Toward a Better Understanding and Evaluation of Tree Structures on Flash SSDs

Diego Didona Nikolas Ioannou Radu Stoica IBM Research Europe IBM Research Europe IBM Research Europe [email protected] [email protected] [email protected]

∗ Kornilios Kourtis Independent researcher [email protected]

ABSTRACT hinders their adoption. Hence, flash SSDs are expected to Solid-state drives (SSDs) are extensively used to deploy per- be the storage medium of choice for many applications in sistent data stores, as they provide low latency random ac- the short and medium term future [1]. cess, high write throughput, high data density, and low cost. Persistent tree data structures (PTSes) are widely used Tree-based data structures are widely used to build persis- to index large datasets. PTSes are particularly appeal- tent data stores, and indeed they lie at the backbone of ing, because, as opposed to hash-based structures, they al- many of the data management systems used in production low for storing the data in sorted order, thus enabling effi- and research today. cient implementations of range queries and iterators over In this paper, we show that benchmarking a persistent the dataset. Examples of PTSes are the log structured tree-based data structure on an SSD is a complex process, merge (LSM) tree [56], used, e.g., by RocksDB [25] and which may easily incur subtle pitfalls that can lead to an BigTable [13]; the B+Tree [17], used, e.g., by Db2 [33] and inaccurate performance assessment. At a high-level, these WiredTiger [53] (which is the default storage engine in Mon- pitfalls stem from the interaction of complex software run- goDB [52]); the Bw-tree [42], used, e.g., by Hekaton [22] and ning on complex hardware. On one hand, tree structures Peloton [28,59]; the B- tree [11], used, e.g., by Tucana [58] implement internal operations that have nontrivial effects and TokuMX [60]. on performance. On the other hand, SSDs employ firmware Over the last decade, due to the reduction in the prices of logic to deal with the idiosyncrasies of the underlying flash flash memory, PTSes have been increasingly deployed over memory, which are well known to lead to complex perfor- flash SSDs [2,3]. Not only do PTSes use flash SSDs as mance dynamics. a drop-in replacement for hard disk drives, but new PTS We identify seven benchmarking pitfalls using RocksDB designs are specifically tailored to exploit the capabilities of and WiredTiger, two widespread implementations of an flash SSDs and their internal architectures [45,66,71,75,76]. LSM-Tree and a B+Tree, respectively. We show that Benchmarking PTSes on flash SSDs. Given their ubiq- such pitfalls can lead to incorrect measurements of key uity, evaluating accurately and fairly the performance of performance indicators, hinder the reproducibility and the PTSes on flash SSDs is a task of paramount importance representativeness of the results, and lead to suboptimal for industry and research alike, to compare alternative de- deployments in production environments. We also provide signs. Unfortunately, as we show in this paper, such task is a guidelines on how to avoid these pitfalls to obtain more complex process, which may easily incur subtle pitfalls that reliable performance measurements, and to perform more can lead to an inaccurate and non-reproducible performance assessment. arXiv:2006.04658v1 [cs.DB] 8 Jun 2020 thorough and fair comparison among different design points. The reason for this complexity is the intertwined effects of the internal dynamics of flash SSDs and of the PTS implementations. On the one hand, flash SSDs employ firmware logic to deal with the idiosyncrasies of the under- 1. INTRODUCTION lying flash memory, which result in highly non-linear dy- Flash solid-state drives (SSDs) are widely used to deploy namics [21,31,67,68]. On the other hand, PTSes implement persistent data storage in data centers, both in bare-metal complex operations, (e.g., compactions in LSM-Trees and and cloud deployments [23,25,27,39,64,72,74], while also be- rebalancing in B+Trees), and update policies (e.g., in a log- ing an integral part of public cloud offerings [5,15,16]. While structured fashion vs in-place). These design choices are SSDs with novel technologies such as 3D Xpoint [34,51] offer known to lead to performance that are hard to analyze [20]. significant advantages [40, 41, 79], they are not expected to They also result in widely different access patterns towards replace flash-based SSDs anytime soon. These newer tech- the underlying SSDs, thus leading to complex, intertwined nologies are not yet as mature from a technology density performance dynamics. standpoint, and have a high cost per GB, which significantly Contributions. In this paper, we identify seven bench- marking pitfalls which relate to different aspects of the eval- ∗Work done while at IBM Research.

1 uation of a PTS deployed on a flash SSD. Furthermore, we provide guidelines on how to avoid these pitfalls. We pro- vide specific suggestions both to academic researchers, to improve the fairness, completeness and reproducibility of their results, and to performance engineers, to help them identify the most efficient and cost-effective PTS for their workload and deployment. In brief, the pitfalls we describe and their consequences on the PTS benchmarking process are the following:

1. Running short tests. Flash SSDs have time- evolving performance dynamics. Short tests lead to Figure 1: Write flow and amplification in PTSes. The PTS issues results that are not representative of the long-term ap- additional writes to the SSD to maintain its internal tree struc- plication performance. ture, which lead to application-level write-amplification. The SSD firmware performs additional writes to overcome the lack of in- 2. Ignoring device write amplification (WA-D). place update capabilities of flash memory, which lead to device- SSDs perform internal garbage collection that leads to level write amplification. write amplification. Ignoring this metric, leads to in- accurate measurements of the I/O efficiency of a PTS. 3. Ignoring internal state of the SSD. Experimental of our experimental analysis and discusses the benchmarking results may significantly vary depending on the initial pitfalls that we identify. Section5 discusses related work. state of the drive. This pitfall leads to unfair and non- Section6 concludes the paper. reproducible performance measurements. 4. Ignoring the effect of the dataset size on SSD performance. SSDs will exhibit different perfor- mance depending on the amount of data stored. This 2. BACKGROUND pitfall leads to biased evaluations. Section 2.1 provides on overview of the PTSes that we 5. Ignoring the extra storage capacity that a PTS use to demonstrate the benchmarking pitfalls, namely LSM- needs. This pitfall leads to ignore the storage versus Trees and B+Trees. Section 2.2 provides background on the performance trade-off of a PTS, which is methodologi- key characteristics of flash-based SSDs that are related to cally wrong and can result in sub-optimal deployments the benchmarking pitfalls we describe. Figure1 shows the in production systems. flow of write operations from an application deployed over a PTS to the flash memory of an SSD. 6. Ignoring software overprovisioning. Performance of SSDs can be improved by overprovisioning storage 2.1 Persistent Tree Data Structures space. This pitfall leads to ignore the storage versus We start by introducing the two PTSes that we use in our performance trade-off achievable by a PTS. experimental study, and two key metrics of the design of a 7. Ignoring the effect of the underlying storage PTS. technology on performance. This pitfall leads to drawing quantitative and qualitative conclusions that 2.1.1 LSM-Trees do not hold across different SSD types. LSM-Trees [56] have two main components: an in-memory component, called memtable, and a disk component. Incom- We present experimental evidence of these pitfalls using ing writes are buffered in the memtable. Upon reaching its an LSM-Tree and an B+Tree, the two most widely used maximum size, the memtable is flushed onto the disk com- PTSes, which are also at the basis of several other PTSes ponent, and a new, empty memtable is set up. The disk designs [10,47,63]. In particular, we consider the LSM-Tree component is organized in levels L1, ··· ,LN , with progres- implementation of RocksDB, and the B+Tree implementa- sively increasing sizes. L1 stores the memtables that have tion of WiredTiger. We use these two systems since they are been flushed. Each level Li, i > 1, organizes data in sorted widely adopted in production systems and research studies. files that store disjoint key ranges. When Li is full, part The storage community has studied the performance of of its data is pushed to Li+1 through an operation called flash SSDs extensively, focusing on understanding, model- compaction. Compaction merges data from one level to the ing, and benchmarking their performance [21,38,67,69]. De- next, and discards older versions of duplicate keys, to main- spite this, we have found that many works from the systems tain disjoint key ranges in each level. and communities are not aware of the pitfalls in evaluating PTSes on SSDs. Our work offers a list of factors 2.1.2 B+Trees to consider, and bridges this gap between these communi- B+Trees [17] are composed of internal nodes and leaf ties by illustrating the intertwined effects of the complex nodes. Leaf nodes store key-value data. Internal nodes con- dynamics of PTSes and flash SSDs. Ultimately, we hope tain information needed to route a request for a target key that our work paves the way for a more rigorous, fair, and to the corresponding leaf. Writing a key-value pair entails reproducible benchmarking of PTSes on flash SSDs. writing to the appropriate leaf node. A key-value write may Outline. The remainder of the paper is organized as fol- also involve modifying the internal nodes to update routing lows. Section2 provides background on the internals of flash information, or to perform re-balancing of the tree. SSDs, LSM-Trees and B+Trees. Section3 describes the ex- perimental setup of our study. Section4 presents the results 2.1.3 Application-level write amplification

2 addresses written by the host application, either program- Internal operations such as flushing and compactions in matically, or by reserving a partition of the disk without LSM-Trees, and internal node updating in B+Trees incur ever writing to it. extra writes to persistent storage. These extra writes are detrimental for performance because they compete with key- 2.2.3 Device-level write amplification value write operations for the SSD bandwidth, resulting in Garbage collection reduces the performance of the SSD as lower overall throughput and higher latencies [6, 41, 47, 65]. it leads to internal re-writing of data in an SSD. We define We define application-level write amplification (WA-A) the device-level write amplification (WA-D) as the ratio between ratio between the overall data written by the PTS (which the amount of data written to flash memory (including the considers both application data and internal operations) and writes induced by garbage collection) and the amount of the amount of application data written. WA-A is depicted host data sent to the device. WA-D is depicted in the right in the left part of Figure1. part of Figure1.

2.1.4 Space amplification 3. EXPERIMENTAL SETUP A PTS may require additional capacity of the drive, other This section describes our experimental setup, which in- than the one needed to store the latest value associated with cludes the PTS systems we benchmark, the hardware on each key. LSM-Trees, for example, may store multiple values which we deploy them, and the workloads we consider. of the same key in different levels (the latest of which is at the lowest level that contains the key). B+Trees store 3.1 Systems routing information in internal nodes, and may reserve some We consider two key-value (KV) stores: RocksDB [25], buffers to implement particular update policies [12]. Space that implements an LSM-Tree, and WiredTiger [53], that amplification captures the amount of extra capacity needed implements a B+Tree. Both are mature systems widely used by a PTS, and it is defined as the ratio of the amount of on their own and as building blocks of other data manage- bytes that the PTS occupies on the drive and the size of the ment systems. We configure RocksDB and WiredTiger to application’s key-value dataset. use small (10 MB) in-memory page caches and direct I/O, so that the dataset does not fit into RAM, and both KV 2.2 Flash SSDs and internal operations are served from secondary storage. In this section we describe the internal architecture of 3.2 Workload flash SSDs, as well as key concepts relevant to their per- Unless stated otherwise, we use the following workload formance dynamics. in our tests. The dataset is composed of 50M KV pairs, with 16 bytes keys and 4000 bytes values. The size of the 2.2.1 Architecture dataset is ≈200 GB, which represent ≈50% of the capac- Flash-based SSDs organize data in pages, which are com- ity of the storage device. Before each experiment we ingest bined in blocks. A prominent characteristics of flash memory all KV pairs in sequential order. We consider a write-only is that pages do not support in-place updates of the pages. workload, where one user thread updates existing KV pairs A page needs to be erased before it can be programmed (i.e., according to a uniform random distribution. We focus on a set to a new value). The erase operation is performed at the write workload as it is the most challenging to benchmark block level, so as to amortize its cost. accurately, both for the target data structures and for the Flash translation layers (FTLs) [48] hide such idiosyn- SSDs. We use a single user thread to avoid the performance crasy of the medium. In general, an FTL performs writes dynamics caused by concurrent data accesses. We also con- out-of-place, in a log-structured fashion, and maintains a sider variations of this workload to show that the pitfalls we mapping between logical and physical addresses. When describe apply to a broad class of workloads. space runs out, a garbage collection process selects some 3.3 Metrics memory blocks, relocates their valid data, and then erases the blocks. To demonstrate our pitfalls, we analyze several applica- In the remainder of the paper, for the sake of brevity, we tion, system and hardware performance metrics. use the term SSD to refer to a flash SSD. i) KV store throughput, i.e., the number of operations per second completed by the KV store. ii) Device throughput, i.e., the amount of data written per 2.2.2 Over-provisioning second to the drive as observed by the OS. The device Over-provisioning is a key technique to enable garbage col- throughput is often used to measure the capability of a sys- lection and to reduce the amount of data that it relocates. tem to utilize the available I/O resources [41]. We measure Over-provisioning means adding extra capacity to the SSD device throughput using iostat. to store extra blocks used for garbage collection. The more iii) User-level write amplification, which we measure by tak- an SSD is over-provisioned, the lower is the number of valid ing the ratio of the device write throughput and the product data that needs to be relocated upon performing garbage of the KV store and the size of a KV pair. By using the de- collection, and, hence, the higher is the performance of the vice write throughput, we factor in also the write overhead SSD. SSDs manufacturers always over-provision SSDs by a posed by the filesystem. We assume such overhead to be certain amount. The user can further implement a software negligible with respect to the amplification caused by the over-provisioning of the SSD by erasing its blocks and en- PTS itself. forcing that a portion of the logical block address (LBA) iv) Application-level write amplification, which we measure space is never written. This is achieved by restricting the via SMART attributes of the device.

3 v) Space amplification, which we obtain by taking the ratio 4.1 Steady-state vs. bursty performance of the disk total utilization and the cumulative size of the Pitfall 1: Running short tests. Because both PTS and KV pairs in the dataset. Also this metric factors in the SSD performance varies over time, short-lived tests are un- overhead posed by the filesystem, which is negligible with able to capture how the systems will behave under a contin- respect to the several GB datasets that we consider. uous (non-bursty) workload.

For the sake of readability of the plots, unless stated oth- Discussion. Figure2 shows the KV store and device erwise, we report 10-minutes average values when plotting throughput (top), and the WA-D and WA-A (bottom) over the evolution of a performance metric over time. time, for RocksDB (left) and WiredTiger (right). These re- sults refer to running the two systems on a trimmed SSD. 3.4 State of the drive The plots do not show the performance of the systems dur- ing the initial loading phase. The pictures show that, during We experiment with two different initial conditions of the a performance test, a PTS exhibits an initial transient phase internal state of the drive. during which the dynamics and the performance indicators • Trimmed. All blocks of the device are erased (using the may widely differ from the ones observed at steady-state. blkdiscard utility). Hence, initial writes are stored directly Hence, taking early measurements of a data store’s perfor- into free blocks and do not incur additional overhead (no mance may lead to a substantially wrong performance as- WA-D occurs), while updates after the free blocks are ex- sessment. hausted incur internal garbage collection. A trimmed device Figure 2a shows that measuring the performance of exhibits performance dynamics close (i.e., modulo the wear RocksDB in the first 15 minutes would report a through- of the storage medium) to the ones of a mint factory-fresh put of 11-8 KOps/s, which is 3.6-2.6 times higher than the device. This setting is representative of bare-metal stan- 3 KOps/s that RocksDB is able to sustain at steady-state. dalone deployments, where a system administrator can reset In the first 15 minutes, the device throughput of RocksDB the state of the drive before deploying the KV store, and the is between 375 and 300 MB/s, which is more than 3 times drive is not shared with other applications. the throughput sustained at steady-state. • Preconditioned. The device undergoes a preliminary writ- Figure 2c sheds lights on the causes of such performance ing phase so that its internal state resembles the state of degradation. The performance of RocksDB decreases over a device that has been in use. To this end, we first write time for the effect of the increased write amplification, both the whole drive sequentially, to ensure that all logical ad- at the PTS and device level. WA-A increases over time dresses have associated data. Then, we issue random writes while the levels of the LSM-Tree fills up, and its curve flat- for an amount of bytes that is twice the size of the disk, so tens once the layout of the LSM tree has stabilized. WA-D as to trigger garbage collection. In this setting even the first increases over time because of the effect of garbage collec- write operation issued by an application towards any page tion. The initial value of WA-D is close to one, because is effectively an over-write operation. This setting is repre- the SSD is initially trimmed, and keys are ingested in order sentative of i) consolidated deployments, e.g., public clouds, during the initial data loading, which results in RocksDB where multiple applications share the same physical device, issuing sequential writes to the drive. The compaction op- or ii) standalone deployments with an aged filesystem. erations triggered over time, instead, do not result in se- quential writes to the SSD flash cells, which ultimately lead These two configurations represent the two endpoints of the to a WA-D slightly higher than 2. spectrum of the possible initial conditions of the drive, con- WiredTiger exhibits performance degradation as well, as cerning the state of the drive’s block. In a real-world deploy- shown in Figure 2b. The performance reduction in Wired- ment the initial conditions of the drive would be somewhere Tiger is lower than in RocksDB for three reasons. First, in-between these two endpoints. WA-A is stable over time, because updating the B+Tree to accommodate new application writes incurs an amount of 3.5 Hardware extra writes that does not change over time. Second, the We use a machine equipped with an Intel(R) Xeon(R) increase in WA-D is lower than in RocksDB: WA-D reaches CPU E5-2630 v4 @ 2.20GHz (20 physical cores, without at most the value of 1.7, and converges to 1.5. We analyze hyper-threading) and 126 GB of RAM. The machine runs more in detail the WA-D of WiredTiger in the next section. Ubuntu 18.04 with a 4.15 generic Linux kernel. The ma- Third, WiredTiger is less sensitive to the performance of chine’s persistent storage device is a 400 GB Intel p3600 the underlying device because of synchronization and CPU enterprise-class SSD [36]. Unless stated otherwise, we setup overheads [41]. a single partition on the drive, which takes the whole avail- Guidelines. Researchers and practitioners should dis- able space. We mount an ext4 filesystem configured with tinguish between steady-state and bursty performance, and the nodiscard parameter [37]. prioritize reporting the former. In light of the results por- trayed by Figure2, we advocate that, to detect steady-state 4. BENCHMARKING PITFALLS behavior, one should implement a holistic approach that en- compasses application-level throughput, WA-A, and WA-D. This section discuss the benchmarking pitfalls in detail. Techniques such as CUSUM [57] can be used to detect that For each pitfall we i) first give a brief description; ii) then the values of these metrics do not change significantly for a discuss the pitfall in depth, by providing experimental evi- long enough period of time. dence that demonstrates the pitfall itself and its implications To measure throughput we suggest using an average over on the performance evaluation process; and iii) finally out- long periods of times, e.g., in the order of ten minutes. In line guidelines on how to avoid the pitfall in research studies fact, it is well known that PTSes are prone to exhibit large and production systems.

4 20 500 Device: writes reads Device: writes reads 15 RocksDB 400 1.5 WiredTiger 150 300 10 1 100 200 5 100 0.5 50

0 0 Throughput (MB/s) 0 0 Throughput (MB/s) Throughput (Kops/s) 0 30 60 90 120 150 180 210 Throughput (Kops/s) 0 30 60 90 120 150 180 210 Time (min) Time (min) (a) RocksDB: KV store and device throughput. (b) WiredTiger: KV store and device throughput.

15 15 WA-D WA-A WA-D WA-A 3 10 3 10 WA-A WA-A WA-D 2 5 WA-D 2 5

1 0 1 0 0 30 60 90 120 150 180 210 0 30 60 90 120 150 180 210 Time (min) Time (min) (c) RocksDB: Application and device-level WA. (d) WiredTiger: Application and device-level WA.

Figure 2: Difference between steady state and bursty performance (on a trimmed SSD) in RocksDB (left) and WiredTiger (right). Steady-state performance differs from the initial one (top) because of the change in write amplification (bottom). performance variations over short period of times [41,46,65]. ure 2b shows that WiredTiger exhibits a throughput drop Furthermore, we suggest expressing the WA-A at time t as at around the 50th minute mark, despite the fact Figure 2d the ratio of the cumulative application writes up to time t shows no variations in WA-A. Figure 2d shows that at the and the cumulative host writes up to time t. This is aimed at 50th minute mark WA-D increases from its initial value of avoiding oscillations that would be obtained if measuring the 1, indicating that the SSD has run out of clean blocks, and WA-D over small time windows. Finally, if WA-D cannot be the garbage collection process has started. This increase in computed directly from SMART attributes, then we suggest, WA-D explains the reduction in SSD throughput, which ul- as a rule of thumb, to consider the SSD as having reached timately determines the drop in the throughput achieved by steady-state after the cumulative host writes accrue to at WiredTiger. least 3 times the capacity of the drive. The first device write The analysis of WA-D also contributes to explaining ensures that the drive is filled once, so that each block has the performance drops in RocksDB, depicted in Figure 2a. data associated with it. The second write triggers garbage Throughout the test, the KV throughput drops by a factor collection, which overwrites the block layout induced by the of ≈4, from 11 to 2.5 KOps/s. Such a huge degradation is initial filling of the data-set. The third write ensures that not consistent with the ≈2 increase of WA-A and the slightly the garbage collection process approaches steady state, and increased CPU overhead caused by internal LSM-Tree oper- is needed also to account for the (possibly unknown) amount ations (most of the CPUs are idle throughout the test). The of hardware extra capacity of the SSD (which makes the doubling of the WA-D explains the huge device through- actual capacity of the drive higher than the nominal capacity put degradation, which contributes to the application-level exposed to the application). throughput loss. ii) WA-D is an essential measure of the I/O efficiency of a 4.2 Analysis of WA-D PTS. One needs to multiply WA-A by WA-D to obtain the Pitfall 2: Not analyzing WA-D. Overlooking WA-D end-to-end write amplification – from application to mem- leads to partial or even inaccurate performance analysis. ory cells– incurred by a PTS on flash. This is the write amplification value that should be used to quantify the I/O Discussion. Many evaluations only consider WA-A in efficiency of a PTS on flash, and its implications on the life- their analysis, which can lead to inaccurate conclusions. time of an SSD. Focusing on WA-A alone, as done in the We advocate considering WA-D when discussing the perfor- vast majority of PTS performance studies, may lead to in- mance of a PTS for three (in addition to being fundamental correct conclusions. For example, Figure2 (bottom) shows to identify steady state, as discussed previously) main rea- that RocksDB incurs a steady-state WA-A of 12, which is sons. higher than the WA-A achieved by WiredTiger by a fac- i) WA-D directly affects the throughput of the device, tor of 1.2×. However, the end-to-end write amplification of which strongly correlates with the application-level through- RocksDB is 25, which is 2.1× higher than WiredTiger’s. put. Analyzing WA-D explains performance variations that iii) WA-D measures the flash-friendliness of a PTS. A low cannot be inferred by the analysis of the WA-A alone. Fig-

5 20 2 Trimmed Trimmed 15 Preconditioned 1.5 Preconditioned 10 1 5 0.5 0 0 Throughput (Kops/s) 0 30 60 90 120 150 180 210 Throughput (Kops/s) 0 30 60 90 120 150 180 210 Time (min) Time (min) (a) RocksDB: throughput. (b) WiredTiger: throughput.

Trimmed Trimmed 3 Preconditioned 3 Preconditioned

WA-D 2 WA-D 2

1 1 0 30 60 90 120 150 180 210 0 30 60 90 120 150 180 210 Time (min) Time (min) (c) RocksDB: device-level WA. (d) WiredTiger: device-level WA.

Figure 3: Impact of the initial state of the SSD on performance. Performance achieved over time by RocksDB (left) and WiredTiger (right) depending on the initial conditions of the SSD (trimmed versus preconditioned). The initial conditions of the drive affect throughput (top), potentially even at steady-state, because they affect SSD garbage collection dynamics and the corresponding WA-D (bottom).

WA-D indicates that a PTS generates a write access pattern expectation and measured performance. towards the SSD that does not incur much garbage collection Guidelines. The analysis of the WA-D should be a stan- overhead. Measuring WA-D, hence, allows for quantifying dard step in the performance study of any PTS. Such analy- the fitness of a PTS for flash SSD deployments, and for sis is fundamental to measure properly the flash-friendliness verifying the effectiveness of flash-conscious design choices. and I/O efficiency of alternative systems. Moreover, the For example, LSM-Trees are often regarded as flash- analysis of WA-D leads to important insights on the inter- friendly due to their highly sequential writes, while B+Tree nal dynamics and performance of a PTS, as we show in the are considered less flash-friendly due to their random write following sections. access pattern. The direct measurement of WA-D in our tests, however, capsizes this conventional wisdom, showing that RocksDB and WiredTiger achieve a WA-D of around 4.3 Initial conditions of the drive 2.1 and 1.5, respectively. As a reference, a pure random Pitfall 3: Overlooking the internal state of the SSD. write workload, targeting also 60% of the device capacity, Not controlling the initial condition of the SSD may lead to has a WA-D of 1.4 [67]. In the next section we provide ad- biased and non-reproducible performance results. ditional insights on the causes of such mismatch between Discussion. Figure3 shows the performance over time of RocksDB (left) and WiredTiger (right), when running on an SSD that has been trimmed or preconditioned before start- 1 ing the experiment. The top row reports KV throughput, 0.75 and the bottom one reports WA-D. The plots do not show the performance of the systems during the initial loading 0.5

CDF phase. 0.25 RocksDB The plots show that the initial state of the SSD heavily WiredTiger affects the performance delivered by a PTS and that, cru- 0 0 0.25 0.5 0.75 1 cially, the steady-state performance of a PTS can greatly differ depending on the initial state of the drive. Such a re- LBA (normalized) sorted by decreasing # writes sult is surprising, given that one would expect the internal state of an SSD to converge to the same configuration, if Figure 4: CDF of the LBA write probability in RocksDB and WiredTiger. The vertical dotted line indicates where the CDF running the same workload for long enough, and hence to corresponding to WiredTiger reaches value 1, and indicates that deliver the same steady-state performance regardless of the WiredTiger does not write to ≈ 45% of the LBAs. This write initial state of the SSD. access pattern results into different WA-D depending on the initial To understand the cause of this phenomenon, we have state of the drive (see Figure 3d.) monitored the host write access pattern generated by

6 RocksDB trim WiredTiger trim RocksDB prec WiredTiger prec 5 4 4

3.3 3

3 2.6 2.3 12.3 2.3 12.3 2.3 2.3 2.2 2.2 11.9 11.7 2.2

3 2.5 11.4 2.0 11.3

2 11.0 1.9 1.9 12.5 10.6 WA-D 2.2 2.2 9.8 9.8 10.2 1.7 10.2 10.1 2 10.1 10.0 10.0 1.6 1.8 1.8 1.7 1 1.7

2 1.3

1.2 10 1.1 WA-A WA-D 1.0 0 1.0 1.0 0.9 0.9 0.8

0.8 1 200 1 0.7 7.5 250 0 Dataset 0 size (GB) 5 Throughput (Kops/s) 0.25 0.37 0.5 0.62 0.25 0.37 0.5 0.62 0.25 0.37 0.5 0.62 Dataset size / SSD capacity Dataset size / SSD capacity Dataset size / SSD capacity (a) Throughput. (b) WA-D. (c) WA-A.

Figure 5: Impact of the size of the dataset in RocksDB and WiredTiger, with preconditioned and trimmed device. Larger datasets lead to a lower throughput (a). This is mostly due to an increase in WA-D (b), since the WA-A only increases mildly (c).

RocksDB and WiredTiger with blktrace. Figure4 reports The reproducibility of a benchmarking process can be the CDF of the access probability of the page access fre- compromised because running two independent tests of a quency in the two systems. We observe that in WiredTiger PTS with the same workload and on the same hardware 46% of the pages are not written (0 read or write accesses). may lead to substantially different results. For production This indicates that WiredTiger only writes to a limited por- engineers this means that the performance study taken on tion of the logical block address (LBA) space, corresponding a test machine may differ widely with respect to the per- to the LBAs that initially store the ≈ 200 GB of KV pairs formance observed in production. For researches, it means plus some small extra capacity, i.e., ≈ 50% of the SSD’s that it may be impossible to replicate the results published capacity in total. In the case of the trimmed device, this in another work. data access pattern corresponds to having only 50% of the Guidelines. To overcome the aforementioned issues, we LBAs with associated valid data. Because SSD garbage col- recommend to control and report the initial state of the lection only relocates valid data, this gives ≈ 200 GBs of SSD before every test. This state depends on the target extra over-provisioning to the SSD, which in turn leads to deployment of the PTS. For a performance engineer, such a low WA-D. In a preconditioned device, instead, all LBAs state should be as similar as possible to the one observed in have associated valid data, which means that the garbage the production environment, which also depends on other collection process has only the hardware over-provisioning applications possibly collocated with the PTS. We suggest available, and needs to relocate more valid pages when eras- to researchers to precondition the SSD as described in Sec- ing a block, leading to a higher WA-D. tion 3.4. In this way, they can evaluate the PTS in the most The difference in WA-D, and hence in performance, de- general case possible, thus broadening the scope of their re- pending on the initial state of the SSD is much less visible in sults. To save on the preconditioning time, the SSD can RocksDB. This is due to the facts that i) the LSM tree uti- be trimmed, provided that one checks that the steady-state lizes more capacity than a B+Tree and ii) RocksDB writes performance of the PTS does not substantially differ from to the whole range of the LBA space. Hence, the initial WA- the one observed on a preconditioned drive. D for RocksDB depends heavily on the initial device state, however, all LBAs are eventually over-written and thus the 4.4 Data-set size WA-D converges to roughly the same value, regardless of Pitfall 4: Testing with a single dataset size. The the initial state of the drive. amount of data stored by the SSD changes its behavior and Our results and analysis lead to two important lessons. affects overall performance results. i) The I/O efficiency of a PTS on SSD is not only a matter of the high level design of the PTS, but also on its low-level Discussion. Figure5 reports the steady-state throughput implementation. Our experiments show that the benefits on (left), WA-D (middle), and WA-A (right) of RocksDB and WA-D given by the large sequential writes of the LSM imple- WiredTiger with datasets whose sizes span from 0.25 to 0.88 mentation of RocksDB are lower than the benefits achieved of the capacity of the SSD (from 100GB to 350GB of KV by the B+Tree implementation of WiredTiger, despite the pairs). We do not report results for RocksDB for the two fact that WiredTiger generates a more random write access biggest datasets because it runs out of space. The figure re- pattern. ports results both with a trimmed and with a preconditioned ii) Not controlling the initial state of the SSD can poten- SSD. tially jeopardize two key properties of a PTS performance Figure 5a shows that the throughput achieved by the two evaluation: fairness and reproducibility. The fairness of a systems is affected by the size of the dataset that they man- benchmarking process can be compromised by simply run- age, although to a different extent and in different ways ning the same workload on two different PTSes back to back. depending on the initial state of the SSD. By contrasting The performance of the second PTS is going to be heavily Figure 5b and Figure 5c we conclude that the performance influenced by the state of the SSD that is determined by the degradation brought by the larger data-set is primarily due previous test with the other PTS. The lack of fairness can to the idiosyncrasies of the SSD. In fact, larger datasets lead lead a performance engineer to pick a sub-optimal PTS for to more valid pages in each flash block, which increases the their workload, or a researcher to report incorrect results. amount of data being relocated upon performing garbage collection, i.e., the WA-D.

7 RocksDB trim WiredTiger trim RocksDB prec WiredTiger prec 5 4 25 RocksDB

99 cheaper

3 88 100 85 2 1.86 20 74 2 80 71 1.8 WA-D 61 57

1.61 Same

1 47 15

60 43 1.6 1.46 cost 1.39 0 40 29 1.4 (Kops/s) 10

200 1.15 250 1.13 1.13 1.12 1.12 20 1.2 1.12 WiredTiger

Disk utilization (%) 0 1 5 Dataset size (GB) Target throughput 0.25 0.37 0.5 0.62 0.75 0.88 Space amplification 0.25 0.37 0.5 0.62 0.75 0.88 1 2 3 4 5 cheaper Dataset size / SSD capacity Dataset size / SSD capacity Total dataset size (TB) (a) Space utilization. (b) Space amplification. (c) Storage cost comparison.

Figure 6: Space amplification in RocksDB and WiredTiger (trimmed and preconditioned SSD), and its effects on storage costs (precon- ditioned SSD). RocksDB runs out of space in the two largest datasets we consider. RocksDB uses more space than WiredTiger to store a dataset of a given size (a), leading to a higher space amplification (b). The heatmap (c) reports the system that requires fewer drives to store a given dataset while achieving a target throughput –hence incurring a lower storage monetary cost.

Changing the dataset size affects the comparison between utilization includes the overhead due to filesystem meta- the two systems both quantitatively and qualitatively. We data. Because RocksDB frequently writes and erases many also note that the comparison among the two systems is large files, its disk utilization varies sensibly over time. The affected by the initial condition of the drive. value we report is the maximum utilization that RocksDB On a trimmed SSD, RocksDB achieves a throughput that achieves. Figure 6b reports the space amplification corre- is 3.3× higher than WiredTiger’s when evaluating the two sponding to the utilization depicted in Figure 6a. systems on the smallest dataset. On the largest dataset, WiredTiger uses an amount of space only slightly higher however, this performance improvement shrinks to 1.9×. than the bare space needed to store the dataset, and Moreover, WiredTiger exhibits a lower WA-D across the achieves a space amplification that ranges from 1.15 to 1.12. board, due to the LBA access pattern discussed in the pre- RocksDB, instead, requires much more additional disk space vious section. to store the several levels of its LSM-Tree. Overall, RocksDB On a preconditioned SSD, the speedup of RocksDB over achieves an application space amplification ranging between WiredTiger still depends on the size of the dataset, but it 1.86, with the smallest dataset we consider, to 1.39, with is lower in absolute values than on a trimmed SSD, ranging the biggest dataset that it can handle 1. from 2.7× on the smallest dataset to 2.57× on the largest These results show that space amplification plays a key one. Moreover, whether RocksDB has a better WA-D than role in the performance versus storage space trade-off. Such WiredTiger depends on the dataset size. In particular, the trade-off affects the total storage cost of a PTS deployment, WA-D of RocksDB and WiredTiger are approximately equal given an SSD drive model, a target throughput, and total when storing datasets whose sizes are up to half of the dataset size. In fact, a PTS with a low space amplification drive’s capacity. Past that point, RocksDB’s WA-D is sensi- may fit the target dataset in a smaller and cheaper drive with bly lower than WiredTiger’s (2.3 versus 2.6). This happens respect to another PTS with a higher write amplification, because the benefits of WiredTiger’s LBA access pattern de- or can index more data given the same capacity, requiring crease with the size of the dataset (and hence of the range of fewer drives to store the whole dataset. LBAs storing KV data) and the reduced over-provisioning To showcase this last point, we perform a back-of-the- due to preconditioning. envelope computation to identify which of the two systems Guidelines. We suggest production engineers to bench- require fewer SSDs (and hence incur a lower storage cost) mark alternative PTSes with a dataset of the size that is to store a given dataset and at the same time achieve a tar- expected in production, and refrain from testing with scaled- get throughput. We use the throughput and disk utilization down datasets for the sake of time. We suggest researchers values that we measure for our SSD (see Figure 5a and Fig- to experiment with different dataset sizes. This suggestion ure 6a). For simplicity, we assume one PTS instance per has a twofold goal. First, it allows a researcher to study the SSD, and that the aggregate throughput of the deployment sensitivity of their PTS design to different device utilization is the sum of the throughputs of the instances. Figure 6c values. Second, it disallows evaluations that are purposely reports the result of this computation. Despite having a or accidentally biased in favor of one design over another. lower per-instance throughput, the higher space efficiency of WiredTiger makes it preferable over RocksDB in scenarios 4.5 Space amplification with a large dataset and a relatively low target throughput. This configuration represents an important class of work- Pitfall 5: Not accounting for space amplification. loads, given that with the ever-increasing amount of data The space utilization overhead of a PTS determines its stor- being collected and stored, many applications begin to be age requirements and deployment monetary cost. storage capacity-bound rather than throughput-bound [14]. Discussion. PTSes frequently trade additional space for 1 improved performance, and understanding their behavior The disk utilization in RocksDB depends on the setting of its in- depends on understanding these trade-offs. Figure 6a re- ternal parameters, most importantly, the maximum number of levels, and the size of each level [26]. It is possible to achieve a lower space ports the total disk utilization incurred by RocksDB and amplification than the one we report, but at the cost of substantially WiredTiger depending on the size of the dataset. The disk lower throughput due to increased compaction overhead.

8 RocksDB trim RocksDB prec 25 Extra OP WiredTiger trim WiredTiger prec cheaper 5 20 4 3 3.30

2.3 Same 3.27 2.3 4 2.3 15 3 cost 2 1.7 (Kops/s) 10 1.4 1.4 1.3 3 1.3 1.79 2 1.77 No OP

WA-D 5 1 Target throughput cheaper WA-D 0.96 0.94 2 0.87 1 2 3 4 5 1 0.76 Total dataset size (TB) 1 Throughput (Kops/s) 0 0 No OP Extra OP No OP Extra OP Figure 8: Storage cost comparison of using RocksDB with or 0 (a) Throughput. (b) WA-D. without extra over-provisioning (OP) on a preconditioned SSD. 100 200 The heatmap reports the RocksDB250 configuration that requires Figure 7: Impact of using extra SSD over-provisioning (OP). Ex- fewer drives to store a given dataset while achieving a target tra OP may increase throughput (a) by improving WA-D (b), at throughput –hence incurring a lower storage monetary cost. the cost of reducing the amount of data that the SSD can store.

The impact of extra over-provisioning is much less evi- dent in WiredTiger. In the trimmed device case, the extra over-provisioning has no effect on WiredTiger. This hap- Guidelines. The experimental evaluation of a PTS should pens because WiredTiger writes only to a certain range of not focus only on performance, but should include also space the LBA space (see Figure4). Hence, all other trimmed amplification. For research works, analyzing space amplifi- blocks act as over-provisioning, regardless of whether they cation provides additional insights on the performance dy- belong to the PTS partition or the extra over-provisioning namics and trade-offs of the design choices of a PTS, and al- one. On the preconditioned device, instead, all blocks of the lows for a multi-dimensional comparison among designs. For PTS partition have data associated with them, so the only production engineers, analyzing space amplification is key to software over-provisioning is given by the trimmed partition. compute the monetary cost of provisioning the storage for a This extra over-provisioning reduces WA-D from 1.7 to 1.3, PTS in production, which is typically more important than yielding a throughput improvement of 1.14×. sheer maximum performance [54]. Allocating extra over-provisioning can be an effective As a final remark, we note that this pitfall applies also technique to reduce the number of PTS instances needed in to PTSes not deployed over an SSD, and hence our con- a deployment (and hence reduce its storage cost), because it siderations apply more broadly to PTSes deployed over any increases the throughput of the PTS without requiring addi- persistent storage medium. tional hardware resources. However, extra over-provisioning also reduces the amount of data that a single drive can store, 4.6 SSD over-provisioning which potentially increases the amount of drives needed to store a dataset and the storage deployment cost. To as- Pitfall 6: Overlooking SSD software over- sess in which use cases using extra over-provisioning is the provisioning. Allocating extra over-provisioning capacity most cost-effective choice we perform a back-of-the-envelope to the SSD may lead to more favorable capacity versus per- computation of the number of drives needed to provision formance trade-offs. a RocksDB deployment given a dataset size and a target Discussion. Figure7 compares the steady-state through- throughput value. We perform this study on RocksDB be- put (left) and WA-D (right) achieved by RocksDB and WiredTiger in two settings: i) the default one in which the whole SSD capacity is assigned to the disk partition acces- RocksDB WiredTiger sible by the filesystem underlying the PTS, and ii) one in which some SSD space is not made available to the filesystem 30

underlying the PTS, and is instead assigned as extra over- 25 24.1 provisioning to the SSD. Specifically, in the second setting we trim the SSD and assign a 300GB partition to the PTS. 20 Hence, the SSD has 100GB of trimmed capacity that is not 15 visible to the PTS. We choose this value because 100GB cor- 10 8.7

responds to half of the free capacity of the drive once the 200 2.9 1.6 1.3 GB dataset has been loaded. For both settings we consider 5 1.2 Throughput (Kops/s) the case in which the PTS partition remains clean after the 0 initial trimming, and the case in which it is preconditioned. SSD1 SSD2 SSD3 Extra over-provisioning improves the performance of SSD type RocksDB by a factor of 1.83×. This substantial improve- ment is caused by a drastic reduction of WA-D, that drops Figure 9: Impact of the SSD type on the throughput of RocksDB from 2.3 to 1.4, and it applies to RocksDB regardless of the and WiredTiger. The type of SSD significantly affects the ab- initial state of the PTS partition, for the reason discussed solute performance achieved by the two systems, and can even in Section 4.3. determine which of them achieves the higher throughput.

9 SSD1 SSD2 SSD3 5 5 4 35 RocksDB WiredTiger 30 4 3 25 3 2 20 2 1 15

Throughput (Kops/s) 10 0 1 Throughput (Kops/s) 5 Throughput (Kops/s) 0 30 60 90 0 0 0 30 60 Time 90 (min) 0 30 60 90 Time (min) Time (min) (a) RocksDB. (b) WiredTiger.

Figure 10: Throughput (1 minute average) of RocksDB (left) and WiredTiger (right) over time when running on SSDs based on different technologies. The type of SSD influences heavily the throughput variability of RocksDB, and much less that of WiredTiger. cause it benefits the most from extra over-provisioning. We resulting in a WA-D very close to one. use the same simplifying assumptions that we made for the Figure9 shows the steady-state throughput of RocksDB previous similar analysis. Figure8 reports the results of our and WiredTiger when deployed on the three SSDs. As can computation. As expected, extra over-provisioning is bene- be depicted from the plot, the performance impact chang- ficial for use cases that require high throughput for relatively ing the underlying drive varies drastically across the two small datasets. For larger datasets with relatively low per- systems. Explaining these performance variations requires formance requirements, it is more convenient to allocate as gaining a deeper understanding of the low-level design of the much of the drive’s capacity as possible to RocksDB. SSDs, which is not always achievable. Guidelines. It is well known that PTSes have several RocksDB achieves the highest throughput on SSD3, and tuning knobs that have a huge effect on performance [8, 20, a higher throughput on SSD1 than on SSD2. This perfor- 43]. We suggest to consider SSD over-provisioning as an mance trend is mostly determined by the write latencies additional, yet first class tuning knob of a PTS. SSD extra of the SSDs, which are the lowest in SSD3, and lower in over-provisioning trades capacity for performance, and can SSD2 than in SSD1. Also WiredTiger achieves the highest reduce the storage cost of a PTS deployment in some use throughput on SSD3 but, surprisingly, it obtains a higher cases. throughput on SSD2 than on SSD1. We argue that the rea- son for this performance dynamic lies in the fact that SSD2 4.7 Storage technology has a larger internal cache than SSD1. Because WireTiger Pitfall 7: Testing on a single SSD type. The perfor- performs small writes, uniformly distributed over time, the mance of a PTS heavily depends on the type of SSD used. cache of SSD2 is able to absorb them with very low la- This makes it hard to extrapolate the expected performance tency, and destages them in the background. The larger when running on other drives and also to reach conclusive cache of SSD2 does not yield the same beneficial effects to results when comparing two systems. RocksDB because RocksDB performs large bursty writes, which overwhelm the cache, leading to longer write laten- Discussion. We exemplify this pitfall through an exper- cies and, hence, lower throughput. iment where we keep the workload and the RocksDB and These dynamics also lead to the surprising result that ei- WireTiger configurations constant and only swap the under- ther of the two systems we consider can achieve a higher lying storage device. We use three SSDs: an Intel p3600 [36] throughput than the other, just by changing the SSD on flash SSD, i.e., the drive used for the experiments discussed which they are deployed. in previous sections; an Intel 660 [35] flash SSD; and an In- We also observe very different performance variations for tel Optane [34]. We refer to these SSDs as SSD1, SSD2 and the two systems, when deployed on different SSDs. On SSD3, respectively, in the following discussion. SSD3 is a the one hand, the best and worst throughputs achieved high-end SSD, based on the newer 3DXP technology that by RocksDB vary by a factor of almost 20× (SSD2 versus achieves higher performance than flash SSDs. We use SSD3 SSD3). On the other hand, they vary only by a factor of 2.4 as an upper bound on performance that a PTS can achieve for WiredTiger. on a flash block device. These results are especially important, because they in- To try and isolate the performance impact due to the dicate that the performance comparison across PTS design SSD technology (i.e., architecture and underlying storage points, and the corresponding conclusions that are drawn, medium performance) itself in the assessment of a PTS, we are strongly dependent on the SSD employed in the bench- eliminate, as much as possible, the other sources of perfor- marks, and hence hard to generalize. mance variability that we have discussed so far. To this end, The type of SSD also affects dramatically the performance we run a workload with a dataset that is 10× smaller than predictability of the two system. Figure 10 reports the the default one, and we trim the flash SSDs. In this way, the throughput of RocksDB (left) and WiredTiger (right) when effect of garbage collection is the flash SSDs is minimized,

10 Throughput trim Throughput prec WA-D trim WA-D prec 2 1.5 3 10 3 2 1 8 1.5 3 2 0.5 6 WA-D 4 2 1 2 WA-D WA-D Throughput (Kops/s) 0 2 0.5 1 0 30 0 60 90 1 0 120 150 1 180 210 Throughput (Kops/s) 0 30 60 90 120 150 180 210 Time (min) Throughput (Kops/s) 0 30 60 90 120 150 180 210 Time (min) Time (min) (a) RocksDB, 50:50 r:w ratio. (b) WiredTiger, 50:50 r:w ratio.

1.5 300 3 3 200 1

100 2 WA-D 0.5 2 WA-D 0 1 0 1 Throughput (Kops/s) 0 30 60 90 120 150 180 210 Throughput (Kops/s) 0 30 60 90 120 150 180 210 Time (min) Time (min) (c) RocksDB, 128B values. (d) WiredTiger, 128B values.

Figure 11: Performance of RocksDB (left) and WiredTiger (right) over time, with a preconditioned and trimmed device. The top row refers to a workload with small (128 bytes) values. The bottom row refers to a workload with a 50:50 read:write ratio. The pitfall we describe apply to a broad set of workloads as long as they have a sustained write component (here we represent pitfalls #1, #2 and #3).

deployed over the three SSDs. To highlight the performance In this section, we show that our pitfalls affect a wider set variability we average the throughput over a 1 minute inter- of workloads than the one we have considered so far. For val (as opposed to the default 10 minutes used in previous space constraints, we focus on the first three pitfalls. Fig- plots). ure 11 reports the throughput and WA-D over time achieved The throughput of RocksDB varies widely over time, and by RocksDB (left) and WiredTiger (right) with two work- the extent of the variability depends on the type of SSD. loads that we obtain by varying one parameter of our de- When using SSD1, RocksDB suffers from throughput swings fault workload. The plots do not show the performance of of 100%. When using SSD2, RocksDB has long periods the systems during the initial loading phase. The top row of time where no application-level writes are executed at refers to a workload in which the size of the values is 128 all. This happens because the large writes performed by bytes (vs. 4000 used so far). To keep the amount of data RocksDB overwhelm the cache of SSD2 and lead to long stored indexed by the PTS constant to the one used by the stall times due to internal data destaging. On SSD3, the previous experiments, we increase accordingly the number relative RocksDB throughput variability decreases to 30%. of keys. The bottom row refers to a workload in which the WireTiger is less prone to performance variability, and ex- top level application submits a mix of read and write op- hibits a more steady and predictable performance irrespec- erations, with a 50:50 ratio. We run these workloads using tive of the storage technology. both a preconditioned SSD and a trimmed one. Guidelines. We have shown that it is difficult to predict As the plots show, the first three pitfalls apply to these the performance of a PTS just based on the high-level SSD workloads as well. First, steady-state behavior can be very specifications (e.g., bandwidth and latency specifications). different from the performance observed at the beginning of Therefore, we recommend testing a PTS on multiple SSD the test. In the mixed read/write workload, it takes longer classes, preferably using devices from multiple vendors, and to the PTSes and to the SSD to stabilize, because writes are using multiple flash storage technologies. Testing on multi- less frequent. This is visible by comparing the top row in ple SSD classes allows researchers to draw broader and more Figure 11 with the top row in Figure2 (Section 4.1). significant conclusions about the performance of a PTS’s de- Second, the value and the variations of WA-D are im- sign, and to assess the “intrinsic” validity of such design, portant to explain performance changes and dynamics. We without tying it to specific characteristics of the medium on note that the WA-D function of WiredTiger in the trimmed which the design has been tested. For a performance en- case with small values (Figure 11d) is different from the one gineer, testing with multiple drives is essential to identify observed for the workload with 4000B values (Figure 2d in which one yields the best combination of storage capacity, Section 4.1). With 4000B values the WA-D function starts performance and cost depending on the target workload. at a value close to 1, whereas with 128B values its starting point is closer to 2. This happens because the initial data 4.8 Additional workloads loading phase leads to different SSD states depending on the

11 size of the KV pairs. With 4000B values, one KV pair can fill ion several pitfalls of benchmarking PTSes on SSDs, and by one filesystem page, which is written only once to the under- providing experimental evidence for each of them. lying SSD. With small values, instead, the same page needs SSD performance modeling and benchmarking. The to written multiple times to pack more KV pairs, which in research and industry storage communities have produced higher fragmentation at the SSD level. Such a phenomenon much work on modeling and benchmarking the performance is not visible in RocksDB, because it writes large chunks of of SSDs. The Storage Networking Industry Association has data regardless the size of the individual KV pairs. defined the Solid State Storage Performance Test Specifica- Third, the initial state of the SSD leads to different tran- tion [70], which contains the guidelines on how to perform sient and steady-state performance. As for the other work- rigorous and reproducible SSD benchmarking, so that per- loads considered in the paper, this pitfall applies especially formance results are comparable across vendors. Many an- to WiredTiger. alytical models [21, 31, 67, 68] aim to express in closed form In light of these results, we argue that our pitfalls and the performance and the device level WA of an SSD as a guideline should apply to every workload that has a sus- function of the workload and the SSD internals and param- tained write component. Some of our pitfalls are also rel- eters. MQSim [69] is a simulator specifically designed to evant to read-only workloads, i.e., the ones concerning the replicate quickly and accurately the behavior of an SSD at data-set size, the space amplification, and the storage tech- steady state. nology. Despite this huge body of work, we have previously shown that practitioners and researchers in the system and communities do not properly –or entirely–take into 5. RELATED WORK account the performance dynamics of SSDs when bench- In this section we discuss i) the performance analyses of marking PTSes. With our work, we aim to raise awareness existing storage systems on SSD, and how they relate to the about the SSD performance intricacies in the system and pitfalls we describe; ii) work on modeling and benchmarking database communities as well. To this end, we show the SSDs in the storage community; and iii) research in the intertwined effects of the SSD idiosyncrasies on the perfor- broader field of system benchmarking. mance of PTSes, and provide guidelines on how to conduct Performance analyses of SSD-based storage systems. a more rigorous and SSD-aware performance benchmarking. Benchmarking the performance of PTSes on SSDs is a task System benchmarking. The process of benchmarking frequently performed both in academia [6,7,9, 10, 18, 19, 44, a system is notoriously difficult, and can incur subtle pit- 46, 50, 62, 65] and in industry [24, 32, 77, 78] to compare dif- falls that may undermine its results and conclusions. Such ferent designs. a complexity is epitomized by the list of benchmarking In general, these evaluations fall short in considering the crimes [29], a popular collection of benchmarking errors and benchmarking pitfalls we discuss in this paper. For example, anti-patterns that are frequently found in the evaluation of the evaluations of the systems do not report the duration of research papers. Raasveldt et al. [61] provide a similar list the experiments, or they do not specify the initial conditions with a focus on DB systems. Many research papers target of the SSD on which experiments are run, or consider a single different aspects of the process of benchmarking a system. dataset size. As we show in this paper, these aspects are Mariq et al. [49], Uta et al. [73], and Hoefler and Belli [30] crucial for both the quantitative and the qualitative analysis focus on the statistical relevance of the measurements, in- of the performance of a data store deployed on an SSD. vestigating whether experiments can be repeatable, and how In addition, performance benchmarks of PTSes typically many trials are needed to consider a set of measurements focus on application-level write amplification to analyze the meaningful. Our work is complementary to this body of re- I/O efficiency and flash-friendliness of a system [6,18,19,44, search, in that it aims to provide guidelines to obtain a more 50, 62]. We show that also device-level write amplification fair and reproducible performance assessment of PTSes de- must be taken into account for these purposes. ployed on SSDs. A few systems distinguish between bursty and sustained performance, by investigating the variations of throughput 6. CONCLUSIONS and latencies over time in LSM-tree key value stores [7, 46, The complex interaction between a persistent tree data 65]. Balmau et al. [7] additionally show how some optimiza- structure and a flash SSD device can easily lead to inac- tions can improve the performance of LSM-Tree key-value curate performance measurements. In this paper we show stores in the short term, but lead to performance degrada- seven pitfalls that one can incur when benchmarking a per- tion in the long run. These works focus on the high-level, sistent tree data structure on a flash SSD. We demonstrate i.e., data-structure specific, causes of such performance dy- these pitfalls using RocksDB and WiredTiger, two of the namics. Our work, instead, investigates the low-level causes most widespread implementations of the LSM-tree and of of performance variations in PTSes, and correlates them the B+tree persistent data structures, respectively. We also with the idiosyncratic performance dynamics of SSDs. present guidelines to avoid the benchmarking pitfalls, so as Yang et al. [80] show that stacking log-structured data to obtain accurate and representative performance measure- structures on top of each of each other may hinder the ef- ments. We hope that our work raises awareness about and fectiveness of the log-structured design. Oh et al. [55] in- provides a deeper understanding of some benchmarking pit- vestigate the use of SSD over-provisioning to increase the falls, and that it paves the way for a more rigorous, fair, and performance of a SSD-based cache. Athanassoulis et al. [4] reproducible benchmarking. propose the RUM conjecture, which states that PTSes have an inherent trade-off between performance and storage cost. Our paper touches some of these issues, and complements the findings of these works, by covering in a systematic fash-

12 7. REFERENCES [14] A. Cidon, D. Rushton, S. M. Rumble, and [1] Digital storage projections for 2020. R. Stutsman. Memshare: a dynamic multi-tenant https://www.forbes.com/sites/tomcoughlin/2019/ key-value cache. In 2017 USENIX Annual Technical 12/23/digital-storage-projections-for-2020- Conference (USENIX ATC 17), pages 321–334, Santa part-2/#4f92c2e17797. Accessed: 2020-27-04. Clara, CA, July 2017. USENIX Association. [2] Ssd prices continue to fall, now upgrade your hard [15] G. Cloud. Storage options. https://cloud.google. drive! https://www.minitool.com/news/ssd- com/compute/docs/disks#localssds. Accessed: prices-fall.html. Accessed: 2020-27-04. 2020-03-21. [3] Ssds are cheaper than ever, hit the magic 10 cents per [16] I. Cloud. Storage options. gigabyte threshold. https://cloud.ibm.com/docs/vsi?topic=virtual- https://www.techpowerup.com/249972/ssds-are- servers-storage-options. Accessed: 2020-03-21. cheaper-than-ever-hit-the-magic-10-cents-per- [17] D. Comer. Ubiquitous b-tree. ACM Comput. Surv., gigabyte-threshold. Accessed: 2020-27-04. 11(2):121137, June 1979. [4] M. Athanassoulis, M. Kester, L. Maas, R. Stoica, [18] N. Dayan, M. Athanassoulis, and S. Idreos. Optimal S. Idreos, A. Ailamaki, and M. Callaghan. Designing bloom filters and adaptive merging for lsm-trees. access methods: The rum conjecture. In International ACM Trans. Database Syst., 43(4), Dec. 2018. Conference on Extending Database Technology [19] N. Dayan and S. Idreos. Dostoevsky: Better (EDBT), Bordeaux, France, 2016. space-time trade-offs for lsm-tree based key-value [5] AWS. Ssd instance store volumes. stores via adaptive removal of superfluous merging. In https://docs.aws.amazon.com/AWSEC2/latest/ Proceedings of the 2018 International Conference on UserGuide/ssd-instance-store.html. Accessed: Management of Data, SIGMOD 18, page 505520, New 2020-03-21. York, NY, USA, 2018. Association for Computing [6] O. Balmau, D. Didona, R. Guerraoui, W. Zwaenepoel, Machinery. H. Yuan, A. Arora, K. Gupta, and P. Konka. TRIAD: [20] N. Dayan and S. Idreos. The Log-Structured Creating synergies between memory, disk and log in Merge-Bush & the Wacky Continuum. In Proceedings log structured key-value stores. In ATC), 2017. of SIGMOD’19., SIGMOD ’19. ACM, 2019. [7] O. Balmau, F. Dinu, W. Zwaenepoel, K. Gupta, [21] P. Desnoyers. Analytic models of ssd write R. Chandhiramoorthi, and D. Didona. SILK: performance. ACM Trans. Storage, 10(2), Mar. 2014. Preventing latency spikes in log-structured merge [22] C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson, key-value stores. In 2019 USENIX Annual Technical P. Mittal, R. Stonecipher, N. Verma, and M. Zwilling. Conference (USENIX ATC 19), pages 753–766, Hekaton: Sql servers memory-optimized oltp engine. Renton, WA, July 2019. USENIX Association. In Proceedings of the 2013 ACM SIGMOD [8] N. Batsaras, G. Saloustros, A. Papagiannis, International Conference on Management of Data, P. Fatourou, and A. Bilas. VAT: asymptotic cost SIGMOD 13, page 12431254, New York, NY, USA, analysis for multi-level key-value stores. CoRR, 2013. Association for Computing Machinery. abs/2003.00103, 2020. [23] Facebook. Mcdipper: A key-value cache for flash [9] L. Bindschaedler, A. Goel, and W. Zwaenepoel. storage. Hailstorm: Disaggregated Compute and Storage for https://www.facebook.com/notes/facebook- Distributed LSM-based Databases. In Proceedings of engineering/mcdipper-a-key-value-cache-for- the 25th ACM International Conference on flash-storage/10151347090423920. Accessed: Architectural Support for Programming Languages 2020-01-31. and Operating Systems, page to appear, 2020. [24] Facebook. Rocksdb performance benchmarks. [10] E. Bortnikov, A. Braginsky, E. Hillel, I. Keidar, and https://github.com/facebook/rocksdb/wiki/ G. Sheffi. Accordion: Better memory organization for Performance-Benchmarks. Accessed: 2020-01-31. lsm key-value stores. Proc. VLDB Endow., [25] Facebook. The RocksDB persistent key-value store. 11(12):18631875, Aug. 2018. http://rocksdb.org. [11] G. S. Brodal and R. Fagerberg. Lower bounds for [26] Facebook. Rocksdb tuning guide. https://github. external memory dictionaries. In Proceedings of the com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide. Fourteenth Annual ACM-SIAM Symposium on [27] Google. LevelDB. Discrete Algorithms, SODA 03, page 546554, USA, https://github.com/google/leveldb. 2003. Society for Industrial and Applied Mathematics. [28] C. M. U. D. Group. Peloton. [12] M. Callaghan. Different kinds of copy-on-write for a https://pelotondb.io/about/. Accessed: 2020-01-31. b-tree: Cow-r, cow-s. [29] G. Heiser. System benchmarking crimes. http://smalldatum.blogspot.com/2015/08/ https://www.cse.unsw.edu.au/~gernot/ different-kinds-of-copy-on-write-for-b.html. benchmarking-crimes.html. Accessed: 2020-01-31. Accessed: 2020-03-21. [30] T. Hoefler and R. Belli. Scientific benchmarking of [13] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. parallel computing systems: Twelve ways to tell the Wallach, M. Burrows, T. Chandra, A. Fikes, and masses when reporting performance results. In R. E. Gruber. Bigtable: A distributed storage system Proceedings of the International Conference for High for structured data. ACM Trans. Comput. Syst., Performance Computing, Networking, Storage and 26(2), June 2008. Analysis, SC 15, New York, NY, USA, 2015. Association for Computing Machinery.

13 [31] X.-Y. Hu, E. Eleftheriou, R. Haas, I. Iliadis, and [46] C. Luo and M. J. Carey. On performance stability in R. Pletka. Write amplification analysis in flash-based lsm-based storage systems. Proc. VLDB Endow., solid state drives. In Proceedings of SYSTOR 2009: 13(4):449462, Dec. 2019. The Israeli Experimental Systems Conference, [47] C. Luo and M. J. Carey. Lsm-based storage SYSTOR 09, New York, NY, USA, 2009. Association techniques: a survey. The VLDB Journal, for Computing Machinery. 29(1):393–418, 2020. [32] HyperDex. Hyperleveldb performance benchmarks. [48] D. Ma, J. Feng, and G. Li. A survey of address http://hyperdex.org/performance/leveldb/s. translation technologies for flash memories. ACM Accessed: 2020-01-31. Comput. Surv., 46(3), Jan. 2014. [33] IBM. Db2 index structure. https://www.ibm.com/ [49] A. Maricq, D. Duplyakin, I. Jimenez, C. Maltzahn, support/knowledgecenter/SSEPGG_11.1.0/com.ibm. R. Stutsman, and R. Ricci. Taming performance db2.luw.admin.perf.doc/doc/c0005300.html. variability. In 13th USENIX Symposium on Operating Accessed: 2020-01-31. Systems Design and Implementation (OSDI 18), pages [34] Intel. Intel optane technology. https://www.intel. 409–425, Carlsbad, CA, Oct. 2018. USENIX com/content/www/us/en/architecture-and- Association. technology/intel-optane-technology.html. [50] L. Marmol, S. Sundararaman, N. Talagala, and [35] Intel 660p SSD, 2019. R. Rangaswami. NVMKV: A scalable, lightweight, [36] Intel Corporation. Intel Solid-State Drive DC P3600 ftl-aware key-value store. In 2015 USENIX Annual Series. Product specification, October 2014. Technical Conference (USENIX ATC 15), pages [37] Intel Linux NVMe Driver. 207–219, Santa Clara, CA, July 2015. USENIX https://www.intel.com/content/dam/support/us/ Association. en/documents/ssdc/data-center- [51] Micron. 3d xpoint technology. ssds/Intel_Linux_NVMe_Guide_330602-002.pdf. https://www.micron.com/products/advanced- Accessed: 2019-11-15. solutions/3d-xpoint-technology. [38] N. Ioannou, K. Kourtis, and I. Koltsidas. Elevating [52] MongoDB. Mongodb’s wiredtiger storage engine. commodity storage with the SALSA host translation https: layer. MASCOTS ’18, 2018. //docs.mongodb.com/manual/core/wiredtiger/. [39] S. Knipple. Leveraging the latest flash in the data [53] MongoDB. The WiredTiger storage engine. center. Summit, 2017. https://www. http://source.wiredtiger.com. flashmemorysummit.com/English/Collaterals/ [54] D. Narayanan, E. Thereska, A. Donnelly, S. Elnikety, Proceedings/2017/20170809_FG21_Knipple.pdf. and A. Rowstron. Migrating server storage to ssds: [40] K. Kourtis, N. Ioannou, and I. Koltsidas. Reaping the Analysis of tradeoffs. In Proceedings of the 4th ACM performance of fast nvm storage with udepot. In European Conference on Computer Systems, EuroSys Proceedings of the 17th USENIX Conference on File 09, page 145158, 2009. and Storage Technologies, FAST19, page 115, USA, [55] Y. Oh, J. Choi, D. Lee, and S. H. Noh. Caching less 2019. USENIX Association. for better performance: Balancing cache size and [41] B. Lepers, O. Balmau, K. Gupta, and W. Zwaenepoel. update cost of flash memory cache in hybrid storage Kvell: The design and implementation of a fast systems. In Proceedings of the 10th USENIX persistent key-value store. In Proceedings of the 27th Conference on File and Storage Technologies, ACM Symposium on Operating Systems Principles, FAST12, page 25, USA, 2012. USENIX Association. SOSP 19, page 447461, 2019. [56] P. ONeil, E. Cheng, D. Gawlick, and E. ONeil. The [42] J. J. Levandoski, D. B. Lomet, and S. Sengupta. The log-structured merge-tree (lsm-tree). Acta bw-tree: A b-tree for new hardware platforms. In Informatica, 33(4):351385, June 1996. Proceedings of the 2013 IEEE International [57] E. S. PAGE. CONTINUOUS INSPECTION Conference on Data Engineering (ICDE 2013), ICDE SCHEMES. Biometrika, 41(1-2):100–115, 06 1954. 13, page 302313, USA, 2013. IEEE Computer Society. [58] A. Papagiannis, G. Saloustros, P. Gonz´alez-F´erez,and [43] H. Lim, D. G. Andersen, and M. Kaminsky. Towards A. Bilas. Tucana: Design and implementation of a fast accurate and fast evaluation of multi-stage and efficient scale-up key-value store. In 2016 USENIX log-structured designs. In Proceedings of the 14th Annual Technical Conference (USENIX ATC 16), Usenix Conference on File and Storage Technologies, pages 537–550, Denver, CO, June 2016. USENIX FAST16, page 149166, USA, 2016. USENIX Association. Association. [59] A. Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma, [44] H. Lim, B. Fan, D. G. Andersen, and M. Kaminsky. P. Menon, T. Mowry, M. Perron, I. Quah, Silt: A memory-efficient, high-performance key-value S. Santurkar, A. Tomasic, S. Toor, D. V. Aken, store. In Proceedings of the Twenty-Third ACM Z. Wang, Y. Wu, R. Xian, and T. Zhang. Self-driving Symposium on Operating Systems Principles, SOSP database management systems. In CIDR 2017, 11, page 113, New York, NY, USA, 2011. Association Conference on Innovative Data Systems Research, for Computing Machinery. 2017. [45] L. Lu, T. S. Pillai, H. Gopalakrishnan, A. C. [60] Percona. Tokumx. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Wisckey: https://www.percona.com/software/mongo- Separating keys from values in ssd-conscious storage. database/percona-tokumx. Accessed: 2020-01-31. ACM Trans. Storage, 13(1), Mar. 2017. [61] M. Raasveldt, P. Holanda, T. Gubner, and

14 H. M¨uhleisen.Fair benchmarking considered difficult: J. Pfefferle, K. Kourtis, I. Koltsidas, and T. R. Gross. Common pitfalls in database performance testing. In Flashnet: Flash/network stack co-design. ACM Trans. Proceedings of the Workshop on Testing Database Storage, 14(4), Dec. 2018. Systems, DBTest18, New York, NY, USA, 2018. [72] Twitter. Fatcache. Association for Computing Machinery. https://github.com/twitter/fatcache. Accessed: [62] P. Raju, R. Kadekodi, V. Chidambaram, and 2020-01-31. I. Abraham. PebblesDB: building key-value stores [73] A. Uta, A. Custura, D. Duplyakin, I. Jimenez, J. S. using fragmented log-structured merge trees. SOSP Rellermeyer, C. Maltzahn, R. Ricci, and A. Iosup. Is ’17, 2017. big data performance reproducible in modern cloud [63] K. Ren, Q. Zheng, J. Arulraj, and G. Gibson. Slimdb: networks? In 17th USENIX Symposium on Networked A space-efficient key-value storage engine for Systems Design and Implementation (NSDI 20), Santa semi-sorted data. Proc. VLDB Endow., Clara, CA, Feb. 2020. USENIX Association. 10(13):20372048, Sept. 2017. [74] R. L. Villars and E. Burgener. Idc: Building data [64] B. Schroeder, R. Lagisetty, and A. Merchant. Flash centers for todays data driven economy: The role of reliability in production: The expected and the flash. Flash Memory Summit, 2014. unexpected. In 14th USENIX Conference on File and https://www.sandisk.it/content/dam/sandisk- Storage Technologies (FAST 16), pages 67–80, Santa main/en_us/assets/resources/enterprise/white- Clara, CA, Feb. 2016. USENIX Association. papers/flash-in-the-data-center-idc.pdf. [65] R. Sears and R. Ramakrishnan. Blsm: A general [75] P. Wang, G. Sun, S. Jiang, J. Ouyang, S. Lin, purpose log structured merge tree. In Proceedings of C. Zhang, and J. Cong. An efficient design and the 2012 ACM SIGMOD International Conference on implementation of lsm-tree based key-value store on Management of Data, SIGMOD 12, page 217228, New open-channel ssd. In Proceedings of the Ninth York, NY, USA, 2012. Association for Computing European Conference on Computer Systems, EuroSys Machinery. 14, New York, NY, USA, 2014. Association for [66] Z. Shen, F. Chen, Y. Jia, and Z. Shao. Didacache: An Computing Machinery. integration of device and application for flash-based [76] Z. Wang, A. Pavlo, H. Lim, V. Leis, H. Zhang, key-value caching. ACM Trans. Storage, 14(3), Oct. M. Kaminsky, and D. G. Andersen. Building a bw-tree 2018. takes more than just buzz words. In Proceedings of [67] R. Stoica and A. Ailamaki. Improving Flash Write the 2018 International Conference on Management of Performance by Using Update Frequency. VLDB, Data, SIGMOD ’18, pages 473–488, New York, NY, 6(9):733–744, 2013. USA, 2018. ACM. [68] R. Stoica, R. Pletka, N. Ioannou, N. Papandreou, [77] WiredTiger. Leveldb performance benchmarks. S. Tomic, and H. Pozidis. Understanding the design https://github.com/wiredtiger/wiredtiger/wiki/ trade-offs of hybrid flash controllers. In Proceedings of LevelDB-Benchmark. Accessed: 2020-01-31. MASCOTS, 2019. [78] WiredTiger. Wiredtiger performance benchmarks. [69] A. Tavakkol, J. G´omez-Luna,M. Sadrosadati, https://github.com/wiredtiger/wiredtiger/wiki/ S. Ghose, and O. Mutlu. Mqsim: A framework for Btree-vs-LSM. Accessed: 2020-01-31. enabling realistic studies of modern multi-queue SSD [79] K. Wu, A. Arpaci-Dusseau, R. Arpaci-Dusseau, devices. In 16th USENIX Conference on File and R. Sen, and K. Park. Exploiting intel optane ssd for Storage Technologies (FAST 18), pages 49–66, microsoft sql server. In Proceedings of the 15th Oakland, CA, Feb. 2018. USENIX Association. International Workshop on Data Management on New [70] The Storage Networking Industry Association. Solid Hardware, DaMoN19, New York, NY, USA, 2019. state storage (sss) performance test specification (pts). Association for Computing Machinery. https://www.snia.org/tech_activities/ [80] J. Yang, N. Plasson, G. Gillis, N. Talagala, and standards/curr_standards/pts. Accessed: S. Sundararaman. Don’t stack your log on my log. In 2020-01-31. 2nd Workshop on Interactions of NVM/Flash with [71] A. Trivedi, N. Ioannou, B. Metzler, P. Stuedi, Operating Systems and Workloads (INFLOW 14), Broomfield, CO, Oct. 2014. USENIX Association.

15