Real-Time LSM-Trees for HTAP Workloads

Real-Time LSM-Trees for HTAP Workloads Hemant Saxena Lukasz Golab University of Waterloo University of Waterloo [email protected] [email protected] Stratos Idreos Ihab F. Ilyas Harvard University University of Waterloo [email protected] [email protected] ABSTRACT We observe that a Log-Structured Merge (LSM) Tree is a natu- Real-time data analytics systems such as SAP HANA, MemSQL, ral fit for a lifecycle-aware storage engine. LSM-Trees are widely and IBM Wildfire employ hybrid data layouts, in which dataare used in key-value stores (e.g., Google’s BigTable and LevelDB, Cas- stored in different formats throughout their lifecycle. Recent data sandra, Facebook’s RocksDB), RDBMSs (e.g., Facebook’s MyRocks, are stored in a row-oriented format to serve OLTP workloads and SQLite4), blockchains (e.g., Hyperledger uses LevelDB), and data support high data rates, while older data are transformed to a stream and time-series databases (e.g., InfluxDB). While Cassandra column-oriented format for OLAP access patterns. We observe that and RocksDB can simulate columnar storage via column families, a Log-Structured Merge (LSM) Tree is a natural fit for a lifecycle- we are not aware of any lifecycle-aware LSM-Trees in which the aware storage engine due to its high write throughput and level- storage layout can change throughout the lifetime of the data. We oriented structure, in which records propagate from one level to fill this gap in our work, by extending the capabilities ofLSM- the next over time. To build a lifecycle-aware storage engine using based systems to efficiently serve real-time analytics and HTAP an LSM-Tree, we make a crucial modification to allow different workloads. data layouts in different levels, ranging from purely row-oriented An LSM-Tree is a multi-level data structure with a main-memory to purely column-oriented, leading to a Real-Time LSM-Tree. We buffer and a number of levels of exponentially-increasing size give a cost model and an algorithm to design a Real-Time LSM-Tree (shown in Figure 1 and discussed in detail in Section 2). Periodically, that is suitable for a given workload, followed by an experimental or when full, the buffer is flushed to Level-0. When Level-0, which evaluation of LASER - a prototype implementation of our idea built stores multiple flushed buffers, is nearly full, its data are merged on top of the RocksDB key-value store. In our evaluation, LASER is into the sorted runs residing in level one (via a compaction process), almost 5x faster than Postgres (a pure row-store) and two orders of and so on. We observe that LSM-Trees provide a natural framework magnitude faster than MonetDB (a pure column-store) for real-time for a lifecycle-aware storage engine for real-time analytics due to data analytics workloads. the following reasons. 1 LSM-Trees are write optimized: All writes and data transfers between levels are batched, allowing high write throughput. 1 INTRODUCTION 2 LSM-Trees naturally propagate data through the levels over time: At any point in time, the buffer stores the most re- The need for real-time analytics or Hybrid Transactional-Analytical cent data that have not yet been flushed (perhaps data inserted Processing (HTAP) is ubiquitous in present-day applications such as within the last hour), Level-0 may contain data between one hour content recommendation, real-time inventory/pricing, high-frequency and 24 hours old, and levels one and beyond store even older data. trading, blockchain data analytics, and IoT [28]. These applica- 3 Different levels can store data in different layouts: Data tions differ from traditional pure On-Line Transactional Processing may be stored in row format in the buffer and in some of the levels, (OLTP) and On-Line Analytical Processing (OLAP) applications and in column format in other levels. This suggests a flexible and in two aspects: 1) they have high data ingest rates [28]; 2) access configurable storage engine that can be adapted to the workload. patterns change over the lifecycle of the data [5, 15]. For exam- arXiv:2101.06801v1 [cs.DB] 17 Jan 2021 4 Compaction can be used to change data layout: Transform- ple, recent data may be accessed via OLTP style operations (point ing the data from a row to a column format can be done seamlessly queries and updates) as part of an alerting application, or to impute during compaction, when a level is merged into the next level. a missing attribute value based on the values of other attributes [28]. We make the following contributions in this paper. Additionally, recent and historical data may be accessed via OLAP style processes, perhaps to generate hourly, weekly and monthly • We propose the Real-Time LSM-Tree, which extends the design reports, where all values of one column (or a few selected columns, of a traditional LSM-Tree with the ability to store data in a row- depending on the time span of the report) are scanned [28]. oriented or a column-oriented format in each level. Recent systems, such as SAP HANA [20], MemSQL [3], and IBM • We characterize the design space of possible Real-Time LSM- Wildfire14 [ ], support real-time analytics with write-optimized Trees, where different designs are suitable for different workloads. features for high data ingest rates and read-optimized features for To navigate this design space, we provide a cost model to select HTAP workloads. In these systems, recent data are stored in a row- good designs for a given workload. oriented format to serve point queries (OLTP), and older data are • We develop and empirically evaluate LASER, a Lifecycle-Aware transformed to a column-oriented format suitable for OLAP. Such Storage Engine for Real-time analytics based on Real-Time LSM- systems can be described as having a lifecycle-aware data layout. Trees. We implement LASER using RocksDB, which is a popular Hemant Saxena, Lukasz Golab, Stratos Idreos, and Ihab F. Ilyas open-source key-value store based on LSM-Trees. We show that the most recent version is kept, and any key with a tombstone flag for real-time data analytics workloads, LASER is almost 5x faster is deleted. In practice, merging is done at SST granularity, i.e., some than Postgres and two orders of magnitude faster than MonetDB. SSTs from level 8 − 1 are merged with overlapping SSTs in level 8. This divides the merging process into smaller tasks, bounding the 2 OVERVIEW OF LSM-TREES processing time and allowing parallelism. Sorted runs in Level-0 are not partitioned into SSTs (or have exactly one SST) because they 2.1 Design are directly flushed from memory. Some implementations, such as Compared to traditional read-optimized data structures such as B- RocksDB, make an exception for Level-0 and allow multiple sorted trees or B¸-trees, LSM-Trees focus on high write throughput while runs to absorb write bursts. allowing indexed access to data [26]. LSM-Trees have two compo- The merging process moves data from one level to the next over nents: an in-memory piece that buffers inserts and a secondary- time. This puts recent data in the upper levels and older data in the storage (SSD or disk) piece. The in-memory piece consists of trees lower levels, providing a natural framework for a lifecycle-aware or skiplists, whereas the disk piece consists of sorted runs. storage engine proposed in this paper. In Figure 2, we present the Figure 1 shows the architecture of an LSM-Tree, with the memory results of an experiment using RocksDB with an LSM-Tree having piece at the top, followed by multiple levels of sorted runs on disk five levels (zero through four), with Level-0 starting at 64MB and (four levels, numbered zero to three, are shown in the figure). The ) = 2. We inserted data at a steady rate until all the levels were full, memory piece contains two or more skiplists of user-configured with background compaction enabled. We show the distribution size (two are shown in the figure). New records are inserted into of keys in terms of their time-since-insertion for two compaction the most recent (mutable) skiplist and into a write-ahead-log for policies commonly used in RocksDB: kByCompensatedSize (Figure durability. Once inserted, a record cannot be modified or deleted 2(a)) prioritizes the largest SST, and kOldestSmallestSeqFirst (Figure directly. Instead, a new version of it must be inserted and marked 2(b)) prioritizes SSTs whose key range has not been compacted with a tombstone flag in case of deletions. for the longest time. For both compaction priorities, each level has Once a skiplist is full, it becomes immutable and can be flushed a high density of keys within a certain time range. We will use to disk via a sequential write. Flushing is executed by a background time-based compaction priority because it is better at distributing thread (or can be called explicitly) and does not block new data keys based on time since insertion. from being inserted into the mutable skiplist. During flushing, each A point query starts from the most recent data and stops as soon skiplist is sorted and serialized to a sorted run. Sorted runs are typi- as the search key is found (there may be older versions of this cally range-partitioned into smaller chunks called Sorted Sequence key deeper in the LSM-Tree, but the query only returns the latest Tables (SSTs), which consist of fixed-size blocks. In Figure 1, we version). First, the in-memory skiplists are probed. If the search show sorted runs being range-partitioned by key into multiple SSTs.

Real-Time LSM-Trees for HTAP Workloads

NUMA-Aware Thread Migration for High Performance NVMM File Systems

Unravel Data Systems Version 4.5

Artificial Intelligence for Understanding Large and Complex

Characterizing, Modeling, and Benchmarking Rocksdb Key-Value

Myrocks in Mariadb

Dmon: Efficient Detection and Correction of Data Locality

As Focused on Software Tools That Support Software Engineering, Along with Data Structures and Algorithms Generally

Rocksdb…The Old Way…Log Structured Merge

Reducing DRAM Footprint with NVM in Facebook (Eurosys'18)

Realtime Data Processing at Facebook

Optimizing Space Amplification in Rocksdb

Towards Left Duff S Mdbg Holt Winters Gai Incl Tax Drupal Fapi Icici