A Scalable In-Memory Skip-List Index for Key-Value Store

S3: A Scalable In-memory Skip-List Index for Key-Value Store ∗ Jingtian Zhangy, Sai Wuy , Zeyuan Tany, Gang Cheny, Zhushi Chengz, Wei Caoz, Yusong Gaoz, Xiaojie Fengz yZhejiang University, Hangzhou, Zhejiang, China. zAlibaba Group, Hangzhou, Zhejiang, China. f11421015, wusai, 3160103832, [email protected] fzhushi.chengzs, mingsong.cw, jianchuan.gys, [email protected] ABSTRACT reason of adopting skip-list is two-fold. First, its mainte- Many new memory indexing structures have been proposed nance cost is low, since it does not require complex adjust- and outperform current in-memory skip-list index adopted ment operations in the tree-like index to keep balance. Sec- by LevelDB, RocksDB and other key-value systems. How- ond, when skip-list is full (4MB case in LevelDB), it can be ever, those new indexes cannot be easily intergrated with efficiently flushed to disk and merged with other disk data key-value systems, because most of them do not consider structures (e.g., SSTable). how the data can be efficiently flushed to disk. Some as- However, many new in-memory indexing structures [4, sumptions, such as fixed size key and value, are unrealistic 20, 26, 7, 30, 37, 42] are proposed, showing superior per- for real applications. In this paper, we present S3, a scal- formance than the classic skip-list index. Some core tech- able in-memory skip-list index for the customized version of niques, when designing those efficient new indexes, include RocksDB in Alibaba Cloud. S3 adopts a two-layer struc- cache-sensitive structure, SIMD speedup for multi-core ar- ture. In the top layer, a cache-sensitive structure is used chitecture, specific encoding of the key, latch-free concurrent to maintain a few guard entries to facilitate the search over access and other optimization approaches. Those sophisti- the skip-list. In the bottom layer, a semi-ordered skip-list cated designs allow the new indexes to achieve an order-of- index is built to support highly concurrent insertions and magnitude better performance than skip-list[41]. The prob- fast lookup and range query. To further improve the per- lem, on the other hand, is that they are difficult to be ap- formance, we train a neural model to select guard entries plied to existing disk-based key-value systems. Since many intelligently according to the data distribution and query services on Alibaba Cloud are supported by our customized distribution. Experiments on multiple datasets show that RocksDB (tailored for the cloud environment), we use the S3 achieves a comparable performance to other new mem- RocksDB as our example to illustrate the idea. ory indexing schemes, and can replace current in-memory All hash-based in-memory indices [4, 20] cannot be direct- skip-list of LevelDB and RocksDB to support huge volume ly integrated with the disk-part of LevelDB or RocksDB, be- of data. cause they do not maintain the keys in order. Consequently, they do not support range queries, or we need to transfor- PVLDB Reference Format: m the range queries into multiple lookup requests. Before Jingtian Zhang, Sai Wu, Zeyuan Tan, Gang Chen, Zhushi Cheng, flushing those data back to the disk as a SSTable, a sorting Wei Cao, Yusong Gao, Xiaojie Feng . S3: A Scalable In-memory process is invoked, which is costly and may block write and Skip-List Index for Key-Value Store. PVLDB, 12(12): 2183-2194, read operations. 2019. DOI: https://doi.org/10.14778/3352063.3352134 In RocksDB, the in-memory index is kept small (4MB- 64MB) so that it can be efficiently flushed to disk and does not trigger high compaction overhead 1. This strategy caus- 1. INTRODUCTION es the in-memory index to be frequently rebuilt and flushed Many popular key-value stores, such as LevelDB [8] , out. Many in-memory indices have not taken the case into RocksDB [13] and HBase [1], maintain an in-memory skip- consideration. To achieve a high in-memory performance, list [32] to support efficient data insertion and lookup. The keys and values are normally formatted to be friendly to the cache. Trie-based indices, such as ART [26], Judy [7] and ∗Sai Wu is the corresponding author. Masstree [30], split a key into multiple key slices to facilitate the search and cache usage. Each slice contains a few bytes of the original key. For example, each layer in the Masstree is indexed by a different 8-byte slice of key. This This work is licensed under the Creative Commons Attribution- will significantly slow down the flushing process, as we need NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For to assemble the keys back. Masstree also maintains the key any use beyond those covered by this license, obtain permission by emailing 1 [email protected]. Copyright is held by the owner/author(s). Publication rights Another reason is that skip-list does not perform well for licensed to the VLDB Endowment. large datasets. However, even if we replace skip-list with Proceedings of the VLDB Endowment, Vol. 12, No. 12 other indices, we still cannot maintain a too large in-memory ISSN 2150-8097. index, since the compaction overhead will become the bot- DOI: https://doi.org/10.14778/3352063.3352134 tleneck and block the insertion and lookup operations. 2183 and value separately and as a result, we need to merge them accessing disk data generates more overhead[39], many key- together before writing the data to disk. value stores adopt the log-structured merge tree (LSM) ar- Many other skip-list-based indices, such as CSSL [37] and chitecture[31]. PI [42, 43], will periodically restructure the index to opti- LevelDB is the most popular LSM key-value stores[8]. In mize the performance. During the restructuring process, the LevelDB, a small in-memory index is employed to support read and write operations are blocked. Besides, both indices fast data insertion and lookup. When the in-memory part, assume that keys are of the same size, so that a specified MemTable, is full, we will flush it back to the disk and cre- number of keys can be exactly processed by a SIMD vector. ate a new one. The disk files, SSTables, are also organized A single SIMD load instruction can load all those keys into as multiple levels. If the number of SSTables at one level a SIMD register. Such optimization may not be possible for reaches the predefined threshold, we will pick one SSTable real applications. On the other hand, FAST [25] addressed and merge it with the next level SSTables. This operation the problem by mapping variable length keys to 4-byte fixed is called compaction in LevelDB and may incur extremely size keys. But if we want to flush the memory data back to high processing overhead. the disk, we must reverse the mapping. Original LevelDB adopts a simple concurrent process s- In this paper, we present S3, a scalable in-memory skip- trategy. It buffers all insertions into a buffer and asks one list index for the customized version of RocksDB in Alibaba thread to conduct all insertions sequentially. Hyper-LevelDB Cloud. Usability is the first-class citizen in our design. S3 improves the strategy by allowing concurrent updates[6]. can be seamlessly integrated with the disk-part of LevelDB RocksDB, on the other hand, introduces a multi-threaded and RocksDB to replace their old skip-list index. S3 al- merging strategy for the disk components[13]. cLSM[21] re- so supports high throughput insertion, efficient lookup and places the global mutex lock with a global reader-writer lock range search. The performance of S3 is approaching to the and uses a concurrent memory component. Hence, opera- state-of-the-art in-memory index, ART [26], while it can be tions can proceed in parallel, but need to be blocked at the easily extended as a disk-based index. start and end of each concurrent compaction. Write ampli- S3 adopts a two-layer design. In the bottom layer, we fication is also an important performance issue for LSM-like build a semi-order skip-list, where keys inside one data entry systems. The LSM-trie data structure uses tries to organize are not required to be sorted. Data entries remain ordered. keys to reduce write amplification. However, it cannot sup- Namely, the maximal key of a data entry is smaller than port range queries[40]. RocksDB's universal compaction re- the minimal key of its successor. This strategy allows us duces write amplification by sacrificing read and range query to further improve the insertion performance, but may slow performance[13]. PebblesDB proposes a scheme to balance down the search process. To address the problem, we inten- the costs of write amplification and range query[33]. WiscK- tionally create some guard entries as shortcuts. Instead of ey directly separates values from keys, significantly reducing starting the search from the head of the skip-list, we jump write amplification regardless of the key distribution in the to the closest shortcut and resume our search, which sig- workload[29]. Most of the above works are orthogonal to nificantly reduces the search overhead. In the paper, we our improvements on the memory component. study the effect of guard entry selection on the performance Some works have been done on the memory componen- theoretically. t of LSM key-value stores. RocksDB offers two hash-based But searching for the proper shortcut introduces addition- memory components as options[13]. HashSkipList organizes al overhead. So in the top layer, we adopt a cache-sensitive data in a hash table with each hash bucket as a skip-list.

Load more