Rocksdb…The Old Way…Log Structured Merge

TRocksDB Remi Brasga Sr. Software Engineer | Memory Storage Strategy Division Toshiba Memory America, Inc. Hi! I’m Remi (Remington Brasga). As a Sr. Software Engineer for the Memory Storage Strategy Division at Toshiba Memory America, Inc., I am dedicated to development and collaboration on open source software. I earned my B.S. and M.S. from University of California, Irvine. Today, I am here fresh from having my 3rd son who has kept me very busy the past few weeks. (thanks for letting me come here and rest!) Applications are King • Software is eating the world • Storage is “Defined by Software” – i.e., applications define how storage is used • Yet, applications do not always use storage wisely For Example, RocksDB RocksDB is a popular data storage engine …used by a wide range of database applications ArangoDB Ceph™ Cassandra® MariaDB® Python® MyRocks Rockset Cassandra is a registered trademark of The Apache Software Foundation. Ceph is a trademark of Red Hat, Inc. or its subsidiaries in the United States and other countries. Python is a registered trademark of the Python Software Foundation. MariaDB is a registered trademark of MariaDB in the European Union and other regions. All other company names, product names and service names may be trademarks of their respective companies. The Challenge of Rocks While highly effective as a data storage engine, RocksDB has some challenges for SSDs: Write Amplification is very large 20-30X (or more) as a result of the compaction layers rewriting the same data This can result in impact to SSD endurance. How does RocksDB work? Keys and Values are stored Data Key together in pairs. Level K+V0 As more data is added, these Level 1 pairs waterfall Target 1GB Level 2 through the levels. Target 10 GB Level 3 Target 100 GB Level 4 This waterfall occurs Target 1000 GB during each compaction. Let’s Demonstrate RocksDB…the Old Way…Log Structured Merge CPU Key & Value Key & Value Key & Value New key value database entries DRAM Saved to SSD in compaction level 1 Compaction Level 1 LSM Compaction Level 2 Compaction Level 3 RocksDB…the Old Way…Log Structured Merge CPU More writes to DRAM compaction level 1 Compaction Level 1 Key & Value Key & Value Key & Value Key & Value Key & Value Key & Value Key & Value LSM Compaction Level 2 Compaction Level 3 Compaction level 1 fills up RocksDB…the Old Way…Log Structured Merge CPU Consolidated for compaction level 2 DRAM Compaction level 1 is full; Compaction Level 1 Key & Value Key & Value Key & Value Key & Value Key & Value Key & Value Key & Value LSM Written to RAM… Compaction Level 2 Compaction Level 3 RocksDB…the Old Way…Log Structured Merge CPU Consolidated data written to compaction level 2 DRAM Key & Value Key & Value Key & Value Key & Value Key & Value Key & Value Key & Value Application-generated write amplification Compaction Level 1 LSM Compaction Level 2 Compaction Level 3 RocksDB…the Old Way…Log Structured Merge CPU DRAM Compaction level 2 filling up Application-generated write amplification Compaction Level 1 LSM Compaction Level 2 KeyKeyKey & & &Value Value Value KeyKeyKey & & &Value Value Value KeyKeyKey & & &Value Value Value KeyKeyKey & & &Value Value Value KeyKeyKey & & &Value Value Value KeyKeyKey & & &Value Value Value KeyKeyKey & & &Value Value Value KeyKey & & Value Value KeyKey & & Value Value KeyKey & & Value Value KeyKey & & Value Value KeyKey & & Value Value KeyKey & & Value Value KeyKey & & Value Value Compaction Level 3 RocksDB…the Old Way…Log Structured Merge CPU Written to DRAM, and consolidated DRAM Compaction level 2 fills up Compaction Level 1 LSM Compaction Level 2 Compaction Level 3 RocksDB…the Old Way…Log Structured Merge CPU DRAM Written to compaction Application-generated Compaction Level 1 level 3 LSM write amplification Compaction Level 2 Compaction Level 3 Result As you can see, • This results in rewriting data “that doesn’t change” during compaction • This can have a significant impact on SSD health and endurance The good news is… There is a better way Toshiba Memory America has re-architected RocksDB to solve this SSD endurance challenge Inspiration We looked at some of the academic work on Rocks – Major inspiration came from the “wisckey” from the university of Wisconsin WiscKey: Separating Keys from Values in SSD-conscious storage https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf Coding Work Significant code was involved to split the keys from the values • Replaced the LSM key value pairs with the ring buffer (for a Value Log - VLog) • The VLog is a large number of files, – with each file having about the same number of values as an SST* has keys • VLogFiles are organized into FIFO structures called VLogRings • One tricky issue - keeping track of the state of the VLog over a restart • Active recycling added to compactions to reduce instances of uncompacted files *SST=Stored String Table Coding Work Work involved managing the keys • While keys mostly are cached in memory • They still need to be written into LSM layers • Compactions still occur, – about one per second as usual but only on the keys • Coding for compaction involved – Compacting the keys – And cleaning up the stale data in the ring buffers Performance Tuning Tuning performance in the code was a challenge • Some changes made performance worse (against every expectation…) • The split keys and values approach needed new test cases • There are a number of new option settings to enable optimizations this added improvements to – Compaction picker improved ( to reduce need of active recycling ) – Write amp and space amp measuring – Bulk load efficiency Coding Merge Code Merge: RocksDB continues to evolve – TRocksDB is maintaining version compatibility with each current revision of RocksDB – TRocks is fully compatible and cross compiles with/into regular RocksDB code today – Once compiled Rocks or TRocks can be enabled with option settings How does TRocksDB work? Keys and Values Data Value Log are split Key Value is unaffected by compactions Level K Keys go into compaction 0 layers (in memory) Level 1 VV Target 1GB Level 2 Target 10 GB Values are stored Level 3 separately into Target 100 GB Level 4 a ring buffer Target 1000 GB The Improved TRocksDB Method TRocksDB…the New Way Ring Buffer, Separate Keys/Values CPU Key Value Key Value Key Value Key Value Create new DRAM database entries LSM Ring Buffer TRocksDB…the New Way Ring Buffer, Separate Keys/Values CPU Key Value Key Value Key Value Key Value Lookups are DRAM lightening fast LSM with keys cached in DRAM Keys stored separately (cacheable) Ring Buffer SSD endurance improved Write Values to Ring via lowered Buffer Write Amp Our Goal during Development Improved(reduced) Write Amplification with the same or better performance Improved Write Amplification Write Amplification (lower is better) 4.9 5 TRocks 1.2.5.23 Rocks FB 6.4 4 3 2.1 2 1 1 0.6 0.6 1 -57% -40% -40% 0 Test 1: Test 2: Test 3: Random Bulk Load Bulk Sequential Load Random Overwrites Performance results based on internal Toshiba Memory testing. Benchmark tests were developed by Facebook and generated in July 2018. They measure RocksDB performance when data resides on flash storage and tested using the newest release RocksDB 6.1.fb. The benchmark information and test descriptions are available at https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks. Same or Better Performance Performance Comparison (lower is better) Seconds TRocks 1.2.5.23 Rocks FB 6.4 155,081 160,000 140,000 118,244 120,000 100,000 80,000 -24% 60,000 29,742 40,000 21,565 9,890 6,938 9,595 20,000 -28% 6,155 -11% -3% 0 Test 1: Test 2: Test 3: Test 4: Random Bulk Load Bulk Sequential Load Random Overwrites Random Key Reads Performance results based on internal Toshiba Memory testing. Benchmark tests were developed by Facebook and generated in July 2018. They measure RocksDB performance when data resides on flash storage and tested using the newest release RocksDB 6.1.fb. The benchmark information and test descriptions are available at https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks. Tuning and Optimization • Every database is different • Rocks and TRocks have a wide range settings / options to optimize the database • These settings are often confusing or “left in default” GitHub is an exclusive trademark registered in the United States by GitHub, Inc. All other company names, product names and service names may be trademarks of their respective companies. Tuning and Optimization • Toshiba Memory America has created a “tuning” tool Fill in the blanks – It is on a Google spreadsheet • Available from the GitHub® – Easy to use - just fill in the blanks – It generates known good settings Generate “known-good” options settings GitHub is an exclusive trademark registered in the United States by GitHub, Inc. All other company names, product names and service names may be trademarks of their respective companies. Call to Action Download Test Use Enjoy ! Join the project improve and contribute to the code Developed by Toshiba Memory America (Which will be renamed KIOXIA America, Inc. as of October 1, 2019) Open Source Improved write amplification = SSD last longer with the same or better performance Available today: https://github.com/ToshibaMemoryAmerica All company names, product names and service names may be trademarks of their respective companies. © 2019 Toshiba Memory America, Inc. All rights reserved. Information, including product pricing and specifications, content of services, and contact information is current and believed to be accurate on the date of the announcement, but is subject to change without prior notice. Technical and application information contained here is subject to the most recent applicable Toshiba Memory product specifications..

Rocksdb…The Old Way…Log Structured Merge

NUMA-Aware Thread Migration for High Performance NVMM File Systems

Unravel Data Systems Version 4.5

Artificial Intelligence for Understanding Large and Complex

Characterizing, Modeling, and Benchmarking Rocksdb Key-Value

Myrocks in Mariadb

Dmon: Efficient Detection and Correction of Data Locality

Real-Time LSM-Trees for HTAP Workloads

As Focused on Software Tools That Support Software Engineering, Along with Data Structures and Algorithms Generally

Reducing DRAM Footprint with NVM in Facebook (Eurosys'18)

Realtime Data Processing at Facebook

Optimizing Space Amplification in Rocksdb

Towards Left Duff S Mdbg Holt Winters Gai Incl Tax Drupal Fapi Icici