Workload Analysis of Key-Value Stores on Non-Volatile Media Vishal Verma (Performance Engineer, Intel) Tushar Gohad (Cloud Software Architect, Intel)

1 2017 Storage Developer Conference Outline

 KV Stores – What and Why  Data Structures for KV Stores  Design Choices and Trade-offs  Performance on Non-Volatile Media  Key Takeaways

2 2017 Storage Developer Conference Outline

 KV Stores – What and Why  Data Structures for KV Stores  Design Choices and Trade-offs  Performance on Non-Volatile Media  Key Takeaways

3 2017 Storage Developer Conference What are KV Stores

 Type of NoSQL that uses simple key/value pair mechanism to store data  Alternative to limitations of traditional relational (DB):  Data structured and schema pre-defined  Mismatch with today’s workloads.  Data growth in large and unstructured  Lots of random writes and reads.  NoSQL brings flexibility as application has complete control over what is stored inside the value

4 2017 Storage Developer Conference What are KV Stores

 Key in a key-value pair must (or at least, should) be unique. Values identified via a key, and stored values can be numbers, strings, images, videos etc  API operations: get(key) for reading data, put(key, value) for writing data and delete(key) for deleting keys.  Phone Directory example:

Key Value Bob (123) 456-7890 Kyle (245) 675-8888 Richard (787) 122-2212

5 2017 Storage Developer Conference KV Stores: Benefits

 High performance: Enable fast location of object rather than searching through columns or tables to find an object in traditional relational DB  Highly scalable: Can scale over several machines or devices by several orders of magnitude, without the need for significant redesign  Flexible: No enforcement of any structure to data  Low TCO: Simplify operations of adding or removing capacity as needed. Any hardware or network failure do not create downtime

6 2017 Storage Developer Conference Outline

 KV Stores – What and Why  Design Choices and Trade-offs  Performance on Non-Volatile Media  Key Takeaways

7 2017 Storage Developer Conference Design Choice: B-

 Internal nodes (green) – pivot keys 19  Internal Nodes Leaf nodes (blue) – data records (KV) 7 13 37 53  Query time proportional to height  Logarithmic time

2 3 5 7 17 19 59 61 67 71  Insertion / Deletion Leaf Nodes  Many random I/Os to disk 41 43 47 53  May incur rebalance – read and write amplification

11 13 23 29 31 37  Full leaf nodes may be split – space fragmentation

8 2017 Storage Developer Conference B-Tree: Trade-offs

 Example DB engines: BerkeleyDB, MySQL InnoDB, MongoDB, WiredTiger  Good design choice for read-intensive workloads  Trades read performance for increased Read / Write / Space Amplification  Nodes of B-Trees are on-disk blocks – aligned IO = space-amp  Compression not effective – 16KB block size,10KB data compressed to 5KB will still occupy 16KB  Byte-size updates also end up in page size read / writes

9 2017 Storage Developer Conference Performance Test Configuration

Dataset: System Non-Volatile OS Key-Value Config Storage Config Stores 500GB - 1TB (500 million - 1 billion records)  Dataset size higher (> 3:1 DRAM size)  Compression Off Distro: CPU: Ubuntu 16.04.1 Intel(R) Xeon(R) CPU WiredTiger 3.0.0 Key_size: 16 Bytes E5-2618L v4, 2.20GHz, Kernel: HT disabled, 20 cores Intel ® P4500 SSD (4TB) RocksDB 5.4 Value_size: 1000 Bytes 4.12.4 Cache size: 32GB Memory: To k u D B 4.6.119 Arch: 64GB x86_64 db_bench: Source: https://github.com/wiredtiger/leveldb.git Test Duration: 30 minutes Linux Kernel 4.12.0 Tuning parameters

Drop page cache after every test run Workloads:  ReadWrite (16 Readers, 1 Writer) XFS filesystem, agcount=32, mount with discard

10 2017 Storage Developer Conference Readwrite: WiredTiger B-Tree 500 million rows

WiredTiger BTree Writes WiredTiger BTree Reads (30 min run) (30 min run) 10000 10000

1000 1000

100 100 10 Read (Ops/sec) Write (Ops/sec) 10 1

0.1 1 1 1 14 27 40 53 66 79 92 12 23 34 45 56 67 78 89 105 118 131 144 157 170 183 196 209 222 235 248 261 274 287 300 313 326 339 352 365 378 391 404 417 430 100 111 122 133 144 155 166 177 188 199 210 221 232 243 254 265 276 287 298 309 320 331 342 Row Count (5000 ops per marker) Row Count (5000 ops per marker)

Read plot represents single Reader. Ops Completed: 1 thread Write 2.15 million, 16 thread Read 27.36 million

11 2017 Storage Developer Conference Readwrite: WiredTiger B-Tree 500 million rows

WiredTiger BTree Write Latency WiredTiger BTree Read Latency (30 min run) (30 min run) 100 100

10 10

1 1 Latency (sec / op) / (sec Latency op) / (sec Latency

0.1 0.1 1 1 14 27 40 53 66 79 92 11 21 31 41 51 61 71 81 91 105 118 131 144 157 170 183 196 209 222 235 248 261 274 287 300 313 326 339 352 365 378 391 404 417 430 101 111 121 131 141 151 161 171 181 191 201 211 221 231 241 251 261 271 281 291 301 311 321 331 341 Row Count (5000 ops per marker) Row Count (5000 ops per marker)

Test: Single Writer, 16 Readers. Read plot represents single Reader latency.

12 2017 Storage Developer Conference Design Choice: LSM Tree

 Log-Structured Merge-tree  Two or more Tree-like components  In-memory Tree (RocksDB: memtable)  One or more Trees on persistent store (RocksDB: SST files)  Transforms random writes into few sequential writes  Write-Ahead Log (WAL) – append-only journal  In-memory store – inexpensive writes to memory as the first level write. Flushed sequentially to first level in persistent store.  Compaction – Merge sort like background operation  Few sequential writes (better than random IOs in B-Tree case)  Trims duplicate data – minimal space amplification  Example DB engines: RocksDB, WiredTiger, Cassandra, CouchBase, LevelDB

13 2017 Storage Developer Conference LSM Trees – Operation

C0 Tree C1 Tree C… Tree Cn Tree

Merge Merge Merge

Merge Multi-Page Blocks

C0 – In-memory Tree C1 – Cn Trees on Persistent Store

 N-level Merge Tree  Transform random writes into sequential writes using WAL and In-memory tables  Optimized for insertions by buffering  Key value items in the store are sorted to support faster lookups

14 2017 Storage Developer Conference Design Choice: LSM Tree

 Writes – Append-only constructs  No read-modify-write, no double write  Reduce fragmentation but sort/merges to multiple levels cause Write Amplification  Reads – Expensive point, order by and range queries  Might call for compaction  Might need to scan all levels  Read Amplification  Can be optimized with Bloom Filters (RocksDB)  Deletes – Defers deletes via tombstones  Tombstones scanned during queries  Tombstones don’t disappear until compaction

15 2017 Storage Developer Conference LSM Trees: Trade-offs

 Read / Write / Space Amplification Summary  Read amp: 1 up to number of levels  Write amp: 1 + 1 + fan-out  Most useful in applications  Tiered storage with varying price / performance points  RAM, Persistent Memory, NVMe SSDs etc  Large dataset where inserts / deletes are more common than reads / searches  Better suited for Write-intensive workloads  Better compression: page/block alignment overhead small compared to size of persistent trees (SST files in RocksDB)  Leveled LSMs have lower write and space amplification compared to B-Tress

16 2017 Storage Developer Conference LSM Tree Example: RocksDB 1 billion rows

RocksDB LSM Write Throughput (30 min run) RocksDB LSM Read Throughput (30 min run) 1000000 1000000

100000 100000

10000 10000

1000 1000

100 Read (Ops/sec) 100 Write (Ops/sec)

10 10

1 1 1 41 81

121 161 201 241 281 321 361 401 441 481 521 561 601 641 681 721 761 801 841 881 921 961 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 1001 1041

Row Count (5000 ops per marker) Row Count (5000 ops per marker)

Read plot represents single Reader. Ops Completed: 1 thread Write 5.2 million, 16 thread Read 6.3 million

17 2017 Storage Developer Conference LSM Tree Example: RocksDB 1 billion rows

RocksDB LSM Write Latency (30 min run) RocksDB LSM Read Latency (30 min run)

100 100

10 10

1 1 (sec/op)

0.1 0.1 Latency (sec/op) Latebcy 0.01 0.01

0.001

1 0.001 41 81

121 161 201 241 281 321 361 401 441 481 521 561 601 641 681 721 761 801 841 881 921 961 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 1001 1041 Row Count (5000 ops per marker) Row Count (5000 ops per marker)

Test: Single Writer, 16 Readers. Read plot represents single Reader Latency.

18 2017 Storage Developer Conference Design Choice: Fractal Trees

 Merges features from B-trees with LSM trees  A Fractal Tree index has buffers at each node, which allow insertions, deletions and other changes to be stored in intermediate locations  Fast writes slower reads and updates  Better than LSM for range reads on cold cache, but the same on warm cache  Commercialized in DBs by Percona (Tokutek)

19 2017 Storage Developer Conference Fractal Trees

 Maintains a B tree in which each internal node contains a buffer, 19 23, 31, 41, 43 shaded RED  7 13 37 53 41, 43 To insert a data record, simply insert it into the buffer at the root of the tree vs. traversing entire tree 2 3 5 7 17 19 59 61 67 71 (B-Tree)  When root buffer full, move inserted 43 53 record down one level until they reach leaves and get stored in leaf 11 13 29 37 node like B-Tree  Several leaf nodes are grouped to Insert(31, …) moves 41, 43 into the next level create sequential IO (~1-4MB)

20 2017 Storage Developer Conference Fractal Trees: Trade-offs

 Read / Write / Space Amplification Summary  Write Amplification similar to leveled LSM (RocksDB)  Better than B-Tree and LSM (in theory)  Most useful in applications  Tiered storage with varying price / performance points  RAM, Persistent Memory, NVMe SSDs etc  Large dataset where inserts / deletes are more common than reads / searches  Better suited for Write-intensive workloads  Lower write and space amplification compared to B-Tress

21 2017 Storage Developer Conference Fractal Tree Example: TokuDB 1 billion rows

To k u D B Fractal WriteThroughput To k u D B Fractal ReadThroughput (30 min run) (30 min run) 10000 10000

1000 1000

100 100

10 ARead (Ops/sec) 10 Write (Ops/sec)

1 1 1 8

15 22 29 36 43 50 57 64 71 78 85 92 99 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 106 113 120 127 134 141 148 155 162 169 Row Count (5000 ops per marker) Row Count (5000 ops per marker)

Test: Single Writer, 16 Readers. Read plot represents single Reader.

22 2017 Storage Developer Conference Fractal Tree Example: TokuDB 1 billion rows

To k u D B Fractal Write Latency (30 min run) To k u D B Fractal Read Latency (30 min run) 1000 1000

100 100

10 10 Latency (micors/op) Latency (micors/op)

1 1 1 6

11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 101 106 111 116 121 126 131 136 141 146 151 156 161 166 171 Row Count (5000 ops per marker) Row Count (5000 ops per marker)

Test: Single Writer, 16 Readers. Read plot represents single Reader.

23 2017 Storage Developer Conference Outline

 KV Stores – What and Why  Design Choices and Trade-offs  Performance Comparisons  Key Takeaways

24 2017 Storage Developer Conference Performance Test Configuration

Dataset: System Non-Volatile OS Key-Value Config Storage Config Stores 200GB (200 million records)  Dataset size higher (> 3:1 DRAM size)  Compression Off Distro: CPU: Ubuntu 16.04.1 Intel(R) Xeon(R) CPU WiredTiger 3.0.0 Key_size: 16 Bytes E5-2618L v4, 2.20GHz, Kernel: HT disabled, 20 cores Intel ® P4500 SSD (4TB) RocksDB 5.4 Value_size: 1000 Bytes 4.12.4 Cache size: 32GB Memory: To k u D B 4.6.119 Arch: 64GB x86_64 db_bench: Source: https://github.com/wiredtiger/leveldb.git Test Duration: 30 minutes Linux Kernel 4.12.0 Tuning parameters

Drop page cache after every test run Workloads:  ReadWrite (16 Readers, 1 Writer) XFS filesystem, agcount=32, mount with discard  Overwrite (4 Writers)

25 2017 Storage Developer Conference KV Workload#0: Fill (Insert) 1 Writer, Space and Write Amplification

Total Write Data Bytes Write SSD Writes* App Writes Disk Usage Space Amplification Written Amplification includes NAND (GB) (GB) Amplification (including (per iostat, GB) (per iostat) GC (GB) NAND GC)

WiredTiger ~190 223 1.17 223 1.07 239 1.3 (B-Tree)

RocksDB ~190 197 1.04 395 1.11 438 2.3 (Leveled LSM)

To k u D B ~190 196 1.03 605 1.14 688 3.6 (Fractal-Tree)

Space Amplification is the amount of disk space consumed relative to the total size of KV write operation Write Amplification (per iostat) is the amount of data written relative to the total size of KV operation Write Amplification (including NAND GC) is the work done by the storage media relative to the total size of KV operation

* - SSD Device Writes obtained with nvme-cli from SMART stats for Intel® P4500

26 2017 Storage Developer Conference KV Workload#1: Readwhilewriting 1 Writer, 16 Readers

ReadWrite: KV Store WriteThroughput

TokuDB 900

RocksDB 4580 99.99pct Write Latency (s)

WiredTiger (B-Tree) 652 WiredTiger (B-Tree) 0 1000 2000 3000 4000 5000 OPs/sec

RocksDB ReadWrite: KV Store ReadThroughput

TokuDB 1966 TokuDB

RocksDB 2444 0 5 10 15 20 25 30 35 WiredTiger (B-Tree) 6894

0 2000 4000 6000 8000 OPs/sec

27 2017 Storage Developer Conference KV Workload#2: Overwrite 4 Writers

OverWrite: KV Store Write Throughput 25000 99.99pct Write Latency (s) 21602 20793 20000 TokuDB

15000

RocksDB

Ops / sec 10000

4204 5000 WiredTiger (B-Tree)

0 0 20 40 60 80 100 120 140 WiredTiger (B- RocksDB TokuDB Tree)

28 2017 Storage Developer Conference Outline

 KV Stores – What and Why  Design Choices and Trade-offs  Performance on Non-Volatile Media  Key Takeaways

29 2017 Storage Developer Conference Key Takeaways

 Key-Value stores important for unstructured data  Choice of an SSD-backed Key Value store is workload-dependent  Performance vs Space / Write Amplification trade-off key decision factor  Traditional B-Tree implementations  Great for read-intensive workloads  Better write amplification compared to alternatives  Poor space amplification  LSM and Fractal-Trees  Well-suited for write-intensive workloads  Better space amplification compared to B-Tree

30 2017 Storage Developer Conference Thank you!

Comments / Questions?

Vishal Verma ([email protected])

31 2017 Storage Developer Conference