Workload Analysis of Key-Value Stores on Non-Volatile Media Vishal Verma (Performance Engineer, Intel) Tushar Gohad (Cloud Software Architect, Intel)
1 2017 Storage Developer Conference Outline
KV Stores – What and Why Data Structures for KV Stores Design Choices and Trade-offs Performance on Non-Volatile Media Key Takeaways
2 2017 Storage Developer Conference Outline
KV Stores – What and Why Data Structures for KV Stores Design Choices and Trade-offs Performance on Non-Volatile Media Key Takeaways
3 2017 Storage Developer Conference What are KV Stores
Type of NoSQL database that uses simple key/value pair mechanism to store data Alternative to limitations of traditional relational databases (DB): Data structured and schema pre-defined Mismatch with today’s workloads. Data growth in large and unstructured Lots of random writes and reads. NoSQL brings flexibility as application has complete control over what is stored inside the value
4 2017 Storage Developer Conference What are KV Stores
Key in a key-value pair must (or at least, should) be unique. Values identified via a key, and stored values can be numbers, strings, images, videos etc API operations: get(key) for reading data, put(key, value) for writing data and delete(key) for deleting keys. Phone Directory example:
Key Value Bob (123) 456-7890 Kyle (245) 675-8888 Richard (787) 122-2212
5 2017 Storage Developer Conference KV Stores: Benefits
High performance: Enable fast location of object rather than searching through columns or tables to find an object in traditional relational DB Highly scalable: Can scale over several machines or devices by several orders of magnitude, without the need for significant redesign Flexible: No enforcement of any structure to data Low TCO: Simplify operations of adding or removing capacity as needed. Any hardware or network failure do not create downtime
6 2017 Storage Developer Conference Outline
KV Stores – What and Why Design Choices and Trade-offs Performance on Non-Volatile Media Key Takeaways
7 2017 Storage Developer Conference Design Choice: B-Tree
Internal nodes (green) – pivot keys 19 Internal Nodes Leaf nodes (blue) – data records (KV) 7 13 37 53 Query time proportional to height Logarithmic time
2 3 5 7 17 19 59 61 67 71 Insertion / Deletion Leaf Nodes Many random I/Os to disk 41 43 47 53 May incur rebalance – read and write amplification
11 13 23 29 31 37 Full leaf nodes may be split – space fragmentation
8 2017 Storage Developer Conference B-Tree: Trade-offs
Example DB engines: BerkeleyDB, MySQL InnoDB, MongoDB, WiredTiger Good design choice for read-intensive workloads Trades read performance for increased Read / Write / Space Amplification Nodes of B-Trees are on-disk blocks – aligned IO = space-amp Compression not effective – 16KB block size,10KB data compressed to 5KB will still occupy 16KB Byte-size updates also end up in page size read / writes
9 2017 Storage Developer Conference Performance Test Configuration
Dataset: System Non-Volatile OS Key-Value Config Storage Config Stores 500GB - 1TB (500 million - 1 billion records) Dataset size higher (> 3:1 DRAM size) Compression Off Distro: CPU: Ubuntu 16.04.1 Intel(R) Xeon(R) CPU WiredTiger 3.0.0 Key_size: 16 Bytes E5-2618L v4, 2.20GHz, Kernel: HT disabled, 20 cores Intel ® P4500 SSD (4TB) RocksDB 5.4 Value_size: 1000 Bytes 4.12.4 Cache size: 32GB Memory: To k u D B 4.6.119 Arch: 64GB x86_64 db_bench: Source: https://github.com/wiredtiger/leveldb.git Test Duration: 30 minutes Linux Kernel 4.12.0 Tuning parameters
Drop page cache after every test run Workloads: ReadWrite (16 Readers, 1 Writer) XFS filesystem, agcount=32, mount with discard
10 2017 Storage Developer Conference Readwrite: WiredTiger B-Tree 500 million rows
WiredTiger BTree Writes WiredTiger BTree Reads (30 min run) (30 min run) 10000 10000
1000 1000
100 100 10 Read (Ops/sec) Write (Ops/sec) 10 1
0.1 1 1 1 14 27 40 53 66 79 92 12 23 34 45 56 67 78 89 105 118 131 144 157 170 183 196 209 222 235 248 261 274 287 300 313 326 339 352 365 378 391 404 417 430 100 111 122 133 144 155 166 177 188 199 210 221 232 243 254 265 276 287 298 309 320 331 342 Row Count (5000 ops per marker) Row Count (5000 ops per marker)
Read plot represents single Reader. Ops Completed: 1 thread Write 2.15 million, 16 thread Read 27.36 million
11 2017 Storage Developer Conference Readwrite: WiredTiger B-Tree 500 million rows
WiredTiger BTree Write Latency WiredTiger BTree Read Latency (30 min run) (30 min run) 100 100
10 10
1 1 Latency (sec / op) / (sec Latency op) / (sec Latency
0.1 0.1 1 1 14 27 40 53 66 79 92 11 21 31 41 51 61 71 81 91 105 118 131 144 157 170 183 196 209 222 235 248 261 274 287 300 313 326 339 352 365 378 391 404 417 430 101 111 121 131 141 151 161 171 181 191 201 211 221 231 241 251 261 271 281 291 301 311 321 331 341 Row Count (5000 ops per marker) Row Count (5000 ops per marker)
Test: Single Writer, 16 Readers. Read plot represents single Reader latency.
12 2017 Storage Developer Conference Design Choice: LSM Tree
Log-Structured Merge-tree Two or more Tree-like components In-memory Tree (RocksDB: memtable) One or more Trees on persistent store (RocksDB: SST files) Transforms random writes into few sequential writes Write-Ahead Log (WAL) – append-only journal In-memory store – inexpensive writes to memory as the first level write. Flushed sequentially to first level in persistent store. Compaction – Merge sort like background operation Few sequential writes (better than random IOs in B-Tree case) Trims duplicate data – minimal space amplification Example DB engines: RocksDB, WiredTiger, Cassandra, CouchBase, LevelDB
13 2017 Storage Developer Conference LSM Trees – Operation
C0 Tree C1 Tree C… Tree Cn Tree
Merge Merge Merge
Merge Multi-Page Blocks
C0 – In-memory Tree C1 – Cn Trees on Persistent Store
N-level Merge Tree Transform random writes into sequential writes using WAL and In-memory tables Optimized for insertions by buffering Key value items in the store are sorted to support faster lookups
14 2017 Storage Developer Conference Design Choice: LSM Tree
Writes – Append-only constructs No read-modify-write, no double write Reduce fragmentation but sort/merges to multiple levels cause Write Amplification Reads – Expensive point, order by and range queries Might call for compaction Might need to scan all levels Read Amplification Can be optimized with Bloom Filters (RocksDB) Deletes – Defers deletes via tombstones Tombstones scanned during queries Tombstones don’t disappear until compaction
15 2017 Storage Developer Conference LSM Trees: Trade-offs
Read / Write / Space Amplification Summary Read amp: 1 up to number of levels Write amp: 1 + 1 + fan-out Most useful in applications Tiered storage with varying price / performance points RAM, Persistent Memory, NVMe SSDs etc Large dataset where inserts / deletes are more common than reads / searches Better suited for Write-intensive workloads Better compression: page/block alignment overhead small compared to size of persistent trees (SST files in RocksDB) Leveled LSMs have lower write and space amplification compared to B-Tress
16 2017 Storage Developer Conference LSM Tree Example: RocksDB 1 billion rows
RocksDB LSM Write Throughput (30 min run) RocksDB LSM Read Throughput (30 min run) 1000000 1000000
100000 100000
10000 10000
1000 1000
100 Read (Ops/sec) 100 Write (Ops/sec)
10 10
1 1 1 41 81
121 161 201 241 281 321 361 401 441 481 521 561 601 641 681 721 761 801 841 881 921 961 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 1001 1041
Row Count (5000 ops per marker) Row Count (5000 ops per marker)
Read plot represents single Reader. Ops Completed: 1 thread Write 5.2 million, 16 thread Read 6.3 million
17 2017 Storage Developer Conference LSM Tree Example: RocksDB 1 billion rows
RocksDB LSM Write Latency (30 min run) RocksDB LSM Read Latency (30 min run)
100 100
10 10
1 1 (sec/op)
0.1 0.1 Latency (sec/op) Latebcy 0.01 0.01
0.001
1 0.001 41 81
121 161 201 241 281 321 361 401 441 481 521 561 601 641 681 721 761 801 841 881 921 961 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 1001 1041 Row Count (5000 ops per marker) Row Count (5000 ops per marker)
Test: Single Writer, 16 Readers. Read plot represents single Reader Latency.
18 2017 Storage Developer Conference Design Choice: Fractal Trees
Merges features from B-trees with LSM trees A Fractal Tree index has buffers at each node, which allow insertions, deletions and other changes to be stored in intermediate locations Fast writes slower reads and updates Better than LSM for range reads on cold cache, but the same on warm cache Commercialized in DBs by Percona (Tokutek)
19 2017 Storage Developer Conference Fractal Trees
Maintains a B tree in which each internal node contains a buffer, 19 23, 31, 41, 43 shaded RED 7 13 37 53 41, 43 To insert a data record, simply insert it into the buffer at the root of the tree vs. traversing entire tree 2 3 5 7 17 19 59 61 67 71 (B-Tree) When root buffer full, move inserted 43 53 record down one level until they reach leaves and get stored in leaf 11 13 29 37 node like B-Tree Several leaf nodes are grouped to Insert(31, …) moves 41, 43 into the next level create sequential IO (~1-4MB)
20 2017 Storage Developer Conference Fractal Trees: Trade-offs
Read / Write / Space Amplification Summary Write Amplification similar to leveled LSM (RocksDB) Better than B-Tree and LSM (in theory) Most useful in applications Tiered storage with varying price / performance points RAM, Persistent Memory, NVMe SSDs etc Large dataset where inserts / deletes are more common than reads / searches Better suited for Write-intensive workloads Lower write and space amplification compared to B-Tress
21 2017 Storage Developer Conference Fractal Tree Example: TokuDB 1 billion rows
To k u D B Fractal WriteThroughput To k u D B Fractal ReadThroughput (30 min run) (30 min run) 10000 10000
1000 1000
100 100
10 ARead (Ops/sec) 10 Write (Ops/sec)
1 1 1 8
15 22 29 36 43 50 57 64 71 78 85 92 99 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 106 113 120 127 134 141 148 155 162 169 Row Count (5000 ops per marker) Row Count (5000 ops per marker)
Test: Single Writer, 16 Readers. Read plot represents single Reader.
22 2017 Storage Developer Conference Fractal Tree Example: TokuDB 1 billion rows
To k u D B Fractal Write Latency (30 min run) To k u D B Fractal Read Latency (30 min run) 1000 1000
100 100
10 10 Latency (micors/op) Latency (micors/op)
1 1 1 6
11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 101 106 111 116 121 126 131 136 141 146 151 156 161 166 171 Row Count (5000 ops per marker) Row Count (5000 ops per marker)
Test: Single Writer, 16 Readers. Read plot represents single Reader.
23 2017 Storage Developer Conference Outline
KV Stores – What and Why Design Choices and Trade-offs Performance Comparisons Key Takeaways
24 2017 Storage Developer Conference Performance Test Configuration
Dataset: System Non-Volatile OS Key-Value Config Storage Config Stores 200GB (200 million records) Dataset size higher (> 3:1 DRAM size) Compression Off Distro: CPU: Ubuntu 16.04.1 Intel(R) Xeon(R) CPU WiredTiger 3.0.0 Key_size: 16 Bytes E5-2618L v4, 2.20GHz, Kernel: HT disabled, 20 cores Intel ® P4500 SSD (4TB) RocksDB 5.4 Value_size: 1000 Bytes 4.12.4 Cache size: 32GB Memory: To k u D B 4.6.119 Arch: 64GB x86_64 db_bench: Source: https://github.com/wiredtiger/leveldb.git Test Duration: 30 minutes Linux Kernel 4.12.0 Tuning parameters
Drop page cache after every test run Workloads: ReadWrite (16 Readers, 1 Writer) XFS filesystem, agcount=32, mount with discard Overwrite (4 Writers)
25 2017 Storage Developer Conference KV Workload#0: Fill (Insert) 1 Writer, Space and Write Amplification
Total Write Data Bytes Write SSD Writes* App Writes Disk Usage Space Amplification Written Amplification includes NAND (GB) (GB) Amplification (including (per iostat, GB) (per iostat) GC (GB) NAND GC)
WiredTiger ~190 223 1.17 223 1.07 239 1.3 (B-Tree)
RocksDB ~190 197 1.04 395 1.11 438 2.3 (Leveled LSM)
To k u D B ~190 196 1.03 605 1.14 688 3.6 (Fractal-Tree)
Space Amplification is the amount of disk space consumed relative to the total size of KV write operation Write Amplification (per iostat) is the amount of data written relative to the total size of KV operation Write Amplification (including NAND GC) is the work done by the storage media relative to the total size of KV operation
* - SSD Device Writes obtained with nvme-cli from SMART stats for Intel® P4500
26 2017 Storage Developer Conference KV Workload#1: Readwhilewriting 1 Writer, 16 Readers
ReadWrite: KV Store WriteThroughput
TokuDB 900
RocksDB 4580 99.99pct Write Latency (s)
WiredTiger (B-Tree) 652 WiredTiger (B-Tree) 0 1000 2000 3000 4000 5000 OPs/sec
RocksDB ReadWrite: KV Store ReadThroughput
TokuDB 1966 TokuDB
RocksDB 2444 0 5 10 15 20 25 30 35 WiredTiger (B-Tree) 6894
0 2000 4000 6000 8000 OPs/sec
27 2017 Storage Developer Conference KV Workload#2: Overwrite 4 Writers
OverWrite: KV Store Write Throughput 25000 99.99pct Write Latency (s) 21602 20793 20000 TokuDB
15000
RocksDB
Ops / sec 10000
4204 5000 WiredTiger (B-Tree)
0 0 20 40 60 80 100 120 140 WiredTiger (B- RocksDB TokuDB Tree)
28 2017 Storage Developer Conference Outline
KV Stores – What and Why Design Choices and Trade-offs Performance on Non-Volatile Media Key Takeaways
29 2017 Storage Developer Conference Key Takeaways
Key-Value stores important for unstructured data Choice of an SSD-backed Key Value store is workload-dependent Performance vs Space / Write Amplification trade-off key decision factor Traditional B-Tree implementations Great for read-intensive workloads Better write amplification compared to alternatives Poor space amplification LSM and Fractal-Trees Well-suited for write-intensive workloads Better space amplification compared to B-Tree
30 2017 Storage Developer Conference Thank you!
Comments / Questions?
Vishal Verma ([email protected])
31 2017 Storage Developer Conference