Pelikan with ADP Yao Yue (@thinkingfish), , Inc

1 Pelikan: an Open-sourced, Modular Cache

2 Cache @ Twitter

Clusters QPS >400 in prod (single-tenant) max 50M (single cluster)

Hosts SLO many thousands p999 < 5ms*

Instances Protocol tens of thousands Memcached, /RESP, thrift, …

Job size Data Structure 2-6 core, 4-48 GiB Simple KV, counter, list, hash, sorted map...

3 A Modular Architecture

Rds Twemcache Slimcache rbuf request response wbuf RESP ... Memcache Memcache SArray,List,.. Slab Cuckoo Slab services parse process compose

ziplist sarray memcache admin bitmap ... Data Data data structure RESP ... Structure Store

slab cuckoo datastore protocols process_request(struct response **, struct request *); High-performance rpc server

4 Intel® Optane™ Pelikan DC Persistent Memory 5 Intel® Optane™ DC Persistent Memory

MEMORY MODE APP DIRECT

APPLICATION APPLICATION

VOLATILE MEMORY POOL

DRAM AS CACHE OPTANE PERSISTENT MEMORY DRAM OPTANE PERSISTENT MEMORY

Affordable, large volatile Large capacity persistent memory capacity memory No code changes

6 Pelikan with Apache Pass (AEP)

Motivation Constraints

Cache more data per instance Maintainable Changes

➔ Reduce TCO if memory-bound ➔ Same codebase ➔ Improve hit rate ➔ Non-invasive, retain high-level Persistent data => warmer cache Operability ➔ Improve operations w/ graceful shutdown and faster rebuild ➔ Flexible invocation Predictable performance ➔ Higher availability during ➔ maintenance

7 Data Pool Abstraction with PMDK

slab cuckoo datastore

Persistent Persistenc Memory e w/ slab cuckoo (file- libpmem backed) (PMDK) Datapool

datastore “cc_alloc” DRAM (malloc)

8 Durable Storage with DRAM Compatibility

Cuckoo Slab bucket bucket

slabs

PMEMDRAM 9 Results

10 Benchmark Overview

Core Parameters Focus Instance density 18-30 instances / host common ➔ Serving performance (vs. DRAM) Object size Perf scalability with different dataset Between 64 and 2048 bytes, step x2 ➔ size Dataset size Between 4GiB and 32 GiB / instance, step x2 app direct mode

# of Connection per Server ➔ Rebuild performance 100 / 1000 app direct vs memory mode R/W ratio Read-only, 90/10, 80/20 Lab vs data center Twemcache-only for this presentation Bottleneck analysis 11 Stage 1: Serving Performance (memory mode) (aka “does this work at all?”)

4 GiB (throughput) 1600 Hardware Config (Intel lab) 4 GiB (latency) 3M · 2 X Intel Xeon 8160 (24) 1400 · 12 X 32GB DIMM 8 GiB (throughput) · 12 X 128GB AEP 8 GiB (latency) 2.5M 1200 · 2-2-2 config ) 16 GiB (throughput)

us · 1 X 25Gb NIC , 16 GiB (latency) 2M 1000 · CentOS 7 999 p 32 GiB (throughput) ( Test Config QPS 800 32 GiB (latency) 1.5M · 30 instances per node 600 · key size is 32 byte Latency 1M · connection count is 100 400 · NUMA-aware · 90 R / 10 W 0.5M 200

0 0 64 128 256 512 1024 2048

Value Size 12 Stage 2: Serving Performance (app direct mode)

4 GiB (throughput) 2M 200 Hardware Config (Intel lab) 4 GiB (latency) · 2 X Intel Xeon 8260 (24) · 12 X 32GB DIMM 8 GiB (throughput) · 12 X 128GB AEP 8 GiB (latency) 150 · 2-2-2 config

1.5M ) 16 GiB (throughput)

us · 1 X 25Gb NIC 16 GiB (latency) , · CentOS 7 999 p 32 GiB (throughput) ( Test Config QPS 1M 100 32 GiB (latency) · 24 instances per node · key size is 32 byte Latency · connection count is 100 0.5M 50 · NUMA-aware

0 0 64 128 256 512 1024

Value Size 13 Stage 2: Recovery Performance

Status Quo Rebuild from AEP

Data Availability Single instance

➔ No redundancy in cache by default ➔ 100 GiB of slab data ➔ Some clusters are mirrored ➔ complete rebuild: 4 minutes Backfill Concurrent

➔ Mostly rely on organic traffic ➔ 18 instances per host ➔ “Bootstrapper” bounded by QPS ➔ complete rebuild: 5 minutes Full warmup takes from minutes to ➔ Potential impact days ➔ Speed up maintenance by 1-2 orders Constraints on maintenance of magnitude (often needs other

➔ 20 minute restart interval by default changes) ➔ Large clusters take days to restart

14 Stage 3: Testing In-house (memory mode)

2 keys: 10M / 4GiB Hardware Config (Twitter DC) keys: 20M / 9GiB 100k · 2 X Intel Xeon 6222 (20) keys: 40M / 18GiB · 12 X 16GB DIMM 5 keys: 80M / 37GiB · 4 X 512GB AEP keys: 160M / 74GiB 2 · 2-1-1 config · 1 X 25Gb NIC ) 10k · CentOS 7 us SLO: p999 < 5ms ( 5 Test Config

2 · 20 instances per node Latency · key size is 64 byte 1000 · value size is 256 byte 5 · connection count is 1000 · NUMA-aware 2 · read-only p999 max = 16ms p9999 max = 148ms 100

p25 p50 p75 p90 p99 p999 p9999 throughput 1.08M QPS

Percentile 15 Stage 3: Testing In-house (app direct mode)

2 keys: 10M / 4GiB Hardware Config (Twitter DC) keys: 20M / 9GiB 100k · 2 X Intel Xeon 6222 (20) keys: 40M / 18GiB · 12 X 16GB DIMM 5 keys: 80M / 37GiB · 4 X 512GB AEP keys: 160M / 74GiB 2 · 2-1-1 config · 1 X 25Gb NIC ) 10k · CentOS 7 us SLO: p999 < 5ms ( 5 Test Config

2 · 20 instances per node Latency · key size is 64 byte 1000 · value size is 256 byte 5 · connection count is 1000 · NUMA-aware 2 · read-only p999 max = 1.4ms p9999 max = 2.5ms 100 p25 p50 p75 p90 p99 p999 p9999 throughput 1.08M QPS

Percentile 16 Conclusion Next Step

App direct mode Network ● Changes were modest ● Testing in-house with ADQ ● Can serve all data structures ● Serving performance comparable to Production canary DRAM for tested Twitter workloads ● Will we see the same performance? ● Recovery performance was good ● How does larger heap affect hit rate? Memory mode ● Fully-loaded config performs like Performance DRAM ● Scaling with connection counts ● Less scalable w/ wimpier config ● Profiling, especially for memory mode Bottleneck ● Tuning data structure/storage ● Network is still primary design ● Testing AEP with pelikan_rds 17 Further Read Contributors

Pelikan Thank you Intel Team! ● Redis at Scale Ali Alavi (@TheAliAlavi), Andy Rudoff ● Caching with Twemcache (@andyrudoff), Jakub Schmiegel, Jason ● Why Pelikan Harper, Michal Biesek, Mauricio Cuervo ● Pelikan Github (@mauriciocuervo), Piotr Balcer, Usha Upadhyayula Cache w/ AEP ● Redis-pmem Thank you Twitter Team! ● Memcached with pmem Brian Martin (@brayniac), Kevin Yang (@kevjyang), Matt Silver (@msilver)

#collaborate

18