Pelikan with ADP Yao Yue (@thinkingfish), Twitter, Inc
1 Pelikan: an Open-sourced, Modular Cache
2 Cache @ Twitter
Clusters QPS >400 in prod (single-tenant) max 50M (single cluster)
Hosts SLO many thousands p999 < 5ms*
Instances Protocol tens of thousands Memcached, Redis/RESP, thrift, …
Job size Data Structure 2-6 core, 4-48 GiB Simple KV, counter, list, hash, sorted map...
3 A Modular Architecture
Rds Twemcache Slimcache rbuf request response wbuf RESP ... Memcache Memcache SArray,List,.. Slab Cuckoo Slab services parse process compose
ziplist sarray memcache admin bitmap ... Data Data data structure RESP ... Structure Store
slab cuckoo datastore protocols process_request(struct response **, struct request *); High-performance rpc server
4 Intel® Optane™ Pelikan DC Persistent Memory 5 Intel® Optane™ DC Persistent Memory
MEMORY MODE APP DIRECT
APPLICATION APPLICATION
VOLATILE MEMORY POOL
DRAM AS CACHE OPTANE PERSISTENT MEMORY DRAM OPTANE PERSISTENT MEMORY
Affordable, large volatile Large capacity persistent memory capacity memory No code changes
6 Pelikan with Apache Pass (AEP)
Motivation Constraints
Cache more data per instance Maintainable Changes
➔ Reduce TCO if memory-bound ➔ Same codebase ➔ Improve hit rate ➔ Non-invasive, retain high-level APIs Persistent data => warmer cache Operability ➔ Improve operations w/ graceful shutdown and faster rebuild ➔ Flexible invocation Predictable performance ➔ Higher availability during ➔ maintenance
7 Data Pool Abstraction with PMDK
slab cuckoo datastore
Persistent Persistenc Memory e w/ slab cuckoo (file- libpmem backed) (PMDK) Datapool
datastore “cc_alloc” DRAM (malloc)
8 Durable Storage with DRAM Compatibility
Cuckoo Slab bucket bucket hash table
slabs
PMEMDRAM 9 Results
10 Benchmark Overview
Core Parameters Focus Instance density 18-30 instances / host common ➔ Serving performance (vs. DRAM) Object size Perf scalability with different dataset Between 64 and 2048 bytes, step x2 ➔ size Dataset size Between 4GiB and 32 GiB / instance, step x2 app direct mode
# of Connection per Server ➔ Rebuild performance 100 / 1000 app direct vs memory mode R/W ratio Read-only, 90/10, 80/20 Lab vs data center Twemcache-only for this presentation Bottleneck analysis 11 Stage 1: Serving Performance (memory mode) (aka “does this work at all?”)
4 GiB (throughput) 1600 Hardware Config (Intel lab) 4 GiB (latency) 3M · 2 X Intel Xeon 8160 (24) 1400 · 12 X 32GB DIMM 8 GiB (throughput) · 12 X 128GB AEP 8 GiB (latency) 2.5M 1200 · 2-2-2 config ) 16 GiB (throughput)
us · 1 X 25Gb NIC , 16 GiB (latency) 2M 1000 · CentOS 7 999 p 32 GiB (throughput) ( Test Config QPS 800 32 GiB (latency) 1.5M · 30 instances per node 600 · key size is 32 byte Latency 1M · connection count is 100 400 · NUMA-aware · 90 R / 10 W 0.5M 200
0 0 64 128 256 512 1024 2048
Value Size 12 Stage 2: Serving Performance (app direct mode)
4 GiB (throughput) 2M 200 Hardware Config (Intel lab) 4 GiB (latency) · 2 X Intel Xeon 8260 (24) · 12 X 32GB DIMM 8 GiB (throughput) · 12 X 128GB AEP 8 GiB (latency) 150 · 2-2-2 config
1.5M ) 16 GiB (throughput)
us · 1 X 25Gb NIC 16 GiB (latency) , · CentOS 7 999 p 32 GiB (throughput) ( Test Config QPS 1M 100 32 GiB (latency) · 24 instances per node · key size is 32 byte Latency · connection count is 100 0.5M 50 · NUMA-aware
0 0 64 128 256 512 1024
Value Size 13 Stage 2: Recovery Performance
Status Quo Rebuild from AEP
Data Availability Single instance
➔ No redundancy in cache by default ➔ 100 GiB of slab data ➔ Some clusters are mirrored ➔ complete rebuild: 4 minutes Backfill Concurrent
➔ Mostly rely on organic traffic ➔ 18 instances per host ➔ “Bootstrapper” bounded by QPS ➔ complete rebuild: 5 minutes Full warmup takes from minutes to ➔ Potential impact days ➔ Speed up maintenance by 1-2 orders Constraints on maintenance of magnitude (often needs other
➔ 20 minute restart interval by default changes) ➔ Large clusters take days to restart
14 Stage 3: Testing In-house (memory mode)
2 keys: 10M / 4GiB Hardware Config (Twitter DC) keys: 20M / 9GiB 100k · 2 X Intel Xeon 6222 (20) keys: 40M / 18GiB · 12 X 16GB DIMM 5 keys: 80M / 37GiB · 4 X 512GB AEP keys: 160M / 74GiB 2 · 2-1-1 config · 1 X 25Gb NIC ) 10k · CentOS 7 us SLO: p999 < 5ms ( 5 Test Config
2 · 20 instances per node Latency · key size is 64 byte 1000 · value size is 256 byte 5 · connection count is 1000 · NUMA-aware 2 · read-only p999 max = 16ms p9999 max = 148ms 100
p25 p50 p75 p90 p99 p999 p9999 throughput 1.08M QPS
Percentile 15 Stage 3: Testing In-house (app direct mode)
2 keys: 10M / 4GiB Hardware Config (Twitter DC) keys: 20M / 9GiB 100k · 2 X Intel Xeon 6222 (20) keys: 40M / 18GiB · 12 X 16GB DIMM 5 keys: 80M / 37GiB · 4 X 512GB AEP keys: 160M / 74GiB 2 · 2-1-1 config · 1 X 25Gb NIC ) 10k · CentOS 7 us SLO: p999 < 5ms ( 5 Test Config
2 · 20 instances per node Latency · key size is 64 byte 1000 · value size is 256 byte 5 · connection count is 1000 · NUMA-aware 2 · read-only p999 max = 1.4ms p9999 max = 2.5ms 100 p25 p50 p75 p90 p99 p999 p9999 throughput 1.08M QPS
Percentile 16 Conclusion Next Step
App direct mode Network ● Changes were modest ● Testing in-house with ADQ ● Can serve all data structures ● Serving performance comparable to Production canary DRAM for tested Twitter workloads ● Will we see the same performance? ● Recovery performance was good ● How does larger heap affect hit rate? Memory mode ● Fully-loaded config performs like Performance DRAM ● Scaling with connection counts ● Less scalable w/ wimpier config ● Profiling, especially for memory mode Bottleneck ● Tuning data structure/storage ● Network is still primary design ● Testing AEP with pelikan_rds 17 Further Read Contributors
Pelikan Thank you Intel Team! ● Redis at Scale Ali Alavi (@TheAliAlavi), Andy Rudoff ● Caching with Twemcache (@andyrudoff), Jakub Schmiegel, Jason ● Why Pelikan Harper, Michal Biesek, Mauricio Cuervo ● Pelikan Github (@mauriciocuervo), Piotr Balcer, Usha Upadhyayula Cache w/ AEP ● Redis-pmem Thank you Twitter Team! ● Memcached with pmem Brian Martin (@brayniac), Kevin Yang (@kevjyang), Matt Silver (@msilver)
#collaborate
18