SKQ: Event for Optimizing Tail Latency in a Traditional OS Kernel

Siyao Zhao, Haoyu Gu, Ali José Mashtizadeh

RCS Group @ University of Waterloo

USENIX ATC ’21 – July 16, 2021

1 / 24 The Latency Problem

• Modern server applications require low tail latency

• Recent research proposed kernel-bypass/custom dataplanes

• Problem: most applications still on traditional OSes

• Solving the latency problem in kernel is challenging

2 / 24 . Memcached 1.4% vs. GIS application 46%

– Over-saturated vs. under-saturated threads – Total processing time difference:

• Workload Imbalance

Sources of Latency

• Cache Misses – Connection migration from RSS & kernel – Applications do not detect any of this! – RPC server → 77% avoidable L2 misses

3 / 24 Sources of Latency

• Cache Misses – Connection migration from RSS & kernel – Applications do not detect any of this! – RPC server → 77% avoidable L2 misses

• Workload Imbalance – Over-saturated vs. under-saturated threads – Total processing time difference: . Memcached 1.4% vs. GIS application 46%

3 / 24 Event-Driven Programming Model

• Uses kernel event facility: – Older: , , /dev/poll – Newer: FreeBSD Kqueue, , Windows IOCP – 20+ years old

• Maximum visibility into OS and applications

• Application workflow: – Applications register events to kernel event objects

– Worker threads poll/listen on kernel event object

4 / 24 Event-Driven Programming Models

Model 1:1 1:N Event Object Private Shared Scalability Good Poor Schedulability Poor Good Affinity Good Poor Popularity High Low

5 / 24 Our System: Scheduable Kqueue (SKQ)

• Efficient event scheduling on top of Kqueue

• Multicore scalability in the 1:N model

• Event scheduling to reduce latency – Cache misses & workload imbalance

• Event delivery control – Event prioritization & event pinning

6 / 24 SKQ Usage

Main thread Worker threads ... while (true) { // Create a single SKQ instance const int maxev = 32; int skq = kqueue(); struct kevent evs[maxev]; // Set scheduling policy // Query events int sched = KQ_SCHED_CPU; nev = kevent(skq, NULL, 0, &evs, (skq, FKQMULTI, &sched); maxev, NULL); ... // Process events ... }

7 / 24 SKQ Architecture

• Scalability

– Thread-private event queue kevq knote Hash Table knote file – Fine-grained locking SKQ Object knote socket

... • Event scheduling and delivery control kevq kevq kevq Event – The event scheduler knote Scheduler

• Best of both worlds via scheduling Application Threads – 1:N model to applications – 1:1 model internally

8 / 24 Scalability Showdown: SKQ vs. Kqueue

2.5M 1:1 Kqueues 2.0M 1:1 SKQs • POSIX pipes instead of sockets 1:N Kqueue 1.5M 1:N SKQ

1.0M • 1:N SKQ scales linearly

Throughput (req./s) 500K

• 1:N Kqueue sees bottleneck 0 1 2 3 4 5 6 7 8 9 10 11 12 Number of cores

9 / 24 Performance Improvement: SKQ vs. Kqueue

10 ms

• Memcached with 1:1 model 1 ms • Facebook Mutilate workload Average Latency

100 µs 0 200k 400k 600k 800k 1M • −33% tail latency at peak 100 ms 1:1 Kqueues 1:1 SKQs 10 ms Benchmark Baseline 1 ms

We use 1:1 SKQs throughout the talk Percentile Latency th

to isolate the benefit of scheduling. 99 100 µs 0 200k 400k 600k 800k 1M Throughput (req./s)

10 / 24 Scheduling Policies

• Cache locality policies

• Load balancing policies

• Hybrid policies

11 / 24 Challenges of Efficient Event Scheduling

• Little overhead available for scheduling – Millions of events per second

• CPU cost, lock contention, and cache footprint – Statistics – Data structures

• Memcached: – 15k cycles per request (amortized) – L3 cache miss ∼ 400 cycles – 2.7% drop in throughput

12 / 24 Cache Locality Policies

• CPU affinity – Delievers to the triggering core – Follows connection migration – Cache locality in the networking stack

• Queue affinity – Pins to the first core – Cache locality in userspace

13 / 24 Cache Locality Policies in Memcached

10 ms

• A uniform workload 1 ms

• Average Latency Cache locality dominates latency 100 µs 0 200k 400k 600k 800k 1M 100 ms 1:1 SKQs • Facebook Mutilate workload CPU affinity 10 ms Queue affinity – +9% low-latency throughput

– −26% tail latency at peak 1 ms Percentile Latency th

99 100 µs 0 200k 400k 600k 800k 1M Throughput (req./s)

14 / 24 CPU Affinity vs. Queue Affinity

• L2 cache misses in a RPC server

• Uniform workload, Memcached-like

• CPU affinity −31.2% L2 cache misses

Policy TCP input TCP output Event activation Event query Total CPU 252k 15k 63k 166k 496k Queue 343k 33k 95k 250k 721k Vanilla 828k 76k 45k 1235k 2184k

15 / 24 Imbalanced Workload: RocksDB

100 ms

10 ms

1 ms

• Facebook ZippyDB workload Average Latency 100 µs 0 20k 40k 60k 80k 100k 120k 1 s • Imbalanced – slow SEEK requests 100 ms

• Cache locality policies don’t help 10 ms

Percentile Latency 1 ms 1:1 SKQs

th CPU affinity 99 100 µs 0 20k 40k 60k 80k 100k 120k Throughput (req./s)

16 / 24 – Idle threads steal work – Minimal interference

• Work stealing

• Best of two + work stealing

Load Balancing Policies

100 ms • Best of two 10 ms – Selects the better of two random kevqs

– Statistics keeping 1 ms

. Number of events Average Latency . Average processing time per event 100 µs 0 20k 40k 60k 80k 100k 120k 1 s

100 ms

10 ms 1:1 SKQs

Percentile Latency 1 ms Best of two

th CPU affinity 99 100 µs 0 20k 40k 60k 80k 100k 120k Throughput (req./s)

17 / 24 Load Balancing Policies

100 ms • Best of two 10 ms – Selects the better of two random kevqs

– Statistics keeping 1 ms

. Number of events Average Latency . Average processing time per event 100 µs 0 20k 40k 60k 80k 100k 120k 1 s

• Work stealing 100 ms – Idle threads steal work 10 ms 1:1 SKQs Best of two

– Minimal interference Percentile Latency 1 ms BoT + WS

th CPU affinity 99 100 µs 0 20k 40k 60k 80k 100k 120k • Best of two + work stealing Throughput (req./s)

17 / 24 Hybrid Policies

100 ms

• Load balancing + cache locality 10 ms

– Best of two vs. cache-local kevq 1 ms

– Cache miss penalty Average Latency 100 µs 0 20k 40k 60k 80k 100k 120k 1 s • Hybrid Policy 100 ms – CPU affinity + best of two + work stealing . × 10 ms – 27 4 low-latency throughput 1:1 SKQs

– Up to 1022× lower tail latency Percentile Latency 1 ms BoT + WS th Hybrid Policy 99 100 µs 0 20k 40k 60k 80k 100k 120k Throughput (req./s)

18 / 24 SKQ vs. Kernel Bypass: Uniform 10 µs Workload

10 ms SKQ 10 µs • RPC server Shenango 10 µs 1 ms – Uniform

– 10 µs request service time 100 µs

– CPU affinity policy Average Latency 10 µs 0 200k 400k 600k 800k 1M 10 ms • Compared 150 µs tail latency

1 ms • Shenango – 1.67× low-latency throughput 100 µs Percentile Latency th

99 10 µs • Bottleneck: overhead 0 200k 400k 600k 800k 1M Throughput (req./s)

19 / 24 SKQ vs. Kernel Bypass: Uniform 20 µs Workload

10 ms SKQ 20 µs • RPC server Shenango 20 µs 1 ms – Uniform

– 20 µs request service time 100 µs

– CPU affinity policy Average Latency 10 µs 0 200k 400k 600k 800k 1M 10 ms • Compared 150 µs tail latency

1 ms • Shenango – 1.5× low-latency throughput 100 µs Percentile Latency th

99 10 µs • Bottleneck: system call overhead 0 200k 400k 600k 800k 1M Throughput (req./s)

20 / 24 SKQ vs. Kernel Bypass: Zipf-like Workload

10 ms SKQ Zipf • RPC server Shenango Zipf 1 ms – Zipf-like

– 20.5 µs average service time 100 µs

– Hybrid policy Average Latency 10 µs 0 200k 400k 600k 800k 1M 10 ms • Compared 150 µs tail latency

1 ms • Shenango – 1.17× low-latency throughput 100 µs Percentile Latency th

99 10 µs • Benefit from event scheduling 0 200k 400k 600k 800k 1M Throughput (req./s)

21 / 24 SKQ vs. Kernel Bypass

10 ms SKQ 10 µs SKQ 20 µs 1 ms SKQ Zipf Shenango 10 µs • Kernel bypass: Shenango 20 µs Shenango Zipf – Short and uniform 100 µs Average Latency

10 µs • Kernel event scheduling: 0 200k 400k 600k 800k 1M 10 ms – Long and imbalanced

1 ms • Unfair comparison 100 µs

– Different OSes Percentile Latency

– Different TCP stack th 99 10 µs 0 200k 400k 600k 800k 1M Throughput (req./s) 22 / 24 Policy Selection Guidelines

• Uniform workloads – CPU affinity

• Imbalanced or IO-heavy workloads – Hybrid policy (CPU affinity + Best of two + Work stealing)

23 / 24 Conclusion

• A practical solution to the latency problem

• More optimization opportunities in kernel

• See paper for: – Event prioritization and pinning – Cache miss analysis and more benchmarks – Design details and optimizations

• Source code available: – https://rcs.uwaterloo.ca/skq/

24 / 24