From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes

Florian Rommel, Christian Dietrich, Daniel Friesel, Marcel Köppen, Christoph Borchert, Michael Müller, Olaf Spinczyk, Daniel Lohmann

Leibniz Universität Hannover Universität Osnabrück

2020-11-05 Dynamic Software Updating

Apply updates during the run time

 High Availability service quality must not decrease

 Expensive Reboot e.g., applications with large run-time state

Prime Example: Operating Systems → Kernel (, kGraft)

Userspace Applications? → DSU rarely used in practice

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 2 / 22 Example: OpenLDAP

OpenLDAP server

listener worker worker dispatch compute compute ...

client

client ...

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 3 / 22 Example: OpenLDAP do_work()

void worker_thread() { void rwm_op_rollback( Operation *op, SlapReply *rs, rwm_op_state *ros ) { ... while (1) { op->o_tmpfree( ros->mapped_attrs, op->o_tmpmemctx ); filter_free_x( op, op->ors_filter, 1 ); wait_for_work(); op->o_tmpfree( op->ors_filterstr.bv_val, op->o_tmpmemctx ); do_work(); op->ors_attrs = ros->ors_attrs; op->ors_filter = ros->ors_filter; } op->ors_filterstr = ros->ors_filterstr; } // quiescence point ... } buggy if (patch_pending()) { barrier(); wait_for_patch(); void rwm_op_rollback( Operation *op, SlapReply *rs, rwm_op_state *ros ) { } ... } if ( op->ors_filter != ros->ors_filter ) { filter_free_x( op, op->ors_filter, 1 ); } op->ors_filter = ros->ors_filter; } void patcher_thread() { if ( op->ors_filterstr.bv_val != ros->ors_filterstr.bv_val ) { op->o_tmpfree( op->ors_filterstr.bv_val, op->o_tmpmemctx ); while (1) { op->ors_filterstr = ros->ors_filterstr; } wait_for_patch_request(); ... patched set_patch_pending(); } barrier(); apply_patch(); reset_patch_pending(); Global Quiescence: resume_workers(); All workers must be in } the barrier before patching }

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 4 / 22 Global Quiescence The to-be-patched code is not active in any

150

] Problems s / 1 [

s 100 #1: Long Calculations e s n o

p 50 #2: I/O Operations e R #3: Inter-Thread Dependencies 0 0.0 0.2 0.4 0.6 0.8 1.0 Response Time relative to Request [s]

150 ] s m [

y

c 100 n e t a L

. 50 x a M 0 0.0 0.2 0.4 0.6 0.8 1.0 Response Time relative to Patch Request [s]

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 5 / 22 Global Quiescence

→ MariaDB: Transaction Locks Problems #1: Long Calculations tx_lock() barrier() #2: I/O Operations Thread 1 #3: Inter-Thread Dependencies

tx_lock()

Deadlock Thread 2

time

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 6 / 22 Kernelspace

 Ksplice: Probe for quiescence instead of waiting in a barrier

→ Patch may never get applied

 kGraft, DynAMOS: Keep patched and unpatched functions in parallel

Decide on per-thread-basis which version to use

Global quiescence → local quiescence

→ Problems: kernel-specific, performance penalty

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 7 / 22 Local Quiescence

Basic Idea: Patching threads independently from each other.

Wait-Free Code Patching via Address-Space Generations

OS extension for run-time modification in multi-threaded processes  AS generations: Multiple views of an address-space

 Thread-local quiescence

 Thread-by-thread migration between AS generations → Implementation in the

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 8 / 22 Local Quiescence

In ter- 150 Thre ] ad s De / → N pen 1 o de [ m n ore cies s 100 dea e dlo s cks

n ! o

p 50 e R

0 0.0 0.2 0.4 0.6 0.8 1.0 Global Quiescence Response Time relative to Patch Request [s] Local Quiescence

150 ] ] s s m m [ [

y y c c 100 n n e e t t a a L L

. . 50 x x a a M M 0 0.0 0.2 0.4 0.6 0.8 1.0 Response Time relative to Patch Request [s]

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 9 / 22 Address-Space Generations

wf_create() address Generation 0 Generation 1 space text text patched wf_pin() Copy-On-WriteShared Mapping Mapping

data data & Shared Mapping & stack stack

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 10 / 22 Address-Space Generations

wf_delete() wf_create() address Generation 0 Generation 1 space text text patched wf_pin() Copy-On-Write Mapping

data data & Shared Mapping & stack stack

threads wf_migrate()

th2 th1 th3

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 11 / 22 Example: OpenLDAP do_work()

void worker_thread() { void rwm_op_rollback( Operation *op, SlapReply *rs, rwm_op_state *ros ) { ... while (1) { op->o_tmpfree( ros->mapped_attrs, op->o_tmpmemctx ); filter_free_x( op, op->ors_filter, 1 ); wait_for_work(); op->o_tmpfree( op->ors_filterstr.bv_val, op->o_tmpmemctx ); do_work(); op->ors_attrs = ros->ors_attrs; op->ors_filter = ros->ors_filter; op->ors_filterstr = ros->ors_filterstr; // quiescence point ... } buggy if (patch_pendingmigration_pending()) ()){ { barrierwf_migrate();(); } wait_for_patch(); void rwm_op_rollback( Operation *op, SlapReply *rs, rwm_op_state *ros ) { } ... } if ( op->ors_filter != ros->ors_filter ) { filter_free_x( op, op->ors_filter, 1 ); } op->ors_filter = ros->ors_filter; } void patcher_thread() { if ( op->ors_filterstr.bv_val != ros->ors_filterstr.bv_val ) { op->o_tmpfree( op->ors_filterstr.bv_val, op->o_tmpmemctx ); while (1) { op->ors_filterstr = ros->ors_filterstr; } wait_for_patch_request(); ... patched set_patch_pendingwf_create(); (); } barrierwf_migrate();(); apply_patch(); reset_patch_pendingset_migration_pending();(); }resume_workers(); }} }

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 12 / 22 Implementation in the Linux Kernel

wf_create  Memory Map (MM) Cloned MM wf_create  Clones the memory map (MM) = AS generation COW COW text text like fork() but without COW wf_pin data & bss data & bss  However, pinned mappings use COW

heap heap  wf_migrate  Changes the thread’s MM pointer file mapping file mapping  switch file mapping file mapping

stack stack

Thread 1 *MM Thread 2 *MM Thread 3 *MM

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 13 / 22 Implementation in the Linux Kernel

wf_create  MemoryMaster MM Map (MM) Cloned MM wf_create  Clones the memory map (MM) = AS generation COW COW text text like fork() but without COW wf_pin data & bss data & bss  However, pinned mappings use COW

heap heap heap heap  wf_migrate  Changes the thread’s MM pointer Page Fault Page Fault  Context switch file mapping file mapping

 Mapping Changes stack stack  Synchronized on all MMs  Master MM: Lazy page initialization, Locking proxy

Thread 1 *MM Thread 2 *MM Thread 3 *MM

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 14 / 22 Evaluation: Patches

Debian 10.0 packages and Debian patches (except MariaDB)

OpenLDAP Apache Memcached Samba MariaDB Node.js Patches (CVE) 13 (2) 10 (10) 1 (1) 2 (2) 74 (26) 4 (0)

Restrict to code-only patches 87% (88%)

text-only 9 (2) 7 (7) 1 (1) 2 (2) 67 (24) 4 (0)

Generate patches via 39% (47%)

kpatch’able 9 (2) 7 (7) 1 (1) 2 (2) 16 (5) 0 (0)

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 15 / 22 Evaluation: Request Latencies

105 106 P99.5 (=143.52ms) P99.5 (=601.00ms) P99.5 (=855.90ms) Global Quiescence Global Quiescence Global Quiescence 3 10 4 103 10 s s s t t t s s s

e e e 2 u 1 u 101 u 10

q 10 q q e e e R R R

f f f o 5 o o 6

10 10 r P99.5 (=9.56ms) r P99.5 (=541.00ms) r P99.5 (=32.38ms) e e e b Local Quiescence b Local Quiescence b Local Quiescence

m m 3 m 4 u u 10 u 103 10 N N N

2 1 10 101 10

0 20 40 60 80 100 120 140 0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 OpenLDAP: Histogram of Request [ms] Apache: Histogram of Request Latency [ms] Memcached: Histogram of Request Latency [ms]

107 P99.5 (=760.68ms) 105 P99.5 (=323.62ms) P99.5 (=236.08ms) 5 Global Quiescence Global Quiescence Global Quiescence 10 103 103 s s s t t 3 t s s

10 s e e e u u 1 u 101 q q 1 10 q

e 10 e e R R R

f f f o o o

7 r r 10 P99.5 (=55.69ms) 5 P99.5 (=7.84ms) r P99.5 (=243.15ms) e e 10 e b Local Quiescence b Local Quiescence b Local Quiescence m 105 m m 3 u u u 10

N N 3 10 N 103 1 1 10 101 10

0 500 1000 1500 2000 2500 3000 3500 0 200 400 600 800 1000 0 250 500 750 1000 1250 1500 1750 Samba: Histogram of File I/O Latency [ms] MariaDB: Histogram of Request Latency [ms] Node.js: Histogram of Request Latency [ms]

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 16 / 22 Evaluation: Request Latencies

105 P99.5 (=143.52ms) Global Quiescence 103 s t s e

u 1

q 10 e R

f

o 5 10 r P99.5 (=9.56ms) e b Local Quiescence m

u 103 N

101

0 20 40 60 80 100 120 140 OpenLDAP: Histogram of Request Latency [ms]

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 17 / 22 Evaluation: Request Latencies

105 P99.5 (=323.62ms) Global Quiescence

103 s t s e

u 1 q 10 e R

f o

r 5 P99.5 (=7.84ms) e 10 b Local Quiescence m u

N 103

101

0 200 400 600 800 1000 MariaDB: Histogram of Request Latency [ms]

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 18 / 22 Evaluation: Overhead

 Run Time (microbenchmarks under load):

 wf_create 88±23 µs (Memcached) to 2171±139 µs (Node.js)

 wf_migrate 5±5 µs (Samba) to 8±7 µs (Node.js)

 Memory (under load): 132 KiB (Memcached) to 1808 KiB (Node.js)

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 19 / 22 Future Work

Basic mechanism Synchronized address-space clones with partial differences

Possible applications  Combination with JIT compiler  Path-specific kernel modification→ ( Synthesis)  Implementation of dynamic variability (→ Multiverse)  Address-space views for Data (thread isolation)

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 20 / 22 Conclusion

Wait-Free Code Patching via Address-Space Generations

 Goal: Reduce global quiescence to local quiescence  Easier to establish  Maintain quality of service

 Approach: Synchronized address-space clones

 Evaluation: 6 server applications  Successful application of code-only patches  Improved tail latencies during patching

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 21 / 22 Thank you for your attention.

Try it: https://www.sra.uni-hannover.de/p/wfpatch [email protected]

2020-11-05 OSDI '20 From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes — Florian Rommel 22 / 22