Fast on Average, Predictable in the Worst Case
Exploring Real-Time Futexes in LITMUSRT
Roy Spliet, Manohar Vanga, Björn Brandenburg, Sven Dziadek
IEEE Real-Time Systems Symposium 2014 December 2-5, 2014 Rome, Italy 1 Real-Time Locking API
Mutex API kernel_do_lock(); // critical section kernel_do_unlock();
2 Real-Time Locking API
Mutex API
No other task is currently holding the lock kernel_do_lock(); // critical section kernel_do_unlock(); No other task is waiting for the lock
Operations are typically uncontended at runtime
3 Futexes: Fast Userspace Mutexes
A mechanism (API) in the Linux kernel ! • For optimizing the uncontended case of locking protocol implementations ! • Avoid kernel invocation during uncontended operations
4 Futexes: Fast Userspace Mutexes
A mechanism (API) in the Linux kernel ! • For optimizing the uncontended case of locking protocol implementations ! • Avoid kernel invocation during uncontended operations
Why optimize the uncontended case? ! • Improved throughput for soft real-time workloads ! • Increasing available slack time in the system
5 Real-Time Locking Implementations
PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)
Futex Implementation
Linux
6 Real-Time Locking Implementations
PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)
Futex Implementation
Linux LITMUSRT
LITMUSRT Real-Time Locking Implementations
PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)
Futex Implementation
Linux LITMUSRT
8 Real-Time Locking Implementations
PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)
Futex Implementation
Linux LITMUSRT
Choice between futex implementation and better analytical properties
9 Real-Time Locking Implementations
PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)
Futex Implementation
Linux LITMUSRT
Choice between futex implementation and better analytical properties
Why is it challenging to have both?
10 Real-Time Locking Implementations
PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)
Futex Implementation
Reactive Anticipatory
11 Real-Time Locking Protocol Dichotomy
Reactive Locking Protocols Anticipatory Locking Protocols
React to contention on a lock Anticipate problem scenarios only when it occurs. and take measure to minimize priority-inversion (pi) blocking.
12 Real-Time Locking Protocol Dichotomy
Reactive Locking Protocols Anticipatory Locking Protocols
React to contention on a lock Anticipate problem scenarios only when it occurs. and take measure to minimize priority-inversion (pi) blocking.
Tricky to implement futexes Simple futex implementation without violating semantics
13 Contributions
PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)
Futex Implementation
14 Contributions
PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)
Futex Implementation
Design and implementation of futexes for three anticipatory real-time locking protocols
15 Contributions
PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)
Futex Implementation Observed overhead reduction: 92% 85% 84%
Design and implementation of futexes for three anticipatory real-time locking protocols
16 Contributions
PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)
Futex Implementation Observed overhead reduction: 92% 85% 84%
Design and implementation of futexes for three anticipatory real-time locking protocols
Method of using page faults to implement futexes
17 Overview of Talk
18 Overview of Talk
Challenge
Implementation
Evaluation
19 Overview of Talk
Challenge
Implementation
Evaluation
20 Futexes — Overview and Intuition The problem with the vanilla approach
Mutex API
kernel_do_lock(); // critical section kernel_do_unlock();
21 Futexes — Overview and Intuition The problem with the vanilla approach
Kernel invoked on every lock Mutex API and unlock operation
kernel_do_lock(); // critical section kernel_do_unlock();
22 Futexes — Overview and Intuition The problem with the vanilla approach
Kernel invoked on every lock Mutex API and unlock operation
kernel_do_lock(); // critical section kernel_do_unlock(); Task Task Task
Lock State Kernel
23 Futexes — Overview and Intuition Exporting Lock State to Userspace Processes
Task Task Task
Lock State Kernel
Shared lock state between userspace and kernel
24 Futexes — Overview and Intuition Exporting Lock State to Userspace Processes
Task Task Task Futex Pseudocode try_to_acquire_lock(lock); Lock State if (lock is contended) Kernel kernel_do_lock(lock); // critical section release_lock(lock); if (there are blocked tasks) kernel_do_unlock();
25 Futexes — Overview and Intuition Fast-Path for Uncontended Operations
Task Task Task Futex Pseudocode try_to_acquire_lock(lock); Lock State if (lock is contended) Kernel kernel_do_lock(lock); // critical section release_lock(lock); if (there are blocked tasks) Fast-path when operation is kernel_do_unlock(); uncontended (kernel not invoked)
26 Futexes — Overview and Intuition Slow-Path for Contended Operations
Task Task Task Futex Pseudocode try_to_acquire_lock(lock); Lock State if (lock is contended) Kernel kernel_do_lock(lock); // critical section release_lock(lock); if (there are blocked tasks) Kernel invoked only when lock or kernel_do_unlock(); unlock operation is contended
27 Priority Inheritance Protocol Futexes for reactive locking protocols
HI Lock 2
LO Lock 1 Lock 2 …
Time
28 Fixed-Priority Scheduling Priority Inheritance Protocol Futexes for reactive locking protocols
Uncontended No kernel intervention
HI Lock 2
LO Lock 1 Lock 2 …
Time
29 Fixed-Priority Scheduling Priority Inheritance Protocol Futexes for reactive locking protocols
Contended Kernel intervention needed Uncontended No kernel intervention
HI Lock 2
LO Lock 1 Lock 2 …
Time
30 Fixed-Priority Scheduling Priority Ceiling Protocol (PCP)
• Each lock has a priority ceiling — the highest priority of all tasks that can acquire that lock.
• System ceiling — Currently held lock with highest ceiling
• A lock may be acquired by a task only if:
• Task priority > System Ceiling Priority!
• Or the task is holding the system ceiling!
• When under pi-blocking, priority inheritance raises priority of lower priority task to that of higher priority task.
31 Priority Ceiling Protocol (PCP)
• Each lock has a priority ceiling — the highest priority of all tasks that can acquire that lock.
• System ceiling — Currently held lock with highest ceiling
• A lock may be acquired by a task only if:
• Task priority > System Ceiling Priority!
• Or the task is holding the system ceiling!
• When under pi-blocking, priority inheritance raises priority of lower priority task to that of higher priority task.
32 Priority Ceiling Protocol (PCP)
• Each lock has a priority ceiling — the highest priority of all tasks that can acquire that lock.
• System ceiling — Currently held lock with highest ceiling
• A lock may be acquired by a task only if:
• Task priority > System Ceiling Priority!
• Or the task is holding the system ceiling!
• When under pi-blocking, priority inheritance raises priority of lower priority task to that of higher priority task.
33 Priority Ceiling Protocol (PCP)
• Each lock has a priority ceiling — the highest priority of all tasks that can acquire that lock.
• System ceiling — Currently held lock with highest ceiling
• A lock may be acquired by a task only if:
• Task priority > System Ceiling Priority!
• Or the task is holding the system ceiling!
• When under contention, priority inheritance (raises priority of lower priority task to that of higher priority task).
34 PCP Futexes — Challenge
HI L1
MED L2
LO L1
Time
35 Fixed-Priority Scheduling PCP Futexes — Challenge
HI L1
MED L2
LO L1
Time Raises system ceiling prio. to HI Lowers system ceiling prio. to None
36 Fixed-Priority Scheduling PCP Futexes — Challenge
Valid acquisition HI L1 MED > System Ceiling Prio.
MED L2
LO L1
Time Raises system ceiling prio. to HI Lowers system ceiling prio. to None
37 Fixed-Priority Scheduling PCP Futexes — Challenge
HI L1 HI L1
MED L2 MED L2
LO L1 LO L1
Time Time
38 Fixed-Priority Scheduling PCP Futexes — Challenge
HI L1 HI L1
MED L2 MED L2
LO L1 LO L1
Time Time Raises system ceiling prio. to HI Preemption Sys. ceiling prio. = HI
39 Fixed-Priority Scheduling PCP Futexes — Challenge
L2 free But invalid acquisition HI L1 HI L1 Sys. Ceiling Prio. > MED
MED L2 MED L2
LO L1 LO L1
Time Time Raises system ceiling prio. to HI Preemption Sys. ceiling prio. = HI
40 Fixed-Priority Scheduling PCP Futexes — Challenge
L2 free But invalid acquisition HI L1 HI L1 Sys. Ceiling Prio. > MED
MED L2 MED L2
LO L1 LO L1
Problem — Simply relying on lock state (as in reactive case) may violate PCP semantics. Must consider system ceiling. Time Time Raises system ceiling prio. to HI Preemption Sys. ceiling prio. = HI
41 Fixed-Priority Scheduling PCP Futexes — Challenge
L2 free But invalid acquisition HI L1 HI L1 Sys. Ceiling Prio. > MED
MED L2 MED L2
LO L1 LO L1
Problem — Simply relying on lock state (as in reactive case) may violate PCP semantics. Must consider system ceiling. Time Time Raises system ceiling prio. to HI Problem — system ceiling potentially updated at everyPreemption lock acquisition and release — must always invoke Sys.kernel ceiling to avoid prio. = HI protocol violations
42 Fixed-Priority Scheduling Overview of Talk
Challenge
Implementation
Evaluation
43 Overview of Talk
Challenge
Implementation
Evaluation
44 PCP Futexes — Our Approach
PCP Futex Pseudocode if (lock is contended) kernel_do_lock(lock); else acquire_lock_locally(lock) // critical section release_lock(lock); if (there are blocked tasks) kernel_do_unlock();
45 PCP Futexes — Our Approach
Whether lock acquisition is contended PCP Futex Pseudocode is determined in PCP by the current system ceiling if (lock is contended) kernel_do_lock(lock); else acquire_lock_locally(lock) // critical section release_lock(lock); if (there are blocked tasks) kernel_do_unlock();
46 PCP Futexes — Our Approach
Whether lock acquisition is contended PCP Futex Pseudocode is determined in PCP by the current system ceiling if (lock is contended) kernel_do_lock(lock); The presence of blocked tasks is else determined by the size of the wait- acquire_lock_locally(lock) queue corresponding to the lock. // critical section release_lock(lock); if (there are blocked tasks) kernel_do_unlock();
47 PCP Futexes — Our Approach
Whether lock acquisition is contended PCP Futex Pseudocode is determined in PCP by the current system ceiling if (lock is contended) kernel_do_lock(lock); The presence of blocked tasks is else determined by the size of the wait- acquire_lock_locally(lock) queue corresponding to the lock. // critical section release_lock(lock); if (there are blocked tasks) Cannot change while a task is kernel_do_unlock(); scheduled on a uniprocessor!
48 PCP Futexes — Deferred Updates
HI Scheduling Interval
LO
Time
Invoke kernel
49 PCP Futexes — Deferred Updates
HI Scheduling Interval
LO
Time
Predetermine contention Make changes to local Update kernel state based conditions and copy of lock states (if on copy at context switch communicate them to task uncontended)
50 PCP Futexes — Deferred Updates
HI Scheduling Interval
LO
Time
Predetermine contention Make changes to local Update kernel state based conditions and copy of lock states (if on copy at context switch communicate them to task uncontended)
Requires a bidirectional communication channel between tasks and the kernel
51 Bidirectional Communication — Lock Page
bitmap: lock_states
bool: tasks_blocked
Lock Page Per-process page of memory writable by both the process and kernel.
52 Bidirectional Communication — Lock Page
Lock-state bitmap used to communicate acquisitions and releases of locks to kernel
bitmap: lock_states
bool: tasks_blocked
Lock Page Per-process page of memory writable by both the process and kernel.
53 Bidirectional Communication — Lock Page
Lock-state bitmap used to communicate acquisitions and releases of locks to kernel
bitmap: lock_states Boolean set by kernel to indicate the bool: tasks_blocked presence of blocked tasks (indicates that release operation is contended) Lock Page Per-process page of memory writable by both the process and kernel.
54 Bidirectional Communication — Lock Page
Lock-state bitmap used to communicate acquisitions and releases of locks to kernel
bitmap: lock_states Boolean set by kernel to indicate the bool: tasks_blocked presence of blocked tasks (indicates that release operation is contended) Lock Page Per-process page of memory writable by both the process and A “permission bit” set by the kernel to kernel. indicate whether lock acquisition is allowed (system ceiling check)
55 Communicating the Permission Bit
PCP Futex Pseudocode if (permission bit is set) set_bit_in_bitmap(lock) else kernel_do_lock(lock); // critical section release_lock(lock); if (tasks_blocked is set) kernel_do_unlock();
56 Communicating the Permission Bit
PCP Futex Pseudocode Challenge: if (permission bit is set) set_bit_in_bitmap(lock) Checking for permission and setting else lock bit must be atomic. kernel_do_lock(lock); // critical section Why? Preemption between them release_lock(lock); may change the permission. if (tasks_blocked is set) kernel_do_unlock();
57 Communicating the Permission Bit
We proposed two approaches for the Classic PCP
Page Faults (PCP-DU-PF) CMPXCHG (PCP-DU-BOOL)
Uses page faults to invoke Boolean for permission bit. kernel when locking is prohibited Compare-and-exchange operation to check and acquire locks atomically
58 Communicating the Permission Bit
We proposed two approaches for the Classic PCP
Page Faults (PCP-DU-PF) CMPXCHG (PCP-DU-BOOL)
Uses page faults to invoke See paper for ! Boolean kernel when locking is details! prohibited Compare-and-exchange operation to check and acquire locks atomically PCP Futexes — Page Fault Approach
HI
LO
Time
If (locking is allowed) Attempting to write lock bit will set lock page writable invoke kernel automatically if else locking is not permitted set lock page unwritable
60 PCP Futexes — Page Fault Approach
Can be implemented using segmentation faults as well
Lock code-path is entirely branch-free HI
LO
Time
If (locking is allowed) Attempting to write lock bit will set lock page writable invoke kernel automatically if else locking is not permitted set lock page unwritable
61 PCP Futexes — Page Fault Approach
Can be implemented using segmentation faults as well
Lock code-path is entirely branch-free HI
Requires an optimized page-fault handler LO
Time
If (locking is allowed) Attempting to write lock bit will set lock page writable invoke kernel automatically if else locking is not permitted set lock page unwritable
62 Communicating the Permission Bit
We proposed two approaches for the Classic PCP
Page Faults (PCP-DU-PF) CMPXCHG (PCP-DU-BOOL)
Uses page faults to invoke Boolean for permission bit. kernel when locking is prohibited Compare-and-exchange operation to check and acquire locks atomically
Lock code-path is entirely Avoid page faults at cost of branch-free atomic op + branch
63 Overview of Talk
Challenge
Implementation
Evaluation
64 Overview of Talk
Challenge
Implementation
Evaluation
65 Evaluation — Platform
Boundary Devices Sabre Lite Board Freescale I.MX6Q Quad Core SoC ARM Cortex A9 (1 GHz) Implemented in LITMUSRT 2013.1 (based on Linux 3.10.5)
66 Evaluation — Benchmarking Methodology
Test program locks and unlocks once every period Uniprocessor tests: 5 threads Randomly assigned parameters for threads Critical section length — 25-45µs Execution time — 25-65µs Period — 600-800µs Measured using processor timestamp counters
67 Evaluation — Benchmarking Methodology
Test program locks and unlocks once every period Uniprocessor tests: 5 threads Randomly assigned parameters for threads Critical section length — 25-45µs Demanding workload — Execution time — 25-65µs thousands of operations per second Period — 600-800µs Measured using processor timestamp counters
68 PCP Futex Evaluation — Baseline
We compare all futex implementations against the non-futex PCP implementation in LITMUSRT
PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)
Futex Implementation
Linux LITMUSRT
69 PCP Futex Evaluation — Baseline
PRIO_PROTECT (Linux PCP) PCP (LITMUS^RT)
33,271
Overhead in Cycles 9,027 6,655 3,850 1,604 645 1,221 1,211
Lock (avg) Lock (max) Unlock (avg) Unlock (max)
70 Overhead in Cycles PCP FutexEvaluation—Baseline 1,604 Lock (avg) PRIO_PROTECT (LinuxPCP) 645 33,271 Lock (max) 3,850 Our baselinePCPimplementation 71 incurs loweroverheads 1,221 Unlock (avg) PCP (LITMUS^RT) 1,211 9,027 Unlock (max) thanLinux’s. 6,655 already PCP Futex Evaluation — Uncontended Case
How much reduction do we see in average uncontended-case overheads?
72 PCP FutexEvaluation—UncontendedCase
Overhead (compared to baseline) PCP (LITMUS^RT) 100% PCP (LITMUS Baseline 100% RT Lock (average) ) Futex (Page-Faults) 25% (Page Faults) PCP Futex 7% 73 Based on 6.6 million samples per protocol Futex (CMPXCHG) 27% (CMPXCHG) PCP Futex Unlock (average) 8% PRIO_INHERIT (Linux) (PRIO_INHERIT) 41% Linux Futex 25% PCP Futex Evaluation — Uncontended Case
Lock (average) Unlock (average)
100% 100% Our PCP futex implementation overheads are significantly lower than Linux’s PIP futex implementation.
41%
25% 27% 25% Overhead (compared to baseline) Overhead (compared 7% 8% PCPPCP (LITMUS^RT) (LITMUSRT) FutexPCP (Page-Faults) Futex FutexPCP (CMPXCHG) Futex PRIO_INHERITLinux Futex (Linux) Baseline (Page Faults) (CMPXCHG) (PRIO_INHERIT)
74Based on 6.6 million samples per protocol PCP Futex Evaluation —Contended Case
How much additional overhead do we incur in the maximum contended-case overheads?
75 PCP Futex Evaluation —Contended Case
Lock (Max) Unlock (Max)
186% 167%
137%
107% 100% 100% 106% 106%
PCPPCP (LITMUS^RT) (LITMUSRT) PCP-DU-PFPCP Futex PCP-DU-BOOLPCP Futex PRIO_INHERITLinux Futex (Linux) Baseline (Page Faults) (CMPXCHG) (PRIO_INHERIT)
76 Based on 900K samples per protocol PCP Futex Evaluation —Contended Case
Lock (Max) Unlock (Max)
Linux page-fault handler 186% not optimized for real-time 167% locking protocol implementation. 137%
107% 100% 100% 106% 106%
PCPPCP (LITMUS^RT) (LITMUSRT) PCP-DU-PFPCP Futex PCP-DU-BOOLPCP Futex PRIO_INHERITLinux Futex (Linux) Baseline (Page Faults) (CMPXCHG) (PRIO_INHERIT)
77 Based on 900K samples per protocol PCP Futex Evaluation —Contended Case
Lock (Max) Unlock (Max)
186% 167%
137%
107% 100% 100% 106% 106%
PCP-DU-PF PCPPCP (LITMUS^RT) (LITMUSRT) PCP-DU-BOOLPCP Futex PRIO_INHERITLinux Futex (Linux) Baseline (CMPXCHG) (PRIO_INHERIT)
78 Based on 900K samples per protocol PCP Futex Evaluation —Contended Case
Lock (Max) Unlock (Max)
Up to 37% 186%increase in worst-case contended overheads as a result of 167% the futex approach. 137%
107% 100% 100% 106% 106%
PCP-DU-PF PCPPCP (LITMUS^RT) (LITMUSRT) PCP-DU-BOOLPCP Futex PRIO_INHERITLinux Futex (Linux) Baseline (CMPXCHG) (PRIO_INHERIT)
79 Based on 900K samples per protocol PCP Futex Evaluation —Contended Case
Lock (Max) Unlock (Max)
Up to 37% 186%increase in worst-case contended overheads as a result of 167% the futex approach. 137% But no worse than Linux’s futex 107% implementation 100% 100% 106% 106%
PCP-DU-PF PCPPCP (LITMUS^RT) (LITMUSRT) PCP-DU-BOOLPCP Futex PRIO_INHERITLinux Futex (Linux) Baseline (CMPXCHG) (PRIO_INHERIT)
80 Based on 900K samples per protocol PCP Futex Evaluation — Summary
Up to 92% reduction in average uncontended overheads at a cost of 37% increase in maximum contended overheads (no worse than Linux).
Page faults can be used to implement futexes for anticipatory real-time locking protocols.
81 Summary
PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)
Futex Implementation Observed overhead reduction: 92% 85% 84%
Design and implementation of futexes for three anticipatory real-time locking protocols
Method of using page faults to implement futexes
82 Thanks!
Linux Testbed for Multiprocessor Scheduling in Real-Time Systems
All source code available at: http://www.litmus-rt.org/
83