Fast on Average, Predictable in the Worst Case

Exploring Real-Time Futexes in LITMUSRT

Roy Spliet, Manohar Vanga, Björn Brandenburg, Sven Dziadek

IEEE Real-Time Systems Symposium 2014 December 2-5, 2014 Rome, Italy 1 Real-Time Locking API

Mutex API kernel_do_lock(); // critical section kernel_do_unlock();

2 Real-Time Locking API

Mutex API

No other task is currently holding the kernel_do_lock(); // critical section kernel_do_unlock(); No other task is waiting for the lock

Operations are typically uncontended at runtime

3 Futexes: Fast Userspace Mutexes

A mechanism (API) in the kernel ! • For optimizing the uncontended case of locking protocol implementations ! • Avoid kernel invocation during uncontended operations

4 Futexes: Fast Userspace Mutexes

A mechanism (API) in the ! • For optimizing the uncontended case of locking protocol implementations ! • Avoid kernel invocation during uncontended operations

Why optimize the uncontended case? ! • Improved throughput for soft real-time workloads ! • Increasing available slack time in the system

5 Real-Time Locking Implementations

PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)

Futex Implementation

Linux

6 Real-Time Locking Implementations

PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)

Futex Implementation

Linux LITMUSRT

LITMUSRT Real-Time Locking Implementations

PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)

Futex Implementation

Linux LITMUSRT

8 Real-Time Locking Implementations

PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)

Futex Implementation

Linux LITMUSRT

Choice between futex implementation and better analytical properties

9 Real-Time Locking Implementations

PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)

Futex Implementation

Linux LITMUSRT

Choice between futex implementation and better analytical properties

Why is it challenging to have both?

10 Real-Time Locking Implementations

PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)

Futex Implementation

Reactive Anticipatory

11 Real-Time Locking Protocol Dichotomy

Reactive Locking Protocols Anticipatory Locking Protocols

React to contention on a lock Anticipate problem scenarios only when it occurs. and take measure to minimize priority-inversion (pi) blocking.

12 Real-Time Locking Protocol Dichotomy

Reactive Locking Protocols Anticipatory Locking Protocols

React to contention on a lock Anticipate problem scenarios only when it occurs. and take measure to minimize priority-inversion (pi) blocking.

Tricky to implement futexes Simple futex implementation without violating semantics

13 Contributions

PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)

Futex Implementation

14 Contributions

PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)

Futex Implementation

Design and implementation of futexes for three anticipatory real-time locking protocols

15 Contributions

PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)

Futex Implementation Observed overhead reduction: 92% 85% 84%

Design and implementation of futexes for three anticipatory real-time locking protocols

16 Contributions

PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)

Futex Implementation Observed overhead reduction: 92% 85% 84%

Design and implementation of futexes for three anticipatory real-time locking protocols

Method of using page faults to implement futexes

17 Overview of Talk

18 Overview of Talk

Challenge

Implementation

Evaluation

19 Overview of Talk

Challenge

Implementation

Evaluation

20 Futexes — Overview and Intuition The problem with the vanilla approach

Mutex API

kernel_do_lock(); // critical section kernel_do_unlock();

21 Futexes — Overview and Intuition The problem with the vanilla approach

Kernel invoked on every lock Mutex API and unlock operation

kernel_do_lock(); // critical section kernel_do_unlock();

22 Futexes — Overview and Intuition The problem with the vanilla approach

Kernel invoked on every lock Mutex API and unlock operation

kernel_do_lock(); // critical section kernel_do_unlock(); Task Task Task

Lock State Kernel

23 Futexes — Overview and Intuition Exporting Lock State to Userspace Processes

Task Task Task

Lock State Kernel

Shared lock state between userspace and kernel

24 Futexes — Overview and Intuition Exporting Lock State to Userspace Processes

Task Task Task Futex Pseudocode try_to_acquire_lock(lock); Lock State if (lock is contended) Kernel kernel_do_lock(lock); // critical section release_lock(lock); if (there are blocked tasks) kernel_do_unlock();

25 Futexes — Overview and Intuition Fast-Path for Uncontended Operations

Task Task Task Futex Pseudocode try_to_acquire_lock(lock); Lock State if (lock is contended) Kernel kernel_do_lock(lock); // critical section release_lock(lock); if (there are blocked tasks) Fast-path when operation is kernel_do_unlock(); uncontended (kernel not invoked)

26 Futexes — Overview and Intuition Slow-Path for Contended Operations

Task Task Task Futex Pseudocode try_to_acquire_lock(lock); Lock State if (lock is contended) Kernel kernel_do_lock(lock); // critical section release_lock(lock); if (there are blocked tasks) Kernel invoked only when lock or kernel_do_unlock(); unlock operation is contended

27 Priority Inheritance Protocol Futexes for reactive locking protocols

HI Lock 2

LO Lock 1 Lock 2 …

Time

28 Fixed-Priority Priority Inheritance Protocol Futexes for reactive locking protocols

Uncontended No kernel intervention

HI Lock 2

LO Lock 1 Lock 2 …

Time

29 Fixed-Priority Scheduling Priority Inheritance Protocol Futexes for reactive locking protocols

Contended Kernel intervention needed Uncontended No kernel intervention

HI Lock 2

LO Lock 1 Lock 2 …

Time

30 Fixed-Priority Scheduling Priority Ceiling Protocol (PCP)

• Each lock has a priority ceiling — the highest priority of all tasks that can acquire that lock.

• System ceiling — Currently held lock with highest ceiling

• A lock may be acquired by a task only if:

• Task priority > System Ceiling Priority!

• Or the task is holding the system ceiling!

• When under pi-blocking, priority inheritance raises priority of lower priority task to that of higher priority task.

31 Priority Ceiling Protocol (PCP)

• Each lock has a priority ceiling — the highest priority of all tasks that can acquire that lock.

• System ceiling — Currently held lock with highest ceiling

• A lock may be acquired by a task only if:

• Task priority > System Ceiling Priority!

• Or the task is holding the system ceiling!

• When under pi-blocking, priority inheritance raises priority of lower priority task to that of higher priority task.

32 Priority Ceiling Protocol (PCP)

• Each lock has a priority ceiling — the highest priority of all tasks that can acquire that lock.

• System ceiling — Currently held lock with highest ceiling

• A lock may be acquired by a task only if:

• Task priority > System Ceiling Priority!

• Or the task is holding the system ceiling!

• When under pi-blocking, priority inheritance raises priority of lower priority task to that of higher priority task.

33 Priority Ceiling Protocol (PCP)

• Each lock has a priority ceiling — the highest priority of all tasks that can acquire that lock.

• System ceiling — Currently held lock with highest ceiling

• A lock may be acquired by a task only if:

• Task priority > System Ceiling Priority!

• Or the task is holding the system ceiling!

• When under contention, priority inheritance (raises priority of lower priority task to that of higher priority task).

34 PCP Futexes — Challenge

HI L1

MED L2

LO L1

Time

35 Fixed-Priority Scheduling PCP Futexes — Challenge

HI L1

MED L2

LO L1

Time Raises system ceiling prio. to HI Lowers system ceiling prio. to None

36 Fixed-Priority Scheduling PCP Futexes — Challenge

Valid acquisition HI L1 MED > System Ceiling Prio.

MED L2

LO L1

Time Raises system ceiling prio. to HI Lowers system ceiling prio. to None

37 Fixed-Priority Scheduling PCP Futexes — Challenge

HI L1 HI L1

MED L2 MED L2

LO L1 LO L1

Time Time

38 Fixed-Priority Scheduling PCP Futexes — Challenge

HI L1 HI L1

MED L2 MED L2

LO L1 LO L1

Time Time Raises system ceiling prio. to HI Preemption Sys. ceiling prio. = HI

39 Fixed-Priority Scheduling PCP Futexes — Challenge

L2 free But invalid acquisition HI L1 HI L1 Sys. Ceiling Prio. > MED

MED L2 MED L2

LO L1 LO L1

Time Time Raises system ceiling prio. to HI Preemption Sys. ceiling prio. = HI

40 Fixed-Priority Scheduling PCP Futexes — Challenge

L2 free But invalid acquisition HI L1 HI L1 Sys. Ceiling Prio. > MED

MED L2 MED L2

LO L1 LO L1

Problem — Simply relying on lock state (as in reactive case) may violate PCP semantics. Must consider system ceiling. Time Time Raises system ceiling prio. to HI Preemption Sys. ceiling prio. = HI

41 Fixed-Priority Scheduling PCP Futexes — Challenge

L2 free But invalid acquisition HI L1 HI L1 Sys. Ceiling Prio. > MED

MED L2 MED L2

LO L1 LO L1

Problem — Simply relying on lock state (as in reactive case) may violate PCP semantics. Must consider system ceiling. Time Time Raises system ceiling prio. to HI Problem — system ceiling potentially updated at everyPreemption lock acquisition and release — must always invoke Sys.kernel ceiling to avoid prio. = HI protocol violations

42 Fixed-Priority Scheduling Overview of Talk

Challenge

Implementation

Evaluation

43 Overview of Talk

Challenge

Implementation

Evaluation

44 PCP Futexes — Our Approach

PCP Futex Pseudocode if (lock is contended) kernel_do_lock(lock); else acquire_lock_locally(lock) // critical section release_lock(lock); if (there are blocked tasks) kernel_do_unlock();

45 PCP Futexes — Our Approach

Whether lock acquisition is contended PCP Futex Pseudocode is determined in PCP by the current system ceiling if (lock is contended) kernel_do_lock(lock); else acquire_lock_locally(lock) // critical section release_lock(lock); if (there are blocked tasks) kernel_do_unlock();

46 PCP Futexes — Our Approach

Whether lock acquisition is contended PCP Futex Pseudocode is determined in PCP by the current system ceiling if (lock is contended) kernel_do_lock(lock); The presence of blocked tasks is else determined by the size of the wait- acquire_lock_locally(lock) queue corresponding to the lock. // critical section release_lock(lock); if (there are blocked tasks) kernel_do_unlock();

47 PCP Futexes — Our Approach

Whether lock acquisition is contended PCP Futex Pseudocode is determined in PCP by the current system ceiling if (lock is contended) kernel_do_lock(lock); The presence of blocked tasks is else determined by the size of the wait- acquire_lock_locally(lock) queue corresponding to the lock. // critical section release_lock(lock); if (there are blocked tasks) Cannot change while a task is kernel_do_unlock(); scheduled on a uniprocessor!

48 PCP Futexes — Deferred Updates

HI Scheduling Interval

LO

Time

Invoke kernel

49 PCP Futexes — Deferred Updates

HI Scheduling Interval

LO

Time

Predetermine contention Make changes to local Update kernel state based conditions and copy of lock states (if on copy at context switch communicate them to task uncontended)

50 PCP Futexes — Deferred Updates

HI Scheduling Interval

LO

Time

Predetermine contention Make changes to local Update kernel state based conditions and copy of lock states (if on copy at context switch communicate them to task uncontended)

Requires a bidirectional communication channel between tasks and the kernel

51 Bidirectional Communication — Lock Page

bitmap: lock_states

bool: tasks_blocked

Lock Page Per- page of memory writable by both the process and kernel.

52 Bidirectional Communication — Lock Page

Lock-state bitmap used to communicate acquisitions and releases of locks to kernel

bitmap: lock_states

bool: tasks_blocked

Lock Page Per-process page of memory writable by both the process and kernel.

53 Bidirectional Communication — Lock Page

Lock-state bitmap used to communicate acquisitions and releases of locks to kernel

bitmap: lock_states Boolean set by kernel to indicate the bool: tasks_blocked presence of blocked tasks (indicates that release operation is contended) Lock Page Per-process page of memory writable by both the process and kernel.

54 Bidirectional Communication — Lock Page

Lock-state bitmap used to communicate acquisitions and releases of locks to kernel

bitmap: lock_states Boolean set by kernel to indicate the bool: tasks_blocked presence of blocked tasks (indicates that release operation is contended) Lock Page Per-process page of memory writable by both the process and A “permission bit” set by the kernel to kernel. indicate whether lock acquisition is allowed (system ceiling check)

55 Communicating the Permission Bit

PCP Futex Pseudocode if (permission bit is set) set_bit_in_bitmap(lock) else kernel_do_lock(lock); // critical section release_lock(lock); if (tasks_blocked is set) kernel_do_unlock();

56 Communicating the Permission Bit

PCP Futex Pseudocode Challenge: if (permission bit is set) set_bit_in_bitmap(lock) Checking for permission and setting else lock bit must be atomic. kernel_do_lock(lock); // critical section Why? Preemption between them release_lock(lock); may change the permission. if (tasks_blocked is set) kernel_do_unlock();

57 Communicating the Permission Bit

We proposed two approaches for the Classic PCP

Page Faults (PCP-DU-PF) CMPXCHG (PCP-DU-BOOL)

Uses page faults to invoke Boolean for permission bit. kernel when locking is prohibited Compare-and-exchange operation to check and acquire locks atomically

58 Communicating the Permission Bit

We proposed two approaches for the Classic PCP

Page Faults (PCP-DU-PF) CMPXCHG (PCP-DU-BOOL)

Uses page faults to invoke See paper for ! Boolean kernel when locking is details! prohibited Compare-and-exchange operation to check and acquire locks atomically PCP Futexes — Page Fault Approach

HI

LO

Time

If (locking is allowed) Attempting to write lock bit will set lock page writable invoke kernel automatically if else locking is not permitted set lock page unwritable

60 PCP Futexes — Page Fault Approach

Can be implemented using segmentation faults as well

Lock code-path is entirely branch-free HI

LO

Time

If (locking is allowed) Attempting to write lock bit will set lock page writable invoke kernel automatically if else locking is not permitted set lock page unwritable

61 PCP Futexes — Page Fault Approach

Can be implemented using segmentation faults as well

Lock code-path is entirely branch-free HI

Requires an optimized page-fault handler LO

Time

If (locking is allowed) Attempting to write lock bit will set lock page writable invoke kernel automatically if else locking is not permitted set lock page unwritable

62 Communicating the Permission Bit

We proposed two approaches for the Classic PCP

Page Faults (PCP-DU-PF) CMPXCHG (PCP-DU-BOOL)

Uses page faults to invoke Boolean for permission bit. kernel when locking is prohibited Compare-and-exchange operation to check and acquire locks atomically

Lock code-path is entirely Avoid page faults at cost of branch-free atomic op + branch

63 Overview of Talk

Challenge

Implementation

Evaluation

64 Overview of Talk

Challenge

Implementation

Evaluation

65 Evaluation — Platform

Boundary Devices Sabre Lite Board Freescale I.MX6Q Quad Core SoC ARM Cortex A9 (1 GHz) Implemented in LITMUSRT 2013.1 (based on Linux 3.10.5)

66 Evaluation — Benchmarking Methodology

Test program locks and unlocks once every period Uniprocessor tests: 5 threads Randomly assigned parameters for threads Critical section length — 25-45µs Execution time — 25-65µs Period — 600-800µs Measured using processor timestamp counters

67 Evaluation — Benchmarking Methodology

Test program locks and unlocks once every period Uniprocessor tests: 5 threads Randomly assigned parameters for threads Critical section length — 25-45µs Demanding workload — Execution time — 25-65µs thousands of operations per second Period — 600-800µs Measured using processor timestamp counters

68 PCP Futex Evaluation — Baseline

We compare all futex implementations against the non-futex PCP implementation in LITMUSRT

PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)

Futex Implementation

Linux LITMUSRT

69 PCP Futex Evaluation — Baseline

PRIO_PROTECT (Linux PCP) PCP (LITMUS^RT)

33,271

Overhead in Cycles 9,027 6,655 3,850 1,604 645 1,221 1,211

Lock (avg) Lock (max) Unlock (avg) Unlock (max)

70 Overhead in Cycles PCP FutexEvaluation—Baseline 1,604 Lock (avg) PRIO_PROTECT (LinuxPCP) 645 33,271 Lock (max) 3,850 Our baselinePCPimplementation 71 incurs loweroverheads 1,221 Unlock (avg) PCP (LITMUS^RT) 1,211 9,027 Unlock (max) thanLinux’s. 6,655 already PCP Futex Evaluation — Uncontended Case

How much reduction do we see in average uncontended-case overheads?

72 PCP FutexEvaluation—UncontendedCase

Overhead (compared to baseline) PCP (LITMUS^RT) 100% PCP (LITMUS Baseline 100% RT Lock (average) ) Futex (Page-Faults) 25% (Page Faults) PCP Futex 7% 73 Based on 6.6 million samples per protocol Futex (CMPXCHG) 27% (CMPXCHG) PCP Futex Unlock (average) 8% PRIO_INHERIT (Linux) (PRIO_INHERIT) 41% Linux Futex 25% PCP Futex Evaluation — Uncontended Case

Lock (average) Unlock (average)

100% 100% Our PCP futex implementation overheads are significantly lower than Linux’s PIP futex implementation.

41%

25% 27% 25% Overhead (compared to baseline) Overhead (compared 7% 8% PCPPCP (LITMUS^RT) (LITMUSRT) FutexPCP (Page-Faults) Futex FutexPCP (CMPXCHG) Futex PRIO_INHERITLinux Futex (Linux) Baseline (Page Faults) (CMPXCHG) (PRIO_INHERIT)

74Based on 6.6 million samples per protocol PCP Futex Evaluation —Contended Case

How much additional overhead do we incur in the maximum contended-case overheads?

75 PCP Futex Evaluation —Contended Case

Lock (Max) Unlock (Max)

186% 167%

137%

107% 100% 100% 106% 106%

PCPPCP (LITMUS^RT) (LITMUSRT) PCP-DU-PFPCP Futex PCP-DU-BOOLPCP Futex PRIO_INHERITLinux Futex (Linux) Baseline (Page Faults) (CMPXCHG) (PRIO_INHERIT)

76 Based on 900K samples per protocol PCP Futex Evaluation —Contended Case

Lock (Max) Unlock (Max)

Linux page-fault handler 186% not optimized for real-time 167% locking protocol implementation. 137%

107% 100% 100% 106% 106%

PCPPCP (LITMUS^RT) (LITMUSRT) PCP-DU-PFPCP Futex PCP-DU-BOOLPCP Futex PRIO_INHERITLinux Futex (Linux) Baseline (Page Faults) (CMPXCHG) (PRIO_INHERIT)

77 Based on 900K samples per protocol PCP Futex Evaluation —Contended Case

Lock (Max) Unlock (Max)

186% 167%

137%

107% 100% 100% 106% 106%

PCP-DU-PF PCPPCP (LITMUS^RT) (LITMUSRT) PCP-DU-BOOLPCP Futex PRIO_INHERITLinux Futex (Linux) Baseline (CMPXCHG) (PRIO_INHERIT)

78 Based on 900K samples per protocol PCP Futex Evaluation —Contended Case

Lock (Max) Unlock (Max)

Up to 37% 186%increase in worst-case contended overheads as a result of 167% the futex approach. 137%

107% 100% 100% 106% 106%

PCP-DU-PF PCPPCP (LITMUS^RT) (LITMUSRT) PCP-DU-BOOLPCP Futex PRIO_INHERITLinux Futex (Linux) Baseline (CMPXCHG) (PRIO_INHERIT)

79 Based on 900K samples per protocol PCP Futex Evaluation —Contended Case

Lock (Max) Unlock (Max)

Up to 37% 186%increase in worst-case contended overheads as a result of 167% the futex approach. 137% But no worse than Linux’s futex 107% implementation 100% 100% 106% 106%

PCP-DU-PF PCPPCP (LITMUS^RT) (LITMUSRT) PCP-DU-BOOLPCP Futex PRIO_INHERITLinux Futex (Linux) Baseline (CMPXCHG) (PRIO_INHERIT)

80 Based on 900K samples per protocol PCP Futex Evaluation — Summary

Up to 92% reduction in average uncontended overheads at a cost of 37% increase in maximum contended overheads (no worse than Linux).

Page faults can be used to implement futexes for anticipatory real-time locking protocols.

81 Summary

PRIO_INHERIT! PRIO_PROTECT Classic Priority (Priority (Immediate Multiprocessor Ceiling Protocol! FMLP Inheritance Priority Ceiling PCP (MPCP) (PCP) Protocol) Protocol)

Futex Implementation Observed overhead reduction: 92% 85% 84%

Design and implementation of futexes for three anticipatory real-time locking protocols

Method of using page faults to implement futexes

82 Thanks!

Linux Testbed for Multiprocessor Scheduling in Real-Time Systems

All source code available at: http://www.litmus-rt.org/

83