Priority Boosting Preemptible RCU

Priority Boosting Preemptible RCU Paul E. McKenney Linux Technology Center IBM Beaverton [email protected] [email protected] ABSTRACT overhead [3].1 Read-copy update (RCU) is a light-weight synchronization Any statement that is not within an RCU read-side crit- mechanism that has been used in production for well over a ical section is a quiescent state, and a quiescent state that decade, most recently, as part of the Linux kernel. The key persists for a significant time period is called an extended concept behind RCU is the ability of RCU update-side prim- quiescent state. Any time period during which each thread itives (synchronize_rcu() and call_rcu()) to wait on pre- has occupied at least one quiescent state is a grace period. existing RCU read-side critical sections, which are delimited The synchronize_rcu() primitive waits for a grace period by rcu_read_lock() and rcu_read_unlock(). The time re- to elapse. Updates that cannot block may use an asyn- quired for all pre-existing RCU read-side critical sections to chronous primitive named call_rcu(), which causes a spec- complete is termed an RCU grace period. A common usage ified function to be invoked with a specified argument at pattern is to remove an element from a data structure, wait the end of a subsequent grace period. In some (but not for an RCU grace period to elapse, then free that element. all) production-quality implementations, call_rcu() sim- Older implementations of RCU operated by suppressing ply appends a callback to a per-thread list, and is therefore preemption across the RCU read-side critical sections, but wait-free [1, 8]. Nevertheless, RCU updates do incur some more recent implementations designed for real-time use per- overhead, so that RCU is best-suited for read-mostly situa- mit such preemption. This can lead to a priority-inversion tions. problem, where a low-priority non-real-time task is preempted Taken together, these four primitives implement RCU’s within an RCU read-side critical section by several medium- grace-period guarantee: a given grace period is guaranteed priority real-time tasks (at least one per CPU). This situa- to extend past the end of any pre-existing RCU read-side tion prevents any subsequent RCU grace periods from com- critical section [9]. It is important to note that RCU pro- pleting, which prevents the corresponding memory from be- vides this guarantee regardless of the memory model of the ing freed, which can exhaust memory, which can block a underlying computer system. high-priority real-time task that is attempting to allocate On weakly ordered systems (a category including all com- memory. modity microprocessors), RCU also provides a publish-sub- This situation is similar to lock-based priority inversion, scribe guarantee via the rcu_assign_pointer() publication and, as with lock-based priority inversion, and can be solved and rcu_dereference() subscription primitives. These prim- by temporarily boosting the priority of the low-priority task itives disable any compiler and CPU optimizations that might that was blocked in an RCU read-side critical section. otherwise result in an RCU reader seeing a pre-initialized view of a newly published data structure. Both of these primitives have O(1) computational complexity with small constant, and incur zero overhead on sequentially consistent computer systems, where “system” includes the compiler. 1. INTRODUCTION Although older implementations of RCU relied on dis- RCU is a synchronization mechanism that allows exe- abling preemption across all RCU read-side critical sections, cution to be deferred until all potentially conflicting op- more recent implementations have permitted these critical erations have completed, which greatly simplifies the de- sections to be preempted in order to improve real-time schedul- sign and implementation of concurrent algorithms [5]. Al- ing latency [3, 6, 7]. As noted earlier, such preemption though RCU antecedents date back to 1980 [4], RCU at- opens the door to a priority-inversion situation where a low- tained widespread use only after its acceptance into the priority task is preempted in an RCU read-side critical sec- Linux kernel in 2002, where it has since become quite heav- tion by medium-priority real-time tasks, preventing any sub- ily used, as shown in Figure 1. RCU’s popularity stems sequent RCU grace periods from ever completing. If grace from its solution to the “existence problem” [2]. The po- periods never complete, the corresponding memory is never tentially conflicting operations, termed RCU read-side criti- freed, eventually running the system out of memory. The cal sections, are bracketed with rcu_read_lock() and rcu_ resulting hang can be expected to block even the highest- read_unlock() primitives. Production-quality implemen- priority real-time tasks. tations of these primitives scale linearly, are wait-free, are @@@ Roadmap immune to both deadlock and livelock, and incur extremely low overheads. In fact, in server-class (CONFIG PREEMPT=n) 1Sequent’s DYNIX/ptx operating system also provided Linux-kernel builds, these two primitives incur exactly zero zero-overhead RCU read-side primitives [8]. 4000 3500 Kernel Parameter Module Parameter 3000 Priority 2500 2000 1500 RCU_SOFTIRQ RCU_BOOST_PRIO rcu_boost_prio Priority # RCU API Uses 1000 2.6 A -1 -1 Priority A from RCU_SOFTIRQ priority 500 A -1 C Priority C from rcu_boost_prio 2.5 A B -1 Priority B from RCU_BOOST_PRIO 0 A B C Priority C from rcu_boost_prio X -1 -1 RT Priority 1 by default 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 X -1 C Priority C from rcu_boost_prio X B -1 Priority B from RCU_BOOST_PRIO Year X B C Priority C from rcu_boost_prio Table 1: Relationship of Priority Defaults blocked RCU readers are to be boosted. A value of zero specifies no boosting. If the value is -1, then the Figure 1: RCU API Usage in the Linux Kernel default is taken from the BOOST_RCU_PRIO kernel parameter above. If the value is neither zero nor a valid 2. CONTROL OF BOOSTING real-time scheduler priority, then it is treated as if it was -1, though a warning will be printed in this case. The operation of RCU priority boosting is controlled by This parameter may also be controlled at runtime via the following: sysfs. 1. The new BOOST_RCU kernel configuration parameter, 5. The new BOOST_RCU_DELAY kernel parameter, which which depends on RT_MUTEXES (default off, at least specifies the number of jiffies to wait after a given grace initially). This dependency is an implementation con- period begins before doing RCU priority boosting for straint rather than a policy decision, as in absence of blocked tasks that are stalling that grace period. A RT_MUTEXES the priority-boosting scheduler infrastruc- value of -1 says never to do RCU priority boosting ture is not compiled into the kernel. Therefore, in ab- (but this may be overridden at boot time and at run sence of RT_MUTEXES, BOOST_RCU simply cannot do its time). BOOST_RCU_DELAY depends on BOOST_RCU. job. There is also a dependency on PREEMPT_RCU by de- sign, given that there is no reason to boost priority for 6. The new rcu_boost_delay module parameter in kernel/ a non-preemptible RCU implementation. The default rcupdate.c, which also controls the RCU boost de- value of BOOST_RCU will be “n” in order to avoid an lay. A value of -1 says to take the default from the automatic Linus rejection. If experience indicates that BOOST_RCU_DELAY kernel parameter, though any other enabling BOOST_RCU by default is wise, that change will negative value will have the same effect (but possibly be made in a later release of the kernel. accompanied by a warning). This parameter may also be controlled at runtime via sysfs. 2. The priority that the RCU_SOFTIRQ task runs at, but only in the -rt patchset, at least until PREEMPT_SOFTIRQS The relationship between the RCU_SOFTIRQ priority, the reaches mainline. In kernels lacking PREEMPT_SOFTIRQS, RCU_BOOST_PRIO kernel parameter, and the rcu_boost_prio the priority instead defaults to the least-important real- module parameter is involved, so their relationship is shown time priority. by Table 1. A letter A, B, or C in the first column indicates some real-time priority (which currently ranges from 1 to 3. The new BOOST_RCU_PRIO kernel configuration param- 99 in the Linux kernel), while the letter X indicates a kernel eter, which depends on BOOST_RCU. This specifies the without threaded softirqs.2 The “Priority” column indicates default priority to which blocked RCU readers are to what controls the resulting task priority. be boosted. A value of zero specifies no boosting. 2 As of this writing, mainline Linux does not permit threaded If the value is -1, then the default is taken from the softirqs, aside from the non-realtime run_ksoftirqd() task RCU_SOFTIRQ priority above. BOOST_RCU_PRIO depends that is used to handle overflow from the softirq environment. on BOOST_RCU. However, PREEMPT_RT kernels that include the -rt patchset provide a PREEMPT_SOFTIRQS kernel parameter that causes 4. The new rcu_boost_prio module parameter in kernel/ softirqs to be executed in the context of a real-time thread rcupdate.c, which also controls the priority to which whose priority may be controlled at run time. 3. NEW DATA STRUCTURES 6. Add a ->boostwq field to the rcu_preempt_ctrlblk Although RCU priority boosting does not introduce any structure, which is initialized using the DECLARE_WAIT_ new data structures to RCU, it does add fields and values to QUEUE_HEAD() macro. This wait queue is where the a number of the existing data structures under BOOST_RCU booster kernel thread blocks when there is no boosting ifdef as follows: to be done. 1. Add an ->rcu_prio field to the task_struct struc- The TREE_PREEMPT_RCU implementation additionally needs ture, parallel to the existing prio, static_prio, and the same additional fields that TINY_PREEMPT_RCU does, but normal_prio fields.

Priority Boosting Preemptible RCU

Last Time Today Response Time Vs. RM Computing Response Time More Second-Priority Task I Response Time

Preemption-Based Avoidance of Priority Inversion for Java

Real-Time Operating Systems (RTOS)

Real-Time Operating System (RTOS)

Guaranteeing Real-Time Performance Using Rate Monotonic Analysis

RTOS Northern Real Time Applications (NRTA)

Multitasking and Real-Time Scheduling

Lock-Free Programming

Generalized Rate-Monotonic Scheduling Theory: a Framework for Developing Real-Time Systems

Comparison of CPU Scheduling in Vxworks and Lynxos

Real-Time and Embedded Operating Systems What Is Real Time?

Alleviating Priority Inversion and Non-Determinism in Real-Time CORBA ORB Core Architectures 1 Introduction 2 Real-Time ORB Core