RCU Usage in the Linux Kernel: One Decade Later
Total Page:16
File Type:pdf, Size:1020Kb
RCU Usage In the Linux Kernel: One Decade Later Paul E. McKenney Silas Boyd-Wickizer Jonathan Walpole Linux Technology Center MIT CSAIL Computer Science Department IBM Beaverton Portland State University Abstract unique among the commonly used kernels. Understand- ing RCU is now a prerequisite for understanding the Linux Read-copy update (RCU) is a scalable high-performance implementation and its performance. synchronization mechanism implemented in the Linux The success of RCU is, in part, due to its high perfor- kernel. RCU’s novel properties include support for con- mance in the presence of concurrent readers and updaters. current reading and writing, and highly optimized inter- The RCU API facilitates this with two relatively simple CPU synchronization. Since RCU’s introduction into the primitives: readers access data structures within RCU Linux kernel over a decade ago its usage has continued to read-side critical sections, while updaters use RCU syn- expand. Today, most kernel subsystems use RCU. This chronization to wait for all pre-existing RCU read-side paper discusses the requirements that drove the devel- critical sections to complete. When combined, these prim- opment of RCU, the design and API of the Linux RCU itives allow threads to concurrently read data structures, implementation, and how kernel developers apply RCU. even while other threads are updating them. This paper describes the performance requirements that 1 Introduction led to the development of RCU, gives an overview of the RCU API and implementation, and examines how ker- The first Linux kernel to include multiprocessor support nel developers have used RCU to optimize kernel perfor- is not quite 20 years old. This kernel provided support mance. The primary goal is to provide an understanding for concurrently running applications, but serialized all of the RCU API and how to apply it. execution in the kernel using a single lock. Concurrently The remainder of the paper is organized as follows. Sec- executing applications that frequently invoked the kernel tion 2 explains the important requirements for production- performed poorly. quality RCU implementations. Section 3 gives an Today the single kernel lock is gone, replaced by highly overview of RCU’s API and overall design. Section 4 concurrent kernel subsystems. Kernel intensive applica- introduces a set of common usage patterns that cover most tions that would have performed poorly on dual-processor uses of RCU, illustrating how RCU is used in Linux. In machines 15 years ago, now scale and perform well on some cases its use as a replacement for existing synchro- multicore machines with many processors [3]. nization mechanisms introduces subtle semantic prob- Kernel developers have used a variety of techniques to lems. Section 5 discusses these problems and commonly improve concurrency, including fine-grained locks, lock- applied solutions. Section 6 highlights the importance of free data structures, per-CPU data structures, and read- RCU by documenting its use in Linux over time and by copy-update (RCU), the topic of this paper. Uses of the specific subsystems. Section 7 discusses related work and RCU API have increased from none in 2002 to over 6500 Section 8 presents conclusions. in 2013 (see Figure 1). Most major Linux kernel subsys- tems use RCU as a synchronization mechanism. Linus 2 RCU Requirements Torvalds characterized a recent RCU-based patch to the virtual file system “as seriously good stuff” because de- RCU fulfills three requirements dictated by the kernel: velopers were able to use RCU to remove bottlenecks (1) support for concurrent readers, even during updates; affecting common workloads [25]. RCU is not unique to (2) low computation and storage overhead; and (3) deter- Linux (see [7, 13, 19] for other examples), but Linux’s ministic completion time. The first two are performance- wide variety of RCU usage patterns is, as far as we know, related requirements, while the third is important for real- 1 10000 lock or waiting to fulfill cache misses, each read from 9000 the AVC takes several hundred cycles. Incurring a single 8000 cache miss, which can cost hundreds of cycles, would double the cost of accessing the AVC. RCU must coor- 7000 dinate readers and updaters in a way that provides low 6000 overhead synchronization in the common case. 5000 A third requirement is deterministic completion times 4000 for read operations. This is critical to real-time re- sponse [11], but also has important software-engineering # RCU API Uses 3000 benefits, including the ability to use RCU within non- 2000 maskable interrupt (NMI) handlers. Accessing shared 1000 data within NMI handlers is tricky, because a thread might 0 be interrupted within a critical section. Using spin locks can lead to deadlock, and lock-free techniques utilizing optimistic concurrency control lead to non-deterministic 2002 2004 2006 2008 2010 2012 2014 2016 completion times if the operation must be retried many Year times. Related synchronization primitives, such as read-write Figure 1: The number of uses of the RCU API in Linux locks, Linux local-global locks, and transactional mem- kernel code from 2002 to 2014. ory, do not fulfill the requirements discussed here. None of these primitives provide concurrent read and write operations to the same data. They all impose a storage overhead. Even the storage overhead for a read-write lock, time response and software engineering reasons. This which is a single integer, is unacceptable for some cases. section describes the three requirements and the next two Read-write locks and local-global locks use expensive sections describe how RCU is designed to fulfill these atomic instructions or memory barriers during acquisition requirements and how kernel developers use RCU. and release. The primary RCU requirement is support for concur- The next section describes an RCU design and API rent reading of a data structure, even during updates. The that provides concurrent reads and writes, low space and Linux kernel uses many data structures that are read and execution overhead, and deterministic execution times. updated intensively, especially in the virtual file system (VFS) and in networking. For example, the VFS caches 3 RCU Design directory entry metadata – each known as a dentry – for recently accessed files. Every time an application opens a RCU is a library for the Linux kernel that allows ker- file, the kernel walks the file path and reads the dentry nel subsystems to synchronize access to shared data in for each path component out of the dentry cache. Since an efficient manner. The core of RCU is based on two applications might access many files, some only once, primitives: RCU read-side critical sections, which we the kernel is frequently loading dentrys into the cache will refer to simply as RCU critical sections, and RCU and evicting unused dentrys. Ideally, threads reading synchronization. A thread enters an RCU critical sec- from the cache would not interfere with each other or be tion by calling rcu_read_lock and completes an RCU impeded by threads performing updates. critical section by calling rcu_read_unlock. A thread The second RCU requirement is low space and execu- uses RCU synchronization by calling synchronize_rcu, tion overhead. Low space overhead is important because which guarantees not to return until all the RCU critical the kernel must synchronize access to millions of kernel sections executing when synchronize_rcu was called objects. On a server in our lab, for example, the kernel have completed. synchronize_rcu does not prevent caches roughly 8 million dentrys. An overhead of more new RCU critical sections from starting, nor does it wait than a few bytes per dentry is an unacceptable storage for the RCU critical sections to finish that were started overhead. after synchronize_rcu was called. Low execution overhead is important because the ker- Developers can use RCU critical sections and RCU nel accesses data structures frequently using extremely synchronization to build data structures that allow con- short code paths. The SELinux access vector cache current reading, even during updates. To illustrate one (AVC) [21] is an example of a performance critical data possible use, consider how to safely free the memory structure that the kernel might access several times during associated with a dentry when an application removes a system call. In the absence of spinning to acquire a a file. One way to implement this is for a thread to ac- 2 void rcu_read_lock() cutes a context switch, the thread calling synchronize_ { rcu briefly executes on every CPU. Notice that the cost preempt_disable[cpu_id()]++; synchronize_rcu } of is independent of the number of times threads execute rcu_read_lock and rcu_read_ void rcu_read_unlock() unlock. { In practice Linux implements synchronize_rcu by preempt_disable[cpu_id()]--; waiting for all CPUs in the system to pass through a con- } text switch, instead of scheduling a thread on each CPU. This design optimizes the Linux RCU implementation for void synchronize_rcu(void) low-cost RCU critical sections, but at the cost of delaying { synchronize_rcu callers longer than necessary. In prin- for_each_cpu(int cpu) ciple, a writer waiting for a particular reader need only run_on(cpu); } wait for that reader to complete an RCU critical section. The reader, however, must communicate to the writer that the RCU critical section is complete. The Linux Figure 2: A simplified version of the Linux RCU imple- RCU implementation essentially batches reader-to-writer mentation. communication by waiting for context switches. When possible, writers can use an asynchronous version of synchronize_rcu, call_rcu, that will asynchronously quire a pointer to a dentry only from the directory cache invokes a specified callback after all CPUs have passed and to always execute in an RCU critical section when through at least one context switch. manipulating a dentry. When an application removes a The Linux RCU implementation tries to amortize file, the kernel, possibly in parallel with dentry readers, the cost of detecting context switches over many removes the dentry from the directory cache, then calls synchronize_rcu and call_rcu operations.