RCU Vs. Locking Performance on Different Cpus Paul E

RCU vs. Locking Performance on Different CPUs Paul E. McKenney IBM Corporation Abstract regards the possibility of batching RCU updates among multiple hash tables. RCU has provided performance benefits in a number The locking methods compared are as follows: of areas in the Linux 2.6 kernel [MAK 01, LSS02, 1. Global spinlock (global). MSA 02, ACMS03, McK03, MSS04], as well as in several other operating system kernels [HOS89, Jac93, 2. Global rwlock (globalrw). 3. Per-bucket spinlock (bkt). MS98, GKAS99]. Its use has also been proposed in con- 4. Per-bucket rwlock, but modified to avoid starvation junction with a number of specific algorithms [KL80, ML84, Pug90]. This experience has generated a num- (bktrw). 5. brlock. ber of useful rules of thumb, analytic comparisons of 6. Per-bucket spinlock combined with a per-element RCU to other locking techniques [MS98, McK99], and reference count (refcnt). Note that this mechanism system-wide comparisons of specific RCU patches to the helps with deadlock avoidance. Linux kernel. However, there have not been any measured comparisons of RCU to other locking techniques 7. RCU, which entails lock-free search with per- bucket spinlock and RCU guarding updates. Note under a variety of conditions running on different CPUs. that this mechanism also helps with deadlock This paper fills that gap, comparing RCU to other avoidance for read-side accesses. locking techniques on a number of CPUs using a hash- 8. Ideal scaling, taking the performance of searches lookup micro-benchmark. The read intensity, number of and updates running without any sort of synchro- CPUs, and number of searches/updates per grace period nization, and multiplying by the number CPUs. are varied to gain better insight into which tool is right Once someone achieves ideal scaling on a given for a particular job. workload, that workload is thereafter designated 1 Introduction “embarrassingly parallel”. This benchmark is run with from one to four CPUs, A useful comparison of RCU and other locking primi- varying operation mixtures from read-only to update- tives requires the following: only. Note that the single-CPU data still uses locks–see 1. Repeatability. the “ideal” lines for lock-free performance. From this 2. Relatively short duration to permit testing many data, we will identify the areas of optimality for each combinations. locking primitive. Although these results are quite use- 3. Code sequences that occur in real life. ful as rules of thumb, they are certainly no substitute for 4. Conservative treatment of the “new kid on the actually running real-world benchmarks before and after block,” in this case, RCU. making changes! Since the Linux kernel contains a very large number Finally, in keeping with Linus Torvalds's request at of hash tables, hash-table search is a natural choice for Ottawa last summer that people place a greater focus on a mini-benchmark. To promote repeatability, deletions user-level code, these benchmarks run at user level. As and insertions are performed in pairs, so that the number a result, some of the names of some of the atomic prim- of elements in each hash chain remains constant over the itives differ slightly from their kernel equivalents in the long term–over the short term, of course, one CPU might code fragments that follow. search the hash table between the time after some other Section 2 provides background on RCU. Section 3 CPU has deleted an element, but before it has re-inserted describes the hash-table mini-benchmark, including the it. code for the various locking mechanisms tested. Sec- In order to err on the conservative side, this bench- tion 4 displays the measured results, and Section 5 mark is compiled without optimization, which de- presents summary and conclusions. emphasizes the cache-thrashing overhead of the locking primitives. Comparison with measured memory- 2 Background latency and pipeline-flush overheads showed that these This section gives a brief overview of RCU; more details costs dominated, in any case. This benchmark also dis- are available elsewhere [MS98, MAK 01, MSA 02]. (1) A B C (1) A B C (2) A B C (2) A B C (3) A C (3) A B C Figure 1: Lock Protecting Deletion and Search (4) A C On SMP systems, any searching of or deletion from a linked list must be protected, for example, by a lock. When element B is deleted from the list shown in Fig- ure 1, searching code is guaranteed to see this list in ei- Figure 3: RCU Protecting Deletion and Search ther the initial state (1) or the final state (3). In state (2), when element B is being deleted, the reader-writer lock operations may be performed while holding a lock. For guarantees that no readers (indicated by the absence of a example, in the Linux kernel, it is forbidden to do a con- triangle on element B) will be accessing the list. text switch while holding any spinlock. RCU mandates these same restrictions: even though the RCU-protected (1) A B C search need not acquire any locks, it is forbidden from performing any operation that would be forbidden if it were in fact holding a lock. Therefore, any CPU that is seen performing a context (2) A B C switch after the linked-list deletion shown in step (2) of Figure 3 cannot possibly hold a reference to element B. As soon as all CPUs have performed a context switch, there can no longer be any readers, as shown in (3) A C step (3). Element B may then be safely freed, as shown in step (4). !!!! A simple, though inefficient, RCU-based deletion al- gorithm could perform the following steps in a non- preemptive Linux kernel (preemptive kernels can be Figure 2: Race Between Deletion and Search handled as well [MSA 02]): However, many lists are searched much more often 1. Unlink element B from the list, but do not free it. than they are modified. For example, an IP routing table The state of the list will be that shown in step (2) of would normally change at most once per few minutes, Figure 3. but might be searched many thousands of times per sec- 2. Run on each CPU in turn. At this point, each CPU ond. This could result in well over a million accesses per has performed one context switch after element B update, making lock-acquisition overhead burdensome has been unlinked. Thus, there cannot be any more to searches. references to element B. Unfortunately, omitting locking when searching 3. Free up element B. means that the update no longer appears to be atomic. Much more efficient implementations have been de- Instead, the update takes the multiple steps shown in Fig- scribed elsewhere [MS98, MSA 02]. ure 2. A search might be referencing element B just as it was freed up, resulting in crashes, or worse, as indicated 3 Hash-Table Mini-Benchmark by the reader referencing nothingness in step (3). The hash table is a simple array of buckets, with linked- One solution to this problem is to delay freeing up el- list chains. Separate tasks perform searches and updates ement B until all searches have given up their references in parallel, and a parent task terminates the test after to it, as shown in Figure 3. RCU indirectly determines the prescribed duration. Each experiment runs the test when all references have been given up. To see how this five times, and prints the average, maximum, and mini- works, recall that there are normally restrictions on what mum times per-operation times from the five runs, along 1 struct el * 2 _search_bucket(struct list_head *p, long key) ing. Lines 9-12 increment the search key with wrap, and 3 { line 14 counts the number of operations in a local vari- 4 struct list_head *lep; 5 struct el *q; able. 6 Similarly, the loop from lines 16-38 of Figure 5 per- 7 list_for_each(lep, p) { 8 q = list_entry(lep, struct el, list); forms a randomly selected number of updates based on 9 if (q->key == key) { the desired mix. If the SEARCHX() function on line 18 10 return (q); 11 } returns non-NULL, it is defined to return holding what- 12 } ever locks are required to delete the element, and the 13 return (NULL); 14 } DELETE() function on line 21 is defined to drop those 15 locks before return, and free the deleted element as well. 16 struct el * 17 _search(long key) On the other hand, if SEARCHX() returns NULL, it re- 18 { turns with no locks held. Line 22 allocates a new ele- 19 struct list_head *p; 20 ment to replace the one deleted, and line 24 counts an 21 p = KEY2CHAIN(key); error if out of memory. Line 26 does some debug checks 22 return (_search_bucket(p, key)); 23 } if RCU is in use, and is a no-op otherwise, again favoring normal locking. Lines 27-30 initialize the new element and insert it. The INSERT() function on line 29 is de- Figure 4: Search Functions fined to acquire any needed locks and to release them before returning. Lines 34-37 advance to the next key with the standard deviation. RCU-based tests print the value, and line 39 counts the operations in a local vari- same statistics for the maximum number of hash-table able. Line 40 forces a quiescent state for RCU, and is a elements that were ever waiting for a grace period to no-op otherwise. complete. As noted earlier, this test loop is run five times for The following section describes the common code each configuration, and the results are averaged.

Load more