Making Lockless Synchronization Fast: Performance Implications of Memory Reclamation

Making Lockless Synchronization Fast: Performance Implications of Memory Reclamation Thomas E. Hart1,∗ Paul E. McKenney2, and Angela Demke Brown1 1University of Toronto 2IBM Beaverton Dept. of Computer Science Linux Technology Center Toronto, ON M5S 2E4 CAN Beaverton, OR 97006 USA {tomhart, demke}@cs.toronto.edu [email protected] Abstract Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while oth- ers replace locks with non-blocking synchronization to protect against deadlock, priority inversion, and convoying. In Figure 1. Read/reclaim race. both cases, dynamic data structures that avoid locking, require a memory reclamation scheme that reclaims nodes Locking is also susceptible to priority inversion, convoying, once they are no longer in use. deadlock, and blocking due to thread failure [3, 10], leading The performance of existing memory reclamation researchers to pursue non-blocking (or lock-free) synchro- schemes has not been thoroughly evaluated. We conduct nization [6, 12, 13, 14, 16, 29]. In some cases, lock-free the first fair and comprehensive comparison of three recent approaches can bring performance benefits [25]. For clar- schemes—quiescent-state-based reclamation, epoch-based ity, we describe all strategies that avoid locks as lockless. reclamation, and hazard-pointer-based reclamation—using A major challenge for lockless synchronization is han- a flexible microbenchmark. Our results show that there is dling the read/reclaim races that arise in dynamic data no globally optimal scheme. When evaluating lockless syn- structures. Figure 1 illustrates the problem: thread T1re- chronization, programmers and algorithm designers should moves node N from a list while thread T2 is referencing it. thus carefully consider the data structure, the workload, N’s memory must be reclaimed to allow reuse, lest memory and the execution environment, each of which can dramati- exhaustion block all threads, but such reuse is unsafe while cally affect memory reclamation performance. T2 continues referencing N. For languages like C, where memory must be explicitly reclaimed (e.g. via free()), 1 Introduction programmers must combine a memory reclamation scheme with their lockless data structures to resolve these races.1 As multiprocessors become mainstream, multithreaded Several such reclamation schemes have been proposed. applications will become more common, increasing the Programmers need to understand the semantics and the need for efficient coordination of concurrent accesses to performance implications of each scheme, since the over- shared data structures. Traditional locking requires expen- head of inefficient reclamation can be worse than that of sive atomic operations, such as compare-and-swap (CAS), locking. For example, reference counting [5, 29] has high even when locks are uncontended. For example, acquiring overhead in the base case and scales poorly with data- and releasing an uncontended spinlock requires over 400 structure size. This is unacceptable when performance is TM cycles on an IBMR POWER CPU. Therefore, many re- the motivation for lockless synchronization. Unfortunately, searchers recommend avoiding locking [2, 7, 22]. Some there is no single optimal scheme and existing work is rela- TM systems, such as Linux ,useconcurrently-readable syn- tively silent on factors affecting reclamation performance. chronization, which uses locks for updates but not for reads. 1 Reclamation is subsumed into automatic garbage collectors in envi- TM ∗Supported by an NSERC Canada Graduate Scholarship. ronments that provide them, such as Java . 1-4244-0054-6/06/$20.00 ©2006 IEEE Authorized licensed use limited to: The University of Toronto. Downloaded on April 9, 2009 at 00:36 from IEEE Xplore. Restrictions apply. Figure 2. Illustration of QSBR. Black boxes Figure 3. QSBR is inherently blocking. represent quiescent states. We address this deficit by comparing three recent reclamation schemes, showing the respective strengths and weaknesses of each. In Sections 2 and 3, we review these schemes and describe factors affecting their performance. Section 4 explains our experimental setup. Our analysis, in Section 5, reveals substantial performance differences be- Figure 4. Illustration of EBR. tween these schemes, the greatest source of which is per- operation atomic instructions. In Section 6, we discuss the 2.1.1 Quiescent-State-Based Reclamation (QSBR) relevance of our work to designers and implementers. We show that lockless algorithms and reclamation schemes are QSBR uses quiescent states to detect grace periods. A mostly independent, by combining a blocking reclamation quiescent state for thread T is a state in which T holds scheme and a non-blocking algorithm, then comparing this no references to shared nodes; hence, a grace period for combination to a fully non-blocking equivalent. We also QSBR is any interval of time during which all threads pass present a new reclamation scheme that combines aspects of through at least one quiescent state. The choice of quies- two other schemes to give good performance and ease of cent states is application-dependent. Many operating sys- use. We close with a discussion of related work in Section 7 tem kernels contain natural quiescent states, such as volun- and summarize our conclusions in Section 8. tary context switch, and use QSBR to implement read-copy update (RCU) [22, 20, 7]. Figure 2 illustrates the relationship between quiescent 2 Memory Reclamation Schemes states and grace periods in QSBR. Thread T1 goes through quiescent states at times t1 and t5, T2 at times t2 and t4, and T3 at time t3. Hence, a grace period is any time interval This section briefly reviews the reclamation schemes we containing either [t1,t3] or [t3,t5]. consider: quiescent-state-based reclamation (QSBR) [22, QSBR must detect grace periods so that removed nodes 2], epoch-based reclamation (EBR) [6], hazard-pointer- may be reclaimed; however, detecting minimal grace peri- based reclamation (HPBR) [23, 24], and reference count- ods is unnecessary. In Figure 2, for example, any interval ing [29, 26]. We provide an overview of each scheme to containing [t1,t3] or [t3,t5] is a quiescent state; implemen- help the reader understand our work; further details are tations which check for grace periods only when threads en- available in the papers cited. ter quiescent states would detect [t1,t5],sinceT1’s two quiescent states form the only pair from a single thread which 2.1 Blocking Schemes enclose a grace period. Figure 3 shows why QSBR is blocking. Here, T 2 goes through no quiescent states (for example, due to blocking We discuss two blocking schemes, QSBR and EBR, on I/O). Threads T 1 and T 3 are then prevented from re- which use the concept of a grace period.Agrace period claiming memory, since no grace periods exist. The ensuing is a time interval [a,b] such that, after time b, all nodes memory exhaustion will eventually block all threads. removed before time a may safely be reclaimed. These methods force threads to wait for other threads to complete 2.1.2 Epoch-Based Reclamation (EBR) their lockless operations in order for a grace period to oc- cur. Failed or delayed threads can therefore prevent memory Fraser’s EBR [6] follows QSBR in using grace periods, but from being reclaimed. Eventually, memory allocation will uses epochs in place of QSBR’s quiescent states. Each fail, causing threads to block. thread executes in one of three logical epochs, and may lag Authorized licensed use limited to: The University of Toronto. Downloaded on April 9, 2009 at 00:36 from IEEE Xplore. Restrictions apply. 1 int search (struct list *l, long key) 2{ 3 node_t *cur; 4 critical_enter(); 5 for (cur = l->list_head; 6 cur != NULL; cur = cur->next) { 7 if (cur->key >= key) { 8 critical_exit(); 9 return (cur->key == key); 10 } 11 } Figure 6. Illustration of HPBR. 12 critical_exit(); 13 return (0); 1 int search (struct list *l, long key) 14 } 2{ 3 node_t **prev, *cur, *next; 4 /* Index of our first hazard pointer. */ 5 int base = getTID()*K; Figure 5. EBR concurrently-readable search. 6 /* Offset into our hazard pointer segment. */ 7 int off = 0; at most one epoch behind the global epoch. Each thread 8 try_again: 9 prev = &l->list_head; atomically sets a per-thread flag upon entry into a critical 10 for (cur = *prev; cur != NULL; cur = next & ˜1) { region, indicating that the thread intends to access shared 11 /* Protect cur with a hazard pointer. */ 12 HP[base+off] = cur; data without locks. Upon exit, the thread atomically clears 13 memory_fence(); its flag. No thread is allowed to access an EBR-protected 14 if (*prev != cur) 15 goto try_again; object outside of a critical region. 16 next = cur->next; Figure 4 shows how EBR tracks epochs, allowing safe 17 if (cur->key >= key) 18 return (cur->key == key); memory reclamation. Upon entering a critical region, a 19 prev = &cur->next; thread updates its local epoch to match the global epoch. 20 off = (off+1) % K; 21 } Hence, if the global epoch is e, threads in critical regions 22 return (0); can be in either epoch e or e-1, but not e+1 (all mod 3). Any 23 } node a thread removes during a given epoch may be safely reclaimed the next time the thread re-enters that epoch. Figure 7. HPBR concurrently-readable Thus, the time period [t1,t2] in the figure is a grace period. A thread will attempt to update the global epoch only if it search. has not changed for some pre-determined number of critical region entries. hence, we have H=NK hazard pointers in total. K is data- As with QSBR, reclamation can be stalled by threads structure-dependent, and often small. Queues and linked which fail in critical regions, but threads not in critical re- lists need K=2 hazard pointers, while stacks require only gions cannot stall EBR.

Load more