Effcient Userspace Optimistic Spinning Locks

Effcient Userspace Optimistic Spinning Locks Waiman Long Principal Software Engineer, Red Hat Linux Plumbers Conference, Sep 20 ! 1 Locking in Userspace "ost Linux applications use the s$nchroni%ation primiti&es pro&ided by the pthread librar$, such as' ● "utex - pthread_mutex)lock+, ● Read/write lock - pthread_rwlock_wrlock+,-pt#read_rwlock)rdlock+, ● Condition variable ( pt#read_cond)signal+,-pthread_cond_wait(, ● Spin lock ( pt#read_spin)loc*+, .#e top t#ree are sleeping loc*s w#ereas t#e last one is a spinning lock as the name implies/ 0 number of large enterprise applications (e/g/ Oracle, SAP H020,, howe&er, implement t#eir own locking code to e*e out a bit more performance than can be pro&ided b$ the pthread library/ .#ose locking libraries generall$ use the kernel futex+2, API to implement the sleeping locks/ .#e s$stem 4 semap#ore API has also been used to support userspace locking in the past, but using futex+2, is generall$ more performant. 2 Userspace Spinning vs. Sleeping Locks 3t is generall$ advised that spinning loc* s#ould onl$ be used in userspace w#en the lock critical sections are s#ort and t#e lock is not li*el$ to be #eavil$ contended. 7oth spinning loc* and sleeping lock have thier own problem as s#own below' Spinning Lock: Sleeping Lock: ● Lock cacheline contention ● Waiter wakeup latency (in 10s to 100s of us) ● Lock starvation (unfairness) ● Lock starvation ● A long wait can wastes a lot of CPU cycles .#e Linux kernel sol&es t#e spin lock cacheline contention and unfairness problems by putting the lock waiters in a 6ueue spinning on t#eir own cachelines/ Similar type of 6ueued spin lock may not be applicable to userspace5 because of the lock owner and waiter preemption problems w#ich aren8t an issue in the kernel as preemption can be disabled. 3 5 .#ere was a proposed patc# to implement queued spinloc* in glibc/ Userspace Spinning vs. Sleeping Locks (Cont) ● .#e sleeping loc* wakeup latency limits the maximum locking rate on an$ gi&en lock w#ich can be a problem in large multithreaded applications that can have man$ t#reads contending on the same lock/ ● 9or applications t#at needs to scale up to large number of threads on s$stems with #undreds of CP:s/ T#ere is a dilemma in deciding if a spinning lock or a sleeping lock s#ould be used for a certain type of data that need to be protected. ● 3n fact, some large enterprise applications have performance benchmark tests that can stress a spin lock with #undreds of t#reads/ T#is can result in o&er 50% of CP: time spent in the spin lock code itself if the s$stem isn8t properl$ tuned. ● A t$pical userspace spin lock implement can be a regular TAS spinlock with exponential bac*off5 to reduce lock cacheline contention/ ● 3t will be hard to implement a fair and ef=cient userspace spinning lock wit#out assistance from the kernel/ 4 * There was proposed kernel patch to do exponential backoff on top of the ticket spinlock. CONFIDENTIAL ?esignator Optimistic Spinning Futex ● .#ere are two main types of futexes for userspace locking implementation – wait-wake (WW, futex and priority in&ersion (P3, futex/ ● "ost userspace locking libraries use the WW futexes for t#eir locking implementation/ P3 futexes are onl$ used in special cases w#ere real time processes re6uire a bound latenc$ response from the s$stem/ ● 7oth types of futexes re6uire lock waiters to sleep in the kernel until the lock holder w#ich calls into the kernel to wa*e them up/ ● 1ptimistic spinning (1S, futex, formerl$ t#roug#put-optimized (.P, futex, is a new type of futex that is a h$brid spinning-sleeping lock/ 3t uses a *ernel mutex to form a waiting queue wit# the mutex owner being the head of the 6ueue/ .#e head waiter will spin on the futex as long as the lock owner is running/ Ot#erwise, the head waiter will go to sleep/ ● 7oth userspace mutex and rwlock can be supported b$ the OS futex, but not t#e condition &ariable/ 5 Why Optimistic Spinning Futex 3n term of performance, there are two main problems with the use of WW futexes for userspace locks/ , Lock waiters are put to sleep until the lock holder calls into t#e kernel to wa*e up the waiters after releasing the lock/ So t#ere is the wa*eup latency/ In addition, the act of calling into the kernel to do the wa*eup also delay the tas* from doing other useful work in the userspace/ 2, With a #eavil$ contended futex, the userspace locking performance will be constrained b$ #as# bucket spinlock in t#e *ernel/ With a large number of contending threads, it is possible that more t#an 50% of the CP: c$cles can be spent waiting for t#e spinlock5/ Wit# an OS futex, the lock waiters in the kernel are not sleeping unless the loc* holder has slept/ 3nstead, the$ are spinning in the kernel for t#e lock to be released. Conse6uentl$, the loc* holder doesn8t need to go into t#e kernel to wa*e up the waiters after releasing the lock/ T#is allows the tas*s to continue doing their work without an$ delay and a&oid the kernel has# bucket lock contention problem. Once the lock is free, the lock waiter can immediate grab the lock in t#e *ernel or return to userspace to acquire it/ 6 * The NUMA qspinlock patch was shown to be able to improve database performance because the kernel hash bucket lock was granted to the waiters in a NUMA friendly manner. OS Futex Design Principles .#e design goals of OS futexes is to maximi%e the loc*ing throug#put w#ile still maintaining a certain amount of fairness and determistic latenc$ (no lock starvation,/ .o impro&e locking throuput: , 1ptimistic spinning – spin on lock owner w#ile it is running on CP:/ If the lock owner is sleeping, t#e head lock waiter will sleep too/ In this case, t#e lock owner will need to go to the kernel to wake up t#e lock waiter at unlock time/ 2,Lock stealing – it is known performance en#ancement tec#ni6ue pro&ided that safeguard is in place to pre&ent lock starvation/ 0s long as t#e head waiter in t#e kernel is not sleeping and hence set the 9:.E@)W03.ERS bit, lockers from userspace is free to steal the lock/ A,:serspace locking – OS futexes have the option to use either userspace locking or kernel locking/ WW futexes support onl$ userspace locking/ :serspace locking pro&ides better performance (s#orter lock hold time,, but is also more prone to lock starvation/ .o combat lock starvation, OS futexes also have a builtin lock hand-off mechanism to force lock handoff to the head waiter in the kernel after a certain time thres#old. T#is is similar to w#at the kernel mutexes and rw semap#ores are 7 doing/ Using OS Futexes !or Userspace "utexes 1S futexes are similar to PI futexes in how the$ s#ould be used. 0 t#read in userspace will need to atomicall$ put its thread I? into t#e futex word to get an exclusi&e lock/ 3f that fails, t#e thread can choose to issue the following futex+2, call to tr$ to lock the futex/ futex(uaddr, FUTEX_LOCK, flags, timeout, NULL, 0); .#e fags can be 9:.E@)1S_:SL1CK to make the lock waiter returns to userspace to retry locking, or it can be 0 to acquire the lock directl$ in t#e *ernel before going back to userspace/ 0t unloc* time, the tas* can simpl$ atomicall$ clear the futex word as long as the F:.E@)W03.ERS bit isn8t set. 3f it is set, it has to issue the following futex+2, to do the unlock/ futex(uaddr, FUTEX_UNLOCK, 0, NULL, NULL, 0); It is relatively simple to use OS futex to implement userspace mutexes. 8 Using OS Futexes !or Userspace #$locks 1S futexes support s#ared locking for implementing userspace rwlock/ A special upper bit in t#e futex word +9:.E@)SH0RE?)9L0D, is used to indicate s#ared locking/ T#e lower 2E bits of the futex word is then used for reader count. 3n order to acquire a s#ared lock, the tas* #as to atomicall$ put (9:TE@)SHARE?)9L0D + 1) to the futex word. If the 9:.E@)SHARE?)9L0D bit has already been set, the tas* just need to atomicall$ increment t#e reader count. If the futex is exclusi&el$ owned, the following futex+2, call can be used to acquire a s#ared lock in the kernel/ futex(uaddr, FUTEX_LOCK_S!"#E$, flags, timeout, NULL, 0); .#e supported fag is F:.E@)1S_PRE9ER)RE0?ER to make the *ernel code prefer reader a bit more than writer. 0t unloc* time, a tas* can just atomicall$ decrement the reader count. If t#e count reaches 0, it has to atomicall$ clear t#e F:.E@_SH0RE?)9L0D as well/ 3t gets a bit more complicated if t#e F:.E@_W03.ERS bit is set as a F:.E@):NL1CK_SHARE? futex+2, call need to be used and it #as to be sure that it is the onl$ read-owner that does the unlock/ 9 Userspace "utex erformance Data 0 mutex locking microbenc#mark run on a 2(socket 4H(core 9I(thread xHI(IE s$stem wit# a 5/3 based kernel/ 9I mutex locking threads with s#ort critical section were run for 10s using WW futex, OS futex and Glibc pthread_mutex/ Futex Type WW OS Glibc Total locking ops 155,646,231 187,826,170 89,292,228 Per-thread average ops/s (std 162,114 (0.53%) 195,632 (1.83%) 93,005 (0.43%) deviation) Futex lock syscalls 8,219,362 7,453,248 N/A Futex unlock syscalls 13,792,295 19 N/A # of kernel locks N/A 276,442 N/A .#e OS futex saw a 20/J< increase in locking rate w#en compared with the WW futex/ 1ne characteristic of 1S futex is the small number of futex unlock s$scalls that need are issued.

Effcient Userspace Optimistic Spinning Locks

Synchronization Spinlocks - Semaphores

Chapter 1. Origins of Mac OS X

Demarinis Kent Williams-King Di Jin Rodrigo Fonseca Vasileios P

CS350 – Operating Systems

4. Process Synchronization

Lock (TCB) Block (TCB)

Kernel and Locking

Thread Evolution Kit for Optimizing Thread Operations on CE/Iot Devices

Red Hat Enterprise Linux for Real Time 7 Tuning Guide

In the Beginning... Example Critical Sections / Mutual Exclusion Critical

Energy Efficient Spin-Locking in Multi-Core Machines

Spinlock Vs. Sleeping Lock Spinlock Implementation(1)