Effcient Userspace Optimistic Spinning Locks

Effcient Userspace Optimistic Spinning Locks Waiman Long Principal Software Engineer, Red Hat Linux Plumbers Conference, Sep 20 ! 1 Locking in Userspace "ost Linux applications use the s$nchroni%ation primiti&es pro&ided by the pthread librar$, such as' ● "utex - pthread_mutex)lock+, ● Read/write lock - pthread_rwlock_wrlock+,-pt#read_rwlock)rdlock+, ● Condition variable ( pt#read_cond)signal+,-pthread_cond_wait(, ● Spin lock ( pt#read_spin)loc*+, .#e top t#ree are sleeping loc*s w#ereas t#e last one is a spinning lock as the name implies/ 0 number of large enterprise applications (e/g/ Oracle, SAP H020,, howe&er, implement t#eir own locking code to e*e out a bit more performance than can be pro&ided b$ the pthread library/ .#ose locking libraries generall$ use the kernel futex+2, API to implement the sleeping locks/ .#e s$stem 4 semap#ore API has also been used to support userspace locking in the past, but using futex+2, is generall$ more performant. 2 Userspace Spinning vs. Sleeping Locks 3t is generall$ advised that spinning loc* s#ould onl$ be used in userspace w#en the lock critical sections are s#ort and t#e lock is not li*el$ to be #eavil$ contended. 7oth spinning loc* and sleeping lock have thier own problem as s#own below' Spinning Lock: Sleeping Lock: ● Lock cacheline contention ● Waiter wakeup latency (in 10s to 100s of us) ● Lock starvation (unfairness) ● Lock starvation ● A long wait can wastes a lot of CPU cycles .#e Linux kernel sol&es t#e spin lock cacheline contention and unfairness problems by putting the lock waiters in a 6ueue spinning on t#eir own cachelines/ Similar type of 6ueued spin lock may not be applicable to userspace5 because of the lock owner and waiter preemption problems w#ich aren8t an issue in the kernel as preemption can be disabled. 3 5 .#ere was a proposed patc# to implement queued spinloc* in glibc/ Userspace Spinning vs. Sleeping Locks (Cont) ● .#e sleeping loc* wakeup latency limits the maximum locking rate on an$ gi&en lock w#ich can be a problem in large multithreaded applications that can have man$ t#reads contending on the same lock/ ● 9or applications t#at needs to scale up to large number of threads on s$stems with #undreds of CP:s/ T#ere is a dilemma in deciding if a spinning lock or a sleeping lock s#ould be used for a certain type of data that need to be protected. ● 3n fact, some large enterprise applications have performance benchmark tests that can stress a spin lock with #undreds of t#reads/ T#is can result in o&er 50% of CP: time spent in the spin lock code itself if the s$stem isn8t properl$ tuned. ● A t$pical userspace spin lock implement can be a regular TAS spinlock with exponential bac*off5 to reduce lock cacheline contention/ ● 3t will be hard to implement a fair and ef=cient userspace spinning lock wit#out assistance from the kernel/ 4 * There was proposed kernel patch to do exponential backoff on top of the ticket spinlock. CONFIDENTIAL ?esignator Optimistic Spinning Futex ● .#ere are two main types of futexes for userspace locking implementation – wait-wake (WW, futex and priority in&ersion (P3, futex/ ● "ost userspace locking libraries use the WW futexes for t#eir locking implementation/ P3 futexes are onl$ used in special cases w#ere real time processes re6uire a bound latenc$ response from the s$stem/ ● 7oth types of futexes re6uire lock waiters to sleep in the kernel until the lock holder w#ich calls into the kernel to wa*e them up/ ● 1ptimistic spinning (1S, futex, formerl$ t#roug#put-optimized (.P, futex, is a new type of futex that is a h$brid spinning-sleeping lock/ 3t uses a *ernel mutex to form a waiting queue wit# the mutex owner being the head of the 6ueue/ .#e head waiter will spin on the futex as long as the lock owner is running/ Ot#erwise, the head waiter will go to sleep/ ● 7oth userspace mutex and rwlock can be supported b$ the OS futex, but not t#e condition &ariable/ 5 Why Optimistic Spinning Futex 3n term of performance, there are two main problems with the use of WW futexes for userspace locks/ , Lock waiters are put to sleep until the lock holder calls into t#e kernel to wa*e up the waiters after releasing the lock/ So t#ere is the wa*eup latency/ In addition, the act of calling into the kernel to do the wa*eup also delay the tas* from doing other useful work in the userspace/ 2, With a #eavil$ contended futex, the userspace locking performance will be constrained b$ #as# bucket spinlock in t#e *ernel/ With a large number of contending threads, it is possible that more t#an 50% of the CP: c$cles can be spent waiting for t#e spinlock5/ Wit# an OS futex, the lock waiters in the kernel are not sleeping unless the loc* holder has slept/ 3nstead, the$ are spinning in the kernel for t#e lock to be released. Conse6uentl$, the loc* holder doesn8t need to go into t#e kernel to wa*e up the waiters after releasing the lock/ T#is allows the tas*s to continue doing their work without an$ delay and a&oid the kernel has# bucket lock contention problem. Once the lock is free, the lock waiter can immediate grab the lock in t#e *ernel or return to userspace to acquire it/ 6 * The NUMA qspinlock patch was shown to be able to improve database performance because the kernel hash bucket lock was granted to the waiters in a NUMA friendly manner. OS Futex Design Principles .#e design goals of OS futexes is to maximi%e the loc*ing throug#put w#ile still maintaining a certain amount of fairness and determistic latenc$ (no lock starvation,/ .o impro&e locking throuput: , 1ptimistic spinning – spin on lock owner w#ile it is running on CP:/ If the lock owner is sleeping, t#e head lock waiter will sleep too/ In this case, t#e lock owner will need to go to the kernel to wake up t#e lock waiter at unlock time/ 2,Lock stealing – it is known performance en#ancement tec#ni6ue pro&ided that safeguard is in place to pre&ent lock starvation/ 0s long as t#e head waiter in t#e kernel is not sleeping and hence set the 9:.E@)W03.ERS bit, lockers from userspace is free to steal the lock/ A,:serspace locking – OS futexes have the option to use either userspace locking or kernel locking/ WW futexes support onl$ userspace locking/ :serspace locking pro&ides better performance (s#orter lock hold time,, but is also more prone to lock starvation/ .o combat lock starvation, OS futexes also have a builtin lock hand-off mechanism to force lock handoff to the head waiter in the kernel after a certain time thres#old. T#is is similar to w#at the kernel mutexes and rw semap#ores are 7 doing/ Using OS Futexes !or Userspace "utexes 1S futexes are similar to PI futexes in how the$ s#ould be used. 0 t#read in userspace will need to atomicall$ put its thread I? into t#e futex word to get an exclusi&e lock/ 3f that fails, t#e thread can choose to issue the following futex+2, call to tr$ to lock the futex/ futex(uaddr, FUTEX_LOCK, flags, timeout, NULL, 0); .#e fags can be 9:.E@)1S_:SL1CK to make the lock waiter returns to userspace to retry locking, or it can be 0 to acquire the lock directl$ in t#e *ernel before going back to userspace/ 0t unloc* time, the tas* can simpl$ atomicall$ clear the futex word as long as the F:.E@)W03.ERS bit isn8t set. 3f it is set, it has to issue the following futex+2, to do the unlock/ futex(uaddr, FUTEX_UNLOCK, 0, NULL, NULL, 0); It is relatively simple to use OS futex to implement userspace mutexes. 8 Using OS Futexes !or Userspace #$locks 1S futexes support s#ared locking for implementing userspace rwlock/ A special upper bit in t#e futex word +9:.E@)SH0RE?)9L0D, is used to indicate s#ared locking/ T#e lower 2E bits of the futex word is then used for reader count. 3n order to acquire a s#ared lock, the tas* #as to atomicall$ put (9:TE@)SHARE?)9L0D + 1) to the futex word. If the 9:.E@)SHARE?)9L0D bit has already been set, the tas* just need to atomicall$ increment t#e reader count. If the futex is exclusi&el$ owned, the following futex+2, call can be used to acquire a s#ared lock in the kernel/ futex(uaddr, FUTEX_LOCK_S!"#E$, flags, timeout, NULL, 0); .#e supported fag is F:.E@)1S_PRE9ER)RE0?ER to make the *ernel code prefer reader a bit more than writer. 0t unloc* time, a tas* can just atomicall$ decrement the reader count. If t#e count reaches 0, it has to atomicall$ clear t#e F:.E@_SH0RE?)9L0D as well/ 3t gets a bit more complicated if t#e F:.E@_W03.ERS bit is set as a F:.E@):NL1CK_SHARE? futex+2, call need to be used and it #as to be sure that it is the onl$ read-owner that does the unlock/ 9 Userspace "utex erformance Data 0 mutex locking microbenc#mark run on a 2(socket 4H(core 9I(thread xHI(IE s$stem wit# a 5/3 based kernel/ 9I mutex locking threads with s#ort critical section were run for 10s using WW futex, OS futex and Glibc pthread_mutex/ Futex Type WW OS Glibc Total locking ops 155,646,231 187,826,170 89,292,228 Per-thread average ops/s (std 162,114 (0.53%) 195,632 (1.83%) 93,005 (0.43%) deviation) Futex lock syscalls 8,219,362 7,453,248 N/A Futex unlock syscalls 13,792,295 19 N/A # of kernel locks N/A 276,442 N/A .#e OS futex saw a 20/J< increase in locking rate w#en compared with the WW futex/ 1ne characteristic of 1S futex is the small number of futex unlock s$scalls that need are issued.

Effcient Userspace Optimistic Spinning Locks

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support