Effcient Userspace Optimistic Spinning Locks

Total Page:16

File Type:pdf, Size:1020Kb

Effcient Userspace Optimistic Spinning Locks Effcient Userspace Optimistic Spinning Locks Waiman Long Principal Software Engineer, Red Hat Linux Plumbers Conference, Sep 20 ! 1 Locking in Userspace "ost Linux applications use the s$nchroni%ation primiti&es pro&ided by the pthread librar$, such as' ● "utex - pthread_mutex)lock+, ● Read/write lock - pthread_rwlock_wrlock+,-pt#read_rwlock)rdlock+, ● Condition variable ( pt#read_cond)signal+,-pthread_cond_wait(, ● Spin lock ( pt#read_spin)loc*+, .#e top t#ree are sleeping loc*s w#ereas t#e last one is a spinning lock as the name implies/ 0 number of large enterprise applications (e/g/ Oracle, SAP H020,, howe&er, implement t#eir own locking code to e*e out a bit more performance than can be pro&ided b$ the pthread library/ .#ose locking libraries generall$ use the kernel futex+2, API to implement the sleeping locks/ .#e s$stem 4 semap#ore API has also been used to support userspace locking in the past, but using futex+2, is generall$ more performant. 2 Userspace Spinning vs. Sleeping Locks 3t is generall$ advised that spinning loc* s#ould onl$ be used in userspace w#en the lock critical sections are s#ort and t#e lock is not li*el$ to be #eavil$ contended. 7oth spinning loc* and sleeping lock have thier own problem as s#own below' Spinning Lock: Sleeping Lock: ● Lock cacheline contention ● Waiter wakeup latency (in 10s to 100s of us) ● Lock starvation (unfairness) ● Lock starvation ● A long wait can wastes a lot of CPU cycles .#e Linux kernel sol&es t#e spin lock cacheline contention and unfairness problems by putting the lock waiters in a 6ueue spinning on t#eir own cachelines/ Similar type of 6ueued spin lock may not be applicable to userspace5 because of the lock owner and waiter preemption problems w#ich aren8t an issue in the kernel as preemption can be disabled. 3 5 .#ere was a proposed patc# to implement queued spinloc* in glibc/ Userspace Spinning vs. Sleeping Locks (Cont) ● .#e sleeping loc* wakeup latency limits the maximum locking rate on an$ gi&en lock w#ich can be a problem in large multithreaded applications that can have man$ t#reads contending on the same lock/ ● 9or applications t#at needs to scale up to large number of threads on s$stems with #undreds of CP:s/ T#ere is a dilemma in deciding if a spinning lock or a sleeping lock s#ould be used for a certain type of data that need to be protected. ● 3n fact, some large enterprise applications have performance benchmark tests that can stress a spin lock with #undreds of t#reads/ T#is can result in o&er 50% of CP: time spent in the spin lock code itself if the s$stem isn8t properl$ tuned. ● A t$pical userspace spin lock implement can be a regular TAS spinlock with exponential bac*off5 to reduce lock cacheline contention/ ● 3t will be hard to implement a fair and ef=cient userspace spinning lock wit#out assistance from the kernel/ 4 * There was proposed kernel patch to do exponential backoff on top of the ticket spinlock. CONFIDENTIAL ?esignator Optimistic Spinning Futex ● .#ere are two main types of futexes for userspace locking implementation – wait-wake (WW, futex and priority in&ersion (P3, futex/ ● "ost userspace locking libraries use the WW futexes for t#eir locking implementation/ P3 futexes are onl$ used in special cases w#ere real time processes re6uire a bound latenc$ response from the s$stem/ ● 7oth types of futexes re6uire lock waiters to sleep in the kernel until the lock holder w#ich calls into the kernel to wa*e them up/ ● 1ptimistic spinning (1S, futex, formerl$ t#roug#put-optimized (.P, futex, is a new type of futex that is a h$brid spinning-sleeping lock/ 3t uses a *ernel mutex to form a waiting queue wit# the mutex owner being the head of the 6ueue/ .#e head waiter will spin on the futex as long as the lock owner is running/ Ot#erwise, the head waiter will go to sleep/ ● 7oth userspace mutex and rwlock can be supported b$ the OS futex, but not t#e condition &ariable/ 5 Why Optimistic Spinning Futex 3n term of performance, there are two main problems with the use of WW futexes for userspace locks/ , Lock waiters are put to sleep until the lock holder calls into t#e kernel to wa*e up the waiters after releasing the lock/ So t#ere is the wa*eup latency/ In addition, the act of calling into the kernel to do the wa*eup also delay the tas* from doing other useful work in the userspace/ 2, With a #eavil$ contended futex, the userspace locking performance will be constrained b$ #as# bucket spinlock in t#e *ernel/ With a large number of contending threads, it is possible that more t#an 50% of the CP: c$cles can be spent waiting for t#e spinlock5/ Wit# an OS futex, the lock waiters in the kernel are not sleeping unless the loc* holder has slept/ 3nstead, the$ are spinning in the kernel for t#e lock to be released. Conse6uentl$, the loc* holder doesn8t need to go into t#e kernel to wa*e up the waiters after releasing the lock/ T#is allows the tas*s to continue doing their work without an$ delay and a&oid the kernel has# bucket lock contention problem. Once the lock is free, the lock waiter can immediate grab the lock in t#e *ernel or return to userspace to acquire it/ 6 * The NUMA qspinlock patch was shown to be able to improve database performance because the kernel hash bucket lock was granted to the waiters in a NUMA friendly manner. OS Futex Design Principles .#e design goals of OS futexes is to maximi%e the loc*ing throug#put w#ile still maintaining a certain amount of fairness and determistic latenc$ (no lock starvation,/ .o impro&e locking throuput: , 1ptimistic spinning – spin on lock owner w#ile it is running on CP:/ If the lock owner is sleeping, t#e head lock waiter will sleep too/ In this case, t#e lock owner will need to go to the kernel to wake up t#e lock waiter at unlock time/ 2,Lock stealing – it is known performance en#ancement tec#ni6ue pro&ided that safeguard is in place to pre&ent lock starvation/ 0s long as t#e head waiter in t#e kernel is not sleeping and hence set the 9:.E@)W03.ERS bit, lockers from userspace is free to steal the lock/ A,:serspace locking – OS futexes have the option to use either userspace locking or kernel locking/ WW futexes support onl$ userspace locking/ :serspace locking pro&ides better performance (s#orter lock hold time,, but is also more prone to lock starvation/ .o combat lock starvation, OS futexes also have a builtin lock hand-off mechanism to force lock handoff to the head waiter in the kernel after a certain time thres#old. T#is is similar to w#at the kernel mutexes and rw semap#ores are 7 doing/ Using OS Futexes !or Userspace "utexes 1S futexes are similar to PI futexes in how the$ s#ould be used. 0 t#read in userspace will need to atomicall$ put its thread I? into t#e futex word to get an exclusi&e lock/ 3f that fails, t#e thread can choose to issue the following futex+2, call to tr$ to lock the futex/ futex(uaddr, FUTEX_LOCK, flags, timeout, NULL, 0); .#e fags can be 9:.E@)1S_:SL1CK to make the lock waiter returns to userspace to retry locking, or it can be 0 to acquire the lock directl$ in t#e *ernel before going back to userspace/ 0t unloc* time, the tas* can simpl$ atomicall$ clear the futex word as long as the F:.E@)W03.ERS bit isn8t set. 3f it is set, it has to issue the following futex+2, to do the unlock/ futex(uaddr, FUTEX_UNLOCK, 0, NULL, NULL, 0); It is relatively simple to use OS futex to implement userspace mutexes. 8 Using OS Futexes !or Userspace #$locks 1S futexes support s#ared locking for implementing userspace rwlock/ A special upper bit in t#e futex word +9:.E@)SH0RE?)9L0D, is used to indicate s#ared locking/ T#e lower 2E bits of the futex word is then used for reader count. 3n order to acquire a s#ared lock, the tas* #as to atomicall$ put (9:TE@)SHARE?)9L0D + 1) to the futex word. If the 9:.E@)SHARE?)9L0D bit has already been set, the tas* just need to atomicall$ increment t#e reader count. If the futex is exclusi&el$ owned, the following futex+2, call can be used to acquire a s#ared lock in the kernel/ futex(uaddr, FUTEX_LOCK_S!"#E$, flags, timeout, NULL, 0); .#e supported fag is F:.E@)1S_PRE9ER)RE0?ER to make the *ernel code prefer reader a bit more than writer. 0t unloc* time, a tas* can just atomicall$ decrement the reader count. If t#e count reaches 0, it has to atomicall$ clear t#e F:.E@_SH0RE?)9L0D as well/ 3t gets a bit more complicated if t#e F:.E@_W03.ERS bit is set as a F:.E@):NL1CK_SHARE? futex+2, call need to be used and it #as to be sure that it is the onl$ read-owner that does the unlock/ 9 Userspace "utex erformance Data 0 mutex locking microbenc#mark run on a 2(socket 4H(core 9I(thread xHI(IE s$stem wit# a 5/3 based kernel/ 9I mutex locking threads with s#ort critical section were run for 10s using WW futex, OS futex and Glibc pthread_mutex/ Futex Type WW OS Glibc Total locking ops 155,646,231 187,826,170 89,292,228 Per-thread average ops/s (std 162,114 (0.53%) 195,632 (1.83%) 93,005 (0.43%) deviation) Futex lock syscalls 8,219,362 7,453,248 N/A Futex unlock syscalls 13,792,295 19 N/A # of kernel locks N/A 276,442 N/A .#e OS futex saw a 20/J< increase in locking rate w#en compared with the WW futex/ 1ne characteristic of 1S futex is the small number of futex unlock s$scalls that need are issued.
Recommended publications
  • Synchronization Spinlocks - Semaphores
    CS 4410 Operating Systems Synchronization Spinlocks - Semaphores Summer 2013 Cornell University 1 Today ● How can I synchronize the execution of multiple threads of the same process? ● Example ● Race condition ● Critical-Section Problem ● Spinlocks ● Semaphors ● Usage 2 Problem Context ● Multiple threads of the same process have: ● Private registers and stack memory ● Shared access to the remainder of the process “state” ● Preemptive CPU Scheduling: ● The execution of a thread is interrupted unexpectedly. ● Multiple cores executing multiple threads of the same process. 3 Share Counting ● Mr Skroutz wants to count his $1-bills. ● Initially, he uses one thread that increases a variable bills_counter for every $1-bill. ● Then he thought to accelerate the counting by using two threads and keeping the variable bills_counter shared. 4 Share Counting bills_counter = 0 ● Thread A ● Thread B while (machine_A_has_bills) while (machine_B_has_bills) bills_counter++ bills_counter++ print bills_counter ● What it might go wrong? 5 Share Counting ● Thread A ● Thread B r1 = bills_counter r2 = bills_counter r1 = r1 +1 r2 = r2 +1 bills_counter = r1 bills_counter = r2 ● If bills_counter = 42, what are its possible values after the execution of one A/B loop ? 6 Shared counters ● One possible result: everything works! ● Another possible result: lost update! ● Called a “race condition”. 7 Race conditions ● Def: a timing dependent error involving shared state ● It depends on how threads are scheduled. ● Hard to detect 8 Critical-Section Problem bills_counter = 0 ● Thread A ● Thread B while (my_machine_has_bills) while (my_machine_has_bills) – enter critical section – enter critical section bills_counter++ bills_counter++ – exit critical section – exit critical section print bills_counter 9 Critical-Section Problem ● The solution should ● enter section satisfy: ● critical section ● Mutual exclusion ● exit section ● Progress ● remainder section ● Bounded waiting 10 General Solution ● LOCK ● A process must acquire a lock to enter a critical section.
    [Show full text]
  • Chapter 1. Origins of Mac OS X
    1 Chapter 1. Origins of Mac OS X "Most ideas come from previous ideas." Alan Curtis Kay The Mac OS X operating system represents a rather successful coming together of paradigms, ideologies, and technologies that have often resisted each other in the past. A good example is the cordial relationship that exists between the command-line and graphical interfaces in Mac OS X. The system is a result of the trials and tribulations of Apple and NeXT, as well as their user and developer communities. Mac OS X exemplifies how a capable system can result from the direct or indirect efforts of corporations, academic and research communities, the Open Source and Free Software movements, and, of course, individuals. Apple has been around since 1976, and many accounts of its history have been told. If the story of Apple as a company is fascinating, so is the technical history of Apple's operating systems. In this chapter,[1] we will trace the history of Mac OS X, discussing several technologies whose confluence eventually led to the modern-day Apple operating system. [1] This book's accompanying web site (www.osxbook.com) provides a more detailed technical history of all of Apple's operating systems. 1 2 2 1 1.1. Apple's Quest for the[2] Operating System [2] Whereas the word "the" is used here to designate prominence and desirability, it is an interesting coincidence that "THE" was the name of a multiprogramming system described by Edsger W. Dijkstra in a 1968 paper. It was March 1988. The Macintosh had been around for four years.
    [Show full text]
  • Demarinis Kent Williams-King Di Jin Rodrigo Fonseca Vasileios P
    sysfilter: Automated System Call Filtering for Commodity Software Nicholas DeMarinis Kent Williams-King Di Jin Rodrigo Fonseca Vasileios P. Kemerlis Department of Computer Science Brown University Abstract This constant stream of additional functionality integrated Modern OSes provide a rich set of services to applications, into modern applications, i.e., feature creep, not only has primarily accessible via the system call API, to support the dire effects in terms of security and protection [1, 71], but ever growing functionality of contemporary software. How- also necessitates a rich set of OS services: applications need ever, despite the fact that applications require access to part of to interact with the OS kernel—and, primarily, they do so the system call API (to function properly), OS kernels allow via the system call (syscall) API [52]—in order to perform full and unrestricted use of the entire system call set. This not useful tasks, such as acquiring or releasing memory, spawning only violates the principle of least privilege, but also enables and terminating additional processes and execution threads, attackers to utilize extra OS services, after seizing control communicating with other programs on the same or remote of vulnerable applications, or escalate privileges further via hosts, interacting with the filesystem, and performing I/O and exploiting vulnerabilities in less-stressed kernel interfaces. process introspection. To tackle this problem, we present sysfilter: a binary Indicatively, at the time of writing, the Linux
    [Show full text]
  • CS350 – Operating Systems
    Outline 1 Processes and Threads 2 Synchronization 3 Memory Management 1 / 45 Processes • A process is an instance of a program running • Modern OSes run multiple processes simultaneously • Very early OSes only ran one process at a time • Examples (can all run simultaneously): - emacs – text editor - firefox – web browser • Non-examples (implemented as one process): - Multiple firefox windows or emacs frames (still one process) • Why processes? - Simplicity of programming - Speed: Higher throughput, lower latency 2 / 45 A process’s view of the world • Each process has own view of machine - Its own address space - Its own open files - Its own virtual CPU (through preemptive multitasking) • *(char *)0xc000 dierent in P1 & P2 3 / 45 System Calls • Systems calls are the interface between processes and the kernel • A process invokes a system call to request operating system services • fork(), waitpid(), open(), close() • Note: Signals are another common mechanism to allow the kernel to notify the application of an important event (e.g., Ctrl-C) - Signals are like interrupts/exceptions for application code 4 / 45 System Call Soware Stack Application 1 5 Syscall Library unprivileged code 4 2 privileged 3 code Kernel 5 / 45 Kernel Privilege • Hardware provides two or more privilege levels (or protection rings) • Kernel code runs at a higher privilege level than applications • Typically called Kernel Mode vs. User Mode • Code running in kernel mode gains access to certain CPU features - Accessing restricted features (e.g. Co-processor 0) - Disabling interrupts, setup interrupt handlers - Modifying the TLB (for virtual memory management) • Allows the kernel to isolate processes from one another and from the kernel - Processes cannot read/write kernel memory - Processes cannot directly call kernel functions 6 / 45 How System Calls Work • The kernel only runs through well defined entry points • Interrupts - Interrupts are generated by devices to signal needing attention - E.g.
    [Show full text]
  • 4. Process Synchronization
    Lecture Notes for CS347: Operating Systems Mythili Vutukuru, Department of Computer Science and Engineering, IIT Bombay 4. Process Synchronization 4.1 Race conditions and Locks • Multiprogramming and concurrency bring in the problem of race conditions, where multiple processes executing concurrently on shared data may leave the shared data in an undesirable, inconsistent state due to concurrent execution. Note that race conditions can happen even on a single processor system, if processes are context switched out by the scheduler, or are inter- rupted otherwise, while updating shared data structures. • Consider a simple example of two threads of a process incrementing a shared variable. Now, if the increments happen in parallel, it is possible that the threads will overwrite each other’s result, and the counter will not be incremented twice as expected. That is, a line of code that increments a variable is not atomic, and can be executed concurrently by different threads. • Pieces of code that must be accessed in a mutually exclusive atomic manner by the contending threads are referred to as critical sections. Critical sections must be protected with locks to guarantee the property of mutual exclusion. The code to update a shared counter is a simple example of a critical section. Code that adds a new node to a linked list is another example. A critical section performs a certain operation on a shared data structure, that may temporar- ily leave the data structure in an inconsistent state in the middle of the operation. Therefore, in order to maintain consistency and preserve the invariants of shared data structures, critical sections must always execute in a mutually exclusive fashion.
    [Show full text]
  • Lock (TCB) Block (TCB)
    Concurrency and Synchronizaon Mo0vaon • Operang systems (and applicaon programs) o<en need to be able to handle mul0ple things happening at the same 0me – Process execu0on, interrupts, background tasks, system maintenance • Humans are not very good at keeping track of mul0ple things happening simultaneously • Threads and synchronizaon are an abstrac0on to help bridge this gap Why Concurrency? • Servers – Mul0ple connec0ons handled simultaneously • Parallel programs – To achieve beGer performance • Programs with user interfaces – To achieve user responsiveness while doing computaon • Network and disk bound programs – To hide network/disk latency Definions • A thread is a single execu0on sequence that represents a separately schedulable task – Single execu0on sequence: familiar programming model – Separately schedulable: OS can run or suspend a thread at any 0me • Protec0on is an orthogonal concept – Can have one or many threads per protec0on domain Threads in the Kernel and at User-Level • Mul0-process kernel – Mul0ple single-threaded processes – System calls access shared kernel data structures • Mul0-threaded kernel – mul0ple threads, sharing kernel data structures, capable of using privileged instruc0ons – UNIX daemon processes -> mul0-threaded kernel • Mul0ple mul0-threaded user processes – Each with mul0ple threads, sharing same data structures, isolated from other user processes – Plus a mul0-threaded kernel Thread Abstrac0on • Infinite number of processors • Threads execute with variable speed – Programs must be designed to work with any schedule Programmer Abstraction Physical Reality Threads Processors 1 2 3 4 5 1 2 Running Ready Threads Threads Queson Why do threads execute at variable speed? Programmer vs. Processor View Programmers Possible Possible Possible View Execution Execution Execution #1 #2 #3 .
    [Show full text]
  • Kernel and Locking
    Kernel and Locking Luca Abeni [email protected] Monolithic Kernels • Traditional Unix-like structure • Protection: distinction between Kernel (running in KS) and User Applications (running in US) • The kernel behaves as a single-threaded program • One single execution flow in KS at each time • Simplify consistency of internal kernel structures • Execution enters the kernel in two ways: • Coming from upside (system calls) • Coming from below (hardware interrupts) Kernel Programming Kernel Locking Single-Threaded Kernels • Only one single execution flow (thread) can execute in the kernel • It is not possible to execute more than 1 system call at time • Non-preemptable system calls • In SMP systems, syscalls are critical sections (execute in mutual exclusion) • Interrupt handlers execute in the context of the interrupted task Kernel Programming Kernel Locking Bottom Halves • Interrupt handlers split in two parts • Short and fast ISR • “Soft IRQ handler” • Soft IRQ hanlder: deferred handler • Traditionally known ass Bottom Half (BH) • AKA Deferred Procedure Call - DPC - in Windows • Linux: distinction between “traditional” BHs and Soft IRQ handlers Kernel Programming Kernel Locking Synchronizing System Calls and BHs • Synchronization with ISRs by disabling interrupts • Synchronization with BHs: is almost automatic • BHs execute atomically (a BH cannot interrupt another BH) • BHs execute at the end of the system call, before invoking the scheduler for returning to US • Easy synchronization, but large non-preemptable sections! • Achieved
    [Show full text]
  • Thread Evolution Kit for Optimizing Thread Operations on CE/Iot Devices
    Thread Evolution Kit for Optimizing Thread Operations on CE/IoT Devices Geunsik Lim , Student Member, IEEE, Donghyun Kang , and Young Ik Eom Abstract—Most modern operating systems have adopted the the threads running on CE/IoT devices often unintentionally one-to-one thread model to support fast execution of threads spend a significant amount of time in taking the CPU resource in both multi-core and single-core systems. This thread model, and the frequency of context switch rapidly increases due to which maps the kernel-space and user-space threads in a one- to-one manner, supports quick thread creation and termination the limited system resources, degrading the performance of in high-performance server environments. However, the perfor- the system significantly. In addition, since CE/IoT devices mance of time-critical threads is degraded when multiple threads usually have limited memory space, they may suffer from the are being run in low-end CE devices with limited system re- segmentation fault [16] problem incurred by memory shortages sources. When a CE device runs many threads to support diverse as the number of threads increases and they remain running application functionalities, low-level hardware specifications often lead to significant resource contention among the threads trying for a long time. to obtain system resources. As a result, the operating system Some engineers have attempted to address the challenges encounters challenges, such as excessive thread context switching of IoT environments such as smart homes by using better overhead, execution delay of time-critical threads, and a lack of hardware specifications for CE/IoT devices [3], [17]–[21].
    [Show full text]
  • Red Hat Enterprise Linux for Real Time 7 Tuning Guide
    Red Hat Enterprise Linux for Real Time 7 Tuning Guide Advanced tuning procedures for Red Hat Enterprise Linux for Real Time Radek Bíba David Ryan Cheryn Tan Lana Brindley Alison Young Red Hat Enterprise Linux for Real Time 7 Tuning Guide Advanced tuning procedures for Red Hat Enterprise Linux for Real Time Radek Bíba Red Hat Customer Content Services [email protected] David Ryan Red Hat Customer Content Services [email protected] Cheryn Tan Red Hat Customer Content Services Lana Brindley Red Hat Customer Content Services Alison Young Red Hat Customer Content Services Legal Notice Copyright © 2015 Red Hat, Inc. This document is licensed by Red Hat under the Creative Commons Attribution-ShareAlike 3.0 Unported License. If you distribute this document, or a modified version of it, you must provide attribution to Red Hat, Inc. and provide a link to the original. If the document is modified, all Red Hat trademarks must be removed. Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law. Red Hat, Red Hat Enterprise Linux, the Shadowman logo, JBoss, MetaMatrix, Fedora, the Infinity Logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux ® is the registered trademark of Linus Torvalds in the United States and other countries. Java ® is a registered trademark of Oracle and/or its affiliates. XFS ® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries.
    [Show full text]
  • In the Beginning... Example Critical Sections / Mutual Exclusion Critical
    Temporal relations • Instructions executed by a single thread are totally CSE 451: Operating Systems ordered Spring 2013 – A < B < C < … • Absent synchronization, instructions executed by distinct threads must be considered unordered / Module 7 simultaneous Synchronization – Not X < X’, and not X’ < X Ed Lazowska [email protected] Allen Center 570 © 2013 Gribble, Lazowska, Levy, Zahorjan © 2013 Gribble, Lazowska, Levy, Zahorjan 2 Critical Sections / Mutual Exclusion Example: ExampleIn the beginning... • Sequences of instructions that may get incorrect main() Y-axis is “time.” results if executed simultaneously are called critical A sections Could be one CPU, could pthread_create() • (We also use the term race condition to refer to a be multiple CPUs (cores). situation in which the results depend on timing) B foo() A' • Mutual exclusion means “not simultaneous” – A < B or B < A • A < B < C – We don’t care which B' • A' < B' • Forcing mutual exclusion between two critical section C • A < A' executions is sufficient to ensure correct execution – • C == A' guarantees ordering • C == B' • One way to guarantee mutually exclusive execution is using locks © 2013 Gribble, Lazowska, Levy, Zahorjan 3 © 2013 Gribble, Lazowska, Levy, Zahorjan 4 CriticalCritical sections When do critical sections arise? is the “happens-before” relation • One common pattern: T1 T2 T1 T2 T1 T2 – read-modify-write of – a shared value (variable) – in code that can be executed concurrently (Note: There may be only one copy of the code (e.g., a procedure), but it
    [Show full text]
  • Energy Efficient Spin-Locking in Multi-Core Machines
    Energy efficient spin-locking in multi-core machines Facoltà di INGEGNERIA DELL'INFORMAZIONE, INFORMATICA E STATISTICA Corso di laurea in INGEGNERIA INFORMATICA - ENGINEERING IN COMPUTER SCIENCE - LM Cattedra di Advanced Operating Systems and Virtualization Candidato Salvatore Rivieccio 1405255 Relatore Correlatore Francesco Quaglia Pierangelo Di Sanzo A/A 2015/2016 !1 0 - Abstract In this thesis I will show an implementation of spin-locks that works in an energy efficient fashion, exploiting the capability of last generation hardware and new software components in order to rise or reduce the CPU frequency when running spinlock operation. In particular this work consists in a linux kernel module and a user-space program that make possible to run with the lowest frequency admissible when a thread is spin-locking, waiting to enter a critical section. These changes are thread-grain, which means that only interested threads are affected whereas the system keeps running as usual. Standard libraries’ spinlocks do not provide energy efficiency support, those kind of optimizations are related to the application behaviors or to kernel-level solutions, like governors. !2 Table of Contents List of Figures pag List of Tables pag 0 - Abstract pag 3 1 - Energy-efficient Computing pag 4 1.1 - TDP and Turbo Mode pag 4 1.2 - RAPL pag 6 1.3 - Combined Components 2 - The Linux Architectures pag 7 2.1 - The Kernel 2.3 - System Libraries pag 8 2.3 - System Tools pag 9 3 - Into the Core 3.1 - The Ring Model 3.2 - Devices 4 - Low Frequency Spin-lock 4.1 - Spin-lock vs.
    [Show full text]
  • Spinlock Vs. Sleeping Lock Spinlock Implementation(1)
    Semaphores EECS 3221.3 • Problems with the software solutions. Operating System Fundamentals – Complicated programming, not flexible to use. – Not easy to generalize to more complex synchronization problems. No.6 • Semaphore (a.k.a. lock): an easy-to-use synchronization tool – An integer variable S Process Synchronization(2) – wait(S) { while (S<=0) ; S-- ; Prof. Hui Jiang } Dept of Electrical Engineering and Computer – signal(S) { Science, York University S++ ; } Semaphore usage (1): Semaphore usage (2): the n-process critical-section problem as a General Synchronization Tool • The n processes share a semaphore, Semaphore mutex ; // mutex is initialized to 1. • Execute B in Pj only after A executed in Pi • Use semaphore flag initialized to 0 Process Pi do { " wait(mutex);! Pi Pj " critical section of Pi! ! … … signal(mutex);" A wait (flag) ; signal (flag) ; B remainder section of Pi ! … … ! } while (1);" " " Spinlock vs. Sleeping Lock Spinlock Implementation(1) • In uni-processor machine, disabling interrupt before modifying • Previous definition of semaphore requires busy waiting. semaphore. – It is called spinlock. – spinlock does not need context switch, but waste CPU cycles wait(S) { in a continuous loop. do { – spinlock is OK only for lock waiting is very short. Disable_Interrupt; signal(S) { • Semaphore without busy-waiting, called sleeping lock: if(S>0) { S-- ; Disable_Interrupt ; – In defining wait(), rather than busy-waiting, the process makes Enable_Interrupt ; S++ ; system calls to block itself and switch to waiting state, and return ; Enable_Interrupt ; put the process to a waiting queue associated with the } return ; semaphore. The control is transferred to CPU scheduler. Enable_Interrupt ; } – In defining signal(), the process makes system calls to pick a } while(1) ; process in the waiting queue of the semaphore, wake it up by } moving it to the ready queue to wait for CPU scheduling.
    [Show full text]