Locks, Lock Based Data Structures  Review: Proportional Share Scheduler – Ch

TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology TCSS 422: OPERATING SYSTEMS OBJECTIVES Quiz 2 – Scheduling Review Three Easy Pieces: Assignment 1 – MASH Shell Locks, Lock Based Data Structures Review: Proportional Share Scheduler – Ch. 9 Review: Concurrency: Introduction – Ch. 26 Review: Linux Thread API – Ch. 27 Wes J. Lloyd Locks – Ch. 28 Institute of Technology Lock Based Data Structures – Ch. 29 University of Washington - Tacoma TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 April 23, 2018 L5.2 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma LOCKS Ensure critical section(s) are executed atomically-as a unit . Only one thread is allowed to execute a critical section at any given time . Ensures the code snippets are “mutually exclusive” CHAPTER 28 – Protect a global counter: LOCKS A “critical section”: TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 April 23, 2018 L7.4 Institute of Technology, University of Washington - Tacoma L5.3 Institute of Technology, University of Washington - Tacoma LOCKS - 2 LOCKS - 3 Lock variables are called “MUTEX” pthread_mutex_lock(&lock) . Short for mutual exclusion (that’s what they guarantee) . Try to acquire lock . If lock is free, calling thread will acquire the lock Lock variables store the state of the lock . Thread with lock enters critical section . Thread “owns” the lock States . Locked (acquired or held) No other thread can acquire the lock before the owner . Unlocked (available or free) releases it. Only 1 thread can hold a lock TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.5 April 23, 2018 L7.6 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma Slides by Wes J. Lloyd L5.1 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology LOCKS - 4 FINE GRAINED? Program can have many mutex (lock) variables to Is this code a good example of “fine grained parallelism”? “serialize” many critical sections pthread_mutex_lock(&lock); a = b++; Locks are also used to protect data structures b = a * c; *d = a + b +c; . Prevent multiple threads from changing the same data FILE * fp = fopen ("file.txt", “r"); fscanf(fp, "%s %s %s %d", str1, str2, str3, &e); simultaneously ListNode *node = mylist->head; Int i=0 . Programmer can make sections of code “granular” while (node) { . Fine grained – means just one grain of sand at a time through an node->title = str1; node->subheading = str2; hour glass node->desc = str3; node->end = *e; . Similar to relational database transactions node = node->next; . i++ DB transactions prevent multiple users from modifying a table, } row, field e = e – i; pthread_mutex_unlock(&lock); TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.7 April 23, 2018 L7.8 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma FINE GRAINED PARALLELISM EVALUATING LOCK IMPLEMENTATIONS pthread_mutex_lock(&lock_a); pthread_mutex_lock(&lock_b); Correctness a = b++; pthread_mutex_unlock(&lock_b); . Does the lock work? pthread_mutex_unlock(&lock_a); . Are critical sections mutually exclusive? pthread_mutex_lock(&lock_b); (atomic-as a unit?) b = a * c; pthread_mutex_unlock(&lock_b); pthread_mutex_lock(&lock_d); Fairness *d = a + b +c; pthread_mutex_unlock(&lock_d); . Are threads competing for a lock have a fair chance of acquiring it? FILE * fp = fopen ("file.txt", “r"); pthread_mutex_lock(&lock_e); fscanf(fp, "%s %s %s %d", str1, str2, str3, &e); pthread_mutex_unlock(&lock_e); Overhead ListNode *node = mylist->head; int i=0 . TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.9 April 23, 2018 L7.10 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma BUILDING LOCKS HISTORICAL IMPLEMENTATION Locks require hardware support To implement mutual exclusion . Disable interrupts upon entering critical sections . To minimize overhead, ensure fairness and correctness . Special “atomic-as a unit” instructions to support lock implementation . Atomic-as a unit exchange instruction . XCHG Any thread could disable system-wide interrupt . What if lock is never released? . Compare and exchange instruction On a multiprocessor processor each CPU has its own interrupts . CMPXCHG . Do we disable interrupts for all cores simultaneously? . CMPXCHG8B While interrupts are disabled, they could be lost . CMPXCHG16B . If not queued… TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.11 April 23, 2018 L7.12 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma Slides by Wes J. Lloyd L5.2 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology SPIN LOCK IMPLEMENTATION DIY: CORRECT? Operate without atomic-as a unit assembly instructions Correctness requires luck… (e.g. DIY lock is incorrect) “Do-it-yourself” Locks Is this lock implementation: Correct? Fair? Performant? Here both threads have “acquired” the lock simultaneously TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.13 April 23, 2018 L7.14 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma DIY: PERFORMANT? TEST-AND-SET INSTRUCTION C implementation: not atomic void lock(lock_t *mutex) . Adds a simple check to basic spin lock { . One a single core CPU system with preemptive scheduler: while (mutex->flag == 1); // while lock is unavailable, wait… mutex->flag = 1; . Try this… } What is wrong with while(<cond>); ? Spin-waiting wastes time actively waiting for another thread lock() method checks that TestAndSet doesn’t return 1 while (1); will “peg” a CPU core at 100% Comparison is in the caller . Continuously loops, and evaluates mutex->flag value… Single core systems are becoming scarce . Generates heat… Try on a one-core VM TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.15 April 23, 2018 L7.16 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma DIY: TEST-AND-SET - 2 SPIN LOCK EVALUATION Requires a preemptive scheduler on single CPU core system Correctness: Lock is never released without a context switch . Spin locks guarantee: critical sections won’t be executed 1-core VM: occasionally will deadlock, doesn’t miscount simultaneously by (2) threads Fairness: . No fairness guarantee. Once a thread has a lock, nothing forces it to relinquish it… Performance: . Spin locks perform “busy waiting” . Spin locks are best for short periods of waiting . Performance is slow when multiple threads share a CPU . Especially for long periods TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.17 April 23, 2018 L7.18 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma Slides by Wes J. Lloyd L5.3 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology COMPARE AND SWAP COMPARE AND SWAP Checks that the lock variable has the expected value FIRST, Compare and Swap before changing its value . If so, make assignment . Return value at location Adds a comparison to TestAndSet Spin lock usage 1-core VM: Count is correct, no deadlock Useful for wait-free synchronization . Supports implementation of shared data structures which can be updated atomically (as a unit) using the HW support CompareAndSwap instruction X86 provides “cmpxchgl” compare-and-exchange instruction . Shared data structure updates become “wait-free” . cmpxchg8b . Upcoming in Chapter 32 . cmpxchg16b TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.19 April 23, 2018 L7.20 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma TWO MORE “LOCK BUILDING” LL/SC LOCK CPU INSTRUCTIONS Cooperative instructions used together to support synchronization on RISC systems No support on x86 processors . Supported by RISC: Alpha, PowerPC, ARM Load-linked (LL) . Loads value into register . Same as typical load . Used as a mechanism to track competition LL instruction loads pointer value (ptr) SC only stores if the load link pointer has not changed Store-conditional (SC) Requires HW support . Performs “mutually exclusive” store . C code is psuedo code . Allows only one thread to store value TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.21 April 23, 2018 L7.22 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma LL/SC LOCK - 2 HARDWARE SPIN LOCKS - SUMMARY Simple, correct Slow With long locks, waiting threads spin for entire timeslice . Repeat comparison continuously . Busy waiting How To Avoid Spinning? Two instruction lock Need both HW & OS Support ! TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.23 April 23, 2018 L7.24 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington

Locks, Lock Based Data Structures  Review: Proportional Share Scheduler – Ch

A Fast Concurrent and Resizable Robin Hood Hash Table Endrias

Intel Threading Building Blocks

16 Concurrent Hash Tables: Fast and General(?)!

Read-Copy-Update for Helenos (Slides)

Threading for Performance with Intel® Threading Building Blocks

Concurrent Data Structures

29. Lock-Based Concurrent Data Structures Operating System: Three Easy Pieces

Blocking and Non-Blocking Concurrent Hash Tables in Multi-Core Systems

Phase-Concurrent Hash Tables for Determinism

Fast, Dynamically-Sized Concurrent Hash Table⋆

Concurrent Hash Tables: Fast and General?(!)

Scalable Concurrent Hash Tables Via Relativistic Programming