TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology
TCSS 422: OPERATING SYSTEMS OBJECTIVES
Quiz 2 – Scheduling Review Three Easy Pieces: Assignment 1 – MASH Shell Locks, Lock Based Data Structures Review: Proportional Share Scheduler – Ch. 9 Review: Concurrency: Introduction – Ch. 26 Review: Linux Thread API – Ch. 27 Wes J. Lloyd Locks – Ch. 28 Institute of Technology Lock Based Data Structures – Ch. 29 University of Washington - Tacoma
TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 April 23, 2018 L5.2 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
LOCKS
Ensure critical section(s) are executed atomically-as a unit . Only one thread is allowed to execute a critical section at any given time . Ensures the code snippets are “mutually exclusive”
CHAPTER 28 – Protect a global counter: LOCKS A “critical section”:
TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 April 23, 2018 L7.4 Institute of Technology, University of Washington - Tacoma L5.3 Institute of Technology, University of Washington - Tacoma
LOCKS - 2 LOCKS - 3
Lock variables are called “MUTEX” pthread_mutex_lock(&lock) . Short for mutual exclusion (that’s what they guarantee) . Try to acquire lock . If lock is free, calling thread will acquire the lock Lock variables store the state of the lock . Thread with lock enters critical section . Thread “owns” the lock States . Locked (acquired or held) No other thread can acquire the lock before the owner . Unlocked (available or free) releases it.
Only 1 thread can hold a lock
TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.5 April 23, 2018 L7.6 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
Slides by Wes J. Lloyd L5.1 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology
LOCKS - 4 FINE GRAINED?
Program can have many mutex (lock) variables to Is this code a good example of “fine grained parallelism”? “serialize” many critical sections pthread_mutex_lock(&lock); a = b++; Locks are also used to protect data structures b = a * c; *d = a + b +c; . Prevent multiple threads from changing the same data FILE * fp = fopen ("file.txt", “r"); fscanf(fp, "%s %s %s %d", str1, str2, str3, &e); simultaneously ListNode *node = mylist->head; Int i=0 . Programmer can make sections of code “granular” while (node) { . Fine grained – means just one grain of sand at a time through an node->title = str1; node->subheading = str2; hour glass node->desc = str3; node->end = *e; . Similar to relational database transactions node = node->next; . i++ DB transactions prevent multiple users from modifying a table, } row, field e = e – i; pthread_mutex_unlock(&lock);
TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.7 April 23, 2018 L7.8 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
FINE GRAINED PARALLELISM EVALUATING LOCK IMPLEMENTATIONS
pthread_mutex_lock(&lock_a); pthread_mutex_lock(&lock_b); Correctness a = b++; pthread_mutex_unlock(&lock_b); . Does the lock work? pthread_mutex_unlock(&lock_a); . Are critical sections mutually exclusive? pthread_mutex_lock(&lock_b); (atomic-as a unit?) b = a * c; pthread_mutex_unlock(&lock_b);
pthread_mutex_lock(&lock_d); Fairness *d = a + b +c; pthread_mutex_unlock(&lock_d); . Are threads competing for a lock have a fair chance of acquiring it? FILE * fp = fopen ("file.txt", “r"); pthread_mutex_lock(&lock_e); fscanf(fp, "%s %s %s %d", str1, str2, str3, &e); pthread_mutex_unlock(&lock_e); Overhead
ListNode *node = mylist->head; int i=0 . . . TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.9 April 23, 2018 L7.10 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
BUILDING LOCKS HISTORICAL IMPLEMENTATION
Locks require hardware support To implement mutual exclusion . Disable interrupts upon entering critical sections . To minimize overhead, ensure fairness and correctness
. Special “atomic-as a unit” instructions to support lock implementation
. Atomic-as a unit exchange instruction . XCHG Any thread could disable system-wide interrupt . What if lock is never released? . Compare and exchange instruction On a multiprocessor processor each CPU has its own interrupts . CMPXCHG . Do we disable interrupts for all cores simultaneously? . CMPXCHG8B While interrupts are disabled, they could be lost . CMPXCHG16B . If not queued…
TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.11 April 23, 2018 L7.12 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
Slides by Wes J. Lloyd L5.2 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology
SPIN LOCK IMPLEMENTATION DIY: CORRECT?
Operate without atomic-as a unit assembly instructions Correctness requires luck… (e.g. DIY lock is incorrect) “Do-it-yourself” Locks Is this lock implementation: Correct? Fair? Performant?
Here both threads have “acquired” the lock simultaneously
TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.13 April 23, 2018 L7.14 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
DIY: PERFORMANT? TEST-AND-SET INSTRUCTION
C implementation: not atomic void lock(lock_t *mutex) . Adds a simple check to basic spin lock { . One a single core CPU system with preemptive scheduler: while (mutex->flag == 1); // while lock is unavailable, wait… mutex->flag = 1; . Try this… }
What is wrong with while(
Spin-waiting wastes time actively waiting for another thread lock() method checks that TestAndSet doesn’t return 1 while (1); will “peg” a CPU core at 100% Comparison is in the caller . Continuously loops, and evaluates mutex->flag value… Single core systems are becoming scarce . Generates heat… Try on a one-core VM
TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.15 April 23, 2018 L7.16 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
DIY: TEST-AND-SET - 2 SPIN LOCK EVALUATION
Requires a preemptive scheduler on single CPU core system Correctness: Lock is never released without a context switch . Spin locks guarantee: critical sections won’t be executed 1-core VM: occasionally will deadlock, doesn’t miscount simultaneously by (2) threads
Fairness: . No fairness guarantee. Once a thread has a lock, nothing forces it to relinquish it…
Performance: . Spin locks perform “busy waiting” . Spin locks are best for short periods of waiting . Performance is slow when multiple threads share a CPU . Especially for long periods
TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.17 April 23, 2018 L7.18 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
Slides by Wes J. Lloyd L5.3 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology
COMPARE AND SWAP COMPARE AND SWAP
Checks that the lock variable has the expected value FIRST, Compare and Swap before changing its value . If so, make assignment . Return value at location
Adds a comparison to TestAndSet Spin lock usage 1-core VM: Count is correct, no deadlock Useful for wait-free synchronization . Supports implementation of shared data structures which can be updated atomically (as a unit) using the HW support CompareAndSwap instruction X86 provides “cmpxchgl” compare-and-exchange instruction . Shared data structure updates become “wait-free” . cmpxchg8b . Upcoming in Chapter 32 . cmpxchg16b
TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.19 April 23, 2018 L7.20 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
TWO MORE “LOCK BUILDING” LL/SC LOCK CPU INSTRUCTIONS
Cooperative instructions used together to support synchronization on RISC systems No support on x86 processors . Supported by RISC: Alpha, PowerPC, ARM Load-linked (LL) . Loads value into register . Same as typical load . Used as a mechanism to track competition LL instruction loads pointer value (ptr) SC only stores if the load link pointer has not changed Store-conditional (SC) Requires HW support . Performs “mutually exclusive” store . C code is psuedo code . Allows only one thread to store value
TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.21 April 23, 2018 L7.22 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
LL/SC LOCK - 2 HARDWARE SPIN LOCKS - SUMMARY
Simple, correct Slow With long locks, waiting threads spin for entire timeslice . Repeat comparison continuously . Busy waiting
How To Avoid Spinning? Two instruction lock Need both HW & OS Support !
TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.23 April 23, 2018 L7.24 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
Slides by Wes J. Lloyd L5.4 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology
FETCH-AND-ADD TICKET LOCK
HW CPU Instruction Can build Ticket Lock using Fetch-and-Add Ensures progress of all threads (fairness) Increment counter atomically-as a unit in one instruction
. Fetch and return value . Increment by 1
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.25 April 23, 2018 L7.26 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
TICKET LOCK - 2 YIELD() – SYSTEM CALL
TB while (1 != 1) acquire lock
TB myturn=1 ticket=2 TA myturn=0 turn=0 ticket=1 turn=0
TA Give up the CPU – instead of busy waiting… while (0 != 0) acquire lock . running ready TB TA-unlock Ready relinquishes the CPU for another thread (ctxt. switch) myturn=0 while (0 != 1) How does the thread get the CPU back? spin ticket=2 turn=1 . OS must opportunistically reschedule it: ready running
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.27 April 23, 2018 L7.28 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
THREAD QUEUES THREAD QUEUES - 2
Don’t allow the OS to control your program . Use internal Thread Queues Allows programmer to maintain control . Ensure fairness, prevent starvation Guard uses a spin-lock to protect the . Better for synchronizing large #’s of threads critical sections in lock() and unlock()
Require OS support to add/remove threads to/from Obtain guard lock queue(s) try to obtain actual lock
Solaris API: lock unavailable; add thread to queue . park(): puts thread to sleep potential wakeup/waiting race . unpark(threadID): wakes specified thread Linux API: futex()
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.29 April 23, 2018 L7.30 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
Slides by Wes J. Lloyd L5.5 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology
THREAD QUEUES - 3 WAKEUP/WAITING RACE
Unlock Thread B: context switch occurs immediately before call to park() Thread A: releases lock, calls unpark, queue is empty Obtain guard lock (spin) Thread B: regains context, proceeds to lock itself forever
wake up thread from queue Need new system call release guard lock . setpark()- informs OS about soon to be parked thread . Subsequent calls to unpark() are aware that ThreadB is about to park . ThreadB’s call to park() immediately returns Note: no change to m->flag if unparking a thread Lock is passed to the unparked thread “directly”
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.31 April 23, 2018 L7.32 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
FUTEX FUTEX: WRITE YOUR OWN MUTEX LOCK
Fast Userspace MuTEX futex_wait(addr, expected) Linux futex system calls similar to park() and unpark() . Put calling thread to sleep . If value @ addr != expected return immediately Linux uses an in-kernel queue futex_wake(addr) Provides a futex() system call . Wake one thread that is waiting on the queue
Provides atomic-as a unit compare-and-block operation These are not exposed as C library calls directly . Call futex() with FUTEX_WAIT or FUTEX_WAKE Futex is a lower-level construct Used as building blocks for: Use a 32-bit integer . mutex, condition variables, semaphores The leftmost bit (the +/- sign) tracks the lock state . 0 – free . 1 – locked Objective: reduce the number of system calls . Remaining 31 bits: identifies thread
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.33 April 23, 2018 L7.34 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
HYBRID - TWO PHASE LOCKS
Hybrid between spin-locks and yielding Useful if lock is about to be released
First phase – spin lock CHAPTER 29 – . Spin for some time waiting for the lock to be released . If lock is not acquired after time expires enter phase two. LOCK BASED DATA STRUCTTURES Second phase - yield . Thread sleeps (yields) . Is awoken when the lock becomes free
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.35 April 23, 2018 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma L7.36
Slides by Wes J. Lloyd L5.6 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology
LOCK-BASED OBJECTIVES CONCURRENT DATA STRUCTURES
Concurrent Data Structures Adding locks to data structures make them thread safe. Performance
Lock Granularity Considerations: .Correctness .Performance .Lock granularity
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.37 April 23, 2018 L7.38 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
COUNTER STRUCTURE W/O LOCK CONCURRENT COUNTER
Synchronization weary --- not thread safe
Add lock to the counter Require lock to change data
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.39 April 23, 2018 L7.40 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
CONCURRENT COUNTER - 2 CONCURRENT COUNTERS - PERFORMANCE
Decrease counter iMac: four core Intel 2.7 GHz i5 CPU Get value Each thread increments counter 1,000,000 times
Traditional vs. sloppy counter Sloppy Threshold (S) = 1024
Synchronized counter scales poorly.
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.41 April 23, 2018 L7.42 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
Slides by Wes J. Lloyd L5.7 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology
PERFECT SCALING SLOPPY COUNTER
Achieve (N) performance gain with (N) additional resources Provides single logical shared counter . Implemented using local counters for each ~CPU core Throughput: . 4 CPU cores = 4 local counters & 1 global counter Transactions per second . Local counters are synchronized via local locks . Global counter is updated periodically 1 core . Global counter has lock to protect global counter value N = 100 tps . Sloppiness threshold (S): Update threshold of global counter with local values 10 core . Small (S): more updates, more overhead N = 1000 tps . Large (S): fewer updates, more performant, less synchronized Why this implementation? Why do we want counters local to each CPU Core?
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.43 April 23, 2018 L7.44 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
SLOPPY COUNTER - 2 THRESHOLD VALUE S
Update threshold (S) = 5 Consider 4 threads increment a counter 1000000 times each Synchronized across four CPU cores Low S What is the consequence? Threads update local CPU counters High S What is the consequence?
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.45 April 23, 2018 L7.46 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
SLOPPY COUNTER - EXAMPLE CONCURRENT LINKED LIST - 1
Example implementation Simplification - only basic list operations shown Structs and initialization: Also with CPU affinity
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.47 April 23, 2018 L7.48 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
Slides by Wes J. Lloyd L5.8 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology
CONCURRENT LINKED LIST - 2 CONCURRENT LINKED LIST - 3
Insert – adds item to list Lookup – checks list for existence of item with key Everything is critical! Once again everything is critical . There are two unlocks . Note - there are also two unlocks
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.49 April 23, 2018 L7.50 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
CONCURRENT LINKED LIST CCL – SECOND IMPLEMENTATION
First Implementation: Init and Insert . Lock everything inside Insert() and Lookup() . If malloc() fails lock must be released . Research has shown “exception-based control flow” to be error prone . 40% of Linux OS bugs occur in rarely taken code paths . Unlocking in an exception handler is considered a poor coding practice . There is nothing specifically wrong with this example however
Second Implementation …
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.51 April 23, 2018 L7.52 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
CCL – SECOND IMPLEMENTATION - 2 CONCURRENT LINKED LIST PERFORMANCE
Lookup Using a single lock for entire list is not very performant Users must “wait” in line for a single lock to access/modify any item Hand-over-hand-locking (lock coupling) . Introduce a lock for each node of a list . Traversal involves handing over previous node’s lock, acquiring the next node’s lock… . Improves lock granularity . Degrades traversal performance
Consider hybrid approach . Fewer locks, but more than 1 . Best lock-to-node distribution?
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.53 April 23, 2018 L7.54 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
Slides by Wes J. Lloyd L5.9 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology
MICHAEL AND SCOTT CONCURRENT QUEUES CONCURRENT QUEUE
Improvement beyond a single master lock for a queue (FIFO) Remove from queue Two locks: . One for the head of the queue . One for the tail Synchronize enqueue and dequeue operations
Add a dummy node . Allocated in the queue initialization routine . Supports separation of head and tail operations
Items can be added and removed by separate threads at the same time
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.55 April 23, 2018 L7.56 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
CONCURRENT QUEUE - 2 CONCURRENT HASH TABLE
Add to queue Consider a simple hash table .Fixed (static) size .Hash maps to a bucket . Bucket is implemented using a concurrent linked list . One lock per hash (bucket) . Hash bucket is a linked lists
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.57 April 23, 2018 L7.58 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
INSERT PERFORMANCE – CONCURRENT HASH TABLE CONCURRENT HASH TABLE
Four threads – 10,000 to 50,000 inserts . iMac with four-core Intel 2.7 GHz CPU
The simple concurrent hash table scales magnificently.
TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.59 April 23, 2018 L7.60 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
Slides by Wes J. Lloyd L5.10 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology
LOCK-FREE DATA STRUCTURES QUESTIONS Lock-free data structures in Java
Java.util.concurrent.atomic package Classes: . AtomicBoolean . AtomicInteger . AtomicIntegerArray . AtomicIntegerFieldUpdater . AtomicLong . AtomicLongArray . AtomicLongFieldUpdater . AtomicReference
See: https://docs.oracle.com/javase/7/docs/api/java /util/concurrent/atomic/package-summary.html
TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.61 Institute of Technology, University of Washington - Tacoma
FUTEX: MUTEX_LOCK PSUEDO CODE FUTEX: MUTEX UNLOCK PSUEDO CODE
Void mutex_lock(int *mutex) { int v; /* Bit 31 was clear, we got the mutex (this is a fast lock!) Void mutex_unlock(int *mutex) { if (atomic_bit_test_set (mutex, 31) == 0) return; // Adding 0x80000000 to counter results in 0 if and only if // “adds” mutex to queue // there are no other interested threads atomic_increment (mutex); while (1) { if (atomic_add_zero (mutex, 0x80000000)) // is lock available? return; if (atomic_bit_test_set (mutex, 31) ==0 { // remove mutex from queue – it has the lock now // There are other threads waiting for this lock (mutex) atomic_decrement (mutex); // wake one of them up.. return; // (e.g. dequeue it) } // Have to wait. Make sure futex value is locked (negative) futex_wake (mutex); v = *mutex; iv (v >= 0) } continue; // wait to be woken up when lock is available Interesting note: Futex bug in Redhat Linux // this is not a spin lock… (signal) futex_wait (mutex, v); https://www.infoq.com/news/2015/05/redhat-futex } } TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.63 April 23, 2018 L7.64 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma
Slides by Wes J. Lloyd L5.11