TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology

TCSS 422: OPERATING SYSTEMS OBJECTIVES

 Quiz 2 – Scheduling Review Three Easy Pieces:  Assignment 1 – MASH Shell Locks, Based Data Structures  Review: Proportional Share Scheduler – Ch. 9  Review: Concurrency: Introduction – Ch. 26  Review: Linux API – Ch. 27 Wes J. Lloyd  Locks – Ch. 28 Institute of Technology  Lock Based Data Structures – Ch. 29 University of Washington - Tacoma

TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 April 23, 2018 L5.2 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

LOCKS

 Ensure critical section(s) are executed atomically-as a unit . Only one thread is allowed to execute a critical section at any given time . Ensures the code snippets are “mutually exclusive”

CHAPTER 28 –  Protect a global counter: LOCKS  A “critical section”:

TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 April 23, 2018 L7.4 Institute of Technology, University of Washington - Tacoma L5.3 Institute of Technology, University of Washington - Tacoma

LOCKS - 2 LOCKS - 3

 Lock variables are called “MUTEX”  pthread_mutex_lock(&lock) . Short for mutual exclusion (that’s what they guarantee) . Try to acquire lock . If lock is free, calling thread will acquire the lock  Lock variables store the state of the lock . Thread with lock enters critical section . Thread “owns” the lock  States . Locked (acquired or held)  No other thread can acquire the lock before the owner . Unlocked (available or free) releases it.

 Only 1 thread can hold a lock

TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.5 April 23, 2018 L7.6 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

Slides by Wes J. Lloyd L5.1 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology

LOCKS - 4 FINE GRAINED?

 Program can have many mutex (lock) variables to  Is this code a good example of “fine grained parallelism”? “serialize” many critical sections pthread_mutex_lock(&lock); a = b++;  Locks are also used to protect data structures b = a * c; *d = a + b +c; . Prevent multiple threads from changing the same data FILE * fp = fopen ("file.txt", “r"); fscanf(fp, "%s %s %s %d", str1, str2, str3, &e); simultaneously ListNode *node = mylist->head; Int i=0 . Programmer can make sections of code “granular” while (node) { . Fine grained – means just one grain of sand at a time through an node->title = str1; node->subheading = str2; hour glass node->desc = str3; node->end = *e; . Similar to relational database transactions node = node->next; . i++ DB transactions prevent multiple users from modifying a table, } row, field e = e – i; pthread_mutex_unlock(&lock);

TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.7 April 23, 2018 L7.8 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

FINE GRAINED PARALLELISM EVALUATING LOCK IMPLEMENTATIONS

pthread_mutex_lock(&lock_a); pthread_mutex_lock(&lock_b);  Correctness a = b++; pthread_mutex_unlock(&lock_b); . Does the lock work? pthread_mutex_unlock(&lock_a); . Are critical sections mutually exclusive? pthread_mutex_lock(&lock_b); (atomic-as a unit?) b = a * c; pthread_mutex_unlock(&lock_b);

pthread_mutex_lock(&lock_d);  Fairness *d = a + b +c; pthread_mutex_unlock(&lock_d); . Are threads competing for a lock have a fair chance of acquiring it? FILE * fp = fopen ("file.txt", “r"); pthread_mutex_lock(&lock_e); fscanf(fp, "%s %s %s %d", str1, str2, str3, &e); pthread_mutex_unlock(&lock_e);  Overhead

ListNode *node = mylist->head; int i=0 . . . TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.9 April 23, 2018 L7.10 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

BUILDING LOCKS HISTORICAL IMPLEMENTATION

 Locks require hardware support  To implement mutual exclusion . Disable interrupts upon entering critical sections . To minimize overhead, ensure fairness and correctness

. Special “atomic-as a unit” instructions to support lock implementation

. Atomic-as a unit exchange instruction . XCHG  Any thread could disable system-wide interrupt . What if lock is never released? . Compare and exchange instruction  On a multiprocessor processor each CPU has its own interrupts . CMPXCHG . Do we disable interrupts for all cores simultaneously? . CMPXCHG8B  While interrupts are disabled, they could be lost . CMPXCHG16B . If not queued…

TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.11 April 23, 2018 L7.12 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

Slides by Wes J. Lloyd L5.2 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology

SPIN LOCK IMPLEMENTATION DIY: CORRECT?

 Operate without atomic-as a unit assembly instructions  Correctness requires luck… (e.g. DIY lock is incorrect)  “Do-it-yourself” Locks  Is this lock implementation: Correct? Fair? Performant?

 Here both threads have “acquired” the lock simultaneously

TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.13 April 23, 2018 L7.14 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

DIY: PERFORMANT? TEST-AND-SET INSTRUCTION

 C implementation: not atomic void lock(lock_t *mutex) . Adds a simple check to basic spin lock { . One a single core CPU system with preemptive scheduler: while (mutex->flag == 1); // while lock is unavailable, wait… mutex->flag = 1; . Try this… }

 What is wrong with while(); ?

 Spin-waiting wastes time actively waiting for another thread  lock() method checks that TestAndSet doesn’t return 1  while (1); will “peg” a CPU core at 100%  Comparison is in the caller . Continuously loops, and evaluates mutex->flag value…  Single core systems are becoming scarce . Generates heat…  Try on a one-core VM

TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.15 April 23, 2018 L7.16 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

DIY: TEST-AND-SET - 2 SPIN LOCK EVALUATION

 Requires a preemptive scheduler on single CPU core system  Correctness:  Lock is never released without a context switch . Spin locks guarantee: critical sections won’t be executed  1-core VM: occasionally will , doesn’t miscount simultaneously by (2) threads

 Fairness: . No fairness guarantee. Once a thread has a lock, nothing forces it to relinquish it…

 Performance: . Spin locks perform “busy waiting” . Spin locks are best for short periods of waiting . Performance is slow when multiple threads share a CPU . Especially for long periods

TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.17 April 23, 2018 L7.18 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

Slides by Wes J. Lloyd L5.3 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology

COMPARE AND SWAP COMPARE AND SWAP

 Checks that the lock variable has the expected value FIRST,  Compare and Swap before changing its value . If so, make assignment . Return value at location

 Adds a comparison to TestAndSet  Spin lock usage 1-core VM: Count is correct, no deadlock  Useful for wait-free synchronization . Supports implementation of shared data structures which can be updated atomically (as a unit) using the HW support CompareAndSwap instruction  X86 provides “cmpxchgl” compare-and-exchange instruction . Shared data structure updates become “wait-free” . cmpxchg8b . Upcoming in Chapter 32 . cmpxchg16b

TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.19 April 23, 2018 L7.20 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

TWO MORE “LOCK BUILDING” LL/SC LOCK CPU INSTRUCTIONS

 Cooperative instructions used together to support synchronization on RISC systems  No support on x86 processors . Supported by RISC: Alpha, PowerPC, ARM  Load-linked (LL) . Loads value into register . Same as typical load . Used as a mechanism to track competition  LL instruction loads pointer value (ptr)  SC only stores if the load link pointer has not changed  Store-conditional (SC)  Requires HW support . Performs “mutually exclusive” store . C code is psuedo code . Allows only one thread to store value

TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.21 April 23, 2018 L7.22 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

LL/SC LOCK - 2 HARDWARE SPIN LOCKS - SUMMARY

 Simple, correct  Slow  With long locks, waiting threads spin for entire timeslice . Repeat comparison continuously . Busy waiting

How To Avoid Spinning?  Two instruction lock Need both HW & OS Support !

TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.23 April 23, 2018 L7.24 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

Slides by Wes J. Lloyd L5.4 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology

FETCH-AND-ADD TICKET LOCK

 HW CPU Instruction  Can build Ticket Lock using Fetch-and-Add  Ensures progress of all threads (fairness)  Increment counter atomically-as a unit in one instruction

. Fetch and return value . Increment by 1

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.25 April 23, 2018 L7.26 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

TICKET LOCK - 2 YIELD() – SYSTEM CALL

TB while (1 != 1) acquire lock

TB myturn=1 ticket=2 TA myturn=0 turn=0 ticket=1 turn=0

TA  Give up the CPU – instead of busy waiting… while (0 != 0) acquire lock . running ready TB TA-unlock  Ready relinquishes the CPU for another thread (ctxt. switch) myturn=0 while (0 != 1)  How does the thread get the CPU back? spin ticket=2 turn=1 . OS must opportunistically reschedule it: ready  running

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.27 April 23, 2018 L7.28 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

THREAD QUEUES THREAD QUEUES - 2

 Don’t allow the OS to control your program . Use internal Thread Queues  Allows programmer to maintain control . Ensure fairness, prevent starvation Guard uses a spin-lock to protect the . Better for synchronizing large #’s of threads critical sections in lock() and unlock()

 Require OS support to add/remove threads to/from Obtain guard lock queue(s) try to obtain actual lock

 Solaris API: lock unavailable; add thread to queue . park(): puts thread to sleep potential wakeup/waiting race . unpark(threadID): wakes specified thread  Linux API: futex()

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.29 April 23, 2018 L7.30 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

Slides by Wes J. Lloyd L5.5 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology

THREAD QUEUES - 3 WAKEUP/WAITING RACE

 Unlock  Thread B: context switch occurs immediately before call to park()  Thread A: releases lock, calls unpark, queue is empty Obtain guard lock (spin)  Thread B: regains context, proceeds to lock itself forever

wake up thread from queue  Need new system call release guard lock . setpark()- informs OS about soon to be parked thread . Subsequent calls to unpark() are aware that ThreadB is about to park . ThreadB’s call to park() immediately returns  Note: no change to m->flag if unparking a thread  Lock is passed to the unparked thread “directly”

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.31 April 23, 2018 L7.32 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

FUTEX FUTEX: WRITE YOUR OWN MUTEX LOCK

 Fast Userspace MuTEX  futex_wait(addr, expected)  Linux futex system calls similar to park() and unpark() . Put calling thread to sleep . If value @ addr != expected  return immediately  Linux uses an in-kernel queue  futex_wake(addr)  Provides a futex() system call . Wake one thread that is waiting on the queue

 Provides atomic-as a unit compare-and-block operation  These are not exposed as C library calls directly . Call futex() with FUTEX_WAIT or FUTEX_WAKE  Futex is a lower-level construct  Used as building blocks for:  Use a 32-bit integer . mutex, condition variables, semaphores The leftmost bit (the +/- sign) tracks the lock state . 0 – free . 1 – locked  Objective: reduce the number of system calls . Remaining 31 bits: identifies thread

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.33 April 23, 2018 L7.34 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

HYBRID - TWO PHASE LOCKS

 Hybrid between spin-locks and yielding  Useful if lock is about to be released

 First phase – spin lock CHAPTER 29 – . Spin for some time waiting for the lock to be released . If lock is not acquired after time expires enter phase two. LOCK BASED DATA STRUCTTURES  Second phase - yield . Thread sleeps (yields) . Is awoken when the lock becomes free

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.35 April 23, 2018 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma L7.36

Slides by Wes J. Lloyd L5.6 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology

LOCK-BASED OBJECTIVES CONCURRENT DATA STRUCTURES

 Concurrent Data Structures  Adding locks to data structures make them thread safe.  Performance

 Lock Granularity  Considerations: .Correctness .Performance .Lock granularity

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.37 April 23, 2018 L7.38 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

COUNTER STRUCTURE W/O LOCK CONCURRENT COUNTER

 Synchronization weary --- not thread safe

 Add lock to the counter  Require lock to change data

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.39 April 23, 2018 L7.40 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

CONCURRENT COUNTER - 2 CONCURRENT COUNTERS - PERFORMANCE

 Decrease counter  iMac: four core Intel 2.7 GHz i5 CPU  Get value  Each thread increments counter 1,000,000 times

Traditional vs. sloppy counter Sloppy Threshold (S) = 1024

Synchronized counter scales poorly.

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.41 April 23, 2018 L7.42 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

Slides by Wes J. Lloyd L5.7 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology

PERFECT SCALING SLOPPY COUNTER

 Achieve (N) performance gain with (N) additional resources  Provides single logical shared counter . Implemented using local counters for each ~CPU core  Throughput: . 4 CPU cores = 4 local counters & 1 global counter  Transactions per second . Local counters are synchronized via local locks . Global counter is updated periodically  1 core . Global counter has lock to protect global counter value  N = 100 tps . Sloppiness threshold (S): Update threshold of global counter with local values  10 core . Small (S): more updates, more overhead  N = 1000 tps . Large (S): fewer updates, more performant, less synchronized  Why this implementation? Why do we want counters local to each CPU Core?

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.43 April 23, 2018 L7.44 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

SLOPPY COUNTER - 2 THRESHOLD VALUE S

 Update threshold (S) = 5  Consider 4 threads increment a counter 1000000 times each  Synchronized across four CPU cores  Low S  What is the consequence?  Threads update local CPU counters  High S  What is the consequence?

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.45 April 23, 2018 L7.46 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

SLOPPY COUNTER - EXAMPLE CONCURRENT LINKED LIST - 1

 Example implementation  Simplification - only basic list operations shown  Structs and initialization:  Also with CPU affinity

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.47 April 23, 2018 L7.48 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

Slides by Wes J. Lloyd L5.8 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology

CONCURRENT LINKED LIST - 2 CONCURRENT LINKED LIST - 3

 Insert – adds item to list  Lookup – checks list for existence of item with key  Everything is critical!  Once again everything is critical . There are two unlocks . Note - there are also two unlocks

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.49 April 23, 2018 L7.50 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

CONCURRENT LINKED LIST CCL – SECOND IMPLEMENTATION

 First Implementation:  Init and Insert . Lock everything inside Insert() and Lookup() . If malloc() fails lock must be released . Research has shown “exception-based control flow” to be error prone . 40% of Linux OS bugs occur in rarely taken code paths . Unlocking in an exception handler is considered a poor coding practice . There is nothing specifically wrong with this example however

 Second Implementation …

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.51 April 23, 2018 L7.52 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

CCL – SECOND IMPLEMENTATION - 2 CONCURRENT LINKED LIST PERFORMANCE

 Lookup  Using a single lock for entire list is not very performant  Users must “wait” in line for a single lock to access/modify any item  Hand-over-hand-locking (lock coupling) . Introduce a lock for each node of a list . Traversal involves handing over previous node’s lock, acquiring the next node’s lock… . Improves lock granularity . Degrades traversal performance

 Consider hybrid approach . Fewer locks, but more than 1 . Best lock-to-node distribution?

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.53 April 23, 2018 L7.54 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

Slides by Wes J. Lloyd L5.9 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology

MICHAEL AND SCOTT CONCURRENT QUEUES CONCURRENT QUEUE

 Improvement beyond a single master lock for a queue (FIFO)  Remove from queue  Two locks: . One for the head of the queue . One for the tail  Synchronize enqueue and dequeue operations

 Add a dummy node . Allocated in the queue initialization routine . Supports separation of head and tail operations

 Items can be added and removed by separate threads at the same time

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.55 April 23, 2018 L7.56 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

CONCURRENT QUEUE - 2 CONCURRENT

 Add to queue  Consider a simple hash table .Fixed (static) size .Hash maps to a bucket . Bucket is implemented using a concurrent linked list . One lock per hash (bucket) . Hash bucket is a linked lists

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.57 April 23, 2018 L7.58 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

INSERT PERFORMANCE – CONCURRENT HASH TABLE CONCURRENT HASH TABLE

 Four threads – 10,000 to 50,000 inserts . iMac with four-core Intel 2.7 GHz CPU

The simple concurrent hash table scales magnificently.

TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.59 April 23, 2018 L7.60 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

Slides by Wes J. Lloyd L5.10 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology

LOCK-FREE DATA STRUCTURES QUESTIONS  Lock-free data structures in Java

 Java.util.concurrent.atomic package  Classes: . AtomicBoolean . AtomicInteger . AtomicIntegerArray . AtomicIntegerFieldUpdater . AtomicLong . AtomicLongArray . AtomicLongFieldUpdater . AtomicReference

 See: https://docs.oracle.com/javase/7/docs/api/java /util/concurrent/atomic/package-summary.html

TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.61 Institute of Technology, University of Washington - Tacoma

FUTEX: MUTEX_LOCK PSUEDO CODE FUTEX: MUTEX UNLOCK PSUEDO CODE

Void mutex_lock(int *mutex) { int v; /* Bit 31 was clear, we got the mutex (this is a fast lock!) Void mutex_unlock(int *mutex) { if (atomic_bit_test_set (mutex, 31) == 0) return; // Adding 0x80000000 to counter results in 0 if and only if // “adds” mutex to queue // there are no other interested threads atomic_increment (mutex); while (1) { if (atomic_add_zero (mutex, 0x80000000)) // is lock available? return; if (atomic_bit_test_set (mutex, 31) ==0 { // remove mutex from queue – it has the lock now // There are other threads waiting for this lock (mutex) atomic_decrement (mutex); // wake one of them up.. return; // (e.g. dequeue it) } // Have to wait. Make sure futex value is locked (negative) futex_wake (mutex); v = *mutex; iv (v >= 0) } continue; // wait to be woken up when lock is available  Interesting note: Futex bug in Redhat Linux // this is not a spin lock… (signal) futex_wait (mutex, v);  https://www.infoq.com/news/2015/05/redhat-futex } } TCSS422: Operating Systems [Winter 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.63 April 23, 2018 L7.64 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma

Slides by Wes J. Lloyd L5.11