Synchronization Types of Synchronization CS 740 • Locks November 14, 2003 Event Synchronization • Global or group-based (barriers) Based on TCM/C&S • Point-to-point

Topics • Locks • Barriers • Hardware primitives

–2– CS 740 F’03

Busy Waiting vs. Blocking A Simple

lock: ld register, location Busy-waiting is preferable when: cmp register, #0 • scheduling overhead is larger than expected wait bnz lock time st location, #1 • processor resources are not needed for other ret tasks unlock: st location, #0 • schedule-based blocking is inappropriate (e.g., in OS kernel) ret

–3– CS 740 F’03 –4– CS 740 F’03 Need Atomic Primitive! Test&Set based lock

Test&Set lock: t&s register, location bnz lock Swap ret Fetch&Op unlock: st location, #0 • Fetch&Incr, Fetch&Decr ret Compare&Swap

–5– CS 740 F’03 –6– CS 740 F’03

T&S Lock Performance Test and Test and Set

Code: lock; delay(c); unlock; A: while (lock != free) Same total no. of lock calls as p increases; measure time per transfer if (test&set(lock) == free) { 20 Test&set,c = 0 Test&set, exponential backoff, c = 3.64 18 critical section; Test&set, exponential backoff, c = 0 Ideal 16 } 14 12 else goto A; s) µ 10 Time ( 8

6 (+) spinning happens in cache

4 (-) can still generate a lot of traffic when 2 many processors go to do test&set 0 3 5 7 9 11 13 15 Number of processors –7– CS 740 F’03 –8– CS 740 F’03 Test and Set with Backoff Test and Set with Update

Upon failure, delay for a while before Test and Set sends updates to retrying processors that cache the lock • either constant delay or exponential backoff Tradeoffs: Tradeoffs: (+) good for bus-based machines (+) much less network traffic (-) still lots of traffic on distributed networks (-) exponential backoff can cause starvation for Main problem with test&set-based high-contention locks schemes is that a lock release causes –new requestors back off for shorter times all waiters to try to get the lock, But exponential found to work best in using a test&set to try to get it. practice

–9– CS 740 F’03 –10– CS 740 F’03

Ticket Lock (fetch&incr based) Ticket Lock Tradeoffs

Two counters: (+) guaranteed FIFO order; no • next_ticket (number of requestors) starvation possible • now_serving (number of releases that have happened) (+) latency can be low if fetch&incr is Algorithm: cacheable • First do a fetch&incr on next_ticket (not (+) traffic can be quite low test&set) • When release happens, poll the value of (-) but traffic is not guaranteed to be now_serving O(1) per lock acquire –if my_ticket, then I win Use delay; but how much?

–11– CS 740 F’03 –12– CS 740 F’03 LockLoop: lock; Performance delay(c); unlock; on SGIdelay(d); Challenge Array-Based Queueing Locks

Array-based LL-SC LL-SC, exponential Every spins on a unique location, Ticket Ticket, proportional rather than on a single now_serving 7 7 7 6 6 6 counter 5 5 5 s) s) s) µ µ µ 4 4 4 Time ( Time ( 3 3 Time ( 3 fetch&incr gives a process the address 2 2 2 1 1 1 0 0 0 on which to spin 13579111315 13579111315 13579111315 Number of processors Number of processors Number of processors (a) Null (c = 0, d = 0) (b) Critical-section (c = 3.64 µs, d = 0) (c) Delay (c = 3.64 µs, d = 1.29 µs) Tradeoffs: • Simple LL-SC lock does best at small p due to unfairness (+) guarantees FIFO order (like ticket lock) – Not so with delay between unlock and next lock (+) O(1) traffic with coherence caches (unlike – Need to be careful with backoff ticket lock) • Ticket lock with proportional backoff scales well, as does array lock (-) requires space per lock proportional to P • Methodologically challenging, and need to look at real workloads –13– CS 740 F’03 –14– CS 740 F’03

List-Base Queueing Locks (MCS) Implementing Fetch&Op

All other good things + O(1) traffic even Load Linked/Store Conditional without coherent caches (spin locally) lock: ll reg1, location /* LL location to reg1 */ Uses compare&swap to build linked lists in software bnz reg1, lock /* check if location locked*/ Locally-allocated flag per list node to spin on sc location, reg2 /* SC reg2 into location*/ beqz reg2, lock Can work with fetch&store, but loses FIFO /* if failed, start again */ guarantee ret Tradeoffs: unlock: (+) less storage than array-based locks st location, #0 /* write 0 to location */ (+) O(1) traffic even without coherent caches ret (-) compare&swap not easy to implement

–15– CS 740 F’03 –16– CS 740 F’03 Barriers A Working Centralized Barrier

Consecutively entering the same barrier doesn’t work We will discuss five barriers: • Must prevent process from entering until all have left previous instance • centralized • Could use another counter, but increases latency and contention • software combining tree Sense reversal: wait for flag to take different value consecutive times • dissemination barrier BARRIER (bar_name, p) { • tournament barrier local_sense = !(local_sense); /* to ggle p LOCK(bar_name.lock); rivate • MCS tree-based barrier sens mycount = bar_name.counter++; /* e var myco */ if (bar_name.counter == p) unt is priva UNLOCK(bar_name.lock); te */ bar_name.counter = 0; /* reset for next */ bar_name.flag = local_sense; /* re else lease wait { UNLOCK(bar_name.lock); ers*/ while (bar_name.flag != local_sense) {}; } } –17– CS 740 F’03 –18– CS 740 F’03

Centralized Barrier Software Combining Tree Barrier

Contention Little contention Basic idea: • notify a single shared counter when you arrive • poll that shared location until all have arrived

Simple implementation require Flat polling/spinning twice: Tree structured Writes into one tree for barrier arrival • first to ensure that all procs have left previous barrier Reads from another tree to allow procs • second to ensure that all procs have arrived at to continue current barrier Sense reversal to distinguish Solution to get one spin: sens consecutive barriers e rev ersal –19– CS 740 F’03 –20– CS 740 F’03 Tournament Barrier Barrier Performance Centralized 35 Combining tree Tournament 30 Binary combining tree Dissemination

25 Representative processor at a node is s)

µ 20 statically chosen Time ( Time 15 • no fetch&op needed 10 k In round k, proc i=2 sets a flag for proc j=i-2k 5 0 • i then drops out of tournament and j proceeds in 12345678 next round Number of processors • i waits for global flag signalling completion of • Centralized does quite well barrier to be set • Helpful hardware support: piggybacking of reads –could use combining wakeup tree misses on bus – Also for spinning on highly contended locks –21– CS 740 F’03 –22– CS 740 F’03

MCS Software Barrier Barrier Recommendations

Modifies tournament barrier to allow Criteria: static allocation in wakeup tree, and to • length of critical path use sense reversal • number of network transactions Every processor is a node in two P-node • space requirements trees: • atomic operation requirements • has pointers to its parent building a fanin-4 arrival tree • has pointers to its children to build a fanout-2 wakeup tree

–23– CS 740 F’03 –24– CS 740 F’03 Space Requirements Network Transactions

Centralized: Centralized, combining tree: • constant • O(P) if broadcast and coherent caches; MCS, combining tree: • unbounded otherwise • O(P) Dissemination: Dissemination, Tournament: • O(PlogP) • O(PlogP) Tournament, MCS: • O(P)

–25– CS 740 F’03 –26– CS 740 F’03

Critical Path Length Primitives Needed

If independent parallel network paths Centralized and combining tree: available: • atomic increment • all are O(logP) except centralized, which is O(P) • atomic decrement Otherwise (e.g., shared bus): Others: • linear factors dominate • atomic read • atomic write

–27– CS 740 F’03 –28– CS 740 F’03 Barrier Recommendations

Without broadcast on distributed memory: • Diss emina – MCS is good,tion only critical path length is about 1.5X longer – MCS has somewhat better network load and space requirements Cache coherence with broadcast (e.g., a bus): • MCS with – centralizedfla gis w best for modest numbers of processors akeup Big advantage of ce barrier: ntral • adapts to changing number of iprocessorszed across barrier calls

–29– CS 740 F’03