POSIX Threads (Pthreads) for Shared Address Space Programming Pthreads • Posix threads package • Available on almost all machines (portable standard) • Sort of like doing “parallel” (not “parallel for”) in OpenMP explicitly • Basic calls: • pthread_create: creates a , to execute a given function • Pthread_join • barrier, , mutex • Thread private variables • Many online resources: • E.g., https://computing.llnl.gov/tutorials/pthreads/

L.V.Kale 2 Pthreads – Create and Join • Spawn an attached thread • Detached threads pthread_create • Join is not needed (&thread1, NULL, foo, &arg) • The OS destroys thread . resources when they . terminate . • A parameter in the create pthread_join(thread1, status) call indicates a detached thread • Thread execution void foo(&arg) { // Thread code return(*status); }

L.V.Kale 3 Executing a Thread

Main Program . Thread1 stack . Thread1 . . void * func (void *arg) pthread_create(&thread1, NULL, func, &arg); { ......

return (status); stackMain pthread_join(thread1, status); } . . . .

L.V.Kale 4 Basic Locks • Declare a lock: pthread_mutex_t mutex; • Initialize a mutex pthread_mutex_init(&mutex, NULL); // Use defaults • Enter and release pthread_mutex_lock(&mutex); and pthread_mutex_unlock(&mutex); • Try lock without blocking: pthread_mutex_trylock(&mutex); • Returns 0 if successful (i.e. lock is acquired) • Release resources pthread_mutex_destroy(mutex);

L.V.Kale 5 Hello World: Pthreads#include #include int main(int argc,char **argv) { #include #include #include long threads = strtol(argv[1], NULL, 10);void* Hello(void* myRank) { #include void* Hello(void* myRankpthread_t) { *threadHandles = malloclong *id = (long* )((threads* sizeofmyRank(pthread_t); )); long *id = (long* )(myRank); printf(“Hello from threadlong *%ld\n”, *id);ids =(long* )malloc(sizeofprintf(long) * (“Hellothreadsfrom);thread %ld\n”, *id); return NULL; return NULL; } for (long t=0; t

for (long t=0; t

} } L.V.Kale 7 Threads and Resources • Suppose you are running on a machine with K cores • Each core may have 2 “hardware threads” (SMT) • This is often called hyperthreading on SMT (simultaneous multi-threading) • How many pthreads can you create? • Unlimited (well … the system may run out of resources like memory) • Can be smaller or larger than K • In performance oriented programs, its rarely more than 2K (assuming 2-way SMT) • We want to prevent OS from swapping out our threads • Which cores does each thread run on? • By default: any (i.e., OS suspends each running thread every few ms, and runs another thread)

L.V.Kale 8 Affinity • Which cores does each thread run on? • By default: any (i.e., OS suspends each running thread every few ms, and runs another thread) • Even if you have fewer threads than the hardware threads • But that’s bad for cache locality • Caches will be polluted by the work by other threads ... you will do a “cold” start almost always when you get scheduled every few ms • Pthreads provide a way for “binding” threads to hardware resources for this purpose

L.V.Kale 9 Pthread Affinity • Set-affinity (or pinning) assigns a thread to a set of hardware threads • Can use topological info to pin to core, sockets, NUMA domains, etc. • A that provides such information is “hwloc” • Example pattern of usage

... cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(PEnum, &cpuset); // can be called multiple times pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset) ...

L.V.Kale 10 OpenMP vs. Pthreads • OpenMP is great for parallel loops • And for many simple situations with just “#pragma omp parallel” as well • But when there is complicated synchronization, and performance is important, pthreads are (currently) better • However, pthreads are not available on all machines/OS’s • Especially Windows

L.V.Kale 11 Performance Oriented Programming in Pthreads • Pthreads as used in OS programming don’t need to be as performance oriented as what we need in HPC • E.g., “synchronizing” every few microseconds • I.e., exchanging data or waiting for signals

• Improving performance: • Always use affinity • Decide the number of pthreads to avoid any over-subscription and use SMT only if memory bandwidth (and floating point intensity) permit • Minimize barriers, using point-to-point synchronization as much as possible (say, between producer and a consumer, as in Gauss-Seidel) • Reduce cross-core communication (it’s much better to use the data produced on one core on the same core if/when possible) • Locks cause serialization of computation across threads

L.V.Kale 12 ++11 Atomics Wait-free Synchronization and Queues Recall.. Why the following doesn’t work

Initially: x, Flag, are both 0

Thread 0: Thread 1:

x = 25; while (Flag == 0) ;

Flag = 1; Print x;

What should get printed?

L.V.Kale 14 Sequential Consistency • This is a “desired property” of parallel programming systems • The effect of executing a program consisting of k threads should be the same as some arbitrary interleaving of statements executed by each thread, executed sequentially

Modern processors do not satisfy sequential consistency!

L.V.Kale 15 PE0 PE1 PEp-1 , . . . ,

Arbitrator

Memory

L.V.Kale 16 Support for memory consistency models • OpenMP provided a flush primitive for dealing with this issue • Ensures variables are written out to memory and no reordering of instructions happen across the flush call • With Pthreads, in the past, you’d need to use processor-specific memory-fence operations • On Intel • On PowerPC • Load-linked-store-conditional, etc. • C++ recently standardized this • C++-11-atomics in C++-11 Standard • Supports sequential consistency as well as specific relaxed

L.V.Kale 17 C++-11 atomics • http://en.cppreference.com/w/cpp/atomic/atomic • Basic: • Declare some (scalar) variables as atomic, • This ensures accesses to those variables, among themselves, are sequentially consistent • If one thread writes to an atomic object while another thread reads from it, the behavior is well-defined (memory model defines it) • #include • Declarations • std::atomic atm_var • std::atomic atm_ptr • std::atomic * atm_array

L.V.Kale 18 Atomic: C++ 11 atomic atomic class template and specializations for bool, int, and pointer type atomic_store atomically replaces the value of the atomic object with a non-atomic argument atomic_load atomically obtains the value stored in an atomic object atomic_fetch_add adds a non-atomic value to an atomic object and obtains the previous value of the atomic atomic_compare_exchange_strong atomically compares the value of the atomic object with non-atomic argument and performs atomic exchange if equal or atomic load if not

Source: https://en.cppreference.com/w/cpp/atomic

L.V.Kale 19 atomic_compare_exchange_strong(a,b,c) a b c ? 5 =x 0 7

a ? b c 0 = 0 77

L.V.Kale 20 C++11 Atomics Avoiding Locks, serialization and Queues with Atomics Locks, Serialization, and Wait-Free Synchronization

• Locks are an established way of enforcing mutual exclusion • It also enforces aspects of sequential consistency: • Memory operations are not moved across lock or unlock calls by the compiler • Hardware is made to ensure all writes are completed at lock() or unlock() • But locks are expensive, because they cause serialization

L.V.Kale 23 Locks, Critical sections and serialization • Suppose all threads are doing the following for I = 0, N dowork(); • The work in dowork() is tw lock(x); critical .. ; • The time in critical is tc unlock(x) • The serialization cost becomes a problem as the number of threads increase, but can be small up to #threads < tw/tc

L.V.Kale 24 tw tc tw tc t t w c tw tc tw t t t t t w c w c w tc tw tc tw t t t w c w tc tw tc tw tc tw

tw tc tw tc tw

tw tc tw tc tw

tw tc tw

tw tc tw

tw tc tw

tw tc tw

tw tc tw

L.V.Kale 25 Locks, Serialization, and Wait-Free Synchronization

• Locks are an established way of enforcing mutual exclusion • It also enforces aspects of sequential consistency: • Memory operations are not moved across lock or unlock calls by the compiler • Hardware is made to ensure all writes are completed at lock() or unlock() • But locks are expensive, because they cause serialization • Still, for most practical situations, locks are fast enough • Just use locks and avoid all the trouble, in practice • Unless you are in a fine grained situation with many threads • I.e., computation between consecutive calls to lock is very short • Then, consider a wait-free implementation with atomics

L.V.Kale 26 An Aside: “Lock-Free Algorithms” • Early days of computer science, there were many research papers and textbook materials on lock-free algorithms • Peterson’s, Dekker’s … • These algorithms all depended on sequential consistency, which processors of the day might have supported • That is no longer true, and so those algorithms are mostly not useful • May occasionally provide inspiration for a wait-free algorithm

L.V.Kale 27 Example: Circular Fixed-Size Queues • We will look at efficient implementation of shared queues • Depending on sharing assumptions, one can make more efficient queues • General: multiple producers/consumers • Single producer/single consumer • Multiple producers single consumer • Steal queues: (popularized by )

L.V.Kale 28 Circular Queues: Implementation • Array of fixed size 2^n • Masking of indices

0 1 2 3 1021 1022 1023

10240 10251 2 3 1021 1022 1023

Masking of Indices

1025: 10000000001 Masking 1: 0000000001

L.V.Kale 29 Single Producer Single Consumer Queue • We will look at fixed-size queue • Not allowed to grow beyond a fixed limit • Single producer accesses the tail • Single consumer accesses the head • No contention on head and tail • Count number of elements in the queue: used to safeguard against empty and full conditions on the queue • Three implementations • Lockless Thread Unsafe • Locking Thread Safe • Lockless Thread Safe

L.V.Kale 30 Single Producer Single Consumer: Lockless – Thread Unsafe class SPSCQueue{ bool deq(T &out){ private: int ret=0; T* arr; int count; if(count>0){ int head,tail; count--; out=arr[mask(head++)]; ret=1; } public: bool enq(T &data){ return ret; int ret=0; }

if(count

L.V.Kale 31 Single Producer Single Consumer: Locking – Thread Safe class SPSCQueue{ bool deq(T &out){ private: int ret=0; T* arr; mtx.lock(); int count if(count>0){ int head,tail; count--; mutex mtx; out=arr[mask(head++)]; ret=1; } public: mtx.unlock(); bool enq(T &data){ return ret; int ret=0; } mtx.lock(); • Using the notion of 'count' to safeguard the if(count0){ array,capacity> arr; out=arr[mask(head++)].load() atomic count; count.fetch_add(-1); int head,tail; return 1;} return 0; }

public: • Make count atomic (accessed by bool enq(T &data){ if(count.load()

L.V.Kale 33 C++11 Atomics: Queues Multiple Producer Single Consumer Queue Multiple Producer Single Consumer Queue • Assume fixed size, power of 2, queue • We will use the notion of an 'EMPTY' element • A specific value denotes empty in the queue (say –1) • Producer thread checks if a position contains EMPTY before inserting to it • Consumer thread makes sure a position does not contain EMPTY before extracting a value from it • After extracting the value it inserts EMPTY in that position

L.V.Kale 35 Multiple Producers Single Consumer: Thread Unsafe class MPSC_Queue{ bool deq(T &out){ private: T* arr; if(arr[mask(head)]==EMPTY)return 0; int head; else{ int tail; out=arr[head]; arr[mask(head++)]=EMPTY; public: return 1; bool enq(T &data){ } if(arr[mask(tail)]!=EMPTY)return 0; } else{ arr[mask(tail++)]=data; • return 1; There is a race condition on tail as result of } multiple producers } • There is a race on each cell in arr as a producer and consumer thread access it, without synchronization • No race on head: only 1 thread accesses it

L.V.Kale 36 Multiple producers Single Consumer: Locking-Thread Safe class MPSC_Locking_Queue{ bool deq(T &out){ private: int ret; T* arr; mtx.lock(); int head,tail; if(arr[mask(head)]==EMPTY)ret=0; mutex mtx; else{ public: out=arr[mask(head++)]; bool enq(T &data){ arr[mask(head++)]=EMPTY; int ret; ret=1;} mtx.lock(); mtx.unlock(); if(arr[mask(tail)]==EMPTY){ return ret; arr[mask(tail++)]=data; } ret=1;} else ret=0; • Once the mtx is acquired by a thread no other mtx.unlock(); thread can acquire it before mtx.unlock() return ret; • Other threads wait on the critical section till the } lock is released • Locking and unlocking overheads are significant • Note: always release lock before return statement (ret helps with that) L.V.Kale 39 Multiple Producers Single Consumer: Thread Unsafe (1.0) class MPSC_Queue{ bool deq(T &out){ private: out=arr[mask(head)]; T* arr; if(out==EMPTY)return 0; int head else{ int tail; arr[mask(head++)]=EMPTY; public: return 1; bool enq(T &data){ } } if(arr[mask(tail)]==EMPTY){ arr[mask(tail++)]=data;

return 1; • We will modify the lockless thread unsafe version } into a lockless thread safe version in 2 steps

else return 0; }

L.V.Kale 41 Multiple producers Single Consumer: Step 1 class MPSC_Queue{ bool deq(T &out){ private: out=arr[mask(head)]; T* arr; if(out==EMPTY)return 0; int head; else{ atomic tail; arr[mask(head++)]=EMPTY; public: return 1; bool enq(T &data){ } int mytail=tail.fetch_add(1); } if(arr[mask(mytail)]==EMPTY){ arr[mask(mytail)] = data; return 1;} • The previous version was vulnerable to conflicts else{ between 2 producers tail.fetch_add(-1); • Access to the tail was not protected return 0;} • Change the tail into an atomic variable } • Modify the operations accordingly (atomic::fetch_add) replaces post increment operation

L.V.Kale 42 Multiple Producers Single Consumer: Step 2 class MPSC_Queue{ bool deq(T &out){ private: out=arr[mask(head)].load(); array,capacity> arr; if(out==EMPTY)return 0; int head; else{ atomic tail; arr[mask(head++)].store(EMPTY); public: return 1; bool enq(T &data){ } int mytail=tail.fetch_add(1); } if(arr[mask(mytail)].load()==EMPTY) arr[mask(mytail)].store(data); • return 1;} The previous version did not prevent a race else{ between a producer and a consumer tail.fetch_add(-1); • Make the underlying data structure of the queue return 0;} an array of atomics } • Modify the operations accordingly • Note: load() and store() calls are not compulsory on atomics. A simple assignment operation can do the same operation

L.V.Kale 43 Multiple Producers Single Consumer: class MPSC_Queue{ bool deq(T &out){ private: out=arr[mask(head)].load(); array,capacity> arr; if(out==EMPTY)return 0; int head; else{ atomic tail; arr[mask(head++)].store(EMPTY); public: return 1; bool enq(T &data){ } int mytail=tail.fetch_add(1); } if(arr[mask(mytail)].load()==EMPTY) arr[mask(mytail)].store(data); return 1;} • Analysis: else{ • 3 Atomic Operations per enq in the normal case tail.fetch_add(-1); • 3 Atomic Operations per enq if queue is full return 0;} } • 2 Atomic Operations per deq in the normal case • 1 Atomic Operations per deq if queue is empty

L.V.Kale 44 • Purple here denotes the “Empty” value C1 • P1 atomically fetches and increments the value of global tail C1HeadHeadC1Head Head • C1 attempts dequeue and finds empty value (failure) Fail • P2 atomically fetches and increments the value of global tail • P2 adds 15 to the queue 7 15 • P1 adds 7 to the queue • C1 (non-atomically) fetches and increments the value of global head • C1 attempts dequeue again and finds 7 Tail Tail P1Tail P2Tail Tail (success) • C1 attempts dequeue again and finds 15 (success) P1 P2 Not done yet! There is a bug • Head is pointing to x and tail to x as well. The queue is full • What if P1 comes in to enqueue, gets index x as the tail where it should insert, • Now tail is x+1 (P1:mytail is x) • Q[x] is not empty, P1 goes to the else clause, • but P1 has not executed the else clause yet • In the meanwhile, P2 comes to enqueue, also increments tail • so it has x+1 as the place where it will insert (P2: mytail is x+1), • global tail is x+2 • Consumer comes to deque twice • So head now becomes x+2, and Q[x] and Q[x+1] are both empty • Next, P1 continues • Decrements tail, so global tail is now x+1 • Returns with failure (returns 0). • Now, we have a hole in the queue! No data at x, but P2’s data is at x+1 • Consumer will never go past the hole, P2’s data will get overwritten by the next enqueue

L.V.Kale 47 bool enq(T &data){ • Tail is 2, Head is 2, Queue is full int mytail=tail.fetch_add(1); if(arr[mask(mytail)].load()==EMPTY){ • P1 comes to enque arr[mask(mytail)].store(data); • P1.mytail == 2 return 1;} • tail == 3 else{ tail.fetch_add(-1); • P1 discovers arr[2] is not empty return 0;} • But before it could execute the else } • P2 comes to enque (say, the value 90) • P1.mytail == 3 • tail == 4 Head • But before it executes the if.. 0 1 2 3 4 5 6 7 • Consumer C dequeues twice • Dequeues 82 and 43 ; replaces 76 12 82 4390 25 31 69 52 them with empty values • Now P2 executes the if, finds arr[3] empty, and enqueues its value there Tail • P1 executes else, decrements tail, and leaves an empty value (hole) at arr[2]! L.V.Kale 48 The problem in a nutshell • The problem was that P1 reserved a spot in the queue, but gave up and returned • A process must enqueue in a spot it reserves, even if it has to wait…

L.V.Kale 49 Multiple Producers Single Consumer: LockLess – Corrected class MPSC_Lockless_Queue{ bool deq(T &out){ private: out=arr[mask(head)].load(); array,capacity> arr; if(out==EMPTY)return 0; int head; else{ atomic tail; arr[mask(head++)].store(EMPTY); public: return 1; bool enq(T &data){ } int mytail=tail.fetch_add(1); } while (arr[mask(mytail)].load()!=EMPTY) ; // eventually consumer will empty it arr[mask(mytail)].store(data); return 1; }

L.V.Kale 50 Wait-free? • Now, of course, our queue is not wait-free in a strict sense because of the while loop • You could • Argue that the queue-size must be chosen such that it never becomes full • Say that its wait-free except when the queue is full • OR: rewrite the code, so that you remember the value we need to enque in a private data structure (queue?) and return with failure, but a process which gets a failure on enqueue must call again to try to enqueue • Use a overflow queue protected by lock?? • Does the FIFO ordering matter?

L.V.Kale 51 Cilk A Task Based Parallel Language Cilk Language • Developed over 15+ years

• Two calls added to C: • Popularized the idea of work stealing • But the idea is older: • MultiLisp (Halstead) • State-space search: Vipin Kumar • Formalized the idea of work stealing • Proofs on optimality / time and space complexity • Intel’s Cilk++ : via Cilk Arts

L.V.Kale 53 cilk int fib (int n) { • if (n < 2) return n; Only 2 keywords added to C else { • Program semantics is the int x, y; x = spawn fib (n-1); same as if you delete the 2 y = spawn fib (n-2); (red) keywords from the sync; return (x+y); } program }

Example from: http://supertech.lcs.mit.edu/cilk/intro.html

L.V.Kale 54 Possible Implementations • User level threads and suspensions • Task scheduling (who does what when) • Centralized queue: good balance, scalablity?, overhead (locks for queues) • 1 queue for every core, and assign work to one of them randomly • Load imbalance: some core may get too much • Extension: balance queues periodically • 1 queue for every core, keep work in your queue, initially • Periodic balancing • Alternative: idle processor de-queues work from someone else’s queue: • This is called work stealing

L.V.Kale 55 Work Stealing • Every core (Pthread, hardware thread) has its own stack • Spawned “threads” (tasks) are pushed on the local stack (which is implemented as a double-ended queue : dequeue) • Each processor’s scheduler picks tasks from the top of its own queue • This is called “bottom” in some Cilk papers (dequeue terminology can be ambiguous) • If its own queue is empty? • Steal from a random processor’s queue • (Alternatives could have been: round robin, globally next, .. Explored in Vipin Kumar’s work)

L.V.Kale 56 Formalization

• Critical Path: Tinfinity • I.e. execution time with infinite number of processors

• Best possible execution time: T1 / P • P: number of processors

• T1 : Execution time on 1 processor • Good implementation of steal queues (later): • Local operations are cheap, steals are expensive • So, prove number of steals is asymptotically minimal

• Also, execution time is O(Tinfinity + T1/P) Examine the papers • Actually, nl Tinfinity

L.V.Kale 57 Intel’s Cilk Plus • Somewhat different syntax for spawn and sync • Provides a cilk_for parallel loop • Implemented by divide-and-conquer task spawns • 0:N => 0:N/2, N/2: N, and recurse

L.V.Kale 58 Steal Queues • “Agenda” parallelism (set of tasks) • The idea of work stealing • Cilk, and some previous systems

L.V.Kale 59 Steal Queues (dequeue) • Basic Idea: • Local Enqueue should be cheap • Local De-queue (pop) should be fast in the common case, and only occasionally use synchronization • Steal (non-local pop) will always use synchronization (locks or atomics)

L.V.Kale 60