POSIX Threads (Pthreads)

POSIX Threads (Pthreads) for Shared Address Space Programming Pthreads • Posix threads package • Available on almost all machines (portable standard) • Sort of like doing “parallel” (not “parallel for”) in OpenMP explicitly • Basic calls: • pthread_create: creates a thread, to execute a given function • Pthread_join • barrier, lock, mutex • Thread private variables • Many online resources: • E.g., https://computing.llnl.gov/tutorials/pthreads/ L.V.Kale 2 Pthreads – Create and Join • Spawn an attached thread • Detached threads pthread_create • Join is not needed (&thread1, NULL, foo, &arg) • The OS destroys thread . resources when they . terminate . • A parameter in the create pthread_join(thread1, status) call indicates a detached thread • Thread execution void foo(&arg) { // Thread code return(*status); } L.V.Kale 3 Executing a Thread Main Program . Thread1 stack Thread1 . Thread1 . void * func (void *arg) pthread_create(&thread1, NULL, func, &arg); { . return (status); Mainstack pthread_join(thread1, status); } . L.V.Kale 4 Basic Locks • Declare a lock: pthread_mutex_t mutex; • Initialize a mutex pthread_mutex_init(&mutex, NULL); // Use defaults • Enter and release pthread_mutex_lock(&mutex); and pthread_mutex_unlock(&mutex); • Try lock without blocking: pthread_mutex_trylock(&mutex); • Returns 0 if successful (i.e. lock is acquired) • Release resources pthread_mutex_destroy(mutex); L.V.Kale 5 Hello World: Pthreads#include <stdlib.h> #include <stdlib.h>int main(int argc,char **argv) { #include <stdio.h> #include <stdio.h> #include <pthread.h> long threads = strtol(argv[1], NULL, 10);void* Hello(void* myRank) { #include <pthread.h> void* Hello(void* myRankpthread_t) { *threadHandles = malloclong *id = (long* )((threads* sizeofmyRank(pthread_t); )); long *id = (long* )(myRank); printf(“Hello from threadlong *%ld\n”, *id);ids =(long* )malloc(sizeofprintf(long) * (“Hellothreadsfrom);thread %ld\n”, *id); return NULL; return NULL; } for (long t=0; t<threads; t++) { int main(int argc,char **argv) {ids[t]=t; } long threads = strtol(argv[1], NULL, 10); pthread_t *threadHandles = pthread_createmalloc(threads* sizeof(pthread_t));(&threadHandles[t], NULL, Hello, (void *)&(ids[t])); long *ids =(long* )malloc(sizeof(long) * threads); for (long t=0; t<threads} ; t++) { ids[t]=t; printf(“Hello from the main thread\n”) ; pthread_create(&threadHandles[t], NULL, Hello, (void *)&(ids[t])); } for(long t=0; t<threads; t++) pthread_join(threadHandles[t], NULL); printf(“Hello from the main thread\n”) ; for(long t=0; t<threadsfree; t++) pthread_join(threadHandles(threadHandles[t], NULL);); free(ids); free(threadHandles); free(ids); } } L.V.Kale 7 Threads and Resources • Suppose you are running on a machine with K cores • Each core may have 2 “hardware threads” (SMT) • This is often called hyperthreading on SMT (simultaneous multi-threading) • How many pthreads can you create? • Unlimited (well … the system may run out of resources like memory) • Can be smaller or larger than K • In performance oriented programs, its rarely more than 2K (assuming 2-way SMT) • We want to prevent OS from swapping out our threads • Which cores does each thread run on? • By default: any (i.e., OS suspends each running thread every few ms, and runs another thread) L.V.Kale 8 Affinity • Which cores does each thread run on? • By default: any (i.e., OS suspends each running thread every few ms, and runs another thread) • Even if you have fewer threads than the hardware threads • But that’s bad for cache locality • Caches will be polluted by the work by other threads ... you will do a “cold” start almost always when you get scheduled every few ms • Pthreads provide a way for “binding” threads to hardware resources for this purpose L.V.Kale 9 Pthread Affinity • Set-affinity (or pinning) assigns a thread to a set of hardware threads • Can use topological info to pin to core, sockets, NUMA domains, etc. • A library that provides such information is “hwloc” • Example pattern of usage ... cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(PEnum, &cpuset); // can be called multiple times pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset) ... L.V.Kale 10 OpenMP vs. Pthreads • OpenMP is great for parallel loops • And for many simple situations with just “#pragma omp parallel” as well • But when there is complicated synchronization, and performance is important, pthreads are (currently) better • However, pthreads are not available on all machines/OS’s • Especially Windows L.V.Kale 11 Performance Oriented Programming in Pthreads • Pthreads as used in OS programming don’t need to be as performance oriented as what we need in HPC • E.g., “synchronizing” every few microseconds • I.e., exchanging data or waiting for signals • Improving performance: • Always use affinity • Decide the number of pthreads to avoid any over-subscription and use SMT only if memory bandwidth (and floating point intensity) permit • Minimize barriers, using point-to-point synchronization as much as possible (say, between producer and a consumer, as in Gauss-Seidel) • Reduce cross-core communication (it’s much better to use the data produced on one core on the same core if/when possible) • Locks cause serialization of computation across threads L.V.Kale 12 C++11 Atomics Wait-free Synchronization and Queues Recall.. Why the following doesn’t work Initially: x, Flag, are both 0 Thread 0: Thread 1: x = 25; while (Flag == 0) ; Flag = 1; Print x; What should get printed? L.V.Kale 14 Sequential Consistency • This is a “desired property” of parallel programming systems • The effect of executing a program consisting of k threads should be the same as some arbitrary interleaving of statements executed by each thread, executed sequentially Modern processors do not satisfy sequential consistency! L.V.Kale 15 PE0 PE1 PEp-1 , . , Arbitrator Memory L.V.Kale 16 Support for memory consistency models • OpenMP provided a flush primitive for dealing with this issue • Ensures variables are written out to memory and no reordering of instructions happen across the flush call • With Pthreads, in the past, you’d need to use processor-specific memory-fence operations • On Intel • On PowerPC • Load-linked-store-conditional, etc. • C++ recently standardized this • C++-11-atomics in C++-11 Standard • Supports sequential consistency as well as specific relaxed consistency model L.V.Kale 17 C++-11 atomics • http://en.cppreference.com/w/cpp/atomic/atomic • Basic: • Declare some (scalar) variables as atomic, • This ensures accesses to those variables, among themselves, are sequentially consistent • If one thread writes to an atomic object while another thread reads from it, the behavior is well-defined (memory model defines it) • #include <atomic> • Declarations • std::atomic<T> atm_var • std::atomic<T*> atm_ptr • std::atomic <T>* atm_array L.V.Kale 18 Atomic: C++ 11 atomic atomic class template and specializations for bool, int, and pointer type atomic_store atomically replaces the value of the atomic object with a non-atomic argument atomic_load atomically obtains the value stored in an atomic object atomic_fetch_add adds a non-atomic value to an atomic object and obtains the previous value of the atomic atomic_compare_exchange_strong atomically compares the value of the atomic object with non-atomic argument and performs atomic exchange if equal or atomic load if not Source: https://en.cppreference.com/w/cpp/atomic L.V.Kale 19 atomic_compare_exchange_strong(a,b,c) a b c ? 5 =x 0 7 a ? b c 0 = 0 77 L.V.Kale 20 C++11 Atomics Avoiding Locks, serialization and Queues with Atomics Locks, Serialization, and Wait-Free Synchronization • Locks are an established way of enforcing mutual exclusion • It also enforces aspects of sequential consistency: • Memory operations are not moved across lock or unlock calls by the compiler • Hardware is made to ensure all writes are completed at lock() or unlock() • But locks are expensive, because they cause serialization L.V.Kale 23 Locks, Critical sections and serialization • Suppose all threads are doing the following for I = 0, N dowork(); • The work in dowork() is tw lock(x); critical .. ; • The time in critical is tc unlock(x) • The serialization cost becomes a problem as the number of threads increase, but can be small up to #threads < tw/tc L.V.Kale 24 tw tc tw tc t t w c tw tc tw t t t t t w c w c w tc tw tc tw t t t w c w tc tw tc tw tc tw tw tc tw tc tw tw tc tw tc tw tw tc tw tw tc tw tw tc tw tw tc tw tw tc tw L.V.Kale 25 Locks, Serialization, and Wait-Free Synchronization • Locks are an established way of enforcing mutual exclusion • It also enforces aspects of sequential consistency: • Memory operations are not moved across lock or unlock calls by the compiler • Hardware is made to ensure all writes are completed at lock() or unlock() • But locks are expensive, because they cause serialization • Still, for most practical situations, locks are fast enough • Just use locks and avoid all the trouble, in practice • Unless you are in a fine grained situation with many threads • I.e., computation between consecutive calls to lock is very short • Then, consider a wait-free implementation with atomics L.V.Kale 26 An Aside: “Lock-Free Algorithms” • Early days of computer science, there were many research papers and textbook materials on lock-free algorithms • Peterson’s, Dekker’s … • These algorithms all depended on sequential consistency, which processors of the day might have supported • That is no longer true, and so those algorithms are mostly not useful • May occasionally provide inspiration

Load more