POSIX Threads (Pthreads)

POSIX Threads (Pthreads) for Shared Address Space Programming Pthreads • Posix threads package • Available on almost all machines (portable standard) • Sort of like doing “parallel” (not “parallel for”) in OpenMP explicitly • Basic calls: • pthread_create: creates a thread, to execute a given function • Pthread_join • barrier, lock, mutex • Thread private variables • Many online resources: • E.g., https://computing.llnl.gov/tutorials/pthreads/ L.V.Kale 2 Pthreads – Create and Join • Spawn an attached thread • Detached threads pthread_create • Join is not needed (&thread1, NULL, foo, &arg) • The OS destroys thread . resources when they . terminate . • A parameter in the create pthread_join(thread1, status) call indicates a detached thread • Thread execution void foo(&arg) { // Thread code return(*status); } L.V.Kale 3 Executing a Thread Main Program . Thread1 stack Thread1 . Thread1 . void * func (void *arg) pthread_create(&thread1, NULL, func, &arg); { . return (status); Mainstack pthread_join(thread1, status); } . L.V.Kale 4 Basic Locks • Declare a lock: pthread_mutex_t mutex; • Initialize a mutex pthread_mutex_init(&mutex, NULL); // Use defaults • Enter and release pthread_mutex_lock(&mutex); and pthread_mutex_unlock(&mutex); • Try lock without blocking: pthread_mutex_trylock(&mutex); • Returns 0 if successful (i.e. lock is acquired) • Release resources pthread_mutex_destroy(mutex); L.V.Kale 5 Hello World: Pthreads#include <stdlib.h> #include <stdlib.h>int main(int argc,char **argv) { #include <stdio.h> #include <stdio.h> #include <pthread.h> long threads = strtol(argv[1], NULL, 10);void* Hello(void* myRank) { #include <pthread.h> void* Hello(void* myRankpthread_t) { *threadHandles = malloclong *id = (long* )((threads* sizeofmyRank(pthread_t); )); long *id = (long* )(myRank); printf(“Hello from threadlong *%ld\n”, *id);ids =(long* )malloc(sizeofprintf(long) * (“Hellothreadsfrom);thread %ld\n”, *id); return NULL; return NULL; } for (long t=0; t<threads; t++) { int main(int argc,char **argv) {ids[t]=t; } long threads = strtol(argv[1], NULL, 10); pthread_t *threadHandles = pthread_createmalloc(threads* sizeof(pthread_t));(&threadHandles[t], NULL, Hello, (void *)&(ids[t])); long *ids =(long* )malloc(sizeof(long) * threads); for (long t=0; t<threads} ; t++) { ids[t]=t; printf(“Hello from the main thread\n”) ; pthread_create(&threadHandles[t], NULL, Hello, (void *)&(ids[t])); } for(long t=0; t<threads; t++) pthread_join(threadHandles[t], NULL); printf(“Hello from the main thread\n”) ; for(long t=0; t<threadsfree; t++) pthread_join(threadHandles(threadHandles[t], NULL);); free(ids); free(threadHandles); free(ids); } } L.V.Kale 7 Threads and Resources • Suppose you are running on a machine with K cores • Each core may have 2 “hardware threads” (SMT) • This is often called hyperthreading on SMT (simultaneous multi-threading) • How many pthreads can you create? • Unlimited (well … the system may run out of resources like memory) • Can be smaller or larger than K • In performance oriented programs, its rarely more than 2K (assuming 2-way SMT) • We want to prevent OS from swapping out our threads • Which cores does each thread run on? • By default: any (i.e., OS suspends each running thread every few ms, and runs another thread) L.V.Kale 8 Affinity • Which cores does each thread run on? • By default: any (i.e., OS suspends each running thread every few ms, and runs another thread) • Even if you have fewer threads than the hardware threads • But that’s bad for cache locality • Caches will be polluted by the work by other threads ... you will do a “cold” start almost always when you get scheduled every few ms • Pthreads provide a way for “binding” threads to hardware resources for this purpose L.V.Kale 9 Pthread Affinity • Set-affinity (or pinning) assigns a thread to a set of hardware threads • Can use topological info to pin to core, sockets, NUMA domains, etc. • A library that provides such information is “hwloc” • Example pattern of usage ... cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(PEnum, &cpuset); // can be called multiple times pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset) ... L.V.Kale 10 OpenMP vs. Pthreads • OpenMP is great for parallel loops • And for many simple situations with just “#pragma omp parallel” as well • But when there is complicated synchronization, and performance is important, pthreads are (currently) better • However, pthreads are not available on all machines/OS’s • Especially Windows L.V.Kale 11 Performance Oriented Programming in Pthreads • Pthreads as used in OS programming don’t need to be as performance oriented as what we need in HPC • E.g., “synchronizing” every few microseconds • I.e., exchanging data or waiting for signals • Improving performance: • Always use affinity • Decide the number of pthreads to avoid any over-subscription and use SMT only if memory bandwidth (and floating point intensity) permit • Minimize barriers, using point-to-point synchronization as much as possible (say, between producer and a consumer, as in Gauss-Seidel) • Reduce cross-core communication (it’s much better to use the data produced on one core on the same core if/when possible) • Locks cause serialization of computation across threads L.V.Kale 12 C++11 Atomics Wait-free Synchronization and Queues Recall.. Why the following doesn’t work Initially: x, Flag, are both 0 Thread 0: Thread 1: x = 25; while (Flag == 0) ; Flag = 1; Print x; What should get printed? L.V.Kale 14 Sequential Consistency • This is a “desired property” of parallel programming systems • The effect of executing a program consisting of k threads should be the same as some arbitrary interleaving of statements executed by each thread, executed sequentially Modern processors do not satisfy sequential consistency! L.V.Kale 15 PE0 PE1 PEp-1 , . , Arbitrator Memory L.V.Kale 16 Support for memory consistency models • OpenMP provided a flush primitive for dealing with this issue • Ensures variables are written out to memory and no reordering of instructions happen across the flush call • With Pthreads, in the past, you’d need to use processor-specific memory-fence operations • On Intel • On PowerPC • Load-linked-store-conditional, etc. • C++ recently standardized this • C++-11-atomics in C++-11 Standard • Supports sequential consistency as well as specific relaxed consistency model L.V.Kale 17 C++-11 atomics • http://en.cppreference.com/w/cpp/atomic/atomic • Basic: • Declare some (scalar) variables as atomic, • This ensures accesses to those variables, among themselves, are sequentially consistent • If one thread writes to an atomic object while another thread reads from it, the behavior is well-defined (memory model defines it) • #include <atomic> • Declarations • std::atomic<T> atm_var • std::atomic<T*> atm_ptr • std::atomic <T>* atm_array L.V.Kale 18 Atomic: C++ 11 atomic atomic class template and specializations for bool, int, and pointer type atomic_store atomically replaces the value of the atomic object with a non-atomic argument atomic_load atomically obtains the value stored in an atomic object atomic_fetch_add adds a non-atomic value to an atomic object and obtains the previous value of the atomic atomic_compare_exchange_strong atomically compares the value of the atomic object with non-atomic argument and performs atomic exchange if equal or atomic load if not Source: https://en.cppreference.com/w/cpp/atomic L.V.Kale 19 atomic_compare_exchange_strong(a,b,c) a b c ? 5 =x 0 7 a ? b c 0 = 0 77 L.V.Kale 20 C++11 Atomics Avoiding Locks, serialization and Queues with Atomics Locks, Serialization, and Wait-Free Synchronization • Locks are an established way of enforcing mutual exclusion • It also enforces aspects of sequential consistency: • Memory operations are not moved across lock or unlock calls by the compiler • Hardware is made to ensure all writes are completed at lock() or unlock() • But locks are expensive, because they cause serialization L.V.Kale 23 Locks, Critical sections and serialization • Suppose all threads are doing the following for I = 0, N dowork(); • The work in dowork() is tw lock(x); critical .. ; • The time in critical is tc unlock(x) • The serialization cost becomes a problem as the number of threads increase, but can be small up to #threads < tw/tc L.V.Kale 24 tw tc tw tc t t w c tw tc tw t t t t t w c w c w tc tw tc tw t t t w c w tc tw tc tw tc tw tw tc tw tc tw tw tc tw tc tw tw tc tw tw tc tw tw tc tw tw tc tw tw tc tw L.V.Kale 25 Locks, Serialization, and Wait-Free Synchronization • Locks are an established way of enforcing mutual exclusion • It also enforces aspects of sequential consistency: • Memory operations are not moved across lock or unlock calls by the compiler • Hardware is made to ensure all writes are completed at lock() or unlock() • But locks are expensive, because they cause serialization • Still, for most practical situations, locks are fast enough • Just use locks and avoid all the trouble, in practice • Unless you are in a fine grained situation with many threads • I.e., computation between consecutive calls to lock is very short • Then, consider a wait-free implementation with atomics L.V.Kale 26 An Aside: “Lock-Free Algorithms” • Early days of computer science, there were many research papers and textbook materials on lock-free algorithms • Peterson’s, Dekker’s … • These algorithms all depended on sequential consistency, which processors of the day might have supported • That is no longer true, and so those algorithms are mostly not useful • May occasionally provide inspiration

POSIX Threads (Pthreads)

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support