Programming with Shared Memory PART I

Programming with Shared Memory PART I HPC Fall 2012 Prof. Robert van Engelen Overview n Shared memory machines n Programming strategies for shared memory machines n Allocating shared data for IPC n Processes and threads n MT-safety issues n Coordinated access to shared data ¨ Locks ¨ Semaphores ¨ Condition variables ¨ Barriers n Further reading 2/23/17 HPC Fall 2012 2 Shared Memory Machines n Single address space n Shared memory ¨ Single bus (UMA) Shared memory UMA machine with a single bus ¨ Interconnect with memory banks (NUMA) ¨ Cross-bar switch n Distributed shared memory (DSM) ¨ Logically shared, physically distributed Shared memory NUMA machine with memory banks DSM Shared memory multiprocessor with cross-bar switch 2/23/17 HPC Fall 2012 3 Programming Strategies for Shared Memory Machines n Use a specialized programming language for parallel computing ¨ For example: HPF, UPC n Use compiler directives to supplement a sequential program with parallel directives ¨ For example: OpenMP n Use libraries ¨ For example: ScaLapack (though ScaLapack is primarily designed for distributed memory) n Use heavyweight processes and a shared memory API n Use threads n Use a parallelizing compiler to transform (part of) a sequential program into a parallel program 2/23/17 HPC Fall 2012 4 Heavyweight Processes Fork-join of processes n The UNIX system call fork() creates a new process ¨ fork() returns 0 to the child process ¨ fork() returns process ID (pid) of child to parent n System call exit(n) joins child process with parent and passes exit value n to it n Parent executes wait(&n) to wait until one of its children joins, where n is set to the exit value Process tree n System and user processes form a tree 2/23/17 HPC Fall 2012 5 Fork-Join Process 1 … … pid = fork(); if (pid == 0) { … // code for child exit(0); } else { … // parent code continues wait(&n); // join } … // parent code continues … SPMD program 2/23/17 HPC Fall 2012 6 Fork-Join Process 1 Process 2 … … … … pid = fork(); pid = fork(); if (pid == 0) if (pid == 0) { … // code for child { … // code for child exit(0); exit(0); } else } else { … // parent code continues { … // parent code continues wait(&n); // join wait(&n); // join } } … // parent code continues … // parent code continues … … SPMD program Copy of program, data, and file descriptors (operations by the processes on open files will be independent) 2/23/17 HPC Fall 2012 7 Fork-Join Process 1 Process 2 … … … … pid = fork(); pid = fork(); if (pid == 0) if (pid == 0) { … // code for child { … // code for child exit(0); exit(0); } else } else { … // parent code continues { … // parent code continues wait(&n); // join wait(&n); // join } } … // parent code continues … // parent code continues … … SPMD program Copy of program and data 2/23/17 HPC Fall 2012 8 Fork-Join Process 1 Process 2 … … … … pid = fork(); pid = fork(); if (pid == 0) if (pid == 0) { … // code for child { … // code for child exit(0); exit(0); } else } else { … // parent code continues { … // parent code continues wait(&n); // join wait(&n); // join } } … // parent code continues … // parent code continues … … SPMD program Copy of program and data 2/23/17 HPC Fall 2012 9 Fork-Join Process 1 Process 2 … … … … pid = fork(); pid = fork(); if (pid == 0) if (pid == 0) { … // code for child { … // code for child exit(0); exit(0); } else } else { … // parent code continues { … // parent code continues wait(&n); // join wait(&n); // join } } … // parent code continues … // parent code continues … … SPMD program Terminated 2/23/17 HPC Fall 2012 10 Creating Shared Data for IPC n Interprocess communication shmget() returns the shared memory identifier for a (IPC) via shared data given key (key is for naming and locking) n Processes do not automatically share data shmat() n Use files to share data attaches the segment identified by a shared memory identifier and returns the address of ¨ Slow, but portable the memory segment n Unix system V shmget() ¨ Allocates shared pages shmctl() between two or more deletes the segment with IPC_RMID argument processes mmap() n BSD Unix mmap() returns the address of a mapped object ¨ Uses file-memory mapping to described by the file id returned by open() create shared data in memory ¨ Based on the principle that munmap() files are shared between deletes the mapping for a given address processes 2/23/17 HPC Fall 2012 11 shmget vs mmap #include <sys/types.h> #include <sys/ipc.h> #include <sys/types.h> #include <sys/shm.h> #include <sys/mman.h> size_t len; // size of data we want size_t len; // size of data we want void *buf; // to point to shared data void *buf; // to point to shared data int shmid; buf = mmap(NULL, key_t key = 9876; // or IPC_PRIVATE len, shmid = shmget(key, PROT_READ|PROT_WRITE, len, MAP_SHARED|MAP_ANON, IPC_CREAT|0666); -1, // fd=-1 is unnamed if (shmid == -1) … // error 0); buf = shmat(shmid, NULL, 0); if (buf == MAP_FAILED) … // error if (buf == (void*)-1) … // error … … fork(); // parent and child use buf fork(); // parent and child use buf … … wait(&n); wait(&n); munmap(buf, len); shmctl(shmid, IPC_RMID, NULL); … Tip: use ipcs command to display IPC shared memory status of a system 2/23/17 HPC Fall 2012 12 Threads n Threads of control operate in the same memory space, sharing code and data ¨ Data is implicitly shared ¨ Consider data on a thread’s stack private n Many OS-specific thread APIs ¨ Windows threads, Linux threads, Java threads, … n POSIX-compliant Pthreads: pthread_create() start a new thread Thread creation and join pthread_join() wait for child thread to join pthread_exit() stop thread 2/23/17 HPC Fall 2012 13 Detached Threads n Detached threads do not join n Use pthread_detach(thread_id) n Detached threads are more efficient n Make sure that all detached threads terminate before program terminates 2/23/17 HPC Fall 2012 14 Process vs Threads Process Process with two threads What happens when we fork a process that executes multiple threads? Does fork duplicate only the calling thread or all threads? 2/23/17 HPC Fall 2012 15 Thread Pools n Thread pooling (or process pooling) is an efficient mechanism n One master thread dispatches jobs to worker threads n Worker threads in pool never terminate and keep accepting new jobs when old job done n Jobs are communicated to workers via a shared job queue n Best for irregular job loads 2/23/17 HPC Fall 2012 16 MT-Safety time_t clk = clock(); n Routines must be multi-thread char *txt = ctime(&clk); safe (MT-safe) when invoked by printf(“Current time: %s\n”, txt); more than one thread n Non-MT-safe routines must be Use of a non-MT-safe routine placed in a critical section, e.g. using a mutex lock (see later) time_t clk = clock(); n Many C libraries are not MT-safe char txt[32]; ¨ Use libroutine_r() versions that ctime_r(&clk, txt); are “reentrant” printf(“Current time: %s\n”, txt); ¨ When building your own MT-safe Use of the reentrant version of ctime library, use #define _REENTRANT n Always make your routines MT- static int counter = 0; safe for reuse in a threaded int count_events() application { return counter++; } n Use locks when necessary (see next slides) Is this routine MT-safe? What can go wrong? 2/23/17 HPC Fall 2012 17 Coordinated Access to Shared Data (Such as Job Queues) n Reading and writing shared data by more than one thread or process requires coordination with locking n Cannot update shared variables simultaneously by more than one thread static int counter = 0; int count_events() { int n; static int counter = 0; pthread_mutex_lock(&lock); int count_events() n = counter++; { return counter++; pthread_mutex_unlock(&lock); } return n-1; } reg1 = M[counter] = 3 reg1 = M[counter] = 3 acquire lock acquire lock reg2 = reg1 + 1 = 4 reg2 = reg1 + 1 = 4 reg1 = M[counter] = 3 … M[counter] = reg2 = 4 M[counter] = reg2 = 4 reg2 = reg1 + 1 = 4 … wait return reg1 = 3 return reg1 = 3 M[counter] = reg2 = 4 … release lock … Thread 1 Thread 2 reg1 = M[counter] = 4 Thread 1 reg2 = reg1 + 1 = 5 M[counter] = reg2 = 5 release lock 2/23/17 HPC Fall 2012 Thread 2 18 Spinlocks n Spin locks use busy waiting until a condition is met n Naïve implementations are almost always incorrect // initially lock = 0 while (lock == 1) ; // do nothing Acquire lock lock = 1; … critical section … lock = 0; Release lock Two or more threads want to enter the critical section, what can go wrong? 2/23/17 HPC Fall 2012 19 Spinlocks n Spin locks use busy waiting until a condition is met n Naïve implementations are almost always incorrect Thread 1 while (lock == 1) Thread 2 ; // do nothing lock = 1; while (lock == 1) … critical section … … lock = 0; … lock = 1; … critical section … This ordering works lock = 0; 2/23/17 HPC Fall 2012 20 Spinlocks n Spin locks use busy waiting until a condition is met n Naïve implementations are almost always incorrect Thread 1 Not set in time Thread 2 while (lock == 1) before read ; // do nothing while (lock == 1) lock = 1; … … critical section … lock = 1; lock = 0; … critical section … lock = 0; This statement interleaving leads Both threads end up executing the to failure critical section! 2/23/17 HPC Fall 2012 21 Spinlocks n Spin locks use busy waiting until a condition is met n Naïve implementations are almost always incorrect Thread 1 Thread 2 while (lock == 1) Useless ; // do nothing assignment while (lock == 1) lock = 1; removed … … critical section … lock = 1; lock = 0; Assignment … critical section … can be lock = 0; moved by compiler Atomic operations such as atomic “test- Compiler optimizes and-set” instructions must be used the code! (these

Programming with Shared Memory PART I

2.5 Classification of Parallel Computers

Chapter 5 Multiprocessors and Thread-Level Parallelism

Beej's Guide to Unix IPC

Mmap and Dma

An Introduction to Linux IPC

Chapter 5 Thread-Level Parallelism

Table of Contents

Trafficdb: HERE's High Performance Shared-Memory Data Store

IPC: Mmap and Pipes

A Machine-Independent DMA Framework for Netbsd

Interprocess Communication

Real-Time Streaming Video and Image Processing on Inexpensive Hardware with Low Latency