<<

3. Asynchronous Parallelism Sample MIMD Systems Multibus Ethernet SCSI Bus

Setup of an MIMD-System Mem Mem Mem • • • Sequent Symmetry PE PE PE General Model Multibus- Drive- SCED- Adapter Controller Board

System Bus Console Network

CPUCPU CPU can be addressed CPUCPU8038680386 Mem CPU80386 by all PEs CPU8038680386 CPU80386 Memory CPUCPU80386 8038680386 40 MB 80386 Program is segmented into autonomous processes. Ideal Allocation: 1 Process : 1 But usually: n Processes: 1 Processor ⇒ Scheduler required 10 CPUs (32-Bit) Bräunl 2004 1 Bräunl 2004 2

Sample MIMD Systems Sample MIMD Systems

Intel iPSC Hypercube Cray “Red Storm” – in cooperation with Sandia National Labs • iPSC = Personal • Standard AMD Opteron processors (originally 2 per node, in future only 1) • Predecessor "Cosmic Cube" (CalTech, Pasadena) • max. 10,000 nodes • Generations: iPSC/1 (80286), iPSC/2 (80386), iPSC/860 (80860) • max. 40 TFLOPS • max. 128 Processors • at US$ 90 Million • Hypercube-Connection Network • Different node types: compute node, I/O node • 3D Grid network (6 neighbors)

Intel Paragon • Predecessor "Touchstone Delta" • max. 512 nodes with 2 × i860 XP processors each • In each node : 1 processor for arithmetic + 1 processor for data exchange • Different node types (dynamically configurable): compute node, service node, I/O node • Grid network (4 neighbors)

Image: Cray Bräunl 2004 3 Bräunl 2004 4 3.1 Process Model Process States

Process-States for an individual Processor

Blocked blocked

V(sema) P(sema)

New ready executing Done OSReadystart of time slice OSKill Ready OSReschedule Running or terminate or end of time slice The states ‘ready’ and ‘blocked’ are managed as queues. Execution of processes is done in time sharing mode (time slices).

Bräunl 2004 5 Bräunl 2004 6

Process States Process States

Process-States for MIMD with shared memory Process-States for MIMD without shared memory

blocked Processor 1 Processor 2 Processor 3 Processor 4

l l l l blocked blocked blocked blocked read block

resign ready executing ready executing ready executing ready executing retire add executing PE1 Done New ready executing PE2 executing PE3 The actual allocation, i.. which process executes on which processor, assign is often transparent to the programmer. A process can be executed by different processors in sequential time slices.

Bräunl 2004 7 Bräunl 2004 8 Threads 3.2 Synchronization and Communication

“light" processes Parallel processing creates the following problems: Idea: Process concept remains, but less overhead. Previous costs: Process switching, especially due to loading/saving of data 1. Data exchange between two processes Saving: No loading/saving of program data on context switching 2. Simultaneous access of data area by multiple processes Prerequisites: One program with multiple processes on system with shared memory (computer sequential or MIMD) must be prevented Implementation: A user program with multiple processes always holds all data • no loading/saving required • fast execution times User view: Like processes, only faster P1 message P2

Bräunl 2004 9 Bräunl 2004 10

Synchronization and Communication Software Solution (by Peterson, Silberschatz) Railway example of this problem: Multiple processes are executed in parallel. These need to be synchronized. One possibility for this is using synchronization variables:

… 1 start(P1); start(P2); … var turn: 1..2; Initialization: turn:=1;

P1 P2 loop loop 2 while turn≠1 do (*nothing*) while turn≠2 do (*nothing*) end; end; turn:=2; turn:=1; end end

Bräunl 2004 11 Bräunl 2004 12 Software Solution Software Solution – 1st Improvement

var flag: array [1..2] of BOOLEAN; • This solution guarantees that only 1 process can enter the critical Initialization: flag[1]:=false; flag[2]:=false; section. • But there is one major disadvantage: alternating access is enforced. P1 P2

⇒ RESTRICTION !! loop loop while flag[2] do (*nothing*) end; while flag[1] do (*nothing*) end; flag[1]:=true; flag[2]:=true; flag[1]:=false; flag[2]:=false; end end

Bräunl 2004 13 Bräunl 2004 14

Software Solution – 1st Improvement Software Solution – 2nd Improvement

It is not that easy though: var flag: array [1..2] of BOOLEAN; Initialization: flag[1]:=false; flag[2]:=false; Should both processes their while-loops simultaneously (despite the safety check), then both will enter the critical section. P1 P2

⇒ INCORRECT !! loop loop flag[1]:=true; flag[2]:=true; while flag[2] do (*nothing*) end; while flag[1] do (*nothing *) end; < critical section > flag[1]:=false; flag[2]:=false; < other instructions > end end

Bräunl 2004 15 Bräunl 2004 16 Software Solution – 2nd Improvement Software Solution – 3rd Improvement

var turn: 1..2; If the two lines "flag[i]:=true" are moved before the while-loop, flag: array [1..2] of BOOLEAN; the error of improvement 1 will not occur, but now we can have Initialization: turn:=1; (* arbitratry *) a instead. flag[1]:=false; flag[2]:=false; P1 P2 ⇒ INCORRECT !! loop loop flag[1]:=true; flag[2]:=true; turn:=2; turn:=1; while flag[2] and (turn=2) do while flag[1] and (turn=1) do (*nothing*) end; (*nothing*) end; flag[1]:=false; flag[2]:=false; end end Bräunl 2004 17 Bräunl 2004 18

Software Solution – 3rd Improvement Hardware Solution

• Expandable to n processes Test-and-Set Operation • Also see Dekker’s algorithm (standard in most processors)

• Disadvantage of software solution: function test_and_set (var target: BOOLEAN): BOOLEAN; "Busy-", i.e. significant efficiency loss if each process begin_atomic test_and_set:=target; does not have its own processor! target := true; end_atomic. ⇒ CORRECT !! Implementation as atomic operation (indivisible, uninterruptible: "1 of CPU").

Bräunl 2004 19 Bräunl 2004 20 Hardware Solution 3.3 New Solution using Test-and-Set operation: Dijkstra, 1965, (similar to post for trains)

Application: var lock: BOOLEAN; Initialization: lock:=false; Pi … Pi P(Sema); loop while test_and_set(lock) do (*nothing*) end; (sema); lock:=false; … End;

• Removal of the busy-wait via Queues (see semaphores).

Bräunl 2004 21 Bräunl 2004 22

Semaphore Semaphore Initialization:Lempty list Implementation: value number of allowed P-ops without V-op P, V are atomic operations (indivisible). procedure P (var S: Semaphore); begin S.value := S.value-1; type Semaphore = record if S.value < 0 then value: INTEGER; append(S.L, actproc); (* append this process to S.L *) L: list of Proc_ID; block(actproc) (* and move to state "blocked" *) end end; end P; procedure V (var S: Semaphore); var Pnew: ProcID; begin Generic Semaphore → value: INTEGER S.value := S.value+1; ≤ value: BOOLEAN if S.value 0 then Boolean Semaphore → getfirst(S.L, Pnew); (* remove process P from S.L *) ready(Pnew) (* and change state to "ready" *) end end V; Bräunl 2004 23 Bräunl 2004 24 Semaphore Producer-Consumer Problem

How do we achieve that P and V are atomic operations? Using Boolean semaphores Declaration and Initialization: • Single-processor-system: var empty: semaphore [1]; • Disable all full: semaphore [0];

• Multi-processor-system: process Producer; process Consumer; • Software-solution: Busy-wait with short P- or V-operation as critical begin begin loop loop section. P (full); • Hardware-solution: Short busy-wait for Test-and-Set instruction before P (empty) start of P- or V-operation. V (empty) V (full);

end ; end ; Attention: Convoy-Phenomenon end process Producer . end process Consumer .

Bräunl 2004 25 Bräunl 2004 26

Producer-Consumer Problem Bounded Buffer Problem

Corresponding Petri-Net var critical: semaphore[1]; free : semaphore[n]; (* there are n buffer spaces *) used : semaphore[0]; process Producer; process Consumer; begin begin loop loop P(used); Data Data empty Data Data creation write read consumption P(free); P(critical); full P(critical); V(critical); V(critical); V(free); V(used); end; end; end process Producer; end process Consumer; Producer Consumer Bräunl 2004 27 Bräunl 2004 28 Bounded Buffer Problem Readers-Writers Problem

var count : semaphore[1]; Corresponding Producer Consumer r_w : semaphore[1]; process Reader; Petri-Net: (* One writer or many readers *) begin loop readcount: INTEGER; P(count); Initialization: readcount:=0; if readcount=0 then P(r_w) end; n readcount := readcount + 1; free process Writer; V(count); begin loop Daten Data Data Data 1 1 critical 1 creation write read consumption P(r_w); P(count); readcount := readcount - 1; used if readcount=0 then V(r_w) end; V(count); 0 V(r_w); end; (* loop *) end; (* loop *) end process Writer; end process Reader; Bräunl 2004 29 Bräunl 2004 30

Readers-Writers Problem Example 1 #include #include thread 0 P(count) #define repeats 3 thread 1 #define threadnum 5 thread 2 void *slave(void *arg) thread 4 { int id = (int) arg ; thread 3 V(count) Data thread 0 1 write Data for (int i = 0; i < repeats; i++) 1 1 1 thread 2 read r_w { printf("thread %d\n", id); Data 2 Data sched_yield(); thread 4 consumption creation } thread 1 P(count) 0 1 return( 0 ) ; thread 3 readcount count } thread 0 int main() thread 2 { pthread_t thread; thread 4 Reader thread 1 Writer for(int i = 0; i < threadnum ; i++) { if (pthread_create(&thread, NULL, slave, (void *) i)) thread 3 printf("Error: thread creation failed\n"); } V(count) pthread_exit( 0 ); } Bräunl 2004 31 Bräunl 2004 32 0-A 0-B 0-C 0-A 0-B 2-A 1-A 0-C Thread Example 2 Thread Example 3 3-A 2-B #include #include 4-A #include #include 1-B thread 0 0-A 3-B #define repeats 3 thread 0 #define repeats 3 2-C #define threadnum 5 #define threadnum 5 4-B thread 0 1-C thread 2 0-B void *slave(void *arg) void *slave(void *arg) 3-C { int id = (int) arg ; thread 2 { int id = (int) arg ; 2-A thread 2 for (int i = 0; i < repeats; i++) 4-C for (int i = 0; i < repeats; i++) { printf("%d-A\n", id); 1-A thread 1 0-C { printf("thread %d\n", id); printf("%d-B\n", id); 3-A } thread 1 printf("%d-C\n", id); 2-B return( 0 ) ; thread 1 } 4-A 1-B } thread 3 return( 0 ) ; 3-B } 2-C thread 3 4-B int main() thread 4 int main() 1-C 3-C { pthread_t thread; thread 3 { pthread_t thread; 2-A thread 4 4-C for(int i = 0; i < threadnum ; i++) for(int i = 0; i < threadnum ; i++) 1-A thread 4 3-A { if (pthread_create(&thread, NULL, slave, (void *) i)) { if (pthread_create(&thread, NULL, slave, (void *) i)) 2-B printf("Error: thread creation failed\n"); printf("Error: thread creation failed\n"); 4-A } } 1-B pthread_exit( 0 ); pthread_exit( 0 ); 3-B 2-C } } 4-B Bräunl 2004 33 Bräunl 2004 1-C 34 3-C 4-C

0-A 0-B 0-C 0-A 0-B 0-C 0-A 0-B Thread Example 4 0-C 3.4 Monitor 2-A #include 2-B #include 2-C #define repeats 3 4-A By Hoare 1974, Brinch Hansen 1975 #define threadnum 5 4-B pthread_mutex_t mutex; 4-C 1-A void *slave(void *arg) 1-B • Abstract data type { int id = (int) arg ; 1-C for (int i = 0; i < repeats; i++) 3-A • A monitor encapsulates the data area to be protected and the { pthread_mutex_lock(&mutex); 3-B printf("%d-A\n", id); 3-C corresponding data access mechanisms (Entries, Conditions). printf("%d-B\n", id); 2-A printf("%d-C\n", id); 2-B pthread_mutex_unlock(&mutex); 2-C 4-A } 4-B Usage: return( 0 ) ; 4-C } 1-A 1-B P1 P2 int main() 1-C { pthread_t thread; 3-A ...... 3-B Buffer:DataWrite(x) Buffer:DataRead(x) if (pthread_mutex_init(&mutex, NULL)) 3-C { printf("Error: mutex failed\n"); exit(1); } 2-A ...... for(int i = 0; i < threadnum ; i++) 2-B { if (pthread_create(&thread, NULL, slave, (void *) i)) 2-C { printf("Error: thread creation failed\n"); exit(1); 4-A } 4-B } 4-C Monitor calls are mutually exclusive, pthread_exit( 0 ); 1-A } 1-B they are automatically synchronized. 1-C Bräunl 2004 3-A 35 Bräunl 2004 36 3-B 3-C Monitor Monitor for Buffer Management

monitor buffer; entry ReadData (var a: Dataset); Application Example: Buffer management var Stack: array [1..n] of Dataset; begin Pointer: 0..n; while Pointer=0 do (* Stack empty *) free, used: CONDITION; wait(used) end; Stack a:=Stack[Pointer]; entry WriteData (a: Dataset); dec(Pointer); n begin if Pointer=n-1 then signal(free) end; while Pointer=n do (* Stack full *) end ReadData; wait(free) • end; • inc(Pointer); Begin (* Monitor-Initialization *) • Stack[Pointer]:=a; Pointer:=0 Pointer if Pointer=1 then signal(used) end; end monitor Buffer; 2 end WriteData; 1 0

Bräunl 2004 37 Bräunl 2004 38

Monitor Conditions Monitor Implementation

Conditions are queues, similar to those of semaphores. • wait (Cond) The executing process blocks itself and waits until another process executes a Monitor Implementation Steps: signal-operation on the condition Cond 1) var MSema: semaphore[1]; (* is required for every monitor *)

• signal (Cond) 2) Rewriting of entries to procedures: All processes waiting in the condition Cond are reactivated and will again apply for access to the monitor. (Another variant only releases one process: procedure xyz(...) the next one in the queue) begin P(MSema); • status (Cond) V(MSema); This function returns the number of processes waiting for entry into this end xyz; condition 3) Rewrite of monitor initialization into a procedure and corresponding call from main program. Bräunl 2004 39 Bräunl 2004 40 Monitor Implementation Monitor Implementation

4) procedure wait(Cond: condition; MSema: semaphore); 6a) In this implementation only one process is released (wait in if-clause) begin procedure signal(Cond, MSema); append(Cond,actproc); (* insert ProcID in Condition-queue *) var NewProc: Proc_ID; block(actproc); (* Dispatcher: insert ProcID in block-liste *) begin V(MSema); (* release monitor-semaphor *) if not empty(cond) then assign; (* Dispatcher: next ready process *) GetFirst(Cond, NewProc); end wait; P-Op. für NewProc end 5) procedure status(Cond: condition): CARDINAL; end signal; Cond • • • begin return length(Cond); MSema • • • end status;

(Cond = NIL) • • •

Bräunl 2004 41 Bräunl 2004 MSema • • • 42

Monitor Implementation 3.5 6b) In this implementation all waiting processes are released. (wait in while-loop)

procedure signal(var Cond:condition; var MSema:semaphore); • For distributed systems (no shared memory) begin (* Status-check not needed here *) this is the only method of communication append(MSema.L,Cond); (* append list *) • Also usable for systems with shared memory MSema.Value := MSema.Value - status(Cond); Cond := nil; • Easy to use, but computation time expensive (overhead) end signal;

Cond • • • ⇒ Implementation with implicit communication: Remote Procedure Call MSema • • •

Cond • • •

MSema • • • Bräunl 2004 43 Bräunl 2004 44 Message Passing Example Message Passing Example

Process 1 Process 2 • • • Process n PC PS loop A R A R A R ... Send_A(to_Server,); Receive_A(Client,Task); ... Send_R(Client,Result); Receive_R(from_Server,Result) end;

Buffer area for A = Tasks messages R = Replies Implementation Operations: Send_A (Receiver, Message) • In parallel systems with shared memory: Monitor Receive_A (var Sender; var Message) • In parallel systems without shared memory: Decentralized network management Send_R (Receiver, Message) Receive_R (var Sender; var Message) with message protocols Send/receive of jobs, or send/receive for replies Each process can receive as many jobs without processing them, as fit into its buffer. Bräunl 2004 45 Bräunl 2004 46

Message Passing Example Message Passing Example

type PoolElem = record Schematic for systems with shared memory: free: BOOLEAN; from: 1..number of Procs; (use of a global message pool) info: Message; end; Queue = record contents: 1..max; next : pointer to Queue Pool end;

monitor Messages; var Pool: array [1..max] of PoolElem; (* global message pool *) pfree: CONDITION; (* queue, in case pool is fully filled *) Send_A Receive_A Afull, Rfull: array [1..number of Procs] of CONDITION; (* queue for each process for incoming messages *) Receive_R Send_R queueA, queueR: array [1..number of Procs] of Queue; (* local message pools for each process *)

Bräunl 2004 47 Bräunl 2004 48 Message Passing Example Message Passing Example

entry Send_A (to: 1..number of Procs; a: Message); entry Send_R (nach: 1..number of Procs; ergebnis: Message); var id: 1..max; var id: 1..max; begin begin id := head(queueA[actproc]); while not GetFreeElem(id) do wait(pfree) end; (queueA[actproc]); (* remove first element (head) of Queue *) with pool[id] do pool[id].from := actproc; free := false; pool[id].info := ergebnis; from := actproc; append(queueR[to],id]; (* insert place number in reply queue *) info := a; signal(Rfull[to]) end; end Send_R; append(queueA[to],id); (* insert place number in task queue *) entry Receive_R (var von: 1..number of Procs; var erg: Messages); signal(Afull[to]); var id: 1..max; end Send_A; begin while empty(queueR[actproc]) do wait(Rfull[actproc]) end; entry Receive_A (var von: 1..number of Procs; var a: Message); id := head(queueR[actproc]); var id: 1..max; tail(queueR[actproc]); (* remove first element (head) of Queue *) begin with pool[id] do while empty(queueA[actproc]) do wait(Afull[actproc]) end; von := from; id := head(queueA[actproc]); erg := info; von := pool[id].from; free := true; (* release of pool element *) end; a := pool[id].info; (* pool[id] not yet freed *) signal(pfree); (* a free pool element exists *) end Receive_A; end Receive_R; Bräunl 2004 49 Bräunl 2004 50

3.6 Problems with Asynchronous Parallelism Inconsistent Data

Time dependent errors Problem A.1: Lost Update • Not reproducible! • Cannot be found by systematic testing! Before : Income[Miller] = 1000 P1 P2 A. Inconsistent Data … … A set of data (or, a relationship between data) does not have the value it would have x := Income[Miller]; y := Income[Miller]; received if the operation had been processed sequentially. x := x+50; y := 1.1*y; Income[Miller] := x; Income[Miller] := y; B. Blockings … … Deadlock, Livelock 1050  C. Inefficiencies After: 1100 Income[Miller] = ? Depending on sequence of memory accesses Load Balancing,  1150 1155 Bräunl 2004 51 Bräunl 2004  52 Inconsistent Data Inconsistent Data

Problem A.2: Inconsistent Analysis After: account[A] = 1600,- account[B] = 1400,- Sum(A,B) = 3000,- Before: account[A] = 2000 account[B] = 1000 Sum(A,B) = 3000 (Same as before!)

P1 P2 2600 …  Depending on sequence of memory accesses x := account[A]; … sum = ? 3000 x := x-400; a := account[A];  account[A] := x; b := account[B]; 3400 sum := a+b; x := account[B]; … x := x+400; account[B] := x; Bräunl 2004 … 53 Bräunl 2004 54

Inconsistent Data Blockings

Problem A.3: Uncommitted Dependencies A group of processes is waiting for the occurrence of a condition which can only be generated by that group itself In databases and transaction systems: (alternating dependency) Transactions are atomic operations commit / rollback

Sample Occurrences:

• All processes are blocked in semaphore or queue conditions (deadlock) • All processes are caught in mutual busy wait or poling statements (livelock)

Bräunl 2004 55 Bräunl 2004 56 Blockings Blockings

Problem B.1: Deadlock The following conditions are required for a deadlock to occur Two processes require terminal and printer resources for computing [Coffman, Elphick, Shoshani 71]:

1. Resources can only be used exclusive. P P 1 2 2. Processes possess resources while requesting new ones. (may be broken by demand that all required resources must be requested at the same P (TE); P (PR); time) P (PR); P (TE); 3. Resources cannot be forcibly removed from processes. V (PR); V (TE); (may be broken by forced removal of resources, e.g. resolving existing ) V (TE); V (PR); 4. There is a circular loop or processes such that each process possesses the resources requested by the next one in the chain.

Bräunl 2004 57 Bräunl 2004 58

Inefficiencies Extended Model

Problem C.1: Load Balancing (by K. Hwang) Simple Scheduling Model: Dynamic allocation of processes to processors during run time Static distribution of processes to processors (dynamic load balancing). (no redistribution during run time) Allocated processes may be re-distributed (process migration) depending on the load setting (threshold). ⇒ can potentially leads to large inefficiencies.

Methods: Receiver-Initiative: Processors with low load request more processes. 1 2 3 PE1 PE2 PE3 PE PE PE (useful at high system load) 1 2 3 1 – – Some processes Sender-Initiative: Processors with too much load want to offload processes. 4 5 6 are blocked 4 (useful at low system load) 7 8 9 7 Hybrid Method: System switches between Receiver- and Sender-Initiatives Initial states Loss of parallelism depending on global system load. Bräunl 2004 59 Bräunl 2004 60 Extended Scheduling Model 3.7 MIMD Programming Languages

Advantages and Disadvantages • Pascal derivatives + Better processor utilization, no loss of possible parallelism – Concurrent Pascal (Brinch-Hansen, 1977) – General management overheads – Ada (US Dept. of Defense, 1975) – Methods kick in too late, namely when the load distribution has already been – Modula-P (Bräunl, 1986) seriously out of balance – "process migration" is an expensive operation and only makes sense for longer • C/C++ plus parallel libraries running processes – Sequent C (Sequent, 1988) o circular "process migration" must be prevented via appropriate parallel algorithms – pthreads and thresholds –PVM “” (Sunderam et al., 1990) – Under full parallel load, load-balancing is useless! –MPI“message passing interface” (based on CVS, MPI Forum, 1995) • Special languages –CSP ”Communicating Sequential Processes” (Hoare, 1978) – Occam (based on CSP, inmos Ltd. 1984) – Linda (Carriero, Gelernter, 1986) Bräunl 2004 61 Bräunl 2004 62

Pthreads Pthreads source: http://www.llnl.gov/computing/tutorials/pthreads/

• In the environment a thread: • Extension to standard Unix () and join() – Exists within a process and uses the process resources • Previously many different implementations – Has its own independent flow of control as long as its exists and the OS supports it • Standard IEEE POSIX 1003.1c standard (1995) – May share the process resources with other threads that act equally independently • Performance: (and dependently) – Dies if the parent process dies - or something similar – Much faster than fork() (about 10x) • To the software developer, the concept of a "procedure" that runs – Much faster than MPI, PVM independently from its main program may best describe a thread. • See: http://www.llnl.gov/computing/tutorials/pthreads/ • Because threads within the same process share resources: http://www.cs.nmsu.edu/~jcook/Tools/pthreads/library.html – Changes made by one thread to shared system resources (such as closing a file) will be seen by all other threads. – Two pointers having the same value point to the same data. – Reading and writing to the same memory locations is possible, and therefore requires explicit synchronization by the programmer. Bräunl 2004 63 Bräunl 2004 64 Pthreads source: http://www.llnl.gov/computing/tutorials/pthreads/ Pthreads source: http://www.llnl.gov/computing/tutorials/pthreads/

Sequential Parallel address space

Bräunl 2004 65 Bräunl 2004 66

Pthreads: Thread Functions Pthreads: Mutex Functions

Create a new thread: Mutex data type: pthread_mutex_t int pthread_create (pthread_t *thread_id, const pthread_attr_t *attributes,void *(*thread_function)(void *), void *arguments); of Mutex (mututal exclusion = simple binary semaphore) : (use NULL pointer for default attributes) A thread terminates when the function returns, or explicitly by calling: int pthread_mutex_init (pthread_mutex_t *mut, const pthread_mutexattr_t *attr); int pthread_exit (void *status); Lock Mutex (= P(sema) ) int pthread_mutex_lock (pthread_mutex_t *mut); A thread can wait for the termination of another: int pthread_join (pthread_t thread, void **status_ptr); Unlock Mutex (= V(sema) ) int pthread_mutex_unlock (pthread_mutex_t *mut); A thread can read its own id: pthread_t pthread_self (); Nonblocking version of lock (either succeeds or returns EBUSY) int pthread_mutex_trylock (pthread_mutex_t *mut); It can be checked whether two threads are identical: int pthread_equal (pthread_t t1, pthread_t t2); Deallocate Mutex Int pthread_mutex_destroy (pthread_mutex_t *mut); Bräunl 2004 67 Bräunl 2004 68 Pthreads: Semaphore Functions Pthreads: Monitor Conditions Init Condition : int pthread_cond_init (pthread_cond_t *cond, pthread_condattr_t *attr); Sempahore data type: sem_t Wait Version 1: Standard (note: always blocks) Init and de-allocate of Semaphore : (use PTHREAD_PROCESS_PRIVATE as shared value) int pthread_cond_wait (pthread_cond_t *cond, pthread_mutex_t *mut); int sem_init (sem_t * sem, int pshared, unsigned int value); int sem_destroy (sem_t * sem); Wait Version 2: With timeout int pthread_cond_timedwait (pthread_cond_t *cond, pthread_mutex_t *mut, P(sema) const struct timespec *abstime); int sem_wait (sem_t * sem); int sem_timedwait (sem_t * sem, const struct timespec * abstime) Signal Version 1 (note: releases 1 waiting thread) V(sema) int pthread_cond_signal (pthread_cond_t *cond); int sem_post (sem_t * sem); int sem_post_multiple (sem_t * sem, int count); Signal Version 2 (note: releases all waiting threads) int pthread_cond_broadcast (pthread_cond_t *cond); Other int sem_getvalue (sem_t * sem, int * sval); Deallocate Condition Int pthread_cond_destroy (pthread_cond_t *cond); Bräunl 2004 69 Bräunl 2004 70

Pthreads: Hello World Example MPI (Message Passing Interface)

#include #include #define NUM_THREADS 5 • based on CVS, MPI Forum (incl. hardware vendors), 1994/95 Creating thread 0 void *PrintHello(void *threadid) Creating thread 1 { • MPI is a standard for a of functions and macros that printf("\n%d: Hello World!\n", (int) threadid); implements data sharing and synchronization between processes. pthread_exit(NULL); 0: Hello World! return 0; Creating thread 2 • Designed to be practical, portable, efficient and flexible. } 1: Hello World! • Public domain implementations: MPICH (www.mcs.anl.gov/mpi) int main (int argc, char *argv[]) { pthread_t threads[NUM_THREADS]; 2: Hello World! LAM-MPI(www.lam-mpi.org) int rc, t; Creating thread 3 for(t=0;t < NUM_THREADS;t++){ • Available for almost any platform, links to C, C++, FORTRAN printf("Creating thread %d\n", t); Creating thread 4 rc = pthread_create(&threads[t],NULL,PrintHello,(void *)t); • MPE routines provide profiling and graphics output. if (rc){ 3: Hello World! printf("ERROR; return code from create() is %d\n", rc); • Debugging tools are implementation specific. Prepared to address exit(-1); 4: Hello World! } the lack of compatibility between software utilizing vendor } pthread_exit(NULL); specific message passing libraries. } Bräunl 2004 71 Bräunl 2004 72 MPI (Message Passing Interface) MPI Programming

• One source file for all processes. Distinguish processing responsibility via rank • Programming in well known language, with insertion of (integer uniquely assigned to a process) by calling MPI_Comm_rank(). The synchronisation, communication and process grouping functions number of processes requested by the user is obtained trough MPI_Comm_size(). • MPI programs must call MPI_Init() before using any MPI calls. All programs (no process creation, however, like in PVM) must end with MPI_Finalize(). • Parallel processing for MIMD (can be used • MPI data type definitions are provided for basic data types (e.g MPI_INT, on shared or hybrid systems). Unlike in PVM, code for all MPI_DOUBLE). User data types can be defined. Packing is optional. processes is usually contained in one executable. • Basic message passing – : MPI_Send(), MPI_Recv() • Different implementations provide different runtime – Non Blocking: MPI_Isend(), MPI_Irecv() followed by e.g. environments. Some have daemons that watch over process MPI_Wait(), Waitall() or MPI_Test() execution (like PVM) – e.g. LAM-MPI and some do not, e.g. • Collective communications MPI_Broadcast(), MPI_Barrier(), MPI_Reduce(), MPICH. Not all implementations are thread safe! MPI_Scatter(), MPI_Gather(), MPI_Alltoall() • MPI was designed to please many interest → many functions exist. Only a subset is really required for most application programs. Bräunl 2004 73 Bräunl 2004 74

MPI Programming MPI Functions

MPI_Send(vector,10,MPI_INT,dest,tag,MPI_COMM_WORLD) Initialisation/Cleanup Communicator Info int MPI_Init(int *argc, char ***argv) int MPI_Comm_rank ( MPI_Comm comm, int *rank ) int MPI_Finalize() int MPI_Comm_size ( MPI_Comm comm, int *size ) destination start size data type tag communicator source (no. els) (user defined process group) Blocking Message Passing int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

MPI_Recv(vector,10,MPI_INT,src,tag,MPI_COMM_WORLD,s int MPI_Sendrecv( void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, tatus) int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status ) info about rcvd msg rank of a broadcasting process Non Blocking Message Passing int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) MPI_Bcast(vector,10,MPI_INT,root_rank,MPI_COMM_WORLD) int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)

int MPI_Test (MPI_Request *request,int *flag,MPI_Status *status) MPI_Barrier(communicator) int MPI_Wait (MPI_Request *request,MPI_Status *status) int MPI_Waitall(int count, MPI_Request array_of_requests[], MPI_Status array_of_statuses[]) Bräunl 2004 75 Bräunl 2004 76 MPI Functions MPI Execution

Collective Communication • Runtime environment is implementation specific (here: MPICH). int MPI_Barrier (MPI_Comm comm) • Machine-file containing a list of machines to be used should be created. int MPI_Bcast ( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm ) • Processes are started with command mpirun. A machinefile can be given as a parameter (if not a global default is used). int MPI_Reduce ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm) • A number of processes to be started may also be specified (option -np): int MPI_Allreduce ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, mpirun -machinefile mfile -np 2 myprog MPI_Comm comm) int MPI_Gather ( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, This starts 2 instances of myprog on machines specified in mfile int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) int MPI_Scatter ( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm) int MPI_Alltoall( void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, MPI_Comm comm) Bräunl 2004 77 Bräunl 2004 78

MPI “Hello World” (adapted from Manchek’s PVM ex.) 3.8 Coarse-Grained Parallel Algorithms

#include #include #include "mpi.h" • Synchronization with Semaphores int main(int argc, char **argv) { int myrank = -1; int l; char buf[100]; MPI_Status status; • Synchronization with Monitor MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); • Pi calculation if(myrank == 0) { • Distributed simulation MPI_Recv(buf,100,MPI_CHAR,1,0,MPI_COMM_WORLD,&status); printf("message from process %d : %s \n",status.MPI_SOURCE,buf); } else if(myrank == 1) { strcpy(buf,"Hello world from: "); MPI_Get_processor_name(buf+strlen(buf),&l); MPI_Send(buf,100,MPI_CHAR,0,0,MPI_COMM_WORLD); } MPI_Finalize(); } Bräunl 2004 79 Bräunl 2004 80 Synchronization with Semaphores (pthreads) Synchronization with Semaphores #include void *prod_thread(void *pparam) #include { int i=0; #include while(1) #define BUF_SIZE 5 { i=(i+1)%10; Sample Run sem_t *full,*empty; produce(i); sem_t critical,freex,used; } int buf[BUF_SIZE]; } int pos,z; write Pos: 1 1 void *cons_thread(void *pparam) write Pos: 2 2 void produce(int i) { int i,quer; { sem_wait(&freex); quer=0; read Pos: 2 2 sem_wait(&critical); while(1) write Pos: 2 3 if(pos>=BUF_SIZE){ printf("Err\n");} { consume(&i); buf[pos++]=i; quer = (quer+i)%10; read Pos: 2 3 printf("write Pos: %d %d\n",pos-1,i); } write Pos: 2 4 sem_post(&critical); } sem_post(&used); read Pos: 2 4 } int main(void) { pthread_t p,c; write Pos: 2 5 void consume(int* i) int i; read Pos: 2 5 { sem_wait(&used); sem_init(&freex,PTHREAD_PROCESS_PRIVATE,BUF_SIZE); write Pos: 2 6 sem_wait(&critical); sem_init(&used,PTHREAD_PROCESS_PRIVATE,0); sem_init(&critical,PTHREAD_PROCESS_PRIVATE,1); read Pos: 2 6 if(pos < 0) { printf("Err\n");} ..... *i=buf[--pos]; printf("read Pos: %d %d \n",pos,*i); for(i=0;i

Synchronization with Monitor (pthreads) Synchronization with Monitor (pthreads)

#include #include void *prod_thread(void *pparam) int main(void) #define BUF_SIZE 5 { int i=0; { pthread_t p,c; int i; pthread_cond_t xused, xfree; printf("Initi Producer \n"); pthread_mutex_t mon; while(1) int stack[BUF_SIZE]; { i=(i+1)%10; printf("Init... \n"); int pointer=0; buffer_write(i); pthread_cond_init(&xfree,NULL); } pthread_cond_init(&xused,NULL); } void buffer_write(int a) void buffer_read(int* a) pthread_mutex_init(&mon,NULL); { pthread_mutex_lock(&mon); { pthread_mutex_lock(&mon); if(pointer==0) if(pointer==BUF_SIZE) void *cons_thread(void *pparam) for(i=0;i

Sample Run write 1 1 read 1 1 write 1 2 1 int ervals read 1 2 4 4 write 1 3 π = dx * width write 2 4 ∫ 2 = ∑ 2 read 2 4 0 1+ x i=1 1+ ((i − 0.5) * width) write 2 5 read 2 5 write 2 6 read 2 6 write 2 7 read 2 7 4 write 2 8 read 2 8 .....

2

0 Bräunl 2004 85 Bräunl 2004 0 0,5 1 86

4 4

2 2

Pi Calculation (pthreads) Pi Calculation (pthreads) 0 0 0 0,5 1 0 0,5 1 #include #include void assignment_get_interval(int *iv) int main(void) #define MAX_THREADS 10 { static int pos = 0; #define INTERVALS 1000 { pthread_t thread[MAX_THREADS]; pthread_mutex_lock(&interval_mtx); #define WIDTH 1.0/(double)(INTERVALS) int i; if(++pos<=INTERVALS) *iv=pos; pthread_mutex_t result_mtx; else *iv=-1; pthread_mutex_t interval_mtx; pthread_mutex_unlock(&interval_mtx); pthread_mutex_init(&interval_mtx,NULL); } double f(double x ) pthread_mutex_init(&result_mtx,NULL); { return 4.0/(1+x*x); } void *worker(void *pparam) for(i=0;i0) for(i=0;i

#include void monitor_read_line(int j,int* line) #include /* returns specified line */ Model: #include {inti; • 2-dim. field of elements (Persons) #include pthread_mutex_lock(&crmtx); • During each time step each person assumes the opinion of a random neighbor #define MAX_THREADS 5 for(i=0;i

pthread_mutex_t crmtx; void monitor_put_line(int j,int* line) Start a number of worker threads. int arr[LINES][COLS]; /* write back one line */ {inti; Each thread should get his line number from a monitor, void monitor_get_linenumber(int* j) pthread_mutex_lock(&crmtx); then work locally in his area and store back results. /* returns current line index */ for(i=0;i

Distributed Simulation (pthreads) Distributed Simulation void *worker(void *pparam) { int k,j,pos,cnt; int main(void) int line[COLS],above[COLS],below[COLS],newl[COLS]; { pthread_t thread[MAX_THREADS]; Result of simulation: Local clustering appears. for(cnt=0;cnt<=GENERATIONS;cnt++) int i,j; { if (pparam == 0 && cnt%1000 == 0) print_array(cnt); monitor_get_linenumber(&j); srand(time(NULL)); monitor_read_line(j,line); pthread_mutex_init(&crmtx,NULL); if (j>0) monitor_read_line(j-1, above); else monitor_read_line(LINES-1, above); if (j