3. Asynchronous Parallelism Sample MIMD Systems Sample MIMD

3. Asynchronous Parallelism Sample MIMD Systems Multibus Ethernet SCSI Bus Setup of an MIMD-System Mem Mem Mem • • • Sequent Symmetry PE PE PE General Model Multibus- Drive- SCED- Adapter Controller Board System Bus Console Network CPUCPU CPU80386 shared memory can be addressed CPUCPU80386 Mem CPU80386 by all PEs CPU8038680386 CPU80386 Memory CPUCPU80386 8038680386 40 MB 80386 Program is segmented into autonomous processes. Ideal Allocation: 1 Process : 1 Processor But usually: n Processes: 1 Processor ⇒ Scheduler required 10 CPUs (32-Bit) Bräunl 2004 1 Bräunl 2004 2 Sample MIMD Systems Sample MIMD Systems Intel iPSC Hypercube Cray “Red Storm” – in cooperation with Sandia National Labs • iPSC = Intel Personal Supercomputer • Standard AMD Opteron processors (originally 2 per node, in future only 1) • Predecessor "Cosmic Cube" (CalTech, Pasadena) • max. 10,000 nodes • Generations: iPSC/1 (80286), iPSC/2 (80386), iPSC/860 (80860) • max. 40 TFLOPS • max. 128 Processors • at US$ 90 Million • Hypercube-Connection Network • Different node types: compute node, I/O node • 3D Grid network (6 neighbors) Intel Paragon • Predecessor "Touchstone Delta" • max. 512 nodes with 2 × i860 XP processors each • In each node : 1 processor for arithmetic + 1 processor for data exchange • Different node types (dynamically configurable): compute node, service node, I/O node • Grid network (4 neighbors) Image: Cray Bräunl 2004 3 Bräunl 2004 4 3.1 Process Model Process States Process-States for an individual Processor Blocked blocked V(sema) P(sema) New ready executing Done OSReadystart of time slice OSKill Ready OSReschedule Running or terminate or end of time slice The states ‘ready’ and ‘blocked’ are managed as queues. Execution of processes is done in time sharing mode (time slices). Bräunl 2004 5 Bräunl 2004 6 Process States Process States Process-States for MIMD with shared memory Process-States for MIMD without shared memory blocked Processor 1 Processor 2 Processor 3 Processor 4 l l l l blocked blocked blocked blocked read block resign ready executing ready executing ready executing ready executing retire add executing PE1 Done New ready executing PE2 executing PE3 The actual allocation, i.e. which process executes on which processor, assign is often transparent to the programmer. A process can be executed by different processors in sequential time slices. Bräunl 2004 7 Bräunl 2004 8 Threads 3.2 Synchronization and Communication “light" processes Parallel processing creates the following problems: Idea: Process concept remains, but less overhead. Previous costs: Process context switching, especially due to loading/saving of data 1. Data exchange between two processes Saving: No loading/saving of program data on context switching 2. Simultaneous access of data area by multiple processes Prerequisites: One program with multiple processes on system with shared memory (computer sequential or MIMD) must be prevented Implementation: A user program with multiple processes always holds all data • no loading/saving required • fast execution times User view: Like processes, only faster P P 1 message 2 Bräunl 2004 9 Bräunl 2004 10 Synchronization and Communication Software Solution (by Peterson, Silberschatz) Railway example of this problem: Multiple processes are executed in parallel. These need to be synchronized. One possibility for this is using synchronization variables: … 1 start(P1); start(P2); … var turn: 1..2; Initialization: turn:=1; P1 P2 loop loop 2 while turn≠1 do (*nothing*) while turn≠2 do (*nothing*) end; end; <critical section> <critical section> turn:=2; turn:=1; <other instructions> <other instructions> end end Bräunl 2004 11 Bräunl 2004 12 Software Solution Software Solution – 1st Improvement var flag: array [1..2] of BOOLEAN; • This solution guarantees that only 1 process can enter the critical Initialization: flag[1]:=false; flag[2]:=false; section. • But there is one major disadvantage: alternating access is enforced. P1 P2 ⇒ RESTRICTION !! loop loop while flag[2] do (*nothing*) end; while flag[1] do (*nothing*) end; flag[1]:=true; flag[2]:=true; <critical section> <critical section> flag[1]:=false; flag[2]:=false; <other instructions> <other instructions> end end Bräunl 2004 13 Bräunl 2004 14 Software Solution – 1st Improvement Software Solution – 2nd Improvement It is not that easy though: var flag: array [1..2] of BOOLEAN; Initialization: flag[1]:=false; flag[2]:=false; Should both processes exit their while-loops simultaneously (despite the safety check), then both will enter the critical section. P1 P2 ⇒ INCORRECT !! loop loop flag[1]:=true; flag[2]:=true; while flag[2] do (*nothing*) end; while flag[1] do (*nothing *) end; < critical section > <critical section> flag[1]:=false; flag[2]:=false; <other instructions> < other instructions > end end Bräunl 2004 15 Bräunl 2004 16 Software Solution – 2nd Improvement Software Solution – 3rd Improvement var turn: 1..2; If the two lines "flag[i]:=true" are moved before the while-loop, flag: array [1..2] of BOOLEAN; the error of improvement 1 will not occur, but now we can have Initialization: turn:=1; (* arbitratry *) a deadlock instead. flag[1]:=false; flag[2]:=false; P1 P2 ⇒ INCORRECT !! loop loop flag[1]:=true; flag[2]:=true; turn:=2; turn:=1; while flag[2] and (turn=2) do while flag[1] and (turn=1) do (*nothing*) end; (*nothing*) end; <critical section> <critical section > flag[1]:=false; flag[2]:=false; <other instructions > <other instructions > end end Bräunl 2004 17 Bräunl 2004 18 Software Solution – 3rd Improvement Hardware Solution • Expandable to n processes Test-and-Set Operation • Also see Dekker’s algorithm (standard in most processors) • Disadvantage of software solution: function test_and_set (var target: BOOLEAN): BOOLEAN; "Busy-wait", i.e. significant efficiency loss if each process begin_atomic test_and_set:=target; does not have its own processor! target := true; end_atomic. ⇒ CORRECT !! Implementation as atomic operation (indivisible, uninterruptible: "1 instruction cycle of CPU"). Bräunl 2004 19 Bräunl 2004 20 Hardware Solution 3.3 Semaphore New Solution using Test-and-Set operation: Dijkstra, 1965, (similar to signal post for trains) Application: var lock: BOOLEAN; Initialization: lock:=false; Pi … Pi P(Sema); loop <critical section> while test_and_set(lock) do (*nothing*) end; V(sema); <critical section> lock:=false; … <other instructions> End; • Removal of the busy-wait via Queues (see semaphores). Bräunl 2004 21 Bräunl 2004 22 Semaphore Semaphore Initialization:Lempty list Implementation: value number of allowed P-ops without V-op P, V are atomic operations (indivisible). procedure P (var S: Semaphore); begin S.value := S.value-1; type Semaphore = record if S.value < 0 then value: INTEGER; append(S.L, actproc); (* append this process to S.L *) L: list of Proc_ID; block(actproc) (* and move to state "blocked" *) end end; end P; procedure V (var S: Semaphore); var Pnew: ProcID; begin Generic Semaphore → value: INTEGER S.value := S.value+1; ≤ value: BOOLEAN if S.value 0 then Boolean Semaphore → getfirst(S.L, Pnew); (* remove process P from S.L *) ready(Pnew) (* and change state to "ready" *) end end V; Bräunl 2004 23 Bräunl 2004 24 Semaphore Producer-Consumer Problem How do we achieve that P and V are atomic operations? Using Boolean semaphores Declaration and Initialization: • Single-processor-system: var empty: semaphore [1]; • Disable all interrupts full: semaphore [0]; • Multi-processor-system: process Producer; process Consumer; • Software-solution: Busy-wait with short P- or V-operation as critical begin begin loop loop section. <create data> P (full); • Hardware-solution: Short busy-wait for Test-and-Set instruction before P (empty) <clear buffer> start of P- or V-operation. <fill; buffer> V (empty) ; V (full); <process data> end ; end ; Attention: Convoy-Phenomenon end process Producer . end process Consumer . Bräunl 2004 25 Bräunl 2004 26 Producer-Consumer Problem Bounded Buffer Problem Corresponding Petri-Net var critical: semaphore[1]; free : semaphore[n]; (* there are n buffer spaces *) used : semaphore[0]; process Producer; process Consumer; begin begin loop loop P(used); Data Data empty Data Data <Data creation> creation write read consumption P(free); P(critical); full P(critical); <read buffer> <write to buffer> V(critical); V(critical); V(free); V(used); <process data> end; end; end process Producer; end process Consumer; Producer Consumer Bräunl 2004 27 Bräunl 2004 28 Bounded Buffer Problem Readers-Writers Problem var count : semaphore[1]; Corresponding Producer Consumer r_w : semaphore[1]; process Reader; Petri-Net: (* One writer or many readers *) begin loop readcount: INTEGER; P(count); Initialization: readcount:=0; if readcount=0 then P(r_w) end; n readcount := readcount + 1; free process Writer; V(count); begin loop <read buffer> Daten Data Data Data 1 1 critical 1 <data creation> creation write read consumption P(r_w); P(count); readcount := readcount - 1; used <write buffer> if readcount=0 then V(r_w) end; V(count); 0 V(r_w); <process data> end; (* loop *) end; (* loop *) end process Writer; end process Reader; Bräunl 2004 29 Bräunl 2004 30 Readers-Writers Problem Thread Example 1 #include <pthread.h> #include <stdio.h> thread 0 P(count) #define repeats 3 thread 1 #define threadnum 5 thread 2 void *slave(void *arg) thread 4 { int id = (int) arg ; thread 3 V(count) Data thread 0 1 write Data for (int i = 0; i < repeats; i++) thread 2 1 1 1 read r_w { printf("thread %d\n", id); Data 2 Data sched_yield(); thread 4 consumption creation

Load more