HIGH-PERFORMANCE PDC (PARALLEL AND DISTRIBUTED COMPUTING)
Prasun Dewan Department of Computer Science University of North Carolina at Chapel Hill [email protected] GOAL
High Performance PDC (Parallel and Distributed Computing) Important for traditional problems when computers were slow Back to the future with emergence of data science Modern popular software abstractions: OpenMP (Parallel Computing and Distributed Shared Memory) Late nineties- MapReduce (Distributed Computing) 2004- OpenMPI (~ Sockets, not covered) Prins 633 course Covers HPC PDC algorithms and OpenMP in detail. Don Smith 590 Course Covers MapReduce uses in detail Connect OpenMP and MapReduce to Each other (reduction) Non HPC PDC Research (with which I am familiar) Derive the design and implementation of both kinds of abstractions From similar but different performance issues that motivate them. 2 A TALE OF TWO DISTRIBUTION KINDS
Remotely Accessible Services (Printers, Desktops)
Replicated Repositories (Files, Databases) Differences between the Collaborative Applications two groups? (Games, Shared Desktops)
Distributed Sensing (Disaster Prediction)
Computation Distribution (Number, Matrix Multiplication) 3 PRIMARY/SECONDARY DISTRIBUTION REASON Primary Secondary Remotely Accessible Services (Printers, Desktops) Remote Service
Replicated Repositories (Files, Fault Tolerance, Databases) Availabilty
Collaborative
Applications Collaboration among Speedup (Games, Shared distributed users Desktops)
Distributed Sensing Aggregation of (Disaster Prediction) Distributed Data
Computation Remote Distribution High-Performance: (Number, Matrix service, Speedup Multiplication) aggregation,… 4 DIST. VS PARALLEL-DIST. EVOLUTION Non Remotely Accessible Distributed Services (Printers, Existing Desktops) Program
Replicated Repositories (Files, Databases) Distributed Distributed T T T T Collaborative Program Program Applications (Games, Shared Desktops) Non Distributed Non Distributed Distributed Sensing Existing T T (Disaster Prediction) Single-Thread Multi-Thread Program Program
Computation Distribution Distributed Distributed (Number, Matrix T Program T T Program T Multiplication) 5 CONCURRENCY-DISTRIBUTION RELATIONSHIP
Remotely Accessible Threads complement blocking IPC primitives by Services (Printers, improving responsiveness, removing deadlocks, Desktops) and allowing certain kinds of distributed applications that would otherwise not be Replicated possible Repositories (Files, e.g. Server waiting for messages from multiple Databases) blocking sockets, NIP selector threads sending read data to read thread, creating a separate Collaborative thread for incoming remote calls Applications (Games, Shared Desktops) Thread decomposition for speedup and replaced by process decomposition and can bee present in Distributed Sensing decomposed processes. (Disaster Prediction) e.g. Single-process: each row of matrix A multiplied with column of A by separate thread Computation Distribution e.g. Multi-process: Each row of matrix assigned to (Number, Matrix a process, which uses different threads to Multiplication) multiply it with different columns of matrix B. 6 ALGORITHMIC CHALLENGE
Remotely Accessible Services (Printers, Consistency: How to define and implement Desktops) correct coupling among distributed processes? Replicated Repositories (Files, Databases)
Collaborative Applications (Games, Shared Desktops)
Distributed Sensing (Disaster Prediction)
Computation How to parallelize/distribute single- Distribution thread algorithms? (Number, Matrix Multiplication) 7 CENTRAL MEDIATOR
Remotely Accessible Services (Printers, Desktops) Clients often talk to each through a central server whose code is Replicated unaware of specific arbitrary Repositories (Files, client/slave locations and ports Databases)
Collaborative Applications (Games, Shared Desktops)
Distributed Sensing (Disaster Prediction) A central master process/thread often decomposes problem and Computation combines results computed by slave Distribution agents, but decomposer knows (Number, Matrix about the nature of slaves, which are Multiplication) the service providers. 8 IMPLEMENTATION CHALLENGE
Remotely Accessible Services (Printers, How to add to single-process local Desktops) observable/observer, producer- consumer and synchronization Replicated relationships corresponding distributed Repositories (Files, observable/observer, producer- Databases) consumer and synchronization relationships? Collaborative Applications (Games, Shared Distributed separation of concerns! Desktops)
Distributed Sensing (Disaster Prediction)
Computation How to reuse existing single-thread Distribution code in multi-thread/multi-process (Number, Matrix program? Multiplication) 9 SECURITY ISSUES
Remotely Accessible Services (Printers, Communicating processes created Desktops) independently on typically geographically dispersed Replicated autonomous hosts, raising security Repositories (Files, issues Databases)
Collaborative Applications (Games, Shared Desktops)
Distributed Sensing Communicating threads/processes (Disaster Prediction) created on hosts/processors typically under control of one authority. Computation Though crowd problem solving is Distribution (Number, Matrix an infrequent exception: UW Condor – part of Cverse/XSEDE, Multiplication) 10 Wagstaff primes FAULT TOLERANCE VS SECURITY Byzantine and general Usually less autonomy security problems ∝ Fault Tolerance in HPC autonomy
Byzantine Fault Non-Byzantine Tolerance Fault Tolerance
Block chain Two-Phase Commit Paxos
Algorithm Message Loss, Manipulation Computer Failure 11 (Adversary) LIFETIME OF PROCESSES/THREADS
Remotely Accessible Services (Printers, Desktops)
Replicated Repositories (Files, Long-lived, processes Databases) need to be explicitly terminated Collaborative Applications (Games, Shared Desktops)
Distributed Sensing (Disaster Prediction)
Computation Short-lived, terminate Distribution when computation (Number, Matrix complete Multiplication) 12 FAULT TOLERANCE
Remotely Accessible Services (Printers, Faults can be long lived and Desktops) propagate as goal is to couple processes Replicated Repositories (Files, Databases)
Collaborative Applications (Games, Shared Desktops)
Distributed Sensing (Disaster Prediction) More independent work as goal is Computation to divide common work. Thus Distribution faults usually do not propagate. (Number, Matrix Short-term task can be simply Multiplication) completely or partially restarted. 13 COUPLING VS HIGH-PERFORMANCE
Coupling/Consistency High Performance
Independent, “long-lived” A single short-lived process processes at different created to perform some autonomous locations made computation made multi- dependent for resource threaded and/or distributed sharing, user collaboration, fault tolerance. to increase performance. Include consistency algorithms, Speedup algorithms that possibly in separate threads, to replace logic of existing define dependency that have code, with task distributed producer-consumer, decomposition observer relationship with existing algorithms May involve a central mediating, distributing May use additional mediating master code but it can be server and other infrastructure code unaware of specific aware of and creator of clients. specific slave processes Division of labor between client Division of labor among and infrastructure an issue master and slave and slave (centralized vs replicated) and slaves an issue 14 EXAMPLE: SERVICE + SPEEDUP
Comp 401 Slave Client Grader Assignment-level Server Master decomposition Slave Client1
TS GraderGrader Server TG Comp 533 Grader
Client2 Slave Master
Slave T1 Slave
4 Comp 533 4 533 T 533 T Test-level T3 T4 Grader T5 T6 decomposition Tester Tester T2 15 ABSTRACTIONS
Remotely Accessible Services (Printers, Desktops) Threads, IPC, Bounded Buffer Distributed Repositories (Files, Databases)
Collaborative Applications (Games, Shared Desktops)
Distributed Sensing (Disaster Prediction)
Computation Distribution Higher-level abstractions for different (Number, Matrix classes of speedup algorithms? Multiplication) 16 COMPLETE AUTOMATION?
Remotely Accessible Modules of Non-Distributed Program Services (Printers, Desktops) Distributed Transparently Loader contacts registry to Distributed determine if local module loaded or Repositories (Files, Databases) remote module accessed Assumes one name space, one Collaborative instance of each service – cannot Applications handle replication (Games, Shared Desktops) Motivates RPC transparency
Distributed Sensing (Disaster Prediction) Parallelizing compilers (Kuck and Kennedy) Computation Distribution Halting problem (Number, Matrix Multiplication) Motivates Declarative Abstractions 17 DECLARATIVE ABSTRACTIONS
Remotely Accessible Services (Printers, Desktops) Threads, IPC, Bounded Buffer Distributed Repositories (Files, Databases)
Collaborative Applications (Games, Shared Desktops)
Distributed Sensing (Disaster Prediction)
Computation Adding Declarative abstractions for Distribution different classes of speedup (Number, Matrix algorithms Multiplication) 18 DECLARATIVE VS IMPERATIVE: COMPLEMENTING
4.8 4.8 Declarative: Specify what we want 5.2 4.5 5.2 floats[] floats = {4.8f, 5.2f, 4.5f}; Type declaration 4.5 Procedural, Imperative : Implement what we want Functional, O-O, …
public static float sum(Float[] aList) { float retVal = (float) 0.0; for (int i = 0; i < aList.length; i++) { retVal += aList[i]; Loop } return retVal; }
Locality relevant to PDC abstractions - later 19 DECLARATIVE VS IMPERATIVE: COMPETING
Declarative: Specify what we want
(0|1)*1 Regular expression
Imperative: Implement what we want
0 1 1 FSA (Finite State Automata)
0
Consistency algorithms Ease of programming? are state machines 20 DECLARATIVE VS IMPERATIVE PDC: COMPETING AND COMPLEMENTING
Declarative: Specify what we want
??? For restricted classes of programs
Imperative: Implement what we want
Thread aThread = new Thread(aRunnable); aThread.start(); Concurrency?
aRegistry.rebind(Server.NAME, aServer) Distribution? Server aServer = (Server) aRegistry.lookup(Server.NAME)
21 DECLARATIVE VS IMPERATIVE CONCURRENCY: COMPETING AND COMPLEMENTING
Declarative: Specify what we want
Arguably easier to ??? For restricted classes program. of programs
Imperative: Implement what we want
Thread aThread = new Thread(aRunnable); aThread.start(); Concurrency?
22 PARALLEL RANDOM
0.6455074599676613 0.14361454218773773 Desired I/O
Declarative Concurrency Specification Called a Pragma or Directive Declarative Concurrency Aware //omp parallel threadNum(2) Code?
System.out.println(Math.random()); Imperative Concurrency- Unaware Code
23 OPENMP
Language-independent much as RPC and Sockets Standard implementations exist for: C/C++ #pragma omp parallel num_threads(2) FORTRAN !$OMP PARALLEL NUM_THREADS(2) No standard implementation for Java OMP4 implements part of standard (Belohlavek undergrad thesis) //omp parallel threadNum(2) Used in programming examples Like Java annotations associated with variable, class and method declarations Associated with statements Unified by set of statement attribute values: Parallel (boolean) Number of threads (int) Syntax for describing them is similar Only certain sets of attributes can be associated with an imperative statement Depends on the kind of statement
24 OPENMP STATEMENT ATTRIBUTES (ABSTRACT)
Attribute Parameters Semantics Parallel Create multiple threads to execute attributed statement NumberOfThreads int Number of threads to be created automatically
25 WHAT PROCESSES THE PRAGMAS?
0.6455074599676613 0.14361454218773773
//omp parallel threadNum(2) Processed by what kind of program? System.out.println(Math.random());
26 PRE-COMPILER APPROACH
Source Program in (Concurrency- Unaware/Aware Native New Source Program in Language + Language Concurrency-Aware Concurrency-Aware New Precompiler Native Language Language)
Compiled Program Native (Optional) in Concurrency- Language Interpreter Aware Imperative Compiler Language
omp4j –s Random.java javac Random.java
java Random 27 IMPERATIVE PARALLEL ABSTRACTION?
Statement cannot be compiled individually { int nonStatementNonGlobal = 0; //omp parallel threadNum(2) Parallel method can encapsulate {nonStatementNonGlobal++;} statement in method and class } Interface of class?
Compile errors at runtime
//Thread.parallelomp parallel threadNum(2,” (2) Compilation cost at runtime System.out.println(Math.random()); Non global (stack) referenced variables outside statement scope and not visible “); directly to synthesized class
Need (pre) compiler approach! 28 JAVA PRECOMPILATION OF PARALLEL
Attributed statement block encapsulated in Runnable method and Runnable class { int nonStatementNonGlobal = 0; //omp parallel threadNum(2) Instead of statement run method of new {nonStatementNonGlobal++;} class instance executed threadNum times }
Creates a context class that has an instance Parent thread waits for all threads to finish variable for each non-statement and non- global (stack) variable referenced in statement Executes next statement Before run method executed non- statement, non-global variables in parent thread stack assigned to corresponding instance variables in context object After wait and before next statement instance variables copied back to corresponding statement-non local 29 variables in parent thread stack JAVA PRECOMPILATION OF PARALLEL Assuming non-stack { variables are shared by all int nonStatementNonGlobal = 0; threads, so a single context class OMPContext { for all threads public int local_nonStatementNonGlobal; } final OMPContext ompContext = new OMPContext(); ompContext.local_nonStatementNonGlobal = nonStatementNonGlobal; final org.omp4j.runtime.IOMPExecutor ompExecutor = new org.omp4j.runtime.DynamicExecutor(2); for (int ompI = 0; ompI < 2; ompI++) { ompExecutor.execute(new Runnable() { @Override public void run() { {ompContext.local_nonStatementNonGlobal++}; } }); } ompExecutor.waitForExecution(); nonStatementNonGlobal = ompContext.local_nonStatementNonGlobal; 30 } RELATIVE EXPRESSIBILITY OF TWO ABSTRACTIONS
Can A1 implement A2 ?
Regular Finite State Expressions Automata Theory of Computation Push Down Finite State Automata Automata
OS: Thread Monitors Semaphores Coordination
RMI + Threads NIO + Threads IPC 31 OTHER CONSIDERATIONS
Performance
Ease of Programming (Level)
Usually involves experimental data and arguments rather than proofs
32 EASE OF PROGRAMMING
{ { int nonStatementNonGlobal = 0; int nonStatementNonGlobal = 0; class OMPContext { //omp parallel threadNum(2) public int local_nonStatementNonGlobal{nonStatementNonGlobal; ++;} } } final OMPContext ompContext = new OMPContext(); ompContext.local_nonStatementNonGlobal = nonStatementNonGlobal; final org.omp4j.runtime.IOMPExecutor ompExecutor = new org.omp4j.runtime.DynamicExecutor(2); for (int ompI = 0; ompI < 2; ompI++) { ompExecutor.execute(new Runnable() { @Override public void run() { {ompContext.local_nonStatementNonGlobal++}; } }); } ompExecutor.waitForExecution(); nonStatementNonGlobal = ompContext.local_nonStatementNonGlobal; 33 } MIGRATING FROM C TO JAVA: CASE STUDY
Pthread_create confused with Unix fork
Java API and uses looked up Imperative
Thread start() confused with run()
Long debug time!
#pragma omp parallel num_threads(2) Declarative //omp parallel threadNum(2) 34 TRACING PARALLEL RANDOM
Thread[pool0.6455074599676613-1-thread-2,5,main]2,5,main]Thread[pool - 0.65017129573705580.143614542187737731-thread-1,5,main] 0.9310973090994396 I/O Thread[pool0.31647362613936514-1-thread-1,5,main] 0.6907459159093547
//omp parallel threadNum(2)
{ trace(Math.random()); } public static void trace(Object... anArgs) { System.out.print(Thread.currentThread()); for (Object anArg : anArgs) { System.out.print(" " + anArg); } System.out.println(); } 35 JAVA SOLUTION
Thread[pool-1-thread-2,5,main] 0.6501712957370558 Desired I/O Thread[pool-1-thread-1,5,main] 0.6907459159093547
//omp parallel threadNum(2) Trace may be called in a static method such as main. {synchronized{ (this) OpenMP designed for non OO { trace(trace(Math.randomMath.random());()); languages such as C and }} trace(Math.random()); FORTRAN } public static void trace(Object... anArgs) { System.out.print(Thread.currentThread()); for (Object anArg : anArgs) { System.out.print(" " + anArg); } System.out.println(); } 36 OMP DECLARATIVE SOLUTION
Thread[pool-1-thread-2,5,main] 0.6501712957370558 Thread[pool-1-thread-1,5,main] 0.6907459159093547
//omp parallel threadNum(2)
{ trace(//omp Math.randomcritical ()); } trace(Math.random()); } public static void trace(Object... anArgs) { System.out.print(Thread.currentThread()); for (Object anArg : anArgs) { System.out.print(" " + anArg); } System.out.println(); } 37 OPENMP STATEMENT ATTRIBUTES (ABSTRACT)
Attribute Parameters Semantics Parallel Create multiple threads to execute attributed statement NumberOfThreads int Number of threads to be created automatically Critical Make auto-threads execute attributed statement atomically
38 COMPLETE PROGRAM PARALLELISM
Thread[pool-1-thread-2,5,main] 0.6501712957370558 Thread[pool-1-thread-1,5,main] 0.6907459159093547 Serial + Parallel Program?
//omp parallel threadNum(2)
{ trace(//omp Math.randomcritical ()); } trace(Math.random()); } public static void trace(Object... anArgs) { System.out.print(Thread.currentThread()); for (Object anArg : anArgs) { System.out.print(" " + anArg); } System.out.println(); } 39 PARTIAL PARALLELISM
Thread[main,5,main]Thread[pool-1-thread -Forking2,5,main] Thread[pool0.6501712957370558-1-thread-2,5,main] 0.8112103632254872Thread[pool-1-thread-1,5,main] Thread[pool0.6907459159093547-1-thread-1,5,main] Serial + Parallel Program 0.7339312982272137 Thread[main,5,main] Joined Which thread(s) will print trace("Forking"); “Forking” and “Joined”? //omp parallel threadNum(2)
{ //omp critical trace(Math.random()); } trace("Joined”);
40 ABSTRACT FORK-JOIN
T0 Statement
0 T Fork(n) create thread1..threadn
T1 Make each thread execute Statement forked statement Tn
Make creating thread wait for T0 Join termination of thread1..threadn
0 T Statement More efficient thread creation?
41 EQUIVALENT, MORE EFFICIENT ABSTRACT FORK- JOIN
0 T Statement create thread1..threadn-1
T0 fork(n) Make all thread execute forked statement if it is going to join T0 Statement Make creating thread wait for Tn- termination of thread1..thread-1 1 Implementation requires Statement code to be replicated T0 Join in both the (in Java, Runnable) code executed by new threads and also in the code execute by 0 the existing thread T Statement;
Similar to Unix Fork-Join? 42 UNIX SINGLE PROCESS FORK-JOIN
N-Slaves? P0 Statement
0 System call, returns pid of new P childPidfork()= fork() process in parent and 0 in child
P0 Code executed by both Statement processes P1
P0 If (0 != childPid) P1 Wait for termination of join(childPid) P0 specified child process
43 UNIX MULTIPLE PROCESS FORK-JOIN for (i=1; i < n; 0 i++) P Procedural (error-prone) code if (0 != childPid)
Concept of fork-join at least as 0 P childPid = fork() old as Unix (60s)
P0 Statement Comparison with OpenMP Pn Parallel Attribute? -1
P0 if (0 != childPid) P1 Wait for termination of all child join() P0 processes
44 SAME CODE/INSTRUCTION, SAME DATA CONCURRENCY
T1 Same print random# data T2
45 SAME CODE/INSTRUCTION, SAME DATA CONCURRENCY
T1 print “hello Same or world” No data T2
46 RANGE OF SAME CODE/INSTRUCTION, SAME DATA CONCURRENCY?
T1 Same or Code C No data T2
System.out.println(isPrime(toInt(Math.random()));
System.out.println(“Hello World”); Range of realistic applications? Printing the same value or computing the value is not very useful
Cannot think of other examples Other patterns?
Same data forces some serialization 47 PARALLELIZATION CLASSES
Code/InstructionCode Data (Parts) SISD SameSame Same SIMD Same Different MISD Different Same MIMD Different Different
SISD straightforwardly supported using declarative primitives.
We cannot support all possible concurrencies with pure declarative primitives.
48 OPENMP STATEMENT ATTRIBUTES (ABSTRACT)
Attributec Parameters Semantics Parallel Create multiple threads to execute attributed statement, which are killed after execution of statement NumberOfThreads int Number of threads to be created automatically Critical Make auto-threads execute attributed statement atomically
Imperative primitives that can be added to these attributes to increase flexibility? 49 DIFFERENT CODE, SAME DATA CONCURRENCY
Number-List T1 Sum
Number-List T2 ToString
How to integrate alternation with fork-join Motivated by Unix? which requires a single piece of code?
50 UNIX INSPIRATION for (i=1; i < n; P0 i++) If (getPid() != childPid)
P0 childPid = fork() Task depends on Id
P0 Statement Pn -1
P0 If (getPid() != childPid) P1
join() P0
51 ID-AWARE IMPERATIVE CODE
T1 Number-List Sum
Sum or ToString
T2 Number-List ToString
Allow each member of a thread Different code/data can be sequence (declaratively created) executed/accessed by different execute an imperative step to threads based on their indices determine its index 52 THREADNUM EXAMPLE: DIFFERENT CODE SAME DATA public static void parallelSumAndToText(float[] aList) { // omp parallel threadNum(2) { if (OMP4J_THREAD_NUM == 0) { trace("Sum of rounded:" + sum(aList); } else { trace("ToText of rounded:" + toText(aList)); } } }
OMP4J_THREAD_NUM is a predefined runtime variable with a different value for each forked thread 53 OPENMP OPERATIONS AND STATEMENT ATTRIBUTES (ABSTRACT) Attributes Parameters Semantics Parallel Create multiple threads to execute attributed statement, which are killed after execution of statement NumberOfThreads int Number of threads to be created automatically Critical Make auto-threads execute attributed statement atomically
Operations Signature Semantics GetThreadNum int Returns index of thread created by Parallel pragma GetNumThreads int Returns number of threads created 54 by Parallel pragma DIFFERENT CODE, SAME DATA CONCURRENCY
T1 Code C1
Same or No data
T2 Code C2
55 SAME CODE, DIFFERENT DATA (PARTS) CONCURRENCY
Data T1 Float-List Round (Parts) 1
Data T2 Float-List Round (Parts) 2
Need to divide data among threads.
56 SAME CODE, DIFFERENT DATA (PARTS)
Thread-Aware T1 Float-List Round, Chooses Which Indices to Thread-Aware Process Based on Float-List Round, ThreadNumber, Number of T2 Threads and Loop Params
What do we need besides thread ID to do the division?
57 SINGLE-THREADED ROUND
public static void round(float[] aList) { trace("Round Started:" + Arrays.toString(aList)); for (int i = 0; i < aList.length; i++) { aList[i] = (float) Math.round(aList[i]); trace(aList[i]); } trace("Round Ended:" + Arrays.toString(aList)); }
float vs Float Garbage collection Parallelized version
Put a compound statement for the body processIteration() passed these and precede it with a parallel pragma values and returns Boolean based on Body of loop is executed based on thread whether a thread should process an number, #threads, and loop parameters iteration 58 MULTI-THREADED ROUND
public static void parallelRound(float[] aList) { // omp parallel threadNum(2) { trace("Round Started:" + Arrays.toString(aList)); for (int i = 0; i < aList.length; i++) { if (processIteration(i, 0, 1, aList.length, OMP4J_THREAD_NUM, OMP4J_NUM_THREADS)) { aList[i] = (float) Math.round(aList[i]); trace(aList[i]); } } trace("Round Ended:" + Arrays.toString(aList)); } }
59 THREAD ASSIGNER
Interface boolean processIteration(int anIndex, int aStart, int aLimit, int aStepSize, int aThreadNum, int aNumThreads);
Different useful implementations of method (possibly chosen by a factory) possible and discussed later boolean processIteration (int aStart, int aLimit, int aStepSize, int aThreadNum, int aNumThreads) { return aNumThreads == 0 || (anIndex % aNumThreads == aThreadNum); }
60 ALTERNATE FOR
public static void parallelRound(float[] aList) { int i = 0; // omp parallel threadNum(2) for (;;) { if (processIteration(i, 0, aList.length, 1, OMP4J_THREAD_NUM, OMP4J_NUM_THREADS)) {
aList[i] = (float) Math.round(aList[i]); trace(aList[i]); } i++; if (i == aList.length) break; } processIteration} () parameters may not be explicit parameters of for construct
Declarative alternative will not have this property 61 TWO INDEPENDENT PROBLEMS
T1 Float-List Round
T2 Float-List Round
Number-List T1 Sum
Number-List T2 ToString 62 PIPELINED FUNCTIONS: SINGLE-THREAD
public static void roundSumAndToText (float[] aList) { round(aList); trace("Sum of rounded:" + sum(aList); trace("ToText of rounded:" + toText(aList)); }
Thread[main,5,main] Round Started:[4.8, 5.2, 4.5, 4.75, 4.7] Thread[main,5,main] Round Ended:[5.0, 5.0, 5.0, 5.0, 5.0] Thread[main,5,main] Sum of rounded:25.0 Thread[main,5,main] ToText of rounded: 5.0 5.0 5.0 5.0 5.0
63 PIPELINED: SEPARATE THREAD TEAMS
T1 Float-List Round
T2 Float-List Round
Number-List T3 Sum
Number-List T4 ToString 64 SEPARATE THREAD TEAM: STEP 1
public static void parallelRound(float[] aList) { // omp parallel threadNum(2) { trace("Round Started:" + Arrays.toString(aList)); for (int i = 0; i < aList.length; i++) { if (processIteration(i, 0, aList.length, 1, OMP4J_THREAD_NUM, OMP4J_NUM_THREADS)) { aList[i] = (float) Math.round(aList[i]); trace(aList[i]); } } trace("Round Ended:" + Arrays.toString(aList)); } }
65 SEPARATE THREAD TEAM: STEP 2
public static void parallelSumAndToText(float[] aList) { // omp parallel threadNum(2) if (OMP4J_THREAD_NUM == 0) { trace("Sum of rounded:" + sum(aList)); } else { trace("ToText of rounded:" + toText(aList)); } }
66 PIPELINED: SEPARATE THREAD TEAMS
T1 Float-List Round
T2 Float-List Round
Number-List T3 Sum More resource- efficient solution?
Number-List T4 ToString 67 PIPELINED: SAME THREAD TEAM
T1 Float-List Round
T2 Float-List Round
Number-List Sum
Number-List ToString 68 REMOVE PARALLEL CONSTRUCT FROM ROUND
public static void roundwithProcessIteration(float[] aList) { { trace("Round Started:" + Arrays.toString(aList)); for (int i = 0; i < aList.length; i++) { if (processIteration(i, 0, aList.length, 1, OMP4J_THREAD_NUM, OMP4J_NUM_THREADS)) { aList[i] = (float) Math.round(aList[i]); trace(aList[i]); } } trace("Round Ended:" + Arrays.toString(aList)); } }
69 SAME TEAM: STEP 1 AND 2 public static void parallelRoundAndSumToText(float[] aList) { // omp parallel threadNum(2) { roundWithProcessIteration(aList); if (OMP4J_THREAD_NUM == 0) { trace("Sum of rounded:" + sum(aList)); } else { trace("ToText of rounded:" + toText(aList)); } } } New declarative construct to fix problem?
Thread[pool-1-thread-1,5,main]Thread[pool-1-thread-2,5,main] Round Started:[4.8, 5.2, 4.5, 4.75, 4.7] Round Started:[4.8, 5.2, 4.5, 4.75, 4.7] Thread[pool-1-thread-1,5,main] Round Ended:[5.0, 5.2, 5.0, 4.75, 5.0] Thread[pool-1-thread-1,5,main] Sum of rounded:24.95 Thread[pool-1-thread-2,5,main] Round Ended:[5.0, 5.0, 5.0, 5.0, 5.0] 70 Thread[pool-1-thread-2,5,main] ToText of rounded: 5.0 5.0 5.0 5.0 5.0 BARRIER public static void parallelRoundAndSumToText(float[] aList) { // omp parallel threadNum(2) { roundWithProcessIteration(aList); // omp barrier if (OMP4J_THREAD_NUM == 0) { trace("Sum of rounded:" + sum(aList)); } else { trace("ToText of rounded:" + toText(aList)); } } }
Thread[pool-1-thread-1,5,main]Thread[pool-1-thread-2,5,main] Round Started:[4.8, 5.2, 4.5, 4.75, 4.7] Round Started:[4.8, 5.2, 4.5, 4.75, 4.7] Thread[pool-1-thread-1,5,main] Round Ended:[5.0, 5.2, 5.0, 4.75, 5.0] Thread[pool-1-thread-2,5,main] Round Ended:[5.0, 5.0, 5.0, 5.0, 5.0] Thread[pool-1-thread-2,5,main] ToText of rounded: 5.0 5.0 5.0 5.0 5.0 Thread[pool-1-thread-1,5,main] Sum of rounded:25.0 71 OPENMP STATEMENT ATTRIBUTES
Attributes Parameters Semantics Parallel Create multiple threads to execute attributed statement, which are killed after execution of statement NumberOfThreads int Number of threads to be created automatically Critical Make auto-threads execute attributed statement atomically Barrier Wait for each sibling created by active Parallel pragma to finish preceding statement before proceeding to next statement
More declarative constructs for patterns we have seen? 72 DECLARATIVE ALTERNATION? public static void parallelRoundAndSumToText(float[] aList) { // omp parallel threadNum(2) { roundWithProcessIteration(aList); // omp barrier if (OMP4J_THREAD_NUM == 0) { trace("Sum of rounded:" + sum(aList)); } else { trace("ToText of rounded:" + toText(aList)); } } }
Can we get rid of the if - use of imperative constructs in this method?
73 DECLARATIVE ALTERNATION public static void sectionRoundAndSumToText(float[] aList) { // omp parallel threadNum(2) { roundWithProcessIteration(aList); // omp barrier // omp sections { // omp section trace("Sum of rounded:" + sum(aList)); // omp section trace("ToText of rounded:" + toText(aList));} } } }
Sections can be executed In parallel rather than sequentially.
74 SEQUENTIAL VS CONCURRENT EXECUTION
Begin
T1 Statement1
T1 Statement2
T1 StatementN
End
75 SEQUENTIAL VS CONCURRENT EXECUTION
Dijsktra (1968) CoBegin
A statement/section is executed T1 Statement1 by a single rather than all threads
T2 Statement2
T1 StatementN Co-Begin can replace Begin if no state changes (Functional programming and eager evaluation) CoEnd
76 OPENMP STATEMENT ATTRIBUTES
Attributes Parameters Semantics Parallel Create multiple threads to execute attributed statement, which are killed after execution of statement NumberOfThreads int Number of threads to be created automatically Critical Make auto-threads execute attributed statement atomically Barrier Wait for each sibling created by Parallel pragma to finish preceding statement before proceeding to next statement Sections Co-Begin around attributed statement Section Allow attributed sub statement to be executed by any thread once 77 IMPERATIVE THREAD ASSIGNER
Interface boolean processIteration(int anIndex, int aStart, int aLimit, int aStepSize, int aThreadNum, int aNumThreads);
Different useful implementations of method (possibly chosen by a factory) possible and discussed later
Boolean processIteration( int aStart, int aLimit, int aStepSize, int aThreadNum, int aNumThreads) { return aNumThreads == 0 || (anIndex % aNumThreads == aThreadNum); }
Suppose the programmer does not care about how iterations are assigned to threads 78 DECLARATIVE PARALLEL ROUND?
boolean processIteration(int anIndex, int aStart, int aLimit, int aStepSize, int aThreadNum, int aNumThreads); public static void parallelRound(float[] aList) { // omp parallel threadNum(2) { for (int i = 0; i < aList.length; i++) { if (processIteration(i, 0, aList.length, 1, OMP4J_THREAD_NUM, OMP4J_NUM_THREADS)) { aList[i] = (float) Math.round(aList[i]); trace(aList[i]); } Can OMP automatically implement procssIteration and } call it based on a loop pragma? } } processIteration() is problem independent but dependent on loop 79 DECLARATIVE PARALLEL ROUND boolean processIteration(int anIndex, int aStart, int aLimit, int aStepSize, int aThreadNum, int aNumThreads);
public static void parallelRound(float[] aList) { // omp parallel for threadNum(2) for (int i = 0; i < aList.length; i++) { aList[i] = (float) Math.round(aList[i]); trace(aList[i]); } }
Conceptually, OpenMP calls processIteration(), actual implementation does not require each thread to determine if it should execute each iteration
Allow pragma to be associated with every loop?
80 DECLARATIVE PARALLEL ROUND boolean processIteration(int anIndex, int aStart, int aLimit, int aStepSize, int aThreadNum, int aNumThreads);
public static void parallelRound(float[] aList) { int i = 0; // omp parallel for threadNum(2) for (;;) { aList[i] = (float) Math.round(aList[i]); trace(aList[i]); i++; if (i == aList.length) break; } }
processIteration() cannot be automatically passed loop index and other parameters
OMP for parallelism needs counter-controlled loops: loop with an index variable, increment, start, and limit known at loop entry 81 OPENMP ATTRIBUTES QUALIFYING PARALLEL ATTRIBUTED LOOPS Attributes Parameters Semantics For Allow different iterations of a counter-controlled loop to execute in parallel based on properties of counter-controlled loops
82 SAME CODE, DIFFERENT DATA (PARTS) CONCURRENCY
Data T1 Float-List Round (Parts) 1
Data T2 Float-List Round (Parts) 2
How should iterations be automatically divided among threads?
83 OUR THREAD ASSIGNER
Interface boolean processIteration(int anIndex, int aStart, int aLimit, int aStepSize, int aThreadNum, int aNumThreads);
Different useful implementations of method (possibly chosen by a factory) possible and discussed later boolean processIteration (int aStart, int aLimit, int aStepSize, int aThreadNum, int aNumThreads) { return aNumThreads == 0 || (anIndex % aNumThreads == aThreadNum); }
84 CLOSEST DISTANCE APPROACH
T1 Float-List Round
T2 Float-List Round
85 FURTHEST DISTANCE DIVISION
T1 Float-List Round A0 A1 A2 A3 T2 Float-List Round
Comparison based on locality arguments? 86 LOCALITY GRANULARITIES
T1 A0
P1 A1 A0 A0 A2 A1 A1 A3 A2 A4
P2 A3 A5 A6 T2 A7
Processor Memory Disk Cache (1 Page)
87 CONFLICTING REPLICA CHANGES
T1 A0
P1 A1 A0 A0 A2 A1 A1 A3 A2 A4
P2 A3 A5 A6 T2 A7
Processor Memory Disk Cache (1 Page)
88 CONFLICTING REPLICA CHANGES
T1 A0
P1 A1 A0 A0 A2 A1 True sharing A1 A3 conflict A2 A4 A0 A3 A5 P2 A1 A6 T2 A7
Cache coherence Processor Memory algorithm guarantees Disk consistency (atomic Cache (1 Page) broadcast, two-phase 89 commit) CLOSEST DISTANCE DIVISION
T1 A0
P1 A1 A0 A0 A2 Consistency is A1 A1 maintained at the cache A3 block level A2 A4 A0 A3 A5 P2 A1 A6 T2 A7
False sharing! Processors Processor Memory writing to different parts Disk of a cache block execute a Cache (1 Page) cache coherence 90 algorithm FURTHEST DISTANCE DIVISION (T1’S PAGE)
T1 A0
P1 A1 A0 A0 A2 A1 A1 A3 A2 A4
P2 A3 A5 A6 T2 A7
Processor Memory Disk Cache (1 Page)
91 FURTHEST DISTANCE DIVISION (T2’S PAGE)
T1 A0
P1 A1 A0 A4 A2 A1 Page A5 A3 Fault! A6 A4 A4 A7 A5 P2 A5 A6 T2 A7
Processor Memory Disk Cache (1 Page)
92 FURTHEST DISTANCE DIVISION (T1’S PAGE)
T1 A0
P1 A1 A0 A0 A2 A1 Page A1 A3 Fault! A2 A4
P2 A3 A5 A6 T2 A7
Processor Memory Disk Cache (1 Page)
93 SAME CODE, DIFFERENT DATA (PARTS) CONCURRENCY
Data T1 Float-List Round (Parts) 1
Data T2 Float-List Round (Parts) 2
How should iterations be automatically divided among threads?
Support division parameters set by the programmer 94 MEMORY ACCESS MODEL
Data moves from disk to main memory to per-processor cache. Disk is larger and slower than main memory which is larger and slower than a cache. Main memory is divided into units called pages. A cache is divided into units called cache lines or blocks. If data accessed by a process is not in main memory, an event called a page fault occurs, which is processed by writing some existing page in main memory to disk and loading the accessed data from disk to the freed saved page. Parts of main memory pages accessed by a processor are loaded into cache lines and accessed from cache lines subsequently if they are loaded in cache lines. Programs run faster if they have fewer transfers between disk and main memory and main memory and cache.
95 CACHE CONFLICTS
Two processors can load cache lines holding the same region of main memory. Concurrent writes to cache lines replicating the same memory blocks results in execution of a cache consistency algorithm a la two phase commit and atomic broadcast. False sharing occurs from concurrent writes to replicas of the same memory region, and result in unnecessary consistency algorithm executions. True sharing occurs from concurrent writes to same words in cache replicas of the same memory block– the overhead of consistency algorithms is necessary now. Reading from cache lines does not cause these algorithms to be executed. Fewer execution of concurrent consistency algorithms makes program faster. 96 CLOSE VS. FAR DISTANCE SHARING
Far distance sharing can result in threads concurrently accessing different pages, and thus potentially causing more page faults depending on size of memory compared to size of data accessed. Close distance sharing can result in more false sharing conflicts as threads access different addresses in cache blocks. Moral: Need threads to work with on same pages but not write to overlapping loaded cache lines. Reading from overlapping cache lines does not cause any problem.
97 SAME CODE, DIFFERENT DATA (PARTS) CONCURRENCY
Data T1 Float-List Round (Parts) 1
Data T2 Float-List Round (Parts) 2
How should iterations be automatically divided among threads?
Support division parameters set by the programmer 98 OPENMP ATTRIBUTES QUALIFYING PARALLEL ATTRIBUTED LOOPS Attributes Parameters Semantics For Allow different iterations of a counter-controlled loop to execute in parallel based on properties of counter-controlled loops Schedule (Static, Step) All iterations assigned at start of loop Step Parameter int Each thread assigned step number of consecutive iterations (chunk) at a time
99 STATIC (5) SCHEDULING
I1 I2 I3 I4 I5 I6 I7
T1 I1 I2 I3 I4 I5
T2 I6 I7
Bad load balancing
10 0 STATIC (1) SCHEDULING – CLOSE COUPLING
I1 I2 I3 I4 I5 I6 I7
T1 I1 I3 I5 I7
T2 I2 I4 I6
Ever reasonable given false sharing?
Adjacent reads do not conflict 10 1 STATIC (2) SCHEDULING
I1 I2 I3 I4 I5 I6 I7
T1 I1 I2 I5 I6
T2 I3 I4 I7
Balancing locality and load balancing 10 2 STATIC (2) SCHEDULING – VARIABLE WORK
I1 I2 I3 I4 I5 I6 I7
T1 I1 I2 I5 I6
T2 I3 I4 I7
T2 is Free and T1 is busy
Load imbalance! 10 3 OPENMP ATTRIBUTES QUALIFYING PARALLEL ATTRIBUTED LOOPS Attributes Parameters Semantics For Allow different iterations of a counter-controlled loop to execute in parallel based on properties of counter-controlled loops Schedule (Static/ Static: All iterations assigned at Dynamic, start of loop Step) Dynamic: New iterations assigned dynamically based on progress of previous iterations Step Parameter int Each thread assigned step number of consecutive iterations (chunk) in static and dynamic
10 4 DYNAMIC (2) SCHEDULE
I1 I2 I3 I4 I5 I6 I7
T1 I1 I2 I7
T2 I3 I4 I5 I6
10 5 OPENMP ATTRIBUTES QUALIFYING PARALLEL ATTRIBUTED LOOPS Attributes Parameters Semantics For Allow different iterations of a counter-controlled loop to execute in parallel based on properties of counter-controlled loops Schedule (Static/ Static: All iterations assigned at Dynamic, start of loop int) Dynamic: New iterations assigned dynamically based on progress of previous iterations Step Parameter int Each thread assigned step number of consecutive iterations (chunk) in static and dynamic
Why not always use dynamic? 10 6 STATIC VS DYNAMIC: VARIABLE ITERATION COST
Static Dynamic
- No load balancing + Load balancing + No additional - After each segment synchronization or must check a shared context switch data structure leading overhead after each to synchronization segment and possibly context switch (if multiple threads assigned to processor)
10 7 DYNAMIC (2) SCHEDULE: 3 THREADS
Solution: smaller step/chunk size?
I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12
Smaller step size leads to better load balancing T1 I1 I2 I7 I8
Larger step size leads to less T2 I3 I4 I9 I10 overhead
Variable step size? T3 I5 I6 I11 I12
T2 does I10 even though T1 and T3 are free! 10 8 OPENMP ATTRIBUTES QUALIFYING PARALLEL ATTRIBUTED LOOPS Attributes Parameters Semantics For Allow different iterations of a counter-controlled loop to execute in parallel based on properties of counter-controlled loops Schedule (Static/ Static: All iterations assigned at Dynamic/ start of loop Guided, Dynamic: New iterations assigned Step) dynamically based on progress of previous iterations Guided: Step size changes dynamically based on remaining iterations after each assignment Step Parameter int Each thread assigned step number of consecutive iterations (chunk) in static and dynamic and at least step number in guided 10 9 GUIDED (1) SCHEDULE Chunk size decreases dynamically ∝ Remaining Iterations/#Threads
Chunk Size at least specified size
I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12
12/3
T1 I1 I2 I3 I4
8/3 3/3
T2 I5 I6 I7 I10 I12
5/3 T3 I8 I9 I11
11 0 OPENMP ATTRIBUTES QUALIFYING PARALLEL ATTRIBUTED LOOPS Attributes Parameters Semantics For Allow different iterations of a counter-controlled loop to execute in parallel based on properties of counter-controlled loops Schedule (Static/ Static: All iterations assigned at Dynamic/ start of loop Guided, Dynamic: New iterations assigned Step) dynamically based on progress of previous iterations Guides: Step size changes dynamically based on remaining iterations after each assignment Step Parameter int Each thread assigned step number of consecutive iterations (chunk) in static and dynamic and at least step number in guided 11 1 DYNAMIC/GUIDED APPLICATION
Grader assignments and tests
Irregular algorithms in general
11 2 PARALLEL ROUND
public static void parallelRound(float[] aList) { // omp parallel for threadNum(2) schedule (static, 128) for (int i = 0; i < aList.length; i++) { aList[i] = (float) Math.round(aList[i]); trace(aList[i]); } }
Threads write to different locations
11 3 PARALLEL SUM?
public static void parallelSum(float[] aList) { float retVal = (float) 0.0; // omp parallel for threadNum(2) for (int i = 0; i < aList.length; i++) { retVal += aList[i]; } return retVal; }
Threads write to common variable
11 4 PARALLEL ATOMIC SUM
public static void parallelSum(float[] aList) { float retVal = (float) 0.0; // omp parallel for threadNum(2) for (int i = 0; i < aList.length; i++) { // omp critical retVal += aList[i]; } return retVal; }
Computations are serialized!
Alternative algorithm? 11 5 REDUCTION: DIVIDE AND CONQUER
2 + 3 + 4 + 6
(2 + 3 ) (4 + 6 ) Partial reduction
(5 + 10 ) Final reduction
11 6 MULTI-LEVEL PARALLEL REDUCTION OF LIST: SUM
Sum Number-List Sum1 Sum A1
Number-List A2 Sum A3 A4 Number-List Sum2 Sum
Final reduction: Replicate a Partial result computed reducible shared Reduced sum in Reduction: from thread- variable (but not its private variable Partial solution specific private value) for wait-free / of thread of problem, variables by one lockless reducing it or more threads coordination! 11 7 MULTI-LEVEL PARALLEL REDUCTION OF LIST: TOSTRING
String Number-List String1 ToString A1
Number-List A2 Sum A3 A4 Number-List String2 ToString
Number (abstract) + is commutative String + is not commutative and and associative associative 11 8 CHANGE HOW FOR MULTI-LEVEL REDUCTION?
public static void parallelSum(float[] aList) { float retVal = 0.0; // omp parallel for threadNum(2) for (int i = 0; i < aList.length; i++) { // omp critical retVal += aList[i]; } return retVal; }
11 9 2-LEVEL SUM REDUCTION WITH PARALLEL FOR public static void parallelSum(float[] aList) { float retVal = 0.0f; float[] aSums = {0.0f, 0.0f}; // omp parallel for threadNum(2)) for (int i = 0; i < aList.length; i++) { int aThreadNum = 0; aSums[OMP4J_THREAD_NUM] += aList[i]; } for (float aSum:aSums) { retVal += aSum; } return retVal; }
False sharing as all threads Cache issues in allocation of aSums? write to adjacent array slots! 12 0 2-LEVEL SUM REDUCTION WITH PARALLEL public static void parallelSum(float[] aList) { float retVal = 0.0f; // omp parallel private (aSum) { float aSum = 0.0f; // private to each thread for (int i = 0; i < aList.length; i++) { if (processIteration(i, 0, aList.length, 1, OMP4J_THREAD_NUM, OMP4J_NUM_THREADS)) { aSum += aList[i]; // replaced retVal += aList[i] } } // omp critical retVal += aSum; // second reduction } return retVal; }
12 1 OPENMP STATEMENT ATTRIBUTES
Attributes Parameters Semantics Parallel Create multiple threads to execute attributed statement, which are killed after execution of statement NumberOfThreads int Number of threads to be created automatically Critical Make auto-threads execute attributed statement atomically Barrier Wait for each sibling created by Parallel pragma to finish preceding statement before proceeding to next statement Sections Co-Begin around attributed statement Section Allow attributed sub statement to be executed by any thread once Private/Shared String Variable declared in statement 12 private or shared by threads that 2 execute it 2-LEVEL SUM REDUCTION WITH PARALLEL public static void parallelSum(float[] aList) { float retVal = 0.0f; // omp parallel private (aSum) { float aSum = 0.0f; // private to each thread for (int i = 0; i < aList.length; i++) { if (processIteration(i, 0, aList.length, 1, OMP4J_THREAD_NUM, OMP4J_NUM_THREADS)) { aSum += aList[i]; // replaced retVal += aList[i] } } // omp critical retVal += aSum; // second reduction } return retVal; Can we add more parameters to for parallel Product? } to create this automatically?
12 3 2-LEVEL PROD REDUCTION WITH PARALLEL public static void parallelSum(float[] aList) { float retVal = 1.0f; // omp parallel private (aSum) { float aProd = 0.0f; // private to each thread for (int i = 0; i < aList.length; i++) { if (processIteration(i, 0, aList.length, 1, OMP4J_THREAD_NUM, OMP4J_NUM_THREADS)) { aProd *= aList[i]; } } // omp critical retVal *= aProd; // second reduction } return retVal; Can we add more parameters to for parallel } to create this automatically?
12 4 CHANGE HOW FOR MULTI-LEVEL REDUCTION? public static void parallelSum(float[] aList) { float retVal = 0.0; // omp parallel for threadNum(2) reduction(+:retVal) for (int i = 0; i < aList.length; i++) { retVal += aList[i]; } return retVal; }
Private variable created before All occurrences of reduction variable attributed statement entered for replaced with private variable in for loop reduction variable (retVal)
Initialization of private variable? Reduction operator (+) repeatedly applied to private variables and resulting Private variable initialized to value reduced value stored in reduction dependent on reduction operator (+) variable 12 5 PARALLEL WORD COUNT
Words:[the, a, an, the, a] Desired I/O Counts:{the=2, a=2, an=1
public static void add (Map
Words:[the, a, an, the, a] Before for loop, private Counts:{the=2, a=2, an=1 variable initialization
// omp declare reduction \ // (mapAdd:java.util.Map
// omp declare reduction \ // (
Before the for loop each private variable is assigned
The
Parameter mp_out is replaced with the for reduction variable and mp_in is replaced with the private variable created from it
12 8 OPENMP ATTRIBUTES QUALIFYING FOR AttributesATTRIBUTEDParametersLOOPSSemantics For Allow different iterations of a counter-controlled loop to execute in parallel based on properties of counter-controlled loops Schedule Static/ Static: All iterations assigned at Dynamic/ start of loop Guided, Step Dynamic: New iterations assigned dynamically based on progress of previous iterations Guides: Step size changes dynamically based on remaining iterations after each assignment Step Parameter int Each thread assigned step number of consecutive iterations (chunk) in static and dynamic and at least step number in guided Reduction String: Multi-level reduction using the String operator identified by 1st parameter of value stored by variable identified 12 by 2nd parameter 9 OPENMP STATEMENT ATTRIBUTES
Attributes Parameters Semantics Parallel Create multiple threads to execute attributed statement, which are killed after execution of statement NumberOfThreads int Number of threads to be created automatically Critical Make auto-threads execute attributed statement atomically Barrier Wait for each sibling created by Parallel pragma to finish preceding statement before proceeding to next statement Sections Co-Begin around attributed statement Section Allow attributed sub statement to be executed by any thread once How many threads to create? 13 0 + REDUCTION: DIVIDE AND CONQUER
A = + (a1, a2, a3 a4)
A1 = + (a1, a2) A2 = + (a3, a4) Partial reduction
A = + (A1, A2) Final reduction
13 1 * REDUCTION: DIVIDE AND CONQUER
A = * (a1, a2, a3 a4)
A1 = * (a1, a2) A2 = * (a3, a4) Partial reduction
A = * (A1, A2) Final reduction
13 2 MAPPLUS REDUCTION: DIVIDE AND CONQUER
A = mapPlus (a1, a2, a3 a4)
A1 = mapPlus (a1, a2) A2 = mapPlus (a3, a4) Partial reduction
A = mapPlus (A1, A2) Final reduction
13 3 OPENMP ATTRIBUTES QUALIFYING FOR AttributesATTRIBUTEDParametersLOOPSSemantics For Allow different iterations of a counter-controlled loop to execute in parallel based on properties of counter-controlled loops Schedule Static/ Static: All iterations assigned at Dynamic/ start of loop Guided, Step Dynamic: New iterations assigned dynamically based on progress of previous iterations Guides: Step size changes dynamically based on remaining iterations after each assignment Step Parameter int Each thread assigned step number of consecutive iterations (chunk) in static and dynamic and at least step number in guided Reduction String: Multi-level reduction using the String operator identified by 1st parameter of value stored by variable identified 13 by 2nd parameter 4 OPENMP STATEMENT ATTRIBUTES
Attributes Parameters Semantics Parallel Create multiple threads to execute attributed statement, which are killed after execution of statement NumberOfThreads int Number of threads to be created automatically Critical Make auto-threads execute attributed statement atomically Barrier Wait for each sibling created by Parallel pragma to finish preceding statement before proceeding to next statement Sections Co-Begin around attributed statement Section Allow attributed sub statement to be executed by any thread once How many threads to create? 13 5 DEFAULT VALUES
NumberOfThreads Runtime.getRuntime().availableProcessors() Schedule Static Step Size Static
Number of iterations/number of threads
Each thread gets one chunk Dynamic/Guided
1
13 6 HOW MANY THREADS?
P1 P2 P3 P4
Runtime.getRuntime().availableProcessors()
A1 T11 T12 T13 T14
Multiple applications? 13 7 MULTIPLE APPLICATIONS
P1 P2 P3 P4
Multi-Process Scheduling Policy?
A1 T11 T12 T13 T14 T15
A2 T21 T22 T23
A3 T31 T32 13 8 COMMON GLOBAL THREAD QUEUE
T22 TP111 TP221 TP313 TP142 T32
T13 Threads to be made current when time quantum allocated for current threads T23 expires (assuming no blocking)? T14
T15
A1 T11 T12 T13 T14 T15
A2 T21 T22 T23
A3 T31 T32 13 9 FIRST THREAD SWITCH
T2142 TP21121 TP23221 TTP13133 TP21432 T3215
131 T Does it matter how threads are mapped in the second T231 context switch?
T3114
T1152
A1 T11 T12 T13 T14 T15
A2 T21 T22 T23
A3 T31 T32 14 0 SECOND THREAD SWITCH: PROCESSOR AGNOSTIC
T2312 TP14111 TP21521 TTP11313 TP21412 T3212
T1313 Cache blocks of P1 and P2 may still hold data of T11 T2323 and T21 when they were scheduled last on them
T1314
T2315
A1 T11 T12 T13 T14 T15
A2 T21 T22 T23
A3 T31 T32 14 1 SECOND CONTEXT SWITCH: PROCESSOR AFFINITY
T2312 TP11111 TP22121 TTP14313 TP15142 T3212
T1313 Affinity-based: Schedule threads on processors on 2323 T which they were scheduled earlier
T1314
T2315
A1 T11 T12 T13 T14 T15
A2 T21 T22 T23
A3 T31 T32 14 2 PROCESSOR AGNOSTIC VS AFFINITY-BASED SCHEDULING
Affinity-based can increase performance, if load does not get unbalanced, as in our scheme. Other scheduling schemes that do affinity based scheduling (Linux) need a load balancing step
14 3 COMMON GLOBAL THREAD QUEUE
T22 TP111 TP221 TP313 TP142 T32
T13 Fair?
23 T Applications with more threads get more processor time
T14 Reason: Single queue T15
A1 T11 T12 T13 T14 T15
A2 T21 T22 T23
A3 T31 T32 14 4 APPLICATION-AND THREAD-LEVEL QUEUE
A2 T15 TP111 TP122 TP133 TP144
A3
A1 T11 T12 T13 T14 T15
A2 T21 T22 T23
A3 T31 T32 14 5 THREAD SWITCH
A2 T11 TP151 TP122 TP133 TP144
A3
Process Switch?
A1 T11 T12 T13 T14 T15
A2 T21 T22 T23
A3 T31 T32 14 6 APPLICATION/PROCESS SWITCH
A3 TP211 TP222 TP233 TP314
A1 How to use idle processor?
Threads of next application(s) can use processors not used by current application
Allocated time depends on position Fairness vs throughput? in application queue
A1 T11 T12 T13 T14 T15
A2 T21 T22 T23
A3 T31 T32 14 7 SINGLE-LEVEL VS MULTI-LEVEL QUEUE
Single-level Single queue of threads associated with a single time quantum t. After t units, thread switch. Applications with more threads get more processor time. Multi-level Level 1 queue of processes/applications with time quantum T Level 2 queue with threads of current application with time quantum t. Every t units threads in level-2 queue switched. Every T units processes in level-1 queue switched, which results is new processes' threads to be loaded in second level queue and n threads of next processes’ in level 1 queue if (n = # of processors - #threads of current process) > 0. The total time a process gets depends on its position in the queue, if it is behind a thread-stingy process, it gets time during its time quantum and also possibly time quantum of processes’ in front of it. Can use processor agnostic or affinity based scheduling in both 14 8 BETTER USE OF CACHE THAN TIME SCHEDULING?
A3 TP211 TP222 TP233 TP314
A1 How to make sure thread always goes to the same processor to make better use of cache?
Divide processors rather than time among applications?
A1 T11 T12 T13 T14 T15
A2 T21 T22 T23
A3 T31 T32 14 9 SPACE SCHEDULING
TP111 TP122 TP231 TP314
T22 T22 T32
T32 T32
T13
A1 T11 T12 T13 T14 T15
A2 T21 T22 T23
A3 T31 T32 15 0 APPLICATIONS < #PROCESSORS
P1 P2 P3 P4
A1 A2 A3
A1
A2
A3
A4
15 1 APPLICATION = # PROCESSORS
P1 P2 P3 P4
A1 A2 A3 A4
A1
A2
A3
A4
15 2 APPLICATIONS > PROCESSORS
A5 P1 P2 P3 P4
A1 A2 A3 A4
A1
A2
3 Application wait A queue serviced when current A4 applications complete A5 15 3 SPACE SCHEDULING
#processors - # applications put in application queue, others in wait queue. Processors assigned to active applications with some applications getting more than others if they do not divide evenly. An active application continues to use its processors until it completes, when it is replaced with applications in wait queue Threads of an application use time multiplexing to share its applications
15 4 SPACE VS. TIME MULTIPLEXING
Space Time
Throughput Throughput Better use of cache, Developed when caches speedup is sublinear were small or non Response Time existent Applications in wait Response times queue must wait until current applications New small jobs go to finish front of queue and others Critical mass go in serviced queue Certain process teams Critical mass: may not have critical Applications can get number of processors, critical number of producer/consumer, but processors
15 5 UNCOORDINATED ATOMICITY-AWARE SCHEDULING
TPS111 TP122 TP231 TP314
TC22 T22 T32
Thread spinning on test T32 T32 and set does no work during its time quantum T13 as thread holding the lock is in the queue
A1 T11 T12 T13 T14 T15
A2 T21 T22 T23
A3 T31 T32 15 6 COORDINATED ATOMICITY-AWARE SCHEDULING
TPC111 TP122 TP231 TP314
TS22 T22 T32
T32 T32 Threads in critical region given more precedence T13 and those spinning given less precedence
A1 T11 T12 T13 T14 T15
A2 T21 T22 T23
A3 T31 T32 15 7 MORE COORDINATION: CONTROL THREADS?
TP111 TP122 TP231 TP314
T22 T22 T32
32 32 Should the scheduler T T Can an application go Given N processors, how and application faster if it has fewer many threads should be coordinate on number T13 threads than it can use? scheduled at one time? of threads?
A1 T11 T12 T13 T14 T15
A2 T21 T22 T23
A3 T31 T32 15 8 SPEEDUP: THREAD VS. PROCESSORS One active application with n threads or two applications with n/2 processors? Why is speedup not Why does speedup go linear when there are down when #threads > enough processors? #processors?
Thread coordination Cache invalidation, lack overhead –cost of of coordination, thread splitting and combining swithcing work
Keep # of threads = # processors
A. Tucker and A. Gupta. 1989. Process control and scheduling issues for multiprogrammed shared-memory multiprocessors. In Proceedings of the twelfth ACM symposium on Operating systems principles (SOSP '89). ACM, New York, NY, USA, 159-166. DOI: 15 https://doi-org.libproxy.lib.unc.edu/10.1145/74850.74866 9 COORDINATED VS UNCOORDINATED ATOMICITY- AWARE SCHEDULING
Uncoordinated Scheduler and application are independent Coordinated scheduling Application tells scheduler if it is spinning or in critical section. Spinning or continuously executing a test and set instruction is necessary in multiprocessor scheduling to ensure atomic operations in the kernel. In uniprocessor systems, interrupts can be disabled on a processor on which a thread is executing to prevent other threads from being scheduled on that processor and thus accessing some shared data structure. Disabling interrupts in a processor does not disable interrupts on other processor, which can schedule threads that access the shared data structure. Hence a thread spins or polls some condition atomically before entering a critical region (in the kernel) until some thread in the critical region to reset the condition. Atomic access to user-level data structures can be ensure by implementing in the kernel blocking semaphore or monitors that use spinning to check semaphore and monitor blocking conditions. If language/library abstractions support atomicity then they can automatically inform the scheduler which may be implemented by the language/library
16 0 COORDINATED VS UNCOORDINATED THREAD- CONTROLLING SCHEDULING
Uncoordinated Application gets no information about number of threads it should create Coordinated The scheduler keeps # of threads = # of processors. Application gets callback about desired number of threads. In dynamic scheduling, it can react to callback by terminating some threads when they finish their current chunk, that is, whey they reach their next safe point. Alternatively scheduler can force terminate certain threads immediately, specially those spinning on locks. Moral: dynamic scheduling is better in multi-application computers! Two applications running n/2 threads will go faster than one application running n threads as speedup decreases with number of threads
16 1 DIST. VS PARALLEL-DIST. EVOLUTION
Remotely Accessible Services (Printers, Desktops)
Replicated Repositories (Files, Databases)
Collaborative Applications (Games, Shared Desktops) Non Distributed Non Distributed Distributed Sensing Existing T T (Disaster Prediction) Single-Thread Multi-Thread Program Program
Computation Distribution Distributed Distributed (Number, Matrix T Program T T Program T Multiplication) 16 2
DISTRIBUTED OPENMP ARCHITECTURE
Splitting
Iteration Iteration
Partial Partial Remote Remote
Reduction Input
Reduction Partial Local Local Partial
PR
PR M/RM
PR M
Final Final Output Reduction
16 3 DISTRIBUTING OPENMP
The Fork-Join model requires input and data to start and end at the for loop one process/thread
Need a software implementation of shared memory
Inconsistent with big data, where all data may not fit in the memory of one process and even it does, the process becomes a bottlenext
MapReduce provides a competitor for Fork-Join loop- based reduction (not sections, non-loop parallel, …)
16 4 MAPPLUS REDUCTION: DIVIDE AND CONQUER
A = mapPlus (a1, a2, a3 a4)
A1 = mapPlus (a1, a2) A2 = mapPlus (a3, a4) Partial reduction
A = mapPlus ({(kA1, =v 1mapPlus), (k2, v2)},(A ({(k1, A12,) v3), (k1, v1)})) Final reduction
Further parallelism when Collection result! collections are large ? 16 5 MAPPLUS PARTITION: DIVIDE AND CONQUER
A = mapPlus ({(k1, v1), (k2, v2)}, {(k1, v3), (k2, v4)})
Split and Partial A1 = seqPlus ({(k1, v1)}, {(k1, v3)} A2 = seqPlus ({(k2, v2)}, {(k2, v4)} reduction
A = mapPlus (A1, A2) Final reduction
Values associated with corresponding elements (keys) can be reduced in parallel 16 6 MAPREDUCE KEY IDEAS
Assume data fits in a set of input files rather than a single process
Work split to different processes by giving them different portions of the files without first loading them into one process
Data-based division rather than loop-based division
Work combined by multiple processes who directly write to output files rather than sending them to one process
16 7 MAPREDUCE COLLECTION-BASED DIVISION
External Data Structure: File Text lines
In Memory Data Structure:
Why Map and not Bean, Array, List, Set?
Can simulate Bean and any collection: array, list, set and PL equivalent of (relational) Database tables
Values associated with the same keys can be reduced independently as thus in parallel
16 8 DIVIDE AND CONQUER: ITERATOR BASED DIVISION
A = mapPlus (a1, a2, a3 a4) Division based on enumeration of keys provided by for or collection iterator
A1 = mapPlus (a1, a2) A2 = mapPlus (a3, a4) Partial reduction
A A= =mapPlus mapPlus(A(A1, 1A, 2A}2) Final reduction
Further parallelism when A collections are large ? 16 9 DIVIDE AND CONQUER: ITERATOR BASED DIVISION
A = mapPlus ({(k1, v1)}, {(k2, v2)}, {(k1, v3)}, {(k2, v4)}) Expanding the maps, making use of the fact that they are collections
A1 = mapPlus ({(k1, v1)}, {(k2, v2)}) A2 = mapPlus ({(k1, v4)}, {(k2, v4)})
Further parallelism when collections are large and A1 = mapPlus1 2 2 (A1, A12) 3 2 4 A = mapPlus ({(k , v ), (k , v )}, {(k , v ), (k , v )})) values associated with different keys are independent?
1 1 3 2 2 4 {(k , v + v ), (k , v + v )} 17 0 DIVIDE AND CONQUER: KEY BASED DIVISION
Can ask different processes to A = seqPlus ({(k1, v1)}, {(k2, v2)}, {(k1, v3)}, {(k2, v4)}) partition or split the input sequence among key-based reducers
Plus of key-value A1 = seqPlus ({(k1, v1)}, {(k1, v3)}) A2 = seqPlus ({(k2, v2)}, {(k2, v4)}) sequence rather than maps
Simple combination, no computation, can be done in A = indepMapPlusA = mapPlus({(k1, v1 +(A v13,)}, A 2{(k) 2, v2 + v4)} the file system without sending the partially reduced maps to a process
1 1 3 2 2 4 {(k , v + v ), (k , v + v )} 17 1
MAPREDUCE
- - Intra- Inter-
Values Process
- Process
Process Process
based based -
Shuffling Shuffling based
Reduction
Key
Shuffling
Mapping to Mapping
Reduction
Final Key Final
Process Process
Partial Key Partial
-
Intra
Iintra
Input Splitting Input Process Process
- M PR R
Inter Input Splitting Input
M PR Output
R Input
Processes go to Mapper Reducer the data or we do load balancing!
Mapper and Final Reduction Processes, Splitting of keys can be different for local and remote Optionally Partial 17 reducers as long as partitions are disjoint Reduction Processes 2 MAP REDUCE ARCHITECTURE: WORD COUNT Mapper ignores (Local partial and remote final position and )reducers convert list input elements 1, (the a) converts each input with same key with multiple values to 2, (a) word into (word, 1) the key with sum of the values
the, 1 the, 1 the, 2 the, 1 a, 1 the, 1 a, 21 a, 1 the, 2
a, 2
an, 1 a, 2 the, 1 the, 1 a, 2
an, 1 an, 1 an, 1 an, 1
3, the 17 4, an 3 Loop parallelizable by OpenMP? MAP CLASS
public class WordCount { public static class Map extends MapReduceBase implements Mapper
Parallelizing/distributing a parameterized Infrastructure-provided parameter method rather than scoped statement types and output collection 17 5 MAP REDUCE VS. OPENMP CHALLENGES
OpenMP: Write problem from scratch and tell OpenMP about parallelization-relevant aspects
MapReduce: Transform problem into a Map and Reduce Method
Create an FSA for a regular expression
Convert given expression into another known regular expression using a transformation known to be regular expression preserving
Proving from scratch the halting problem is intractable
Reducing known intractable problem into given problem 17 6 MAP REDUCE : MIN TEMP Mapper ignores Reducers convert list input 1, (A 70) position and converts elements with same key 2, (A 60) each input word(city, with multiple values to the 3, (B 55) temp) into (city, temp) key with min of the values
A, 70 A, 60 A, 60 A, 60 A, 60 A, 90 B, 55 B, 55 A, 60
B, 55
B, 55 B, 70 B, 70 B, 55 B, 70 A, 90 A, 90
4, (B 70) 17 5, (A 90) 7 MAP REDUCE: SUM Reducers convert list input 1, 5 Mapper ignores elements with same key 2, 3 position and converts with multiple values to the 3, 3 each input into (S, int) key with sum of the values
S, 5
S, 3 S, 11 S, 3 S, 11 S, 14 S, 14 S, 3
S, 2 S, 3 S, 1
4 (2 1) 17 8 MAP REDUCE: ROUND
1, 5.2 Mapper converts each Reducers get a single value 2, 3.3 (position, float) to for each key and simply 3, 3.4 (position, round(int)) output it
1.1, 5 1.1, 5 1.1, 5 2.1, 3 2.1, 3 2.1, 3 3.1, 3 3.1, 3 3.1, 3
4.1, 2
4.2, 1 4.1, 2 4.1, 2
4.2, 1 4.2, 1 Assuming output is or can be sorted by key
17 4 (2.2 1.3) 9 MAP REDUCE: ROUND + SUM Reducers convert list input 1, 5.2 Mapper converts each elements with same key 2, 3.3 (position, float) to (S, with multiple values to the 3, 3.4 round(int)) key with sum of the values
S, 5
S, 3 S, 11 S, 3 S, 11 S, 14 S, 14 S, 3
S, 2 S, 3 S, 1
18 4 (2.2 1.3) 0 MAP REDUCE : COMMON FRIENDS Mapper each S>F1..FN to { (S.F1, Reducers convert list input 1, A>B C F1..FN), … (S.FN,F1.. FN)} and elements with same key with 2, C>A B normalizes key to order its two multiple set values to the key with components in alphabetical order intersection of the values
A.B, B.C A.B, B.C A.B, B.C A.C, B.C AB, C AB, C A.C, B A.B, A.C.D C.A.,A.C, A.BAB B.C, A.B B.C,CB, ABA.B
B.A,A.B, A.C.D A.B., A.C.D
B.C, A.C.D B.C, A.C.D A.C, B AC, B AC, B B.D, A.C.D B.D, A.C.D To find common friends of A and B we need to associate each of their friend sets 3, B>A C D with the same key and have reducers 18 intersect the two sets 1 OPEN MP REDUCTION VS MAPREDUCE MODEL
OpenMP MapReduce
A binary commutative and An n-ary reduce function associative reduce operation that reduces a list of known to OpenMP values bound to a key to a A counter-controlled loop single value using an known to OpenMP unknown commutative Output variable in forking and associative binary thread known to OpenMP function. that is an operand to Input text files known to operation on each iteration. MapReduce An expression computed by A map function known to each loop iteration that is MapReduce that converts the other operand to each input line to (Key, operation in iteration Value) pairs processed by a reduce function.
18 2 OPEN MP REDUCTION VS MAPREDUCE
OpenMP MapReduce
Work divided among different threads/processes consists of Work divided among mapping and reduction. different Splits application lines among threads/processes different mappers processes consists of reduction based on location of input files. Splits mapper-process lines OpenMP splits work to among different mapper-threads, multiple which call the mapper function Optionally feeds the results of threads/processes based mapper-threads among reduction on loop index threads of mapper process, which call the reduction function on Feeds the results of non overlapping keys child/ calls the reduction Splits the results of reduction threads of mapper-process function to combines the among different reduction reduced results of child processes, which are responsible threads/processes in one for combining values of non overlapping sets of keys. thread Split keys received by reduction process among different 18 reduction threads, which directly 3 write the output for each key MAP/REDUCE FUNCTIONS
Mapper Function: Infrastructure can call map in parallel on different input lines Gets as input a line of text (value parameter) and its offset in the input file (key), which is like the index in a loop Produces as output key-value pairs Each iteration free to determine how input is transformed to output. Reduce Functions: Infrastructure can call reduce function in parallel with different keys Takes as input a key and a list of map values produced for that key by different map functions, which correspond to local result variables in OpenMP Produces as output a single result reduction value for that key.
18 4 MAP REDUCE ARCHITECTURE
Mapper Node and Phase: Executes in parallel the map function on lines assigned to it. Feeds the output
Automation provided based on Map data structure rather than loop Can simulate any collection, bean, or array Provides a distributed file system in which different parts of the output can be placed from different locations Automatically determines the number and location of mappers and reducers based on input file distribution. Process goes to the data rather than vice versa. Automatically splits the input data among the mappers. Automatically splits keys (as opposed to iterations) among the reducers. Computed results can be combined by multiple reducers Reads input data assigned to a mapper and feeds it into a programmer-defined map method. Transmits each
https://hadoop.apache.org/docs/stable/hadoop-mapreduce- client/hadoop-mapreduce-client-core/MapReduceTutorial.html
18 7 RUNNING WORDCOUNT
//Run the application:
$ bin/hadoop jar /usr/joe/wordcount.jar mapreduce.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
//Output:
$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000 the 2 an 1 a 2
18 8 COMPILING AND JARING
Usage Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version installed, compile WordCount.java and create a jar:
$ mkdir wordcount_classes $ javac -classpath ${HADOOP_HOME}/hadoop- ${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java $ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ .
18 9 CREATING INPUT
Assuming that:
/usr/joe/wordcount/input - input directory in HDFS /usr/joe/wordcount/output - output directory in HDFS Sample text-files as input:
$ bin/hadoop dfs -ls /usr/joe/wordcount/input/ /usr/joe/wordcount/input/file01 /usr/joe/wordcount/input/file02
$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01 the, a, an $ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02 the,a
19 0 MAP CLASS public class WordCount { public static class Map extends MapReduceBase implements Mapper
19 1 REMOTE AND LOCAL REDUCE CLASS public static class Reduce extends MapReduceBase implements Reducer
19 2 MAIN CLASS public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");41. conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }
19 3
ASSIGNMENT
-
Splitting
reduction
Bounded Partial Local Local Partial
Buffer based based Buffer Input reduction Partial Remote Remote Partial R
PR R M
R join Output after
Output
reduction
Final Remote Remote Final
Based Based
-
Final
Splitting
Reduction Reduction
Key after Barrier after
19 4 PARALLEL-DISTRIBUTED ASSIGNMENT Words:[the, a, a, the, an] Counts:{the=2, a=2, an=1 Client the, 1 the, 2 a, 1 the, 1 a, 1 a, 1 the, 2 a, 1 the, 1 an, 1 a, 1 S M the, 1 C/R Server S an, 1 a, 1 null, null a, 2 a, 1 a, 1 a, 1 null, null an, 1 an, 1 an, 1 an, 1 19 5 ASSIGNMENT: OPENMP + MAPREDUCE
Like MapReduce distributes computation among multiple processes and threads without assuming distributes shared memory
Like OpenMP assumes a single master process is source and sink of input and output data, respectively
Like MapReduce splits work based on data rather than a loop A la OpenMP dynamic scheduling (1) in which data items rather than loop indices are split for first reduction, except no incremental reduction to ease implementation effort while allowing one message to be sent to remote clients for each sublist
As in MayReduce Map based parallelism and distribution and key- based splitting for final reduction
19 6 ASSIGNMENT: OPENMP + MAPREDUCE
Number of slave threads and processed input is determined through interactive input processed by an MVC architecture
Uses a bounded buffer of (Key, Value) pairs to distribute work to slave threads of master process a la dynamic scheduling
Line MapReduce does key based final reduction in parallel
A slave thread can delegate work to a bound client registered dynamically with the server, so like MapReduce distributes worl
19 7 THE END
19 8