Parallel Programming Patterns Overview and Map Pattern

Parallel Computing CIS 410/510 Department of Computer and Information Science

Lecture 5 – Parallel Programming Patterns - Map Outline

q Parallel programming models q Dependencies q Structured programming patterns overview ❍ Serial / parallel control flow patterns ❍ Serial / parallel data management patterns q Map pattern ❍ Optimizations ◆ sequences of Maps ◆ code Fusion ◆ cache Fusion ❍ Related Patterns ❍ Example: Scaled Vector Addition (SAXPY)

Introduction to , University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 2 Parallel Models 101

q Sequential models Memory (RAM) ❍ von Neumann (RAM) model

q Parallel model processor ❍ A parallel computer is simple a collection of processors interconnected in some manner to coordinate activities and exchange data ❍ Models that can be used as general frameworks for describing and analyzing parallel algorithms ◆ Simplicity: description, analysis, architecture independence ◆ Implementability: able to be realized, reflect performance q Three common parallel models ❍ Directed acyclic graphs, shared-memory, network

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 3 Directed Acyclic Graphs (DAG)

q Captures data flow parallelism q Nodes represent operations to be performed ❍ Inputs are nodes with no incoming arcs ❍ Output are nodes with no outgoing arcs ❍ Think of nodes as tasks q Arcs are paths for flow of data results q DAG represents the operations of the algorithm and implies precedent constraints on their order for (i=1; i<100; i++) … a[i] = a[i-1] + 100; a[0] a[1] a[99]

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 4 Shared Memory Model

q Parallel extension of RAM model (PRAM) ❍ Memory size is infinite P1 ❍ Number of processors in unbounded P2 P ❍ Processors communicate via the memory 3 Shared . Memory ❍ . Every processor accesses any memory . location in 1 cycle PN ❍ Synchronous ◆ All processors execute same algorithm synchronously – READ phase – COMPUTE phase – WRITE phase ◆ Some subset of the processors can stay idle ❍ Asynchronous

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 5 Network Model

q G = (N,E) P11 P12 P1N ❍ N are processing nodes P21 P22 P2N P P P ❍ E are bidirectional communication links 31 32 3N ...... q Each processor has its own memory . . .

q No shared memory is available PN1 PN2 PNN q Network operation may be synchronous or asynchronous q Requires communication primitives ❍ Send (X, i) ❍ Receive (Y, j) q Captures message passing model for algorithm design

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 6 Parallelism

q Ability to execute different parts of a computation concurrently on different machines q Why do you want parallelism? ❍ Shorter running time or handling more work q What is being parallelized? ❍ Task: instruction, statement, procedure, … ❍ Data: data flow, size, replication ❍ Parallelism granularity ◆ Coarse-grain versus fine-grainded q Thinking about parallelism q Evaluation

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 7 Why is parallel programming important?

q Parallel programming has matured

❍ Standard programming models ❍ Common machine architectures ❍ Programmer can focus on computation and use suitable programming model for implementation

q Increase portability between models and architectures

q Reasonable hope of portability across platforms

q Problem

❍ Performance optimization is still platform-dependent ❍ Performance portability is a problem

❍ Parallel programming methods are still evolving

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 8 Parallel Algorithm

q Recipe to solve a problem “in parallel” on multiple processing elements q Standard steps for constructing a parallel algorithm

❍ Identify work that can be performed concurrently

❍ Partition the concurrent work on separate processors ❍ Properly manage input, output, and intermediate data ❍ Coordinate data accesses and work to satisfy dependencies q Which are hard to do?

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 9 Parallelism Views

q Where can we find parallelism? q Program (task) view ❍ Statement level ◆ Between program statements ◆ Which statements can be executed at the same time? ❍ Block level / Loop level / Routine level / Process level ◆ Larger-grained program statements q Data view ❍ How is data operated on? ❍ Where does data reside? q Resource view

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 10 Parallelism, Correctness, and Dependence

q Parallel execution, from any point of view, will be constrained by the sequence of operations needed to be performed for a correct result q Parallel execution must address control, data, and system dependences q A dependency arises when one operation depends on an earlier operation to complete and produce a result before this later operation can be performed q We extend this notion of dependency to resources since some operations may depend on certain resources ❍ For example, due to where data is located

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 11 Executing Two Statements in Parallel

q Want to execute two statements in parallel q On one processor: Statement 1; Statement 2; q On two processors: Processor 1: Processor 2: Statement 1; Statement 2;

q Fundamental (concurrent) execution assumption ❍ Processors execute independent of each other ❍ No assumptions made about speed of processor execution

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 12 Sequential Consistency in Parallel Execution

q Case 1:

Processor 1: Processor 2: me statement 1; statement 2;

q Case 2: Processor 1: Processor 2: me statement 2; statement 1; q Sequential consistency

❍ Statements execution does not interfere with each other ❍ Computation results are the same (independent of order)

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 13 Independent versus Dependent

q In other words the execution of statement1; statement2; must be equivalent to statement2; statement1;

q Their order of execution must not matter! q If true, the statements are independent of each other q Two statements are dependent when the order of their execution affects the computation outcome

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 14 Examples

q Example 1 r Statements are independent S1: a=1; S2: b=1; q Example 2 r Dependent (true (flow) dependence) S1: a=1; ¦ Second is dependent on first S2: b=a; ¦ Can you remove dependency? q Example 3 r Dependent (output dependence) S1: a=f(x); ¦ Second is dependent on first S2: a=b; ¦ Can you remove dependency? How? q Example 4 r Dependent (anti-dependence) S1: a=b; ¦ First is dependent on second S2: b=1; ¦ Can you remove dependency? How?

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 15 True Dependence and Anti-Dependence

q Given statements S1 and S2, S1; S2;

q S2 has a true (flow) dependence on S1 X = ... δ if and only if = X S2 reads a value written by S1

= X

q S2 has a anti-dependence on S1 ... δ-1 if and only if X = S2 writes a value read by S1

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 16 Output Dependence

q Given statements S1 and S2, S1; S2; q S2 has an output dependence on S1 X = if and only if ... δ0 S2 writes a variable written by S1 X =

q Anti- and output dependences are “name” dependencies ❍ Are they “true” dependences? q How can you get rid of output dependences? ❍ Are there cases where you can not?

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 17 Statement Dependency Graphs

q Can use graphs to show dependence relationships q Example S1: a=1; S1 flow S2: b=a; output S2 an S3: a=b+1; S3

S4: c=a; S4

q S2 δ S3 : S3 is flow-dependent on S2 0 q S1 δ S3 : S3 is output-dependent on S1 -1 q S2 δ S3 : S3 is anti-dependent on S2

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 18 When can two statements execute in parallel?

q Statements S1 and S2 can execute in parallel if and only if there are no dependences between S1 and S2

❍ True dependences ❍ Anti-dependences ❍ Output dependences

q Some dependences can be remove by modifying the program ❍ Rearranging statements ❍ Eliminating statements

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 19 How do you compute dependence?

q Data dependence relations can be found by comparing the IN and OUT sets of each node q The IN and OUT sets of a statement S are defined as: ❍ IN(S) : set of memory locations (variables) that may be used in S ❍ OUT(S) : set of memory locations (variables) that may be modified by S q Note that these sets include all memory locations that may be fetched or modified q As such, the sets can be conservatively large

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 20 IN / OUT Sets and Computing Dependence

q Assuming that there is a path from S1 to S2 , the following shows how to intersect the IN and OUT sets to test for data dependence

out(S1 )∩in(S2 )≠ ∅ S1 δ S2 flow dependence −1 in(S1) ∩ out(S2 ) ≠ ∅ S1 δ S2 anti - dependence 0 out(S1) ∩ out(S2 ) ≠ ∅ S1 δ S2 output dependence

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 21 Loop-Level Parallelism

q Significant parallelism can be identified within loops

for (i=0; i<100; i++) for (i=0; i<100; i++) { S1: a[i] = i; S1: a[i] = i; S2: b[i] = 2*i; }

q Dependencies? What about i, the loop index? q DOALL loop (a.k.a. foreach loop) ❍ All iterations are independent of each other ❍ All statements be executed in parallel at the same time ◆ Is this really true?

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 22 Iteration Space

q Unroll loop into separate statements / iterations q Show dependences between iterations

for (i=0; i<100; i++) for (i=0; i<100; i++) { S1: a[i] = i; S1: a[i] = i; S2: b[i] = 2*i;

}

S10 S11 … S199 S10 S11 … S199

S20 S21 S299

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 23 Multi-Loop Parallelism

q Significant parallelism can be identified between loops

for (i=0; i<100; i++) a[i] = i; a[0] a[1] … a[99] for (i=0; i<100; i++) b[i] = i; b[0] b[1] … b[99]

q Dependencies?

q How much parallelism is available?

q Given 4 processors, how much parallelism is possible?

q What parallelism is achievable with 50 processors?

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 24 Loops with Dependencies

Case 1: Case 2: for (i=1; i<100; i++) for (i=5; i<100; i++) a[i] = a[i-1] + 100; a[i-5] = a[i] + 100;

a[0] a[1] … a[99] a[0] a[5] a[10] … a[1] a[6] a[11] … a[2] a[7] a[12] … a[3] a[8] a[13] … q Dependencies? a[4] a[9] a[14] … ❍ What type? q Is the Case 1 loop parallelizable? q Is the Case 2 loop parallelizable?

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 25 Another Loop Example for (i=1; i<100; i++) a[i] = f(a[i-1]);

q Dependencies?

❍ What type? q Loop iterations are not parallelizable

❍ Why not?

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 26 Loop Dependencies

q A loop-carried dependence is a dependence that is present only if the statements are part of the execution of a loop (i.e., between two statements instances in two different iterations of a loop) q Otherwise, it is loop-independent, including between two statements instances in the same loop iteration q Loop-carried dependences can prevent loop iteration parallelization q The dependence is lexically forward if the source comes before the target or lexically backward otherwise ❍ Unroll the loop to see

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 27 Loop Dependence Example for (i=0; i<100; i++) a[i+10] = f(a[i]);

q Dependencies?

❍ Between a[10], a[20], … ❍ Between a[11], a[21], …

q Some parallel execution is possible

❍ How much?

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 28 Dependences Between Iterations for (i=1; i<100; i++) {

S1: a[i] = …; S1 … S2: … = a[i-1]; S2 i } 1 2 3 4 5 6

q Dependencies?

❍ Between a[i] and a[i-1]

q Is parallelism possible?

❍ Statements can be executed in “pipeline” manner

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 29 Another Loop Dependence Example for (i=0; i<100; i++) for (j=1; j<100; j++) a[i][j] = f(a[i][j-1]);

q Dependencies? ❍ Loop-independent dependence on i ❍ Loop-carried dependence on j q Which loop can be parallelized? ❍ Outer loop parallelizable ❍ Inner loop cannot be parallelized

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 30 Still Another Loop Dependence Example for (j=1; j<100; j++) for (i=0; i<100; i++) a[i][j] = f(a[i][j-1]);

q Dependencies? ❍ Loop-independent dependence on i ❍ Loop-carried dependence on j q Which loop can be parallelized? ❍ Inner loop parallelizable ❍ Outer loop cannot be parallelized ❍ Less desirable (why?)

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 31 Key Ideas for Dependency Analysis

q To execute in parallel: ❍ Statement order must not matter ❍ Statements must not have dependences

q Some dependences can be removed

q Some dependences may not be obvious

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 32 Dependencies and Synchronization

q How is parallelism achieved when have dependencies?

❍ Think about concurrency ❍ Some parts of the execution are independent ❍ Some parts of the execution are dependent

q Must control ordering of events on different processors (cores)

❍ Dependencies pose constraints on parallel event ordering ❍ Partial ordering of execution action

q Use synchronization mechanisms ❍ Need for concurrent execution too

❍ Maintains partial order

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 33 Parallel Patterns

q Parallel Patterns: A recurring combination of task distribution and data access that solves a specific problem in parallel algorithm design.

q Patterns provide us with a “vocabulary” for algorithm design q It can be useful to compare parallel patterns with serial patterns

q Patterns are universal – they can be used in any parallel programming system

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 34 Parallel Patterns

q Nesting Pattern q Serial / Parallel Control Patterns

q Serial / Parallel Data Management Patterns

q Other Patterns

q Programming Model Support for Patterns

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 35 Nesting Pattern

q Nesting is the ability to hierarchically compose patterns q This pattern appears in both serial and parallel algorithms

q “Pattern diagrams” are used to visually show the pattern idea where each “task block” is a location of general code in an algorithm

q Each “task block” can in turn be another pattern in the nesting pattern

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 36 Nesting Pattern

Nesng Paern: A composional paern. Nesng allows other paerns to be composed in a hierarchy so that any task block in the above diagram can be replaced with a paern with the same input/output and dependencies.

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 37 Serial Control Patterns

q Structured serial programming is based on these patterns: sequence, selection, iteration, and recursion

q The nesting pattern can also be used to hierarchically compose these four patterns q Though you should be familiar with these, it’s extra important to understand these patterns when parallelizing serial algorithms based on these patterns

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 38 Serial Control Patterns: Sequence

q Sequence: Ordered list of tasks that are executed in a specific order q Assumption – program text ordering will be followed (obvious, but this will be important when parallelized)

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 39 Serial Control Patterns: Selection

q Selection: condition c is first evaluated. Either task a or b is executed depending on the true or false result of c.

q Assumptions – a and b are never executed before c, and only a or b is executed - never both

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 40 Serial Control Patterns: Iteration

q Iteration: a condition c is evaluated. If true, a is evaluated, and then c is evaluated again. This repeats until c is false.

q Complication when parallelizing: potential for dependencies to exist between previous iterations

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 41 Serial Control Patterns: Recursion

q Recursion: dynamic form of nesting allowing functions to call themselves q Tail recursion is a special recursion that can be converted into iteration – important for functional languages

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 42 Parallel Control Patterns

q Parallel control patterns extend serial control patterns q Each parallel control pattern is related to at least one serial control pattern, but relaxes assumptions of serial control patterns q Parallel control patterns: fork-join, map, stencil, reduction, scan, recurrence

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 43 Parallel Control Patterns: Fork-Join

q Fork-join: allows control flow to fork into multiple parallel flows, then rejoin later q Plus implements this with spawn and sync ❍ The call tree is a parallel call tree and functions are spawned instead of called ❍ Functions that spawn another function call will continue to execute ❍ Caller syncs with the spawned function to join the two q A “join” is different than a “barrier ❍ Sync – only one continues ❍ Barrier – all threads continue

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 44 Parallel Control Patterns: Map

q Map: performs a function over every element of a collection q Map replicates a serial iteration pattern where each iteration is independent of the others, the number of iterations is known in advance, and computation only depends on the iteration count and data from the input collection

q The replicated function is referred to as an “elemental function”

Input

Elemental Funcon

Output

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 45 Parallel Control Patterns: Stencil

q Stencil: Elemental function accesses a set of “neighbors”, stencil is a generalization of map q Often combined with iteration – used with iterative solvers or to evolve a system through time

q Boundary conditions must be handled carefully in the stencil pattern q See stencil lecture…

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 46 Parallel Control Patterns: Reduction

q Reduction: Combines every element in a collection using an associative “combiner function”

q Because of the associativity of the combiner function, different orderings of the reduction are possible

q Examples of combiner functions: addition, multiplication, maximum, minimum, and Boolean AND, OR, and XOR

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 47 Parallel Control Patterns: Reduction

Serial Reducon Parallel Reducon

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 48 Parallel Control Patterns: Scan

q Scan: computes all partial reduction of a collection q For every output in a collection, a reduction of the input up to that point is computed q If the function being used is associative, the scan can be parallelized

q Parallelizing a scan is not obvious at first, because of dependencies to previous iterations in the serial loop

q A parallel scan will require more operations than a serial version

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 49 Parallel Control Patterns: Scan

Serial Scan Parallel Scan

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 50 Parallel Control Patterns: Recurrence

q Recurrence: More complex version of map, where the loop iterations can depend on one another q Similar to map, but elements can use outputs of adjacent elements as inputs

q For a recurrence to be computable, there must be a serial ordering of the recurrence elements so that elements can be computed using previously computed outputs

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 51 Serial Data Management Patterns

q Serial programs can manage data in many ways q Data management deals with how data is allocated, shared, read, written, and copied q Serial Data Management Patterns: random read and write, stack allocation, heap allocation, objects

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 52 Serial Data Management Patterns: random read and write

q Memory locations indexed with addresses q Pointers are typically used to refer to memory addresses q Aliasing (uncertainty of two pointers referring to the same object) can cause problems when serial code is parallelized

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 53 Serial Data Management Patterns: Stack Allocation

q Stack allocation is useful for dynamically allocating data in LIFO manner q Efficient – arbitrary amount of data can be allocated in constant time

q Stack allocation also preserves locality

q When parallelized, typically each thread will get its own stack so thread locality is preserved

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 54 Serial Data Management Patterns: Heap Allocation

q Heap allocation is useful when data cannot be allocated in a LIFO fashion q But, heap allocation is slower and more complex than stack allocation

q A parallelized heap allocator should be used when dynamically allocating memory in parallel ❍ This type of allocator will keep separate pools for each parallel worker

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 55 Serial Data Management Patterns: Objects

q Objects are language constructs to associate data with code to manipulate and manage that data q Objects can have member functions, and they also are considered members of a class of objects

q Parallel programming models will generalize objects in various ways

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 56 Parallel Data Management Patterns

q To avoid things like race conditions, it is critically important to know when data is, and isn’t, potentially shared by multiple parallel workers

q Some parallel data management patterns help us with data locality q Parallel data management patterns: pack, pipeline, geometric decomposition, gather, and scatter

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 57 Parallel Data Management Patterns: Pack

q Pack is used eliminate unused space in a collection q Elements marked false are discarded, the remaining elements are placed in a contiguous sequence in the same order q Useful when used with map q Unpack is the inverse and is used to place elements back in their original locations

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 58 Parallel Data Management Patterns: Pipeline

q Pipeline connects tasks in a producer- consumer manner q A linear pipeline is the basic pattern idea, but a pipeline in a DAG is also possible q Pipelines are most useful when used with other patterns as they can multiply available parallelism

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 59 Parallel Data Management Patterns: Geometric Decomposition

q Geometric Decomposition – arranges data into subcollections q Overlapping and non-overlapping decompositions are possible q This pattern doesn’t necessarily move data, it just gives us another view of it

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 60 Parallel Data Management Patterns: Gather

q Gather reads a collection of data given a collection of indices q Think of a combination of map and random serial reads q The output collection shares the same type as the input collection, but it share the same shape as the indices collection

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 61 Parallel Data Management Patterns: Scatter

q Scatter is the inverse of gather q A set of input and indices is required, but each element of the input is written to the output at the given index instead of read from the input at the given index q Race conditions can occur when we have two writes to the same location!

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 62 Other Parallel Patterns

q Superscalar Sequences: write a sequence of tasks, ordered only by dependencies q Futures: similar to fork-join, but tasks do not need to be nested hierarchically

q Speculative Selection: general version of serial selection where the condition and both outcomes can all run in parallel

q Workpile: general map pattern where each instance of elemental function can generate more instances, adding to the “pile” of work

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 63 Other Parallel Patterns

q Search: finds some data in a collection that meets some criteria q Segmentation: operations on subdivided, non- overlapping, non-uniformly sized partitions of 1D collections q Expand: a combination of pack and map

q Category Reduction: Given a collection of elements each with a label, find all elements with same label and reduce them

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 64 Programming Model Support for Patterns

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 65 Programming Model Support for Patterns

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 66 Programming Model Support for Patterns

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 67 Map Pattern - Overview

q Map q Optimizations

❍ Sequences of Maps ❍ Code Fusion

❍ Cache Fusion q Related Patterns

q Example Implementation: Scaled Vector Addition (SAXPY)

❍ Problem Description ❍ Various Implementations

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 68 Mapping

q “Do the same thing many times” foreach i in foo: do something q Well-known higher order function in languages like ML, Haskell, Scala map:∀ a b.(a → b)List a → List b applies a function each element in a list and returns a list of results

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 69 Example Maps Add 1 to every item in an array Double every item in an array

0 1 2 3 4 5 0 1 2 3 4 5 0 4 5 3 1 0 3 7 0 1 4 0

1 5 6 4 2 1 6 14 0 2 8 0

Key Point: An operation is a map if it can be applied to each element without knowledge of neighbors.

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 70 Key Idea

q Map is a “foreach loop” where each iteration is independent

Embarrassingly Parallel

Independence is a big win. We can run map completely in parallel. Significant speedups! More precisely: T ( ∞ ) is O(1) plus implementation overhead that is O(log n)…so T ( ∞ ) ∈ O ( l o g n ) .

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 71 Sequential Map

for(int n=0; n< array.length; ++n){

process(array[n]); Time }

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 72 Parallel Map

parallel_for_each(

x in array){ Time process(x); }

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 73 Comparing Maps Serial Map Parallel Map

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 74 Comparing Maps Serial Map Parallel Map

Speedup The space here is speedup. With the parallel map, our program finished execution early, while the serial map is still running.

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 75 Independence

q The key to (embarrasing) parallelism is independence Warning: No shared state!

Map function should be “pure” (or “pure-ish”) and should not modify shared states q Modifying shared state breaks perfect independence

q Results of accidentally violating independence: ❍ non-determinism

❍ data-races ❍ undefined behavior

❍ segfaults

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 76 Implementation and API

q OpenMP and CilkPlus contain a parallel for language construct q Map is a mode of use of parallel for q TBB uses higher order functions with lambda expressions/“funtors”

q Some languages (CilkPlus, Matlab, Fortran) provide array notation which makes some maps more concise

Array Notation

A[:] = A[:]*5; is CilkPlus array notation for “multiply every element in A by 5”

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 77 Unary Maps

Unary Maps So far we have only dealt with mapping over a single collection…

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 78 Map with 1 Input, 1 Output

0 1 2 3 4 5 6 7 8 9 10 11

x 3 7 0 1 4 0 0 4 5 3 1 0

result 6 14 0 2 8 0 0 8 10 6 2 0

int oneToOne ( int x[11] ) { return x*2; }

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 79 N-ary Maps

N-ary Maps But, sometimes it makes sense to map over multiple collections at once…

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 80 Map with 2 Inputs, 1 Output

0 1 2 3 4 5 6 7 8 9 10 11

x 3 7 0 1 4 0 0 4 5 3 1 0

y 2 4 2 1 8 3 9 5 5 1 2 1

result 5 11 2 2 12 3 9 9 10 4 3 1

int twoToOne ( int x[11], int y[11] ) { return x+y; }

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 81 To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s), Elsevier and typesetter diacriTech. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and is confidential until formal publication. McCool — e9780124159938 — 2012/6/6 — 23:09 — Page 139 — #139

4.4 Sequence of Maps versus Map of Sequence 139

particular interface (using a numerical value to select the index component desired) was chosen because it provides a straightforward extension to higher dimensionalities.

4.4 SEQUENCE OF MAPS VERSUS MAP OF SEQUENCE A sequence of map operations over collections of the same shape should be combined whenever possible into a single larger operation. In particular, vector operations are really map operations using very simple operations like addition and multiplication. Implementing these one by one, writing to and from memory, would be inefficient, since it would have low arithmetic intensity. If this organization was implemented literally, data would have to be read and written for each operation, and we would consume memory bandwidth unnecessarily for intermediate results. Even worse, if the maps were big enough, we might exceed the size of the cache and so each map operation would go directly to and from main memory. If we fuse the operations used in a sequence of maps into a sequence inside a single map, we can load only the input data at the start of the map and keep intermediate results in registers rather than wasting memory bandwidth on them. We will call this approach code fusion, and it can be applied to other patternsOptimization as well. Code fusion is – demonstrated Sequences in Figure 4.2of. Maps

q Often several map operations occur in sequence

❍ Vector math consists of many small operations such as additions and multiplications applied as maps

q A naïve implementation may write each intermediate result to memory, wasting memory BW and likely overwhelming the cache

FIGUREIntroduction 4.2 to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 82 Code fusion optimization: Convert a sequence of maps into a map of sequences, avoiding the need to write intermediate results to memory. This can be done automatically by ArBB and explicitly in other programming models. To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s), Elsevier and typesetter diacriTech. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and is confidential until formal publication. McCool — e9780124159938 — 2012/6/6 — 23:09 — Page 139 — #139

4.4 Sequence of Maps versus Map of Sequence 139

particular interface (using a numerical value to select the index component desired) was chosen because it provides a straightforward extension to higher dimensionalities.

4.4 SEQUENCE OF MAPS VERSUS MAP OF SEQUENCE A sequence of map operations over collections of the same shape should be combined whenever possible into a single larger operation. In particular, vector operations are really map operations using very simple operations like addition and multiplication. Implementing these one by one, writing to and from memory, would be inefficient, since it would have low arithmetic intensity. If this organization was implemented literally, data would have to be read and written for each operation, and we would consume memory bandwidth unnecessarily for intermediate results. Even worse, if the maps were big enough, we might exceed the size of the cache and so each map operation would go directly to and from main memory. If we fuse the operations used in a sequence of maps into a sequence inside a single map, we can load only the input data at the start of the map and keep intermediate results in registers rather than wasting memory bandwidth on them. We will call this approach code fusion, and it can be applied to other patterns as well. Code fusion is demonstrated in Figure 4.2.

Optimization – Code Fusion

q Can sometimes “fuse” together the operations to perform them at once q Adds arithmetic intensity, reduces memory/cache usage q Ideally, operations can be performed FIGURE 4.2 using registers alone Code fusion optimization: Convert a sequence of maps into a map of sequences, avoiding the need to write intermediate results to memory. This can be done automatically by ArBB and explicitly in other programming Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 83 models. To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s), Elsevier and typesetter diacriTech. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and is confidential until formal publication. McCool — e9780124159938 — 2012/6/6 — 23:09 — Page 140 — #140

140 CHAPTER 4 Map Optimization – Cache Fusion

q Sometimes impractical to fuse together the map operations q Can instead break the work into blocks, giving each CPU one block at a time q Hopefully, operations use cache alone

FIGURE 4.3 Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 84 Cache fusion optimization: Process sequences of maps in small tiles sequentially. When code fusion is not possible, a sequence of maps can be broken into small tiles and each tile processed sequentially. This avoids the need for synchronization between each individual map, and, if the tiles are small enough, intermediate data can be held in cache.

Another approach that is often almost as effective as code fusion is cache fusion, shown in Figure 4.3. If the maps are broken into tiles and the entire sequence of smaller maps for one tile is executed sequentially on one core, then if the aggregate size of the tiles is small enough interme- diate data will be resident in cache. In this case at least it will be possible to avoid going to main memory. Both kinds of fusion also reduce the cost of synchronization, since when multiple maps are fused only one synchronization is needed after all the tiles are processed, instead of after every map. How- ever, code fusion is preferred when it is possible since registers are still faster than cache, and with cache fusion there is still the “interpreter” overhead of managing the multiple passes. However, cache fusion is useful when there is no access to the code inside the individual maps—for example, if they are provided as precompiled user-defined functions without source access by the compiler. This is a common pattern in, for example, image processing plugins. In Cilk Plus, TBB, OpenMP, and OpenCL the reorganization needed for either kind of fusion must generally be done by the programmer, with the following notable exceptions: OpenMP: Cache fusion occurs when all of the following are true: • A single parallel region executes all of the maps to be fused. • The loop for each map has the same bounds and chunk size. • Each loop uses the static scheduling mode, either as implied by the environment or explicitly specified. Related Patterns Three patterns related to map are discussed here: ❍ Stencil ❍ Workpile

❍ Divide-and-Conquer

More detail presented in a later lecture

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 85 To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s), Elsevier and typesetter diacriTech. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and is confidential until formal publication. StencilMcCool — e9780124159938 — 2012/6/6 — 23:09 — Page 201 — #201

q Each instance of the map function accesses neighbors of its input,7.2 Implementingoffset from Stencil its usual with Shiftinput 201 q Common in imaging and PDE solvers

FIGURE 7.1 Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 86 Stencil pattern. The stencil pattern combines a local, structured gather with a function to combine the results into a single output for each input neighborhood.

Stencils also arise in solvers for partial differential equations (PDEs) over regular grids. PDE solvers are important in many scientific simulations, in computer-aided engineering, and in imaging. Imag- ing applications include photography, satellite imaging, medical imaging, and seismic reconstruction. Seismic reconstruction is one of the major workloads in oil and gas exploration. Stencils can be one dimensional, as shown in Figure 7.1, or multidimensional. Stencils also have different kinds of neighborhoods from square compact neighborhoods to sparse neighborhoods. The special case of a convolution using a square compact neighborhood with constant weights is known as a box filter and there are specific optimizations for it similar to that for the scan pattern. However, these optimizations do not apply to the general case. Stencils reuse samples required for neighbor- ing elements, so stencils, especially multidimensional stencils, can be further optimized by taking cache behavior into account as discussed in Section 7.3. Stencils, like shifts, also require considera- tion of boundary conditions. When subdivided using the partition pattern, presented in Section 6.6, boundary conditions can result in additional communication between cores, either implicit or explicit.

7.2 IMPLEMENTING STENCIL WITH SHIFT The regular data access pattern used by stencils can be implemented using shifts. For a group of ele- mental functions, a vector of inputs for each offset in the stencil can be collected by shifting the input by the amount of the offset. This is diagrammed in Figure 7.2. Implementing a stencil in this way is really only beneficial for one-dimensional stencils or the memory-contiguous dimension of a multidimensional stencil. Also, it does not reduce total memory traffic to external memory since, if random scalar reads are used, data movement from external memory will still be combined into block reads by the cache. Shifts, however, allow vectorization of the data reads, and this can reduce the total number of instructions used. They may also place data in vector registers ready for use by vectorized elemental functions. Workpile

q Work items can be added to the map while it is in progress, from inside map function instances q Work grows and is consumed by the map

q Workpile pattern terminates when no more work is available

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 87 To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s), Elsevier and typesetter diacriTech. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and is confidential until formal publication. McCool — e9780124159938 — 2012/6/6 — 23:09 — Page 192 — #192

192 CHAPTER 6 Data Reorganization Divide-and-Conquer

q Applies if a problem can be divided into smaller subproblems recursively until a base case is reached that can be solved serially

FIGURE 6.17 Partitioning in 2D. The partition pattern can be extended to multiple dimensions.

These diagrams show only the simplest case, whereIntroduction the to sections Parallel Computing, of the partitionUniversity of fit Oregon, exactly IPCC into the Lecture 5 – Parallel Programming Patterns - Map 88 domain. In practice, there may be boundary conditions where partial sections are required along the edges. These may need to be treated with special-purpose code, but even in this case the majority of the sections will be regular, which lends itself to vectorization. Ideally, to get good memory behavior and to allow efficient vectorization, we also normally want to partition data, especially for writes, so that it aligns with cache line and vectorization boundaries. You should be aware of how data is actually laid out in memory when partitioning data. For example, in a multidimensional partitioning, typically only one dimension of an array is contiguous in memory, so only this one benefits directly from spatial locality. This is also the only dimension that benefits from alignment with cache lines and vectorization unless the data will be transposed as part of the computation. Partitioning is related to strip-mining the stencil pattern, which is discussed in Section 7.3. Partitioning can be generalized to another pattern that we will call segmentation. Segmentation still requires non-overlapping sections, but now the sections can vary in size. This is shown in Figure 6.18. Various algorithms have been designed to operate on segmented data, including segmented versions of scan and reduce that can operate on each segment of the array but in a perfectly load-balanced fashion, regardless of the irregularities in the lengths of the segments [BHC+93]. These segmented algorithms can actually be implemented in terms of the normal scan and reduce algorithms by using a suitable combiner function and some auxiliary data. Other algorithms, such as quicksort [Ble90, Ble96], can in turn be implemented in a vectorized fashion with a segmented data structure using these primitives. In order to represent a segmented collection, additional data is required to keep track of the bound- aries between sections. The two most common representations are shown in Figure 6.19. Using an array Example: Scaled Vector Addition (SAXPY)

q y ← ax + y ❍ Scales vector x by a and adds it to vector y ❍ Result is stored in input vector y

q Comes from the BLAS (Basic Linear Algebra Subprograms) library q Every element in vector x and vector y are independent

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 89 What does y ← a x + y look like?

0 1 2 3 4 5 6 7 8 9 10 11

a 4 4 4 4 4 4 4 4 4 4 4 4 * x 2 4 2 1 8 3 9 5 5 1 2 1 + y 3 7 0 1 4 0 0 4 5 3 1 0

y 11 23 8 5 36 12 36 49 50 7 9 4

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 90 Visual: y ← ax + y

0 1 2 3 4 5 6 7 8 9 10 11

a 4 4 4 4 4 4 4 4 4 4 4 4 * x 2 4 2 1 8 3 9 5 5 1 2 1 + y 3 7 0 1 4 0 0 4 5 3 1 0

y 11 23 8 5 36 12 36 49 50 7 9 4

Twelve processors used à one for each element in the vector

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 91 Visual: y ← ax + y

0 1 2 3 4 5 6 7 8 9 10 11

a 4 4 4 4 4 4 4 4 4 4 4 4 * x 2 4 2 1 8 3 9 5 5 1 2 1 + y 3 7 0 1 4 0 0 4 5 3 1 0

y 11 23 8 5 36 12 36 49 50 7 9 4

Six processors used à one for every two elements in the vector

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 92 Visual: y ← ax + y

0 1 2 3 4 5 6 7 8 9 10 11

a 4 4 4 4 4 4 4 4 4 4 4 4 * x 2 4 2 1 8 3 9 5 5 1 2 1 + y 3 7 0 1 4 0 0 4 5 3 1 0

y 11 23 8 5 36 12 36 49 50 7 9 4

Two processors used à one for every six elements in the vector

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 93 To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s), Elsevier and typesetter diacriTech. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and is confidential until formal publication. McCool — e9780124159938 — 2012/6/6 — 23:09 — Page 126 — #126

126 CHAPTER 4 Map

Serial SAXPY Implementation

1 void saxpy_serial( 2 size_t n, // the number of elements in the vectors 3 float a, // scale factor 4 const float x[], // the first input vector 5 float y[] // the output vector and second input vector 6 ) { 7 for (size_t i 0; i < n; ++i) = 8 y[i] a x[i] + y[i]; = * 9 } LISTING 4.1 Serial implementation of SAXPY in C.

Introduction1 void to Parallel Computing, saxpy_tbb( University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 94 2 int n, // the number of elements in the vectors 3 float a, // scale factor 4 float x[], // the first input vector 5 float y[] // the output vector and second input vector 6 ) { 7 tbb::parallel_for( 8 tbb::blocked_range(0, n), 9 [&](tbb::blocked_range r) { 10 for (size_t i r.begin(); i ! r.end(); ++i) = = 11 y[i] a x[i] + y[i]; = * 12 } 13 ); 14 } LISTING 4.2 Tiled implementation of SAXPY in TBB. Tiling not only leads to better spatial locality but also exposes opportunities for vectorization by the host compiler.

functions for brevity throughout the book, though they are not required for using TBB. Appendix D.2 discusses lambda functions and how to write the equivalent code by hand if you need to use an old C++ compiler. The TBB code exploits tiling. The parallel_for breaks the half-open range [0,n) into subranges and processes each subrange r with a separate task. Hence, each subrange r acts as a tile, which is processed by the serial for loop in the code. Here the range and subrange are implemented as blocked_range objects. Appendix C.3 says more about the mechanics of parallel_for. TBB uses thread parallelism but does not, by itself, vectorize the code. It depends on the underlying C++ compiler to do that. On the other hand, tiling does expose opportunities for vectorization, so if the basic serial algorithm can be vectorized then typically the TBB code can be, too. Generally, the To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s), Elsevier and typesetter diacriTech. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and is confidential until formal publication. McCool — e9780124159938 — 2012/6/6 — 23:09 — Page 126 — #126

126 CHAPTER 4 Map

1 void saxpy_serial( 2 size_t n, // the number of elements in the vectors 3 float a, // scale factor 4 const float x[], // the first input vector 5 float y[] // the output vector and second input vector 6 ) { 7 for (size_t i 0; i < n; ++i) = 8 y[i] a x[i] + y[i]; = * 9 } LISTING 4.1 Serial implementation of SAXPY in C. TBB SAXPY Implementation

1 void saxpy_tbb( 2 int n, // the number of elements in the vectors 3 float a, // scale factor 4 float x[], // the first input vector 5 float y[] // the output vector and second input vector 6 ) { 7 tbb::parallel_for( 8 tbb::blocked_range(0, n), 9 [&](tbb::blocked_range r) { 10 for (size_t i r.begin(); i ! r.end(); ++i) = = 11 y[i] a x[i] + y[i]; = * 12 } 13 ); 14 } LISTING 4.2

IntroductionTiled to implementationParallel Computing, University of of Oregon, SAXPY IPCC in TBB. TilingLecture 5 not– Parallel only Programming leads Patterns to better - Map spatial95 locality but also exposes opportunities for vectorization by the host compiler.

functions for brevity throughout the book, though they are not required for using TBB. Appendix D.2 discusses lambda functions and how to write the equivalent code by hand if you need to use an old C++ compiler. The TBB code exploits tiling. The parallel_for breaks the half-open range [0,n) into subranges and processes each subrange r with a separate task. Hence, each subrange r acts as a tile, which is processed by the serial for loop in the code. Here the range and subrange are implemented as blocked_range objects. Appendix C.3 says more about the mechanics of parallel_for. TBB uses thread parallelism but does not, by itself, vectorize the code. It depends on the underlying C++ compiler to do that. On the other hand, tiling does expose opportunities for vectorization, so if the basic serial algorithm can be vectorized then typically the TBB code can be, too. Generally, the To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s), Elsevier and typesetter diacriTech. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and is confidential until formal publication. McCool — e9780124159938 — 2012/6/6 — 23:09 — Page 127 — #127

4.2 Scaled Vector Addition (SAXPY) 127

performance of the serial code inside TBB tasks will depend on the performance of the code generated by the C++ compiler with which it is used.

4.2.4 Cilk Plus A basic Cilk Plus implementation of the SAXPY operation is given in Listing 4.3. The “parallel for” syntax approach is used here, as with TBB, although the syntax is closer to a regular for loop. In fact, an ordinary for loop can often be converted to a cilk_for construct if all iterations of the loop body are independent—that is, if it is a map. As with TBB, the cilk_for is not explicitly vectorized but the compiler may attempt to auto-vectorize. There are restrictions on the form of a cilk_for loop. See Appendix B.5 for details.

4.2.5 Cilk Plus with Array Notation It is also possible in Cilk Plus to explicitly specify vector operations using Cilk Plus array notation, as in Listing 4.4. Here x[0:n] and y[0:n] refer to n consecutive elements of each array, starting with x[0] and y[0]. A variant syntax allows specification of a stride between elements, using x[start: length:stride]. Sections of the same length can be combined with operators. Note that there is no cilk_for in Listing 4.4. Cilk Plus SAXPY Implementation

1 void saxpy_cilk( 2 int n, // the number of elements in the vectors 3 float a, // scale factor 4 float x[], // the first input vector 5 float y[] // the output vector and second input vector 6 ) { 7 cilk_for (int i 0; i < n; ++i) = 8 y[i] a x[i] + y[i]; = * 9 } LISTING 4.3 SAXPY in Cilk Plus using cilk_for.

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 96 1 void saxpy_array_notation( 2 int n, // the number of elements in the vectors 3 float a, // scale factor 4 float x[], // the input vector 5 float y[] // the output vector and offset 6 ) { 7 y[0:n] a x[0:n] + y[0:n]; = * 8 } LISTING 4.4 SAXPY in Cilk Plus using cilk_for and array notation for explicit vectorization. To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s), Elsevier and typesetter diacriTech. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and is confidential until formal publication. McCool — e9780124159938 — 2012/6/6 — 23:09 — Page 128 — #128

128 CHAPTER 4 Map

Uniform inputs are handled by scalar promotion: When a scalar and an array are combined with an operator, the scalar is conceptually “promoted” to an array of the same length by replication.

4.2.6 OpenMP Like TBB and Cilk Plus, the map pattern is expressed in OpenMP using a “parallel for” construct. This is done by adding a pragma as in Listing 4.5 just before the loop to be parallelized. OpenMP uses a “team” of threads and the work of the loop is distributed over the team when such a pragma is used. How exactly the distribution of work is done is given by the current scheduling option. The advantage of the OpenMP syntax is that the code inside the loop does not change, and the annotations can usually be safely ignored and a correct serial program will result. However, as with the equivalent Cilk Plus construct, the form of the for loop is more restricted than in the serial case. Also, as with Cilk Plus and TBB, implementations of OpenMP generally do not check for incorrect paral- lelizations that can arise from dependencies between loop iterations, which can lead to race conditions. If these exist and are not correctly accounted for in the pragma, an incorrect parallelization will result.

4.2.7 ArBB Using Vector Operations ArBB operates only over data stored in ArBB containers and requires using ArBB types to represent elements of those containers. The ArBB dense container represents multidimensional arrays. It is a template with the first argument being the element type and the second the dimensionality. The dimensionality default is 1 so the second template argument can be omitted for 1D arrays. The simplest way to implement SAXPY in ArBB is to use arithmetic operations directly over dense containers, as in Listing 4.6. Actually, this gives a sequence of maps. However, as will be explained in Section 4.4, ArBB automatically optimizes this into a map of a sequence. In ArBB, we have to include some extra code to move data into “ArBB data space” and to invoke the above function. Moving data into ArBB space is required for two reasons: safety and offload. Data stored in ArBB containers can be managed in such a way that race conditions are avoided. For example, if the same container is both an input and an output to a function, ArBB will make sure that OpenMP SAXPY Implentation

1 void saxpy_openmp( 2 int n, // the number of elements in the vectors 3 float a, // scale factor 4 float x[], // the first input vector 5 float y[] // the output vector and second input vector 6 ) { 7 #pragma omp parallel for 8 for (int i 0; i < n; ++i) = 9 y[i] a x[i] + y[i]; = * 10 } LISTING 4.5 SAXPY in OpenMP.

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 97 OpenMP SAXPY Performance Vector size = 500,000,000

Introduction to Parallel Computing, University of Oregon, IPCC Lecture 5 – Parallel Programming Patterns - Map 98