Outline: Embarrassingly Parallel Problems

G what they are G Mandelbrot Set computation I cost considerations I static parallelization I dynamic parallelizations and its analysis G Monte Carlo methods G parallel random number generation

Ref: Lin and Snyder Ch 5, Wilkinson and Allen Ch 3

Admin: reminder - pracs this week, get your NCI accounts!; parallel programming poll

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 1 Embarrassingly Parallel Problems G computation can be divided into completely independent parts for execution by separate processors (correspond to totally disconnected computational graphs) I infrastructure: Blocks of Independent Computations (BOINC) project I SETI@home and Folding@Home are projects solving very large such problems G part of an application may be embarrassingly parallel G distribution and collection of data are the key issues (can be non-trivial and/or costly) G frequently uses the master/slave approach (p − 1 )

Send data

Slaves Master

Collect Results

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 2 Example #1: Computation of the Mandelbrot Set

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 3 The Mandelbrot Set

G a set of points in complex plane that are quasi-stable G is computed by iterating the function

2 zk+1 = zk + c

I z and c are complex numbers (z = a + bi) I z is initially zero I c gives the position of the point in the complex plane G iterations continue until |z| > 2 or some arbitrary iteration limit is reached p |z| = a2 + b2

G enclosed by a circle centered at (0,0) of radius 2

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 4 Evaluating 1 Point

typedef struct complex { float real, imag; } complex; const int MaxIter = 256; int calc pixel(complexc) { int count = 0; complexz= {0.0 , 0.0}; float temp, lengthsq; do { temp=z.real ∗ z.real − z.imag ∗ z.imag+c.real z.imag=2 ∗ z.real ∗ z.imag+c.imag; z.real= temp; lengthsq=z.real ∗ z.real+z.imag ∗ z.imag; count++; } while(lengthsq < 4.0 && count < MaxIter); return count; }

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 5 Building the Full Image

Define:

G min. and max. values for c (usually -2 to 2) G number of horizontal and vertical pixels for(x = 0;x < width;x++) for(y = 0;y < height;y++) { c.real= min.real + ((float)x ∗ (max.real − min.real)/ width); c.imag= min.imag + ((float)y ∗ (max.imag − min.imag)/ height); color= calc pixel(c); display(x,y, color); }

Summary:

G width × height totally independent tasks G each task can be of different length (execution time)

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 6 Cost Considerations on NCI’s Gadi

G 10 flops per iteration G maximum 256 iterations per point G approximate time on one core: 10 × 256/(2 × 3.2 × 109) ≈ 0.4usec G between two nodes, the time to communicate a single point to the slave and receive the result ≈ 2 × 1.4usec (latency limited) G conclusion: cannot usefully parallelize over individual points G also must allow time for master to send to all slaves before it can return to sending out more tasks

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 7 Parallelisation: Static

Process Width

Map

Height

Process Square Region Distribution

Map

Row Distribution

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 8 Static Implementation Master (assume that height is a multiple of nproc−1): for(slave =1, row = 0; slave < nproc; slave++) { send(&row, 1, slave); row= row+ height/(nproc −1); } for(npixel = 0; npixel < (width ∗ height); npixel++) { recv( {&x,&y,&color } , 3, any proc); display(x,y, color); }

Slave: const int master = 0;// process id of master recv(&firstrow, 1, master); lastrow= firstrow+ height/(nproc −1); for(x = 0;x < width;x++) for(y= firstrow;y < lastrow;y++) { c.real= min.real + ((float)x ∗ (max.real − min.real)/width); c.imag= min.imag + ((float)y ∗ (max.imag − min.imag)/height); color= calc pixel(c); send( {&x,&y,&color } , 3, master); }

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 9 Dynamic Task Assignment

G discussion point: why would we expect static assignment to be sub-optimal for the Mandelbrot set calculation? Would any regular static decomposition be significantly better (or worse)? G pool of over-decomposed tasks that are dynamically assigned to next requesting process

Work Pool

(x2,y2) (x5,y5) (x4,y4) (x7,y7) (x1,y1) (x3,y3) (x6,y6)

Task Result

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 10 Processor Farm: Master count = 0; row = 0; for(slave = 1; slave < nproc;k++) { send(&row, 1, slave, data tag); count++; row++; } do { recv( {&slave,&r,&color } , width+2, any proc, result tag); count −−; if(row < height) { send(&row, 1, slave, data tag); row++; count++; } else send(&row, 1, slave, terminator tag); display vector(r, color); } while(count > 0);

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 11 Processor Farm: Slave

myid = 1..nproc−1 is slave’s process id: const int master = 0;//proc id. recv(&y, 1, master, any tag,&source tag); while(source tag == data tag) { c.imag= min.imag + ((float)y ∗ (max.imag − min.imag)/height); for(x = 0;x < width;x++) { c.real= min.real + ((float)x ∗ (max.real − min.real)/width); color[x] = calc pixel(c); } send( {&myid,&y, color } , width+2, master, result tag); recv(&y, 1, master, source tag); }

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 12 Analysis

Let p,m,n,I denote nproc, height, width, MaxIter:

G sequential time: (t f denotes the floating point operation time)

tseq ≤ I · mn ·t f = O(mn)

G parallel communication 1: (neglect th term, assume message length of 1 word)

tcom1 = (p − 1)(ts +tw) G parallel computation: I·mn tcomp ≤ p−1t f G parallel communication 2: m tcom2 = p−1(ts + ntw) G overall: I·mn m mn tpar ≤ p−1t f + (p − 1 + p−1)ts + (p − 1 + p−1)tw Discussion point: What assumptions have we been making here? Are there any situations where we might still have poor performance, and how could we mitigate this?

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 13 Example #2: Monte Carlo Methods

G use random numbers to solve numerical/physical problems G evaluation of π by determining if random points in the unit square fall inside a circle

area of circle π(1)2 π = = area of square 2 × 2 4

Total Area = 4

2 Area = π

2

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 14 Monte Carlo Integration

G evaluation of integral (x1 ≤ xi ≤ x2)

N Z x2 1 area = f (x)dx = lim ∑ f (xi)(x2 − x1) x1 N→∞ N i=1

G example Z x2 I = (x2 − 3x)dx x1 sum = 0; for(i = 0;i < N;i++) { xr= rand v(x1, x2); sum += xr ∗ xr − 3 ∗ xr; } area= sum ∗ (x2 − x1)/N;

G where rand v(x1, x2) computes a pseudo-random number between x1 and x2

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 15 Parallelization

G the only problem is ensuring each process uses a different random number seed and that there is no correlation G one solution is to have a unique process (maybe the master) issuing random numbers to the slaves Master

Partial sum

Request

Slaves

Random number

Random number process

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 16 Parallel Code: Integration

Master (process 0): intn=N/ nproc; Slave (myid = 1..nproc−1 is its for(i = 0;i < N/n;i++) { process id): for(j = 0;j < n;j++) const int master = 0; xr[j] = rand v(x1, x2); intn=N/ nproc; recv(&p src, 1, any proc, sum = 0; req tag); send(&myid, 1, master, req tag); send(xr,n,p src, comp tag); recv(xr,n, master, any tag,&tag); } while(tag == comp tag) { for(i=1;i < nproc;i++) { for(i = 0;i < n;i++) recv(&p src, 1,i, req tag); sum += xr[i] ∗ xr[i] − 3∗ xr[i]; send(NULL, 0,i, stop tag); send(&rank, 1, master, req tag); } recv(xr,n, master, any tag,&tag); sum = 0; } reduce add(&sum, 1, 0/ ∗ master ∗/); reduce add(&sum, 1, master);

Question: performance problems with this code?

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 17 Parallel Random Numbers

G linear congruential generators xi+1 = (axi + c) mod m (a, c, and m are constants)

G using property xi+p = (A(a, p,m)xi +C(c,a, p,m)) mod m, we can generate the first p random numbers sequentially to repeatedly calculate the next p numbers in parallel

x(1) x(2)x(p−1) x(p) x(p+1) x(p+2) x(2p−1) x(2p)

Summary: embarrassingly parallel problems

G defining characteristic: tasks do not need to communicate G non-trivial however: providing input data to tasks, assembling results, load balancing, scheduling, heterogeneous compute resources, costing I static task assignment (lower communication costs) vs. dynamic task assignment+ overdecomposition (better load balance) G Monte Carlo or ensemble simulations are a big use of computational power! G the field of arose to solve this issue

COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 18