Embarrassingly Parallel Problems

Outline: Embarrassingly Parallel Problems l what they are l Mandelbrot Set computation n cost considerations n static parallelization n dynamic parallelizations and its analysis l Monte Carlo methods l parallel random number generation Ref: Lin and Snyder Ch 5, Wilkinson and Allen Ch 3 Admin: reminder - pracs this week, get your NCI accounts!; parallel programming poll COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 1 Embarrassingly Parallel Problems l computation can be divided into completely independent parts for execution by separate processors (correspond to totally disconnected computational graphs) n infrastructure: Blocks of Independent Computations (BOINC) project n SETI@home and Folding@Home are projects solving very large such problems l part of an application may be embarrassingly parallel l distribution and collection of data are the key issues (can be non-trivial and/or costly) l frequently uses the master/slave approach (p − 1 speedup) Send data Slaves Master Collect Results COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 2 Example #1: Computation of the Mandelbrot Set COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 3 The Mandelbrot Set l a set of points in complex plane that are quasi-stable l is computed by iterating the function 2 zk+1 = zk + c n z and c are complex numbers (z = a + bi) n z is initially zero n c gives the position of the point in the complex plane l iterations continue until jzj > 2 or some arbitrary iteration limit is reached p jzj = a2 + b2 l enclosed by a circle centered at (0,0) of radius 2 COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 4 Evaluating 1 Point typedef struct complex f float real, imag; g complex; const int MaxIter = 256; int calc pixel(complexc) f int count = 0; complexz= f0.0 , 0.0g; float temp, lengthsq; do f temp=z.real ∗ z.real − z.imag ∗ z.imag+c.real z.imag=2 ∗ z.real ∗ z.imag+c.imag; z.real= temp; lengthsq=z.real ∗ z.real+z.imag ∗ z.imag; count++; g while(lengthsq < 4.0 && count < MaxIter); return count; g COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 5 Building the Full Image Define: l min. and max. values for c (usually -2 to 2) l number of horizontal and vertical pixels for(x = 0;x < width;x++) for(y = 0;y < height;y++) f c.real= min.real + ((float)x ∗ (max.real − min.real)/ width); c.imag= min.imag + ((float)y ∗ (max.imag − min.imag)/ height); color= calc pixel(c); display(x,y, color); g Summary: l width × height totally independent tasks l each task can be of different length (execution time) COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 6 Cost Considerations on NCI’s Gadi l 10 flops per iteration l maximum 256 iterations per point l approximate time on one core: 10 × 256=(2 × 3.2 × 109) ≈ 0.4usec l between two nodes, the time to communicate a single point to the slave and receive the result ≈ 2 × 1.4usec (latency limited) l conclusion: cannot usefully parallelize over individual points l also must allow time for master to send to all slaves before it can return to sending out more tasks COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 7 Parallelisation: Static Process Width Map Height Process Square Region Distribution Map Row Distribution COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 8 Static Implementation Master (assume that height is a multiple of nproc−1): for(slave =1, row = 0; slave < nproc; slave++) f send(&row, 1, slave); row= row+ height/(nproc −1); g for(npixel = 0; npixel < (width ∗ height); npixel++) f recv( f&x,&y,&color g , 3, any proc); display(x,y, color); g Slave: const int master = 0;// process id of master recv(&firstrow, 1, master); lastrow= firstrow+ height/(nproc −1); for(x = 0;x < width;x++) for(y= firstrow;y < lastrow;y++) f c.real= min.real + ((float)x ∗ (max.real − min.real)/width); c.imag= min.imag + ((float)y ∗ (max.imag − min.imag)/height); color= calc pixel(c); send( f&x,&y,&color g , 3, master); g COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 9 Dynamic Task Assignment l discussion point: why would we expect static assignment to be sub-optimal for the Mandelbrot set calculation? Would any regular static decomposition be significantly better (or worse)? l pool of over-decomposed tasks that are dynamically assigned to next requesting process Work Pool (x2,y2) (x5,y5) (x4,y4) (x7,y7) (x1,y1) (x3,y3) (x6,y6) Task Result COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 10 Processor Farm: Master count = 0; row = 0; for(slave = 1; slave < nproc;k++) f send(&row, 1, slave, data tag); count++; row++; g do f recv( f&slave,&r,&color g , width+2, any proc, result tag); count −−; if(row < height) f send(&row, 1, slave, data tag); row++; count++; g else send(&row, 1, slave, terminator tag); display vector(r, color); g while(count > 0); COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 11 Processor Farm: Slave myid = 1..nproc−1 is slave’s process id: const int master = 0;//proc id. recv(&y, 1, master, any tag,&source tag); while(source tag == data tag) f c.imag= min.imag + ((float)y ∗ (max.imag − min.imag)/height); for(x = 0;x < width;x++) f c.real= min.real + ((float)x ∗ (max.real − min.real)/width); color[x] = calc pixel(c); g send( f&myid,&y, color g , width+2, master, result tag); recv(&y, 1, master, source tag); g COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 12 Analysis Let p,m,n,I denote nproc, height, width, MaxIter: l sequential time: (t f denotes the floating point operation time) tseq ≤ I · mn ·t f = O(mn) l parallel communication 1: (neglect th term, assume message length of 1 word) tcom1 = (p − 1)(ts +tw) l parallel computation: I·mn tcomp ≤ p−1t f l parallel communication 2: m tcom2 = p−1(ts + ntw) l overall: I·mn m mn tpar ≤ p−1t f + (p − 1 + p−1)ts + (p − 1 + p−1)tw Discussion point: What assumptions have we been making here? Are there any situations where we might still have poor performance, and how could we mitigate this? COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 13 Example #2: Monte Carlo Methods l use random numbers to solve numerical/physical problems l evaluation of p by determining if random points in the unit square fall inside a circle area of circle p(1)2 p = = area of square 2 × 2 4 Total Area = 4 2 Area = π 2 COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 14 Monte Carlo Integration l evaluation of integral (x1 ≤ xi ≤ x2) N Z x2 1 area = f (x)dx = lim ∑ f (xi)(x2 − x1) x1 N!¥ N i=1 l example Z x2 I = (x2 − 3x)dx x1 sum = 0; for(i = 0;i < N;i++) f xr= rand v(x1, x2); sum += xr ∗ xr − 3 ∗ xr; g area= sum ∗ (x2 − x1)/N; l where rand v(x1, x2) computes a pseudo-random number between x1 and x2 COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 15 Parallelization l the only problem is ensuring each process uses a different random number seed and that there is no correlation l one solution is to have a unique process (maybe the master) issuing random numbers to the slaves Master Partial sum Request Slaves Random number Random number process COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 16 Parallel Code: Integration Master (process 0): intn=N/ nproc; Slave (myid = 1..nproc−1 is its for(i = 0;i < N/n;i++) f process id): for(j = 0;j < n;j++) const int master = 0; xr[j] = rand v(x1, x2); intn=N/ nproc; recv(&p src, 1, any proc, sum = 0; req tag); send(&myid, 1, master, req tag); send(xr,n,p src, comp tag); recv(xr,n, master, any tag,&tag); g while(tag == comp tag) f for(i=1;i < nproc;i++) f for(i = 0;i < n;i++) recv(&p src, 1,i, req tag); sum += xr[i] ∗ xr[i] − 3∗ xr[i]; send(NULL, 0,i, stop tag); send(&rank, 1, master, req tag); g recv(xr,n, master, any tag,&tag); sum = 0; g reduce add(&sum, 1, 0/ ∗ master ∗/); reduce add(&sum, 1, master); Question: performance problems with this code? COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 17 Parallel Random Numbers l linear congruential generators xi+1 = (axi + c) mod m (a, c, and m are constants) l using property xi+p = (A(a, p,m)xi +C(c,a, p,m)) mod m, we can generate the first p random numbers sequentially to repeatedly calculate the next p numbers in parallel x(1) x(2)x(p−1) x(p) x(p+1) x(p+2) x(2p−1) x(2p) Summary: embarrassingly parallel problems l defining characteristic: tasks do not need to communicate l non-trivial however: providing input data to tasks, assembling results, load balancing, scheduling, heterogeneous compute resources, costing n static task assignment (lower communication costs) vs. dynamic task assignment+ overdecomposition (better load balance) l Monte Carlo or ensemble simulations are a big use of computational power! l the field of grid computing arose to solve this issue COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 18.

Load more