Embarrassingly Parallel Problems

Outline: Embarrassingly Parallel Problems l what they are l Mandelbrot Set computation n cost considerations n static parallelization n dynamic parallelizations and its analysis l Monte Carlo methods l parallel random number generation Ref: Lin and Snyder Ch 5, Wilkinson and Allen Ch 3 Admin: reminder - pracs this week, get your NCI accounts!; parallel programming poll COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 1 Embarrassingly Parallel Problems l computation can be divided into completely independent parts for execution by separate processors (correspond to totally disconnected computational graphs) n infrastructure: Blocks of Independent Computations (BOINC) project n SETI@home and Folding@Home are projects solving very large such problems l part of an application may be embarrassingly parallel l distribution and collection of data are the key issues (can be non-trivial and/or costly) l frequently uses the master/slave approach (p − 1 speedup) Send data Slaves Master Collect Results COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 2 Example #1: Computation of the Mandelbrot Set COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 3 The Mandelbrot Set l a set of points in complex plane that are quasi-stable l is computed by iterating the function 2 zk+1 = zk + c n z and c are complex numbers (z = a + bi) n z is initially zero n c gives the position of the point in the complex plane l iterations continue until jzj > 2 or some arbitrary iteration limit is reached p jzj = a2 + b2 l enclosed by a circle centered at (0,0) of radius 2 COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 4 Evaluating 1 Point typedef struct complex f float real, imag; g complex; const int MaxIter = 256; int calc pixel(complexc) f int count = 0; complexz= f0.0 , 0.0g; float temp, lengthsq; do f temp=z.real ∗ z.real − z.imag ∗ z.imag+c.real z.imag=2 ∗ z.real ∗ z.imag+c.imag; z.real= temp; lengthsq=z.real ∗ z.real+z.imag ∗ z.imag; count++; g while(lengthsq < 4.0 && count < MaxIter); return count; g COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 5 Building the Full Image Define: l min. and max. values for c (usually -2 to 2) l number of horizontal and vertical pixels for(x = 0;x < width;x++) for(y = 0;y < height;y++) f c.real= min.real + ((float)x ∗ (max.real − min.real)/ width); c.imag= min.imag + ((float)y ∗ (max.imag − min.imag)/ height); color= calc pixel(c); display(x,y, color); g Summary: l width × height totally independent tasks l each task can be of different length (execution time) COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 6 Cost Considerations on NCI’s Gadi l 10 flops per iteration l maximum 256 iterations per point l approximate time on one core: 10 × 256=(2 × 3.2 × 109) ≈ 0.4usec l between two nodes, the time to communicate a single point to the slave and receive the result ≈ 2 × 1.4usec (latency limited) l conclusion: cannot usefully parallelize over individual points l also must allow time for master to send to all slaves before it can return to sending out more tasks COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 7 Parallelisation: Static Process Width Map Height Process Square Region Distribution Map Row Distribution COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 8 Static Implementation Master (assume that height is a multiple of nproc−1): for(slave =1, row = 0; slave < nproc; slave++) f send(&row, 1, slave); row= row+ height/(nproc −1); g for(npixel = 0; npixel < (width ∗ height); npixel++) f recv( f&x,&y,&color g , 3, any proc); display(x,y, color); g Slave: const int master = 0;// process id of master recv(&firstrow, 1, master); lastrow= firstrow+ height/(nproc −1); for(x = 0;x < width;x++) for(y= firstrow;y < lastrow;y++) f c.real= min.real + ((float)x ∗ (max.real − min.real)/width); c.imag= min.imag + ((float)y ∗ (max.imag − min.imag)/height); color= calc pixel(c); send( f&x,&y,&color g , 3, master); g COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 9 Dynamic Task Assignment l discussion point: why would we expect static assignment to be sub-optimal for the Mandelbrot set calculation? Would any regular static decomposition be significantly better (or worse)? l pool of over-decomposed tasks that are dynamically assigned to next requesting process Work Pool (x2,y2) (x5,y5) (x4,y4) (x7,y7) (x1,y1) (x3,y3) (x6,y6) Task Result COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 10 Processor Farm: Master count = 0; row = 0; for(slave = 1; slave < nproc;k++) f send(&row, 1, slave, data tag); count++; row++; g do f recv( f&slave,&r,&color g , width+2, any proc, result tag); count −−; if(row < height) f send(&row, 1, slave, data tag); row++; count++; g else send(&row, 1, slave, terminator tag); display vector(r, color); g while(count > 0); COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 11 Processor Farm: Slave myid = 1..nproc−1 is slave’s process id: const int master = 0;//proc id. recv(&y, 1, master, any tag,&source tag); while(source tag == data tag) f c.imag= min.imag + ((float)y ∗ (max.imag − min.imag)/height); for(x = 0;x < width;x++) f c.real= min.real + ((float)x ∗ (max.real − min.real)/width); color[x] = calc pixel(c); g send( f&myid,&y, color g , width+2, master, result tag); recv(&y, 1, master, source tag); g COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 12 Analysis Let p,m,n,I denote nproc, height, width, MaxIter: l sequential time: (t f denotes the floating point operation time) tseq ≤ I · mn ·t f = O(mn) l parallel communication 1: (neglect th term, assume message length of 1 word) tcom1 = (p − 1)(ts +tw) l parallel computation: I·mn tcomp ≤ p−1t f l parallel communication 2: m tcom2 = p−1(ts + ntw) l overall: I·mn m mn tpar ≤ p−1t f + (p − 1 + p−1)ts + (p − 1 + p−1)tw Discussion point: What assumptions have we been making here? Are there any situations where we might still have poor performance, and how could we mitigate this? COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 13 Example #2: Monte Carlo Methods l use random numbers to solve numerical/physical problems l evaluation of p by determining if random points in the unit square fall inside a circle area of circle p(1)2 p = = area of square 2 × 2 4 Total Area = 4 2 Area = π 2 COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 14 Monte Carlo Integration l evaluation of integral (x1 ≤ xi ≤ x2) N Z x2 1 area = f (x)dx = lim ∑ f (xi)(x2 − x1) x1 N!¥ N i=1 l example Z x2 I = (x2 − 3x)dx x1 sum = 0; for(i = 0;i < N;i++) f xr= rand v(x1, x2); sum += xr ∗ xr − 3 ∗ xr; g area= sum ∗ (x2 − x1)/N; l where rand v(x1, x2) computes a pseudo-random number between x1 and x2 COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 15 Parallelization l the only problem is ensuring each process uses a different random number seed and that there is no correlation l one solution is to have a unique process (maybe the master) issuing random numbers to the slaves Master Partial sum Request Slaves Random number Random number process COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 16 Parallel Code: Integration Master (process 0): intn=N/ nproc; Slave (myid = 1..nproc−1 is its for(i = 0;i < N/n;i++) f process id): for(j = 0;j < n;j++) const int master = 0; xr[j] = rand v(x1, x2); intn=N/ nproc; recv(&p src, 1, any proc, sum = 0; req tag); send(&myid, 1, master, req tag); send(xr,n,p src, comp tag); recv(xr,n, master, any tag,&tag); g while(tag == comp tag) f for(i=1;i < nproc;i++) f for(i = 0;i < n;i++) recv(&p src, 1,i, req tag); sum += xr[i] ∗ xr[i] − 3∗ xr[i]; send(NULL, 0,i, stop tag); send(&rank, 1, master, req tag); g recv(xr,n, master, any tag,&tag); sum = 0; g reduce add(&sum, 1, 0/ ∗ master ∗/); reduce add(&sum, 1, master); Question: performance problems with this code? COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 17 Parallel Random Numbers l linear congruential generators xi+1 = (axi + c) mod m (a, c, and m are constants) l using property xi+p = (A(a, p,m)xi +C(c,a, p,m)) mod m, we can generate the first p random numbers sequentially to repeatedly calculate the next p numbers in parallel x(1) x(2)x(p−1) x(p) x(p+1) x(p+2) x(2p−1) x(2p) Summary: embarrassingly parallel problems l defining characteristic: tasks do not need to communicate l non-trivial however: providing input data to tasks, assembling results, load balancing, scheduling, heterogeneous compute resources, costing n static task assignment (lower communication costs) vs. dynamic task assignment+ overdecomposition (better load balance) l Monte Carlo or ensemble simulations are a big use of computational power! l the field of grid computing arose to solve this issue COMP4300/8300 L7: Embarrassingly Parallel Problems 2021 JJJ • III × 18.

Embarrassingly Parallel Problems

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support