FDA125 APP Lecture 2: Foundations of parallel . 1 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 2 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Foundations of parallel algorithms Literature

PRAM model [PPP] Keller, Kessler, Traff:¨ Practical PRAM Programming. Wiley Interscience, New York, 2000. Chapter 2. Time, work, cost Self-simulation and Brent’s Theorem [JaJa] JaJa: An introduction to parallel algorithms. Addison-Wesley, 1992. and Amdahl’s Law NC [CLR] Cormen, Leiserson, Rivest: Introduction to Algorithms, Chapter 30. MIT press, 1989. and Gustafssons Law Fundamental PRAM algorithms [JA] Jordan, Alaghband: Fundamentals of Parallel Processing. reduction Prentice Hall, 2003. parallel prefix list ranking

PRAM variants, simulation results and separation theorems. Survey of other models of parallel computation Asynchronous PRAM, Delay model, BSP, LogP, LogGP

FDA125 APP Lecture 2: Foundations of parallel algorithms. 4 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 3 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Parallel computation models (2) Parallel computation models (1)

Cost model: should + abstract from hardware and technology + explain available observations + specify basic operations, when applicable + predict future behaviour + specify how data can be stored

+ abstract from unimportant details generalization

analyze algorithms before implementation independent of a particular parallel computer Simplifications to reduce model complexity:

focus on most characteristic (w.r.t. influence on time/space complexity) use idealized machine model features of a broader class of parallel machines ignore hardware details: memory hierarchies, network topology, ... use asymptotic analysis Programming model Cost model drop insignificant effects vs. key parameters use empirical studies message passing cost functions for basic operations calibrate parameters, evaluate model degree of synchronous execution constraints FDA125 APP Lecture 2: Foundations of parallel algorithms. 5 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 6 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Flashback to DALG, Lecture 1: The RAM model The RAM model (2)

RAM (Random Access Machine) [PPP 2.1] analysis: Counting instructions s = d(0) do i = 1, N-1 programming and cost model for the analysis of sequential algorithms Example: Computing the global sum of N elements s = s + d(i)

data memory N end do

Θ( ) t = tload + tstore + ∑ ( 2tload + tadd + tstore + tbranch ) = 5N 3 N ..... i= 2 M[3] s M[2] + M[1] s M[0] + s load + s clock store + s program memory CPU + + ALU s register 1 + + + s current instruction register 2 .... + + + + + PC d[0] d[1] d[2] d[3] d[4] d[5] d[6] d[7] d[0] d[1] d[2] d[3] d[4] d[5] d[6] d[7]

arithmetic circuit model, directed acyclic graph (DAG) model

FDA125 APP Lecture 2: Foundations of parallel algorithms. 8 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 7 C. Kessler, IDA, Linkopings¨ Universitet, 2003. PRAM model: Variants for memory access conflict resolution PRAM model [PPP 2.2]

Exclusive Read, Exclusive Write (EREW) PRAM Parallel Random Access Machine [Fortune/Wyllie’78] concurrent access only to different locations in the same cycle p processors MIMD Concurrent Read, Exclusive Write (CREW) PRAM simultaneous reading from or single writing to same location is possible common clock signal arithm./jump: 1 clock cycle Shared Memory Concurrent Read, Concurrent Write (CRCW) PRAM simultaneous reading from or writing to same location is possible: shared memory CLOCK Weak CRCW time ? Shared Memory P PP ...... Common CRCW latency: 1 clock cycle (!) 0 1 2P 3 Pp-1 a Arbitrary CRCW CLOCK M concurrent memory accesses M0 M1 M2 M3 p-1 Priority CRCW sequential consistency P PP ...... P Combining CRCW 0 1 2P 3 p-1 private memory (optional) (global sum, max, etc.) M M0 M1 M2 M3 p-1 processor-local access only t: *a=0; *a=1; nop; *a=0; *a=2; No need for ERCW ... FDA125 APP Lecture 2: Foundations of parallel algorithms. 9 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 10 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Global sum computation on EREW and Combining-CRCW PRAM (1) Global sum computation on EREW and Combining-CRCW PRAM (2)

; ; :::;

Given n numbers x x x stored in an array. 0 1 n 1 Recursive parallel sum program in the PRAM progr. language Fork [PPP]

n 1

The global sum ∑ xi can be computed in d log2 ne time steps sync int parsum( sh int *d, sh int n) i= 0 on an EREW PRAM with n processors. { sh int s1, s2; Parallel algorithmic paradigm used: Parallel Divide-and-Conquer sh int nd2 = n / 2; if (n==1) return d[0]; // base case d[0] d[1] d[2] d[3] d[4] d[5] d[6] d[7] $=rerank(); // re-rank processors within group ParSum(n): + + + + if ($

} Fork95 434 sh-loads, 344 sh-stores t trv Global sum traced time period: 6 msecs 78 mpadd, 0 mpmax, 0 mpand, 0 mpor P0 Divide phase: trivial, time O( 1) 7 barriers, 0 msecs = 15.4% spent spinning on barriers 0 lockups, 0 msecs = 0.0% spent spinning on locks 93 sh loads, 43 sh stores, 15 mpadd, 0 mpmax, 0 mpand, 0 mpor P1 7 barriers, 0 msecs = 14.9% spent spinning on barriers 0 lockups, 0 msecs = 0.0% spent spinning on locks 48 sh loads, 43 sh stores, 9 mpadd, 0 mpmax, 0 mpand, 0 mpor Recursive calls: parallel time T ( n= 2) P2 7 barriers, 0 msecs = 14.9% spent spinning on barriers 0 lockups, 0 msecs = 0.0% spent spinning on locks 48 sh loads, 43 sh stores, 9 mpadd, 0 mpmax, 0 mpand, 0 mpor T ( n) = T ( n= 2) + O( 1)

( ) P3 with base case: load operation, time O 1 7 barriers, 0 msecs = 14.4% spent spinning on barriers 0 lockups, 0 msecs = 0.0% spent spinning on locks 49 sh loads, 43 sh stores, 9 mpadd, 0 mpmax, 0 mpand, 0 mpor P4 7 barriers, 0 msecs = 14.9% spent spinning on barriers 0 lockups, 0 msecs = 0.0% spent spinning on locks 48 sh loads, 43 sh stores, 9 mpadd, 0 mpmax, 0 mpand, 0 mpor Combine phase: addition, time O( 1) P5 7 barriers, 0 msecs = 14.4% spent spinning on barriers 0 lockups, 0 msecs = 0.0% spent spinning on locks 49 sh loads, 43 sh stores, 9 mpadd, 0 mpmax, 0 mpand, 0 mpor P6 7 barriers, 0 msecs = 14.4% spent spinning on barriers 0 lockups, 0 msecs = 0.0% spent spinning on locks 49 sh loads, 43 sh stores, 9 mpadd, 0 mpmax, 0 mpand, 0 mpor Use induction or the master theorem [CLR 4] T ( n) O( logn) P7 7 barriers, 0 msecs = 13.9% spent spinning on barriers 0 lockups, 0 msecs = 0.0% spent spinning on locks 50 sh loads, 43 sh stores, 9 mpadd, 0 mpmax, 0 mpand, 0 mpor

FDA125 APP Lecture 2: Foundations of parallel algorithms. 11 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 12 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Global sum computation on EREW and Combining-CRCW PRAM (3) PRAM model: CRCW is stronger than CREW

Iterative parallel sum program in Fork Example: int sum(sh int a[], sh int n) t Computing the logical OR of p bits { + idle idle idle idle idle idle idle CREW: time O(log p) int d, dd; + idle idle idle + idle idle idle ? Shared Memory int ID = rerank(); 01 0 1 0 0 0 1 a d = 1; + idle+ idle + idle + idle OR OR OR OR CLOCK while (d

parallel work wA ( n) of algorithm A on an input of size n (a) asymptotic analysis = max. number of instructions performed by all procs during execution of A,

estimation based on model and pseudocode operations where in each (parallel) time step as many processors are available

results for large problem sizes, large # processors as needed to execute the step in constant time.

parallel time t ( n) of algorithm A on input of size n (b) empirical analysis A = maximum number of parallel time steps required under the same circum-

measurements based on implementation stances.

for fixed (small) problem and machine sizes Work and time are thus worst-case measures.

tA ( n) is sometimes called the depth of A (cf. circuit model, DAG model of (parallel) computation)

pi ( n) = number of processors needed in time step i, 0 i < tA ( n) ,

tA ( n)

to execute the step in constant time. Then, wA ( n) = ∑ pi ( n)

i= 0

FDA125 APP Lecture 2: Foundations of parallel algorithms. 15 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 16 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Asymptotic analysis: Work and time optimality, work efficiency Asymptotic analysis: Cost, cost optimality

( ) = ( ) Algorithm A needs pA n max1 i tA ( n) pi n processors. A is work-optimal if wA ( n) = O( tS ( n))

where S = optimal or currently best known sequential algorithm Cost cA ( n) of A on an input of size n

for the same problem = processor-time product: cA ( n) = pA ( n) tA ( n)

k A is cost-optimal if cA ( n) = O( tS ( n)) A is work-efficient if wA ( n) = tS ( n) O( log ( tS ( n))) for some constant k 1. with S = optimal or currently best known sequential algorithm for the same problem A is time-optimal if any other for this problem

requires Ω( tA ( n)) time steps. Work Cost: wA ( n) = O( cA ( n))

Θ( ( )) A is cost-effective if wA ( n) = cA n . FDA125 APP Lecture 2: Foundations of parallel algorithms. 18 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 17 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Trading concurrency for cost-effectiveness Asymptotic analysis for global sum computation

time + time Making the parallel sum algorithm cost-optimal: t t parallel sum algorithm problem size n a(8) + # processors Instead of n processors, use only n= log2 n processors. p a(7) + time t ( p; n) First, each processor computes sequentially the global sum of a(6)

“its” logn local elements. This takes time O( logn) . work w( p; n) + a(5) cost c = t * p

= cost c( p; n) = t p Then, they compute the global sum of n logn partial sums + + idle idle idle idle idle idle idle a(4) using the previous parallel sum algorithm. Example: + + idle idle idle + idle idle idle a(3) seq. sum algorithm Time: O( logn) for local summation, O( logn) for global summation + + idle+ idle+ idle + idle s = a(1) a(2)

Cost: n= logn O( logn) = O( n) linear! do i = 2, n a(1) a(1) a(2) a(3) a(4) a(5) a(6) a(7) a(8) s = s + a(i) p=1 p=n end do

( ; ) = ( ) = ( ) t 1 n tseq n O n t ( n; n) = O( logn) This is an example of a more general technique based on Brent’s theorem.

( ; ) = ( ) n 1 additions w 1 n O n w( n; n) = O( n) n loads

( ; ) = ( ; ) c 1 n t 1 n 1 c( n; n) = O( nlogn)

O( n) other

= O( n) par. sum alg. not cost-effective!

FDA125 APP Lecture 2: Foundations of parallel algorithms. 19 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 20 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Self-simulation and Brent’s Theorem Consequences of self-simulation

Self-simulation (aka work-time scheduling in [JaJa’92]) RAM = 1-processor PRAM simulates p-processor PRAM in O( p) time steps.

A model of parallel computation is self-simulating RAM simulates A with cost cA ( n) = pA ( n) tA ( n) in O( cA ( n)) time. if a p-processor machine can simulate (Actually possible in O( wA ( n)) time.)

one time step of a q-processor machine in O( d q= pe ) time steps.

Even with arb. many processors A cannot be simulated any faster than tA ( n) . All PRAM variants are self-simulating.

Θ( ( )) For cost-optimal A, cA ( n) = tS n Exercise

Proof idea for (EREW) PRAM with p q simulating processors: p-processor PRAM can simulate one step of A requiring pA ( n) processors

( ( ) = ) Divide the q simulated processors in p chunks of size d q= pe in O pA n p time steps assign a chunk to each of the p simulating processors map memory of simulated PRAM to memory of simulating PRAM Self-simulation emulates virtual processors with significant overhead.

step-by-step simulation, with O( q= p) steps per simulated step In practice, other mechanisms for adapting the granularity are more suitable. take care of pending memory accesses in current simulated step

extra space O( q= p) for registers and status of the simulated machine

ω( ( )) How to avoid simulation of inactive processors where cA ( n) = wA n ? FDA125 APP Lecture 2: Foundations of parallel algorithms. 21 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 22 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Brent’s Theorem Absolute Speedup

A parallel algorithm for problem P Brent’s theorem: [Brent’74] S asymptotically optimal or best known sequential algorithm for P. Any PRAM algorithm A

tA ( p; n) worst-case execution time of A with p pA ( n) processors

which runs in tA ( n) time steps and performs wA ( n) work

t ( n) worst-case execution time of S can be implemented to run on a p-processor PRAM in S

 

wA ( n) The absolute speedup of a parallel algorithm A is the ratio

O t ( n) + A p tS ( n)

SUabs ( p; n) =

time steps. tA ( p; n)

Proof: see [PPP p.41] tS ( n) tS ( n) If S is an optimal algorithm for P, then SUabs ( p; n) = p p tA ( p; n) cA ( n) Algorithm design issue: Balance the terms for cost-effectiveness: for any fixed input size n, since tS ( n) cA ( n) .

design A with pA ( n) processors such that wA ( n) = pA ( n) = O( tA ( n)) A cost-optimal parallel algorithm A for a problem P has linear absolute speedup. Note: Proof is non-constructive! This holds for n sufficiently large.

How to determine the active processors for each time step?

“Superlinear” speedup > p may exist only for small n.

language constructs, dependence analysis, static/dynamic scheduling, ...

FDA125 APP Lecture 2: Foundations of parallel algorithms. 23 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 24 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Relative Speedup and Efficiency Speedup curves

Compare A with p processors to itself running on 1 processor: Speedup curves measure the utility of , not speed. S trivially parallel The asymptotic relative speedup of a parallel algorithm A is the ratio (superlinear) ideal speedup: (e.g., matrix product, LU tA ( 1; n) S = p decomposition, ray tracing) SUrel ( p; n) =

( ; ) tA p n close to ideal S = p linear work-bound algorithms tS ( n) tA ( 1; n) SUrel ( p; n) SUabs ( p; n) . [PPP p.44 typo!]

Θ( ) sublinear linear SU p , work-optimal

Preferably used in papers on parallelization to “nice” performance results. tree-like task graphs (e.g., global sum / max) saturation

Θ( = ) The relative efficiency of parallel algorithm A is the ratio sublinear SU p log p decreasing tA ( 1; n)

EF( p; n) = communication-bound

p tA ( p; n) p sublinear SU = 1= f n( p) Most papers on parallelization show only relative speedup

EF( p; n) = SUrel ( p; n) = p [ 0; 1]

(as SUabs SUrel, and best seq. algorithm needed for SUabs) FDA125 APP Lecture 2: Foundations of parallel algorithms. 25 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 26 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Speedup anomalies Amdahl’s Law

Consider execution (trace) of parallel algorithm A: Speedup anomaly: sequential part As where only 1 processor is active An implementation on p processors may execute faster than expected. parallel part Ap that can be sped up perfectly by p processors

Superlinear speedup total work wA ( n) = wAs ( n) + wAp ( n)

speedup function that grows faster than linear, i.e., in ω( p) Amdahl’s Law

Possible causes: If the sequential part of A is a fixed fraction of the total work

cache effects irrespective of the problem size n, that is, if there is a constant β with

search anomalies

wAs ( n)

β = 1

w ( n) Real-world example: move scaffolding A the relative speedup of A with p processors is limited by Speedup anomalies may occur only for fixed (small) range of p. p β 1=

β) βp + ( 1 Theorem: There is no absolute superlinear speedup for arbitrarily large p.

FDA125 APP Lecture 2: Foundations of parallel algorithms. 27 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 28 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Proof of Amdahl’s Law NC

T ( 1) T ( 1)

SUrel = = Recall complexity class P:

T ( p) T s + T p ( p) A A P = set of all problems solvable on a RAM in polynomial time Assume perfect parallelizability of the parallel part Ap,

β) ( ) = β) ( ) = ( that is, TAp ( p) = ( 1 T p 1 T 1 p: Can all problems in P be solved fast on a PRAM?

T ( 1) p

= β SUrel = 1= “Nick’s class” NC:

β) ( ) = ) β + β βT ( 1) + ( 1 T 1 p p 1 NC = set of problems solvable on a PRAM in P0 k polylogarithmic time O( log n) for some k 0

ß T(1) (1-ß) T(1) using only nO( 1) processors (i. e. a polynomial number) P0 in the size n of the input instance. P1 p P2 P3

By self-simulation: NC P . (1-ß)T(1)/p Remark: For most parallel algorithms the sequential part is not a fixed fraction. FDA125 APP Lecture 2: Foundations of parallel algorithms. 29 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 30 C. Kessler, IDA, Linkopings¨ Universitet, 2003. NC - Some remarks Speedup and Efficiency w.r.t. other sequential architectures

Are the problems in NC just the well-parallelizable problems? Parallel algorithm A runs on a “real” parallel machine N with fixed size p. Counterexample: Searching for a given element in an ordered array sequentially solvable in logarithmic time (thus in NC) Sequential algorithm S for same problem runs on sequential machine M

cannot be solved significantly faster in (EREW)-parallel [PPP 2.5.2] N M Measure execution times TA ( p; n) , TS ( n) in seconds

M TS ( n) Are NC-algorithms always a good choice? absolute, machine-uniform speedup of A: SUabs ( p; n) = M TA ( p; n)

3 1= 4 12 Time log n is faster than time n only for ca. n > 10 . M TA ( 1; n) parallelization slowdown of A: SL( n) = M TS ( n)

SUrel ( p; n) Is NC = P ? Hence, SUabs ( p; n) =

SL( n)

For some problems in P no polylogarithmic PRAM algorithm is known M TS ( n)

= likely that NC = P absolute, machine-nonuniform speedup N TA ( n)

P -completeness [PPP p. 46] Used in the 1990’s to disqualify parallel processing by comparing to newer superscalars

FDA125 APP Lecture 2: Foundations of parallel algorithms. 32 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 31 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Scalability Example: Cost-optimal parallel sum algorithm on SB-PRAM

( ) n = 10; 000 For machine N with p pA n , M Processors Clock cycles Time SUrel SUabs EF TS ( n) we have tA ( p; n) = O( cA ( n) = p) and thus SUabs ( p; n) = p N . Sequential 460118 1.84 cA ( n) 1 1621738 6.49 1.00 0.28 1.00 4 408622 1.63 3.97 1.13 0.99 linear speedup for cost-optimal A 16 105682 0.42 15.35 4.35 0.96 “well scalable” (in theory) in range 1 p pA ( n) 64 29950 0.12 54.15 15.36 0.85

256 10996 0.04 147.48 41.84 0.58 For fixed n, no further speedup beyond pA ( n) 1024 6460 0.03 251.04 71.23 0.25 For realistic problem sizes (small n, small p): often sublinear!

n = 100; 000 - communication costs (non-PRAM) may increase more than linearly in p Processors Clock cycles Time SUrel SUabs EF Sequential 4600118 18.40 - sequential part may increase with p – not enough work available

1 16202152 64.81 1.00 0.28 1.00 less scalable 4 4054528 16.22 4.00 1.13 1.00 16 1017844 4.07 15.92 4.52 0.99 What about scaling the problem size n with p to keep speedup? 64 258874 1.04 62.59 17.77 0.98 256 69172 0.28 234.23 66.50 0.91 1024 21868 0.09 740.91 210.36 0.72 FDA125 APP Lecture 2: Foundations of parallel algorithms. 33 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 34 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Isoefficiency [Rao,Kumar’87] Gustafssons Law measured efficiency of parallel algorithm A on machine M for problem size n Revisit Amdahl’s law: s M assumes that sequential work A is a constant fraction β of total work. TA ( 1; n) SUrel ( p; n)

EF( p; n) = = M when scaling up n, wAs ( n) will scale linearly as well! p TA ( p; n) p

Gustafssons Law [Gustafsson’88] Let A solve a problem of size n0 on M with p0 processors with efficiency ε. Assuming that the sequential work is constant (independent of n), The isoefficiency function for A is a function of p, which given by seq. fraction α in an unscaled (e.g., size n = 1 (thus p = 1)) problem expresses the increase in problem size required for A

α ( ) = ( α) ( ) such that TAs = T1 1 , TAp 1 T1 1 , to retain a given efficiency ε. and that wAp ( n) scales linearly in n,

the scaled speedup for n > 1 is predicted by

If isoefficiency-function for A linear A well scalable s Tn ( 1)

= α: α + ( α) = ( ) SUrel ( n) = 1 n n n 1 Otherwise (superlinear): A needs large increase in n to keep same efficiency. Tn ( n) The seq. part is assumed to be replicated over all processors.

FDA125 APP Lecture 2: Foundations of parallel algorithms. 35 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 36 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Proof of Gustafssons Law Fundamental PRAM algorithms

n=1: P0 1

p αT(1) (1−α) T(1) reduction see parallel sum algorithm 1 1

Scaled speedup for p = n > 1: n>1: P0 prefix-sums s Tn ( 1) P1 SUrel ( n) = P2 Tn ( n) P3 list ranking n TAs + wAp ( n)

=

TAs + TAp Pn-1 Oblivious (PRAM) algorithm: [JaJa 4.4.1] assuming P0 control flow ( execution time) does not depend on input data. perfect parallelizability P1 p p P2 of A up to p = n processors P3 Oblivious algorithms can be represented as arithmetic circuits

α) α + ( s 1 n P0 whose shape only depends on the input size. SUrel ( n) = 1 P1 α P0 Examples: reduction, (parallel) prefix, pointer jumping; = n ( n 1) . (1−α) n T(1)1 sorting networks, e.g. bitonic-sort [CLR’90 ch. 28], Lab, mergesort

Yields better speedup predictions for data-parallel algorithms. Counterexamples: (parallel) quicksort FDA125 APP Lecture 2: Foundations of parallel algorithms. 37 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 38 C. Kessler, IDA, Linkopings¨ Universitet, 2003. The Prefix-sums problem Sequential prefix sums computation

void seq_prefix( int x[], int n, int y[] ) xx x x x x x Given: a set S (e.g., the integers) { 1 2 3 4 5 6 7

a binary associative operator on S, int i; + int ps; // i'th prefix sum

; : : : ; a sequence of n items x0 xn 1 S if (n>0) ps = y[0] = x[0]; + compute the sequence y of prefix sums defined by for (i=1; i

yi = x j for 0 i < n } + j = 0 } + An important building block of many parallel algorithms! [Blelloch’89] Task dependence graph: y y y y y y y linear chain of dependences 1 2 3 4 5 6 7

typical operations :

seems to be inherently sequential — how to parallelize? integer addition, maximum, bitwise AND, bitwise OR

Example:

bank account: initially 0$, daily changes x0, x1, ...

daily balances: (0,) x0, x0 + x1, x0 + x1 + x2, ...

FDA125 APP Lecture 2: Foundations of parallel algorithms. 39 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 40 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Parallel prefix sums (1) Parallel prefix sums (2)

Naive parallel implementation: Algorithmic technique: parallel divide&conquer We consider the simplest variant, called Upper/lower parallel prefix: apply the definition,

i recursive formulation:

M

yi = x j for 0 i < n N–prefix is computed as

j = 0 x1x 2 xN and assign one processor for computing each yi ......

2 N/2Prefix N/2 Prefix

Θ( ) Θ( ) parallel time n , work and cost n

But we observe: ...... a lot of redundant computation (common subexpressions) N/2 N x x + x 1 1 2 xi x i i=1 i=1

Idea: Exploit associativity of ...

Θ( ) Parallel time: logn steps, work: n= 2 logn additions, cost: nlogn Not work-optimal! ... and needs concurrent read FDA125 APP Lecture 2: Foundations of parallel algorithms. 42 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 41 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Parallel prefix sums (4) Parallel prefix sums (3)

oddeven Odd/even parallel prefix P ( n) : Rework lower-upper prefix sums algorithm for exclusive read:

x1 x 2 x 3 x 4 x 5 x 6 ... x1 x 2 x 3 x 4 x 5 x 6 x7 x 8 iterative formulation in data-parallel pseudocode: ++ +.... + + + + + + aaaaaaaaaaaaa0 1 2 3 4 5 6 7 8 9 10 11 12 a13 a 14 a 15

real a : array[ 0:: N 1] ; + + oe 1 2 34 5 6 7 8 9 10 11 12 13 14 15 int stride; P (4) a0 01 23 4 5 67 89 1011 12 13 14 Podd/even (n/2) = P ul(4) stride 1; + +

1 23 4 5 6 7 8 9 10 11 12 1314 15 while stride < N do a0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12

forall i : [ 0:: N 1] in parallel do + + .... + + + + + 1 23 4 5 6 7 8 9 10 11 12 1314 15

a0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 if i stride then y1 y 2 y 3 y 4 y 5 y 6 ... y1 y 2 y 3 y 4 y 5 y 6 y7 y 8

a[ i] a[ i stride] + a[ i] ;

1 23 4 5 6 7 8 9 10 11 12 1314 15

Θ( ) a0 0 00000000000000 stride := stride * 2; EREW, 2logn 2 time steps, work 2n logn 2, cost nlogn

(* finally, sum in a[ N 1] *) Not cost-optimal! But may use Brent’s theorem...

FDA125 APP Lecture 2: Foundations of parallel algorithms. 43 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 44 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Parallel prefix (3) Towards List Ranking

Ladner/Fischer parallel prefix [Ladner/Fischer’80] Parallel list: (unordered) array of list items (one per proc.), singly linked

Problem: for each element, find next next next next next next next next combines advantages of upper-lower and odd-even parallel prefix chum chum chum chum chum chum chum chum the end of its .

0: 69

Θ( ) EREW, time logn steps, work 4n 4: 96n + 1, cost nlogn Algorithmic technique: next next next next next next next next recursive doubling, here: chum chum chum chum chum chum chum chum can be made cost-optimal using Brent’s theorem: “pointer jumping” [Wyllie’79]

The prefix-sums problem can be solved on a n-processor EREW PRAM The algorithm in pseudocode: next next next next next next next next chum chum chum chum chum chum chum chum

Θ( ) in Θ( logn) time steps and cost n .

forall k in [ 1:: N ] in parallel do

chum[ k ] next[ k ] ; next next next next next next next next while chum[ k ] = null chum chum chum chum chum chum chum chum

and chum[ chum[ k ]] = null do

chum[ k ] chum[ chum[ k ]] ; od next next next next next next next next od chum chum chum chum chum chum chum chum lengths of chum lists halved in each step

d logN e pointer jumping steps FDA125 APP Lecture 2: Foundations of parallel algorithms. 45 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 46 C. Kessler, IDA, Linkopings¨ Universitet, 2003. List ranking List ranking (2): Pointer jumping

Extended problem: compute the rank = distance to the end of the list NULL-checks can be avoided by marking list end by a self-loop. Pointer jumping Implementation in Fork: [Wyllie’79] EREW sync wyllie( sh LIST list[], sh int length ) { 1 1 1 1 1 1 1 step: LIST *e; // private pointer to my own int nn; distance value, 2 2 2 2 2 1 I add distance e = list[$$]; // $$ is my processor index

of my next if (e->next != e) e->rank = 1; else e->rank = 0; that I splice nn = length; out of the list 4 4 4 3 2 1 while (nn>1) { Every step e->rank = e->rank + e->next->rank; + doubles #lists e->next = e->next->next; + halves lengths nn = nn>>1; // division by 2 6 5 4 3 2 1 }

d log2 ne steps } Not work-efficient!

Also for parallel prefix on a list! Exercise

FDA125 APP Lecture 2: Foundations of parallel algorithms. 47 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 48 C. Kessler, IDA, Linkopings¨ Universitet, 2003. CREW is more powerful than EREW Simulating a CRCW algorithm with an EREW algorithm

Example problem: A p-processor CRCW algorithm can be no more than O( log p) times faster Given a directed forest, than the best p-processor EREW algorithm for the same problem. compute for each node a pointer to the root of its tree. Step-by-step simulation [Vishkin’83]

CREW: with pointer-jumping technique in d log2 max. depthe steps For Weak/Common/Arbitrary CRCW PRAM:

e.g. for balanced binary tree: O( loglogn) ; an O( 1) algorithm exists handle concurrent writes with auxiliary array A of pairs.

CRCW processor i should write xi into location li:

h ; i [ ] EREW: Lower bound Ω( logn) steps EREW processor i writes li xi to A i per step, one given value can be copied to at most 1 other location Sort A on p EREW processors by first coordinates

e.g. for a single binary tree: in time O( log p) [Ajtai/Komlos/Szemeredi’83], [Cole’88] k after k steps, at most 2 locations can contain the identity of the root Processor j inspects write requests A[ j ] = h lk ; xk i and A[ j 1] = h lq ; xq i

and assigns xk to lk iff lk = lq or j = 0.

A Θ( logn) EREW algorithm exists. For Combining (Maximum) CRCW PRAM: see [PPP p.66/67] FDA125 APP Lecture 2: Foundations of parallel algorithms. 49 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 50 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Simulation summary PRAM Variants [PPP 2.6]

EREW CREW CRCW Broadcasting with selective reduction (BSR) PRAM

Common CRCW Priority CRCW Distributed RAM (DRAM) Local memory PRAM (LPRAM)

Arbitrary CRCW Priority CRCW Asynchronous PRAM

where : “strictly weaker than” (transitive) Queued PRAM (QRQW PRAM) Hierarchical PRAM (H-PRAM) See [PPP p.68/69] for more separation results. Message passing models:

Delay model, BSP, LogP, LogGP Lecture 4

FDA125 APP Lecture 2: Foundations of parallel algorithms. 51 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 52 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Broadcasting with selective reduction (BSR) Asynchronous PRAM

BSR: generalization of a Combine CRCW PRAM [Akl/Guenther’89] Asynchronous PRAM [Cole/Zajicek’89] [Gibbons’89] [Martel et al’92]

SHARED MEMORY 1 BSR write step: ...... store_sh load_sh fetch&incr Each processor can write a value to all memory locations (broadcast) NETWORK atomic_incr Each memory location computes a global reduction (max, sum, ...) P P 1 P2 ...... Pp-1 processors over a specified subset of all incoming write contributions (selective re- 0 duction) store_pr load_pr M M M ...... M 0 1 2 p-1 private memory modules

No common clock No uniform memory access time Sequentially consistent shared memory FDA125 APP Lecture 2: Foundations of parallel algorithms. 53 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 54 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Delay model BSP model

Idealized multicomputer: point-to-point communication costs time tmsg. Bulk-synchronous parallel programming [Valiant’90] [McColl’93] time

BSP computer = abstract message passing architecture ( p; L ; g; s) time P0P1 P2 P3 P4 P5 P6 P7 P8 P9 MIMD tw word transfer time

global barrier SPMD t startup time s Cost of communicating a larger block of n bytes: size local computation h-relation models using local data only communication

time tmsg ( n) = sender overhead + latency + receiver overhead + n/bandwidth superstep pattern / volume

= : tstartup + n ttransfer hi [words] = comm. communication phase fan-in, fan-out of Pi Assumption: network not overloaded; no conflicts occur at routing (message passing)

next barrier = h max1 i p hi tstartup = startup time (time to send a 0-byte message)

accounts for hardware and software overhead tstep = w + hg + L ttransfer = transfer rate, send time per word sent BSP program = sequence of supersteps, separated by (logical) barriers depends on the network bandwidth.

FDA125 APP Lecture 2: Foundations of parallel algorithms. 56 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 55 C. Kessler, IDA, Linkopings¨ Universitet, 2003. LogP model (1) BSP example: Global maximum computation (non-optimal algorithm)

LogP model [Culler et al. 1993]

Compute maximum of n numbers A[ 0; :::; n 1] on BSP( p; L ; g; s) : for the cost of communicating small messages (a few bytes)

// A[ 0:: n 1] distributed block-wise across p processors step 4 parameters: g P0 o // local computation phase: send ∞ latency L m ; overhead o g for all A[ i] in my local partition of A f gap g (models bandwidth) P1 o m max (m; A[ i] ); processor number P recv // communication phase: L

if myPID = 0 Local work: abstracts from network topology Θ time send ( m, 0 ); ( n= p) gap g = inverse network bandwidth per processor: else // on P0: Communication:

= for each i f 1; :::; p 1g Network capacity is L g messages to or from each processor. h = p 1 recv ( mi, i ); (P0 is bottleneck) L, o, g typically measured as multiples of the CPU cycle time. step

tstep = w + hg + L if myPID = 0 transmission time for a small message:

 

for each i f 1; :::; p 1g n

+ if the network capacity is not exceeded Θ + + = pg L 2 o L m max(m; mi); p FDA125 APP Lecture 2: Foundations of parallel algorithms. 57 C. Kessler, IDA, Linkopings¨ Universitet, 2003. FDA125 APP Lecture 2: Foundations of parallel algorithms. 58 C. Kessler, IDA, Linkopings¨ Universitet, 2003. LogP model (2) LogP model (3): LogGP model P0 P2 The LogGP model [Culler et al. ’95] extends LogP by parameter G = gap per word, to model block communication Example: Broadcast on a 2-dimensional hypercube P1 P3 Communication of an n-word-block:

With example parameters P = 4, o = 2µs, g = 3µs, L = 5µs with the LogP-model: with the LogGP-model:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 g g g g GGGG g GGGG P0 o o o o o o send send Remark: gap constraint does not apply to recv; send sequences sender P1 recv send L L L L P2 receiver recv o o o o o o

P3 recv time

it takes at least 18µs to broadcast 1 byte from P0 to P1; P2; P3

0 tn = ( n 1) g + L + 2o tn = o + ( n 1) G + L + o Remark: for determining time-optimal broadcast trees in LogP, see [Papadimitriou/Yannakakis’89], [Karp et al.’93]

FDA125 APP Lecture 2: Foundations of parallel algorithms. 59 C. Kessler, IDA, Linkopings¨ Universitet, 2003. Summary

Parallel computation models Shared memory: PRAM, PRAM variants Message passing: Delay model, BSP, LogP, LogGP parallel time, work, cost

Parallel algorithmic paradigms (up to now) Parallel divide-and-conquer (includes reduction and pointer jumping / recursive doubling)

Fundamental parallel algorithms Global sum Prefix sums List ranking Broadcast