Theoretical

A. Legrand

Parallel RAM Introduction Reducing the Theoretical Parallel Computing Number of Processors PRAM Model Hierarchy Conclusion Combinatorial Arnaud Legrand, CNRS, University of Grenoble Networks Merge Sort 0-1 Principle LIG laboratory, [email protected] Odd-Even Transposition Sort FFT November 1, 2009 Conclusion

1 / 63 Outline

Theoretical Parallel Computing

A. Legrand 1 Parallel RAM Introduction Parallel RAM Introduction Pointer Jumping Pointer Jumping Reducing the Reducing the Number of Processors Number of Processors PRAM Model Hierarchy PRAM Model Hierarchy Conclusion Conclusion Combinatorial Networks 2 Combinatorial Networks Merge Sort 0-1 Principle Merge Sort Odd-Even Transposition 0-1 Principle Sort FFT Odd-Even Transposition Sort Conclusion FFT

3 Conclusion

2 / 63 Outline

Theoretical Parallel Computing

A. Legrand 1 Parallel RAM Introduction Parallel RAM Introduction Pointer Jumping Pointer Jumping Reducing the Reducing the Number of Processors Number of Processors PRAM Model Hierarchy PRAM Model Hierarchy Conclusion Conclusion Combinatorial Networks 2 Combinatorial Networks Merge Sort 0-1 Principle Merge Sort Odd-Even Transposition 0-1 Principle Sort FFT Odd-Even Transposition Sort Conclusion FFT

3 Conclusion

3 / 63 Theoretical Parallel Computing

A. Legrand Models for Parallel Computation

Parallel RAM  Introduction We have seen how to implement parallel Pointer Jumping in practice Reducing the Number of  We have come up with performance analyses Processors PRAM Model  In traditional complexity work, the Turing Hierarchy Conclusion machine makes it possible to precisely compare Combinatorial algorithms, establish precise notions of complexity, Networks Merge Sort etc.. 0-1 Principle  Odd-Even Can we do something like this for parallel computing? Transposition  Sort Parallel machines are complex with many hardware FFT characteristics that are difficult to take into account Conclusion for algorithm work (e.g., the network), is it hopeless?  The most famous theoretical model of parallel computing is the PRAM model  We will see that many principles in the model are really at the heart of the more applied things we’ve seen so far Courtesy of Henri Casanova 4 / 63 Theoretical Parallel Computing

A. Legrand The PRAM Model

Parallel RAM  Introduction Parallel Random Access Machine (PRAM) Pointer Jumping  Reducing the An imperfect model that will only tangentially relate to Number of Processors the performance on a real parallel machine PRAM Model  Hierarchy Just like a Turing Machine tangentially relate to the Conclusion performance of a real computer Combinatorial  Networks Goal: Make it possible to reason about and classify Merge Sort parallel algorithms, and to obtain complexity results 0-1 Principle Odd-Even (optimality, minimal complexity results, etc.) Transposition  Sort One way to look at it: makes it possible to determine FFT the “maximum parallelism” in an algorithm or a Conclusion problem, and makes it possible to devise new algorithms

Courtesy of Henri Casanova 5 / 63 Theoretical Parallel Computing

A. Legrand The PRAM Model

Parallel RAM  Introduction Memory size is infinite, number of Pointer Jumping processors in unbounded Reducing the  But nothing prevents you to “fold” this back to Number of Processors something more realistic P1 PRAM Model  Hierarchy No direct communication between Conclusion processors P2 Combinatorial  they communicate via the memory Networks  they can operate in an asynchronous fashion P3 Merge Sort  Shared 0-1 Principle Every processor accesses any memory Odd-Even location in 1 cycle . Memory Transposition Sort  Typically all processors execute the same . FFT algorithm in a synchronous fashion . Conclusion  READ phase  COMPUTE phase  WRITE phase PN  Some subset of the processors can stay idle (e.g., even numbered processors may not work, while odd processors do, and conversely) Courtesy of Henri Casanova 6 / 63 Theoretical Parallel Computing

A. Legrand Memory Access in PRAM

Parallel RAM  Introduction Exclusive Read (ER): p processors can Pointer Jumping Reducing the simultaneously read the content of p distinct Number of Processors memory locations. PRAM Model Hierarchy  Conclusion Concurrent Read (CR): p processors can Combinatorial simultaneously read the content of p’ memory Networks Merge Sort locations, where p’ < p. 0-1 Principle  Odd-Even Exclusive Write (EW): p processors can Transposition Sort simultaneously write the content of p distinct FFT Conclusion memory locations.  Concurrent Write (CW): p processors can simultaneously write the content of p’ memory locations, where p’ < p.

Courtesy of Henri Casanova 7 / 63 Theoretical Parallel Computing

A. Legrand PRAM CW?

Parallel RAM Introduction  What ends up being stored when multiple writes occur? Pointer Jumping  Reducing the priority CW: processors are assigned priorities and the top priority Number of Processors processor is the one that counts for each group write PRAM Model  Fail common CW: if values are not equal, no change Hierarchy Conclusion  Collision common CW: if values not equal, write a “failure value” Combinatorial  Fail­safe common CW: if values not equal, then algorithm aborts Networks  Merge Sort Random CW: non­deterministic choice of the value written 0-1 Principle  Combining CW: write the sum, average, max, min, etc. of the values Odd-Even Transposition  etc. Sort FFT  The above means that when you write an algorithm for a CW PRAM Conclusion you can do any of the above at any different points in time  It doesn’t corresponds to any hardware in existence and is just a logical/algorithmic notion that could be implemented in software  In fact, most algorithms end up not needing CW

Courtesy of Henri Casanova 8 / 63 Theoretical Parallel Computing

A. Legrand Classic PRAM Models

Parallel RAM  Introduction CREW (concurrent read, exclusive write) Pointer Jumping  Reducing the most commonly used Number of Processors  PRAM Model CRCW (concurrent read, concurrent write) Hierarchy  Conclusion most powerful Combinatorial  Networks EREW (exclusive read, exclusive write) Merge Sort  most restrictive 0-1 Principle Odd-Even  Transposition unfortunately, probably most realistic Sort FFT Conclusion  Theorems exist that prove the relative power of the above models (more on this later)

Courtesy of Henri Casanova 9 / 63 Theoretical Parallel Computing

A. Legrand PRAM Example 1

Parallel RAM  Introduction Problem: Pointer Jumping  Reducing the We have a of length n Number of Processors  For each element i, compute its distance to the PRAM Model Hierarchy end of the list: Conclusion d[i] = 0 if next[i] = NIL Combinatorial Networks d[i] = d[next[i]] + 1 otherwise Merge Sort 0-1 Principle  Odd-Even Sequential algorithm in O(n) Transposition Sort  We can define a PRAM algorithm in O(log n) FFT  Conclusion associate one processor to each element of the list  at each iteration split the list in two with odd­ placed and even­placed elements in different lists

Courtesy of Henri Casanova 10 / 63 Theoretical Parallel Computing

A. Legrand PRAM Example 1

Parallel RAM Principle: Introduction Pointer Jumping Look at the next element Reducing the Number of Add its d[i] value to yours Processors PRAM Model Point to the next element’s next element Hierarchy Conclusion Combinatorial 1 1 1 1 1 0 Networks Merge Sort 0-1 Principle Odd-Even The size of each list Transposition 2 2 2 2 1 0 Sort is reduced by 2 at FFT each step, hence the Conclusion O(log n) complexity 4 4 3 2 1 0

5 4 3 2 1 0 Courtesy of Henri Casanova 11 / 63 Theoretical Parallel Computing

A. Legrand PRAM Example 1

Parallel RAM  Introduction Algorithm Pointer Jumping Reducing the Number of forall i Processors ← ← PRAM Model if next[i] == NIL then d[i] 0 else d[i] 1 Hierarchy Conclusion while there is an i such that next[i] ≠ NIL Combinatorial i Networks forall Merge Sort if next[i] ≠ NIL then 0-1 Principle Odd-Even d[i] ← d[i] + d[next[i]] Transposition Sort ← FFT next[i] next[next[i]] Conclusion

What about the correctness of this algorithm?

Courtesy of Henri Casanova 12 / 63 Theoretical Parallel Computing

A. Legrand forall loop

Parallel RAM  Introduction At each step, the updates must be Pointer Jumping Reducing the synchronized so that pointers point to the Number of Processors right things: PRAM Model Hierarchy next[i] ← next[next[i]] Conclusion  Combinatorial Ensured by the semantic of forall Networks Merge Sort forall i 0-1 Principle forall i Odd-Even tmp[i] = B[i] Transposition Sort A[i] = B[i] forall i FFT A[i] = tmp[i] Conclusion  Nobody really writes it out, but one mustn’t forget that it’s really what happens underneath

Courtesy of Henri Casanova 13 / 63 Theoretical Parallel Computing

A. Legrand while condition

Parallel RAM  Introduction while there is an i such that next[i] ≠NULL Pointer Jumping  Reducing the How can one do such a global test on a Number of Processors PRAM? PRAM Model Hierarchy  Conclusion Cannot be done in constant time unless the PRAM Combinatorial is CRCW Networks  At the end of each step, each processor could write to a Merge Sort 0-1 Principle same memory location TRUE or FALSE depending on Odd-Even next[i] being equal to NULL or not, and one can then take Transposition Sort the AND of all values (to resolve concurrent writes) FFT  On a PRAM CREW, one needs O(log n) steps for doing Conclusion a global test like the above  In this case, one can just rewrite the while loop into a for loop, because we have analyzed the way in which the iterations go:   for step = 1 to log n Courtesy of Henri Casanova 14 / 63 Theoretical Parallel Computing

A. Legrand What type of PRAM?

Parallel RAM  Introduction The previous algorithm does not require a Pointer Jumping Reducing the CW machine, but: Number of Processors tmp[i] ← d[i] + d[next[i]] PRAM Model Hierarchy which requires concurrent reads on proc i and j Conclusion such that j = next[i]. Combinatorial  Networks Solution: Merge Sort  0-1 Principle split it into two instructions: Odd-Even tmp2[i] ← d[i] Transposition Sort tmp[i] ← tmp2[i] + d[next[i]] FFT (note that the above are technically in two different forall loops) Conclusion  Now we have an execution that works on a EREW PRAM, which is the most restrictive type

Courtesy of Henri Casanova 15 / 63 Theoretical Parallel Computing

A. Legrand Final Algorithm on a EREW PRAM

Parallel RAM Introduction Pointer Jumping Reducing the Number of Processors forall i PRAM Model ← ← Hierarchy if next[i] == NILL then d[i] 0 else d[i] 1 O(1) Conclusion for step = 1 to log n Combinatorial Networks forall i Merge Sort if next[i] ≠ NIL then 0-1 Principle O(log n) Odd-Even tmp[i] ← d[i] Transposition O(log n) Sort d[i] ← tmp[i] + d[next[i]] FFT O(1) ← Conclusion next[i] next[next[i]]

Conclusion: One can compute the length of a list of size n in time O(log n) on any PRAM Courtesy of Henri Casanova 16 / 63 Cost, Work

Theoretical Parallel I Let P be a problem of size n that we want to solve (e.g., a computation Computing over an n-element list). A. Legrand I Let Tseq(n) be the execution time of the best (known) sequential algo- Parallel RAM rithm for solving P. Introduction Pointer Jumping I Let us now consider a PRAM algorithm that solves P in time Tpar (p, n) Reducing the Number of with p PUs. Processors PRAM Model Hierarchy Definition: Cost and Work. Conclusion Combinatorial The cost of a PRAM algorithm is defined as Networks Merge Sort 0-1 Principle Cp(n) = p.Tpar (p, n) . Odd-Even Transposition Sort The work Wp(n) of a PRAM algorithm is the sum over all PUs of the number FFT of performed operations. The difference between cost and work is that the Conclusion work does not account for PU idle time.

Intuitively, cost is a rectangle of area Tpar (p, n). Therefore, the cost is minimal if, at each step, all PUs are used to per- form useful computations, i.e. computations that are part of the sequential

algorithm. 17 / 63 and Efficiency

Theoretical The speedup of a PRAM algorithm is the factor by which the program’s Parallel Computing execution time is lower than that of the sequential program. A. Legrand Definition: Speedup and Efficiency. Parallel RAM Introduction The speedup of a PRAM algorithm is defined as Pointer Jumping Reducing the Number of Tseq(n) Processors Sp(n) = . PRAM Model Tpar (p, n) Hierarchy Conclusion The efficiency of the PRAM algorithm is defined as Combinatorial Networks S (n) T (n) Merge Sort e (n) = p = seq . 0-1 Principle p p p.Tpar (p, n) Odd-Even Transposition Sort Tpar (1,n) FFT Some authors use the following definition: Sp(n)= , where Tpar (1, n) Tpar (p,n) Conclusion is the execution time of the algorithm with a single PU. This definition quantifies how the algorithm scales: If the speedup is close to p (i.e. isΩ( p)), one says that the algorithm scales well. However, this definition does not reflect the quality of the parallelization (i.e. how much we can really gain by parallelization), and it is thus better

to use Tseq(n) than Tpar (1, n). 18 / 63 Brent Theorem

Theoretical Parallel Computing A. Legrand Theorem 1: Brent. Parallel RAM Introduction Let A be an algorithm that executes a total number of m operations Pointer Jumping and that runs in time t on a PRAM (with some unspecified number Reducing the Number of  m  Processors of PUs). A can be simulated in time O p + t PRAM of the same PRAM Model Hierarchy type that contains p PUs. Conclusion Combinatorial Networks Merge Sort Proof. 0-1 Principle Odd-Even Transposition Say that at step i A performs m(i) operations (implying that Sort Pt m(i) = m). Step i can be simulated with p PUs in time FFT l i=1m Conclusion m(i) m(i) p 6 p + 1. One can simply sum these upper bounds to prove the theorem.

19 / 63 Theoretical Parallel Computing

A. Legrand Brent Theorem

Parallel RAM Introduction  Pointer Jumping Theorem: Let A be an algorithm with m operations that Reducing the runs in time t on some PRAM (with some number of Number of Processors processors). It is possible to simulate A in time O(t + PRAM Model Hierarchy m/p) on a PRAM of same type with p processors Conclusion  Combinatorial Example: maximum of n elements on an EREW PRAM Networks  Clearly can be done in O(log n) with O(n) processors Merge Sort  0-1 Principle Compute series of pair­wise maxima Odd-Even  The first step requires O(n/2) processors Transposition Sort  What happens if we have fewer processors? FFT  By the theorem, with p processors, one can simulate the same Conclusion algorithm in time O(log n + n / p)  If p = n / log n, then we can simulate the same algorithm in O(log n + log n) = O(log n) time, which has the same complexity!  This theorem is useful to obtain lower­bounds on number of required processors that can still achieve a given complexity.

Courtesy of Henri Casanova 20 / 63 Another Useful Theorem

Theoretical Parallel Theorem 2. Computing

A. Legrand Let A be an algorithm whose execution time is t on a PRAM with p 0 PUs. A can be simulated on a RAM of the same type with p 6 p Parallel RAM   Introduction t.p PUs in time O 0 . The cost of the algorithm on the smaller PRAM Pointer Jumping p Reducing the Number of is at most twice the cost on the larger PRAM. Processors PRAM Model Hierarchy Conclusion Proof. Combinatorial l m Networks p Each step of A can be simulated in at most 0 time units with Merge Sort p 0-1 Principle p0 p PUs by simply reducing concurrency and having the p0 PUs Odd-Even 6 Transposition Sort perform sequences of operations. Since there are at most t steps, FFT l m the execution time t0 of the simulated algorithm is at most p t. Conclusion p0 0  p   t.p  Therefore, t = O p0 .t = O p0 . 0 0 l p m 0  p  0 We also have Cp0 = t .p 6 p0 p .t 6 p0 + 1 p t =  1  0 p.t 1 + p0 = Cp.(1 + 1/p ) 6 2Cp. 21 / 63 Theoretical Parallel Computing

A. Legrand An other useful theorem

Parallel RAM  Introduction Theorem: Let A be an algorithm that Pointer Jumping Reducing the executes in time t on a PRAM with p Number of Processors PRAM Model processors. One can simulate A on a PRAM Hierarchy Conclusion with p’ processors in time O(t.p/p’) Combinatorial  Networks This makes it possible to think of the “folding” Merge Sort 0-1 Principle we talked about earlier by which one goes Odd-Even Transposition from an unbounded number of processors to Sort FFT a bounded number Conclusion  A.n.log n + B on n processors  A.n2 + Bn/(log n) on log n processors  A(n2log n)/10 + Bn/10 on 10 processors

Courtesy of Henri Casanova 22 / 63 Theoretical Parallel Computing

A. Legrand Are all PRAMs equivalent?

Parallel RAM  Consider the following problem Introduction  Pointer Jumping given an array of n elements, ei=1,n, all distinct, find whether some Reducing the element e is in the array Number of Processors  On a CREW PRAM, there is an algorithm that works in time O(1) PRAM Model Hierarchy on n processors: Conclusion  initialize a boolean to FALSE Combinatorial  Each processor i reads ei and e and compare them Networks  Merge Sort if equal, then write TRUE into the boolean (only one proc will write, so 0-1 Principle we’re ok for CREW) Odd-Even  Transposition One a EREW PRAM, one cannot do better than log n Sort  Each processor must read e separately FFT  at worst a complexity of O(n), with sequential reads Conclusion  at best a complexity of O(log n), with series of “doubling” of the value at each step so that eventually everybody has a copy (just like a broadcast in a binary tree, or in fact a k­ary tree for some constant k)  Generally, “diffusion of information” to n processors on an EREW PRAM takes O(log n)  Conclusion: CREW PRAMs are more powerful than EREW PRAMs Courtesy of Henri Casanova 23 / 63 CRCW>CREW

Theoretical Parallel Computing

A. Legrand

Parallel RAM Introduction Pointer Jumping Reducing the Number of Processors PRAM Model Hierarchy Conclusion Combinatorial Networks Merge Sort 0-1 Principle Odd-Even Transposition Sort FFT Conclusion I O(1) on a CRCW I Ω(log n) on a CREW Therefore, we have CRCW > CREW > EREW (with at least a loga- rithmic factor each time). 24 / 63 Theoretical Parallel Computing

A. Legrand Simulation Theorem

Parallel RAM  Introduction Simulation theorem: Any algorithm running on a Pointer Jumping CRCW PRAM with p processors cannot be more than Reducing the Number of O(log p) times faster than the best algorithm on a Processors PRAM Model EREW PRAM with p processors for the same Hierarchy Conclusion problem  Combinatorial Proof: Networks  Merge Sort “Simulate” concurrent writes 0-1 Principle  Each step of the algorithm on the CRCW PRAM is simulated as Odd-Even log(p) steps on a EREW PRAM Transposition Sort  When Pi writes value xi to address li, one replaces the write by FFT an (exclusive) write of (li ,xi) to A[i], where A[i] is some auxiliary Conclusion array with one slot per processor  Then one sorts array A by the first component of its content  Processor i of the EREW PRAM looks at A[i] and A[i­1]  if the first two components are different or if i = 0, write value xi to

address li  Since A is sorted according to the first component, writing is exclusive Courtesy of Henri Casanova 25 / 63 Theoretical Parallel Computing

A. Legrand Proof (continued)

Parallel RAM Introduction Pointer Jumping Picking one processor for each competing write Reducing the Number of Processors P0 PRAM Model 12 8 Hierarchy → A[0]=(8,12) P0 writes Conclusion P0 (29,43) = A[0] P1 Combinatorial → Networks P1 (8,12) = A[1] A[1]=(8,12) P1 nothing Merge Sort 43 29 P2 0-1 Principle P2 → (29,43) = A[2] A[2]=(29,43) P2 writes Odd-Even sort Transposition Sort P3 P3 → (29,43) = A[3] A[3]=(29,43) P3 nothing FFT Conclusion P4 → (92,26) = A[4] A[4]=(29,43) P4 nothing P4 P5 → (8,12) = A[5] A[5]=(92,26) P5 writes 26 92 P5

Courtesy of Henri Casanova 26 / 63 Theoretical Parallel Computing

A. Legrand Proof (continued)

Parallel RAM  Introduction Note that we said that we just sort array A Pointer Jumping  Reducing the If we have an algorithm that sorts p elements Number of Processors with O(p) processors in O(log p) time, we’re PRAM Model Hierarchy Conclusion set Combinatorial  Turns out, there is such an algorithm: Cole’s Networks Merge Sort Algorithm. 0-1 Principle  Odd-Even basically a merge­sort in which lists are merged in Transposition Sort constant time! FFT  Conclusion It’s beautiful, but we don’t really have time for it, and it’s rather complicated  Therefore, the proof is complete.

Courtesy of Henri Casanova 27 / 63 Theoretical Parallel Computing

A. Legrand And many, many more things

Parallel RAM  Introduction J. Reiff (editor), Synthesis of Parallel Algorithms, Pointer Jumping Reducing the Morgan Kauffman, 1993 Number of  Processors Everything you’ve ever wanted to know about PRAMs PRAM Model  Hierarchy Every now and then, there are references to PRAMs Conclusion in the literature Combinatorial Networks  “by the way, this can be done in O(x) on a XRXW PRAM” Merge Sort  0-1 Principle This network can simulate a EREW PRAM, and thus we Odd-Even Transposition know a bunch of useful algorithms (and their complexities) Sort that we can instantly implement FFT  Conclusion etc.  You probably will never care if all you do is hack MPI and OpenMP code

Courtesy of Henri Casanova 28 / 63 Relevance of the PRAM model

Theoretical Parallel Computing I Central question in complexity theory for sequential algorithms: A. Legrand P = NP ?

Parallel RAM I Central question in complexity theory for parallel algorithms: P = Introduction Pointer Jumping NC ? Reducing the Number of Processors I NC is the class of all problems that, with a polynomial number of PRAM Model Hierarchy PUs, can be solved in polylogarithmic time. An algorithm of size Conclusion n is polylogarithmic if it can be solved in O(log(n)c ) time with Combinatorial k Networks O(n ) PUs, where c and k are constants. Merge Sort 0-1 Principle Nice theoretical complexity considerations, uh ? Odd-Even Transposition Sort FFT Skeptical readers may question the relevance of the PRAM model for Conclusion practical implementation purposes: You can’t haven = 100, 000, 000 PUs. One never has an immediately addressable, unbounded shared parallel memory.

29 / 63 Relevance of the PRAM model

Theoretical This criticism, similar to the criticism of O(n17) “polynomial” time algo- Parallel Computing rithms, often comes from a misunderstanding of the role of theory: A. Legrand I Theory is not everything: Theoretical results are not to be taken as is

Parallel RAM and implemented by engineers. Introduction Pointer Jumping I Theory is also not nothing: The fact that an algorithm cannot in general Reducing the be implemented as is, does not mean it is meaningless. Number of Processors PRAM Model Hierarchy Deep understanding Conclusion 17 Combinatorial When a O(n ) algorithm is designed for a problem, it does not lead to a Networks practical way to solve that problem. Merge Sort 0-1 Principle However, it proves something inherent about the problem (namely, that it Odd-Even Transposition is in P). Sort FFT Hopefully, in the of proving this result, key insights may be developed Conclusion that can later be used in practical algorithms for this or other problems, or for other theoretical results. Similarly, the design of PRAM algorithms for a problem proves something inherent to the problem (namely, that it is parallelizable) and can in turn lead to new ideas.

30 / 63 Relevance of the PRAM model

Theoretical This criticism, similar to the criticism of O(n17) “polynomial” time algo- Parallel Computing rithms, often comes from a misunderstanding of the role of theory: A. Legrand I Theory is not everything: Theoretical results are not to be taken as is

Parallel RAM and implemented by engineers. Introduction Pointer Jumping I Theory is also not nothing: The fact that an algorithm cannot in general Reducing the be implemented as is, does not mean it is meaningless. Number of Processors PRAM Model Hierarchy Candidate for Practical Implementation Conclusion Combinatorial Even if communications are not taken into account in the performance eval- Networks uation of PRAM algorithms (a potentially considerable discrepancy between Merge Sort 0-1 Principle theoretical complexity and practical execution time), trying to design fast Odd-Even Transposition PRAM algorithms is not a useless pursuit. Sort FFT Indeed, PRAM algorithms can be simulated on other models and do not Conclusion necessarily incur prohibitive communication overheads. It is commonly admitted that only cost-optimal PRAM algorithms have po- tential practical relevance and that the most promising are those cost-optimal algorithms with minimal execution time.

30 / 63 Outline

Theoretical Parallel Computing

A. Legrand 1 Parallel RAM Introduction Parallel RAM Introduction Pointer Jumping Pointer Jumping Reducing the Reducing the Number of Processors Number of Processors PRAM Model Hierarchy PRAM Model Hierarchy Conclusion Conclusion Combinatorial Networks 2 Combinatorial Networks Merge Sort 0-1 Principle Merge Sort Odd-Even Transposition 0-1 Principle Sort FFT Odd-Even Transposition Sort Conclusion FFT

3 Conclusion

31 / 63 Theoretical Parallel Computing

A. Legrand Combinational circuits/networks

Parallel RAM  Introduction More realistic than PRAMs Pointer Jumping  Reducing the More restricted Number of Processors  PRAM Model Algorithms for combinational circuits were Hierarchy Conclusion among the first parallel algorithms developed Combinatorial  Networks Understanding how they work makes it easier Merge Sort 0-1 Principle to learn more complex parallel algorithms Odd-Even  Transposition Many combinational circuit algorithms provide Sort FFT the basis for algorithms for other models (they Conclusion are good building blocks)  We’re going to look at:  sorting networks  FFT circuit

Courtesy of Henri Casanova 32 / 63 Theoretical Parallel Computing

A. Legrand Sorting Networks

Parallel RAM  Introduction Goal: sort lists of numbers Pointer Jumping  Reducing the Main principle Number of Processors  PRAM Model computing elements take two numbers as input Hierarchy and sort them Conclusion Combinatorial Networks Merge Sort a min(a,b) 0-1 Principle b max(a,b) Odd-Even Transposition Sort FFT Conclusion  we arrange them in a network  we look for an architecture that depends only on the size of lists to be sorted, not on the values of the elements

Courtesy of Henri Casanova 33 / 63 Theoretical Parallel Computing

A. Legrand Merge­sort on a sorting network

Parallel RAM  Introduction First, build a network to merge two lists Pointer Jumping  Reducing the Some notations Number of  Processors (c1,c2,...,cn) a list of numbers PRAM Model  Hierarchy sort(c1,c2,...,cn) the same list, sorted Conclusion  sorted(x1,x2,...,xn) is true if the list is sorted Combinatorial  Networks if sorted(a1,...,an) and sorted(b1,...,bn) then merge((a1,...,an), Merge Sort (b1,...,bn)) = sort(a1,...,an,b1,...,bn) 0-1 Principle Odd-Even  We’re going to build a network, merge , that merges two sorted lists Transposition m Sort with 2m elements FFT  Conclusion m=0 m = 1

a1 min(a1,b1) a1 b1 b1 min(max(a1,b1),min(a2,b2)) a2 max(max(a1,b1),min(a2,b2)) b2 max(a2,b2)

Courtesy of Henri Casanova 34 / 63 Theoretical Parallel Computing

A. Legrand What about for m=3?

Parallel RAM Introduction a1 Pointer Jumping b1 Reducing the Number of a3 Processors b3 PRAM Model Hierarchy a2 Conclusion b2 Combinatorial Networks a4 Merge Sort b4 0-1 Principle Odd-Even  Transposition Why does this work? Sort  To build merge one uses FFT m  2 copies of the merge network Conclusion m­1  1 row of 2m­1 comparators  The first copy of mergem­1 merges the odd­indexed elements, the second copy merges the even­indexed elements  The row of comparators completes the global merge, which is quite a miracle really

Courtesy of Henri Casanova 35 / 63 Theoretical Parallel Computing Theorem to build merge A. Legrand m

Parallel RAM  Introduction Given sorted(a1,...,a2n) and Pointer Jumping Reducing the sorted(b1,...,b2n) Number of Processors  PRAM Model Let Hierarchy  Conclusion (d1,...,d2n) = merge((a1,a3,..,a2n­1), Combinatorial Networks (b1,b3,...,b2n­1) Merge Sort  0-1 Principle (e1,...,e2n) = merge((a2,a4,..,a2n), Odd-Even Transposition (b2,b4,...,b2n) Sort FFT Conclusion  Then  sorted(d1,min(d2,e1),max(d2,e1),...,

min(d2n,e2n­1),max(d2n,e2n­1),e2n)

Courtesy of Henri Casanova 36 / 63 Theoretical Parallel Computing

A. Legrand Proof

Parallel RAM  Assume all elements are distinct Introduction  Pointer Jumping d1 is indeed the first element, and e2n is the last element of the Reducing the Number of global sorted list Processors  For i>1 and i <=2n, d and e must appear in the final list in PRAM Model i i­1, Hierarchy position 2i­2 or 2i­1. Conclusion  Let’s prove that they are at the right place Combinatorial Networks  if each is larger than 2i­3 elements Merge Sort  if each is smaller than 4n­2i+1 elements 0-1 Principle  Odd-Even therefore each is either in 2i­2 or 2i­1 Transposition  Sort and the comparison between the two makes them each go in the FFT correct place  Conclusion So we must show that  di is larger than 2i­3 elements  ei­1 is larger than 2i­3 elements  di is smaller than 4n­2i+1 elements  ei­1 is smaller than 4n­2i+1 elements

Courtesy of Henri Casanova 37 / 63 Theoretical Parallel Computing

A. Legrand Proof (cont’ed)

Parallel RAM  Introduction di is larger than 2i­3 elements Pointer Jumping  Reducing the Assume that di belongs to the (aj)j=1,2n list Number of  Processors Let k be the number of elements in {d1,d2,...,di} that belong to the PRAM Model Hierarchy (aj)j=1,2n list Conclusion  Then di = a2k­1, and di is larger than 2k­2 elements of A Combinatorial  Networks There are i­k elements from the (bj)j=1,2n list in {d1,d2,...,di­1}, and Merge Sort thus the largest one is b2(i­k)­1. Therefore di is larger than 2(i­k)­1 0-1 Principle Odd-Even elements in list (bj)j=1,2n Transposition Sort  Therefore, di is larger than 2k­2 + 2(i­k)­1 = 2i­3 elements FFT  Similar proof if d belongs to the (b ) list Conclusion i j j=1,2n  Similar proofs for the other 3 properties

Courtesy of Henri Casanova 38 / 63 Theoretical Parallel Computing Construction of merge A. Legrand m

Parallel RAM Introduction Pointer Jumping a Reducing the 1 d1 Number of a2 merge d Processors a3 2 PRAM Model . . Hierarchy . .

m-1 . Conclusion . . di+1 Combinatorial a 2i-1 . . Networks a2i Merge Sort . . . . . 0-1 Principle . Odd-Even . Transposition . b Sort 1 e b 1 FFT 2 b 3 e2 merge Conclusion b4 ...... m-1 . b2i-1 e i . b2i . . . . Recursive construction that implements

the result from the theoremCourtesy of Henri Casanova 39 / 63 Theoretical Parallel Computing Performance of merge A. Legrand m

Parallel RAM  Introduction Execution time is defined as the maximum number of comparators Pointer Jumping that one input must go through to produce output Reducing the  Number of tm: time to go through mergem Processors  PRAM Model pm: number of comparators in mergem Hierarchy  Conclusion Two inductions  Combinatorial t0=1, t1=2, tm = tm­1 + 1 (tm = m+1) Networks  m m p0=1, p1=3, pm = 2pm­1 + 2 ­1 (pm = 2 m+1) Merge Sort  0-1 Principle Easily deduced from the theorem Odd-Even  m Transposition In terms of n=2 , O(log n) and O(n log n) Sort  Fast execution in O(log n) FFT  Conclusion But poor efficiency  Sequential time with one comparator: n  Efficiency = n / (n * log n * log n) = 1 / (log n)2  Comparators are not used efficiently as they are used only once  The network could be used in pipelined mode, processing series of lists, with all comparators used at each step, with one result available at each step. Courtesy of Henri Casanova 40 / 63 Theoretical Parallel Computing Sorting network using merge A. Legrand m

Parallel RAM Introduction  Sort network Sort network Pointer Jumping 2 3 Reducing the Number of Processors PRAM Model Hierarchy Conclusion merge

Combinatorial merge Networks

Merge Sort 1

0-1 Principle merge

Odd-Even 1 Transposition sort2 Sort FFT

sort2 2 merge Conclusion

Sort 1st half of the list 1 Sort 2nd half of the list Merge the results sort2 Recursively Courtesy of Henri Casanova 41 / 63 Theoretical Parallel Computing

A. Legrand Performance

Parallel RAM  Execution time t’m and p’m number of comparators Introduction  2 Pointer Jumping t’1=1 t’m = t’m­1 + tm­1 (t’m = O(m )) Reducing the  m 2 Number of p’1=1 p’m = 2p’m­1 + pm­1 (p’m = O(2 m )) Processors  m PRAM Model In terms of n = 2 Hierarchy  2 Conclusion Sort time: O((log n) )  Number of comparators: O(n(log n)2) Combinatorial Networks  Poor performance given the number of comparators (unless Merge Sort used in pipeline mode) 0-1 Principle  Odd-Even Efficiency: Tseq / (p * Tpar) Transposition Sort  Efficiency = O(n log n/ (n (log n)4 )) = O((log n)­3) FFT  There was a PRAM algorithm in O(log n) Conclusion  Is there a sorting network that achieves this?  yes, recent work in 1983  O(log n) time, O(n log n) comparators  But constants are SO large, that it is impractical

Courtesy of Henri Casanova 42 / 63 Theoretical Parallel Computing

A. Legrand 0­1 Principle

Parallel RAM  Introduction Theorem: A network of comparators Pointer Jumping Reducing the Number of implements sorting correctly if and only Processors PRAM Model Hierarchy if it implements it correctly for lists that Conclusion Combinatorial consist solely of 0’s and 1’s Networks  Merge Sort This theorem makes proofs of things 0-1 Principle Odd-Even Transposition like the “merge theorem” much simpler Sort FFT and in general one only works with lists Conclusion of 0’s and 1’ when dealing with sorting networks

Courtesy of Henri Casanova 43 / 63 Theoretical Parallel Computing

A. Legrand Another (simpler) sorting network

Parallel RAM  Introduction Sort by even­odd transposition Pointer Jumping Reducing the  Number of The network is built to sort a list of n=2p Processors PRAM Model Hierarchy elements Conclusion  p copies of a 2­row network Combinatorial Networks  Merge Sort the first row contains p comparators that 0-1 Principle take elements 2i­1 and 2i, for i=1,...,p Odd-Even Transposition  Sort the second row contains p­1 comparators FFT that take elements 2i and 2i+1, for Conclusion i=1,..,p­1  for a total of n(n­1)/2 comparators  similar construction for when n is odd

Courtesy of Henri Casanova 44 / 63 Theoretical Parallel Computing

A. Legrand Odd­even transposition network

Parallel RAM Introduction n = 8 n = 7 Pointer Jumping Reducing the Number of Processors PRAM Model Hierarchy Conclusion Combinatorial Networks Merge Sort 0-1 Principle Odd-Even Transposition Sort FFT Conclusion

Courtesy of Henri Casanova 45 / 63 Theoretical Parallel Computing

A. Legrand Proof of correctness

Parallel RAM  To prove that the previous network sort correctly Introduction  Pointer Jumping rather complex induction Reducing the  use of the 0­1 principle Number of Processors  Let (ai)i=1,..,n a list of 0’s and 1’s to sort PRAM Model Hierarchy  Let k be the number of 1’s in that list; j0 the position of the last 1 Conclusion Combinatorial Networks Merge Sort 1 1 0 1 0 0 0 0-1 Principle k = 3 Odd-Even Transposition j0 = 4 Sort FFT Conclusion  Note that a 1 never “moves” to the left (this is why using the 0­1 principle makes this proof easy)  Let’s follow the last 1: If j0 is even, no move, but move to the right at the

next step. If j0 is odd, then move to the right in the first step. In all cases, it will move to the right at the 2nd step, and for each step, until it reaches the nth position. Since the last 1 is at least in position 2, it will reach position n in at least n­1 steps. Courtesy of Henri Casanova 46 / 63 Theoretical Parallel Computing

A. Legrand Proof (continued)

Parallel RAM  Let’s follow the next­to last 1, starting in position j: since the last Introduction Pointer Jumping 1 moves to the right starting in step 2 at the latest, the next­to­ Reducing the Number of last 1 will never be “blocked”. At step 3, and at all following Processors steps, the next­to­last 1 will move to the right, to arrive in PRAM Model Hierarchy position n­1. Conclusion  Generally, the ith 1, counting from the right, will move right during Combinatorial Networks step i+1 and keep moving until it reaches position n­i+1 Merge Sort  This goes on up to the kth 1, which goes to position n­k+1 0-1 Principle Odd-Even  At the end we have the n­k 0’s followed by the k 1’s Transposition Sort  Therefore we have sorted the list FFT Conclusion

Courtesy of Henri Casanova 47 / 63 Theoretical Parallel Computing

A. Legrand Example for n=6

Parallel RAM Introduction 1 0 1 0 1 0 Pointer Jumping Reducing the Number of Processors PRAM Model 0 1 0 1 0 1 Hierarchy Conclusion Combinatorial 0 0 1 0 1 1 Networks Merge Sort 0-1 Principle Odd-Even Transposition 0 0 0 1 1 1 Sort FFT Conclusion 0 0 0 1 1 1

Redundant steps 0 0 0 1 1 1

0 0 0 1 1 1 Courtesy of Henri Casanova 48 / 63 Theoretical Parallel Computing

A. Legrand Performance

Parallel RAM  Introduction Compute time: tn = n Pointer Jumping Reducing the  Number of # of comparators: pn = n(n­1)/2 Processors PRAM Model  2 Hierarchy Efficiency: O(n log n / n * (n­1)/2 * n) = O(log n / n ) Conclusion  Really, really, really poor Combinatorial Networks  But at least it’s a simple network Merge Sort 0-1 Principle Odd-Even Transposition  Sort Is there a sorting network with good and practical FFT performance and efficiency? Conclusion  Not really  But one can use the principle of a sorting network for coming up with a good algorithm on a linear network of processors

Courtesy of Henri Casanova 49 / 63 Theoretical Parallel Computing

A. Legrand Sorting on a linear array

Parallel RAM  Introduction Consider a linear array of p general­ Pointer Jumping Reducing the purpose processors Number of Processors PRAM Model Hierarchy Conclusion . . . . P1 P2 P3 Pp Combinatorial Networks  Merge Sort Consider a list of n elements to sort 0-1 Principle Odd-Even Transposition (such that n is divisible by p for Sort FFT simplicity) Conclusion  Idea: use the odd­even transposition network and sort of “fold” it onto the linear array.

Courtesy of Henri Casanova 50 / 63 Theoretical Parallel Computing

A. Legrand Principle

Parallel RAM  Introduction Each processor receives a sub­part, i.e. n/p elements, of the list Pointer Jumping to sort Reducing the Number of Processors  Each processor sorts this list locally, in parallel. PRAM Model Hierarchy  There are then p steps of alternating exchanges as in the odd­ Conclusion Combinatorial even transposition sorting network Networks  exchanges are for full sub­lists, not just single elements Merge Sort 0-1 Principle  when two processors communicate, their two lists are merged Odd-Even Transposition  Sort the left processor keeps the left half of the merged list FFT  the right processor keeps the right half of the merged list Conclusion

Courtesy of Henri Casanova 51 / 63 Theoretical Parallel Computing

A. Legrand Example

Parallel RAM Introduction P1 P2 P3 P4 P5 P6 Pointer Jumping Reducing the Number of Processors init {8,3,12} {10,16,5} {2,18,9} {17,15,4} {1,6,13} {11,7,14} PRAM Model Hierarchy {3,8,12} {5,10,16} {2,9,18} {4,15,17} {1,6,13} {7,11,14} Conclusion local sort Combinatorial Networks odd {3,5,8} {10,12,16} {2,4,9} {15,17,18} {1,6,7} {11,13,14} Merge Sort 0-1 Principle even {3,5,8} {2,4,9} {10,12,16} {1,6,7} {15,17,18} {11,13,14} Odd-Even Transposition Sort odd {2,3,4} {5,8,9} {1,6,7} {10,12,16} {11,13,14} {15,17,18} FFT Conclusion even {2,3,4} {1,5,6} {7,8,9} {10,11,12} {13,14,16} {15,17,18}

odd {1,2,3} {4,5,6} {7,8,9} {10,11,12} {13,14,15} {16,17,18}

even {1,2,3} {4,5,6} {7,8,9} {10,11,12} {13,14,15} {16,17,18}

Same pattern as the sorting network Courtesy of Henri Casanova 52 / 63 Theoretical Parallel Computing

A. Legrand Performance

Parallel RAM  Introduction Local sort: O(n/p * log n/p) = O(n/p * log n) Pointer Jumping  Reducing the Each step costs one merge of two lists of n/p Number of Processors elements: O(n/p) PRAM Model Hierarchy  There are p such steps, hence: O(n) Conclusion  Combinatorial Total: O(n/p * log n + n) Networks Merge Sort  If p = log n: O(n) 0-1 Principle Odd-Even  Transposition The algorithm is optimal for p ≤ log n Sort FFT Conclusion  More information on sorting networks: D. Knuth, The Art of , volume 3: Sorting and Searching, Addison­Wesley (1973)

Courtesy of Henri Casanova 53 / 63 Theoretical Parallel Computing

A. Legrand FFT circuit ­ what’s an FFT?

Parallel RAM  Introduction Fourier Transform (FT): A “tool” to decompose a Pointer Jumping Reducing the function into sinusoids of different frequencies, which Number of Processors sum to the original function PRAM Model  Hierarchy Useful is signal processing, linear system analysis, quantum Conclusion physics, image processing, etc. Combinatorial  Networks Discrete Fourier Transform (DFT): Works on a Merge Sort 0-1 Principle discrete sample of function values Odd-Even  Transposition In many domains, nothing is truly continuous or continuously Sort measured FFT  Conclusion Fast Fourier Transform (FFT): an algorithm to compute a DFT, proposed initially by Tukey and Cole in 1965, which reduces the number of computation from O(n2) to O(n log n)

Courtesy of Henri Casanova 54 / 63 Theoretical Parallel Computing

A. Legrand How to compute a DFT

Parallel RAM Introduction  Given a sequence of numbers {a , ..., Pointer Jumping 0 Reducing the Number of Processors an­1}, its DFT is defined as the sequence PRAM Model Hierarchy {b , ..., b }, where Conclusion 0 n­1 Combinatorial Networks Merge Sort 0-1 Principle Odd-Even (polynomia eval) Transposition Sort FFT Conclusion ω with n a primitive root of 1, i.e.,

Courtesy of Henri Casanova 55 / 63 Theoretical Parallel Computing

A. Legrand The FFT Algorithm

Parallel RAM  2 Introduction A naive algorithm would require n Pointer Jumping Reducing the Number of complex additions and multiplications, Processors PRAM Model Hierarchy which is not practical as typically n is Conclusion Combinatorial very large Networks  s Merge Sort Let n = 2 0-1 Principle Odd-Even Transposition Sort FFT Conclusion

even: uj odd: vj

Courtesy of Henri Casanova 56 / 63 Theoretical Parallel Computing

A. Legrand The FFT Algorithm

Parallel RAM  Introduction Therefore, evaluating the polynomial Pointer Jumping Reducing the Number of Processors at PRAM Model Hierarchy can be reduced to: Conclusion Combinatorial 1. Evaluate the two polynomials Networks Merge Sort 0-1 Principle Odd-Even Transposition Sort at FFT 2. Compute Conclusion

BUT: contains really n/2 distinct elements!!! Courtesy of Henri Casanova 57 / 63 Theoretical Parallel Computing

A. Legrand The FFT Algorithm

Parallel RAM  Introduction As a result, the original problem of size n (that Pointer Jumping Reducing the is, n polynomial evaluations), has been Number of Processors reduced to 2 problems of size n/2 (that is, n/2 PRAM Model Hierarchy polynomial evaluations) Conclusion Combinatorial Networks FFT(in A,out B) Merge Sort if n = 1 0-1 Principle b ← a Odd-Even 0 0 Transposition else Sort FFT FFT(a0, a2,..., an­2, u0, u1,..., u(n/2)­1)

Conclusion FFT(a1, a3, ..., an­1, v0, v1,..., v(n/2)­1) for j = 0 to n­1 ← ω j bj uj mod(n/2) + n .vj mod(n/2) end for end if

Courtesy of Henri Casanova 58 / 63 Theoretical Parallel Computing

A. Legrand Performance of the FFT

Parallel RAM  Introduction t(n): running time of the algorithms Pointer Jumping Reducing the Number of  Processors t(n) = d * n + 2 * t(n/2), where d is some PRAM Model Hierarchy constant Conclusion Combinatorial  Networks t(n) = O(n log n) Merge Sort  0-1 Principle How to do this in parallel? Odd-Even Transposition Sort  Both recursive FFT computations can be FFT Conclusion done independently  Then all iterations of the for loop are also independent

Courtesy of Henri Casanova 59 / 63 Theoretical Parallel Computing

A. Legrand FFTn Circuit Multiple + Add

Parallel RAM ω 0 Introduction u n b Pointer Jumping a0 0 0 Reducing the a ω 1 Number of 1 u n b1 Processors 1 PRAM Model a2 . Hierarchy FFTn/2 Conclusion . . Combinatorial . . Networks ω n/2-1 . n Merge Sort un/2-1 bn/2-1 0-1 Principle an/2-1 Odd-Even Transposition a ω n/2 Sort n/2 n b FFT v0 n/2 an/2+1 ω n/2+1 Conclusion n bn/2+1 an/2+2 v1 . FFTn/2 . . .

. ω n-1 . n vn/2-1 an-1 bCourtesyn-1 of Henri Casanova 60 / 63 Theoretical Parallel Computing

A. Legrand Performance

Parallel RAM  Introduction Number of elements Pointer Jumping  Reducing the width of O(n) Number of Processors  depth of O(log n) PRAM Model Hierarchy  therefore O(n log n) Conclusion  Combinatorial Running time Networks  Merge Sort t(n) = t(n/2) + 1 0-1 Principle  Odd-Even therefore O(log n) Transposition Sort  FFT Efficiency Conclusion  1/log n  You can decide which part of this circuit should be mapped to a real parallel platform, for instance

Courtesy of Henri Casanova 61 / 63 Outline

Theoretical Parallel Computing

A. Legrand 1 Parallel RAM Introduction Parallel RAM Introduction Pointer Jumping Pointer Jumping Reducing the Reducing the Number of Processors Number of Processors PRAM Model Hierarchy PRAM Model Hierarchy Conclusion Conclusion Combinatorial Networks 2 Combinatorial Networks Merge Sort 0-1 Principle Merge Sort Odd-Even Transposition 0-1 Principle Sort FFT Odd-Even Transposition Sort Conclusion FFT

3 Conclusion

62 / 63 Theoretical Parallel Computing

A. Legrand Conclusion

Parallel RAM  Introduction We could teach an entire semester of Pointer Jumping Reducing the Number of theoretical parallel computing Processors PRAM Model  Hierarchy Most people just happily ignore it Conclusion Combinatorial  Networks But Merge Sort  0-1 Principle it’s the source of most fundamental ideas Odd-Even Transposition  Sort it’s a source of inspiration for algorithms FFT  Conclusion it’s a source of inspiration for implementations: DSP

Courtesy of Henri Casanova 63 / 63