Theoretical Parallel Computing
A. Legrand
Parallel RAM Introduction Pointer Jumping Reducing the Theoretical Parallel Computing Number of Processors PRAM Model Hierarchy Conclusion Combinatorial Arnaud Legrand, CNRS, University of Grenoble Networks Merge Sort 0-1 Principle LIG laboratory, [email protected] Odd-Even Transposition Sort FFT November 1, 2009 Conclusion
1 / 63 Outline
Theoretical Parallel Computing
A. Legrand 1 Parallel RAM Introduction Parallel RAM Introduction Pointer Jumping Pointer Jumping Reducing the Reducing the Number of Processors Number of Processors PRAM Model Hierarchy PRAM Model Hierarchy Conclusion Conclusion Combinatorial Networks 2 Combinatorial Networks Merge Sort 0-1 Principle Merge Sort Odd-Even Transposition 0-1 Principle Sort FFT Odd-Even Transposition Sort Conclusion FFT
3 Conclusion
2 / 63 Outline
Theoretical Parallel Computing
A. Legrand 1 Parallel RAM Introduction Parallel RAM Introduction Pointer Jumping Pointer Jumping Reducing the Reducing the Number of Processors Number of Processors PRAM Model Hierarchy PRAM Model Hierarchy Conclusion Conclusion Combinatorial Networks 2 Combinatorial Networks Merge Sort 0-1 Principle Merge Sort Odd-Even Transposition 0-1 Principle Sort FFT Odd-Even Transposition Sort Conclusion FFT
3 Conclusion
3 / 63 Theoretical Parallel Computing
A. Legrand Models for Parallel Computation
Parallel RAM Introduction We have seen how to implement parallel algorithms Pointer Jumping in practice Reducing the Number of We have come up with performance analyses Processors PRAM Model In traditional algorithm complexity work, the Turing Hierarchy Conclusion machine makes it possible to precisely compare Combinatorial algorithms, establish precise notions of complexity, Networks Merge Sort etc.. 0-1 Principle Odd-Even Can we do something like this for parallel computing? Transposition Sort Parallel machines are complex with many hardware FFT characteristics that are difficult to take into account Conclusion for algorithm work (e.g., the network), is it hopeless? The most famous theoretical model of parallel computing is the PRAM model We will see that many principles in the model are really at the heart of the more applied things we’ve seen so far Courtesy of Henri Casanova 4 / 63 Theoretical Parallel Computing
A. Legrand The PRAM Model
Parallel RAM Introduction Parallel Random Access Machine (PRAM) Pointer Jumping Reducing the An imperfect model that will only tangentially relate to Number of Processors the performance on a real parallel machine PRAM Model Hierarchy Just like a Turing Machine tangentially relate to the Conclusion performance of a real computer Combinatorial Networks Goal: Make it possible to reason about and classify Merge Sort parallel algorithms, and to obtain complexity results 0-1 Principle Odd-Even (optimality, minimal complexity results, etc.) Transposition Sort One way to look at it: makes it possible to determine FFT the “maximum parallelism” in an algorithm or a Conclusion problem, and makes it possible to devise new algorithms
Courtesy of Henri Casanova 5 / 63 Theoretical Parallel Computing
A. Legrand The PRAM Model
Parallel RAM Introduction Memory size is infinite, number of Pointer Jumping processors in unbounded Reducing the But nothing prevents you to “fold” this back to Number of Processors something more realistic P1 PRAM Model Hierarchy No direct communication between Conclusion processors P2 Combinatorial they communicate via the memory Networks they can operate in an asynchronous fashion P3 Merge Sort Shared 0-1 Principle Every processor accesses any memory Odd-Even location in 1 cycle . Memory Transposition Sort Typically all processors execute the same . FFT algorithm in a synchronous fashion . Conclusion READ phase COMPUTE phase WRITE phase PN Some subset of the processors can stay idle (e.g., even numbered processors may not work, while odd processors do, and conversely) Courtesy of Henri Casanova 6 / 63 Theoretical Parallel Computing
A. Legrand Memory Access in PRAM
Parallel RAM Introduction Exclusive Read (ER): p processors can Pointer Jumping Reducing the simultaneously read the content of p distinct Number of Processors memory locations. PRAM Model Hierarchy Conclusion Concurrent Read (CR): p processors can Combinatorial simultaneously read the content of p’ memory Networks Merge Sort locations, where p’ < p. 0-1 Principle Odd-Even Exclusive Write (EW): p processors can Transposition Sort simultaneously write the content of p distinct FFT Conclusion memory locations. Concurrent Write (CW): p processors can simultaneously write the content of p’ memory locations, where p’ < p.
Courtesy of Henri Casanova 7 / 63 Theoretical Parallel Computing
A. Legrand PRAM CW?
Parallel RAM Introduction What ends up being stored when multiple writes occur? Pointer Jumping Reducing the priority CW: processors are assigned priorities and the top priority Number of Processors processor is the one that counts for each group write PRAM Model Fail common CW: if values are not equal, no change Hierarchy Conclusion Collision common CW: if values not equal, write a “failure value” Combinatorial Failsafe common CW: if values not equal, then algorithm aborts Networks Merge Sort Random CW: nondeterministic choice of the value written 0-1 Principle Combining CW: write the sum, average, max, min, etc. of the values Odd-Even Transposition etc. Sort FFT The above means that when you write an algorithm for a CW PRAM Conclusion you can do any of the above at any different points in time It doesn’t corresponds to any hardware in existence and is just a logical/algorithmic notion that could be implemented in software In fact, most algorithms end up not needing CW
Courtesy of Henri Casanova 8 / 63 Theoretical Parallel Computing
A. Legrand Classic PRAM Models
Parallel RAM Introduction CREW (concurrent read, exclusive write) Pointer Jumping Reducing the most commonly used Number of Processors PRAM Model CRCW (concurrent read, concurrent write) Hierarchy Conclusion most powerful Combinatorial Networks EREW (exclusive read, exclusive write) Merge Sort most restrictive 0-1 Principle Odd-Even Transposition unfortunately, probably most realistic Sort FFT Conclusion Theorems exist that prove the relative power of the above models (more on this later)
Courtesy of Henri Casanova 9 / 63 Theoretical Parallel Computing
A. Legrand PRAM Example 1
Parallel RAM Introduction Problem: Pointer Jumping Reducing the We have a linked list of length n Number of Processors For each element i, compute its distance to the PRAM Model Hierarchy end of the list: Conclusion d[i] = 0 if next[i] = NIL Combinatorial Networks d[i] = d[next[i]] + 1 otherwise Merge Sort 0-1 Principle Odd-Even Sequential algorithm in O(n) Transposition Sort We can define a PRAM algorithm in O(log n) FFT Conclusion associate one processor to each element of the list at each iteration split the list in two with odd placed and evenplaced elements in different lists
Courtesy of Henri Casanova 10 / 63 Theoretical Parallel Computing
A. Legrand PRAM Example 1
Parallel RAM Principle: Introduction Pointer Jumping Look at the next element Reducing the Number of Add its d[i] value to yours Processors PRAM Model Point to the next element’s next element Hierarchy Conclusion Combinatorial 1 1 1 1 1 0 Networks Merge Sort 0-1 Principle Odd-Even The size of each list Transposition 2 2 2 2 1 0 Sort is reduced by 2 at FFT each step, hence the Conclusion O(log n) complexity 4 4 3 2 1 0
5 4 3 2 1 0 Courtesy of Henri Casanova 11 / 63 Theoretical Parallel Computing
A. Legrand PRAM Example 1
Parallel RAM Introduction Algorithm Pointer Jumping Reducing the Number of forall i Processors ← ← PRAM Model if next[i] == NIL then d[i] 0 else d[i] 1 Hierarchy Conclusion while there is an i such that next[i] ≠ NIL Combinatorial i Networks forall Merge Sort if next[i] ≠ NIL then 0-1 Principle Odd-Even d[i] ← d[i] + d[next[i]] Transposition Sort ← FFT next[i] next[next[i]] Conclusion
What about the correctness of this algorithm?
Courtesy of Henri Casanova 12 / 63 Theoretical Parallel Computing
A. Legrand forall loop
Parallel RAM Introduction At each step, the updates must be Pointer Jumping Reducing the synchronized so that pointers point to the Number of Processors right things: PRAM Model Hierarchy next[i] ← next[next[i]] Conclusion Combinatorial Ensured by the semantic of forall Networks Merge Sort forall i 0-1 Principle forall i Odd-Even tmp[i] = B[i] Transposition Sort A[i] = B[i] forall i FFT A[i] = tmp[i] Conclusion Nobody really writes it out, but one mustn’t forget that it’s really what happens underneath
Courtesy of Henri Casanova 13 / 63 Theoretical Parallel Computing
A. Legrand while condition
Parallel RAM Introduction while there is an i such that next[i] ≠NULL Pointer Jumping Reducing the How can one do such a global test on a Number of Processors PRAM? PRAM Model Hierarchy Conclusion Cannot be done in constant time unless the PRAM Combinatorial is CRCW Networks At the end of each step, each processor could write to a Merge Sort 0-1 Principle same memory location TRUE or FALSE depending on Odd-Even next[i] being equal to NULL or not, and one can then take Transposition Sort the AND of all values (to resolve concurrent writes) FFT On a PRAM CREW, one needs O(log n) steps for doing Conclusion a global test like the above In this case, one can just rewrite the while loop into a for loop, because we have analyzed the way in which the iterations go: for step = 1 to log n Courtesy of Henri Casanova 14 / 63 Theoretical Parallel Computing
A. Legrand What type of PRAM?
Parallel RAM Introduction The previous algorithm does not require a Pointer Jumping Reducing the CW machine, but: Number of Processors tmp[i] ← d[i] + d[next[i]] PRAM Model Hierarchy which requires concurrent reads on proc i and j Conclusion such that j = next[i]. Combinatorial Networks Solution: Merge Sort 0-1 Principle split it into two instructions: Odd-Even tmp2[i] ← d[i] Transposition Sort tmp[i] ← tmp2[i] + d[next[i]] FFT (note that the above are technically in two different forall loops) Conclusion Now we have an execution that works on a EREW PRAM, which is the most restrictive type
Courtesy of Henri Casanova 15 / 63 Theoretical Parallel Computing
A. Legrand Final Algorithm on a EREW PRAM
Parallel RAM Introduction Pointer Jumping Reducing the Number of Processors forall i PRAM Model ← ← Hierarchy if next[i] == NILL then d[i] 0 else d[i] 1 O(1) Conclusion for step = 1 to log n Combinatorial Networks forall i Merge Sort if next[i] ≠ NIL then 0-1 Principle O(log n) Odd-Even tmp[i] ← d[i] Transposition O(log n) Sort d[i] ← tmp[i] + d[next[i]] FFT O(1) ← Conclusion next[i] next[next[i]]
Conclusion: One can compute the length of a list of size n in time O(log n) on any PRAM Courtesy of Henri Casanova 16 / 63 Cost, Work
Theoretical Parallel I Let P be a problem of size n that we want to solve (e.g., a computation Computing over an n-element list). A. Legrand I Let Tseq(n) be the execution time of the best (known) sequential algo- Parallel RAM rithm for solving P. Introduction Pointer Jumping I Let us now consider a PRAM algorithm that solves P in time Tpar (p, n) Reducing the Number of with p PUs. Processors PRAM Model Hierarchy Definition: Cost and Work. Conclusion Combinatorial The cost of a PRAM algorithm is defined as Networks Merge Sort 0-1 Principle Cp(n) = p.Tpar (p, n) . Odd-Even Transposition Sort The work Wp(n) of a PRAM algorithm is the sum over all PUs of the number FFT of performed operations. The difference between cost and work is that the Conclusion work does not account for PU idle time.
Intuitively, cost is a rectangle of area Tpar (p, n). Therefore, the cost is minimal if, at each step, all PUs are used to per- form useful computations, i.e. computations that are part of the sequential
algorithm. 17 / 63 Speedup and Efficiency
Theoretical The speedup of a PRAM algorithm is the factor by which the program’s Parallel Computing execution time is lower than that of the sequential program. A. Legrand Definition: Speedup and Efficiency. Parallel RAM Introduction The speedup of a PRAM algorithm is defined as Pointer Jumping Reducing the Number of Tseq(n) Processors Sp(n) = . PRAM Model Tpar (p, n) Hierarchy Conclusion The efficiency of the PRAM algorithm is defined as Combinatorial Networks S (n) T (n) Merge Sort e (n) = p = seq . 0-1 Principle p p p.Tpar (p, n) Odd-Even Transposition Sort Tpar (1,n) FFT Some authors use the following definition: Sp(n)= , where Tpar (1, n) Tpar (p,n) Conclusion is the execution time of the algorithm with a single PU. This definition quantifies how the algorithm scales: If the speedup is close to p (i.e. isΩ( p)), one says that the algorithm scales well. However, this definition does not reflect the quality of the parallelization (i.e. how much we can really gain by parallelization), and it is thus better
to use Tseq(n) than Tpar (1, n). 18 / 63 Brent Theorem
Theoretical Parallel Computing A. Legrand Theorem 1: Brent. Parallel RAM Introduction Let A be an algorithm that executes a total number of m operations Pointer Jumping and that runs in time t on a PRAM (with some unspecified number Reducing the Number of m Processors of PUs). A can be simulated in time O p + t PRAM of the same PRAM Model Hierarchy type that contains p PUs. Conclusion Combinatorial Networks Merge Sort Proof. 0-1 Principle Odd-Even Transposition Say that at step i A performs m(i) operations (implying that Sort Pt m(i) = m). Step i can be simulated with p PUs in time FFT l i=1m Conclusion m(i) m(i) p 6 p + 1. One can simply sum these upper bounds to prove the theorem.
19 / 63 Theoretical Parallel Computing
A. Legrand Brent Theorem
Parallel RAM Introduction Pointer Jumping Theorem: Let A be an algorithm with m operations that Reducing the runs in time t on some PRAM (with some number of Number of Processors processors). It is possible to simulate A in time O(t + PRAM Model Hierarchy m/p) on a PRAM of same type with p processors Conclusion Combinatorial Example: maximum of n elements on an EREW PRAM Networks Clearly can be done in O(log n) with O(n) processors Merge Sort 0-1 Principle Compute series of pairwise maxima Odd-Even The first step requires O(n/2) processors Transposition Sort What happens if we have fewer processors? FFT By the theorem, with p processors, one can simulate the same Conclusion algorithm in time O(log n + n / p) If p = n / log n, then we can simulate the same algorithm in O(log n + log n) = O(log n) time, which has the same complexity! This theorem is useful to obtain lowerbounds on number of required processors that can still achieve a given complexity.
Courtesy of Henri Casanova 20 / 63 Another Useful Theorem
Theoretical Parallel Theorem 2. Computing
A. Legrand Let A be an algorithm whose execution time is t on a PRAM with p 0 PUs. A can be simulated on a RAM of the same type with p 6 p Parallel RAM Introduction t.p PUs in time O 0 . The cost of the algorithm on the smaller PRAM Pointer Jumping p Reducing the Number of is at most twice the cost on the larger PRAM. Processors PRAM Model Hierarchy Conclusion Proof. Combinatorial l m Networks p Each step of A can be simulated in at most 0 time units with Merge Sort p 0-1 Principle p0 p PUs by simply reducing concurrency and having the p0 PUs Odd-Even 6 Transposition Sort perform sequences of operations. Since there are at most t steps, FFT l m the execution time t0 of the simulated algorithm is at most p t. Conclusion p0 0 p t.p Therefore, t = O p0 .t = O p0 . 0 0 l p m 0 p 0 We also have Cp0 = t .p 6 p0 p .t 6 p0 + 1 p t = 1 0 p.t 1 + p0 = Cp.(1 + 1/p ) 6 2Cp. 21 / 63 Theoretical Parallel Computing
A. Legrand An other useful theorem
Parallel RAM Introduction Theorem: Let A be an algorithm that Pointer Jumping Reducing the executes in time t on a PRAM with p Number of Processors PRAM Model processors. One can simulate A on a PRAM Hierarchy Conclusion with p’ processors in time O(t.p/p’) Combinatorial Networks This makes it possible to think of the “folding” Merge Sort 0-1 Principle we talked about earlier by which one goes Odd-Even Transposition from an unbounded number of processors to Sort FFT a bounded number Conclusion A.n.log n + B on n processors A.n2 + Bn/(log n) on log n processors A(n2log n)/10 + Bn/10 on 10 processors
Courtesy of Henri Casanova 22 / 63 Theoretical Parallel Computing
A. Legrand Are all PRAMs equivalent?
Parallel RAM Consider the following problem Introduction Pointer Jumping given an array of n elements, ei=1,n, all distinct, find whether some Reducing the element e is in the array Number of Processors On a CREW PRAM, there is an algorithm that works in time O(1) PRAM Model Hierarchy on n processors: Conclusion initialize a boolean to FALSE Combinatorial Each processor i reads ei and e and compare them Networks Merge Sort if equal, then write TRUE into the boolean (only one proc will write, so 0-1 Principle we’re ok for CREW) Odd-Even Transposition One a EREW PRAM, one cannot do better than log n Sort Each processor must read e separately FFT at worst a complexity of O(n), with sequential reads Conclusion at best a complexity of O(log n), with series of “doubling” of the value at each step so that eventually everybody has a copy (just like a broadcast in a binary tree, or in fact a kary tree for some constant k) Generally, “diffusion of information” to n processors on an EREW PRAM takes O(log n) Conclusion: CREW PRAMs are more powerful than EREW PRAMs Courtesy of Henri Casanova 23 / 63 CRCW>CREW
Theoretical Parallel Computing
A. Legrand
Parallel RAM Introduction Pointer Jumping Reducing the Number of Processors PRAM Model Hierarchy Conclusion Combinatorial Networks Merge Sort 0-1 Principle Odd-Even Transposition Sort FFT Conclusion I O(1) on a CRCW I Ω(log n) on a CREW Therefore, we have CRCW > CREW > EREW (with at least a loga- rithmic factor each time). 24 / 63 Theoretical Parallel Computing
A. Legrand Simulation Theorem
Parallel RAM Introduction Simulation theorem: Any algorithm running on a Pointer Jumping CRCW PRAM with p processors cannot be more than Reducing the Number of O(log p) times faster than the best algorithm on a Processors PRAM Model EREW PRAM with p processors for the same Hierarchy Conclusion problem Combinatorial Proof: Networks Merge Sort “Simulate” concurrent writes 0-1 Principle Each step of the algorithm on the CRCW PRAM is simulated as Odd-Even log(p) steps on a EREW PRAM Transposition Sort When Pi writes value xi to address li, one replaces the write by FFT an (exclusive) write of (li ,xi) to A[i], where A[i] is some auxiliary Conclusion array with one slot per processor Then one sorts array A by the first component of its content Processor i of the EREW PRAM looks at A[i] and A[i1] if the first two components are different or if i = 0, write value xi to
address li Since A is sorted according to the first component, writing is exclusive Courtesy of Henri Casanova 25 / 63 Theoretical Parallel Computing
A. Legrand Proof (continued)
Parallel RAM Introduction Pointer Jumping Picking one processor for each competing write Reducing the Number of Processors P0 PRAM Model 12 8 Hierarchy → A[0]=(8,12) P0 writes Conclusion P0 (29,43) = A[0] P1 Combinatorial → Networks P1 (8,12) = A[1] A[1]=(8,12) P1 nothing Merge Sort 43 29 P2 0-1 Principle P2 → (29,43) = A[2] A[2]=(29,43) P2 writes Odd-Even sort Transposition Sort P3 P3 → (29,43) = A[3] A[3]=(29,43) P3 nothing FFT Conclusion P4 → (92,26) = A[4] A[4]=(29,43) P4 nothing P4 P5 → (8,12) = A[5] A[5]=(92,26) P5 writes 26 92 P5
Courtesy of Henri Casanova 26 / 63 Theoretical Parallel Computing
A. Legrand Proof (continued)
Parallel RAM Introduction Note that we said that we just sort array A Pointer Jumping Reducing the If we have an algorithm that sorts p elements Number of Processors with O(p) processors in O(log p) time, we’re PRAM Model Hierarchy Conclusion set Combinatorial Turns out, there is such an algorithm: Cole’s Networks Merge Sort Algorithm. 0-1 Principle Odd-Even basically a mergesort in which lists are merged in Transposition Sort constant time! FFT Conclusion It’s beautiful, but we don’t really have time for it, and it’s rather complicated Therefore, the proof is complete.
Courtesy of Henri Casanova 27 / 63 Theoretical Parallel Computing
A. Legrand And many, many more things
Parallel RAM Introduction J. Reiff (editor), Synthesis of Parallel Algorithms, Pointer Jumping Reducing the Morgan Kauffman, 1993 Number of Processors Everything you’ve ever wanted to know about PRAMs PRAM Model Hierarchy Every now and then, there are references to PRAMs Conclusion in the literature Combinatorial Networks “by the way, this can be done in O(x) on a XRXW PRAM” Merge Sort 0-1 Principle This network can simulate a EREW PRAM, and thus we Odd-Even Transposition know a bunch of useful algorithms (and their complexities) Sort that we can instantly implement FFT Conclusion etc. You probably will never care if all you do is hack MPI and OpenMP code
Courtesy of Henri Casanova 28 / 63 Relevance of the PRAM model
Theoretical Parallel Computing I Central question in complexity theory for sequential algorithms: A. Legrand P = NP ?
Parallel RAM I Central question in complexity theory for parallel algorithms: P = Introduction Pointer Jumping NC ? Reducing the Number of Processors I NC is the class of all problems that, with a polynomial number of PRAM Model Hierarchy PUs, can be solved in polylogarithmic time. An algorithm of size Conclusion n is polylogarithmic if it can be solved in O(log(n)c ) time with Combinatorial k Networks O(n ) PUs, where c and k are constants. Merge Sort 0-1 Principle Nice theoretical complexity considerations, uh ? Odd-Even Transposition Sort FFT Skeptical readers may question the relevance of the PRAM model for Conclusion practical implementation purposes: You can’t haven = 100, 000, 000 PUs. One never has an immediately addressable, unbounded shared parallel memory.
29 / 63 Relevance of the PRAM model
Theoretical This criticism, similar to the criticism of O(n17) “polynomial” time algo- Parallel Computing rithms, often comes from a misunderstanding of the role of theory: A. Legrand I Theory is not everything: Theoretical results are not to be taken as is
Parallel RAM and implemented by engineers. Introduction Pointer Jumping I Theory is also not nothing: The fact that an algorithm cannot in general Reducing the be implemented as is, does not mean it is meaningless. Number of Processors PRAM Model Hierarchy Deep understanding Conclusion 17 Combinatorial When a O(n ) algorithm is designed for a problem, it does not lead to a Networks practical way to solve that problem. Merge Sort 0-1 Principle However, it proves something inherent about the problem (namely, that it Odd-Even Transposition is in P). Sort FFT Hopefully, in the process of proving this result, key insights may be developed Conclusion that can later be used in practical algorithms for this or other problems, or for other theoretical results. Similarly, the design of PRAM algorithms for a problem proves something inherent to the problem (namely, that it is parallelizable) and can in turn lead to new ideas.
30 / 63 Relevance of the PRAM model
Theoretical This criticism, similar to the criticism of O(n17) “polynomial” time algo- Parallel Computing rithms, often comes from a misunderstanding of the role of theory: A. Legrand I Theory is not everything: Theoretical results are not to be taken as is
Parallel RAM and implemented by engineers. Introduction Pointer Jumping I Theory is also not nothing: The fact that an algorithm cannot in general Reducing the be implemented as is, does not mean it is meaningless. Number of Processors PRAM Model Hierarchy Candidate for Practical Implementation Conclusion Combinatorial Even if communications are not taken into account in the performance eval- Networks uation of PRAM algorithms (a potentially considerable discrepancy between Merge Sort 0-1 Principle theoretical complexity and practical execution time), trying to design fast Odd-Even Transposition PRAM algorithms is not a useless pursuit. Sort FFT Indeed, PRAM algorithms can be simulated on other models and do not Conclusion necessarily incur prohibitive communication overheads. It is commonly admitted that only cost-optimal PRAM algorithms have po- tential practical relevance and that the most promising are those cost-optimal algorithms with minimal execution time.
30 / 63 Outline
Theoretical Parallel Computing
A. Legrand 1 Parallel RAM Introduction Parallel RAM Introduction Pointer Jumping Pointer Jumping Reducing the Reducing the Number of Processors Number of Processors PRAM Model Hierarchy PRAM Model Hierarchy Conclusion Conclusion Combinatorial Networks 2 Combinatorial Networks Merge Sort 0-1 Principle Merge Sort Odd-Even Transposition 0-1 Principle Sort FFT Odd-Even Transposition Sort Conclusion FFT
3 Conclusion
31 / 63 Theoretical Parallel Computing
A. Legrand Combinational circuits/networks
Parallel RAM Introduction More realistic than PRAMs Pointer Jumping Reducing the More restricted Number of Processors PRAM Model Algorithms for combinational circuits were Hierarchy Conclusion among the first parallel algorithms developed Combinatorial Networks Understanding how they work makes it easier Merge Sort 0-1 Principle to learn more complex parallel algorithms Odd-Even Transposition Many combinational circuit algorithms provide Sort FFT the basis for algorithms for other models (they Conclusion are good building blocks) We’re going to look at: sorting networks FFT circuit
Courtesy of Henri Casanova 32 / 63 Theoretical Parallel Computing
A. Legrand Sorting Networks
Parallel RAM Introduction Goal: sort lists of numbers Pointer Jumping Reducing the Main principle Number of Processors PRAM Model computing elements take two numbers as input Hierarchy and sort them Conclusion Combinatorial Networks Merge Sort a min(a,b) 0-1 Principle b max(a,b) Odd-Even Transposition Sort FFT Conclusion we arrange them in a network we look for an architecture that depends only on the size of lists to be sorted, not on the values of the elements
Courtesy of Henri Casanova 33 / 63 Theoretical Parallel Computing
A. Legrand Mergesort on a sorting network
Parallel RAM Introduction First, build a network to merge two lists Pointer Jumping Reducing the Some notations Number of Processors (c1,c2,...,cn) a list of numbers PRAM Model Hierarchy sort(c1,c2,...,cn) the same list, sorted Conclusion sorted(x1,x2,...,xn) is true if the list is sorted Combinatorial Networks if sorted(a1,...,an) and sorted(b1,...,bn) then merge((a1,...,an), Merge Sort (b1,...,bn)) = sort(a1,...,an,b1,...,bn) 0-1 Principle Odd-Even We’re going to build a network, merge , that merges two sorted lists Transposition m Sort with 2m elements FFT Conclusion m=0 m = 1
a1 min(a1,b1) a1 b1 b1 min(max(a1,b1),min(a2,b2)) a2 max(max(a1,b1),min(a2,b2)) b2 max(a2,b2)
Courtesy of Henri Casanova 34 / 63 Theoretical Parallel Computing
A. Legrand What about for m=3?
Parallel RAM Introduction a1 Pointer Jumping b1 Reducing the Number of a3 Processors b3 PRAM Model Hierarchy a2 Conclusion b2 Combinatorial Networks a4 Merge Sort b4 0-1 Principle Odd-Even Transposition Why does this work? Sort To build merge one uses FFT m 2 copies of the merge network Conclusion m1 1 row of 2m1 comparators The first copy of mergem1 merges the oddindexed elements, the second copy merges the evenindexed elements The row of comparators completes the global merge, which is quite a miracle really
Courtesy of Henri Casanova 35 / 63 Theoretical Parallel Computing Theorem to build merge A. Legrand m
Parallel RAM Introduction Given sorted(a1,...,a2n) and Pointer Jumping Reducing the sorted(b1,...,b2n) Number of Processors PRAM Model Let Hierarchy Conclusion (d1,...,d2n) = merge((a1,a3,..,a2n1), Combinatorial Networks (b1,b3,...,b2n1) Merge Sort 0-1 Principle (e1,...,e2n) = merge((a2,a4,..,a2n), Odd-Even Transposition (b2,b4,...,b2n) Sort FFT Conclusion Then sorted(d1,min(d2,e1),max(d2,e1),...,
min(d2n,e2n1),max(d2n,e2n1),e2n)
Courtesy of Henri Casanova 36 / 63 Theoretical Parallel Computing
A. Legrand Proof
Parallel RAM Assume all elements are distinct Introduction Pointer Jumping d1 is indeed the first element, and e2n is the last element of the Reducing the Number of global sorted list Processors For i>1 and i <=2n, d and e must appear in the final list in PRAM Model i i1, Hierarchy position 2i2 or 2i1. Conclusion Let’s prove that they are at the right place Combinatorial Networks if each is larger than 2i3 elements Merge Sort if each is smaller than 4n2i+1 elements 0-1 Principle Odd-Even therefore each is either in 2i2 or 2i1 Transposition Sort and the comparison between the two makes them each go in the FFT correct place Conclusion So we must show that di is larger than 2i3 elements ei1 is larger than 2i3 elements di is smaller than 4n2i+1 elements ei1 is smaller than 4n2i+1 elements
Courtesy of Henri Casanova 37 / 63 Theoretical Parallel Computing
A. Legrand Proof (cont’ed)
Parallel RAM Introduction di is larger than 2i3 elements Pointer Jumping Reducing the Assume that di belongs to the (aj)j=1,2n list Number of Processors Let k be the number of elements in {d1,d2,...,di} that belong to the PRAM Model Hierarchy (aj)j=1,2n list Conclusion Then di = a2k1, and di is larger than 2k2 elements of A Combinatorial Networks There are ik elements from the (bj)j=1,2n list in {d1,d2,...,di1}, and Merge Sort thus the largest one is b2(ik)1. Therefore di is larger than 2(ik)1 0-1 Principle Odd-Even elements in list (bj)j=1,2n Transposition Sort Therefore, di is larger than 2k2 + 2(ik)1 = 2i3 elements FFT Similar proof if d belongs to the (b ) list Conclusion i j j=1,2n Similar proofs for the other 3 properties
Courtesy of Henri Casanova 38 / 63 Theoretical Parallel Computing Construction of merge A. Legrand m
Parallel RAM Introduction Pointer Jumping a Reducing the 1 d1 Number of a2 merge d Processors a3 2 PRAM Model . . Hierarchy . .
m-1 . Conclusion . . di+1 Combinatorial a 2i-1 . . Networks a2i Merge Sort . . . . . 0-1 Principle . Odd-Even . Transposition . b Sort 1 e b 1 FFT 2 b 3 e2 merge Conclusion b4 ...... m-1 . b2i-1 e i . b2i . . . . Recursive construction that implements
the result from the theoremCourtesy of Henri Casanova 39 / 63 Theoretical Parallel Computing Performance of merge A. Legrand m
Parallel RAM Introduction Execution time is defined as the maximum number of comparators Pointer Jumping that one input must go through to produce output Reducing the Number of tm: time to go through mergem Processors PRAM Model pm: number of comparators in mergem Hierarchy Conclusion Two inductions Combinatorial t0=1, t1=2, tm = tm1 + 1 (tm = m+1) Networks m m p0=1, p1=3, pm = 2pm1 + 2 1 (pm = 2 m+1) Merge Sort 0-1 Principle Easily deduced from the theorem Odd-Even m Transposition In terms of n=2 , O(log n) and O(n log n) Sort Fast execution in O(log n) FFT Conclusion But poor efficiency Sequential time with one comparator: n Efficiency = n / (n * log n * log n) = 1 / (log n)2 Comparators are not used efficiently as they are used only once The network could be used in pipelined mode, processing series of lists, with all comparators used at each step, with one result available at each step. Courtesy of Henri Casanova 40 / 63 Theoretical Parallel Computing Sorting network using merge A. Legrand m
Parallel RAM Introduction Sort network Sort network Pointer Jumping 2 3 Reducing the Number of Processors PRAM Model Hierarchy Conclusion merge
Combinatorial merge Networks
Merge Sort 1
0-1 Principle merge
Odd-Even 1 Transposition sort2 Sort FFT
sort2 2 merge Conclusion
Sort 1st half of the list 1 Sort 2nd half of the list Merge the results sort2 Recursively Courtesy of Henri Casanova 41 / 63 Theoretical Parallel Computing
A. Legrand Performance
Parallel RAM Execution time t’m and p’m number of comparators Introduction 2 Pointer Jumping t’1=1 t’m = t’m1 + tm1 (t’m = O(m )) Reducing the m 2 Number of p’1=1 p’m = 2p’m1 + pm1 (p’m = O(2 m )) Processors m PRAM Model In terms of n = 2 Hierarchy 2 Conclusion Sort time: O((log n) ) Number of comparators: O(n(log n)2) Combinatorial Networks Poor performance given the number of comparators (unless Merge Sort used in pipeline mode) 0-1 Principle Odd-Even Efficiency: Tseq / (p * Tpar) Transposition Sort Efficiency = O(n log n/ (n (log n)4 )) = O((log n)3) FFT There was a PRAM algorithm in O(log n) Conclusion Is there a sorting network that achieves this? yes, recent work in 1983 O(log n) time, O(n log n) comparators But constants are SO large, that it is impractical
Courtesy of Henri Casanova 42 / 63 Theoretical Parallel Computing
A. Legrand 01 Principle
Parallel RAM Introduction Theorem: A network of comparators Pointer Jumping Reducing the Number of implements sorting correctly if and only Processors PRAM Model Hierarchy if it implements it correctly for lists that Conclusion Combinatorial consist solely of 0’s and 1’s Networks Merge Sort This theorem makes proofs of things 0-1 Principle Odd-Even Transposition like the “merge theorem” much simpler Sort FFT and in general one only works with lists Conclusion of 0’s and 1’ when dealing with sorting networks
Courtesy of Henri Casanova 43 / 63 Theoretical Parallel Computing
A. Legrand Another (simpler) sorting network
Parallel RAM Introduction Sort by evenodd transposition Pointer Jumping Reducing the Number of The network is built to sort a list of n=2p Processors PRAM Model Hierarchy elements Conclusion p copies of a 2row network Combinatorial Networks Merge Sort the first row contains p comparators that 0-1 Principle take elements 2i1 and 2i, for i=1,...,p Odd-Even Transposition Sort the second row contains p1 comparators FFT that take elements 2i and 2i+1, for Conclusion i=1,..,p1 for a total of n(n1)/2 comparators similar construction for when n is odd
Courtesy of Henri Casanova 44 / 63 Theoretical Parallel Computing
A. Legrand Oddeven transposition network
Parallel RAM Introduction n = 8 n = 7 Pointer Jumping Reducing the Number of Processors PRAM Model Hierarchy Conclusion Combinatorial Networks Merge Sort 0-1 Principle Odd-Even Transposition Sort FFT Conclusion
Courtesy of Henri Casanova 45 / 63 Theoretical Parallel Computing
A. Legrand Proof of correctness
Parallel RAM To prove that the previous network sort correctly Introduction Pointer Jumping rather complex induction Reducing the use of the 01 principle Number of Processors Let (ai)i=1,..,n a list of 0’s and 1’s to sort PRAM Model Hierarchy Let k be the number of 1’s in that list; j0 the position of the last 1 Conclusion Combinatorial Networks Merge Sort 1 1 0 1 0 0 0 0-1 Principle k = 3 Odd-Even Transposition j0 = 4 Sort FFT Conclusion Note that a 1 never “moves” to the left (this is why using the 01 principle makes this proof easy) Let’s follow the last 1: If j0 is even, no move, but move to the right at the
next step. If j0 is odd, then move to the right in the first step. In all cases, it will move to the right at the 2nd step, and for each step, until it reaches the nth position. Since the last 1 is at least in position 2, it will reach position n in at least n1 steps. Courtesy of Henri Casanova 46 / 63 Theoretical Parallel Computing
A. Legrand Proof (continued)
Parallel RAM Let’s follow the nextto last 1, starting in position j: since the last Introduction Pointer Jumping 1 moves to the right starting in step 2 at the latest, the nextto Reducing the Number of last 1 will never be “blocked”. At step 3, and at all following Processors steps, the nexttolast 1 will move to the right, to arrive in PRAM Model Hierarchy position n1. Conclusion Generally, the ith 1, counting from the right, will move right during Combinatorial Networks step i+1 and keep moving until it reaches position ni+1 Merge Sort This goes on up to the kth 1, which goes to position nk+1 0-1 Principle Odd-Even At the end we have the nk 0’s followed by the k 1’s Transposition Sort Therefore we have sorted the list FFT Conclusion
Courtesy of Henri Casanova 47 / 63 Theoretical Parallel Computing
A. Legrand Example for n=6
Parallel RAM Introduction 1 0 1 0 1 0 Pointer Jumping Reducing the Number of Processors PRAM Model 0 1 0 1 0 1 Hierarchy Conclusion Combinatorial 0 0 1 0 1 1 Networks Merge Sort 0-1 Principle Odd-Even Transposition 0 0 0 1 1 1 Sort FFT Conclusion 0 0 0 1 1 1
Redundant steps 0 0 0 1 1 1
0 0 0 1 1 1 Courtesy of Henri Casanova 48 / 63 Theoretical Parallel Computing
A. Legrand Performance
Parallel RAM Introduction Compute time: tn = n Pointer Jumping Reducing the Number of # of comparators: pn = n(n1)/2 Processors PRAM Model 2 Hierarchy Efficiency: O(n log n / n * (n1)/2 * n) = O(log n / n ) Conclusion Really, really, really poor Combinatorial Networks But at least it’s a simple network Merge Sort 0-1 Principle Odd-Even Transposition Sort Is there a sorting network with good and practical FFT performance and efficiency? Conclusion Not really But one can use the principle of a sorting network for coming up with a good algorithm on a linear network of processors
Courtesy of Henri Casanova 49 / 63 Theoretical Parallel Computing
A. Legrand Sorting on a linear array
Parallel RAM Introduction Consider a linear array of p general Pointer Jumping Reducing the purpose processors Number of Processors PRAM Model Hierarchy Conclusion . . . . P1 P2 P3 Pp Combinatorial Networks Merge Sort Consider a list of n elements to sort 0-1 Principle Odd-Even Transposition (such that n is divisible by p for Sort FFT simplicity) Conclusion Idea: use the oddeven transposition network and sort of “fold” it onto the linear array.
Courtesy of Henri Casanova 50 / 63 Theoretical Parallel Computing
A. Legrand Principle
Parallel RAM Introduction Each processor receives a subpart, i.e. n/p elements, of the list Pointer Jumping to sort Reducing the Number of Processors Each processor sorts this list locally, in parallel. PRAM Model Hierarchy There are then p steps of alternating exchanges as in the odd Conclusion Combinatorial even transposition sorting network Networks exchanges are for full sublists, not just single elements Merge Sort 0-1 Principle when two processors communicate, their two lists are merged Odd-Even Transposition Sort the left processor keeps the left half of the merged list FFT the right processor keeps the right half of the merged list Conclusion
Courtesy of Henri Casanova 51 / 63 Theoretical Parallel Computing
A. Legrand Example
Parallel RAM Introduction P1 P2 P3 P4 P5 P6 Pointer Jumping Reducing the Number of Processors init {8,3,12} {10,16,5} {2,18,9} {17,15,4} {1,6,13} {11,7,14} PRAM Model Hierarchy {3,8,12} {5,10,16} {2,9,18} {4,15,17} {1,6,13} {7,11,14} Conclusion local sort Combinatorial Networks odd {3,5,8} {10,12,16} {2,4,9} {15,17,18} {1,6,7} {11,13,14} Merge Sort 0-1 Principle even {3,5,8} {2,4,9} {10,12,16} {1,6,7} {15,17,18} {11,13,14} Odd-Even Transposition Sort odd {2,3,4} {5,8,9} {1,6,7} {10,12,16} {11,13,14} {15,17,18} FFT Conclusion even {2,3,4} {1,5,6} {7,8,9} {10,11,12} {13,14,16} {15,17,18}
odd {1,2,3} {4,5,6} {7,8,9} {10,11,12} {13,14,15} {16,17,18}
even {1,2,3} {4,5,6} {7,8,9} {10,11,12} {13,14,15} {16,17,18}
Same pattern as the sorting network Courtesy of Henri Casanova 52 / 63 Theoretical Parallel Computing
A. Legrand Performance
Parallel RAM Introduction Local sort: O(n/p * log n/p) = O(n/p * log n) Pointer Jumping Reducing the Each step costs one merge of two lists of n/p Number of Processors elements: O(n/p) PRAM Model Hierarchy There are p such steps, hence: O(n) Conclusion Combinatorial Total: O(n/p * log n + n) Networks Merge Sort If p = log n: O(n) 0-1 Principle Odd-Even Transposition The algorithm is optimal for p ≤ log n Sort FFT Conclusion More information on sorting networks: D. Knuth, The Art of Computer Programming, volume 3: Sorting and Searching, AddisonWesley (1973)
Courtesy of Henri Casanova 53 / 63 Theoretical Parallel Computing
A. Legrand FFT circuit what’s an FFT?
Parallel RAM Introduction Fourier Transform (FT): A “tool” to decompose a Pointer Jumping Reducing the function into sinusoids of different frequencies, which Number of Processors sum to the original function PRAM Model Hierarchy Useful is signal processing, linear system analysis, quantum Conclusion physics, image processing, etc. Combinatorial Networks Discrete Fourier Transform (DFT): Works on a Merge Sort 0-1 Principle discrete sample of function values Odd-Even Transposition In many domains, nothing is truly continuous or continuously Sort measured FFT Conclusion Fast Fourier Transform (FFT): an algorithm to compute a DFT, proposed initially by Tukey and Cole in 1965, which reduces the number of computation from O(n2) to O(n log n)
Courtesy of Henri Casanova 54 / 63 Theoretical Parallel Computing
A. Legrand How to compute a DFT
Parallel RAM Introduction Given a sequence of numbers {a , ..., Pointer Jumping 0 Reducing the Number of Processors an1}, its DFT is defined as the sequence PRAM Model Hierarchy {b , ..., b }, where Conclusion 0 n1 Combinatorial Networks Merge Sort 0-1 Principle Odd-Even (polynomia eval) Transposition Sort FFT Conclusion ω with n a primitive root of 1, i.e.,
Courtesy of Henri Casanova 55 / 63 Theoretical Parallel Computing
A. Legrand The FFT Algorithm
Parallel RAM 2 Introduction A naive algorithm would require n Pointer Jumping Reducing the Number of complex additions and multiplications, Processors PRAM Model Hierarchy which is not practical as typically n is Conclusion Combinatorial very large Networks s Merge Sort Let n = 2 0-1 Principle Odd-Even Transposition Sort FFT Conclusion
even: uj odd: vj
Courtesy of Henri Casanova 56 / 63 Theoretical Parallel Computing
A. Legrand The FFT Algorithm
Parallel RAM Introduction Therefore, evaluating the polynomial Pointer Jumping Reducing the Number of Processors at PRAM Model Hierarchy can be reduced to: Conclusion Combinatorial 1. Evaluate the two polynomials Networks Merge Sort 0-1 Principle Odd-Even Transposition Sort at FFT 2. Compute Conclusion
BUT: contains really n/2 distinct elements!!! Courtesy of Henri Casanova 57 / 63 Theoretical Parallel Computing
A. Legrand The FFT Algorithm
Parallel RAM Introduction As a result, the original problem of size n (that Pointer Jumping Reducing the is, n polynomial evaluations), has been Number of Processors reduced to 2 problems of size n/2 (that is, n/2 PRAM Model Hierarchy polynomial evaluations) Conclusion Combinatorial Networks FFT(in A,out B) Merge Sort if n = 1 0-1 Principle b ← a Odd-Even 0 0 Transposition else Sort FFT FFT(a0, a2,..., an2, u0, u1,..., u(n/2)1)
Conclusion FFT(a1, a3, ..., an1, v0, v1,..., v(n/2)1) for j = 0 to n1 ← ω j bj uj mod(n/2) + n .vj mod(n/2) end for end if
Courtesy of Henri Casanova 58 / 63 Theoretical Parallel Computing
A. Legrand Performance of the FFT
Parallel RAM Introduction t(n): running time of the algorithms Pointer Jumping Reducing the Number of Processors t(n) = d * n + 2 * t(n/2), where d is some PRAM Model Hierarchy constant Conclusion Combinatorial Networks t(n) = O(n log n) Merge Sort 0-1 Principle How to do this in parallel? Odd-Even Transposition Sort Both recursive FFT computations can be FFT Conclusion done independently Then all iterations of the for loop are also independent
Courtesy of Henri Casanova 59 / 63 Theoretical Parallel Computing
A. Legrand FFTn Circuit Multiple + Add
Parallel RAM ω 0 Introduction u n b Pointer Jumping a0 0 0 Reducing the a ω 1 Number of 1 u n b1 Processors 1 PRAM Model a2 . Hierarchy FFTn/2 Conclusion . . Combinatorial . . Networks ω n/2-1 . n Merge Sort un/2-1 bn/2-1 0-1 Principle an/2-1 Odd-Even Transposition a ω n/2 Sort n/2 n b FFT v0 n/2 an/2+1 ω n/2+1 Conclusion n bn/2+1 an/2+2 v1 . FFTn/2 . . .
. ω n-1 . n vn/2-1 an-1 bCourtesyn-1 of Henri Casanova 60 / 63 Theoretical Parallel Computing
A. Legrand Performance
Parallel RAM Introduction Number of elements Pointer Jumping Reducing the width of O(n) Number of Processors depth of O(log n) PRAM Model Hierarchy therefore O(n log n) Conclusion Combinatorial Running time Networks Merge Sort t(n) = t(n/2) + 1 0-1 Principle Odd-Even therefore O(log n) Transposition Sort FFT Efficiency Conclusion 1/log n You can decide which part of this circuit should be mapped to a real parallel platform, for instance
Courtesy of Henri Casanova 61 / 63 Outline
Theoretical Parallel Computing
A. Legrand 1 Parallel RAM Introduction Parallel RAM Introduction Pointer Jumping Pointer Jumping Reducing the Reducing the Number of Processors Number of Processors PRAM Model Hierarchy PRAM Model Hierarchy Conclusion Conclusion Combinatorial Networks 2 Combinatorial Networks Merge Sort 0-1 Principle Merge Sort Odd-Even Transposition 0-1 Principle Sort FFT Odd-Even Transposition Sort Conclusion FFT
3 Conclusion
62 / 63 Theoretical Parallel Computing
A. Legrand Conclusion
Parallel RAM Introduction We could teach an entire semester of Pointer Jumping Reducing the Number of theoretical parallel computing Processors PRAM Model Hierarchy Most people just happily ignore it Conclusion Combinatorial Networks But Merge Sort 0-1 Principle it’s the source of most fundamental ideas Odd-Even Transposition Sort it’s a source of inspiration for algorithms FFT Conclusion it’s a source of inspiration for implementations: DSP
Courtesy of Henri Casanova 63 / 63