Theoretical Parallel Computing

Theoretical Parallel Computing A. Legrand Parallel RAM Introduction Pointer Jumping Reducing the Theoretical Parallel Computing Number of Processors PRAM Model Hierarchy Conclusion Combinatorial Arnaud Legrand, CNRS, University of Grenoble Networks Merge Sort 0-1 Principle LIG laboratory, [email protected] Odd-Even Transposition Sort FFT November 1, 2009 Conclusion 1 / 63 Outline Theoretical Parallel Computing A. Legrand 1 Parallel RAM Introduction Parallel RAM Introduction Pointer Jumping Pointer Jumping Reducing the Reducing the Number of Processors Number of Processors PRAM Model Hierarchy PRAM Model Hierarchy Conclusion Conclusion Combinatorial Networks 2 Combinatorial Networks Merge Sort 0-1 Principle Merge Sort Odd-Even Transposition 0-1 Principle Sort FFT Odd-Even Transposition Sort Conclusion FFT 3 Conclusion 2 / 63 Outline Theoretical Parallel Computing A. Legrand 1 Parallel RAM Introduction Parallel RAM Introduction Pointer Jumping Pointer Jumping Reducing the Reducing the Number of Processors Number of Processors PRAM Model Hierarchy PRAM Model Hierarchy Conclusion Conclusion Combinatorial Networks 2 Combinatorial Networks Merge Sort 0-1 Principle Merge Sort Odd-Even Transposition 0-1 Principle Sort FFT Odd-Even Transposition Sort Conclusion FFT 3 Conclusion 3 / 63 Theoretical Parallel Computing A. Legrand Models for Parallel Computation Parallel RAM Introduction We have seen how to implement parallel algorithms Pointer Jumping in practice Reducing the Number of We have come up with performance analyses Processors PRAM Model In traditional algorithm complexity work, the Turing Hierarchy Conclusion machine makes it possible to precisely compare Combinatorial algorithms, establish precise notions of complexity, Networks Merge Sort etc.. 0-1 Principle Odd-Even Can we do something like this for parallel computing? Transposition Sort Parallel machines are complex with many hardware FFT characteristics that are difficult to take into account Conclusion for algorithm work (e.g., the network), is it hopeless? The most famous theoretical model of parallel computing is the PRAM model We will see that many principles in the model are really at the heart of the more applied things we’ve seen so far Courtesy of Henri Casanova 4 / 63 Theoretical Parallel Computing A. Legrand The PRAM Model Parallel RAM Introduction Parallel Random Access Machine (PRAM) Pointer Jumping Reducing the An imperfect model that will only tangentially relate to Number of Processors the performance on a real parallel machine PRAM Model Hierarchy Just like a Turing Machine tangentially relate to the Conclusion performance of a real computer Combinatorial Networks Goal: Make it possible to reason about and classify Merge Sort parallel algorithms, and to obtain complexity results 0-1 Principle Odd-Even (optimality, minimal complexity results, etc.) Transposition Sort One way to look at it: makes it possible to determine FFT the “maximum parallelism” in an algorithm or a Conclusion problem, and makes it possible to devise new algorithms Courtesy of Henri Casanova 5 / 63 Theoretical Parallel Computing A. Legrand The PRAM Model Parallel RAM Introduction Memory size is infinite, number of Pointer Jumping processors in unbounded Reducing the But nothing prevents you to “fold” this back to Number of Processors something more realistic P1 PRAM Model Hierarchy No direct communication between Conclusion processors P2 Combinatorial they communicate via the memory Networks they can operate in an asynchronous fashion P3 Merge Sort Shared 0-1 Principle Every processor accesses any memory Odd-Even location in 1 cycle . Memory Transposition Sort Typically all processors execute the same . FFT algorithm in a synchronous fashion . Conclusion READ phase COMPUTE phase WRITE phase PN Some subset of the processors can stay idle (e.g., even numbered processors may not work, while odd processors do, and conversely) Courtesy of Henri Casanova 6 / 63 Theoretical Parallel Computing A. Legrand Memory Access in PRAM Parallel RAM Introduction Exclusive Read (ER): p processors can Pointer Jumping Reducing the simultaneously read the content of p distinct Number of Processors memory locations. PRAM Model Hierarchy Conclusion Concurrent Read (CR): p processors can Combinatorial simultaneously read the content of p’ memory Networks Merge Sort locations, where p’ < p. 0-1 Principle Odd-Even Exclusive Write (EW): p processors can Transposition Sort simultaneously write the content of p distinct FFT Conclusion memory locations. Concurrent Write (CW): p processors can simultaneously write the content of p’ memory locations, where p’ < p. Courtesy of Henri Casanova 7 / 63 Theoretical Parallel Computing A. Legrand PRAM CW? Parallel RAM Introduction What ends up being stored when multiple writes occur? Pointer Jumping Reducing the priority CW: processors are assigned priorities and the top priority Number of Processors processor is the one that counts for each group write PRAM Model Fail common CW: if values are not equal, no change Hierarchy Conclusion Collision common CW: if values not equal, write a “failure value” Combinatorial Failsafe common CW: if values not equal, then algorithm aborts Networks Merge Sort Random CW: nondeterministic choice of the value written 0-1 Principle Combining CW: write the sum, average, max, min, etc. of the values Odd-Even Transposition etc. Sort FFT The above means that when you write an algorithm for a CW PRAM Conclusion you can do any of the above at any different points in time It doesn’t corresponds to any hardware in existence and is just a logical/algorithmic notion that could be implemented in software In fact, most algorithms end up not needing CW Courtesy of Henri Casanova 8 / 63 Theoretical Parallel Computing A. Legrand Classic PRAM Models Parallel RAM Introduction CREW (concurrent read, exclusive write) Pointer Jumping Reducing the most commonly used Number of Processors PRAM Model CRCW (concurrent read, concurrent write) Hierarchy Conclusion most powerful Combinatorial Networks EREW (exclusive read, exclusive write) Merge Sort most restrictive 0-1 Principle Odd-Even Transposition unfortunately, probably most realistic Sort FFT Conclusion Theorems exist that prove the relative power of the above models (more on this later) Courtesy of Henri Casanova 9 / 63 Theoretical Parallel Computing A. Legrand PRAM Example 1 Parallel RAM Introduction Problem: Pointer Jumping Reducing the We have a linked list of length n Number of Processors For each element i, compute its distance to the PRAM Model Hierarchy end of the list: Conclusion d[i] = 0 if next[i] = NIL Combinatorial Networks d[i] = d[next[i]] + 1 otherwise Merge Sort 0-1 Principle Odd-Even Sequential algorithm in O(n) Transposition Sort We can define a PRAM algorithm in O(log n) FFT Conclusion associate one processor to each element of the list at each iteration split the list in two with odd placed and evenplaced elements in different lists Courtesy of Henri Casanova 10 / 63 Theoretical Parallel Computing A. Legrand PRAM Example 1 Parallel RAM Principle: Introduction Pointer Jumping Look at the next element Reducing the Number of Add its d[i] value to yours Processors PRAM Model Point to the next element’s next element Hierarchy Conclusion Combinatorial 1 1 1 1 1 0 Networks Merge Sort 0-1 Principle Odd-Even The size of each list Transposition 2 2 2 2 1 0 Sort is reduced by 2 at FFT each step, hence the Conclusion O(log n) complexity 4 4 3 2 1 0 5 4 3 2 1 0 Courtesy of Henri Casanova 11 / 63 Theoretical Parallel Computing A. Legrand PRAM Example 1 Parallel RAM Introduction Algorithm Pointer Jumping Reducing the Number of forall i Processors ← ← PRAM Model if next[i] == NIL then d[i] 0 else d[i] 1 Hierarchy Conclusion while there is an i such that next[i] ≠ NIL Combinatorial i Networks forall Merge Sort if next[i] ≠ NIL then 0-1 Principle Odd-Even d[i] ← d[i] + d[next[i]] Transposition Sort ← FFT next[i] next[next[i]] Conclusion What about the correctness of this algorithm? Courtesy of Henri Casanova 12 / 63 Theoretical Parallel Computing A. Legrand forall loop Parallel RAM Introduction At each step, the updates must be Pointer Jumping Reducing the synchronized so that pointers point to the Number of Processors right things: PRAM Model Hierarchy next[i] ← next[next[i]] Conclusion Combinatorial Ensured by the semantic of forall Networks Merge Sort forall i 0-1 Principle forall i Odd-Even tmp[i] = B[i] Transposition Sort A[i] = B[i] forall i FFT A[i] = tmp[i] Conclusion Nobody really writes it out, but one mustn’t forget that it’s really what happens underneath Courtesy of Henri Casanova 13 / 63 Theoretical Parallel Computing A. Legrand while condition Parallel RAM Introduction while there is an i such that next[i] ≠NULL Pointer Jumping Reducing the How can one do such a global test on a Number of Processors PRAM? PRAM Model Hierarchy Conclusion Cannot be done in constant time unless the PRAM Combinatorial is CRCW Networks At the end of each step, each processor could write to a Merge Sort 0-1 Principle same memory location TRUE or FALSE depending on Odd-Even next[i] being equal to NULL or not, and one can then take Transposition Sort the AND of all values (to resolve concurrent writes) FFT On a PRAM CREW, one needs O(log n) steps for doing Conclusion a global test like the above In this case, one can just rewrite the while loop into a for loop, because we have analyzed the way in which the iterations go: for step = 1 to log

Theoretical Parallel Computing

Vertex Coloring

Easy PRAM-Based High-Performance Parallel Programming with ICE∗

Chapter 4 Algorithm Analysis

Instructor's Manual

Lawrence Berkeley National Laboratory Recent Work

Introduction to Parallel Computing, 2Nd Edition

Oblivious Network RAM and Leveraging Parallelism to Achieve Obliviousness

PRAM Algorithms Parallel Random Access Machine

CSE 4351/5351 Notes 9: PRAM and Other Theoretical Model S

An External-Memory Work-Depth Model and Its Applications to Massively Parallel Join Algorithms

Introduction to Parallel Computing

Parallel Functional Arrays