Parallel Computation Models Parallel Computation Models

Parallel Computation Models • PRAM (parallel RAM) Parallel Computation Models • Fixed Interconnection Network – bus, ring, mesh, hypercube, shuffle-exchange • Boolean Circuits Lecture 3 • Combinatorial Circuits Lecture 4 • BSP • LOGP Slide 1 Slide 2 TYPES OF MULTIPROCESSING FRAMEWORKS PARALLEL DISTRIBUTED PARALLEL AND DISTRIBUTED COMPUTATION TECHNICAL ASPECTS • MANY INTERCONNECTED PROCESSORS WORKING CONCURRENTLY •PARALLEL COMPUTERS (USUALLY) WORK IN TIGHT SYNCRONY, SHARE MEMORY TO A LARGE EXTENT AND HAVE A VERY FAST AND RELIABLE COMMUNICATION MECHANISMBETWEEN THEM. • DISTRIBUTED COMPUTERS ARE MORE INDEPENDENT, COMMUNICATION IS LESS P4 P5 FREQUENT AND LESS SYNCRONOUS, AND THE COOPERATION IS LIMITED. P3 PURPOSES • PARALLEL COMPUTERS COOPERATE TO SOLVE MORE EFFICIENTLY (POSSIBLY) DIFFICULT PROBLEMS INTERCONNECTION NETWORK • DISTRIBUTED COMPUTERS HAVE INDIVIDUAL GOALS AND PRIVATE ACTIVITIES. SOMETIME COMMUNICATIONS WITH OTHER ONES ARE NEEDED. (E. G. DISTRIBUTED DA TA BASE OPERATIONS). P2 PARALLEL COMPUTERS: COOPERATION IN A POSITIVE SENSE P1 . Pn DISTRIBUTED COMPUTERS: COOPERATION IN A NEGATIVE SENSE, ONLY WHEN IT IS NECESSARY • CONNECTION MACHINE • INTERNET Connects all the computers of the world Slide 3 Slide 4 FOR PARALLEL SYSTEMS PARALLEL ALGORITHMS WE ARE INTERESTED TO SOLVE ANY PROBLEM IN PARALLEL • WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? FOR DISTRIBUTED SYSTEMS • HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? WE ARE INTERESTED TO SOLVE IN PARALLEL • HOW TO CONSTRUCT EFFICIENT ALGORITHMS? PARTICULAR PROBLEMS ONLY, TYPICAL EXAMPLES ARE: •COMMUNICATION SERVICES MANY CONCEPTS OF THE COMPLEXITY THEORY MUST BE REVISITED ROUTING BROADCASTING • IS THE PARALLELISM A SOLUTION FOR HARD PROBLEMS? •MAINTENANCE OF CONTROL STUCTURE SPANNING TREE CONSTRUCTION • ARE THERE PROBLEMS NOT ADMITTING AN EFFICIENT PARALLEL SOLUTION, TOPOLOGY UPDATE LEADER ELECTION THAT IS INHERENTLY SEQUENTIAL PROBLEMS? •RESOURCE CONTROL ACTIVITIES LOAD BALANCING MANAGING GLOBAL DIRECTORIES Slide 5 Slide 6 1 We need a model of computation • NETWORK (VLSI) MODEL HYPERCUBE • The processors are connected by a network of bounded degree. • No shared memory is available. 0110 0111 1110 1111 • Several interconnection topologies. 0100 0101 1100 diameter = 4 degree = 4 (log2N) 1101 0010 • Synchronous way of operating. 0011 1010 1011 0000 0001 MESH CONNECTED ARRAY 1000 1001 N = 24 PROCESSORS degree = 4 (N) diameter = 2N Slide 7 Slide 8 Other important topologies • binary trees • mesh of trees Model Equivalence • cube connected cycles In the network model a PARALLEL MACHINEis a very complex • given two models M and M , and a problem P ensemble of small interconnected units, performing elementary 1 2 operations. of size n - Each processor has its own memory. - Processors work synchronously. LIMITS OF THE MODEL • if M1 and M2 are equivalent then solving P • different topologies require different algorithms to solve the same requires: problem • it is difficult to describe and analyse algorithms (the migration of – T(n) time and P(n) processors on M1 data have to be described) O(1) O(1) – T(n) time and P(n) processors on M2 A shared-memory model is more suitable by an algorithmic point of view Slide 9 Slide 10 PRAM MODEL 1 PRAM 2 P1 3 P2 . Common Memory • Parallel Random Access Machine . ? . P • Shared-memory multiprocessor i . • unlimited number of processors, each Pn – has unlimited local memory m – knows its ID PRAM n RAM processors connected to a common memory of m cells ASSUMPTION: at each time unit each Pi can read a memory cell, make an internal – able to access the shared memory computation and write another memory cell. CONSEQUENCE: any pair of processor P P can communicate in constant time! • unlimited shared memory i j Pi writes the message in cell x at timet Pi reads the message in cell x at timet+1 Slide 11 Slide 12 2 PRAM PRAM Instruction Set • Inputs/Outputs are placed in the shared • accumulator architecture memory (designated address) – memory cell R0 accumulates results • Memory cell stores an arbitrarily large integer • multiply/divide instructions take only • Each instruction takes unit time constant operands • Instructions are synchronized across the – prevents generating exponentially large processors numbers in polynomial time Slide 13 Slide 14 PRAM Complexity Measures Two Technical Issues for PRAM • for each individual processor – time: number of instructions executed • How processors are activated – space: number of memory cells accessed • How shared memory is accessed • PRAM machine – time: time taken by the longest running processor – hardware: maximum number of active processors Slide 15 Slide 16 THE PRAM IS A THEORETICAL (UNFEASIBLE) MODEL Processor Activation • The interconnection network between processors and memory would require a very large amount of area . • The message-routing on the interconnection network would require time • P0 places the number of processors (p) in the proportional to network size (i. e. the assumption of a constant access time designated shared-memory cell to the memory is not realistic). – each active Pi, where i < p, starts executing WHY THE PRAM IS A REFERENCE MODEL? – O(1) time to activate • Algorithm’s designers can forget the communication problems and focus their attention on the parallel computation only. – all processors halt when P0 halts •There exist algorithms simulating any PRAM algorithm on bounded degree networks. • Active processors explicitly activate additional E. G. A PRAM algorithm requiring time T(n), can be simulated in a mesh of tree processors via FORK instructions in time T(n)log2n/loglogn, that is each step can be simulated with a slow-down of log2n/loglogn. – tree-like activation • Instead of design ad hoc algorithms for bounded degree networks, design more – O(log p) time to activate general algorithms for the PRAM model and simulate them on a feasible network. Slide 17 Slide 18 3 • For the PRAM model there exists a well developed body of techniques and methods to handle different classes of computational problems. Metrics • The discussion on parallel model of computation is still HOT The actual trend: A measure of relative performance between a multiprocessor COARSE-GRAINED MODELS system and a single processor system is the speed-up S( p), defined as follows: • The degree of parallelism allowed is independent from the number Execution time using a single processor system of processors. S( p) = Execution time using a multiprocessor with pprocessors • The computation is divided in supersteps, each one includes T1 Sp • local computation S( p) = Efficiency = • communication phase Tp p • syncronization phase Cost = p ´ Tp the study is still at the beginning! Slide 19 Slide 20 Metrics • Parallel algorithm is cost-optimal: Amdahl’s Law parallel cost = sequential time • f = fraction of the problem that’s C = T p 1 inherently sequential Ep = 100% (1 – f) = fraction that’s parallel • Critical when down-scaling: • Parallel time Tp: T = f + (1- f ) p parallel implementation may p become slower than sequential 1 3 S = T1 = n • Speedup with p processors: p 1- f 2.5 2 f + Tp = n when p = n p 4.5 Cp = n Slide 21 Slide 22 Amdahl’s Law PRAM • Too many interconnections gives problems with synchronization • Upper bound on speedup (p = ¥) • However it is the best conceptual model for designing efficient 1 1 parallel algorithms S p = Converges to 0 1- f S¥ = – due to simplicity and possibility of simulating efficiently PRAM f + f algorithms on more realistic parallel architectures p • Example: f = 2% S = 1 / 0.02 = 50 Slide 23 Slide 24 4 Shared-Memory Access Example CRCW-PRAM Concurrent (C) means, many processors can do the operation simultaneously in • Initially the same memory Exclusive (E) not concurent – table A contains values 0 and 1 – output contains value 0 • EREW (Exclusive Read Exclusive Write) • CREW (Concurrent Read Exclusive Write) – Many processors can read simultaneously the same location, but only one can attempt to write to a given location • ERCW • CRCW – Many processors can write/read at/from the same memory location • The program computes the “Boolean OR” of A[1], A[2], A[3], A[4], A[5] Slide 25 Slide 26 Example CREW-PRAM Pascal triangle • Assume initially table A contains [0,0,0,0,0,1] and we have the parallel program PRAM CREW Slide 27 Slide 28 5.

Parallel Computation Models Parallel Computation Models

Chapter 4 Algorithm Analysis

Introduction to Parallel Computing, 2Nd Edition

Oblivious Network RAM and Leveraging Parallelism to Achieve Obliviousness

An External-Memory Work-Depth Model and Its Applications to Massively Parallel Join Algorithms

Introduction to Parallel Computing

Parallel Functional Arrays

Oblivious Parallel RAM and Applications

Parallel Programming

Oblivious Network

Effective Data Parallel Computing on Multicore Processors

A Practical Hierarchial Model of Parallel Computation: the Model

The Choice of Hybrid Architectures, a Realistic Strategy to Reach the Exascale Julien Loiseau