Parallel Computation Models Parallel Computation Models
Total Page:16
File Type:pdf, Size:1020Kb
Parallel Computation Models • PRAM (parallel RAM) Parallel Computation Models • Fixed Interconnection Network – bus, ring, mesh, hypercube, shuffle-exchange • Boolean Circuits Lecture 3 • Combinatorial Circuits Lecture 4 • BSP • LOGP Slide 1 Slide 2 TYPES OF MULTIPROCESSING FRAMEWORKS PARALLEL DISTRIBUTED PARALLEL AND DISTRIBUTED COMPUTATION TECHNICAL ASPECTS • MANY INTERCONNECTED PROCESSORS WORKING CONCURRENTLY •PARALLEL COMPUTERS (USUALLY) WORK IN TIGHT SYNCRONY, SHARE MEMORY TO A LARGE EXTENT AND HAVE A VERY FAST AND RELIABLE COMMUNICATION MECHANISMBETWEEN THEM. • DISTRIBUTED COMPUTERS ARE MORE INDEPENDENT, COMMUNICATION IS LESS P4 P5 FREQUENT AND LESS SYNCRONOUS, AND THE COOPERATION IS LIMITED. P3 PURPOSES • PARALLEL COMPUTERS COOPERATE TO SOLVE MORE EFFICIENTLY (POSSIBLY) DIFFICULT PROBLEMS INTERCONNECTION NETWORK • DISTRIBUTED COMPUTERS HAVE INDIVIDUAL GOALS AND PRIVATE ACTIVITIES. SOMETIME COMMUNICATIONS WITH OTHER ONES ARE NEEDED. (E. G. DISTRIBUTED DA TA BASE OPERATIONS). P2 PARALLEL COMPUTERS: COOPERATION IN A POSITIVE SENSE P1 . Pn DISTRIBUTED COMPUTERS: COOPERATION IN A NEGATIVE SENSE, ONLY WHEN IT IS NECESSARY • CONNECTION MACHINE • INTERNET Connects all the computers of the world Slide 3 Slide 4 FOR PARALLEL SYSTEMS PARALLEL ALGORITHMS WE ARE INTERESTED TO SOLVE ANY PROBLEM IN PARALLEL • WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? FOR DISTRIBUTED SYSTEMS • HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? WE ARE INTERESTED TO SOLVE IN PARALLEL • HOW TO CONSTRUCT EFFICIENT ALGORITHMS? PARTICULAR PROBLEMS ONLY, TYPICAL EXAMPLES ARE: •COMMUNICATION SERVICES MANY CONCEPTS OF THE COMPLEXITY THEORY MUST BE REVISITED ROUTING BROADCASTING • IS THE PARALLELISM A SOLUTION FOR HARD PROBLEMS? •MAINTENANCE OF CONTROL STUCTURE SPANNING TREE CONSTRUCTION • ARE THERE PROBLEMS NOT ADMITTING AN EFFICIENT PARALLEL SOLUTION, TOPOLOGY UPDATE LEADER ELECTION THAT IS INHERENTLY SEQUENTIAL PROBLEMS? •RESOURCE CONTROL ACTIVITIES LOAD BALANCING MANAGING GLOBAL DIRECTORIES Slide 5 Slide 6 1 We need a model of computation • NETWORK (VLSI) MODEL HYPERCUBE • The processors are connected by a network of bounded degree. • No shared memory is available. 0110 0111 1110 1111 • Several interconnection topologies. 0100 0101 1100 diameter = 4 degree = 4 (log2N) 1101 0010 • Synchronous way of operating. 0011 1010 1011 0000 0001 MESH CONNECTED ARRAY 1000 1001 N = 24 PROCESSORS degree = 4 (N) diameter = 2N Slide 7 Slide 8 Other important topologies • binary trees • mesh of trees Model Equivalence • cube connected cycles In the network model a PARALLEL MACHINEis a very complex • given two models M and M , and a problem P ensemble of small interconnected units, performing elementary 1 2 operations. of size n - Each processor has its own memory. - Processors work synchronously. LIMITS OF THE MODEL • if M1 and M2 are equivalent then solving P • different topologies require different algorithms to solve the same requires: problem • it is difficult to describe and analyse algorithms (the migration of – T(n) time and P(n) processors on M1 data have to be described) O(1) O(1) – T(n) time and P(n) processors on M2 A shared-memory model is more suitable by an algorithmic point of view Slide 9 Slide 10 PRAM MODEL 1 PRAM 2 P1 3 P2 . Common Memory • Parallel Random Access Machine . ? . P • Shared-memory multiprocessor i . • unlimited number of processors, each Pn – has unlimited local memory m – knows its ID PRAM n RAM processors connected to a common memory of m cells ASSUMPTION: at each time unit each Pi can read a memory cell, make an internal – able to access the shared memory computation and write another memory cell. CONSEQUENCE: any pair of processor P P can communicate in constant time! • unlimited shared memory i j Pi writes the message in cell x at timet Pi reads the message in cell x at timet+1 Slide 11 Slide 12 2 PRAM PRAM Instruction Set • Inputs/Outputs are placed in the shared • accumulator architecture memory (designated address) – memory cell R0 accumulates results • Memory cell stores an arbitrarily large integer • multiply/divide instructions take only • Each instruction takes unit time constant operands • Instructions are synchronized across the – prevents generating exponentially large processors numbers in polynomial time Slide 13 Slide 14 PRAM Complexity Measures Two Technical Issues for PRAM • for each individual processor – time: number of instructions executed • How processors are activated – space: number of memory cells accessed • How shared memory is accessed • PRAM machine – time: time taken by the longest running processor – hardware: maximum number of active processors Slide 15 Slide 16 THE PRAM IS A THEORETICAL (UNFEASIBLE) MODEL Processor Activation • The interconnection network between processors and memory would require a very large amount of area . • The message-routing on the interconnection network would require time • P0 places the number of processors (p) in the proportional to network size (i. e. the assumption of a constant access time designated shared-memory cell to the memory is not realistic). – each active Pi, where i < p, starts executing WHY THE PRAM IS A REFERENCE MODEL? – O(1) time to activate • Algorithm’s designers can forget the communication problems and focus their attention on the parallel computation only. – all processors halt when P0 halts •There exist algorithms simulating any PRAM algorithm on bounded degree networks. • Active processors explicitly activate additional E. G. A PRAM algorithm requiring time T(n), can be simulated in a mesh of tree processors via FORK instructions in time T(n)log2n/loglogn, that is each step can be simulated with a slow-down of log2n/loglogn. – tree-like activation • Instead of design ad hoc algorithms for bounded degree networks, design more – O(log p) time to activate general algorithms for the PRAM model and simulate them on a feasible network. Slide 17 Slide 18 3 • For the PRAM model there exists a well developed body of techniques and methods to handle different classes of computational problems. Metrics • The discussion on parallel model of computation is still HOT The actual trend: A measure of relative performance between a multiprocessor COARSE-GRAINED MODELS system and a single processor system is the speed-up S( p), defined as follows: • The degree of parallelism allowed is independent from the number Execution time using a single processor system of processors. S( p) = Execution time using a multiprocessor with pprocessors • The computation is divided in supersteps, each one includes T1 Sp • local computation S( p) = Efficiency = • communication phase Tp p • syncronization phase Cost = p ´ Tp the study is still at the beginning! Slide 19 Slide 20 Metrics • Parallel algorithm is cost-optimal: Amdahl’s Law parallel cost = sequential time • f = fraction of the problem that’s C = T p 1 inherently sequential Ep = 100% (1 – f) = fraction that’s parallel • Critical when down-scaling: • Parallel time Tp: T = f + (1- f ) p parallel implementation may p become slower than sequential 1 3 S = T1 = n • Speedup with p processors: p 1- f 2.5 2 f + Tp = n when p = n p 4.5 Cp = n Slide 21 Slide 22 Amdahl’s Law PRAM • Too many interconnections gives problems with synchronization • Upper bound on speedup (p = ¥) • However it is the best conceptual model for designing efficient 1 1 parallel algorithms S p = Converges to 0 1- f S¥ = – due to simplicity and possibility of simulating efficiently PRAM f + f algorithms on more realistic parallel architectures p • Example: f = 2% S = 1 / 0.02 = 50 Slide 23 Slide 24 4 Shared-Memory Access Example CRCW-PRAM Concurrent (C) means, many processors can do the operation simultaneously in • Initially the same memory Exclusive (E) not concurent – table A contains values 0 and 1 – output contains value 0 • EREW (Exclusive Read Exclusive Write) • CREW (Concurrent Read Exclusive Write) – Many processors can read simultaneously the same location, but only one can attempt to write to a given location • ERCW • CRCW – Many processors can write/read at/from the same memory location • The program computes the “Boolean OR” of A[1], A[2], A[3], A[4], A[5] Slide 25 Slide 26 Example CREW-PRAM Pascal triangle • Assume initially table A contains [0,0,0,0,0,1] and we have the parallel program PRAM CREW Slide 27 Slide 28 5.