Flynn's Taxonomy PRAM Model BSP Model Single Processor Systems

T 1 For a fixed problem: S n ¡ Flynn's Taxonomy T n SISD: Single Instruction, Single Data: Typical Desktop PC Speedup is limited by overhead involved in execution SIMD: Single Instruction, Multiple Data: multiple processors connected to single control unit Not flexible; all processors are in lockstep Execution time can be expressed as: MISD: Multiple Instruction, Single Data: No real implementation T = Tcomp + Tpar + Tinteract MIMD: Multiple Instruction, Multiple Data: Typical multi-processor machine Tcomp:computation required by the program Tpar: overhead imposed by executing in parallel PRAM Model Tinteract: overhead due to communication and synchronization Synchronous at instruction level, single address space, exclusive read/write (EREW) ¢ ¡ S n Ignores communication overhead E n 1 Efficiency: n BSP Model Amdahl's Law for Fixed Workload n processor/memory pairs program is a sequence of supersteps Sequential portion W1 Computation (at most w clock cycles) Parallel portion uses all available processes Wn Communication (gh cycles; h: message size; g: platform dependent) a % of code must execute sequentially Barrier Synchronization (wait for all processes to finish superstep; l cycles) n is number of processors Time for superstep: w + gh + l “job” has a fixed workloadd ¡ 1 S 1 ¤ a a £ Single Processor Systems n CPI = Cycles Per Instruction ILP = Instruction Level Parallelism Gustafson's Law for Scaled Workload, Fixed Time T = IC * CPI * Cycle Time Pipelining: overlap execution of several instructions; stages: Fetch, Decode, Execute, Writeback As number or processors is increased, expand workload so that runtime is same Structural Hazards: Resource Conflicts a: percent of code to run in parallel Data Hazards: Instruction needs result of previous, unfinished, instruction n: # of processors £ ¤ W ' £ W ' a n 1 a ¡ ¡ RAW: read after write (most common) ¡ 1 n ¤ ¤ S n a n 1 £ ¤ WAW: write after write (only present if writes done on > 1 pipeline stage) W 1 £ W n a 1 a WAR: write after read (only present if write are early, reads are late in pipeline) Prevented by using interlocks Sun and Ni's Law: Memory Bounded Speedup Out-of-Order Execution: increase distance between dependent instructions so that the 1st instruction has finished before 2nd one needs data. Let workload sacle up to fill available memory Control Hazards: branches, interrupts, exceptions; found in EX, dest. Known in MEM. Assume that W = g(M) in general let scaled workload Wn = G(n) Wn CPI Example: 30% of instructions are branches, cause 3 stage delay: If G(n) = 1: Ambdahl's law for fixed workload T(k) = k + (N-1) + .3*N*3 = N(1.9) + k – 1 If (G(n) = n, we have Gustafson's Fixed time model k = # of pipeline stages; If (G(n) > n, the workload increases faster than the memory is scaled £ £ ¡ ¡ W1' Wn ' W1 G n Wn Speedup: T(1) = k* N = 5*N; S Speedup = 5N/(1.9*N + 4) ~~ 5/1.9 = 2.63 = CPI Wn ' G n Wn £ W1 ' W1 £ n n Programming Issues Dependence Graphs Shared Memory Primitives Flow Dependence: S1 S2 (\) Name Syntax Function CREATE CREATE(p, proc, args) Create p processes that start execution at Antidependence: S1 S2 (/) procedure proc with arguments args Output Dependence: S1 S2 (|) G_MALLOC G_MALLOC(size) Allocate shared data of size bytes Bernstein's Conditions The input of S2 is disjoint from the output of S1 (no flow dependence) LOCK LOCK(name) Acquire mutually exclusive access The input of S1 is disjoint from the output of S2 (no antidependence) UNLOCK UNLOCK(name) Release mutually exclusive access The outputs of S1 and S2 are disjoint (no output dependence) BARRIER BARRIER(name, number) Global synchronization among number processes: none gets past BARRIER until Programming For Performance number have arrived Cache Performance: cache miss is implicit communication WAIT_FOR_END WAIT_FOR_END( number) Wait for number processes to terminate Communication Costs: algorithm, partitioning (decomposition), message sizes wait for flag while (!flag); or WAIT(flag) Wait for flag to be set (spin or block); used for point-to-point event synchronization set flag flag = 1; or SIGNAL(flag) Set flag; wakes up process that is spinning or Assume that for a given application, the function F(i, p) gives the fraction of time that exactly i blocked on flag, if any processors are usable, given that the total of p processors are available. Note: Summation from (i=1) to p of F(i, p) = 1 Message Passing Primitives a. Amdahl's Law: Name Syntax Function T(1) = a1 + a2 + ... + ap = 1 T(p) = a1 + a2/2 + a3/3 + ... + ap CREATE CREATE(procedure) Create process that starts at = summation from (i=1) to p of a i / i procedure S(p) = 1 / (summation from (i=1) to p of F(i, p) / i SEND SEND(src_addr, size, dest, tag) Send size bytes starting at src_addr to the idest process, with tag identifier b. Gustafson's Law: T(1) = a1 + 2a2 + ... + p ap RECEIVE RECEIVE(buffer_addr, size, src, tag) Receive a message with the tag -- Time on a sequential processor a1 identifier from the src process, and T(p) = a1 + 2 + ... + ap = 1 put size bytes of it into buffer starting at buffer_addr S(p) = T(1) / T(p) = summation from (i=1) to p of i * F(i,p) SEND_PROBE SEND_PROBE(tag, dest) Check if message with identifier tag has been sent to process dest (only Shared Memory Systems for asynchronous message passing; Memory Operation: read/write to a specific memory location depends on symantics) Issue: memory operation issues when it leaves processor's internal state and is presented to the memory system. RECV_PROBE RECV_PROBE(tag, src) Check if a message with identifier tag Perform: operation appears to have taken place as far as processor can tell has been received from process src write performs when subsequent read returns that value (only for async. msg pass) read performs when subsequent write can't affect it. BARRIER BARRIER(name, number) See Above Complete: operation performs for all processors WAIT_FOR_END WAIT_FOR_END(number) See Above Coherence Memory system is coherent if results of any execution of a program are such that for each location, Performance Measures we can construct a hypothetical serial order of operations consistent with results of execution. Coherence applies`with respect to a single memory location Operations issued by a particular process occur in order issued Speedup preserve program ordering for a processor Relates uniprocessor execution time to parallel execution time (bets time for each) Value returned by a read is value last written to that location Write propogation: Value written must be made visible to other processors Write serialization: Writes to locatio nseen in same order by all processors Write-through cache: Contention Delay: caused by other traffic in networks when processor updates cache value, also update main memory Software overhead needed to send and receive message Memory is kept consistent with cache Node Degree: Number of links connected to a node. In degree, out degree, and Total degree Write-back cache Node Diameter: Maximum path length between any two nodes; pathlength is number of links Writes not written back to memory until cache line is flushed traversed when travelling from source to destination Memory not always consistent with cache Bisection Bandwidth: Maximum number of bits/bytes crossing all wires on a bisection plane dividing the network in half Virtual Cut through/Wormhole Routing: Synchronization Hybrid between circuit and packet switching Load-linked/Store-conditional Data message can span several switches and interconnect links Non-atomic When front of message arrives at switch, it is immediately forwarded over outbound link Returns condition code that indicates whether instructions executed omically If outbound link not available, message can be queued in the switch or allowed to remain in the Slow with lots of contention other switches More bus traffic due to load-lock instruction Performance similar to circuit switching for light loads and packet switching when congested Load-Linked Low-dimension Tori vs. High-dimension Hypercubes Load value from memory low-dimension torus Copy address into special register lower degree If a cache invalidation occurs, clear special register higher diameter Store Conditional fewer links, higher BW per link Only store if address matches special register value generally will perform better Returns condition code to indicate if store succeeded hot spots don't hurt you as much, due to higher Bandwidth Memory Fences high-dimension hypercube Fixed point in computation that ensures no read or write is moved across the fence low diameter aquire()/release() (above) operate as memory fences high degree Write fence more links, less bandwitdth per link All writes by processor p complete before write fence executed on p no writes after the fence start before the fence operation Routing Read fence similar to Write Fence, but for reads. Store and Forward Routing: Latency proportional to number of hops a message takes Sequential Consistency: L=Packet Length; W=Channel Bandwidth; D=Distance (# hops) All reads are read fences ¡ LD T SF All writes are write fences W Wormhole Routing: Pipeline transmission of a packet by dividing packet into flits Relaxed Models: R=Read, W=Write, S=Sync, a=aquire, r=release L=Packet Length (bits); F=Flit Length (bits); W=Channel Bandwidth; D=Distance (#hops) ¥ ¦ § ¥ ¦ ¡ ¡ D Num Flits 1 F L F D 1 Processor Consistency (Total Store Ordering): T WH relax W->R constraint W W Allow a read before previous write has completed Favors lower-dimension networks—fewer connections, higher BW/link; constant degree network Programmer can insert sync. to preserver order if needed: Deadlock: Requires: Mutual exclusion, Hold & Wait, No Preemption, Circular Wait Partial Store Ordering: Prevent Circular wait by restricting routing; Example: Dimension Order Routing (DOR) Relax W->R and W->W Virtual Channels: Divide physical links into multiple logical links; multiplex virtual channels over Allow non-conflicting writes to complete out of order physical connection Pipeline or overlap writes Weak Ordering: Vector & SIMD Systems Requires only that: Memory Organization Read or write has to complete before next sync.

Load more