T 1 For a fixed problem: S n ¡ Flynn's Taxonomy T n SISD: Single Instruction, Single Data: Typical Desktop PC is limited by overhead involved in execution SIMD: Single Instruction, Multiple Data: multiple processors connected to single control unit Not flexible; all processors are in lockstep Execution time can be expressed as: MISD: Multiple Instruction, Single Data: No real implementation T = Tcomp + Tpar + Tinteract

MIMD: Multiple Instruction, Multiple Data: Typical multi-processor machine

Tcomp:computation required by the program

Tpar: overhead imposed by executing in parallel

PRAM Model Tinteract: overhead due to communication and synchronization

Synchronous at instruction level, single address space, exclusive read/write (EREW)

¢ ¡ S n Ignores communication overhead E n 1 Efficiency: n BSP Model Amdahl's Law for Fixed Workload n processor/memory pairs

program is a sequence of supersteps Sequential portion W1

Computation (at most w clock cycles) Parallel portion uses all available processes Wn Communication (gh cycles; h: message size; g: platform dependent) a % of code must execute sequentially Barrier Synchronization (wait for all processes to finish superstep; l cycles) n is number of processors Time for superstep: w + gh + l “job” has a fixed workloadd

¡ 1 S

1 ¤ a

a £ Single Processor Systems n CPI = Cycles Per Instruction ILP = Instruction Level Parallelism Gustafson's Law for Scaled Workload, Fixed Time T = IC * CPI * Cycle Time Pipelining: overlap execution of several instructions; stages: Fetch, Decode, Execute, Writeback As number or processors is increased, expand workload so that runtime is same Structural Hazards: Resource Conflicts a: percent of code to run in parallel

Data Hazards: Instruction needs result of previous, unfinished, instruction n: # of processors

£ ¤

W ' £ W ' a n 1 a

¡ ¡

RAW: read after write (most common) ¡ 1 n

¤ ¤

S n a n 1

£ ¤ WAW: write after write (only present if writes done on > 1 pipeline stage) W 1 £ W n a 1 a WAR: write after read (only present if write are early, reads are late in pipeline) Prevented by using interlocks Sun and Ni's Law: Memory Bounded Speedup Out-of-Order Execution: increase distance between dependent instructions so that the 1st instruction has finished before 2nd one needs data. Let workload sacle up to fill available memory Control Hazards: branches, interrupts, exceptions; found in EX, dest. Known in MEM. Assume that W = g(M) in general let scaled workload Wn = G(n) Wn CPI Example: 30% of instructions are branches, cause 3 stage delay: If G(n) = 1: Ambdahl's law for fixed workload T(k) = k + (N-1) + .3*N*3 = N(1.9) + k – 1 If (G(n) = n, we have Gustafson's Fixed time model

k = # of pipeline stages; If (G(n) > n, the workload increases faster than the memory is scaled

£

£ ¡ ¡ W1' Wn ' W1 G n Wn Speedup: T(1) = k* N = 5*N; S

Speedup = 5N/(1.9*N + 4) ~~ 5/1.9 = 2.63 = CPI Wn ' G n Wn £ W1 ' W1 £ n n Programming Issues Dependence Graphs Primitives Flow Dependence: S1 S2 (\) Name Syntax Function

CREATE CREATE(p, proc, args) Create p processes that start execution at Antidependence: S1 S2 (/) procedure proc with arguments args Output Dependence: S1 S2 (|) G_MALLOC G_MALLOC(size) Allocate shared data of size bytes Bernstein's Conditions The input of S2 is disjoint from the output of S1 (no flow dependence)

LOCK LOCK(name) Acquire mutually exclusive access

The input of S1 is disjoint from the output of S2 (no antidependence)

UNLOCK UNLOCK(name) Release mutually exclusive access The outputs of S1 and S2 are disjoint (no output dependence)

BARRIER BARRIER(name, number) Global synchronization among number processes: none gets past BARRIER until Programming For Performance number have arrived Cache Performance: cache miss is implicit communication WAIT_FOR_END WAIT_FOR_END( number) Wait for number processes to terminate Communication Costs: algorithm, partitioning (decomposition), message sizes wait for flag while (!flag); or WAIT(flag) Wait for flag to be set (spin or block); used for point-to-point event synchronization set flag flag = 1; or SIGNAL(flag) Set flag; wakes up that is spinning or Assume that for a given application, the function F(i, p) gives the fraction of time that exactly i blocked on flag, if any processors are usable, given that the total of p processors are available. Note: Summation from (i=1) to p of F(i, p) = 1 Message Passing Primitives a. Amdahl's Law: Name Syntax Function T(1) = a1 + a2 + ... + ap = 1 T(p) = a1 + a2/2 + a3/3 + ... + ap CREATE CREATE(procedure) Create process that starts at = summation from (i=1) to p of a i / i procedure S(p) = 1 / (summation from (i=1) to p of F(i, p) / i SEND SEND(src_addr, size, dest, tag) Send size bytes starting at src_addr to the idest process, with tag identifier b. Gustafson's Law: T(1) = a1 + 2a2 + ... + p ap RECEIVE RECEIVE(buffer_addr, size, src, tag) Receive a message with the tag -- Time on a sequential processor a1 identifier from the src process, and T(p) = a1 + 2 + ... + ap = 1 put size bytes of it into buffer starting at buffer_addr S(p) = T(1) / T(p) = summation from (i=1) to p of i * F(i,p) SEND_PROBE SEND_PROBE(tag, dest) Check if message with identifier tag has been sent to process dest (only Shared Memory Systems for asynchronous message passing; Memory Operation: read/write to a specific memory location depends on symantics) Issue: memory operation issues when it leaves processor's internal state and is presented to the memory system. RECV_PROBE RECV_PROBE(tag, src) Check if a message with identifier tag Perform: operation appears to have taken place as far as processor can tell

has been received from process src write performs when subsequent read returns that value

(only for async. msg pass) read performs when subsequent write can't affect it. BARRIER BARRIER(name, number) See Above Complete: operation performs for all processors WAIT_FOR_END WAIT_FOR_END(number) See Above Coherence Memory system is coherent if results of any execution of a program are such that for each location, Performance Measures we can construct a hypothetical serial order of operations consistent with results of execution. Coherence applies`with respect to a single memory location Operations issued by a particular process occur in order issued Speedup preserve program ordering for a processor Relates uniprocessor execution time to parallel execution time (bets time for each) Value returned by a read is value last written to that location Write propogation: Value written must be made visible to other processors Write serialization: Writes to locatio nseen in same order by all processors

Write-through cache: Contention Delay: caused by other traffic in networks

when processor updates cache value, also update main memory Software overhead needed to send and receive message Memory is kept consistent with cache Node Degree: Number of links connected to a node. In degree, out degree, and Total degree Write-back cache Node Diameter: Maximum path length between any two nodes; pathlength is number of links Writes not written back to memory until cache line is flushed traversed when travelling from source to destination Memory not always consistent with cache Bisection Bandwidth: Maximum number of bits/bytes crossing all wires on a bisection plane dividing the network in half Virtual Cut through/Wormhole Routing: Synchronization Hybrid between circuit and packet switching

Load-linked/Store-conditional Data message can span several switches and interconnect links

Non-atomic When front of message arrives at switch, it is immediately forwarded over outbound link

Returns condition code that indicates whether instructions executed omically If outbound link not available, message can be queued in the switch or allowed to remain in the

Slow with lots of contention other switches

More bus traffic due to load-lock instruction Performance similar to circuit switching for light loads and packet switching when congested

Load-Linked Low-dimension Tori vs. High-dimension Hypercubes

Load value from memory low-dimension torus

Copy address into special register lower degree

If a cache invalidation occurs, clear special register higher diameter

Store Conditional fewer links, higher BW per link

Only store if address matches special register value generally will perform better

Returns condition code to indicate if store succeeded hot spots don't hurt you as much, due to higher Bandwidth

Memory Fences high-dimension hypercube

Fixed point in computation that ensures no read or write is moved across the fence low diameter

aquire()/release() (above) operate as memory fences high degree

Write fence more links, less bandwitdth per link All writes by processor p complete before write fence executed on p no writes after the fence start before the fence operation

Routing Read fence

similar to Write Fence, but for reads. Store and Forward Routing: Latency proportional to number of hops a message takes

Sequential Consistency: L=Packet Length; W=Channel Bandwidth; D=Distance (# hops)

All reads are read fences ¡ LD T SF All writes are write fences W Wormhole Routing: Pipeline transmission of a packet by dividing packet into flits

Relaxed Models:

R=Read, W=Write, S=Sync, a=aquire, r=release L=Packet Length (bits); F=Flit Length (bits); W=Channel Bandwidth; D=Distance (#hops)

¥ ¦ § ¥ ¦

¡

¡ D Num Flits 1 F L F D 1 Processor Consistency (Total Store Ordering): T

WH

relax W->R constraint W W

Allow a read before previous write has completed Favors lower-dimension networks—fewer connections, higher BW/link; constant degree network

Programmer can insert sync. to preserver order if needed: : Requires: Mutual exclusion, Hold & Wait, No , Circular Wait

Partial Store Ordering: Prevent Circular wait by restricting routing; Example: Dimension Order Routing (DOR)

Relax W->R and W->W Virtual Channels: Divide physical links into multiple logical links; multiplex virtual channels over Allow non-conflicting writes to complete out of order physical connection Pipeline or overlap writes

Weak Ordering: Vector & SIMD Systems

Requires only that: Memory Organization

Read or write has to complete before next sync. Want to be able to pipeline memory accesses

Sync. completes before next read/write Use multiple memory banks; each has own controller and address bus

Release Consistency: Different than using interleaved memory banks

Most common ordering Can pipeline non-contiguous accesses easier Same as Weak Ordering Goal: have successive memory accesses map to separate banks; Allows for better pipelining

Only enforce: How many memory banks?

Sa->W Latency of access

Sa->R Depth of pipeline

R->Sr Real world vector systems have as many as 1024 banks

W->Sr Chaining:

S->S Combine multiple vector operations Output of one functional unit fed into input of another Depends on the number of functional units

Memory Consistency Models Early systems had only one load/store pipe

Memory Consistency becomes a bottleneck

When does write actually become visible? Later systems had at least 3 load/store pipes How do we establish order between reads & wries? Coherence applies with respect to a single memory location Does not address multiple locations Multithreading Systems

Memory consistency models Efficiency: Single Threaded:

Specifies constraints on order in which memory operatations can appear to execute R=number of cycles between misses; L=Latency of a miss

¡

What orders are preserved? ¡ Busy R

Efficiency ¥ For a read, what possible values are returned? Busy ¥ Idle R L Sequential Consistency As L increases, efficiency decreases

more restrictive model Efficiency: Multithreaded processors:

Total order achieved by interleaving accesses by different processors in some fashion R=number of cycles between misses; L=Latency of a miss; C=time needed for context switch

Maintain program order with respect to each processor Saturation ¡ Two requirements for Sequential Consistency: R

E sat

Maintain Program order R ¥ C

Memory operations issued by a process must appear to be visible in program order Independent of L

Atomicity Reach Saturation when: N ¨ L/(R+C)+1

memory operations are atomic Linear Region

Not enough threads to keep processor busy ¡ Sufficient conditions for Sequential Consistency (not always necessary) NR

E lin ¥

Every process issues memory operations in program order R ¥ C L

After a write is issued, processor waits for write to complete before issuing next operation Efficiency increases with number of threads

After a read is issued, processor waits for the read to complete and for write where value is As L increases, efficiency decreases returned by read to complete Interleaved Execution: Interleave instructions from different threads into pipeline Message Passing Systems Taverage = pthit + (1-p) tmiss Anything? Average time is the total of the time that we get a cache hit plus the time we suffer cache misses

Interconnect Networks NSaturation (The number of processes that cause 100% efficiency: Nsat > (R+L)/(R+C) Link (Channel, Cable): Physical connection between two units; either serial or parallel connection Switch: Connects inputs to outputs; input has receiver+buffer; output has transmitter Network Interface Circuit (NIC): Used to conect host to network Bandwidth (BW): Data transfer rate Latency: Total time needed to transfer a message fromsource to destination VLIW Systems Channel Delay: message length / BW Routing Delay: how long to make decisions about path through network