Science 246 Spring 2010 Harvard University

Instructor: Prof. David Brooks [email protected]

Multiprocessing

Computer Science 246 David Brooks Multiprocessors Outline

• Multiprocessors – Computation taxonomy – Interconnection overview – design criteria • coherence • Very broad overview • H&P: Chapter 6, mostly 6.1-6.9, 8.5 • Next time: More Multiprocessors +Multithreading

Computer Science 246 David Brooks Parallel Processing

• Multiple processors working cooperatively on problems: not the same as multiprogramming • Goals/Motivation – Performance: limits of uniprocessors • ILP (branch prediction, RAW dependencies, memory) – : build big systems with commodity parts (uniprocessor cores) – : just add more cores to get more performance – Fault tolerance: One fails you still can continue processing

Computer Science 246 David Brooks Parallel Processing

• Sounds great, but… • As usual, is the bottleneck – Difficult to parallelize applications • Compiler parallelization is hard (huge research efforts in 80s/90s) • By-hand parallelization is harder/expensive/error-prone – Difficult to make parallel applications run fast • Communication is expensive • Makes first task even harder • Inherent problems – Insufficient parallelism in some apps – Long-latency remote communication

Computer Science 246 David Brooks Recall Amdahl’s Law

SpeedupOverall = 1

Fractionenhanced ((1 - Fractionenhanced) + ) Speedupenhanced

• Example: Achieve of 80x using 100 processors

– 80 = 1/[Fracparallel/100+1-Fracparallel] – Fracparallel=0.9975 => only .25% of the work can be serial! • Assumes linear speedup – Superlinear speedup is possible in some cases – Increased memory + cache with increased processor count

Computer Science 246 David Brooks Parallel Application Domains

• Traditional Parallel Programs • True parallelism in one job – Regular loop structures – Data usually tightly shared – Automatic parallelization – Called “data-level parallelism” – Can often exploit vectors as well • Workloads – Scientific simulation codes (FFT, weather, fluid dynamics) – Dominant market segment in the 1980s Computer Science 246 David Brooks Parallel Program Example: Matrix Multiply • Parameters for i = 1 to n – Size of matrix: (N*N) for j = 1 to n – P processors: N/P1/2 * N/P1/2 blocks for k = 1 to n – Cij needs Aik, Bkj k=0 to n-1 C(i,j) = C(i,j) + A(i,k) * B(k,j) • Growth Functions N – Computation grows as f(N3) P – Computation per processor: f(N3/P) – Data size: f(N2) – Data size per processor: f(N2/P) – Communication: f(N2/P1/2) – Computation/Communication: f(N/P1/2)

Computer Science 246 David Brooks Application Domains

• Parallel independent but similar tasks – Irregular control structures – Loosely shared data locked at different granularities – Programmer defines and fine tunes parallelism – Cannot exploit vectors – “-level Parallelism” (throughput is important) • Workloads – Transaction Processing, databases, web-servers – Dominant MP market segment today

Computer Science 246 David Brooks Database Application Example

• Bank Database • Parameters: – D = number of accounts – P = number of processors in server – N = number of ATMs (parallel transactions) • Growth Functions – Computation: f(N) – Computation per processor: F(N/P) – Communication? Lock records while changing them – Communication: f(N) – Computation/communication: f(1)

Computer Science 246 David Brooks Computation Taxonomy

• Proposed by Flynn (1966) • Dimensions: – Instruction streams: single (SI) or multiple (MI) – Data streams: single (SD) or multiple (MD) • Cross-product: – SISD: uniprocessors (chapters 3-4) – SIMD: multimedia/vector processors (MMX, HP-MAX) – MISD: no real examples – MIMD: multiprocessors + multicomputers (ch. 6)

Computer Science 246 David Brooks SIMD vs. MIMD

• MP vs. SIMD – Programming model flexibility • Could simulate vectors with MIMD but not the other way • Dominant market segment cannot use vectors – Cost effectiveness • Commodity Parts: high volume (cheap) components • MPs can be made up of many uniprocessors • Allows easy scalability from small to large

Computer Science 246 David Brooks Taxonomy of Parallel (MIMD) Processors • Center on organization of main memory – Shared vs. Distributed • Appearance of memory to hardware – Q1: Memory access latency uniform? – Shared (UMA): yes, doesn’t matter where data goes – Distributed (NUMA): no, makes a big difference • Appearance of memory to software – Q2: Can processors communicate directly via memory? – Shared (shared memory): yes, communicate via load/store – Distributed (message passing): no, communicate via messages • Dimensions are orthogonal – e.g. DSM: (physically) distributed, (logically) shared memory

Computer Science 246 David Brooks UMA vs. NUMA: Why it matters

Memory Low latency p0 p1 p2 p3 • Ideal model: – Perfect (single-cycle) memory latency – Perfect (infinite) memory bandwidth • Real systems: – Latencies are long and grow with system size – Bandwidth is limited • Add memory banks, interconnect to hook up (latency goes up)

Computer Science 246 David Brooks UMA vs. NUMA: Why it matters

m0 m1 m2 m3 • UMA: – From p0 same latency to m0-m3 long Interconnect – Data placement doesn’t matter latency – Latency worse as system scales – Interconnect contention restricts bandwidth p0 p1 p2 p3 – Small multiprocessors only • NUMA: non-uniform memory access Interconnect long – From p0 faster to m0 than m1-m3 latency – Low latency to local memory helps m0 m1 m2 m3 performance – Data placement important (software!) short – Less contention => more scalable latency p0 p1 p2 p3 – Large multiprocessor systems Computer Science 246 David Brooks Interconnection Networks

• Need something to connect those processors/memories – Direct: endpoints connect directly – Indirect: endpoints connect via switches/routers • Interconnect issues – Latency: average latency more important than max – Bandwidth: per processor – Cost: #wires, #switches, #ports per switch – Scalability: how latency, bandwidth, cost grow with processors • Interconnect topology primarily concerns architects

Computer Science 246 David Brooks Interconnect 1: Bus m0 m1 m2 m3 • Direct Interconnect Style

+ Cost: f(1) wires p0 p1 p2 p3 + Latency: f(1) - Bandwidth: Not Scalable, f(1/P) . Only works in small systems (<=4) m0 m1 m2 m3 + Only performs ordered broadcast . No other protocols work p0 p1 p2 p3

Computer Science 246 David Brooks Interconnect 2: Crossbar Switch

• Indirect Interconnect m0 m1 m2 m3 – Switch implemented as big MUXes p0 p1 + Latency: f(1) + Bandwidth: f(1) p2 - Cost: p3 . f(2P) wires . F(P2) switches . 4 wires per switch

Computer Science 246 David Brooks Interconnect 3: Multistage Network

• Indirect Interconnect – Routing done by address decoding m0 m1 m2 m3 – k: switch arity (#inputs/#outputs per switch)

– d: number of network stages = logk P + Cost . f(d*P/k) switches . f(P*d) wires . f(k) wires per switch + Latency: f(d) p0 p1 p2 p3 + Bandwidth: f(1) • Commonly used in large UMA systems

Computer Science 246 David Brooks Interconnect 4: 2D Torus

• Direct Interconnect + Cost p/m p/m p/m p/m . f(2P) wires . 4 wires per switch p/m p/m p/m p/m + Latency: f(P1/2) p/m p/m p/m p/m + Bandwidth: f(1) • Good scalability (widely used) p/m p/m p/m p/m – Variants: 1D (ring), 3D, mesh (no wraparound)

Computer Science 246 David Brooks Interconnect 5: Hypercube

• Direct Interconnect – K: arity (#nodes per dimension) p/m p/m p/m p/m – D: dimension =logk P + Latency: f(d) p/m p/m p/m p/m + Bandwidth: f(k*d) - Cost p/m p/m p/m p/m . F(k*d*P) wires . F(k*d) wires per switch p/m p/m p/m p/m • Good scalability, expensive switches – new design needed for more nodes

Computer Science 246 David Brooks Interconnect Routing

• Store-and-Forward Routing – Switch buffers entire message before passing it on – Latency = [(message length / bandwidth) + fixed switch overhead] * #hops • Wormhole Routing – message through interconnect – Switch passes message on before completely arrives – Latency = (message length / bandwidth) + (fixed switch overhead * #hops) – No buffering needed at switch – Latency (relative) independent of number of intermediate hops

Computer Science 246 David Brooks Shared Memory vs. Message Passing

• MIMD (appearance of memory to software) • Message Passing (multicomputers) – Each processor has its own address space – Processors send and receive messages to and from each other – Communication patterns explicit and precise – Explicit messaging forces programmer to optimize this – Used for scientific codes (explicit communication) – Message passing systems: PVM, MPI, OpenMP – Simple Hardware – Difficult programming Model

Computer Science 246 David Brooks Shared Memory vs. Message Passing

• Shared Memory (multiprocessors) – One shared address space – Processors use conventional load/stores to access shared data – Communication can be complex/dynamic – Simpler programming model (compatible with uniprocessors) – Hardware controlled caching is useful to reduce latency + contention – Has drawbacks • Synchronization (discussed later) • More complex hardware needed

Computer Science 246 David Brooks Parallel Systems (80s and 90s)

Machine Communication Interconnect #cpus Remote latency (us) SPARCcenter Shared memory Bus <=20 1 SGI Challenge Shared memory Bus <=32 1 CRAY T3D Shared memory 3D Torus 64-1024 1 Convex SPP Shared memory X-bar/ring 8-64 2 KSR-1 Shared memory Bus/ring 32 2-6 TMC CM-5 Messages Fat tree 64-1024 10 Intel Paragon Messages 2D mesh 32-2048 10-30 IBM SP-2 Messages Multistage 32-256 30-100

• Trend towards shared memory systems

Computer Science 246 David Brooks Multiprocessor Trends

• Shared Memory – Easier, more dynamic programming model – Can do more to optimize the hardware • Small-to-medium size UMA systems (2-8 processors) – Processor + memory + switch on single board (4x Pentium) – Single-chip multiprocessors (POWER4) – Commodity parts soon – glueless MP systems • Larger NUMAs built from smaller UMAs – Use commodity small UMAs with commodity interconnects (ethernet, myrinet) – NUMA clusters

Computer Science 246 David Brooks Computation Taxonomy

SISD SIMD MISD MIMD

ILP Vectors, MM-ISAs

Shared Message Memory Passing

UMA vs. NUMA

• Cache Coherence – Write Invalidate/Update? Snooping? Directory? • Synchronization • Memory Consistency

Computer Science 246 David Brooks Major issues for Shared Memory

• Cache coherence – “Common Sense” • P1 Read[X] => P1 Write[X] => P1 Read[X] will return X • P1 Write[X] => P2 Read[X] => will return value written by P1 • P1 Write[X] => P2 Write[X] => Serialized (all processor see the writes in the same order) • Synchronization – Atomic read/write operations • Memory – When will a written value be seen? – P1 Write[X] (10ps later) P2 Read[X] what happens? • These are not issues for message passing systems – Why?

Computer Science 246 David Brooks Cache Coherence

• Benefits of coherent caches in parallel systems? – Migration and Replication of shared data – Migration: data moved locally lowers latency + main memory bandwidth – Replication: data being simultaneously read can be replicated to reduce latency + contention • Problem: sharing of writeable data Processor 0 Processor 1 Correct value of A in: Memory Read A Memory, p0 cache Read a Memory, p0 cache, p1 cache Write A P0 cache, memory (if write-through) Read A P1 gets stale value on hit

Computer Science 246 David Brooks Solutions to Cache Coherence

• No caches – Not likely :) • Make shared data non-cacheable – Simple software solution – low performance if lots of shared data • Software flush at strategic times – Relatively simple, but could be costly (frequent syncs) • Hardware cache coherence – Make memory and caches coherent/consistent with each other

Computer Science 246 David Brooks HW Coherence Protocols

• Absolute coherence – All copies of each block have same data at all times – A little bit overkill… • Need appearance of absolute coherence – Temporary incoherence is ok • Similar to write-back cache • As long as all loads get their correct value • Coherence protocol: FSM that runs at every cache – Invalidate protocols: invalidate copies in other caches – Update protocols: update copies in other caches – Snooping vs. Directory based protocols (HW implementation) – Memory is always updated

Computer Science 246 David Brooks Write Invalidate

• Much more common for most systems • Mechanics – Broadcast address of cache line to invalidate – All processor snoop until they see it, then invalidate if in local cache – Same policy can be used to service cache misses in write-back caches

Processor Activity Bus Activity Contents of Contents of Contents of Memory CPU A’s cache CPU B’s cache Location X 0 CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X 0 0 0 CPU A writes 1 to X Invalidation miss for X 1 0 CPU B reads X Cache miss for X 1 1 1

Computer Science 246 David Brooks Write Update (Broadcast)

Processor Activity Bus Activity Contents of Contents of Contents of Memory CPU A’s cache CPU B’s cache Location X 0 CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X 0 0 0 CPU A writes 1 to X Write Broadcast for X 1 1 1 CPU B reads X 1 1 1 • Bandwidth requirements are excessive – Can reduce by checking if word is shared (also can reduce write-invalidate traffic) • Comparison with Write invalidate – Multiple writes, no intervening reads require multiple broadcasts – Multiple broadcasts needed for multiple word cache line writes (only 1 invalidate) – Advantage: other processors can get the data faster

Computer Science 246 David Brooks Bus-based protocols (Snooping)

• Snooping – All caches see and react to all bus events – Protocol relies on global visibility of events (ordered broadcast) • Events: – Processor (events from own processor) • Read (R), Write (W), Writeback (WB) – Bus Events (events from other processors) • Bus Read (BR), Bus Write (BW)

Computer Science 246 David Brooks Three-State Invalidate Protocol

CPU read hit Write miss for block CPU read miss CPU Read invalid shared Put read Put read miss on bus miss on CPU Write miss for block bus Write WB block • Three states

Put write Put write – Invalid: don’t have block miss on bus – Exclusive: have block and wrote it – Shared: have block but only read it CPU Read/ exclusive CPU Write Miss – CPU activity Write Hit WB, write miss on bus – Bus activity

Computer Science 246 David Brooks Three-State Invalidate Protocol

Processor 1 Processor 2 C

invalid A shared A1 invalid A1 shared B D E D E A: P1 Read B A1: P2 Read C B: P1 Write exclusive exclusive F-Z C: P2 Read D: P1 Write E: P2 Write F-Z: P2 Write

Computer Science 246 David Brooks Three State Invalidate Problems

• Real implementations are much more complicated – Assumed all operations are atomic • Multiphase operation can be done with no intervening ops • E.g. write miss detected, acquire bus, receive response • Even read misses are non-atomic with split transaction busses • is a possibility, see Appendix I – Other Simplifications • Write hits/misses treated same • Don’t distinguish between true sharing and clean in one cache • Other caches could supply data on misses to shared block

Computer Science 246 David Brooks Scalable Coherence Protocols: Directory Based Systems • Bus based systems are not scalable – Not enough bus B/W for everyone’s coherence traffic – Not enough processor snooping B/W to handle everyone’s traffic • Directories: scalable cache coherence for large MPs – Each memory entry (cache line) has a bit vector (1 bit per processor) – Bit vector tracks which processors have cached copies of line – Send all messages to memory directory – If no other cached copies, memory returns data – Otherwise memory forwards request to correct processor + Low b/w consumption (communicate only with relevant processors) + Works with general interconnect (bus not needed) - Longer latency (3-hop transaction: p0 => directory => p1 => p0)

Computer Science 246 David Brooks Multiprocessor Performance

• OLTP: TPC-C (Oracle) – CPI ~ 7.0 • DSS: TPC-D – CPI ~ 1.61 • AV: Altavista – CPI ~1.3

• Huge fraction of time in memory and I/O for TPC-C

Computer Science 246 David Brooks Coherence Protocols: Performance

• As block size is increased – Coherence misses increase (false sharing) – False sharing • Sharing of different data in the same cache line • Block is shared, but data word is not shared Misses per 1000 instructions Misses 1000 instructions per

OLTP Workload

Computer Science 246 David Brooks Coherence Protocols: Performance

• 3C Miss model => 4C miss model – Capacity, compulsory, conflict – Coherence: additional misses due Miss Rate Miss Rate to coherence protocol – Complicates uniprocessor cache analysis • As cache size is increased – Capacity misses decrease – Coherence misses may increase (more shared data is cached)

Computer Science 246 David Brooks Coherence Protocols: Performance

• As processors are added – Coherence misses increase (more communication)

Computer Science 246 David Brooks Today multiprocessor trends: Single Chip multiprocessors • e.g. IBM POWER4 – 1 chip contains: 2 1.3GHz processors, L2, L3 tags – Can connect 4 chips on 1 MCM to create 8-way system – Targets threaded server workloads, high-performance computing

Core0 Core1 L3 Cache

L2 Tags/Cache L3 L3 Tags Bus I/O Cache

Computer Science 246 David Brooks Single Chip MP: The Future – but how many cores? • Some companies are betting big on it… – Sun’s Niagara (8 cores per chip) – Intel’s rumored Tukwila or Tanglewood processor (8 or 16?) • Basis in Piranha research project (ISCA 2000)

Computer Science 246 David Brooks Multithreading

• Another trend: multithreaded processors – Processor utilization: IPC / Processor Width • Decreases as width increases (~50% on 4-wide) • Why? Cache misses, branch mis-predicts, RAW dependencies – Idea? Two (or more) processes (threads) share one pipeline – Replicate (thread) state • PC, , bpred, page table pointer, etc (5% area overhead) – One copy of stateless (naturally tagged) structures • Caches, functional units, buses, etc – Hardware thread context must be fast • Multiple on-chip contexts => no need to load from memory

Computer Science 246 David Brooks Multithreading Paradigms

SuperScalar Coarse MT Fine MT Simultaneous MT

Pentium 3, IBM Pulsar Intel P4-HT, Alpha EV6/7 EV8,POWER5

Computer Science 246 David Brooks Coarse vs. Fine Grained MT

• Coarse-Grained – Makes sense for in-order/shorter pipelines – Switch threads on long stalls (L2 cache misses) – Threads don’t interfere with each other much – Can’t improve utilization on L1 misses/bpred mispredicts • Fine-grained – Out-of-order, deep pipelines – Instructions from multiple threads in stage at a time, miss or not – Improves utilization in all scenarios – Individual thread performance suffers due to interference Computer Science 246 David Brooks Pentium 4 Front-End

• Front end resources arbitrate between threads every cycle • ITLB, RAS, BHB(Global History), Decode Queue are duplicated

Computer Science 246 David Brooks Pentium 4 Backend

• Some queues and buffers are partitioned (only ½ entries per thread) • Scheduler is oblivious to instruction thread Ids (limit on # per scheduler)

Computer Science 246 David Brooks Multithreading Performance

9-stage pipeline 128KB I/D Cache 6 IntALU (4 Load/Store) 4 FPALUs 8 SMT threads 8-insn fetch (2 threads)

• Intel claiming more like 20-30% increase on P4 • Debate still exist on performance benefits

Computer Science 246 David Brooks Synchronization

• Synchronization: important issue for shared memory – Regulates access to shared data – E.g. , monitor, critical section (s/w constructs) – Synchronization primitive: lock – Well written parallel programs will synchronize on shared data acquire(lock); //while (lock != 0); lock = 1; critical section; release(lock); // lock = 0;

0: ldw r1, lock // wait for lock to be free 1: bnez r1, #0 2: stw #1, lock // acquire lock … // critical section 9: stw #0, lock // release lock

Computer Science 246 David Brooks Implementing Locks

• Called spin lock, doesn’t work quite right…

Processor 0 Processor 1 0: ldw r1, lock 1: bnez r1, #0 // p0 sees lock free 0: ldw r1, lock 1: bnez r1, #0 // p1 sees lock free 2: stw #1, lock // p0 acquires lock 2: stw #1, lock // p1 acquires lock … // p0 and p1 in … // critical section … // together 9: stw #0, lock

Computer Science 246 David Brooks Implementing Locks

• Problem: acquire sequence (load-test-store) is not atomic – Solution: ISA provides an atomic lock-acquire operation – Load+Check+Store in one instruction (uninterruptible by definition) – E.g. test+set instruction (fetch-and-increment, exchange) t&s r1, lock // ldw r1, lock; stw #1, lock

0: t&s r1, lock 1: bnez r1, 0 2: … 3: stw r1, #0

Lock-release is already atomic

Computer Science 246 David Brooks More atomic instructions

• All atomic instructions can implement semaphores – Fetch-and-Add (generalization of T&S) – Compare-and-Swap (CAS) – Load-linked and Store-conditional (LL/SC)

• Only CAS and LL/SC can implement lock-free and wait- free algorithms (hard) – Semaphores can deadlock if a thread with lock swaps out – Lock-free: Thread does not require lock, always makes progress – Wait-free: Can complete in finite-steps

• See proof: M.P. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems, 13(1):124--149, January 1991.

Computer Science 246 David Brooks Load-Linked/Store-Conditional

Example from Prof.Computer Bruning, Science 246 Universität MannheimDavid Brooks HW Implementation Issues

Computer Science 246 David Brooks Memory Ordering Consistency

• Memory updates may become reordered by the memory system Processor 0 Processor 1 A = 0; B = 0; A = 1; B = 1; L1: if (B==0) L2: if (A==0) Critical section Critical Section

• Intuitively, impossible for both to be in critical section • BUT, if memory ops are reordered it could happen • Coherence: A’s and B’s must be same eventually • Doesn’t specify relative timing of coherence • See: Sarita . Adve, Kourosh Gharachorloo, “Shared Memory Consistency Models: A Tutorial,” IEEE Computer, 1995.

Computer Science 246 David Brooks Memory Ordering: Sequential Consistency • “System is sequentially consistent if the result of any execution is the same as if the operations of all processors were executed in some sequential order and the operations of each individual processor appear in this sequence in program order” [Lamport] • Sequential Consistency – All loads and stores in order – Delay completion of memory access until all invalidations caused by that access complete – Delay next memory access until previous one completes – Delay read of A/B until previous write of A/B completes • Cannot place writes in a write buffer and continue with read! – Simple for programmer – Not much room for HW/SW performance optimizations Computer Science 246 David Brooks Sequential Consistency Performance • Suppose a write miss takes: – 40 cycles to establish ownership – 10 cycles to issue each invalidate after ownership – 50 cycles to complete/acknowledge invalidates • Suppose 4 processors have shared cache block

• Sequential Consistency: Must wait 40 + 10 + 10 + 10 + 10 + 50 cycles = 130 cycles • Ownership time is only 40 cycles • Write buffers could continue before this

Computer Science 246 David Brooks Weaker Consistency Models

• Assume programs are synchronized • Observation: SC only needed for lock variables – Other variables? – Either in critical section (no parallel access) – Or not shared • Weaker consistency: can delay/reorder loads and stores – More room for hardware optimizations – Somewhat trickier programming model (memory barriers)

Computer Science 246 David Brooks One other possibility…

• Many processors have extensive support for speculation recovery • Can this be re-used to support strict consistency? – Use delayed commit feature for memory references – If we receive an invalidate message, back-out the computation and restart – Rarely triggered anyway (only on unsynchronized accesses that cause a race)

Computer Science 246 David Brooks