High-Performance Processors' Design Choices

High-Performance Processors’ Design Choices Ramon Canal PD Fall 2013 1 High-Performance Processors’ Design Choices 1 Motivation 2 Multiprocessors 3 Multithreading 4 VLIW 2 Outline • Motivation • Multiprocessors – SISD, SIMD, MIMD, and MISD – Memory organization – Communication mechanisms • Multithreading •VLIW 3 Motivation Instruction-Level Parallelism (ILP): What all we have covered so far: – simple pipelining – dynamic scheduling: scoreboarding and Tomasulo’s alg. – dynamic branch prediction – multiple-issue architectures: superscalar, VLIW – compiler techniques and software approaches Bottom line: There just aren’t enough instructions that can actually be executed in parallel! – instruction issue: limit on maximum issue count – branch prediction: imperfect – # registers: finite – functional units: limited in number – data dependencies: hard to detect dependencies via memory 4 So, What do we do? Key Idea: Increase number of running processes – multiple processes: at a given “point” in time • i.e., at the granularity of one (or a few) clock cycles • not sufficient to have multiple processes at the OS level! Two Approaches: – multiple CPU’s: each executing a distinct process • “Multiprocessors” or “Parallel Architectures” – single CPU: executing multiple processes (“threads”) • “Multi-threading” or “Thread-level parallelism” 5 Taxonomy of Parallel Architectures Flynn’s Classification: – SISD: Single instruction stream, single data stream • uniprocessor – SIMD: Single instruction stream, multiple data streams • same instruction executed by multiple processors • each has its own data memory • Ex: multimedia processors, vector architectures – MISD: Multiple instruction streams, single data stream • successive functional units operate on the same stream of data • rarely found in general-purpose commercial designs • special-purpose stream processors (digital filters etc.) – MIMD: Multiple instruction stream, multiple data stream • each processor has its own instruction and data streams • most popular form of parallel processing – single-user: high-performance for one application – multiprogrammed: running many tasks simultaneously (e.g., servers) 6 Multiprocessor: Memory Organization Centralized, shared- memory multiprocessor: – usually few processors – share single memory & bus – use large caches 7 Multiprocessor: Memory Organization Distributed-memory multiprocessor: – can support large processor counts • cost-effective way to scale memory bandwidth • works well if most accesses are to local memory node – requires interconnection network • communication between processors becomes more complicated, slower 8 Communication Mechanisms • Shared-Memory Communication – around for a long time, so well understood and standardized • memory-mapped – ease of programming when communication patterns are complex or dynamically varying – better use of bandwidth when items are small – Problem: cache coherence harder • use “Snoopy” and other protocols • Message-Passing Communication (i.e. intel’s Knight… family) – simpler hardware because keeping caches coherent is easier – communication is explicit, simpler to understand • focuses programmer attention on communication – synchronization: naturally associated with communication • fewer errors due to incorrect synchronization 9 Multiprocessor: Hybrid Organization • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors) 10 Multiprocessor: Hybrid Organization • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors) • What about Big Data? Is it a “game changer”? – Next slides based on the following works: • M. Ferdman et al. “Clearing the clouds” ASPLOS’12 • P.Lotfi-Kamran et al.‘‘Scale-OutProcessors” ISCA’12 • B. Grot et al. “Optimizing Datacenter TCO with Scale-Out Processors”, IEEE MICRO 2012 – Next couple of slides © of Prof. Babak Falsafi (EPFL) 11 Multiprocessors and Big Data PD, 2013 12 PD, 2013 13 PD, 2013 14 PD, 2013 15 PD, 2013 16 Scale-out Processors • Small LLC. Just to capture instructions. • More cores for higher throughput • “Pods” for small distance to memory PD, 2013 17 Performance • Iso server power (20MW) PD, 2013 18 Summary Multiprocessors • Need to tailor chip design to applications – Big Data applications are too big for data caches. Best solution is too eliminate them. – Big Data applications in need of coarse grain parallelism (i.e. At the request level) – Still single-thread performance is STILL important for other applications (i.e. Computation intensive) PD, 2013 19 Multithreading Threads: multiple processes that share code and data (and much of their address space) • recently, the term has come to include processes that may run on different processors and even have disjoint address spaces, as long as they share the code Multithreading: exploit thread-level parallelism within a processor – fine-grain multithreading • switch between threads on each instruction! – coarse-grain multithreading • switch to a different thread only if current thread has a costly stall – E.g., switch only on a level-2 cache miss 20 Multithreading • How can we guarantee no dependencies between instructions in a pipeline? – One way is to interleave execution of instructions from different program threads on same pipeline Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1) 21 Simple Multithreaded Pipeline • Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage 22 Multithreading Fine-grain multithreading – switch between threads on each instruction! – multiple threads executed in interleaved manner – interleaving is usually round-robin – CPU must be capable of switching threads on every cycle! • fast, frequent switches – main disadvantage: • slows down the execution of individual threads • that is, traded off latency for better throughput 23 CDC 6600 Peripheral Processors (Cray, 1965) • First multithreaded hardware • 10 “virtual” I/O processors • fixed interleave on simple pipeline • pipeline has 100ns cycle time • each processor executes one instruction every 1000ns • accumulator-based instruction set to reduce processor state 24 Denelcor HEP (Burton Smith, 1982) • First commercial machine to use hardware threading in main CPU – 120 threads per processor – 10 MHz clock rate – Up to 8 processors – precursor to Tera MTA (Multithreaded Architecture) 25 Tera MTA (Cray, 1997) • Up to 256 processors • Up to 128 active threads per processor • Processors and memory modules populate a sparse 3D torus interconnection fabric • Flat, shared main memory – No data cache – Sustains one main memory access per cycle per processor • 50W/processor @ 260MHz 26 Tera MTA (Cray) • Each processor supports 128 active hardware threads – 128 SSWs, 1024 target registers, 4096 general-purpose registers • Every cycle, one instruction from one active thread is launched into pipeline • Instruction pipeline is 21 cycles long • At best, a single thread can issue one instruction every 21 cycles – Clock rate is 260MHz, effective single thread issue rate is 260/21 = 12.4MHz 27 Multithreading Coarse-grain multithreading – switch only if current thread has a costly stall • E.g., level-2 cache miss – can accommodate slightly costlier switches – less likely to slow down an individual thread • a thread is switched “off” only when it has a costly stall – main disadvantage: • limited in ability to overcome throughput losses – shorter stalls are ignored, and there may be plenty of those • issues instructions from a single thread – every switch involves emptying and restarting the instruction pipeline 28 IBM PowerPC RS64-III (Pulsar) • Commercial coarse-grain multithreading CPU • Based on PowerPC with quad-issue in-order five stage pipeline • Each physical CPU supports two virtual CPUs • On L2 cache miss, pipeline is flushed and execution switches to second thread – short pipeline minimizes flush penalty (4 cycles), small compared to memory access latency – flush pipeline to simplify exception handling 29 Simultaneous Multithreading (SMT) Key Idea: Exploit ILP across multiple threads! – Share CPU to multiple threads – i.e., convert thread-level parallelism into more ILP – exploit following features of modern processors: • multiple functional units – modern processors typically have more functional units available than a single thread can utilize • register renaming and dynamic scheduling – multiple instructions from independent threads can co-exist and co-execute! 30 Multithreading: Illustration (a) (b) (c) (d) (a) A superscalar processor with no multithreading (b) A superscalar processor with coarse-grain multithreading (c) A superscalar processor with fine-grain multithreading (d) A superscalar processor with simultaneous multithreading (SMT) 31 From Superscalar to SMT • SMT is an out-of-order superscalar extended with hardware to support multiple executing threads 32 Simultaneous Multithreaded Processor 33 Simultaneous Multithreaded Processor • Add multiple contexts and fetch engines to wide out-of- order superscalar processor – [Tullsen, Eggers, Levy, University of Washington, 1995] • OOO instruction window already has most of the circuitry required to schedule from multiple threads • Any single thread can utilize whole machine • First examples: – Alpha 21464 (DEC/Compaq) – Pentium IV (Intel) – Power 5 (IBM) – Ultrasparc IV (Sun) 34 SMT: Design Challenges •

Load more