High-Performance Processors' Design Choices

High-Performance Processors’ Design Choices Ramon Canal PD Fall 2013 1 High-Performance Processors’ Design Choices 1 Motivation 2 Multiprocessors 3 Multithreading 4 VLIW 2 Outline • Motivation • Multiprocessors – SISD, SIMD, MIMD, and MISD – Memory organization – Communication mechanisms • Multithreading •VLIW 3 Motivation Instruction-Level Parallelism (ILP): What all we have covered so far: – simple pipelining – dynamic scheduling: scoreboarding and Tomasulo’s alg. – dynamic branch prediction – multiple-issue architectures: superscalar, VLIW – compiler techniques and software approaches Bottom line: There just aren’t enough instructions that can actually be executed in parallel! – instruction issue: limit on maximum issue count – branch prediction: imperfect – # registers: finite – functional units: limited in number – data dependencies: hard to detect dependencies via memory 4 So, What do we do? Key Idea: Increase number of running processes – multiple processes: at a given “point” in time • i.e., at the granularity of one (or a few) clock cycles • not sufficient to have multiple processes at the OS level! Two Approaches: – multiple CPU’s: each executing a distinct process • “Multiprocessors” or “Parallel Architectures” – single CPU: executing multiple processes (“threads”) • “Multi-threading” or “Thread-level parallelism” 5 Taxonomy of Parallel Architectures Flynn’s Classification: – SISD: Single instruction stream, single data stream • uniprocessor – SIMD: Single instruction stream, multiple data streams • same instruction executed by multiple processors • each has its own data memory • Ex: multimedia processors, vector architectures – MISD: Multiple instruction streams, single data stream • successive functional units operate on the same stream of data • rarely found in general-purpose commercial designs • special-purpose stream processors (digital filters etc.) – MIMD: Multiple instruction stream, multiple data stream • each processor has its own instruction and data streams • most popular form of parallel processing – single-user: high-performance for one application – multiprogrammed: running many tasks simultaneously (e.g., servers) 6 Multiprocessor: Memory Organization Centralized, shared- memory multiprocessor: – usually few processors – share single memory & bus – use large caches 7 Multiprocessor: Memory Organization Distributed-memory multiprocessor: – can support large processor counts • cost-effective way to scale memory bandwidth • works well if most accesses are to local memory node – requires interconnection network • communication between processors becomes more complicated, slower 8 Communication Mechanisms • Shared-Memory Communication – around for a long time, so well understood and standardized • memory-mapped – ease of programming when communication patterns are complex or dynamically varying – better use of bandwidth when items are small – Problem: cache coherence harder • use “Snoopy” and other protocols • Message-Passing Communication (i.e. intel’s Knight… family) – simpler hardware because keeping caches coherent is easier – communication is explicit, simpler to understand • focuses programmer attention on communication – synchronization: naturally associated with communication • fewer errors due to incorrect synchronization 9 Multiprocessor: Hybrid Organization • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors) 10 Multiprocessor: Hybrid Organization • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors) • What about Big Data? Is it a “game changer”? – Next slides based on the following works: • M. Ferdman et al. “Clearing the clouds” ASPLOS’12 • P.Lotfi-Kamran et al.‘‘Scale-OutProcessors” ISCA’12 • B. Grot et al. “Optimizing Datacenter TCO with Scale-Out Processors”, IEEE MICRO 2012 – Next couple of slides © of Prof. Babak Falsafi (EPFL) 11 Multiprocessors and Big Data PD, 2013 12 PD, 2013 13 PD, 2013 14 PD, 2013 15 PD, 2013 16 Scale-out Processors • Small LLC. Just to capture instructions. • More cores for higher throughput • “Pods” for small distance to memory PD, 2013 17 Performance • Iso server power (20MW) PD, 2013 18 Summary Multiprocessors • Need to tailor chip design to applications – Big Data applications are too big for data caches. Best solution is too eliminate them. – Big Data applications in need of coarse grain parallelism (i.e. At the request level) – Still single-thread performance is STILL important for other applications (i.e. Computation intensive) PD, 2013 19 Multithreading Threads: multiple processes that share code and data (and much of their address space) • recently, the term has come to include processes that may run on different processors and even have disjoint address spaces, as long as they share the code Multithreading: exploit thread-level parallelism within a processor – fine-grain multithreading • switch between threads on each instruction! – coarse-grain multithreading • switch to a different thread only if current thread has a costly stall – E.g., switch only on a level-2 cache miss 20 Multithreading • How can we guarantee no dependencies between instructions in a pipeline? – One way is to interleave execution of instructions from different program threads on same pipeline Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1) 21 Simple Multithreaded Pipeline • Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage 22 Multithreading Fine-grain multithreading – switch between threads on each instruction! – multiple threads executed in interleaved manner – interleaving is usually round-robin – CPU must be capable of switching threads on every cycle! • fast, frequent switches – main disadvantage: • slows down the execution of individual threads • that is, traded off latency for better throughput 23 CDC 6600 Peripheral Processors (Cray, 1965) • First multithreaded hardware • 10 “virtual” I/O processors • fixed interleave on simple pipeline • pipeline has 100ns cycle time • each processor executes one instruction every 1000ns • accumulator-based instruction set to reduce processor state 24 Denelcor HEP (Burton Smith, 1982) • First commercial machine to use hardware threading in main CPU – 120 threads per processor – 10 MHz clock rate – Up to 8 processors – precursor to Tera MTA (Multithreaded Architecture) 25 Tera MTA (Cray, 1997) • Up to 256 processors • Up to 128 active threads per processor • Processors and memory modules populate a sparse 3D torus interconnection fabric • Flat, shared main memory – No data cache – Sustains one main memory access per cycle per processor • 50W/processor @ 260MHz 26 Tera MTA (Cray) • Each processor supports 128 active hardware threads – 128 SSWs, 1024 target registers, 4096 general-purpose registers • Every cycle, one instruction from one active thread is launched into pipeline • Instruction pipeline is 21 cycles long • At best, a single thread can issue one instruction every 21 cycles – Clock rate is 260MHz, effective single thread issue rate is 260/21 = 12.4MHz 27 Multithreading Coarse-grain multithreading – switch only if current thread has a costly stall • E.g., level-2 cache miss – can accommodate slightly costlier switches – less likely to slow down an individual thread • a thread is switched “off” only when it has a costly stall – main disadvantage: • limited in ability to overcome throughput losses – shorter stalls are ignored, and there may be plenty of those • issues instructions from a single thread – every switch involves emptying and restarting the instruction pipeline 28 IBM PowerPC RS64-III (Pulsar) • Commercial coarse-grain multithreading CPU • Based on PowerPC with quad-issue in-order five stage pipeline • Each physical CPU supports two virtual CPUs • On L2 cache miss, pipeline is flushed and execution switches to second thread – short pipeline minimizes flush penalty (4 cycles), small compared to memory access latency – flush pipeline to simplify exception handling 29 Simultaneous Multithreading (SMT) Key Idea: Exploit ILP across multiple threads! – Share CPU to multiple threads – i.e., convert thread-level parallelism into more ILP – exploit following features of modern processors: • multiple functional units – modern processors typically have more functional units available than a single thread can utilize • register renaming and dynamic scheduling – multiple instructions from independent threads can co-exist and co-execute! 30 Multithreading: Illustration (a) (b) (c) (d) (a) A superscalar processor with no multithreading (b) A superscalar processor with coarse-grain multithreading (c) A superscalar processor with fine-grain multithreading (d) A superscalar processor with simultaneous multithreading (SMT) 31 From Superscalar to SMT • SMT is an out-of-order superscalar extended with hardware to support multiple executing threads 32 Simultaneous Multithreaded Processor 33 Simultaneous Multithreaded Processor • Add multiple contexts and fetch engines to wide out-of- order superscalar processor – [Tullsen, Eggers, Levy, University of Washington, 1995] • OOO instruction window already has most of the circuitry required to schedule from multiple threads • Any single thread can utilize whole machine • First examples: – Alpha 21464 (DEC/Compaq) – Pentium IV (Intel) – Power 5 (IBM) – Ultrasparc IV (Sun) 34 SMT: Design Challenges •

High-Performance Processors' Design Choices

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support