High-Performance Processors’ Design Choices Ramon Canal
PD Fall 2013
1 High-Performance Processors’ Design Choices
1 Motivation 2 Multiprocessors 3 Multithreading 4 VLIW
2 Outline
• Motivation • Multiprocessors – SISD, SIMD, MIMD, and MISD – Memory organization – Communication mechanisms • Multithreading •VLIW
3 Motivation
Instruction-Level Parallelism (ILP): What all we have covered so far: – simple pipelining – dynamic scheduling: scoreboarding and Tomasulo’s alg. – dynamic branch prediction – multiple-issue architectures: superscalar, VLIW – compiler techniques and software approaches Bottom line: There just aren’t enough instructions that can actually be executed in parallel! – instruction issue: limit on maximum issue count – branch prediction: imperfect – # registers: finite – functional units: limited in number – data dependencies: hard to detect dependencies via memory
4 So, What do we do?
Key Idea: Increase number of running processes – multiple processes: at a given “point” in time • i.e., at the granularity of one (or a few) clock cycles • not sufficient to have multiple processes at the OS level!
Two Approaches: – multiple CPU’s: each executing a distinct process • “Multiprocessors” or “Parallel Architectures” – single CPU: executing multiple processes (“threads”) • “Multi-threading” or “Thread-level parallelism”
5 Taxonomy of Parallel Architectures Flynn’s Classification: – SISD: Single instruction stream, single data stream • uniprocessor – SIMD: Single instruction stream, multiple data streams • same instruction executed by multiple processors • each has its own data memory • Ex: multimedia processors, vector architectures – MISD: Multiple instruction streams, single data stream • successive functional units operate on the same stream of data • rarely found in general-purpose commercial designs • special-purpose stream processors (digital filters etc.) – MIMD: Multiple instruction stream, multiple data stream • each processor has its own instruction and data streams • most popular form of parallel processing – single-user: high-performance for one application – multiprogrammed: running many tasks simultaneously (e.g., servers)
6 Multiprocessor: Memory Organization Centralized, shared- memory multiprocessor: – usually few processors – share single memory & bus – use large caches
7 Multiprocessor: Memory Organization Distributed-memory multiprocessor: – can support large processor counts • cost-effective way to scale memory bandwidth • works well if most accesses are to local memory node – requires interconnection network • communication between processors becomes more complicated, slower
8 Communication Mechanisms
• Shared-Memory Communication – around for a long time, so well understood and standardized • memory-mapped – ease of programming when communication patterns are complex or dynamically varying – better use of bandwidth when items are small – Problem: cache coherence harder • use “Snoopy” and other protocols • Message-Passing Communication (i.e. intel’s Knight… family) – simpler hardware because keeping caches coherent is easier – communication is explicit, simpler to understand • focuses programmer attention on communication – synchronization: naturally associated with communication • fewer errors due to incorrect synchronization
9 Multiprocessor: Hybrid Organization • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)
10 Multiprocessor: Hybrid Organization • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)
• What about Big Data? Is it a “game changer”? – Next slides based on the following works: • M. Ferdman et al. “Clearing the clouds” ASPLOS’12 • P.Lotfi-Kamran et al.‘‘Scale-OutProcessors” ISCA’12 • B. Grot et al. “Optimizing Datacenter TCO with Scale-Out Processors”, IEEE MICRO 2012 – Next couple of slides © of Prof. Babak Falsafi (EPFL)
11 Multiprocessors and Big Data
PD, 2013 12 PD, 2013 13 PD, 2013 14 PD, 2013 15 PD, 2013 16 Scale-out Processors
• Small LLC. Just to capture instructions. • More cores for higher throughput • “Pods” for small distance to memory
PD, 2013 17 Performance
• Iso server power (20MW)
PD, 2013 18 Summary Multiprocessors
• Need to tailor chip design to applications – Big Data applications are too big for data caches. Best solution is too eliminate them. – Big Data applications in need of coarse grain parallelism (i.e. At the request level)
– Still single-thread performance is STILL important for other applications (i.e. Computation intensive)
PD, 2013 19 Multithreading
Threads: multiple processes that share code and data (and much of their address space) • recently, the term has come to include processes that may run on different processors and even have disjoint address spaces, as long as they share the code
Multithreading: exploit thread-level parallelism within a processor – fine-grain multithreading • switch between threads on each instruction! – coarse-grain multithreading • switch to a different thread only if current thread has a costly stall – E.g., switch only on a level-2 cache miss
20 Multithreading
• How can we guarantee no dependencies between instructions in a pipeline? – One way is to interleave execution of instructions from different program threads on same pipeline Interleave 4 threads,
T1: LW r1, 0(r2) T1-T4 T2: ADD r7, r1, r4 , on non-bypassed 5-stage pipe T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1)
21 Simple Multithreaded Pipeline
• Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage
22 Multithreading
Fine-grain multithreading – switch between threads on each instruction! – multiple threads executed in interleaved manner – interleaving is usually round-robin – CPU must be capable of switching threads on every cycle! • fast, frequent switches – main disadvantage: • slows down the execution of individual threads • that is, traded off latency for better throughput
23 CDC 6600 Peripheral Processors (Cray, 1965)
• First multithreaded hardware • 10 “virtual” I/O processors • fixed interleave on simple pipeline • pipeline has 100ns cycle time • each processor executes one instruction every 1000ns • accumulator-based instruction set to reduce processor state
24 Denelcor HEP (Burton Smith, 1982)
• First commercial machine to use hardware threading in main CPU – 120 threads per processor – 10 MHz clock rate – Up to 8 processors – precursor to Tera MTA (Multithreaded Architecture)
25 Tera MTA (Cray, 1997)
• Up to 256 processors • Up to 128 active threads per processor • Processors and memory modules populate a sparse 3D torus interconnection fabric • Flat, shared main memory – No data cache – Sustains one main memory access per cycle per processor • 50W/processor @ 260MHz
26 Tera MTA (Cray)
• Each processor supports 128 active hardware threads – 128 SSWs, 1024 target registers, 4096 general-purpose registers • Every cycle, one instruction from one active thread is launched into pipeline • Instruction pipeline is 21 cycles long • At best, a single thread can issue one instruction every 21 cycles – Clock rate is 260MHz, effective single thread issue rate is 260/21 = 12.4MHz
27 Multithreading
Coarse-grain multithreading – switch only if current thread has a costly stall • E.g., level-2 cache miss – can accommodate slightly costlier switches – less likely to slow down an individual thread • a thread is switched “off” only when it has a costly stall – main disadvantage: • limited in ability to overcome throughput losses – shorter stalls are ignored, and there may be plenty of those • issues instructions from a single thread – every switch involves emptying and restarting the instruction pipeline
28 IBM PowerPC RS64-III (Pulsar) • Commercial coarse-grain multithreading CPU • Based on PowerPC with quad-issue in-order five stage pipeline • Each physical CPU supports two virtual CPUs • On L2 cache miss, pipeline is flushed and execution switches to second thread – short pipeline minimizes flush penalty (4 cycles), small compared to memory access latency – flush pipeline to simplify exception handling
29 Simultaneous Multithreading (SMT) Key Idea: Exploit ILP across multiple threads! – Share CPU to multiple threads – i.e., convert thread-level parallelism into more ILP – exploit following features of modern processors: • multiple functional units – modern processors typically have more functional units available than a single thread can utilize • register renaming and dynamic scheduling – multiple instructions from independent threads can co-exist and co-execute!
30 Multithreading: Illustration
(a) (b) (c) (d)
(a) A superscalar processor with no multithreading (b) A superscalar processor with coarse-grain multithreading (c) A superscalar processor with fine-grain multithreading (d) A superscalar processor with simultaneous multithreading (SMT)
31 From Superscalar to SMT
• SMT is an out-of-order superscalar extended with hardware to support multiple executing threads
32 Simultaneous Multithreaded Processor
33 Simultaneous Multithreaded Processor • Add multiple contexts and fetch engines to wide out-of- order superscalar processor – [Tullsen, Eggers, Levy, University of Washington, 1995] • OOO instruction window already has most of the circuitry required to schedule from multiple threads • Any single thread can utilize whole machine
• First examples: – Alpha 21464 (DEC/Compaq) – Pentium IV (Intel) – Power 5 (IBM) – Ultrasparc IV (Sun)
34 SMT: Design Challenges
• Dealing with a large register file – needed to hold multiple contexts
• Maintaining low overhead on clock cycle – fast instruction issue: choosing what to issue – instruction commit: choosing what to commit – keeping cache conflicts within acceptable bounds
• Power hungry!
35 Intel Pentium-4 Processor • Hyperthreading = SMT • Dual physical processors, each 2-way SMT • Logical processors share nearly all resources of the physical processor – Caches, execution units, branch predictors • Die area overhead of hyperthreading ~5 % • When one logical processor is stalled, the other can make progress – No logical processor can use all entries in queues when two threads are active • A processor running only one active software thread to run at the same speed with or without hyperthreading
36 Pentium 4 Micro-architecture
400 MHz Advanced System Transfer Bus Cache Advanced Dynamic Execution Hyper Pipelined Rapid Technology Execution Engine
Streaming Execution Enhanced Floating SIMD Trace Cache Point / Multi-Media Extensions 2 37 Pentium 4 Micro-architecture
What hardware complexity does OoO and SMT incur in?
Advanced Dynamic Execution Hyper Pipelined Technology
38 Sun/Oracle Ultrasparc T5 (2013)
16 Core 3,6 Ghz 8 threads/core (128 T/Chip)
X Core: 2-way OoO 16 KB I$ 16 KB D$ 128 KB L2 8 MB L3
28nm
PD, 2013 39 IBM Power 7
PD, 2013 40 VLIW
• Very Long Instruction Word: – Compiler packs a fixed number of operations into a single VLIW “instruction”. – The operations within a VLIW instruction are issued and executed in parallel.
–Example: • High-end signal processors (TMS320C6201) • Intel’s Itanium • Transmeta Crusoe, Efficeon
41 VLIW • VLIW (very long instruction word) processors use a long instruction word that contains a usually fixed number of operations that are fetched, decoded, issued, and executed synchronously. • All operations specified within a VLIW instruction must be independent of one another. • Some of the key issues of a (V)LIW processor: – (very) long instruction word (up to 1 024 bits per instruction), – each instruction consists of multiple independent parallel operations, – each operation requires a statically known number of cycles to complete, – a central controller that issues a long instruction word every cycle, – multiple FUs connected through a global shared register file.
42 VLIW and Superscalar • sequential stream of long instruction words • instructions scheduled statically by the compiler • number of simultaneously issued instructions is fixed during compile-time • instruction issue is less complicated than in a superscalar processor • Disadvantage: VLIW processors cannot react on dynamic events, e.g. cache misses, with the same flexibility like superscalars. • The number of instructions in a VLIW instruction word is usually fixed. • Padding VLIW instructions with no-ops is needed in case the full issue bandwidth is not be met. This increases code size. More recent VLIW architectures use a denser code format which allows to remove the no-ops. • VLIW is an architectural technique, whereas superscalar is a microarchitecture technique. • VLIW processors take advantage of spatial parallelism.
43 VLIW and Superscalar
• Superscalar RISC solution – Based on sequential execution semantics – Compiler’s role is limited by the instruction set architecture – Superscalar hardware identifies and exploits parallelism
• VLIW solution – Based on parallel execution semantics – VLIW ISA enhancements support static parallelization – Compiler takes greater responsibility for exploiting parallelism – Compiler / hardware collaboration often resembles superscalar
44 VLIW and Superscalar • Advantages of pursuing VLIW architectures – Make wide issue & deep latency less expensive in hardware – Allow processor parallelism to scale with additional VLSI density
• Architect the processor to do well with in-order execution – Enhance the ISA to allow static parallelization – Use compiler technology to parallelize program • Loop Unrolling, Software Pipelining, ... – However, a purely static VLIW is not appropriate for general- purpose use
45 Examples • Intel Itanium
• Transmeta Crusoe
• Almost all DSPs
– Texas Instruments
– ST Microelectronics
46 Intel Itanium, Itanium 2
47 IA-64 Encoding
Source: Intel/HP IA-64 Application ISA Guide 1.0 48 IA-64 Templates
Source: Intel/HP IA-64 Application ISA Guide 1.0 49 Intel's IA-64 ISA
• Intel 64-bit Architecture (IA-64) register model: – 128 64-bit general purpose registers GR0-GR127 to hold values for integer and multimedia computations • each register has one additional NaT (Not a Thing) bit to indicate whether the value stored is valid. – 128 82-bit floating-point registers FR0-FR127 • registers f0 and f1 are read-only with values +0.0 and +1.0, – 64 1-bit predicate registers P0-PR63 • the first register p0 is read-only and always reads 1 (true) – 8 64-bit branch registers BR0-BR7 to specify the target addresses of indirect branches
50 Transmeta Crusoe i Efficeon
51 Overview
• HW/SW system for executing x86 code – VLIW processor – Code Morphing Software • Underlying ISA and details invisible – convenient level of indirection – upgrades, fixes, freedom for changes • as long as new CMS is implemented – anything else?
52 VLIW CPU
•Simple – in-order, very few interlocks – TM5400, 7 million transistors, 7 stage pipeline – low power, easier (and cheaper) to design • TM5800 – <=1GHz, 64KB L1, 512KB L2 – 0.5-15W @ 300-1000MHz, 0.8-1.3V running typ mm app
53 Crusoe vs. PIII mobile (temperature)
54 VLIW CPU • RISC-like ISA – molecule(long instruction) • 2 or 4 atoms (RISC-like instruction) • slot distribution? • 64 gprs and 32 fprs – dedicated regs for x86 architectural regs
128-bit molecule
FADD ADD LD BRCC
Floating INT unit 1 Load/Store Branch point unit unit unit INT unit 2 55 Conclusions
•VLIW – Reduces hardware complexity at the cost of increasing compiler complexity – Good for DSPs – Not so good for GPPs (so far?)
56 Conclusions
• Multiprocessors – Conventional superscalars are reaching ILP’s limits → exploit TLP or PLP – Already known technology • Multithreading – Good for extensive use of superscalar cores – More efficient than MP but more complex too
57