High-Performance Processors’ Design Choices Ramon Canal

PD Fall 2013

1 High-Performance Processors’ Design Choices

1 Motivation 2 Multiprocessors 3 Multithreading 4 VLIW

2 Outline

• Motivation • Multiprocessors – SISD, SIMD, MIMD, and MISD – Memory organization – Communication mechanisms • Multithreading •VLIW

3 Motivation

Instruction-Level Parallelism (ILP): What all we have covered so far: – simple pipelining – dynamic scheduling: and Tomasulo’s alg. – dynamic branch prediction – multiple-issue architectures: superscalar, VLIW – techniques and software approaches Bottom line: There just aren’t enough instructions that can actually be executed in parallel! – instruction issue: limit on maximum issue count – branch prediction: imperfect – # registers: finite – functional units: limited in number – data dependencies: hard to detect dependencies via memory

4 So, What do we do?

Key Idea: Increase number of running processes – multiple processes: at a given “point” in time • i.e., at the granularity of one (or a few) clock cycles • not sufficient to have multiple processes at the OS level!

Two Approaches: – multiple CPU’s: each executing a distinct • “Multiprocessors” or “Parallel Architectures” – single CPU: executing multiple processes (“threads”) • “Multi-threading” or “-level parallelism”

5 Taxonomy of Parallel Architectures Flynn’s Classification: – SISD: Single instruction stream, single data stream • uniprocessor – SIMD: Single instruction stream, multiple data streams • same instruction executed by multiple processors • each has its own data memory • Ex: multimedia processors, vector architectures – MISD: Multiple instruction streams, single data stream • successive functional units operate on the same stream of data • rarely found in general-purpose commercial designs • special-purpose stream processors (digital filters etc.) – MIMD: Multiple instruction stream, multiple data stream • each has its own instruction and data streams • most popular form of parallel processing – single-user: high-performance for one application – multiprogrammed: running many tasks simultaneously (e.g., servers)

6 Multiprocessor: Memory Organization Centralized, shared- memory multiprocessor: – usually few processors – share single memory & – use large caches

7 Multiprocessor: Memory Organization Distributed-memory multiprocessor: – can support large processor counts • cost-effective way to scale memory bandwidth • works well if most accesses are to local memory node – requires interconnection network • communication between processors becomes more complicated, slower

8 Communication Mechanisms

• Shared-Memory Communication – around for a long time, so well understood and standardized • memory-mapped – ease of programming when communication patterns are complex or dynamically varying – better use of bandwidth when items are small – Problem: coherence harder • use “Snoopy” and other protocols • Message-Passing Communication (i.e. intel’s Knight… family) – simpler hardware because keeping caches coherent is easier – communication is explicit, simpler to understand • focuses programmer attention on communication – synchronization: naturally associated with communication • fewer errors due to incorrect synchronization

9 Multiprocessor: Hybrid Organization • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

10 Multiprocessor: Hybrid Organization • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

• What about Big Data? Is it a “game changer”? – Next slides based on the following works: • M. Ferdman et al. “Clearing the clouds” ASPLOS’12 • P.Lotfi-Kamran et al.‘‘Scale-OutProcessors” ISCA’12 • B. Grot et al. “Optimizing Datacenter TCO with Scale-Out Processors”, IEEE MICRO 2012 – Next couple of slides © of Prof. Babak Falsafi (EPFL)

11 Multiprocessors and Big Data

PD, 2013 12 PD, 2013 13 PD, 2013 14 PD, 2013 15 PD, 2013 16 Scale-out Processors

• Small LLC. Just to capture instructions. • More cores for higher throughput • “Pods” for small distance to memory

PD, 2013 17 Performance

• Iso server power (20MW)

PD, 2013 18 Summary Multiprocessors

• Need to tailor chip design to applications – Big Data applications are too big for data caches. Best solution is too eliminate them. – Big Data applications in need of coarse grain parallelism (i.e. At the request level)

– Still single-thread performance is STILL important for other applications (i.e. Computation intensive)

PD, 2013 19 Multithreading

Threads: multiple processes that share code and data (and much of their address space) • recently, the term has come to include processes that may run on different processors and even have disjoint address spaces, as long as they share the code

Multithreading: exploit thread-level parallelism within a processor – fine-grain multithreading • between threads on each instruction! – coarse-grain multithreading • switch to a different thread only if current thread has a costly stall – E.g., switch only on a level-2 cache miss

20 Multithreading

• How can we guarantee no dependencies between instructions in a ? – One way is to interleave execution of instructions from different program threads on same pipeline Interleave 4 threads,

T1: LW r1, 0(r2) T1-T4 T2: ADD r7, r1, r4 , on non-bypassed 5-stage pipe T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1)

21 Simple Multithreaded Pipeline

• Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage

22 Multithreading

Fine-grain multithreading – switch between threads on each instruction! – multiple threads executed in interleaved manner – interleaving is usually round-robin – CPU must be capable of switching threads on every cycle! • fast, frequent – main disadvantage: • slows down the execution of individual threads • that is, traded off latency for better throughput

23 CDC 6600 Peripheral Processors (Cray, 1965)

• First multithreaded hardware • 10 “virtual” I/O processors • fixed interleave on simple pipeline • pipeline has 100ns cycle time • each processor executes one instruction every 1000ns • accumulator-based instruction set to reduce processor state

24 Denelcor HEP (Burton Smith, 1982)

• First commercial machine to use hardware threading in main CPU – 120 threads per processor – 10 MHz – Up to 8 processors – precursor to Tera MTA (Multithreaded Architecture)

25 Tera MTA (Cray, 1997)

• Up to 256 processors • Up to 128 active threads per processor • Processors and memory modules populate a sparse 3D torus interconnection fabric • Flat, shared main memory – No data cache – Sustains one main memory access per cycle per processor • 50W/processor @ 260MHz

26 Tera MTA (Cray)

• Each processor supports 128 active hardware threads – 128 SSWs, 1024 target registers, 4096 general-purpose registers • Every cycle, one instruction from one active thread is launched into pipeline • Instruction pipeline is 21 cycles long • At best, a single thread can issue one instruction every 21 cycles – Clock rate is 260MHz, effective single thread issue rate is 260/21 = 12.4MHz

27 Multithreading

Coarse-grain multithreading – switch only if current thread has a costly stall • E.g., level-2 cache miss – can accommodate slightly costlier switches – less likely to slow down an individual thread • a thread is switched “off” only when it has a costly stall – main disadvantage: • limited in ability to overcome throughput losses – shorter stalls are ignored, and there may be plenty of those • issues instructions from a single thread – every switch involves emptying and restarting the instruction pipeline

28 IBM PowerPC RS64-III (Pulsar) • Commercial coarse-grain multithreading CPU • Based on PowerPC with quad-issue in-order five stage pipeline • Each physical CPU supports two virtual CPUs • On L2 cache miss, pipeline is flushed and execution switches to second thread – short pipeline minimizes flush penalty (4 cycles), small compared to memory access latency – flush pipeline to simplify exception handling

29 Simultaneous Multithreading (SMT) Key Idea: Exploit ILP across multiple threads! – Share CPU to multiple threads – i.e., convert thread-level parallelism into more ILP – exploit following features of modern processors: • multiple functional units – modern processors typically have more functional units available than a single thread can utilize • and dynamic scheduling – multiple instructions from independent threads can co-exist and co-execute!

30 Multithreading: Illustration

(a) (b) (c) (d)

(a) A with no multithreading (b) A superscalar processor with coarse-grain multithreading (c) A superscalar processor with fine-grain multithreading (d) A superscalar processor with simultaneous multithreading (SMT)

31 From Superscalar to SMT

• SMT is an out-of-order superscalar extended with hardware to support multiple executing threads

32 Simultaneous Multithreaded Processor

33 Simultaneous Multithreaded Processor • Add multiple contexts and fetch engines to wide out-of- order superscalar processor – [Tullsen, Eggers, Levy, University of Washington, 1995] • OOO already has most of the circuitry required to schedule from multiple threads • Any single thread can utilize whole machine

• First examples: – (DEC/Compaq) – IV (Intel) – Power 5 (IBM) – Ultrasparc IV (Sun)

34 SMT: Design Challenges

• Dealing with a large – needed to hold multiple contexts

• Maintaining low overhead on clock cycle – fast instruction issue: choosing what to issue – instruction commit: choosing what to commit – keeping cache conflicts within acceptable bounds

• Power hungry!

35 Intel Pentium-4 Processor • Hyperthreading = SMT • Dual physical processors, each 2-way SMT • Logical processors share nearly all resources of the physical processor – Caches, execution units, branch predictors • Die area overhead of hyperthreading ~5 % • When one logical processor is stalled, the other can make progress – No logical processor can use all entries in queues when two threads are active • A processor running only one active software thread to run at the same speed with or without hyperthreading

36 Pentium 4 Micro-architecture

400 MHz Advanced System Transfer Bus Cache Advanced Dynamic Execution Hyper Pipelined Rapid Technology Execution Engine

Streaming Execution Enhanced Floating SIMD Point / Multi-Media Extensions 2 37 Pentium 4 Micro-architecture

What hardware complexity does OoO and SMT incur in?

Advanced Dynamic Execution Hyper Pipelined Technology

38 Sun/Oracle Ultrasparc T5 (2013)

16 Core 3,6 Ghz 8 threads/core (128 T/Chip)

X Core: 2-way OoO 16 KB I$ 16 KB D$ 128 KB L2 8 MB L3

28nm

PD, 2013 39 IBM Power 7

PD, 2013 40 VLIW

• Very Long Instruction Word: – Compiler packs a fixed number of operations into a single VLIW “instruction”. – The operations within a VLIW instruction are issued and executed in parallel.

–Example: • High-end signal processors (TMS320C6201) • Intel’s • Transmeta Crusoe, Efficeon

41 VLIW • VLIW (very long instruction word) processors use a long instruction word that contains a usually fixed number of operations that are fetched, decoded, issued, and executed synchronously. • All operations specified within a VLIW instruction must be independent of one another. • Some of the key issues of a (V)LIW processor: – (very) long instruction word (up to 1 024 bits per instruction), – each instruction consists of multiple independent parallel operations, – each operation requires a statically known number of cycles to complete, – a central controller that issues a long instruction word every cycle, – multiple FUs connected through a global shared register file.

42 VLIW and Superscalar • sequential stream of long instruction words • instructions scheduled statically by the compiler • number of simultaneously issued instructions is fixed during compile-time • instruction issue is less complicated than in a superscalar processor • Disadvantage: VLIW processors cannot react on dynamic events, e.g. cache misses, with the same flexibility like superscalars. • The number of instructions in a VLIW instruction word is usually fixed. • Padding VLIW instructions with no-ops is needed in case the full issue bandwidth is not be met. This increases code size. More recent VLIW architectures use a denser code format which allows to remove the no-ops. • VLIW is an architectural technique, whereas superscalar is a technique. • VLIW processors take advantage of spatial parallelism.

43 VLIW and Superscalar

• Superscalar RISC solution – Based on sequential execution semantics – Compiler’s role is limited by the instruction set architecture – Superscalar hardware identifies and exploits parallelism

• VLIW solution – Based on parallel execution semantics – VLIW ISA enhancements support static parallelization – Compiler takes greater responsibility for exploiting parallelism – Compiler / hardware collaboration often resembles superscalar

44 VLIW and Superscalar • Advantages of pursuing VLIW architectures – Make wide issue & deep latency less expensive in hardware – Allow processor parallelism to scale with additional VLSI density

• Architect the processor to do well with in-order execution – Enhance the ISA to allow static parallelization – Use compiler technology to parallelize program • Loop Unrolling, Software Pipelining, ... – However, a purely static VLIW is not appropriate for general- purpose use

45 Examples • Intel Itanium

• Transmeta Crusoe

• Almost all DSPs

– Texas Instruments

– ST Microelectronics

46 Intel Itanium, Itanium 2

47 IA-64 Encoding

Source: Intel/HP IA-64 Application ISA Guide 1.0 48 IA-64 Templates

Source: Intel/HP IA-64 Application ISA Guide 1.0 49 Intel's IA-64 ISA

• Intel 64-bit Architecture (IA-64) register model: – 128 64-bit general purpose registers GR0-GR127 to hold values for integer and multimedia computations • each register has one additional NaT (Not a Thing) bit to indicate whether the value stored is valid. – 128 82-bit floating-point registers FR0-FR127 • registers f0 and f1 are read-only with values +0.0 and +1.0, – 64 1-bit predicate registers P0-PR63 • the first register p0 is read-only and always reads 1 (true) – 8 64-bit branch registers BR0-BR7 to specify the target addresses of indirect branches

50 Transmeta Crusoe i Efficeon

51 Overview

• HW/SW system for executing code – VLIW processor – Code Morphing Software • Underlying ISA and details invisible – convenient level of indirection – upgrades, fixes, freedom for changes • as long as new CMS is implemented – anything else?

52 VLIW CPU

•Simple – in-order, very few interlocks – TM5400, 7 million transistors, 7 stage pipeline – low power, easier (and cheaper) to design • TM5800 – <=1GHz, 64KB L1, 512KB L2 – 0.5-15W @ 300-1000MHz, 0.8-1.3V running typ mm app

53 Crusoe vs. PIII mobile (temperature)

54 VLIW CPU • RISC-like ISA – molecule(long instruction) • 2 or 4 atoms (RISC-like instruction) • slot distribution? • 64 gprs and 32 fprs – dedicated regs for x86 architectural regs

128-bit molecule

FADD ADD LD BRCC

Floating INT unit 1 Load/Store Branch point unit unit unit INT unit 2 55 Conclusions

•VLIW – Reduces hardware complexity at the cost of increasing compiler complexity – Good for DSPs – Not so good for GPPs (so far?)

56 Conclusions

• Multiprocessors – Conventional superscalars are reaching ILP’s limits → exploit TLP or PLP – Already known technology • Multithreading – Good for extensive use of superscalar cores – More efficient than MP but more complex too

57