Chip Multiprocessing and the Cell Broadband Engine

Cell Broadband Engine Chip Multiprocessing and the Cell Broadband Engine Michael Gschwind © 2006 IBM Corporation ACM Computing Frontiers 2006 Cell Broadband Engine Computer Architecture at the turn of the millenium “The end of architecture” was being proclaimed – Frequency scaling as performance driver – State of the art microprocessors •multiple instruction issue •out of order architecture •register renaming •deep pipelines Little or no focus on compilers – Questions about the need for compiler research Academic papers focused on microarchitecture tweaks 2 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine The age of frequency scaling Scaled Device Voltage, V / α 1 WIRING tox/α bips W/α 0.9 GATE n+ n+ source drain 0.8 L/α xd/α p substrate, doping α*NA 0.7 Performance SCALING: RESULTS: Voltage: V/α Higher Density: ~α2 0.6 Oxide: tox /α Higher Speed: ~α Wire width: W/α Power/ckt: ~1/α2 0.5 Gate width: L/α Power Density: 37 34 31 28 25 22 19 16 13 10 7 ~Constant Total FO4 Per Stage Diffusion: xd /α Substrate: α *NA Source: Dennard et al., JSSC 1974. deeper pipeline 3 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Frequency scaling A trusted standby – Frequency as the tide that raises all boats – Increased performance across all applications – Kept the industry going for a decade Massive investment to keep going – Equipment cost – Power increase – Manufacturing variability – New materials 4 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine … but reality was closing in! 1000 100 1 10 0.8 1 0.6 0.4 0.1 bips 0.2 bips^3/W 0.01 Relative to Optimal FO4 Optimal to Relative Active Power 0 Leakage Power 37 34 31 28 25 22 19 16 13 10 7 0.001 Total FO4 Per Stage 1 0.1 0.01 5 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Power crisis 1000 The power crisis is not “natural” – Created by deviating from 100 ideal scaling theory Tox(Å) – Vdd and Vt not scaled by α • additional performance with increased voltage 10 Laws of physics Vdd classic scaling – Pchip = ntransistors * Ptransistor Marginal performance gain 1 Vt per transistor low – significant power increase – decreased power/performance 0.1 efficiency 1 0.1 0.01 gate length Lgate (µm) 6 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine The power inefficiency of deep pipelining Power-performance optimal Performance optimal 1 0.8 0.6 0.4 bips 0.2 bips^3/W deeper pipeline Relative to Optimal FO4 Relative 0 37 34 31 28 25 22 19 16 13 10 7 Total FO4 Per Stage Source: Srinivasan et al., MICRO 2002 Deep pipelining increases number of latches and switching rate ⇒ power increases with at least f2 Latch insertion delay limits gains of pipelining 7 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine 100 Hitting the memory wall MFLOPS Memory latency 80 60 40 Latency gap 20 – Memory speedup lags behind normalized 0 processor speedup -20 1990 2003 – limits ILP -40 -60 Chip I/O bandwidth gap Source: McKee, Computing Frontiers 2004 – Less bandwidth per MIPS Latency gap as application bandwidth gap usable bandwidth avg. request size no. of inflight roundtrip latency requests – Typically (much) less than chip I/O bandwidth Source: Burger et al., ISCA 1996 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Our Y2K challenge 10× performance of desktop systems 1 TeraFlop / second with a four-node configuration 1 Byte bandwidth per 1 Flop – “golden rule for balanced supercomputer design” scalable design across a range of design points mass-produced and low cost 9 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Cell Design Goals Provide the platform for the future of computing – 10× performance of desktop systems shipping in 2005 Computing density as main challenge –Dramatically increase performance per X • X = Area, Power, Volume, Cost,… Single core designs offer diminishing returns on investment – In power, area, design complexity and verification cost 10 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Necessity as the mother of invention Increase power-performance efficiency –Simple designs are more efficient in terms of power and area Increase memory subsystem efficiency –Increasing data transaction size –Increase number of concurrently outstanding transactions ⇒ Exploit larger fraction of chip I/O bandwidth Use CMOS density scaling –Exploit density instead of frequency scaling to deliver increased aggregate performance Use compilers to extract parallelism from application –Exploit application parallelism to translate aggregate performance to application performance 11 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Exploit application-level parallelism Data-level parallelism – SIMD parallelism improves performance with little overhead Thread-level parallelism – Exploit application threads with multi-core design approach Memory-level parallelism – Improve memory access efficiency by increasing number of parallel memory transactions Compute-transfer parallelism – Transfer data in parallel to execution by exploiting application knowledge 12 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Concept phase ideas modular design large architected design reuse register set Chip control cost of high frequency coherency and low power! Multiprocessor reverse Vdd efficient use of scaling for low memory interface power 13 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Cell Architecture SPE Heterogeneous multi- SPU SPU SPU SPU SPU SPU SPU SPU core system SXU SXU SXU SXU SXU SXU SXU SXU architecture LS LS LS LS LS LS LS LS – Power Processor Element for control tasks SMF SMF SMF SMF SMF SMF SMF SMF – Synergistic Processor 16B/cycle Elements for data- EIB (up to 96B/cycle) intensive processing 16B/cycle PPE 16B/cycle 16B/cycle (2x) Synergistic Processor PPU MIC BIC Element (SPE) consists of L2 L1 PXU – Synergistic Processor 32B/cycle 16B/cycle Unit (SPU) Dual FlexIOTM TM – Synergistic Memory Flow XDR Control (SMF) 64-bit Power Architecture with VMX 14 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Shifting the Balance of Power with Cell Broadband Engine Data processor instead of control system – Control-centric code stable over time – Big growth in data processing needs • Modeling • Games • Digital media • Scientific applications Today’s architectures are built on a 40 year old data model – Efficiency as defined in 1964 – Big overhead per data operation – Parallelism added as an after-thought 15 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Powering Cell – the Synergistic Processor Unit 64B Fetch ILB Local Store 128B SMF Issue / Single Port Branch SRAM 2 instructions 16B VRF 16B x 2 16B x 3 x 2 V F P U V F X U PERM LSU Source: Gschwind et al., Hot Chips 17, 2005 16 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Density Computing in SPEs Today, execution units only fraction of core area and power – Bigger fraction goes to other functions • Address translation and privilege levels • Instruction reordering • Register renaming • Cache hierarchy Cell changes this ratio to increase performance per area and power – Architectural focus on data processing • Wide datapaths • More and wide architectural registers • Data privatization and single level processor-local store • All code executes in a single (user) privilege level • Static scheduling 17 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Streamlined Architecture Architectural focus on simplicity – Aids achievable operating frequency – Optimize circuits for common performance case – Compiler aids in layering traditional hardware functions – Leverage 20 years of architecture research Focus on statically scheduled data parallelism – Focus on data parallel instructions • No separate scalar execution units • Scalar operations mapped onto data parallel dataflow – Exploit wide data paths • data processing • instruction fetch – Address impediments to static scheduling • Large register set • Reduce latencies by eliminating non-essential functionality 18 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine SPE Highlights LS DP RISC organization SFP – 32 bit fixed instructions LS – Load/store architecture FXU EVN – Unified register file User-mode architecture FWD – No page translation within SPU FXU ODD LS SIMD dataflow

Chip Multiprocessing and the Cell Broadband Engine

Ray Tracing on the Cell Processor

IBM System P5 Quad-Core Module Based on POWER5+ Technology: Technical Overview and Introduction

Implementing Powerpc Linux on System I Platform

From Blue Gene to Cell Power.Org Moscow, JSCC Technical Day November 30, 2005

IBM Powerpc 970 (A.K.A. G5)

IBM Power Systems Performance Report Apr 13, 2021

IBM's POWER5 Micro Processor Design and Methodology

Microprocessor

Introduction to the Cell Multiprocessor

IBM Power Roadmap

POWER8: the First Openpower Processor

Power8 Quser Mspl Nov 2015 Handout