<<

Systems and Technology Group

Power Efficient Design and the Processor

H. , Ph. D. hofstee@us..com Architect, Cell Synergistic Processor Element IBM Systems and Technology Group Austin, Texas

© 2005 IBM Corporation Systems and Technology Group

Agenda

 Power Efficient Processor Architecture  System Trends  Cell Processor Overview

2 © 2005 IBM Corporation Systems and Technology Group

Power Efficient Architecture

3 © 2005 IBM Corporation Systems and Technology Group

Limiters to Processor Performance

 Power wall  Memory wall  Frequency wall

4 © 2005 IBM Corporation Systems and Technology Group Power Wall (Voltage Wall) 1000  Power components: 100 )

2 Active – Active power Power – Passive power 10 Passive Power • Gate leakage 1 • Sub-threshold

leakage (source- 0.1 drain leakage)

0.01 Power Density (W/cm Power Density 1994 2004 Gate10S Tox=11A Stack 0.001 1 0.1 0.01 NET: INCREASINGGate Length PERFORMANCE (microns) REQUIRES INCREASING EFFICIENCY Gate dielectric approaching a fundamental limit (a few atomic layers)

5 © 2005 IBM Corporation Systems and Technology Group

Memory wall

 Main memory now nearly 1000 cycles from the processor – Situation worse with (on-chip) SMP  Memory latency penalties drive inefficiency in the design – Expensive and sophisticated hardware to try and deal with it – Programmers that try to gain control of cache content, but are hindered by the hardware mechanisms  Latency induced bandwidth limitations – Much of the bandwidth to memory in systems can only be used speculatively – Diminishing returns from added bandwidth on traditional systems

6 © 2005 IBM Corporation Systems and Technology Group

Frequency wall

 Increasing frequencies and deeper pipelines have reached diminishing returns on performance  Returns negative if power is taken into account  Results of studies depend on issue width of processor – The wider the processor the slower it wants to be – Simultaneous Multithreading helps to use issue slots efficiently  Results depend on number of architected registers and workload – More registers tolerate deeper pipeline – Fewer random branches in application tolerates deeper pipelines

7 © 2005 IBM Corporation Systems and Technology Group

Microprocessor Efficiency

 Recent History: –Gelsinger’s law • 1.4x more performance for 2x more transistors –Hofstee’s corollary • 1/1.4x efficiency loss in every generation • Examples: Cache size, OoO, Superscalar, etc. etc.

 Re-examine microarchitecture with performance per transistor as metric –Pipelining is last clear win

8 © 2005 IBM Corporation Systems and Technology Group

Attacking the Performance Walls

 Multi-Core Non-Homogeneous Architecture – Control Plane vs. Data Plane processors – Attacks Power Wall  3-level Model of Memory – Main Memory, Local Store, Registers – Attacks Memory Wall  Large Shared & SW Controlled Branching – Allows deeper pipelines (11FO4 … helps power!) – Attacks Frequency Wall

9 © 2005 IBM Corporation Systems and Technology Group

System Trends

10 © 2005 IBM Corporation Systems and Technology Group

System Trends toward Integration

Memory Northbridge Memory Cell Accel Processor Processor IO Southbridge IO

 Increased integration is driving processors to take on many functions typically associated with systems – Integration forces processor developers to address off- load and acceleration in the design of the processor – Integration of bridge chip functionality  Virtualization technology is used to support non- homogeneous environments

11 © 2005 IBM Corporation Systems and Technology Group Next Generation Processors address Programming Complexity and Trend Towards Programmable Offload Engines with a Simpler System Alternative

Streaming 64b Power Mem. Graphics Processor GPU Processor Contr.

NIC Network Synergistic Processor Processor CPU CPU Security Security Processor .. Media . Media Processor Config. … Synergistic IO … Processor Hardwired Programmable Function ASIC Cell

12 © 2005 IBM Corporation Systems and Technology Group

“Outward Facing” Aspects of Cell  Cell is designed to be responsive  .. to human user – Real-time response – Supports rich visual interfaces  .. to network – Flexible, can support new standards – High-bandwidth – Content protection, privacy & security  Contrast to traditional processors which evolved from “batch processing” mentality (inward focused).

13 © 2005 IBM Corporation Systems and Technology Group

Cell Overview

14 © 2005 IBM Corporation Systems and Technology Group Key Attributes of Cell  Cell is Multi-Core – Contains 64-bit Power Architecture TM – Contains 8 Synergistic Processor Elements (SPE)  Cell is a Flexible Architecture – Multi-OS support (including ) with Virtualization technology – Path for OS, legacy apps, and software development  Cell is a Broadband Architecture – SPE is RISC architecture with SIMD organization and Local Store – 128+ concurrent transactions to memory per processor  Cell is a Real-Time Architecture – Resource allocation (for Bandwidth Measurement) – Locking Caches (via Replacement Management Tables)  Cell is a Security Enabled Architecture – SPE dynamically reconfigurable as secure processors

15 © 2005 IBM Corporation Systems and Technology Group

Cell Chip Block Diagram SPU SPE

SXU SXU SXU SXU SXU SXU SXU SXU

LS LS LS LS LS LS LS LS

SMF SMF SMF SMF SMF SMF SMF SMF

EIB (up to 96 /cycle)

PPE L2 MIC BIC

TM Dual XDRTM FlexIO L1 PXU

16 © 2005 IBM Corporation Systems and Technology Group Cell Prototype Die (Pham et al, ISSCC 2005)

S S S S P P P P U U U U P M P R B I U MIB R I C A C C

S S S S P P P P U U U U

17 © 2005 IBM Corporation Systems and Technology Group Cell Highlights

 Observed clock speed – > 4 GHz  Peak performance (single precision) – > 256 GFlops  Peak performance (double precision) – >26 GFlops  Area 221 mm2  Technology 90nm SOI  Total # of transistors 234M

18 © 2005 IBM Corporation Systems and Technology Group Element Interconnect  EIB data ring for internal communication – Four 16 data rings, supporting multiple transfers – 96B/cycle peak bandwidth – Over 100 outstanding requests

19 © 2005 IBM Corporation Systems and Technology Group Power Processor Element

 PPE handles and control tasks – 64-bit Power ArchitectureTM with VMX – In-order, 2-way hardware Multi-threading – Coherent Load/Store with 32KB I & D L1 and 512KB L2

20 © 2005 IBM Corporation Systems and Technology Group Synergistic Processor Element

 SPE provides computational performance – Dual issue, up to 16-way 128-bit SIMD – Dedicated resources: 128 128-bit RF, 256KB Local Store – Each can be dynamically configured to protect resources – Dedicated DMA engine: Up to 16 outstanding request

21 © 2005 IBM Corporation Systems and Technology Group SPE Highlights  User-mode architecture – No translation/protection within SPU – DMA is full Power Arch protect/x-late LS  DP Direct programmer control SFP SPU – DMA/DMA-list LS FXU EVN – Branch hint FWD  VMX-like SIMD dataflow

FXU ODD LS – Broad set of operations – Graphics SP-Float GPR – IEEE DP-Float (BlueGene-like) CONTROL LS CHANNEL  Unified register file DMA SMM – 128 entry x 128 bit ATO SMF  RTB 256kB Local Store

SBI B E

B – Combined I & D 14.5mm2 (90nm SOI) – 16B/cycle L/S bandwidth – 128B/cycle DMA bandwidth

22 © 2005 IBM Corporation Systems and Technology Group

SPE Organization (Flachs et al, ISSCC 2005)

23 © 2005 IBM Corporation Systems and Technology Group

SPE PIPELINE (Flachs et al, ISSCC 2005)

24 © 2005 IBM Corporation Systems and Technology Group I/O and Memory Interfaces  I/O Provides wide bandwidth – Dual XDRTM controller (25.6GB/s @ 3.2Gbps) – Two configurable interfaces (76.8GB/s @6.4Gbps) – Flexible Bandwidth between interfaces – Allows for multiple system configurations

25 © 2005 IBM Corporation Systems and Technology Group Cell Processor Can Support Many Systems  Game console systems XDRtm XDRtm XDRtm XDRtm  Workstations (CPBW)  HDTV  Home media servers CELL CELL Processor Processor  IOIF BIF IOIF

XDRtm XDRtm XDRtm XDRtm XDRtm XDRtm CELL CELL

Processor Processor

IOIF IOIF BIF

BIF SW IOIF

IOIF Processor Processor CELL

CELL CELL

Processor

XDR XDR XDR XDR

tm tm tm IOIF0 IOIF1 tm

26 © 2005 IBM Corporation Systems and Technology Group

Cell Processor Based Workstation (CPBW) ( Group and IBM) CELL Processor Board

 High BW Sys First Prototype “Powered On” Storage Networks Mgmt  I/O 16 Tera- in a rack (est.) Bridge

CELL Processor – ( equals 1 Peta-flop in 64 racks ) (2-Way SMP) Memory  Optimized for Digital Content 16 TFlop rack Creation, including • Computer entertainment • Movies • Real-time rendering … • Physics simulation

27 © 2005 IBM Corporation Systems and Technology Group

Cell Processor Example Application Areas

 Cell is a processor that excels at processing of rich media content in the context of broad connectivity – Digital content creation (games and movies) – Game playing and game serving – Distribution of (dynamic, media rich) content – Imaging and image processing – Image analysis (e.g. video surveillance) – Next-generation physics-based visualization – Video conferencing (3D?) – Streaming applications (codecs etc.) – Physical simulation & science

28 © 2005 IBM Corporation Systems and Technology Group

Summary

 Cell ushers in a new era of leading edge processors optimized for digital media and entertainment  Desire for realism is driving a convergence between supercomputing and entertainment  New levels of performance and power efficiency beyond what is achieved by PC processors  Responsiveness to the human user and the network are key drivers for Cell  Cell will enable entirely new classes of applications, even beyond those we contemplate today

29 © 2005 IBM Corporation Systems and Technology Group

Acknowledgements

 Cell is the result of a deep partnership between SCEI/Sony, , and IBM  Cell represents the work of more than 400 people starting in 2001  More detailed papers on the Cell implementation and the SPE micro-architecture can be found in the ISSCC 2005 proceedings

30 © 2005 IBM Corporation