Low-Power, High-Performance Architecture of the PWRficient Processor Family

Presented by Tse-Yu Yeh Director, Architecture & Verification

Hot Chips 18 Aug 21, 2006

8/21/2006 © Copyright 2006 P.A. Semi, Inc. 1 Hthi 18 Acknowledgement

THE ARCHITECTURE, DESIGN, AND IMPLEMENTATION OF THE PWRFICIENT PROCESSOR FAMILY ARE THE OUTCOME OF THE UNTIRING EFFORTS OF THE ENTIRE TEAM AT P.A. SEMI, INC.

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 2 Outline

Design paradigm Family Core Performance and power Summary

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 3 Low-Power Design Paradigm

Design goal —970-class performance at less than 7 watts per core Power dissipation is a primary consideration at all levels Circuit and Process Voltage/frequency scaling Multiple power planes for optimal voltage selection per region and Logic Clock gating to reduce power of idle circuits Active and pre-charge standby modes in external DRAM array Sizing and hierarchy of cache structures Architecture Integration to reduce I/O interface power Internal and external power-saving modes — CPU, , PCIe Verification Monitor power consumption against budget throughout the design process Software Power management of I/O devices and CPU

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 4 Power-Influenced µ/Architectural Choices

Extensive fine-grained clock gating used throughout core and SOC If a function can be performed sequentially without performance loss, why build power-hungry parallel mechanisms? The smaller, the better Design employs smaller subsections that are powered up for frequent accesses Much attention to clocking only those elements that are performing work Each major block’s design criteria included goals for power in addition to the traditional area, complexity & timing budgets Energy per memory access at each level of cache vs. DRAM Leakage, voltage, and execution frequency Overall execution time saving Speculation has to be effective or it becomes a power sink

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 5 Fine-Grain Clock Gating Reduces Dynamic Power

80 Coarse-Grain Clock Gating 70

60

50 Reset and Flop Normal Thermal Initialization Operation Virus

40 % of Flops Clocked 30

20

Time 10

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 6 Device-Specific Vdd Reduces Static and Dynamic Power

45 Conventional approach 40 Operate at 1.1V across entire process 35 30 range 25 Fast parts tend to be very leaky 20 15 10 5 0 FF @ 1.1V TT @ 1.1V SS @ 1.1V

45 P.A. Semi approach 40 35 Operate at device-specific optimal Vdd 30 Partition power plane for optimal 25 voltage selection per region 20 Enables full process range for power 15 yield 10 5 0 FF @ 0.8V TT @ 0.9V SS @ 1.1V Leakage Power Dynamic Power 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 7 PWRficient Family

8/21/2006 © Copyright 2006 P.A. Semi, Inc. 8 Hthi 18 PWRficient Family

PA6T core —the CPU Power Architecture compliant, 64-bit, high-performance FPU and VMX 7W @ 2GHz worst-case power dissipation CONEXIUM™—the on-chip coherent interconnect Scalable cross-bar interconnect 1–8 SMP cores 1 or 2 L2 caches, sized 512KB–8MB 1–4 64-bit DDR2 memory controllers ENVOI™—the I/O system SERDES I/O—PCI Express®, XAUI, SGMII Offload engines—TCP/IP, iSCSI, cryptography, and RAID Support I/O—Boot bus, UARTs, SMBus, GPIOs

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 9 PWRficient PA6T-1682M Block Diagram

64KB 64KB 64KB 64KB I-Cache D-Cache I-Cache D-Cache 2MB2MB DDR2 DDR2 L2 Cache DDR2 DDR2 L2 Cache Controller Controller PA6T 2GHz Core PA6T 2GHz Core Controller Controller

CONEXIUM™ Interchange

Power Control TTM*TTM* I/OI/O OffloadOffload System Control BridgeBridge DMADMA CacheCache EnginesEngines PTM†† Interrupt/GPIO PTM

ENVOI™ Intelligent I/O SMBus (3) OctalOctal PCIPCI DualDual 10GbE10GbE QuadQuad GbEGbE ExpressExpress XAUIXAUI SGMIISGMII UART (2) ConfigurableConfigurable Boot Bus SERDESSERDES (24)(24)

*Transaction trace memory † trace memory

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 10 The PA6T Core

8/21/2006 © Copyright 2006 P.A. Semi, Inc. 11 Hthi 18 PA6T Core Features

Fully compliant Power Architecture High-performance memory implementation hierarchy @ 2GHz Power Architecture version 2.04 L1 data—32GB/s read or write Full FPU and VMX SIMD capabilities L2 data—16GB/s read plus 64-bit with 32-bit capability 16GB/s write DDR2-1067—16GB/s read or write Super-scalar, out-of-order design 16 transactions in flight L1 instruction cache Fetch four instructions per cycle into 64- entry scheduler CONEXIUM Interchange Issue up to 3 per cycle into 6 functional units 1G transactions per second Branch predictors 64GB/s peak data rate Strongly ordered memory model MOESI coherency protocol Issue out of order, retire in order

Hypervisor and virtualization support

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 12 PA6T Low-Power Features

Improved branch prediction Minimize incorrect speculation to avoid wasting power Efficient superscalar design Index-based, out-of-order execution engine minimizes the use of CAMs and avoids unnecessary replays Energy-efficient memory pipelines Hierarchical address translation to balance power consumption and performance Highly integrated memory pipeline that supports out-of-order execution and sequential consistency with minimal data movement Coherent memory subsystem Coherent memory subsystem designed for high throughput and low latency while minimizing the energy used per reference Extensive power-management capabilities Doze, nap, and sleep power-down modes trade varying degrees of power savings with recovery time

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 13 Processor Pipeline

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 IF IT ID DE UC PM MP SE SP IS RR AL AD AG TR DT DD LW

IntegerInteger InstructionInstruction InstructionInstruction InstructionInstruction FetchFetch DecodeDecode && MapMap IssueIssue FloatingFloating PointPoint

VMXVMX BranchBranch PredictionPrediction Load/StoreLoad/Store

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 14 Front End

I-cache 1 2 3 4 5 6 7 IF IT ID DE UC PM MP 64KB 2-way associative

IERATIERAT 2-cycle 4 instructions/fetch

µIERATµIERAT µSeqµSeq 4 fetches to CONEXIUM Next-line pre-fetch after a miss I-CacheI-Cache DecDec µDecµDec MapMap Instruction buffer 4 fetch groups (16 instructions) Decode Next Select RFRF DepDep Line MapMap MapMap 4 µ ops/dispatch Taken Map Not Taken 4 µ ops renamed/cycle 64 rename registers RSB Pred TAP Target Target

back to main diagram

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 15 Branch Processing

Branch prediction 1 2 3 4 5 6 7 IF IT ID DE UC PM MP 0-cycle bubble: 16-entry next-fetch prediction

IERATIERAT 4-cycle bubble (taken prediction) Path: 16K bi-mode taken/not taken

µIERAT (4 predicted) µIERAT µSeq 16-entry return stack 64-entry history-assisted target predictor I-Cache Dec µDec Map IP-relative Early verification NextNext SelectSelect RF Dep Branch execution LineLine Map Map TakenTaken 2-cycle branch execution on integer P1 NotNot Branch misprediction TakenTaken 13-cycle path mispredict

RSBRSB 13-cycle target mispredict PredPred TAPTAP TargetTarget TargetTarget

back to main diagram

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 16 Scheduler and Execution Pipes

Schedule & Issue 8 9 10 11 12 13 14 15 16 … 20 64-entry schedule buffer SE SP IS RR AL AD 6464 dependency 3 pickers Pre-slotted Replay from scheduler IntInt P0:IntegerP0:Integer RFRF Replay prevention Retire PickPick CRCR P1:Integer/BrP1:Integer/Br 00 BRBR Commit to architecture registers PickPick EvalEval IssueIssue Precise exception 11 P1:FPUP1:FPU ArithmeticArithmetic 88 4 µ ops/cycle P1:FPUP1:FPU CompareCompare PickPick FPUFPU 3 pickers, 5 execution pipes + 22 RF RF P1:VMXP1:VMX FloatFloat 88 one load/store pipe P1:VMXP1:VMX ComplexComplex IntegerInteger 66 VMXVMX Integer (P0&1): 2-cycle Micro RFRF Micro P0:VMXP0:VMX PermutePermute FP (P1): 8-cycle OpOp BufferBuffer P0:VMXP0:VMX EasyEasy IntInt VMX (P1): 8-cycle FP or 6- cycle complex integer VMX (P0): 2-cycle permute or simple integer Load/store (P2) back to main diagram

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 17 Load/Store Processing

Latency —Load to Use 10 11 12 13 14 15 IS RR AL AD L1 cache 4 AG TR DT DD L2 cache 24 Open page 100 IntInt P0:IntegerP0:Integer RFRF Closed page 120 data addr Remote L1 42 CRCR P1:Integer/BrP1:Integer/Br BRBR Loads and Stores Issued out of order FPUFPU RF Strongly-ordered stores Issue RF Issue 32-Entry Load/Store Queue VMX D-CacheD-Cache VMX MRB RF MRB RF HWHW PrefetchPrefetch Re-ordered for consistency Checking and retire CONEXIUM AddrAddr Load/StoreLoad/Store DERATDERAT Store-to-load forwarding MicroMicro GenGen QueueQueue OpOp 16 blocks in flight BufferBuffer SNPSNP 12-entry hardware prefetcher

DupDup TagTag

back to main diagram

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 18 Address Translation

1 2 3 4 I-address translation IF IT ID DE 4-entry direct µ IERAT IERAT 64-entry 2-way IERAT D-address translation µIERAT 128-entry 2-way DERAT EA VA SLB—64-entry fully-assoc I-Cache

CPU TLB—1024-entry 4-way SLB TLB CPU SLB TLB BUS HTWHTW BUS I/FI/F H/W Page table walk Reload D-CacheD-Cache ERAT RA 4-walks in flight EA 64-bit effective address Addr Load/Store Prefetch of next PTE Addr DERAT Load/Store VA 65-bit virtual address Gen DERAT Queue Gen Queue RA 44-bit real (physical) address after a TLB miss

EA RA ERAT Effective-to-real address AG TR DT DD translation SLB Segment look-aside buffer 12 13 14 15 HTW Hardware page-table walk back to main diagram

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 19 CONEXIUM Interchange

Transaction initiators: cores & I/O bridge

PA6T 2MB PA6T All devices respond Core L2 Core MOE SI-style protocol Minimize copy-back to memory and L2-cache to save power Address bus cycles at half the core

DATA CROSSBAR frequency—1G address/sec DDR2 DDR2 Ctrl Ctrl Address arbitration enforces strong ordering ADDRESS BUS Data connected as crossbar Scales with number of agents I/O ENVOI DMA, Cache I/O Bridge Offload Each port provides 16-byte dual- simplex connections

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 20 Memory Hierarchy

On-chip memories are power efficient RAM structures have low power density due to low inherent activity Only a few of many bit cells accessed per cycle On-chip RAMs save power by avoiding chip-to-chip bus structures

Most on-chip memory is devoted to caches Caches have diminishing (logarithmic) performance return vs. size

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 21 Power Saving Modes

Doze (entry time immediate, wakeup time immediate) Cores idle at reduced frequency; continue snooping on the bus Wake up immediately—no state reloading needed Nap (entry time 2–16µs*, wakeup time <0.5 ms) Core clock stopped and voltage lowered to reduce leakage D-cache modified data is flushed by hardware All architecture state is retained SRAM remains power on, value retained Branch predictors: state retained TLB, I-cache: invalidate if snooped Sleep (entry time 2–16µs*, wakeup time <1 ms†) Core powered off (either or both cores) D-cache modified data is flushed by hardware Some architecture state must be saved by software On wakeup, core goes through power-on-reset sequence *Depends on number of modi fied L1 data cache lines †Wakeup time is dominated by power regulator restart time 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 22 PWRficient Performance & Power

8/21/2006 © Copyright 2006 P.A. Semi, Inc. 23 Hthi 18 High Performance at Low Power Across a Range of Metrics

PWRficient 1682M PROVIDES MAINSTREAM PERFORMANCE AT LOW POWER

General-purpose computing Application offloads SPECint®2000 >1000 per core* TCP/IP termination § Floating-point performance > 20Gbps Encryption SPECfp®2000 >2000 per core* 10Gbps IPSec/SSL encryption and Imaging authentication§ FFT 24 GFlops/sec (total)† 3,000 public-key handshakes/sec in System bandwidth software* Storage Sustained block copy 2.0GB/s RAID5 (data+parity)§ 10 Gigabytes/sec* High-speed SERDES I/O *Estimated max sustained performance at 2GHz † 104Gbps aggregate peak bandwidth 75% of estimated max sustained performance at 2GHz, 1D single-precision FFT with 2K elements §Estimated peak performance at 2GHz

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 24 Power Efficient

POWER IS FIRST-ORDER DESIGN PRINCIPLE

PADS > 15,000 MC optimizes DRAM MC gated power clocks CPU L2 L2 cache and I/O are PADS coherent with core off Separate power rails CPU for cores, I/O pads PADS L2

Turn off IO Dynamic power-control unused I/Os PwrCtl hardware and software voltage/frequency SERDES management

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 25 PA6T-1682M Projected Power Dissipation

Projected power assuming full power-regulation scheme Independent per-core VRM Dynamic control loop based on PA6T-1682M frequency and operating conditions VID_CP0 6 PA6TPA6T VRM for SOC VDD_CP0 VRMVRM CoreCore Static control VID_CP1 Max Typ Max 6 PA6TPA6T Freq VDD_CP1 VRMVRM CoreCore PA6T core only 2.0GHz 4W 7W VID_SOC PA6T-1682M-FCN 2.0GHz 17W 25W SoCSoC VDD_SOC VRMVRM PA6T-1682M-FCG1.5GHz 8W 15W PA6T-1682M-FCD1.0GHz 6W 10W I/O coherent nap 2W*

*PA6T-1682M-FCN nap power may be higher

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 26 Summary

Power-aware design from ground up to maximize performance/Watt Modular design supports rapid family deployment Interesting convergence of needs across a wide range of design points Networking, telecom, servers, mil/aero, imaging Performance and power enable wide range of applications

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 27 Contact P.A. Semi

For further information, please visit P.A. Semi web site at:

www.pasemi.com

Kindly direct sales inquiries to:

[email protected]

Full contact information:

P.A. Semi, Inc. 3965 Freedom Circle, Floor 8 Santa Clara CA 95054-1203 USA Main: 408.200.4500 Fax: 408.200.4501

8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 28 Thank You

The P.A. Semi name and the P.A. Semi logo and combinations thereof are trademarks of P.A. Semi, Inc. The Power name is a trademark of International Business Machines Corporation, used under license therefrom. SPECint and SPECfp are registered trademarks of the Standard Performance Evaluation Corporation (SPEC). All other trademarks are the property of their respective owners.

8/21/2006 © Copyright 2006 P.A. Semi, Inc. 29 Hthi 18