Low-Power, High-Performance Architecture of the Pwrficient Processor Family
Total Page:16
File Type:pdf, Size:1020Kb
Low-Power, High-Performance Architecture of the PWRficient Processor Family Presented by Tse-Yu Yeh Director, Architecture & Verification Hot Chips 18 Aug 21, 2006 8/21/2006 © Copyright 2006 P.A. Semi, Inc. 1 Hthi 18 Acknowledgement THE ARCHITECTURE, DESIGN, AND IMPLEMENTATION OF THE PWRFICIENT PROCESSOR FAMILY ARE THE OUTCOME OF THE UNTIRING EFFORTS OF THE ENTIRE TEAM AT P.A. SEMI, INC. 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 2 Outline Design paradigm Family Core Performance and power Summary 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 3 Low-Power Design Paradigm Design goal —970-class performance at less than 7 watts per core Power dissipation is a primary consideration at all levels Circuit and Process Voltage/frequency scaling Multiple power planes for optimal voltage selection per region Microarchitecture and Logic Clock gating to reduce power of idle circuits Active and pre-charge standby modes in external DRAM array Sizing and hierarchy of cache structures Architecture Integration to reduce I/O interface power Internal and external power-saving modes — CPU, memory controller, PCIe Verification Monitor power consumption against budget throughout the design process Software Power management of I/O devices and CPU 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 4 Power-Influenced µ/Architectural Choices Extensive fine-grained clock gating used throughout core and SOC If a function can be performed sequentially without performance loss, why build power-hungry parallel mechanisms? The smaller, the better Design employs smaller subsections that are powered up for frequent accesses Much attention to clocking only those elements that are performing work Each major block’s design criteria included goals for power in addition to the traditional area, complexity & timing budgets Energy per memory access at each level of cache vs. DRAM Leakage, voltage, and execution frequency Overall execution time saving Speculation has to be effective or it becomes a power sink 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 5 Fine-Grain Clock Gating Reduces Dynamic Power 80 Coarse-Grain Clock Gating 70 60 50 Reset and Flop Normal Thermal Initialization Operation Virus 40 % of Flops Clocked 30 20 Time 10 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 6 Device-Specific Vdd Reduces Static and Dynamic Power 45 Conventional approach 40 Operate at 1.1V across entire process 35 30 range 25 Fast parts tend to be very leaky 20 15 10 5 0 FF @ 1.1V TT @ 1.1V SS @ 1.1V 45 P.A. Semi approach 40 35 Operate at device-specific optimal Vdd 30 Partition power plane for optimal 25 voltage selection per region 20 Enables full process range for power 15 yield 10 5 0 FF @ 0.8V TT @ 0.9V SS @ 1.1V Leakage Power Dynamic Power 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 7 PWRficient Family 8/21/2006 © Copyright 2006 P.A. Semi, Inc. 8 Hthi 18 PWRficient Family PA6T core —the CPU Power Architecture compliant, 64-bit, high-performance FPU and VMX 7W @ 2GHz worst-case power dissipation CONEXIUM™—the on-chip coherent interconnect Scalable cross-bar interconnect 1–8 SMP cores 1 or 2 L2 caches, sized 512KB–8MB 1–4 64-bit DDR2 memory controllers ENVOI™—the I/O system SERDES I/O—PCI Express®, XAUI, SGMII Offload engines—TCP/IP, iSCSI, cryptography, and RAID Support I/O—Boot bus, UARTs, SMBus, GPIOs 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 9 PWRficient PA6T-1682M Block Diagram 64KB 64KB 64KB 64KB I-Cache D-Cache I-Cache D-Cache 2MB2MB DDR2 DDR2 L2 Cache DDR2 DDR2 L2 Cache Controller Controller PA6T 2GHz Core PA6T 2GHz Core Controller Controller CONEXIUM™ Interchange Power Control TTM*TTM* I/OI/O OffloadOffload System Control BridgeBridge DMADMA CacheCache EnginesEngines PTM†† Interrupt/GPIO PTM ENVOI™ Intelligent I/O SMBus (3) OctalOctal PCIPCI DualDual 10GbE10GbE QuadQuad GbEGbE ExpressExpress XAUIXAUI SGMIISGMII UART (2) ConfigurableConfigurable Boot Bus SERDESSERDES (24)(24) *Transaction trace memory †Peripheral trace memory 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 10 The PA6T Core 8/21/2006 © Copyright 2006 P.A. Semi, Inc. 11 Hthi 18 PA6T Core Features Fully compliant Power Architecture High-performance memory implementation hierarchy @ 2GHz Power Architecture version 2.04 L1 data—32GB/s read or write Full FPU and VMX SIMD capabilities L2 data—16GB/s read plus 64-bit with 32-bit capability 16GB/s write DDR2-1067—16GB/s read or write Super-scalar, out-of-order design 16 transactions in flight L1 instruction cache Fetch four instructions per cycle into 64- entry scheduler CONEXIUM Interchange Issue up to 3 per cycle into 6 functional units 1G transactions per second Branch predictors 64GB/s peak data rate Strongly ordered memory model MOESI coherency protocol Issue out of order, retire in order Hypervisor and virtualization support 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 12 PA6T Low-Power Features Improved branch prediction Minimize incorrect speculation to avoid wasting power Efficient superscalar design Index-based, out-of-order execution engine minimizes the use of CAMs and avoids unnecessary replays Energy-efficient memory pipelines Hierarchical address translation to balance power consumption and performance Highly integrated memory pipeline that supports out-of-order execution and sequential consistency with minimal data movement Coherent memory subsystem Coherent memory subsystem designed for high throughput and low latency while minimizing the energy used per reference Extensive power-management capabilities Doze, nap, and sleep power-down modes trade varying degrees of power savings with recovery time 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 13 Processor Pipeline 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 IF IT ID DE UC PM MP SE SP IS RR AL AD AG TR DT DD LW IntegerInteger InstructionInstruction InstructionInstruction InstructionInstruction FetchFetch DecodeDecode && MapMap IssueIssue FloatingFloating PointPoint VMXVMX BranchBranch PredictionPrediction Load/StoreLoad/Store 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 14 Front End I-cache 1 2 3 4 5 6 7 IF IT ID DE UC PM MP 64KB 2-way associative IERATIERAT 2-cycle 4 instructions/fetch µIERATµIERAT µSeqµSeq 4 fetches to CONEXIUM Next-line pre-fetch after a miss I-CacheI-Cache DecDec µDecµDec MapMap Instruction buffer 4 fetch groups (16 instructions) Decode Next Select RFRF DepDep Line MapMap MapMap 4 µ ops/dispatch Taken Map Not Taken 4 µ ops renamed/cycle 64 rename registers RSB Pred TAP Target Target back to main diagram 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 15 Branch Processing Branch prediction 1 2 3 4 5 6 7 IF IT ID DE UC PM MP 0-cycle bubble: 16-entry next-fetch prediction IERATIERAT 4-cycle bubble (taken prediction) Path: 16K bi-mode taken/not taken µIERAT (4 predicted) µIERAT µSeq 16-entry return stack 64-entry history-assisted target predictor I-Cache Dec µDec Map IP-relative Early verification NextNext SelectSelect RF Dep Branch execution LineLine Map Map TakenTaken 2-cycle branch execution on integer P1 NotNot Branch misprediction TakenTaken 13-cycle path mispredict RSBRSB 13-cycle target mispredict PredPred TAPTAP TargetTarget TargetTarget back to main diagram 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 16 Scheduler and Execution Pipes Schedule & Issue 8 9 10 11 12 13 14 15 16 … 20 64-entry schedule buffer SE SP IS RR AL AD 6464 dependency 3 pickers Pre-slotted Replay from scheduler IntInt P0:IntegerP0:Integer RFRF Replay prevention Retire PickPick CRCR P1:Integer/BrP1:Integer/Br 00 BRBR Commit to architecture registers PickPick EvalEval IssueIssue Precise exception 11 P1:FPUP1:FPU ArithmeticArithmetic 88 4 µ ops/cycle P1:FPUP1:FPU CompareCompare PickPick FPUFPU 3 pickers, 5 execution pipes + 22 RF RF P1:VMXP1:VMX FloatFloat 88 one load/store pipe P1:VMXP1:VMX ComplexComplex IntegerInteger 66 VMXVMX Integer (P0&1): 2-cycle Micro RFRF Micro P0:VMXP0:VMX PermutePermute FP (P1): 8-cycle OpOp BufferBuffer P0:VMXP0:VMX EasyEasy IntInt VMX (P1): 8-cycle FP or 6- cycle complex integer VMX (P0): 2-cycle permute or simple integer Load/store (P2) back to main diagram 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 17 Load/Store Processing Latency —Load to Use 10 11 12 13 14 15 IS RR AL AD L1 cache 4 AG TR DT DD L2 cache 24 Open page 100 IntInt P0:IntegerP0:Integer RFRF Closed page 120 data addr Remote L1 42 CRCR P1:Integer/BrP1:Integer/Br BRBR Loads and Stores Issued out of order FPUFPU RF Strongly-ordered stores Issue RF Issue 32-Entry Load/Store Queue VMX D-CacheD-Cache VMX MRB RF MRB RF HWHW PrefetchPrefetch Re-ordered for consistency Checking and retire CONEXIUM AddrAddr Load/StoreLoad/Store DERATDERAT Store-to-load forwarding MicroMicro GenGen QueueQueue OpOp 16 blocks in flight BufferBuffer SNPSNP 12-entry hardware prefetcher DupDup TagTag back to main diagram 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 18 Address Translation 1 2 3 4 I-address translation IF IT ID DE 4-entry direct µ IERAT IERAT 64-entry 2-way IERAT D-address translation µIERAT 128-entry 2-way DERAT EA VA SLB—64-entry fully-assoc I-Cache CPU TLB—1024-entry 4-way SLB TLB CPU SLB TLB BUS HTWHTW BUS I/FI/F H/W Page table walk Reload D-CacheD-Cache ERAT RA 4-walks in flight EA 64-bit effective address Addr Load/Store Prefetch of next PTE Addr DERAT Load/Store VA 65-bit virtual address Gen DERAT Queue Gen Queue RA 44-bit real (physical) address after a TLB miss EA RA ERAT Effective-to-real address AG TR DT DD translation SLB Segment look-aside buffer 12 13 14 15 HTW Hardware page-table walk back to main diagram 8/21/2006 ©Copyright 2006 P.A.