Microarchitecture-Level Power-Performance Simulators: Modeling, Validation, and Impact on Design

Microarchitecture-Level Power-Performance Simulators: Modeling, Validation, and Impact on Design Zhigang Hu, David Brooks, Victor Zyuban, Pradip Bose IBM Research Harvard University Tutorial Outline 8:00-8:15 Introduction and Motivation Basics of Performance Modeling - Turandot performance simulation infrastructure Architectural Power Modeling - PowerTimer extensions to Turandot - Power-Performance Efficiency Metrics Case Studies and Examples - Optimal Power-Performance Pipeline Depth Validation and Calibration Efforts Future challenges and Discussion Bibliography Power Dissipation Trends 1000 ) 2 Intel Data SIA Projection Nuclear Reactor 100 Pentium III Hot Plate Pentium II 10 Pentium Pro Pentium Power Density (W/cm Density Power 486 1 386 1980 1990 2000 2010 The Battery Gap Diverging Gap Between Actual Battery Capacities and Energy Needs 10kbps 64kbps 384kbps 2Mbps 5000 Interactive Mobile video- 4000 Conferencing, Collaboration Video email, 3000 Voice recognition, Battery Downlink Mobile commerce capacity (mAh) dominated Fuel Cells 2000 Web browser, Energy PIM, SMS, MMS, Video clips requirement Voice (mAh) Energy (mAh) 1000 Lithium Source: Lithium Ion Polymer Anand 0 Raghunathan, NEC Labs 2000 2001 2002 2003 2004 2005 2006 2007 Power Issues Capacitive (Dynamic) Power Static (Leakage) Power Vdd VIN VOUT Vin Vout ISub CL IGate CL Temperature Di/Dt (Vdd/Gnd Bounce) 20 cycles Voltage (V) (A) Current Minimum Voltage Application Areas for Power-Aware Computing Temperature/di-dt-Constrained Energy-Constrained Computing Why architecture/system level? • Many architectural/system decisions have huge impact on power and performance • Often need feedback at the early-stage of a design – Pre-RTL, pre-circuit analysis • Run-time, system-level feedback control – Application/dynamic run-time characteristics allow dynamic scaling for power reduction – Perhaps power, temperature, and voltage sensor to guide throttling for worst-case situations What architects need from lower levels… • Architects need abstract models on many levels… – Static speed-power knobs for structures • Parameterized models for HW structures • Impact of implementation choices – Given cycle-level power estimates (power vs. time) • Chip temperature models • Chip di/dt models • Hardware hooks – Dynamic speed-power knobs for structures • Clock gating, Vdd-scaling, Vdd-gating • Need to understand costs of these knobs – On-chip sensors to measure power, temperature, voltage deltas Tutorial Outline Introduction and Motivation 8:15-9:00 Basics of Performance Modeling - Turandot performance simulation infrastructure Architectural Power Modeling - PowerTimer extensions to Turandot - Power-Performance Efficiency Metrics Case Studies and Examples - Optimal Power-Performance Pipeline Depth Validation and Calibration Efforts Future challenges and Discussion Bibliography A Developer's Guide to Turandot/PowerTimer Acknowledgments: J-D Wellman, Jaime Moreno, and other IBMers in the original Turandot/MET development team 1 Processor Simulator: An Overview Processor simulator: a tool that emulates the behavior of a real processor Software-based: Concept phase: C/C++/System C Design phase: VHDL Hardware-based: FPGA Simulators are used for: Workload characterization Performance / power target projection Compiler tuning Design space exploration and tradeoff evaluation Testing / debugging/ validation Existing simulators Academia simulators: SimpleScalar, RSIM, SMTSIM, etc. Industry simulators: Concept phase Product phase 2 Turandot/PowerTimer Overview An out-of-order superscalar processor model for the PowerPC architecture Cycle-accurate, cycle-based Initial version developed by a group of researchers at IBM T.J. Watson Power4-like machine configuration by default Other configurations attainable through compile-time parameters Performance model validated against Power4 preRTL model Power model added in summer 2000 Based on circuit simulation of Power4-like circuits Supporting trace-driven and execution-driven modes Trace-driven mode now supports SMT, and is portable to AIX/Linux/Cygwin Interpretation-based execution mode is underway program trace 1 program trace segment ... Turandot binary Aria Turandot program program trace request trace N inputs trace-driven mode execution-driven mode 3 Disclaimer 1. Power4-like != Power4 2. Simulator implementation != real hardware implementation 4 Source File Organization Turandot root dir Sources source file dir Aria Turandot translate opcode deps ffreader src standalone opcode predecode ffreader turandot Aria library library library source files 5 Turandot Source Files turandot.c headers units init/flush utilities trace power stages controls.h array.c init.macros param.macros init.power.macros stage_commit.macros iq.h block_bus.macros flush_arbitrary.macros trace.macros power.macros stage_retire.macros trauma.h block_dcache.macros flush_mispredicted.macros dep.macros power_def.macros stage_fpu_exec.macros stage_fpu1_exec.macros array.h block_icache.macros reset.macros utils_cmdline.macros block_memq.macros utils_reader.macros stage_fix_exec.macros block_nfa.macros utils_trans.macros stage_fix1_exec.macros block_prediction.macros dep_prep_process.c stage_dmiss_exec.macros block_prefetch.macors castout_exec (in Turandot.c) stage_mem_exec.macros stage_mem1_exec.macros stage_log_exec.macros stage_br_exec.macros stage_rename.macros stage_dispatch.macros stage_decode.macros stage_ifetch.macros flush_exec (turandot.c) 6 Turandot Simulation Framework NFA + Branch Predictor I-TLB1 L1-I cache Fetch I-TLB2 Decode/Expand I-Buffer I-Fetch Decode/Expand I-Prefetch Rename/Dispatch Rename/Dispatch L2 cache Issue queue Issue queue Issue queue Issue queue Integer Load/store FP Branch Issue Issue logic Issue logic Issue logic Issue logic Main Memory Reg Read Reg.read Reg.read Reg.read Reg.read Cast-out queue Exec/Mem Integer Load/store FP Branch units units units units WB L1-D cache Load/store reorder buffer Retirement queue Retire D-TLB1 store queue D-TLB2 Retirement logic 7 miss queue Simulation Flow: Reverse Pipeline Order stores are committed to cache/memory COMMIT_STORES_DELAY cycles after retire Commit FIX_EXEC 1. If store, remove from storeq FIX1_EXEC 2. If load, remove from reorderq, check if there is a load/store Retire conflict. If yes, flush the pipeline CMPLX_EXEC (reset.macros) 3. Update branch history 4. Remove instruction from retireq Exec FPU_EXEC Rename FPU1_EXEC 1. Rename DMISS_EXEC I. Check if enough rename registers available, if not, stall until available. Dispatch II. Rename architectural registers to DCASTOUT_EXEC physical registers. III.If the instruction is a mispredicted branch instruction, check if all Decode operands are ready. If yes, resolve MEM_EXEC the branch and start fetching from the right path from next cycle. IV.Note: registers in different class are MEM1_EXEC renamed separately. IFetch 2. Dispatch 1. Place renamed IOPs into the LOG_EXEC corresponding issue queue. If a given operation can not be placed in the Flush issue queue (i.e. the queue is full), stall the stage until available. BR_EXEC 8 Fetch Stage 1. If a mispredicted branch is resolved (therefore NFA/Branch Predictor ifetch has been on the mispredicted path), then I-TLB1 L1-I cache revert back to the true taken path, flush the pipeline, and stall ifetch for a number of cycles. Decode/Expand I-TLB2 I-Buffer I-Fetch 2. If ifetch is stalled for some reason, check whether the reason has been resolved. If so, resume ifetch I-Prefetch from next cycle. 3. Stall ifetch if I-Buffer is full, or no more fetch blocks are allowed, or no more inflight insns are L2 cache allowed. bus 4. Fill the trace reader buffer. Stall ifetch if no instruction is available due to (1). no trace on the path (2). end of trace. stage_ifetch.macros 5. Use address of the first insn in this fetch block to: main fetch logic 1. Check ITLB1 / ITLB2. If miss, stall ifetch for a array.c/h number of cycles according to the miss type. arrays: caches, counter prediction table, NFA, etc. 2. Check L1 ICACHE / IPrefetch / L2 ICACHE. If miss, stall ifetch and charge appropriate miss unit definitions: penalties. block_icache.macros 3. Lookup NFA for next fetch address. block_bus.macros 6. For each insn in current fetch block: block_nfa.macros 1. Decode (see process_iword), expand into IOPs block_prefetch.macros (internal insns), and insert IOPs into I-Buffer. block_prediction.macros 2. If branch, perform branch prediction. 7. Update NFA. 9 Decode/Expand Stage NFA/Branch Predictor Decode/Expand I-Buffer I-Fetch 1. Expand instructions into IOPs and insert them into I-Buffer, in program order. (This code is in stage_ifetch.macros but logically it belongs to decode stage) 2. Handle millicode instructions (insns that expand to more than two IOPs), such as string ops. Stall decode if necessary. 3. Form instruction groups according to Power4 grouping rules (see IBM JR&D stage_ifetch.macros Power4 paper). expand instructions stage_decode.macros millicode handling and instruction group formation 10 Rename/Dispatch Stage NFA/Branch Predictor I-TLB1 L1-I cache 1. Rename I-TLB2 Decode/Expand I. Check if enough physical registers I-Buffer I-Fetch available, if not, stall until available. I-Prefetch II. Rename operands to physical registers, Rename/Dispatch and allocate physical registers for each L2 cache Issue queue Issue queue Issue queue Issue queue result. Integer Load/store FP Branch III.If the insn is a branch, check if all operands are ready. If so, resolve the Main Memory branch and start fetching from the true path from

Microarchitecture-Level Power-Performance Simulators: Modeling, Validation, and Impact on Design

45-Year CPU Evolution: One Law and Two Equations

Cuda C Best Practices Guide

Multi-Cycle Datapathoperation

CS2504: Computer Organization

Analysis of Body Bias Control Using Overhead Conditions for Real Time Systems: a Practical Approach∗

A Performance Analysis Tool for Intel SGX Enclaves

ESC-470: ARM 9 Instruction Set Architecture with Performance

A Characterization of Processor Performance in the VAX-1 L/780

Autotuning GPU Kernels Via Static and Predictive Analysis

The Anatomy of the ARM Cortex-M0+ Processor

Assembly Language Programming (Part 1) 2 7

V850e/Ms1 , V850e/Ms2