White Paper | ADVANCED POWER MANAGEMENT HELPS BRING

White Paper | ADVANCED POWER MANAGEMENT HELPS BRING IMPROVED PERFORMANCE TO HIGHLY INTEGRATED X86 PROCESSORS TABLE OF CONTENTS THE IMPORTANCE OF POWER MANAGEMENT 3 THE X86 EXAMPLES 3 ESTABLISH A REALISTIC WORST-CASE FOR POWER 4 POWER LIMITS CAN TRANSLATE TO PERFORMANCE LIMITS 4 AMD TACKLES THE UNDERUSED TDP HEADROOM ISSUE 5 GOING ABOVE TDP 6 INTELLIGENT BOOST 7 CONFIGURABLE TDP 8 SUMMARY 9 Complex heterogeneous processors have the potential to leave a large amount of performance headroom untapped when workloads don’t utilize all cores. Advanced power management techniques for x86 processors are designed to reduce the power of underutilized cores while also allowing for dynamic allocation of the thermal budget between cores for improved performance. THE IMPORTANCE OF THE X86 EXAMPLE POWER MANAGEMENT Typical x86 processors widely used Those with experience implementing in both consumer and embedded microprocessors know the importance applications are a perfect example: of proper power management. Whether Integration of network and security for simple applications processors or engines, memory controllers, graphics high-end server processors, the ability processing units (GPUs), and video to down-clock, clock-gate, power-off, encode/decode engines has effectively or in some manner disable unused or turned them into heterogeneous underused hardware blocks is crucial in compute units that excel at a wide limiting power consumption. variety of workloads. Better power management benefits The notable thing about traditional range from energy savings within the reduction-based power management data center to improved battery life in is that a particular functional block is mobile devices. But don’t underestimate only turned off when unused, or down- the value of reducing power and clocked when higher performance is increasing efficiency. In fact, power not needed by the application. What reduction and increased efficiency about applications that desire more is even more important today, as performance? Shouldn’t saving power processors integrate more and varied in one area allow you to utilize it functional blocks. in another? WHITE PAPER | ADVANCED POWER MANAGEMENT HELPS BRING IMPROVED 3 PERFORMANCE TO HIGHLY INTEGRATED X86 PROCESSORS Specifying power usage is complex, ESTABLISH A REALISTIC particularly with highly integrated WORST-CASE FOR POWER processors. If the worst-case power The pragmatic approach for silicon for each individual hardware block in a providers is to survey real-world heterogeneous processor were added application software to establish a more together, the resulting total could be realistic worst-case power and add several times the achievable worst-case some guard-band for safety. Both AMD power for the device. The fact that it is and Intel use this type of methodology nearly impossible to write software that and specify it as thermal design power will simultaneously utilize all functional (TDP). TDP is essentially the maximum blocks to their fullest extent is one sustained power a processor can reason. Simply feeding the various draw with “real world” software while compute engines and I/O ports with operating under defined temperature enough data to keep them all 100% and voltage limits. utilized would likely exceed the available bandwidth of internal buses. Central processing unit (CPU) cores manage POWER LIMITS CAN data movement, and time spent there TRANSLATE TO is less time spent executing higher- PERFORMANCE LIMITS power instructions. Most embedded x86-based systems are power-constrained in some Another issue is that different way. Designers will look for the best instruction sequences can incur vastly performance they can get in a given different power usage, which can further power envelope, at a price they can complicate specifying processor power. afford. The worst-case power limit can For instance, complex floating-point translate directly into a performance instructions burn much more power limit for a given processor product than a simple I/O data read due to the by effectively defining the maximum significant difference in transistor logic operating frequency. they activate during execution. The combination of varying instruction types Using TDP as a worst-case power and utilized hardware blocks makes the specification instead of the cumulative actual power usage of the processor per-block maximum power helps to highly workload-dependent, and increase that operating frequency, but explains why it is rare to see a “typical” it’s also based on an assumption of the power specification for this device type. software workload. Applications using Still, implementers expect a maximum fewer hardware blocks, or using them power specification on which to base to a lesser extent, use less power and their design. effectively leave performance headroom on the table. WHITE PAPER | ADVANCED POWER MANAGEMENT HELPS BRING IMPROVED 4 PERFORMANCE TO HIGHLY INTEGRATED X86 PROCESSORS AMD TACKLES THE UNDERUSED TDP HEADROOM ISSUE AMD Turbo CORE technology1 was "PILEDRIVER" 2MB L2 launched several years ago to address DUAL-CORE underutilized TDP headroom. AMD Turbo X86 MODULE CORE began with a simple core-counting mechanism that allowed some CPU PCI EXPRESS® cores to use higher-frequency “boost” NORTHBRIDGE states while other CPU cores were idle. This approach only affected the CPU cores, and was primarily targeted at accelerating single-threaded "PILEDRIVER" applications that didn’t leverage DUAL-CORE a multi-core architecture. X86 MODULE 2MB L2 MEMORY INTERFACE MEMORY DP & VGA Generational improvements have increased the granularity and effectiveness of the technology by adding more boost states for CPU and GPU cores, real-time power and GRAPHICS CORES temperature monitors, and enabling & MULTIMEDIA dynamic power budget allocation between cores. Increasing performance by boosting to Integration of large GPU cores, as done in AMD R-Series APUs, higher frequencies is relatively simple, increases the potential for unused power budget. since the use of multiple performance states (voltage and frequency operating AMD’s recent move to integrate points) has been around for a while. discrete-class GPUs with x86 processor However, the complexity lies in cores in accelerated processing determining when and which cores to units (APUs) underscores this power boost. For AMD Embedded R-Series management challenge. Some APUs APUs, the process starts by dividing contain a GPU that accounts for the processor into separate thermal more than half of the silicon die and entities: one for each CPU core-pair and a proportional amount of the power one for the GPU. I/O power is small by budget. A much larger potential for comparison, so it is defined as a fixed under-utilization of the APU’s power value based on characterization to envelope exists in this scenario if the reduce complexity. software workload is highly CPU- centric or GPU-centric. The trend An integrated microcontroller manages toward integration of these complex, AMD Turbo CORE calculations, allowing heterogeneous cores is likely to continue a more complex and therefore more and necessitates a means of harnessing effective algorithm. In deciding whether the excess thermal headroom. boosting a given core is possible, the WHITE PAPER | ADVANCED POWER MANAGEMENT HELPS BRING IMPROVED 5 PERFORMANCE TO HIGHLY INTEGRATED X86 PROCESSORS power usage of each thermal entity be explained later. Total instantaneous must be determined. On-die analog power of the thermal entity can then 2 power measurement at many amps be calculated by P=CAC*V *f + Pstatic, is not practical in a 32nm silicon on and total power for the APU equals insulator (SOI) process, and external the summation of the power for each measurement is not possible because thermal entity and the I/O power offset. the various cores share power rails. The instantaneous power calculation result is compared to an allocated power MAX DIE TEMP LIMIT budget for the thermal entity, as well as the device’s thermal design current TDP BUDGET specification to ensure that current demand does not exceed what the Unused voltage regulator can provide. If either CPU Power CORE Budget value is too close to the limit, firmware PWR can impose throttling by reducing the CPU CORE core’s performance state. The ability PWR to boost the performance state is CPU DIE TEMP APU POWER CORE CPU maintained when headroom exists PWR CORE on both parameters. PWR I/O I/O GOING ABOVE TDP PWR PWR Even if an application with a high CAC drives the APU to consume the full APP 1 APP 2 high CAC low CAC TDP, operation at this level may occur in bursts or be preceded by idle time Applications with a low CAC can leave unused such that the die temperature at the TDP and temperature headroom. New power management techniques can exploit both for start of the high CAC period is far below improved performance. the maximum specification. The latest version of AMD Turbo CORE also takes Alternatively, proprietary activity the opportunity to boost in this scenario monitors that are integrated throughout by allowing brief excursions above TDP the processor architecture model current when there is adequate temperature logic activity as an AC capacitance (CAC). headroom. After all, the purpose The CAC monitors effectively profile the of a TDP limit is only to ensure die running application to determine if it is temperature stays in check. one of those “worst-case” workloads that defines TDP or something less Real-time temperature values from laborious. Static power of the core

White Paper | ADVANCED POWER MANAGEMENT HELPS BRING

CFD Analyses of a Notebook Computer Thermal Management

Power Management 24

Power Management Using FPGA Architectural Features Abu Eghan, Principal Engineer Xilinx Inc

Clock Gating for Power Optimization in ASIC Design Cycle: Theory & Practice

Desktop 3Rd Generation Intel® Core™ Processor Family, Desktop Intel® Pentium® Processor Family, Desktop Intel® Celeron® Processor Family, and LGA1155 Socket

Computer Architecture Techniques for Power-Efficiency

Thermal Guide: Intel® Xeon® Processor E5 V4 Product Family

Dynamic Voltage/Frequency Scaling and Power-Gating of Network-On-Chip with Machine Learning

Power Reduction Techniques for Microprocessor Systems

Happy: Hyperthread-Aware Power Profiling Dynamically

Learning-Directed Dynamic Voltage and Frequency Scaling Scheme with Adjustable Performance for Single-Core and Multi-Core Embedded and Mobile Systems †

Summarizing CPU and GPU Design Trends with Product Data