Sysscale: Exploiting Multi-Domain Dynamic Voltage and Frequency Scaling for Energy ECient Mobile Processors

Sysscale: Exploiting Multi-Domain Dynamic Voltage and Frequency Scaling for Energy ECient Mobile Processors

SysScale: Exploiting Multi-domain Dynamic Voltage and Frequency Scaling for Energy Ecient Mobile Processors Jawad Haj-Yahya§ Mohammed Alser§ Jeremie Kim§ A. Giray Yağlıkçı§ Nandita Vijaykumar §?† Efraim Rotem? Onur Mutlu§ §ETH Zürich ?Intel †University of Toronto There are three domains ina modern thermally-constrained system running below a TDP, the SoC power-management- mobile system-on-chip (SoC): compute, IO, and memory. We unit (PMU) employs a power budget management algorithm observe that a modern SoC typically allocates a xed power (PBM) to dynamically distribute the total power budget to budget, corresponding to worst-case performance demands, to each SoC domain [6, 18, 42, 57, 71, 80, 82, 84, 87, 111]. This the IO and memory domains even if they are underutilized. The allows each domain to operate within its allocated power resulting unfair allocation of the power budget across domains budget. For instance, CPU cores and graphics engines in can cause two major issues: 1) the IO and memory domains the compute domain share the same power budget. When a can operate at a higher frequency and voltage than necessary, graphics-intensive workload is executed, the graphics engines increasing power consumption and 2) the unused power budget consume most of the compute domain’s power budget. To of the IO and memory domains cannot be used to increase the keep the power consumption of the compute domain within throughput of the compute domain, hampering performance. its allocated power budget, PMU applies dynamic voltage To avoid these issues, it is crucial to dynamically orchestrate the and frequency scaling (DVFS) to 1) reduce the CPU cores’ distribution of the SoC power budget across the three domains power consumption and 2) increase the graphics engines’ based on their actual performance demands. performance [48, 51, 62, 77, 83, 84, 85, 87, 102, 109]. We propose SysScale, a new multi-domain power manage- In this work, we demonstrate that the power budget the ment technique to improve the energy eciency of mobile SoCs. PBM allocates to the IO and memory domains is ineciently SysScale is based on three key ideas. First, SysScale introduces managed, making the energy and performance of a high-end an accurate algorithm to predict the performance (e.g., band- mobile SoC suboptimal. We make four key observations. width and latency) demands of the three SoC domains. Second, Observation 1. In a typical high-end mobile SoC, the power SysScale uses a new DVFS (dynamic voltage and frequency scal- budget management algorithm assigns a xed power budget ing) mechanism to distribute the SoC power to each domain to the IO and memory domains corresponding to the worst- according to the predicted performance demands. This mecha- case performance demands (bandwidth/latency) from the IO nism is designed to minimize the signicant latency overheads interconnect and the memory subsystem resources. How- associated with applying DVFS across multiple domains. Third, ever, we observe that common use cases of mobile systems in addition to using a global DVFS mechanism, SysScale uses have only modest demands relative to the worst-case. Un- domain-specialized techniques to optimize the energy eciency fortunately, these systems do not apply DVFS to the IO and of each domain at dierent operating points. memory domains based on the actual demands of the three We implement SysScale on an Intel Skylake microprocessor domains, making these SoCs energy inecient. for mobile devices and evaluate it using a wide variety of SPEC IO peripherals V_SA Voltage Regulators (VRs) CPU2006, graphics (3DMark), and battery life workloads (e.g., IOIO peripheralsperipherals 1 V_IO VDDQ video playback). On a 2-core Skylake, SysScale improves the 4 AnalogDDRIO 3 2 performance of SPEC CPU2006 and 3DMark workloads by up Digital DDRIO IOIO IO ControllersControllers Engines/ to 16% and 8.9% (9.2% and 7.9% on average), respectively. For Controllers DRAM battery life workloads, which typically have xed performance Memory IO Interconnect demands, SysScale reduces the average power consumption by Controller up to 10.7% (8.5% on average), while meeting performance IO domain demands. CoreCPU GraphicsGraphics Memory domain LLC 1. Introduction Coress 5 EnginesEngine Compute domain System on Chip A high-end mobile microprocessor is built as a system-on- chip (SoC) that integrates multiple components into a single Figure 1: A modern mobile SoC (Intel Skylake [18]) with chip. It typically has three main domains: compute (e.g., CPU three domains (compute, IO, memory). Voltage regulators cores, graphics engines), IO (e.g., display controller, image (VRs) are highlighted, e.g., IO engines/controllers,IO inter- signal processing (ISP) engine), and memory (i.e., memory connect, and memory controller share the same VR, V_SA. controller, memory interface, and DRAM) as illustrated in Fig. 1. A mobile SoC operates in a thermally-constrained envi- Observation 2. While mobile SoCs employ a power budget ronment, limited by what is known as thermal design power redistribution mechanism between components within a do- (TDP) [17,19,25,34,59,66,73,83,84,86,87,93,107]. To keep the main, such as between cores and graphics engines in the com- 1 pute domain [84, 85, 87], we observe that current power bud- main memory subsystem. To our knowledge, no previous get management algorithms do not support dynamic power work on SoC DVFS 1) coordinates and combines DVFS across redistribution across dierent domains. Therefore, when a three domains, or 2) optimizes the DRAM conguration reg- domain’s power budget is underutilized, the remaining bud- isters [52, 75, 79] and voltage. get is wasted, making system performance suboptimal. This To enable more holistic power management in a mobile unused power budget could have been allocated to another SoC and thereby to improve overall SoC power eciency domain (e.g., the compute domain) to increase performance. and performance, we propose SysScale, a new power man- Observation 3. In modern mobile SoCs, we observe that agement technique. SysScale is based on three key ideas. multiple components in the IO and compute domains have First, SysScale can accurately and dynamically predict the widely-varying main memory bandwidth demands across bandwidth/latency demand of multiple SoC domains by im- dierent workloads. However, due to over-provisioning of plementing new performance counters and utilizing exist- IO and memory demands, SoC energy eciency remains low ing system conguration registers. Second, SysScale uses while running many workloads, as we demonstrate in Sec. 3. a highly-ecient global DVFS mechanism to dynamically Observation 4. Unoptimized DRAM conguration register distribute the SoC power budget across all three domains, values can signicantly reduce the energy eciency benets according to the predicted performance requirements. SysS- of multi-domain DVFS (e.g., they provide 22% less power cale’s DVFS mechanism minimizes latency overheads by 1) reduction than optimized values). performing DVFS simultaneously in all domains to overlap Unfortunately, there are three main challenges that make the DVFS latencies and 2) storing the required congura- it dicult for existing high-end mobile systems to apply tion registers in on-chip SRAM near each domain. Third, DVFS to the IO and memory domains based on demands to maximize power savings, SysScale optimizes the energy of multiple domains. First, accurate prediction of 1) the actual eciency of each domain at dierent DVFS operating points bandwidth/latency demands of the multiple domains, and 2) with domain-specic mechanisms. For instance, we optimize the potential eect of DVFS on power/performance of the the energy eciency of the DRAM interface by adding a SoC, in the presence of multiple domains, is challenging. A dedicated scalable voltage supply and optimizing the cong- modern high-end SoC integrates several components that uration registers for each DVFS operating point. share the IO interconnect and memory subsystem. Some This work makes the following contributions: of these components have strict quality of service (QoS) re- • To our knowledge, SysScale is the rst work to enable quirements [99] with respect to latency (e.g., isochronous coordinated and highly-ecient DVFS across all SoC trac [5, 12, 103]) and bandwidth (e.g., display [9,38, 94, 99]). domains to increase the energy eciency of mobile SoCs. Mispredicting a component’s actual demand can violate the SysScale introduces the ability to redistribute the total QoS requirements and/or signicantly degrade system per- power budget across all SoC domains according to the formance. Second, the DVFS process of the IO and memory performance demands of each domain. domains is a global system optimization. It requires monitor- • We propose an eective algorithm to accurately predict ing the demands of the three SoC domains and subsequently the performance (e.g., bandwidth and latency) demands conguring multiple components in the SoC to carry out the of the three SoC domains, utilizing newly-implemented actual DVFS. Therefore, a power management transition ow performance counters and existing system conguration for applying this global optimization can be computationally registers. expensive. If it is not done correctly, the transition from one • We introduce a new global DVFS mechanism that min- voltage/frequency operating point to another can degrade imizes the performance overhead of applying DVFS SoC performance by stalling the SoC domains. Third, the across multiple domains. DVFS process should be holistic, ecient, and optimized to • We implement SysScale on the Intel Skylake SoC for maximize power savings. For instance, previous works on mobile devices [2, 18, 97] and evaluate SysScale using a memory subsystem DVFS [10, 11, 13, 14, 16, 20, 35, 58, 111] do wide variety of workloads: SPEC CPU2006 [95], graphics not dynamically optimize the DRAM interface (i.e., DDRIO) (3DMark [100]), and battery life workloads for mobile conguration registers [52,75,79] and voltage during the DVFS devices [1] (e.g., web browsing, light gaming, video con- process.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    14 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us