Power, Reliability, and Performance: One System to Rule Them All

10.16 www.computer.org/computer qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND® MULTIMEDIA MULTIMEDIA from Computer’s October 2016 Issue X pg. 12 Author Charles Severance goes behind the scenes at the Living Computer Museum. View the full-resoluton video at htps://youtu.be/ ZHU4nzIsaIM. Subscribe to the Computng Conversatons podcast on iTunes at htps://itunes.apple.com/us/podcast/computng- conversatons/id731495760. pg. 12 Author Charles Severance provides an audio recording of his Computng Conversatons column in which he recounts his visit to the Living Computer Museum. Subscribe to the Computng Conversatons podcast on iTunes at htps://itunes.apple.com/us/podcast/computng- conversatons/id731495760. pg. 14 In this video at htps://youtu.be/35PaR65EYa8, Karlheinz Meier from Ruprecht-Karls University in Heidelberg speaks about neuromorphic computng's concepts, achievements, and challenges at the 2016 Internatonal Supercomputng Conference. pg. 112 Author David Alan Grier reads his Errant Hashtag column, in which he traces the influence of the American industrial experience on Silicon Valley's innovaton system. Subscribe to the Errant Hashtag podcast on iTunes at htps://itunes.apple.com/us/podcast/errant-hashtag/14 id893229126.GUEST EDITORS’ INTRODUCTION New Frontiers in Energy-Effi cient Computing VLADIMIR GETOV, ADOLFY HOISIE, AND PRADIP BOSE OCTOBER 2016 FEATURES 20 30 38 Using Performance- Power, Reliability, Standardizing Power Modeling and Performance: Power Monitoring to Improve Energy One System and Control at Ef ciency of HPC to Rule Them All Exascale Applications BILGE ACUN, AKHIL LANGER, RYAN E. GRANT, ESTEBAN MENESES, MICHAEL LEVENHAGEN, XINGFU WU, VALERIE TAYLOR, HARSHITHA MENON, STEPHEN L. OLIVIER, JEANINE COOK, AND PHILIP J. MUCCI OSMAN SAROOD, EHSAN TOTONI, DAVID DEBONIS, AND LAXMIKANT V. KALÉ KEVIN T. PEDRETTI, AND JAMES H. LAROS III qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND® COVER FEATURE ENERGY-EFFICIENT COMPUTING Power, Reliability, and Performance: One System to Rule Them All Bilge Acun, University of Illinois at Urbana–Champaign Akhil Langer, Intel Esteban Meneses, Costa Rica Institute of Technology and Costa Rica National High Technology Center Harshitha Menon, University of Illinois at Urbana–Champaign Osman Sarood, Yelp Ehsan Totoni, Intel Labs Laxmikant V. Kalé, University of Illinois at Urbana–Champaign In a design based on the Charm++ parallel programming framework, an adaptive runtime system dynamically interacts with a datacenter’s resource manager to control power by intelligently scheduling jobs, reallocating resources, and reconfiguring hardware. It simultaneously manages reliability by cooling the system to the running application’s optimal level and maintains performance through load balancing. igh-performance computing (HPC) data- manager for the whole machine. Studies have shown that centers and applications are facing major a smart, adaptive runtime system can improve efciency challenges in reliability, power management, in a power-constrained environment,1 increase perfor- and thermal variations that will require dis- mance with load-balancing algorithms,2 better control ruptive solutions to optimize performance. A unifed the reliability of supercomputers with substantial ther- H 3 system design with a smart runtime system that inter- mal variations, and confgure hardware components to acts with the system resource manager will be impera- operate within power constraints to save energy.4,5 tive in addressing these challenges. The system would be Although smart runtime systems are a promising part of each job, but interact with an adaptive resource way to overcome exascale computing barriers, there is 30 COMPUTER PUBLISHED BY THE IEEE COMPUTER SOCIETY 0018-9162/16/$33.00 © 2016 IEEE qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND® System administrator Runtime system Users Resource manager Processors Execution framework • Launches jobs Job profiler Scheduler • Shrinks or expands jobs Job • Models power-aware • Selects jobs • Cleans up jobs Runtime system submission performance • Allocates • Applies power caps • Selects hardware resources • Confgures hardware confguration System-level constraint Job allocation and termination decisions FIGURE 1. Interacting components in our design based on the Charm++ framework. The two major interacting components are the resource manager and an adaptive runtime system. Our design achieves the objectives of both datacenter users and system administrators by allowing dynamic interaction between the system resource manager or scheduler and the job runtime system to meet system- level constraints such as power caps and hardware configurations. no integrated solution that combines energy constraints, and thermal varia- and reconfgures the hardware with- past research into a single system for tions. Balance is essential in address- out sacrifcing performance. optimizing performance across mul- ing these concerns because of potential To evaluate our solution, we con- tiple dimensions, such as load balanc- conficts among them—for example, ducted several efciency tests. Our ing and power use. To fll that need, applying power and temperature con- results show that the adaptive run- we have developed a comprehensive straints can compromise performance time system’s capabilities result in design based on Charm++, a parallel and lead to a load imbalance. greater power efciency for common programming framework in C++ that Research has shown that Charm++ HPC applications. has been in use since 2001. Our design enhances user productivity and allows programs to run portably from small INTERACTION FLOW › lets the datacenter resource multicore computers (laptops, phones) Figure 1 shows how Charm++’s adap- manager dynamically interact to the largest supercomputers.6 Users tive runtime system interacts with with the individual job’s run- can easily expose and express much the resource manager to address time system, of the parallelism in their algorithms power, reliability, and performance › optimizes for both performance while automating many of the require- issues. Datacenter users are focused and power consumption, and ments for high performance and scal- on job performance. System admin- › enables operation in an environ- ability. Because of these strengths, istrators are tasked with guarantee- ment with system failures under Charm++ has thousands of users across ing good performance to individual constraints supplied by users or a wide variety of computing disciplines jobs, yet ensuring that total power administrators. with multiple large-scale applications, consumption does not exceed the cen- including simulation programs such ter’s allocated budget and that overall At Charm++’s core is an adap- as Nanoscale Molecular Dynamics— job throughput remains high despite tive runtime system that enables the formerly Not (just) Another Molecular node failures and thermal variations. dynamic collection of performance Dynamics program—for molecular Charm++ addresses both concerns. The data, dynamic task migration (load dynamics, ChaNGa for cosmology, and job scheduler strives to allocate system balancing), and temperature restraint OpenAtom for quantum chemistry.6 resources to jobs optimally according and power capping with optimal per- In our design, the Charm++ adap- to their power and performance char- formance. The ability to migrate tasks tive runtime system intelligently acteristics, and the runtime system and data from one processor to any schedules jobs and reallocates job implements the scheduler’s decision other available processor is critical to resources when utilization changes, by shrinking or expanding the num- solving conficting requirements, such controls reliability by a temperature- ber of nodes assigned and dynamically as application load imbalances across aware module that cools the system balancing load as needed. The runtime processors, high fault rates, power and to an application-based optimal level, system can turn on or of or reconfgure OCTOBER 2016 31 qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND® ENERGY-EFFICIENT COMPUTING Processor 1 Power cap Local Temperature manager Migration and load-balancing Shrink or expand Objects module decision Resource manager Processor 2 mapping and then carries out migra- Power-resiliency Power cap Power caps tions on the basis of this mapping. Local module A suite of load balancers includes Temperature manager several centralized, distributed, and Objects hierarchical strategies. Charm++ can also automate the decision of when to call the load bal- Processor load information ancer,7 predict future load, and make Processor temperatures load-balancing decisions. Load bal- Migration decisions ancing is automatically triggered when Charm++ detects an imbalance FIGURE

Load more