10.16

www.computer.org/computer qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND® MULTIMEDIA

MULTIMEDIA from Computer’s October 2016 Issue X

pg. 12 Author Charles Severance goes behind the scenes at the Living Computer Museum. View the full-resoluton video at htps://youtu.be/ ZHU4nzIsaIM. Subscribe to the Computng Conversatons podcast on iTunes at htps://itunes.apple.com/us/podcast/computng- conversatons/id731495760.

pg. 12 Author Charles Severance provides an audio recording of his Computng Conversatons column in which he recounts his visit to the Living Computer Museum. Subscribe to the Computng Conversatons podcast on iTunes at htps://itunes.apple.com/us/podcast/computng- conversatons/id731495760.

pg. 14 In this video at htps://youtu.be/35PaR65EYa8, Karlheinz Meier from Ruprecht-Karls University in Heidelberg speaks about neuromorphic computng's concepts, achievements, and challenges at the 2016 Internatonal Supercomputng Conference.

pg. 112 Author David Alan Grier reads his Errant Hashtag column, in which he traces the influence of the American industrial experience on Silicon Valley's innovaton system. Subscribe to the Errant Hashtag podcast on iTunes at htps://itunes.apple.com/us/podcast/errant-hashtag/14 id893229126.GUEST EDITORS’ INTRODUCTION New Frontiers in Energy-Effi cient Computing VLADIMIR GETOV, ADOLFY HOISIE, AND PRADIP BOSE

OCTOBER 2016 FEATURES 20 30 38 Using Performance- Power, Reliability, Standardizing Power Modeling and Performance: Power Monitoring to Improve Energy One System and Control at Ef ciency of HPC to Rule Them All Exascale Applications BILGE ACUN, AKHIL LANGER, RYAN E. GRANT, ESTEBAN MENESES, MICHAEL LEVENHAGEN, XINGFU WU, VALERIE TAYLOR, HARSHITHA MENON, STEPHEN L. OLIVIER, JEANINE COOK, AND PHILIP J. MUCCI OSMAN SAROOD, EHSAN TOTONI, DAVID DEBONIS, AND LAXMIKANT V. KALÉ KEVIN T. PEDRETTI, AND JAMES H. LAROS III

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND®

COVER FEATURE ENERGY-EFFICIENT COMPUTING

Power, Reliability, and Performance: One System to Rule Them All

Bilge Acun, University of Illinois at Urbana–Champaign Akhil Langer, Intel Esteban Meneses, Costa Rica Institute of Technology and Costa Rica National High Technology Center Harshitha Menon, University of Illinois at Urbana–Champaign Osman Sarood, Yelp Ehsan Totoni, Intel Labs Laxmikant V. Kalé, University of Illinois at Urbana–Champaign

In a design based on the Charm++ parallel programming framework, an adaptive runtime system dynamically interacts with a datacenter’s resource manager to control power by intelligently jobs, reallocating resources, and reconfiguring hardware. It simultaneously manages reliability by cooling the system to the running application’s optimal level and maintains performance through load balancing.

igh-performance computing (HPC) data- manager for the whole machine. Studies have shown that centers and applications are facing major a smart, adaptive runtime system can improve efciency challenges in reliability, power management, in a power-constrained environment,1 increase perfor- and thermal variations that will require dis- mance with load-balancing algorithms,2 better control ruptive solutions to optimize performance. A unifed the reliability of with substantial ther- H 3 system design with a smart runtime system that inter- mal variations, and confgure hardware components to acts with the system resource manager will be impera- operate within power constraints to save energy.4,5 tive in addressing these challenges. The system would be Although smart runtime systems are a promising part of each job, but interact with an adaptive resource way to overcome exascale computing barriers, there is

30 COMPUTER PUBLISHED BY THE IEEE COMPUTER SOCIETY 0018-9162/16/$33.00 © 2016 IEEE

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND®

System administrator

Runtime system

Users Resource manager Processors Execution framework • Launches jobs Job profiler Scheduler • Shrinks or expands jobs Job • Models power-aware • Selects jobs • Cleans up jobs Runtime system submission performance • Allocates • Applies power caps • Selects hardware resources • Confgures hardware confguration

System-level constraint Job allocation and termination decisions

FIGURE 1. Interacting components in our design based on the Charm++ framework. The two major interacting components are the resource manager and an adaptive runtime system. Our design achieves the objectives of both datacenter users and system administra- tors by allowing dynamic interaction between the system resource manager or scheduler and the job runtime system to meet system- level constraints such as power caps and hardware configurations.

no integrated solution that combines energy constraints, and thermal varia- and reconfgures the hardware with- past research into a single system for tions. Balance is essential in address- out sacrifcing performance. optimizing performance across mul- ing these concerns because of potential To evaluate our solution, we con- tiple dimensions, such as load balanc- conficts among them—for example, ducted several efciency tests. Our ing and power use. To fll that need, applying power and temperature con- results show that the adaptive run- we have developed a comprehensive straints can compromise performance time system’s capabilities result in design based on Charm++, a parallel and lead to a load imbalance. greater power efciency for common programming framework in ++ that Research has shown that Charm++ HPC applications. has been in use since 2001. Our design enhances user productivity and allows programs to run portably from small INTERACTION FLOW › lets the datacenter resource multicore computers (laptops, phones) Figure 1 shows how Charm++’s adap- manager dynamically interact to the largest supercomputers.6 Users tive runtime system interacts with with the individual job’s run- can easily expose and express much the resource manager to address time system, of the parallelism in their algorithms power, reliability, and performance › optimizes for both performance while automating many of the require- issues. Datacenter users are focused and power consumption, and ments for high performance and scal- on job performance. System admin- › enables operation in an environ- ability. Because of these strengths, istrators are tasked with guarantee- ment with system failures under Charm++ has thousands of users across ing good performance to individual constraints supplied by users or a wide variety of computing disciplines jobs, yet ensuring that total power administrators. with multiple large-scale applications, consumption does not exceed the cen- including simulation programs such ter’s allocated budget and that overall At Charm++’s core is an adap- as Nanoscale Molecular Dynamics— job throughput remains high despite tive runtime system that enables the formerly Not (just) Another Molecular node failures and thermal variations. dynamic collection of performance Dynamics program—for molecular Charm++ addresses both concerns. The data, dynamic task migration (load dynamics, ChaNGa for cosmology, and job scheduler strives to allocate system balancing), and temperature restraint OpenAtom for quantum chemistry.6 resources to jobs optimally according and power capping with optimal per- In our design, the Charm++ adap- to their power and performance char- formance. The ability to migrate tasks tive runtime system intelligently acteristics, and the runtime system and data from one processor to any schedules jobs and reallocates job implements the scheduler’s decision other available processor is critical to resources when utilization changes, by shrinking or expanding the num- solving conficting requirements, such controls reliability by a temperature- ber of nodes assigned and dynamically as application load imbalances across aware module that cools the system balancing load as needed. The runtime processors, high fault rates, power and to an application-based optimal level, system can turn on or of or reconfgure

OCTOBER 2016 31

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND®

ENERGY-EFFICIENT COMPUTING

Processor 1

Power cap Local Temperature manager Migration and load-balancing Shrink or expand Objects module decision

Resource manager Processor 2 mapping and then carries out migra- Power-resiliency Power cap Power caps tions on the basis of this mapping. Local module A suite of load balancers includes Temperature manager several centralized, distributed, and Objects hierarchical strategies. Charm++ can also automate the decision of when to call the load bal- Processor load information ancer,7 predict future load, and make Processor temperatures load-balancing decisions. Load bal- Migration decisions ancing is automatically triggered when Charm++ detects an imbalance FIGURE 2. Components of the adaptive runtime system and their interaction with the and load-balancing benefts are likely resource manager. Charm++ has three main components: the local manager and the to exceed its overhead. load-balancing and power-resiliency modules. Fault tolerance Charm++ implements both proactive hardware components without impact- temperatures. With a large number of and reactive techniques for ensur- ing application performance, as long as processors, centralized data collection ing reliability.8 In a proactive strat- vendors provide adequate hardware becomes a performance bottleneck, so egy, the runtime system evacuates all control. data collection and decision making objects from a node that a monitor- are hierarchical. The adaptive runtime ing system predicts will soon crash. CHARM++ ATTRIBUTES system’s modules use this information Because failure prediction is not com- Charm++ has three main attributes. to make decisions such as improv- pletely accurate, reactive techniques The frst is overdecomposition, in which ing load balance, handling faults, and recover the information lost after a the programmer divides an applica- enforcing power constraints. failure brings down a system node. tion’s computation requirements into Because recovery techniques are small work and data units so that Load balancing based primarily on checkpoint and the number of units greatly exceeds One critical Charm++ feature is a restart, Charm++ routinely stores the number of processors. The sec- measurement-based mechanism for the application’s global state so that ond attribute is message-driven execu- load balancing that relies on the it can retrieve a prior global state tion, which involves scheduling work principle-of-persistence heuristic. during recovery. units on the basis of when a message This principle states that, for over- is received for them. The third attri- decomposed iterative applications, the Job malleability bute, migratability, is the ability to task’s or object’s computation load and The ability of Charm++ objects to move data and work units among pro- communication pattern tend to persist migrate enables job malleability, in cessors. These three attributes enable over time. The heuristic uses the appli- which a job can shrink (decrease) or Charm++ to provide many useful fea- cation’s load statistics collected by the expand (increase) the number of nodes tures including dynamic load balanc- runtime system, which provides an it is running on. Job malleability does ing, fault tolerance, and job mallea- automatic, application-independent not require any additional code from bility (shrinking or expanding the way of obtaining load statistics with- the application developer;9 rather, an number of processors the application out any user input. external command or internal deci- is running on). If desired, the user can specify pre- sion by the runtime system can trigger The framework collects informa- dicted loads and thus override system shrink or expand operations. tion about the application and system predictions. Using the collected load During a shrink operation, the run- in a distributed database including statistics, Charm++ executes a cho- time system moves objects from pro- processor loads, each object’s load, sen load-balancing strategy to deter- cessors that will no longer be used and communication patterns, and core mine a possible objects-to-processors returns them to the resource manager.

32 C O M PUTE R WW______W . C O M P UTE R . ORG/ C O M P UTE R

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND®

1.8

LeanMD 1.7 Lulesh Wave2D 1.6 Jacobi2D

1.5

1.4 In an expand operation, the runtime system launches new processes on the 1.3 additional processors and distributes Power-aware speedup objects from current processors to the 1.2 newly allocated ones. In addition to moving objects, the runtime system 1.1 must adjust its distributed data struc- tures, such as spanning trees and loca- 1.0 tion managers. 30 35 40 45 50 55 60 Figure 2 shows Charm++’s inter- CPU power (W) nal components and functions. Each processor has a local manager (LM) FIGURE 3. Power-aware speedups of four applications running on 20 nodes. LeanMD, that controls objects residing on that a molecular dynamics application, has the highest power-aware speedup because it processor and interaction with other is the most CPU-intensive application. Jacobi2D, a stencil application, has the lowest Charm++ components. The LM sends speedup because it is memory intensive. The many minor deviations from monotonic its total computational load and the behavior (lack of a smooth curve) are likely due to external factors such as OS jitter and computational load of each of its network delays. objects to the load-balancing module (LBM) and sends the CPU tempera- ture to the power-resiliency module exascale that it is power allocated to the CPU.1 The idio- (PRM). The LBM makes load-balancing willing to purchase and deploy.10 syncrasies in a job’s performance based decisions and redistributes load in Such a limit underlines the need for on allocated CPU power points to the response to shrink or expand com- power-aware resource management possibility of running diferent appli- mands from the resource manager. It along with job malleability. cations at diferent power levels. Over- also communicates object-migration With recent advances in processor provisioned systems can signifcantly decisions to the respective LMs. hardware design, users can employ soft- improve the performance of appli- The PRM ensures that the CPU tem- ware such as Intel’s Running Average cations that are not sensitive to CPU peratures remain below the tempera- Power Limit (RAPL) driver to control the power by capping CPU power to values ture threshold specifc to that job, con- power the processor consumes. Power well below their TDP and adding more trolling temperature by adjusting the capping forces the processors to run nodes to get the scaling benefts. CPU’s power cap. When a processor’s below their thermal design power (TDP)— temperature is above the threshold, the maximum power a processor can Power-aware speedup the PRM lowers its power cap; when consume. A node’s TDP also determines An application’s response to a CPU the temperature is well below the the maximum number of nodes that a power limit can be captured by the desired threshold, it increases the cap, datacenter with a power budget can use. application’s power-aware speedup12— thus ensuring that the total power the the ratio of the job’s execution time on job consumes remains below that job’s Overprovisioning a CPU capped at a certain power level allocated power budget. Jobs do not With power capping, the datacenter compared to the execution time of the have administrator rights to constrain can control the nodes’ power consump- same job when running on the lowest their CPUs’ power consumption, so tion and thus have additional nodes power level the CPU allows.1 The higher the PRM must communicate any new while remaining within the power the power-aware speedup, the more the power caps to the resource manager, budget—a practice referred to as over- sensitive the application is to changes which then applies them to the CPUs. provisioning.11 Research shows that an in the power allocated to the CPU. increase in the power allocated to a Figure 3 shows the power-aware HANDLING POWER LIMITS processor does not yield a proportional speedups for four HPC applications The US Department of Energy has increase in job performance because with diferent characteristics running set a power limit of 20 MW for an jobs react diferently to an increase in under diferent CPU power caps.1

OCTOBER 2016 33

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND®

ENERGY-EFFICIENT COMPUTING

300

SLURM PARM-R 250 PARM-M

200

150 operations, and any changes to CPU power caps as determined by the run- 100 time system’s PRM module. Average completion time (min) Comparison with other resource 50 managers. As Figure 4 shows, PARM outperforms the Simple Utility for Resource Management 0 (SLURM; now the Slurm Workload SetL SetH Manager)—an open source power- FIGURE 4. Comparison of average job-completion times with the Simple Linux Util- unaware utility that many super- ity for Resource Management (SLURM; now the Slurm Workload Manager) and the computers use. The malleable ver- Power-Aware Resource Manager (PARM) in rigid (R) and malleable (M) versions. Each sion of PARM (PARM-M) reduced bar represents average completion time as a percentage of the average completion time average job completion time by up to using SLURM. In PARM-R, the node allocation decision cannot be changed once the job 41 percent relative to SLURM. PARM starts running. With PARM-M, in contrast, the nodes allocated to a running job can either gives less performance benefit over

decrease or increase (shrink or expand). SetL has jobs with low sensitivity to CPU power, SLURM with SetH as compared to and SetH has jobs with high sensitivity. SetL because reducing CPU power greatly affects performance for jobs

with high power sensitivity (SetH). Power-Aware Resource Manager profling mechanism has negligible The Power-Aware Resource Manager overhead; the application needs to run IMPROVING RELIABILITY (PARM),1 developed at the Parallel Pro- only for a few iterations for PARM to Checkpoint and restart is the most gramming Laboratory of the Univer- get the necessary data points. popular mechanism for providing sity of Illinois at Urbana–Champaign, fault tolerance in high-performance is an essential part of our Charm++- Scheduler. PARM implements its computers and is used to recover work based design. PARM makes schedul- strategy for optimizing resource from the stored checkpoint before the ing decisions by selecting jobs and allocation as an integer linear pro- failure. An application’s total execu- their resource confgurations (power gram (ILP) that aims to maximize tion time T on an unreliable system is budget and compute nodes) such that the power-aware speedup of jobs run- given by the total power-aware speedup of run- ning under power constraints. The

ning jobs is maximized. It dynamically arrival of a new job or termination of T = Tsolve + Tcheckpoint + Trecover + Trestart, interacts with the runtime system, a running job triggers PARM’s sched-

hardware, user, and system admin- uler and reoptimizes scheduling and where Tsolve represents the total istrator to optimally distribute avail- resource-allocation decisions. The ILP effort required to solve the problem,

able resources to the jobs while stay- is fast enough to run frequently with Tcheckpoint accumulates all the time ing under the datacenter’s total power negligible overhead.1 spent on saving the system check-

budget and not exceeding the allow- points, Trecover represents the total able number of computational nodes. Execution framework. This com- lost work from system failures that Similar to the resource manager in ponent implements the scheduler must be recovered, and Trestart rep- Figure 1, PARM has three components. decisions by launching jobs, sending resents the amount of time required shrink/expand decisions to the run- to resume execution after a crash Job profler. Before a job is added to time system of the jobs, and by apply- (usually a constant value). the scheduler queue, PARM profles it ing power caps on compute nodes. Job A system using checkpoint and to develop a power-aware model with runtime systems interact with the exe- restart must choose an appropriate strong scaling that is the basis for cal- cution framework to convey job termi- checkpoint period, denoted by W. The culating power-aware speedups. This nation, completion of shrink/expand value of W has a delicate balance. A

34 COMPUTER WWW.COMPUTER.ORG/COMPUTER______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND®

16 1.74× 2.00× Jacobi2D 14 Lulesh 1.52× 2.30× 12 1.62× 1.32× 1.86× 10 2.14× 8

long value (low checkpoint frequency) 6 decreases T , but might increase 1.15× checkpoint Execution time reduction (%) 2.46× 1.42× Trecover. Conversely, a short value (high 4 checkpoint frequency) could reduce 1.23×

Trecover, but might enlarge Tcheckpoint. 2 The optimum value of W strongly depends on the system’s mean time between fail- 0 ures (MTBF). 42 44 46 48 50 52 Maximum allowable temperature (°C) An electronic component’s MTBF is directly afected by the component’s FIGURE 5. Reduction in execution time and change in mean time between failures temperature. That relation is usu- (MTBF) for different temperature thresholds. Comparison is relative to a baseline case ally exponential, and there is some in which core temperature is unconstrained. experimental evidence that a 10qC increase in a processor’s temperature decreases its MTBF by half.3 There- Appropriate, application-based tem- resilience component in PRM and fore, system reliability can be con- perature thresholds are stored as are changed according to algorithms trolled by restraining the temperature part of the job profler in Figure 1. In that optimize performance and con- of its components. The cooler the sys- the end, the runtime system aims to sider the system’s MTBF and the run- tem runs, the more reliable it is, but reduce the application’s total execu- ning application’s characteristics. the slower it runs because CPU power tion time in light of the system’s MTBF The output will be propagated to is constrained to keep temperature and power limitations.3 additional PRM components, which down. This interplay between power Figure 5 shows the percentage will consolidate the fnal power lim- management and reliability has been reduction in execution time after con- its and communicate them back to studied in other contexts.13 straining core temperatures to dif- the resource manager. The runtime system allows each ferent thresholds for two diferent As this interaction sequence core to consume the maximum allow- applications. The reduction is calcu- implies, a dynamic runtime system able power as long as it is within the lated relative to the baseline case in is fundamental to controlling sys- maximum temperature threshold. If the which processor temperature is uncon- tem reliability while simultaneously temperature any core exceeds the max- strained. Figure 5 also shows the ratio adhering to power limits. Because imum threshold, power is decreased, of MTBF for the machine with our thermal variations are dynamic, a causing the core’s temperature to fall. design versus the MTBF for the base- reactive runtime system efciently However, this can degrade the perfor- line case. For example, by restraining core responds to those changes and pro- mance of tightly coupled applications temperatures to 42qC for Jacobi2D, the vides a healthy balance between sys- because of thermal variations. The machine’s MTBF increased 2.3 times tem performance and reliability. It LBM will automatically detect any while the execution time decreased also allows users to address a broader load imbalance and make the load- by 12 percent relative to the baseline range of reliability concerns, includ- balancing decisions.2 case. The inverted U shape of both ing how to correct soft errors (such as The runtime system must strike curves strongly suggests a tradeof bit fips caused by high-energy parti- a balance in the temperature at between reliability (MTBF) and the cles) in tandem with hard errors.14 Cir- which each component should be slowdown induced by the tempera- cuits using near-threshold voltage will restrained. Moreover, that balance ture restraint. introduce a more complex scenario depends on the application. Diferent The resource manager sends the with higher performance variability codes generate diferent thermal pro- runtime system’s PRM the upper and a higher transient-failure rate. fles on the system at diferent stages bounds of the temperatures that of the application. Some codes are honor the system’s power envelope (as DYNAMIC CONFIGURATION more compute intensive and tend to shown in Figure 2). These tempera- The runtime system can exploit avail- heat up the processors more quickly. ture values are input to an internal able hardware controls for power

OCTOBER 2016 35

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND®

ENERGY-EFFICIENT COMPUTING

management, such as frequency scal- ing and power capping. However, greater power savings are possible with ABOUT THE AUTHORS more runtime control over hardware components. By taking advantage of BILGE ACUN is a doctoral student in the Department of Computer Science at the running application’s properties, the University of Illinois at Urbana–Champaign (UIUC). Her research interests the runtime system can turn of or include power-aware software design, malleability, and variability in large-scale reconfgure many components with- applications. Acun received a BS in computer science from Bilkent University. out a signifcant performance penalty.

She is a member of ACM. Contact her at [email protected].______Ideally, high-performance com- puting systems should be energy pro- AKHIL LANGER is a software engineer at Intel. His research interests include portional; the hardware components scalable distributed algorithms and power and communication optimizations should consume power and energy in high-performance computing (HPC). Langer received a PhD in computer sci- only when their functionality is being ence from UIUC. Contact him at [email protected].______used. However, network links are always on, whether or not they are ESTEBAN MENESES is an assistant professor in the School of Computing actually being used. Moreover, pro- at the Costa Rica Institute of Technology and the Costa Rica National High cessor caches consume large amounts Technology Center. His research interests include reliability in HPC systems, of power, even when they are not parallel-objects application design, and accelerator programming. Meneses improving the application’s perfor- received a PhD in computer science from UIUC. He is a member of ACM. Contact mance. The runtime system approach him at [email protected]. can save this wasted energy by dynam- ically reconfguring hardware as the HARSHITHA MENON is a doctoral student in the Department of Computer application requires. Science at UIUC. Her research interests include scalable load-balancing algo- Caches consume up to 40 percent of 4 rithms and adaptive runtime techniques to improve large-scale application per- a processor’s power. Much of that con- formance. Menon received an MS in computer science from UIUC. Contact her at sumption can be avoided by turning [email protected].______of some cache banks when doing so would not degrade application perfor- OSMAN SAROOD is a software engineer at Yelp. His research interests include mance. Many common HPC applica- performance optimization for parallel and distributed computing under power tions do not use caches efectively. For and energy constraints. Sarood received a PhD in computer science from UIUC. example, molecular dynamics appli- Contact him at [email protected].______cations typically have small working data sets and do not need the large EHSAN TOTONI is a research scientist at Intel Labs. His research interests last-level caches. Grid-based physical include programming systems for HPC and big-data analytics. Totoni received simulation applications typically have a PhD in computer science from UIUC. He is a member of IEEE and ACM. Contact very large data sets that do not ft in him at [email protected].______caches, and the data reuse in cache is minimal. The hardware cannot pre- LAXMIKANT V. KALÉ is a professor in the Department of Computer Science at dict this application behavior. UIUC. His research interests include aspects of parallel computing, with a focus To address this concern, we devel- on enhancing performance and productivity through adaptive runtime systems. oped an approach in which the run- Kalé received a PhD in computer science from Stony Brook University. He is a time system uses profling data to Fellow of IEEE and a member of ACM. Contact him at [email protected].______reconfgure the cache to save power without signifcant performance loss. Using a set of representative HPC applications, we demonstrated that,

36 COMPUTER WWW.COMPUTER.ORG/COMPUTER______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM Computer Qmags THE WORLD’S NEWSSTAND®