Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin Lawrence Berkeley National Laboratory Berkeley, CA 94720, USA. Email: fsbyna, [email protected] Abstract—Large-scale simulations produce massive data that tigation of parallel I/O energy efficiency. Ge et al. [15], [18] needs to be stored on parallel file systems. The simulations use investigated the impact of application I/O patterns on energy parallel I/O to write data into file systems, such as Lustre. consumption and proposed a middleware to apply dynamic Since writing data to disks is often a synchronous operation, the application-level computing workload on CPU cores is minimal voltage and frequency scaling (DVFS) for compute nodes. during I/O and hence we consider whether energy may be saved This study shows up to 9% to 28% energy reduction for by keeping the cores in lower power states. To examine this I/O benchmarks that have contiguous or non-contiguous data postulation, we have conducted a thorough evaluation of energy reads/writes. However, this study was conducted on a small- consumption and performance of various I/O kernels from real scale dedicated cluster with few jobs running on the system. simulations on a Cray XC30 supercomputer, named Edison, at the National Energy Research Supercomputing Center (NERSC). Supercomputing systems located at facilities such as the To adjust CPU power consumption, we use the frequency National Energy Research Supercomputing Center (NERSC) scaling capabilities provided by the Cray power management are shared by hundreds of users and the parallel file system is and monitoring tools. In this paper, we present our initial heavily used by a large number of applications. There has been observations that when the I/O load is high enough to saturate the no study to analyze the benefits of DVFS during I/O phases capability of the filesystem, down-scaling the CPU frequency on compute nodes reduces energy consumption without diminishing of applications on large-scale systems with applications that I/O performance. write terabytes of data. In this paper, we present our initial observations of energy I. INTRODUCTION consumption and parallel I/O performance trade-offs with two Power efficiency and energy efficiency are increasingly I/O kernels and scaling them up to 16K cores and writing prominent concerns to the high-performance computing (HPC) data up to 4 TB and 12 TB in size, respectively. We extracted community with upcoming exascale systems. As supercomput- the I/O kernels from real applications: plasma physics (Vector ing systems approach the exascale regime, their power con- particle-in-cell - VPIC), and accelerator physics (VORPAL) sumption (and associated infrastructure and operating costs) simulations. The I/O kernel represents storing data to the file will grow proportionally unless their efficiency increases. The system at one time step and the real simulations typically store U.S. Department of Energy has set an ambitious efficiency data at tens of such time steps. goals of approximately 20 MW for future exascale systems. The primary contributions of this paper are: Thus, energy efficiency is becoming a first-order constraint for the design of future HPC systems. A recent DOE report states • We demonstrate that DVFS has potential to save energy that energy efficiency is one of the most difficult challenges and improve energy efficiency on large-scale systems at exascale [4]. for two I/O benchmarks sampled from real scientific While flops per watt may be the top-billing metric for applications. next generation systems, the performance of the growing of • We show a collection of qualitative power/performance “big data” and data-intensive workloads will emphasize I/O trade-off curves for the two benchmarks. We use ”I/O performance more than flops. For example, plasma physics rate per Watt” as a metric of energy efficiency. and cosmology codes simulate tens of trillions of particles • Our observations show that energy savings and energy and produce datasets in the range of tens to hundreds of efficiency differ for the two kernels we tested. The terabytes per time step [31], [6]. The development of in-situ dependency of savings on I/O patterns needs to be studied analysis and visualization tools (for example, in VisIt [7] or in further. ParaView[25]), is in part, a reflection of the potential imbal- The remainder of this paper is organized as follows. Sec- ance between I/O and compute resources that are anticipated. tion II summarizes related work. Section III describes, at a File systems must also operate within the system’s power high level, the factors that contributed to our experimental budget, so the energy efficiency of I/O operations are critical. design. Section IV details the computational platform, I/O There is a growing body of work studying application benchmarks and power measurement techniques used. Our energy efficiency (see section II), but relatively limited inves- results are presented in section V. We conclude in Section VI. II. RELATED WORK focuses on the energy efficiency of compute nodes. Typically, A. Frequency scaling and energy efficiency compute nodes are dedicated to a single application and a batch processing service is used to assign compute nodes to A significant number of research efforts have focused on an application. saving power and energy of high-performance computing Reducing parallel system energy consumption specifically systems by taking advantage of frequency scaling capability during parallel I/O phases has been studied by Ge et al.. In of modern processors. A non-exhaustive list of these efforts [15], these authors investigate the impact of application’s I/O includes [14], [33], [26], [16], [27], [22], [17], [23], [20]. access patterns, parallel file system deployment, and processor Hsu et al. [22], [23] proved the feasibility of using DVFS frequency scaling on energy efficiency; this study shows the (Dynamic Voltage and Frequency Scaling) to reduce processor feasibility of energy efficiency by processor frequency scaling. power consumption. Freeh et al. [14] studied energy and Their SERA-IO middleware package has been shown to save execution time tradeoffs in MPI programs. Song et al. [33] energy without effecting performance by intercepting MPI- proposed an analytical model to predict performance and en- I/O calls and interleaving DVFS commands [18]. These two ergy consumption of applications based on various system and studies were performed on a small-scale system in a dedicated application parameters to help users balance energy use and environment. In this study, we investigate the impact of performance. An Energy-aware Distributed Runtime (EDR) frequency scaling on compute nodes of a petascale Cray XC30 has been proposed by Li et al. [26] to efficiently select data supercomputer, where the I/O subsystem is heavily shared by replicas in data center environments. All these efforts explore hundreds of simultaneous applications. saving power without impacting performance of computation and communication phases of applications. In this paper, we C. Using Cray PMDB study the impact of power-saving strategies on performance during parallel I/O phases. Measuring power and energy consumption on Cray XC30 Various efforts focused on DVFS for parallel task schedul- systems has become possible with the Cray Power Man- ing on clusters [38], [37], [28], [19], [36]. Yao et al. [38] agement Database (PMDB) [29]. First experiences of these and Manzak et al. [28] proposed scheduling independent tasks measurements were performed by Fourestey et al. [13] and with DVFS on a single processor systems. Wei et al. [37] Austin et al. [2]. While Fourestey et al. [13] validated the and Gruian et al. [19] discuss scheduling dependent tasks on measurements of PMDB using various benchmarks, the latter multiple processors using DVFS. Wang et al. [36] recently study focused on evaluating first order energy and performance proposed a power-aware scheduling based on task clustering models using various compute intensive microbenchmarks. In for dependent tasks by zeroing communication links to reduce this paper, we study power, energy, and performance impacts power consumption. All these efforts aimed to reduce power of I/O kernels retrieved from two highly scalable simulation consumption during computation and communication phases codes from plasma physics and accelerator physics. and none of them target the phases that move data to storage III. MOTIVATION AND APPROACH media. The parallel I/O software stack contains multiple layers B. Storage energy efficiency of software. As shown in Figure 1, scientific applications Studies of energy efficiency of storage systems have focused often use high-level I/O libraries, such as HDF5 [34] or on reducing power consumption of disks and I/O servers by NetCDF [35] to read or write data arrays to and from parallel keeping them in low power states. Multi-speed disks have been file systems. These libraries internally use POSIX-IO and MPI- proposed to reduce energy dissipation in hard drives. Son et IO [10] middleware to perform I/O. I/O optimization layers, al. [32] proposed software-directed disk power management such as I/O Forwarding Layer [1] and the Scientific Data with energy-aware compilers for multi-speed disks. Several Services framework [11], perform optimizations such as data proposals focused on scheduling the I/O requests of applica- organization or redirection of I/O calls to different views of tions and reducing power consumption of the I/O subsystem data to take advantage of parallelism available in file systems. by placing some disks in low power states. Chou et al. [8], [9] All these different layers are interdependent to achieve efficient and Kim et al. [24] propose strategies for energy-aware disk I/O performance. scheduling by analyzing the arrival patterns of I/O requests. Each of the parallel I/O software layers offers tunable pa- All the above mentioned proposals used disk simulators and rameters.

Load more