The Reliability Wall for Exascale Supercomputing Xuejun Yang, Member, IEEE, Zhiyuan Wang, Jingling Xue, Senior Member, IEEE, and Yun Zhou

. IEEE TRANSACTIONS ON COMPUTERS, VOL. *, NO. *, * * 1 The Reliability Wall for Exascale Supercomputing Xuejun Yang, Member, IEEE, Zhiyuan Wang, Jingling Xue, Senior Member, IEEE, and Yun Zhou Abstract—Reliability is a key challenge to be understood to turn the vision of exascale supercomputing into reality. Inevitably, large-scale supercomputing systems, especially those at the peta/exascale levels, must tolerate failures, by incorporating fault- tolerance mechanisms to improve their reliability and availability. As the benefits of fault-tolerance mechanisms rarely come without associated time and/or capital costs, reliability will limit the scalability of parallel applications. This paper introduces for the first time the concept of “Reliability Wall” to highlight the significance of achieving scalable performance in peta/exascale supercomputing with fault tolerance. We quantify the effects of reliability on scalability, by proposing a reliability speedup, defining quantitatively the reliability wall, giving an existence theorem for the reliability wall, and categorizing a given system according to the time overhead incurred by fault tolerance. We also generalize these results into a general reliability speedup/wall framework by considering not only speedup but also costup. We analyze and extrapolate the existence of the reliability wall using two representative supercomputers, Intrepid and ASCI White, both employing checkpointing for fault tolerance, and have also studied the general reliability wall using Intrepid. These case studies provide insights on how to mitigate reliability-wall effects in system design and through hardware/software optimizations in peta/exascale supercomputing. Index Terms—fault tolerance, exascale, performance metric, reliability speedup, reliability wall, checkpointing. ! 1INTRODUCTION a revolution in computing at a greatly accelerated pace [2]. However, exascale supercomputing itself N these past decades, the performance advances faces a number of major challenges, including technol- of supercomputers have been remarkable as their ogy, architecture, power, reliability, programmability, sizesI are scaled up. Some fastest supercomputers and usability. This research addresses the reliabil- crowned in the TOP500 supercomputing list in the ity challenge when building scalable supercomputing past decade are ASCI White in 2000, an IBM SP systems, particularly those at the peta/exascale levels. (Scalable POWERparallel) cluster system consisting The reliability of a system is referred to as the of 8,192 cores with an Rpeak of 12 TeraFlops, Blue probability that it will function without failure under Gene/L in 2005, an IBM MPP system consisting stated conditions for a specified amount of time [3]. of 131,072 cores with an Rpeak of 367 TeraFlops, MTTF (mean time to failure) is a key reliability metric. It and Tianhe-1A in 2010, currently the world’s fastest is generally accepted in the supercomputing commu- PetaFlop system, a NUDT YH MPP system consisting nity that exascale systems will have more faults than of 186,368 cores with an Rpeak of 4.701 PetaFlops. today’s supercomputers do [4], [5] and thus must han- Therefore, scalable parallel computing has been the dle a continuous stream of faults/errors/failures [6]. common approach to achieving high performance. Thus, fault tolerance will play a more important role Despite these great strides made by supercomput- in the future exascale supercomputing. ing systems, much of scientific computation’s poten- While many fault-tolerance mechanisms are avail- tial remains untapped “because many scientific chal- able to improve the reliability of a system, their bene- lenges are far too enormous and complex for the com- fits rarely come without time and/or capital costs. For putational resources at hand” [1]. Planned exascale example, checkpointing [7], [8] improves system relia- supercomputers (capable of an exaflop, 103 petaflops, bility but incurs the additional time on saving check- or 1018 floating point operations per second) in this points and performing rollback-recovery (as well as decade promise to overcome these challenges by the capital cost associated with the hardware used). According to some projections [9] made based on ex- • X. Yang, Z. Wang and Y. Zhou are with the School of Computer, Na- tional University of Defense Technology, ChangSha, Hunan, 410073, isting technologies, the time spent on saving a global China. checkpoint may reach or even exceed the MTTF of a E-mail: {xjyang, zy vxd}@nudt.edu.cn and system when the system performance sits between the wang [email protected] • J. Xue is with the School of Computer Science and Engineering, petascale and exascale levels. Therefore, reliability will University of New South Wales, Australia. inhibit scalability in peta/exascale supercomputing. E-mail: [email protected] In this research, we analyze and quantify the effects of reliability, which is realized at the extra time and IEEE TRANSACTIONS ON COMPUTERS, VOL. *, NO. *, * * 2 dollar cost induced by fault tolerance, on the scalabil- work on speedup and costup modeling and on check- ity of peta/exascale supercomputing. Our case studies pointing techniques. provide insights on how to mitigate reliability-wall effects when building reliable and scalable parallel 2.1 Speedup and Costup applications and parallel systems at the peta/exascale levels. Speedup has been almost exclusively used for mea- The main contribution are summarized below: suring scalability in parallel computing, such as Am- dahl’s speedup, Gustafson’s speedup or memory- • We introduce for the first time the concept of bounded speedup [11], [12], [13], [14]. These models “Reliability Wall”, thereby highlighting the sig- are concerned only with the computing capability of nificance of achieving scalable performance in a system. peta/exascale supercomputing with fault toler- Amdahl proposed a speedup model for a fixed-size ance. problem [11]: • For a fault-tolerant system, we present a reliability speedup model, define quantitatively the P SP, = (1) reliability wall, give an existence theorem of the Amdahl 1+f(P − 1) reliability wall, and categorize it according to where P is the number of processors and f is the serial the time overhead induced by fault tolerance part of the program. Interestingly, Amdahl’s speedup (Section 3). argues against the usefulness of large-scale parallel • We generalize our results to obtain a general relia- computing systems. bility speedup model and to quantify the general To overcome the shortcomings of Amdahl’s reliability wall for a system when the dollar cost speedup model, Gustafson [12] presented a scaled induced by the fault-tolerance mechanism used is speedup for a fixed-time problem, which scales up the also taken into account, by considering simulta- workload with the increasing number of processors in neously both speedup and costup [10] (Section 4). such a way as to preserve the execution time: • We conduct a series of case studies using two − real-world supercomputers, Intrepid and ASCI SP,Gustafson = f + P (1 f) (2) White, both employing checkpointing, the pre- dominant technique for fault tolerance used in Later, Wood and Hill considered a costup metric practice (Sections 5.1 – 5.3). Intrepid is an IBM [10]: CP Blue Gene/P MPP system while ASCI White C = (3) is an IBM SP cluster system, with both being C1 representatives of the majority of the TOP500 where CP is the dollar cost of a P -processor parallel list. As exascale systems do not exist yet, we system and C1 is the dollar cost of a single-processor use these two supercomputing systems to verify, system. They argued that parallel computing is cost- analyze and extrapolate the existence of the (gen- effective whenever speedup exceeds costup. eral) reliability wall for a number of system con- In this work, we have developed our reliability figurations, depending on whether checkpoint- wall framework based on these existing speedup and ing is full-memory or incremental-memory and costup models. whether I/O is centralized or distributed, as the systems are scaled up in size. Furthermore, we 2.2 Fault-Tolerant Techniques show that the checkpoint interval optimization does not affect the existence of the (general) A plethora of fault-tolerance techniques have been reliability wall. proposed to improve the reliability of parallel com- • We describe briefly how to apply our reliability- puting systems. We consider software solutions first wall results to mitigate reliability-wall effects in and then hardware solutions. system design (e.g., on developing fault tolerance Of many software techniques developed for fault mechanisms and I/O architectures) and through tolerance, checkpointing has been the most popular hardware/software optimizations in supercom- in large-scale parallel computing. Checkpointing can puting, particularly peta/exascale supercomput- occur at the operating system level or the application ing (Section 5.4). level. In the first approach, system-level checkpointing [15], [16] only requires the user to specify a checkpoint interval, with no additional programmer ELATED ORK 2R W effort. On the other hand, application-level check- Supercomputing systems have mainly relied on pointing [17], [18], [19] requires the user to decide checkpointing for fault tolerance. As this paper rep- the placement and contents of the checkpointing. This resents the first work on quantifying the Reliability second approach requires more programmer effort Wall for supercomputing systems based on

The Reliability Wall for Exascale Supercomputing Xuejun Yang, Member, IEEE, Zhiyuan Wang, Jingling Xue, Senior Member, IEEE, and Yun Zhou

LLNL Computation Directorate Annual Report (2014)

Biomolecular Simulation Data Management In

Measuring Power Consumption on IBM Blue Gene/P

Blue Gene®/Q Overview and Update

TMA4280—Introduction to Supercomputing

Introduction to the Practical Aspects of Programming for IBM Blue Gene/P

The Blue Gene/Q Compute Chip

R00456--FM Getting up to Speed

IBM Research Report Designing a Highly-Scalable Operating System

CFD) Applications on HPC Systems

2011 INCITE Awards Type: Renewal Title

High Performance Computing (HPC)