.

IEEE TRANSACTIONS ON COMPUTERS, VOL. *, NO. *, * * 1 The Reliability Wall for Exascale Supercomputing Xuejun Yang, Member, IEEE, Zhiyuan Wang, Jingling Xue, Senior Member, IEEE, and Yun Zhou

Abstract—Reliability is a key challenge to be understood to turn the vision of exascale supercomputing into reality. Inevitably, large-scale supercomputing systems, especially those at the peta/exascale levels, must tolerate failures, by incorporating fault- tolerance mechanisms to improve their reliability and availability. As the benefits of fault-tolerance mechanisms rarely come without associated time and/or capital costs, reliability will limit the scalability of parallel applications. This paper introduces for the first time the concept of “Reliability Wall” to highlight the significance of achieving scalable performance in peta/exascale supercomputing with fault tolerance. We quantify the effects of reliability on scalability, by proposing a reliability speedup, defining quantitatively the reliability wall, giving an existence theorem for the reliability wall, and categorizing a given system according to the time overhead incurred by fault tolerance. We also generalize these results into a general reliability speedup/wall framework by considering not only speedup but also costup. We analyze and extrapolate the existence of the reliability wall using two representative , Intrepid and ASCI White, both employing checkpointing for fault tolerance, and have also studied the general reliability wall using Intrepid. These case studies provide insights on how to mitigate reliability-wall effects in system design and through hardware/software optimizations in peta/exascale supercomputing.

Index Terms—fault tolerance, exascale, performance metric, reliability speedup, reliability wall, checkpointing. !

1INTRODUCTION a revolution in computing at a greatly accelerated pace [2]. However, exascale supercomputing itself N these past decades, the performance advances faces a number of major challenges, including technol- of supercomputers have been remarkable as their ogy, architecture, power, reliability, programmability, Isizes are scaled up. Some fastest supercomputers and usability. This research addresses the reliabil- crowned in the TOP500 supercomputing list in the ity challenge when building scalable supercomputing past decade are ASCI White in 2000, an IBM SP systems, particularly those at the peta/exascale levels. (Scalable POWERparallel) cluster system consisting The reliability of a system is referred to as the of 8,192 cores with an Rpeak of 12 TeraFlops, Blue probability that it will function without failure under Gene/L in 2005, an IBM MPP system consisting stated conditions for a specified amount of time [3]. of 131,072 cores with an Rpeak of 367 TeraFlops, MTTF (mean time to failure) is a key reliability metric. It and Tianhe-1A in 2010, currently the world’s fastest is generally accepted in the supercomputing commu- PetaFlop system, a NUDT YH MPP system consisting nity that exascale systems will have more faults than of 186,368 cores with an Rpeak of 4.701 PetaFlops. today’s supercomputers do [4], [5] and thus must han- Therefore, scalable has been the dle a continuous stream of faults/errors/failures [6]. common approach to achieving high performance. Thus, fault tolerance will play a more important role Despite these great strides made by supercomput- in the future exascale supercomputing. ing systems, much of scientific computation’s poten- While many fault-tolerance mechanisms are avail- tial remains untapped “because many scientific chal- able to improve the reliability of a system, their bene- lenges are far too enormous and complex for the com- fits rarely come without time and/or capital costs. For putational resources at hand” [1]. Planned exascale example, checkpointing [7], [8] improves system relia- supercomputers (capable of an exaflop, 103 petaflops, bility but incurs the additional time on saving check- or 1018 floating point operations per second) in this points and performing rollback-recovery (as well as decade promise to overcome these challenges by the capital cost associated with the hardware used). According to some projections [9] made based on ex- • X. Yang, Z. Wang and Y. Zhou are with the School of Computer, Na- tional University of Defense Technology, ChangSha, Hunan, 410073, isting technologies, the time spent on saving a global China. checkpoint may reach or even exceed the MTTF of a E-mail: {xjyang, zy vxd}@nudt.edu.cn and system when the system performance sits between the wang [email protected] • J. Xue is with the School of Computer Science and Engineering, petascale and exascale levels. Therefore, reliability will University of New South Wales, Australia. inhibit scalability in peta/exascale supercomputing. E-mail: [email protected] In this research, we analyze and quantify the effects of reliability, which is realized at the extra time and IEEE TRANSACTIONS ON COMPUTERS, VOL. *, NO. *, * * 2 dollar cost induced by fault tolerance, on the scalabil- work on speedup and costup modeling and on check- ity of peta/exascale supercomputing. Our case studies pointing techniques. provide insights on how to mitigate reliability-wall effects when building reliable and scalable parallel 2.1 Speedup and Costup applications and parallel systems at the peta/exascale levels. Speedup has been almost exclusively used for mea- The main contribution are summarized below: suring scalability in parallel computing, such as Am- dahl’s speedup, Gustafson’s speedup or memory- • We introduce for the first time the concept of bounded speedup [11], [12], [13], [14]. These models “Reliability Wall”, thereby highlighting the sig- are concerned only with the computing capability of nificance of achieving scalable performance in a system. peta/exascale supercomputing with fault toler- Amdahl proposed a speedup model for a fixed-size ance. problem [11]: • For a fault-tolerant system, we present a relia- bility speedup model, define quantitatively the P SP, = (1) reliability wall, give an existence theorem of the Amdahl 1+f(P − 1) reliability wall, and categorize it according to where P is the number of processors and f is the serial the time overhead induced by fault tolerance part of the program. Interestingly, Amdahl’s speedup (Section 3). argues against the usefulness of large-scale parallel • We generalize our results to obtain a general relia- computing systems. bility speedup model and to quantify the general To overcome the shortcomings of Amdahl’s reliability wall for a system when the dollar cost speedup model, Gustafson [12] presented a scaled induced by the fault-tolerance mechanism used is speedup for a fixed-time problem, which scales up the also taken into account, by considering simulta- workload with the increasing number of processors in neously both speedup and costup [10] (Section 4). such a way as to preserve the execution time: • We conduct a series of case studies using two − real-world supercomputers, Intrepid and ASCI SP,Gustafson = f + P (1 f) (2) White, both employing checkpointing, the pre- dominant technique for fault tolerance used in Later, Wood and Hill considered a costup metric practice (Sections 5.1 – 5.3). Intrepid is an IBM [10]: CP Blue Gene/P MPP system while ASCI White = (3) is an IBM SP cluster system, with both being C1 representatives of the majority of the TOP500 where CP is the dollar cost of a P -processor parallel list. As exascale systems do not exist yet, we system and C1 is the dollar cost of a single-processor use these two supercomputing systems to verify, system. They argued that parallel computing is cost- analyze and extrapolate the existence of the (gen- effective whenever speedup exceeds costup. eral) reliability wall for a number of system con- In this work, we have developed our reliability figurations, depending on whether checkpoint- wall framework based on these existing speedup and ing is full-memory or incremental-memory and costup models. whether I/O is centralized or distributed, as the systems are scaled up in size. Furthermore, we 2.2 Fault-Tolerant Techniques show that the checkpoint interval optimization does not affect the existence of the (general) A plethora of fault-tolerance techniques have been reliability wall. proposed to improve the reliability of parallel com- • We describe briefly how to apply our reliability- puting systems. We consider software solutions first wall results to mitigate reliability-wall effects in and then hardware solutions. system design (e.g., on developing fault tolerance Of many software techniques developed for fault mechanisms and I/O architectures) and through tolerance, checkpointing has been the most popular hardware/software optimizations in supercom- in large-scale parallel computing. Checkpointing can puting, particularly peta/exascale supercomput- occur at the level or the application ing (Section 5.4). level. In the first approach, system-level checkpoint- ing [15], [16] only requires the user to specify a checkpoint interval, with no additional programmer ELATED ORK 2R W effort. On the other hand, application-level check- Supercomputing systems have mainly relied on pointing [17], [18], [19] requires the user to decide checkpointing for fault tolerance. As this paper rep- the placement and contents of the checkpointing. This resents the first work on quantifying the Reliability second approach requires more programmer effort Wall for supercomputing systems based on a new but provides the user an opportunity to place check- reliability speedup model, we review below the prior points in such a way that checkpoint contents and .

IEEE TRANSACTIONS ON COMPUTERS, VOL. *, NO. *, * * 3 checkpointing time are reduced. In order to combine 3THE RELIABILITY WALL the best of these two approaches, compiler-assisted Table 1 provides a list of main symbols used, their checkpointing [20], [21], [22], once coupled with meaning and where they are defined in the paper. application-level checkpointing, can provide a certain Section 3.1 characterises the time overhead incurred degree of transparency to the user. Our case studies by fault-tolerance. Section 3.2 introduces a new relia- focus on system-level checkpointing techniques. By bility speedup model. Section 3.3 quantifies the reli- checkpointing, we mean system-level checkpointing ability wall based on our reliability speedup model. from now on unless indicated otherwise. Section 3.4 provides a classification of supercomput- Saving the states for thousands of processes at a ing systems based on their fault-tolerance time over- checkpoint in a system can result in a heavy demand head. on the underlying I/O and network resources, ren- dering this part of the system a performance bottle- 3.1 Analyzing the Fault-Tolerance Time Overhead neck. To alleviate this problem, many checkpointing Consider a parallel system consisting of P single-core optimizations have been considered. Instead of full- processor nodes, denoted by N1,...,NP . Let a parallel memory checkpointing, which involves saving the program G be executed on the system with one pro- entire memory context of a process at a checkpoint, cess per node, resulting in a total of P processes. (If incremental-memory checkpointing opts to save only a processor has more than one core, then P denotes the dirty pages, i.e., the ones that have been modi- the number of cores in the system with cores being fied since the last checkpoint [23]. Thus, incremental- treated as nodes. In this case, there will be one process memory checkpointing can reduce the amount of running on each core.) memory contexts that need to be saved, especially A system is called an R system if it uses some for large systems. In our case studies discussed in fault-tolerance mechanism, denoted R, to improve its Section 5, both types of checkpointing techniques are reliability and R-free otherwise. considered. &RPSXWLQJ 7LPH )DXOW7ROHUDQFH7LPH In diskless checkpointing [24], [25], each processor 5 73 makes local checkpoints in local memory (or local 73 5 WL disk), eliminating the performance bottleneck of tra- WL ditional checkpointing. Some form of redundancy is 7LPH

7LPH Ă Ă ĂĂ maintained in order to guarantee checkpoint availabil- 1RGH ity in case of failures. However, diskless checkpoint- 1RGH ,' ,' ing doubles the memory occupation and may not be 1 1 1L 13 1 1 1L 13 well suited to exascale supercomputing [4]. D E

Fault tolerance at the hardware level has been Fig. 1. Execution state of a program on a P -node achieved mostly through redundancy or replication of system. (a) R-free system. (b) R system. resources, either physically (as in RAIDed disks [26] and backup servers) or information-wise (as in ECC Figure 1 compares and contrasts the execution memory [27], parity memory [28] and data replica- states of a program G on an R-free system and an R R tion [29]). In the case of N modular redundancy [30], system. Let ti (ti ) be the execution time of program for example, N systems perform a common operation G on node Ni in the R-free (R) system. Then the and their results are processed by a voting system execution time of G on the R-free system is: to produce a single output, at the cost of employing TP =max{ti,i=1,...,P} (4) N − 1 redundant duplicate copies. In data replication, data are copied into a local node from remote nodes Similarly, the execution time of G on the R system is to facilitate local rather than remote reads, improv- given by: ing the overall system performance. Due to data R { R } redundancy, data replication may also improve the TP =max ti ,i=1,...,P (5) reliability of communication, at the cost of additional Thus, the time overhead induced by the fault- storage space. tolerance mechanism R when program G is run on In summary, the overhead introduced by a fault- an R system is found to be: tolerance mechanism includes both a time component R − H = TP TP (6) and a dollar cost component, limiting scalability as the system size is scaled up. Below we analyze and In general, most fault-tolerance mechanisms have quantify the effects of reliability achieved through positive time overhead values, i.e. H>0. However, fault tolerance techniques on scalability in supercom- there are cases where H  0, when, for example, a puting. program runs on a data replication system with a IEEE TRANSACTIONS ON COMPUTERS, VOL. *, NO. *, * * 4

TABLE 1 nisms do not limit scalability but are rarely achievable List of main symbols, their meanings and definitions. in peta/exascale supercomputing. Thus, this work Symbol Meaning Definition focuses only on the fault tolerance mechanisms with P Number of single-core processor nodes Sect. 3.1 positive time overhead values (when H>0). R Fault tolerance mechanism Sect. 3.1 Let us use MP to denote the MTTF of an R-free G Parallel program with P processes Sect. 3.1 system. Suppose that TP MP for a large-scale f G Serial part of parallel program Sect. 2.1 parallel program G. When program G runs on an R- TP Execution time of program G on an R- (4) free system (with P nodes) free system, the average number of failures that may SP Traditional speedup on an R-free sys- Sect. 3.2 occur can be approximated by: tem   SP, Amdahl’s speedup on an R-free system (1) TP Amdahl F = (7) SP,Gustafson Gustafson’s speedup on an R-free sys- (2) M tem P EP Traditional efficiency on an R-free sys- Sect. 3.2 Statistically, it is assumed that one failure occurs tem R during every MTTF-sized time interval and no failure TP Execution time of program G on an R (5) system (with P nodes) occurs during the last residual interval of length TP F Average number of failures that may (7) mod MP . occur when program G runs on an R- The average time incurred per failure for fault free system H Time overhead induced by the fault- (6) tolerance is defined as: tolerance mechanism R when running H  H  program G on an R system HP = = (8) H F TP P Average time incurred per failure for (8) MP fault tolerance on an R system MP MTTF of an R-free system Sect. 3.1 R(P ) Fault tolerance time factor (11) 3.2 Defining the Reliability Speedup ER P Efficiency of the fault tolerance tech- (12) In this section, we generalize the traditional concept nique R R of speedup to reliability speedup. In Section 3.3, we SP Reliability speedup (9) SR P,Amdahl Amdahl’s reliability speedup (14) apply this generalization to analyze and quantify the SR P,Gustafson Gustafson’s reliability speedup (15) effects of fault tolerance time overhead on scalability. ( ) C P Fault tolerance cost factor (17) Recall that TP given in (4) represents the (parallel) C Costup (3) execution time of program G on an R-free system. CP Dollar cost of a P -processor parallel Sect. 2.1 system Let Tseq be the (sequential) execution time of the C1 Dollar cost of a single-processor system Sect. 2.1 sequential version of G. Thus, the traditional speedup cP Dollar cost of the fault tolerance mech- Sect. 4.1 Tseq SP is SP = and the efficiency achieved is EP = . anism R TP P CR Costup of an R system (19) Definition 1 . R (Reliability Speedup) When program G CP Dollar cost of a P -node R system Sect. 4.1 C runs on an R system, the reliability speedup achieved EP Efficiency of fault tolerance cost (18) GR SP General reliability speedup (16) is defined as follows: sup R SP Reliability wall Sect. 3.3 R Tseq P S = (9) sup GR P R SP General reliability wall Sect. 4.2 TP P m Average number of checkpoints be- (21) R Our reliability speedup formula SP can be refined tween failures R R−full below. To this end, let us refine TP first. According to SP Reliability speedup for full-memory (25) checkpointing (6) – (8), we have: −   SR inc Reliability speedup for incremental- (26) P R TP memory checkpointing TP = TP + H = TP + HP sup R−full MP SP Reliability wall for full-memory check- (28)   P pointing 1 − (TP mod MP )/TP − sup SR inc Reliability wall for incremental- (29) = TP + TP HP (10) P MP P memory checkpointing   GR−full S General reliability speedup for full- (31) HP HP P ≈ TP + TP = TP 1+ memory checkpointing − MP MP sup SGR full P General reliability wall for full- Sect. 5.2 P P P P P T mod M T mod M memory checkpointing where TP can be dropped since TP < MP  TP 1 when (TP mod MP )

R R  1 which is called the efficiency of the fault tolerance tech- Since SP,Amdahl

3.3 Quantifying the Reliability Wall Θ(1) and an incremental system if R(P ) Θ(1). Definition 2 (Reliability Wall). When a program G As is customary, the Θ notation3 describes asymp- runs on an R system, the reliability wall is defined totically both an upper bound and a lower bound. as the supremum1 of the corresponding reliability For example, if R(P )=kQ  Θ(1), where k and Q R R speedup SP , denoted sup SP . are positive constants (and remain so for the rest of P this paper), then the R system is a constant system. Theorem 1. When a program G runs on an R system, By (15), we have: lim SR = ∞ the reliability wall exists if and only if →∞ P . − P R f + P (1 f) ∞ lim SP, = lim = Proof: ⇒ By Definition 2, when the reliability wall P →∞ Gustafson P →∞ 1+kQ exists, the supremum of the corresponding reliability Then the reliability wall does not exist according to R R ∞ speedup SP exists. Suppose sup SP = Z< , where P Theorem 1. Z lim SR  sup SR = If R(P )=kP lg P Θ(1), then the R system is an is a positive constant. Then →∞ P P P P ∞ R  ∞ incremental system. Similarly, by (15), we have: Z< . Thus, lim SP = holds. P →∞ − ⇐ R  ∞ R ∞ R f + P (1 f) Since lim SP = , lim SP = Z< holds. lim SP, = lim =0 P →∞ P →∞ P →∞ Gustafson P →∞ 1+kP lg P ∀ ∃ ∈ + | R − | Thus, ε>0, N Z s.t. SP Z <εwhen − R R This time, however, the reliability wall exists by The- P>N. Then Z ε

R  ∞ Thus, lim SP = . By Theorem 1, the reliability wall 4.1 General Reliability Speedup P →∞ exists for the program G. Definition 4 (General Reliability Speedup). When

For an incremental R system, R(P ) Θ(1),by program G runs on an R system, the general reliability definition. Therefore, the reliability wall may exist speedup is defined as: according to Theorem 2. On the other hand, a program R that satisfies lim SP = ∞ remains scalable on a GR SP P →∞ S = (16) P CR constant R system since R(P )=Θ(1)≺ SP ,as summarized below. where CR is the costup of the R system.

Corollary 1. When a program G that satisfies lim SP = P →∞ Let us write cP as the cost of the fault tolerance ∞ runs on a constant R system, the reliability wall does mechanism R. We define: not exist. cP C(P )= (17) C TABLE 2 P Existence of reliability wall for R systems. which is simply the ratio of the cost of fault tolerance mechanism R over the cost of a P -node R-free system System Categorization Reliability wall and is referred to as the fault tolerance cost factor.In R(P )  S Yes Incremental R(P )  Θ(1) P additin, we define: Θ(1) ≺ R(P ) ≺ SP No Constant R(P )  Θ(1) No 1 EC = (18) P 1+C(P ) The conditions that govern the existence of the which is called the efficiency of the fault tolerance cost. reliability wall for a program that satisfies lim SP = P →∞ Suppose that CR is the cost of a P -node R system. ∞ running on constant and incremental systems are P According to (3), we have: summarized in Table 2.   CR C + c c c CR = P = P P = C + P = C 1+ P SP C1 C1 C1 CC1   9000 R(P ) SP R(P )=kQ cP C Θ(1)≺R(P ) ≺ SP = C 1+ = C(1 + C(P )) = C R(P ) Θ(1) CP EP (19) 6000

SP R(P )=klgP Then, the general reliability speedup becomes: × × R × C GR SP P EP EP EP 3000 Reliability Speedup SP = = Reliability Wall C(1 + R(P ))(1 + C(P )) C R(P )=kPlgP (20)

0 2000 6000 10000 14000 If the time overhead incurred by fault tolerance can Number of Nodes, P be omitted, i.e., if H ≈ 0, then we can make the following simplification, by using (8): Fig. 2. Reliability speedup and wall for a constant system (in the solid line) and two incremental systems 1 1 1 ER = = = ≈ 1 (in dashed lines). P 1+R(P ) HP 1+ H 1+ MP FMP

By instantiating SP with Gustafson’s speedup in As a result, the general reliability speedup can be (13), Figure 2 illustrates three different systems when simplified to: R(P ) is set to be as kP lg P , k lg P and kQ, respec- × × R × C × × C GR P EP EP EP P EP EP tively, to represent the three conditions R(P )  SP , S = ≈ P C C Θ(1) ≺ R(P ) ≺ SP and R(P )  Θ(1) in Table 2. which considers essentially the effects of only the 4GENERALIZATION dollar costs incurred by fault tolerance on scalability. We generalize our reliability wall theory introduced For example, parity encoding techniques [28] fall in Section 3 by considering not only speedup but into this category; they incur some capital cost due to also costup [10]. This generalization is practically extra encoding bits added in hardware but negligible important for peta/exascale supercomputing systems time overhead. as the fault-tolerant mechanisms employed can be Therefore, our generalization allows us to study the costly. The generalized theory allows us to investigate effects of either the time overhead or dollar cost or the effects of both the time and cost induced by fault both introduced by a fault-tolerance mechanism on tolerance on scalability. scalability. IEEE TRANSACTIONS ON COMPUTERS, VOL. *, NO. *, * * 7

0.1 R(P) = kQ (R(P),C(P)) = (kQ,qV) 6 10 R(P) = kP Reliability Wall 0.09 (R(P),C(P)) = (kQ,qP/lgP) 1.5 (R(P),C(P)) = (kP,qV) R(P) = kP lgP 0.08 2 (R(P),C(P)) = (kP,qP/lgP) 4 R(P) = kP 10 0.07 General Reliability Wall 0.06

2 10 0.05

0.04

0 Reliability Speedup 10 0.03

General Reliability Speedup 0.02

-2 10 0.01

0 0 0.5 1 1.5 2 0 200 400 600 800 1000 1200 1400 Number of Nodes, P Number of Nodes, P (×104) (a) (a)

-9 0.03 8 k=10 10 -6 k=10 Reliability Wall k=10-5 0.025 6 10 k=10-3 0.02 q=1 4 10 q=0.1 0.015 q=0.01 q=0.001 2 10 0.01 General Reliability Wall Reliability Speedup

0

10 General Reliability Speedup 0.005

-2 10 0 0 1 2 3 4 5 6 0 200 400 600 800 1000 1200 1400 Number of Nodes, P (×105) Number of Nodes, P (b) (b)

R GR Fig. 3. Trends of the reliability speedup SP and the Fig. 4. Trends of the general reliability speedup SP existence of the reliability wall, where SP is taken to and the existence of the general reliability wall, where be Gustafson’s speedup. The Y axis is drawn in the SP is Gustafson’s speedup and C = 5000 lg P is the logarithmic scale with base 10. (a) Four forms of R(P ) costup used in (20). (a) Four form combinations of − shown with fixed k =10−3 and Q =10−3. (b) Four (R(P ),C(P )) shown with fixed k =10 3, V =10, q = − forms of R(P )=kP 1.5 lg P shown with varying k as 0.01 and Q =10 3. (b) Four forms of (R(P ),C(P )) = qP −3 shown. (kP, lg P ) with fixed k =10 and varying q as shown.

4.2 General Reliability Wall 4.3 Understanding the General Reliability Speedup and Wall

Definition 5 (General Reliability Wall). When a pro- According to (13) and (20), R(P ) and C(P ), which gram G runs on an R system, the general reliability wall characterize the time overhead and dollar cost in- is defined as the supremum of the corresponding general curred by the fault tolerance technique R, respectively, GR GR reliability speedup SP , denoted sup SP . are the key scalability-limiting factors of a program P that satisfies lim SP = ∞ running on an R system. P →∞ Theorem 3. When a program G runs on an R system, the Their effects on the (general) reliability speedup and GR  ∞ general reliability wall exists if and only if lim SP = . the existence of (general) reliability wall are plotted P →∞ in Figures 3 and 4 for a number of representative Proof: Proceeds similarly as in the proof of Theo- function forms. rem 1. In order to determine realistically the two param- As before, this theorem has the following immedi- eters, k and Q), used in R(P ), we have assumed  ∞ GR  ∞ an abstract P -node system, where the MTTF of a ate implication. If lim SP = , then lim SP = P →∞ P →∞ single node in the R-free system is 512 × 3600 secs SGR

IEEE TRANSACTIONS ON COMPUTERS, VOL. *, NO. *, * * 8

5CASE STUDIES 5.1 Reliability Speedup and Wall In Section 5.1.1, we build reliability speedup models In our case studies, we use existing representative for both full-memory and incremental-memory check- supercomputing systems with representative fault- pointing techniques. We show that the existence of the tolerance techniques deployed to verify, analyze, and reliability wall depends critically on two key parame- extrapolate the existence of the (general) reliability ters, one is related to checkpoint size and the other to wall for forthcoming exascale systems. (A simulator the bandwidth of the underlying I/O architecture. In for an exascale system, which does not exist yet, Section 5.1.2, we conduct our analysis using Intrepid may not be helpful unless it can keep track of the with a centralized I/O architecture by considering execution state of the entire system.) Our case analysis full-memory and incremental-memory checkpointing. also provides insights on how to mitigate reliability- In Section 5.1.3, we repeat this analysis for ASCI White wall effects in system design and through applying with a distributed I/O architecture. optimizations for large-scale parallel computing. MPPs and clusters are the two most popular ar- 5.1.1 Formulating Reliability Speedup for Check- chitectures in the TOP500 supercomputing list and pointing checkpointing is a fault-tolerance mechanism widely If the fault-tolerance technique R used is checkpoint- adopted in large-scale parallel systems. In our case ing, then the average time incurred per failure due to studies, we have selected Intrepid [34], an MPP sys- fault tolerance, HP given in (13), consists of mainly tem, and ASCI White [35], a cluster system, with their two parts. One part represents the average time architectural details discussed later. We have selected spent on saving checkpoints, enforcing coordination these two systems also because their performance among processors and orchestrating the communica- data, including MTTF, are publicly available. We con- tion among processes between failures. The other part sider two checkpointing techniques, full-memory and represents the average time spent on roll-back and incremental-memory [23], and two I/O architectures, recovery when a failure occurs. centralized and distributed to evaluate their impact The bandwidth of the underlying I/O architecture on scalability. is the bottleneck for checkpointing, which mainly af- We write Interval to represent the checkpointing fects the time on saving checkpoints between failures interval used in a system and m to represent the and the time on recovering a checkpoint per failure. average number of checkpoints between failures. The To simplify our analysis, we assume below that HP two parameters are related to each other as follows: only includes these two components. Thus, we obtain:     s r s r MP MP Dsize Dsize Dsize Dsize m = (21) HP = + =m + (22) Interval Interval W W W W s where Dsize is the average size of a saved checkpoint In Section 5.1, we focus on reliability speedup and r each time and Dsize is the average size of a checkpoint wall by considering four system configurations, de- per failure to be recovered. For some checkpointing pending on whether checkpointing is full-memory or techniques, such as incremental-memory checkpoint- incremental-memory and whether I/O is centralized s r ing, Dsize is not necessarily equal to Dsize, as will or distributed. Both of these affect the existence of be discussed below. W is the bandwidth of the un- the reliability wall. For each configuration, we first derlying I/O architecture in Gbits/sec. The other two develop its reliability speedup model and then ana- parameters, Interval and m, are introduced in (21). lyze the existence of the reliability wall. In Section 5.2, In 2006, Schroeder [36] tested 22 high performance we move to the general reliability speedup and wall computers in LANL and pointed out that failure rates by considering one configuration with full-memory are roughly proportional to the number of processors checkpointing and a centralized I/O architecture. In P in a system [36], implying that the time to failure is Section 5.3, we further analyze the the existence of the inversely proportional to P . Let M>0 be the MTTF (general) reliability wall for the above configuration of a single node in the R-free system. Thus, we assume when checkpoint interval optimization is adopted. that M and MP are related by: In Section 5.4, we summarize how to mitigate the MP reliability wall effects through reducing scalability- M = (23) limiting overhead induced by fault tolerance. P In this section, the traditional speedup SP is always According to (13), we have: instantiated with Gustafson’s speedup, SP,Gustafson. R SP SP S = = s r P HP (24) In our case analysis conducted in Sections 5.1 and 1+ (mDsize+Dsize)P MP 1+ WM 5.2, we use only a fixed value for m to analyze s r (mDsize+Dsize)P the reliability walls for Intrepid and ASCI White. In where R(P )= WM . Section 5.3, we show that our results are valid when In full-memory checkpointing, the entire memory different values of m are used. context for a process at a checkpoint is saved. Thus, IEEE TRANSACTIONS ON COMPUTERS, VOL. *, NO. *, * * 9

s the checkpoint saved is of the same size every time, Dsize. However, the dirty pages of a non- i.e. the size of the total memory. So is the size of first checkpoint does not necessarily increase s r  s  a checkpoint to be recovered. Thus, Dsize = Dsize. with P . Thus, Θ(1) Dsize P holds. According to (24), the reliability speedup for full- W . The I/O bandwidth W is determined by memory checkpointing is given by: the underlying I/O architecture. The stor- age systems used in some modern large- R−full SP SP S = s r = s (25) scale MPP supercomputing systems are de- P (mDsize+Dsize)P (m+1)DsizeP 1+ WM 1+ WM coupled from the compute nodes [37], as Unlike full-memory checkpointing, incremental- in the IBM Blue Gene series, giving rise memory checkpointing reduces the amount of check- to a centralized I/O architecture. However, point contexts saved. Typically, the size of the first this design puts the storage system of a saved checkpoint file and the size of a checkpoint to on a separate network from be recovered per failure are equal to that of the entire the compute nodes, limiting the available memory, but the size of a subsequently saved check- I/O bandwidth. As a result, today’s limited point file may be smaller. Thus, the average size of a I/O bandwidth is choking the capabilities of s modern supercomputers, particularly mak- saved checkpoint between failures, Dsize, is not larger r s  r ing fault-tolerance techniques, such as check- than Dsize, i.e., Dsize Dsize. Thus, the reliability speedup for incremental-memory checkpointing is: pointing, prohibitively expensive. Thus, a scalable I/O system design, referred to as the R−inc SP  SP distributed I/O architecture, may overcome S = s r s (26) P (mDsize+Dsize)P (m+1)DsizeP 1+ WM 1+ WM this limitation. For example, the clusters such as ASCI White with local disks have cer- In this section, we explain the “if” condition gov- tain advantages when adopting checkpoint- erning the existence of the reliability wall in Theo- ing techniques. rem 1. Thus, the instance of the ”if” part of Theo- Below we take two systems, Intrepid and ASCI rem 1 for full-memory checkpointing or incremental- White, as examples to investigate the existence of reli- memory checkpointing is given as follows. ability wall for two different checkpointing techniques Theorem 4. Suppose that a program G that satisfies and two different I/O architectures. lim SP = ∞ runs on an R system using full-memory P →∞ 5.1.2 Intrepid checkpointing with its reliability speedup given in (25) or incremental-memory checkpointing with its reliability The BlueGene/P Intrepid system at Argonne National Laboratory is one of IBM’s massively parallel super- speedup givens in (26), where SP is instantiated with Dsize computers consisting of 40 racks of 40, 960 quad-core SP, .If  Θ(1), then the reliability wall exists. Gustafson W computer nodes (totalling 163, 840 processors) and m M Proof: In (25), and are positive constantss 80 TB of memory. Each compute node consists of D P S =Θ(P ) size  four PowerPC 450 processors, and has 2 GB memory. independent of and P,sGustafson .If W  (m+1)DsizeP The peak performance is 557 TeraFlops. A 10-Gigabit Θ(1) then P MW , which implies that P (m+1)Ds P Ehternet network, consisting of Myricom switching size has a supremum. Note that 1+ MW  gear connected in a non-blocking configuration, pro- vides connectivity between the I/O nodes, file servers, SP,Gustafson P s =Θ s and several other storage resources. The centralized (m+1)DsizeP (m+1)DsizeP 1+ MW 1+ MW I/O architecture of Intrepid is shown in Figure 5, R−full where the compute node / I/O node ratio can be So S with SP being replaced by SP, has P Gustafson 16, 32, 64 or 128, resulting in the number of I/O a supremum. Given (25) and (26), the reliability wall controllers as n = 2560, 1280, 640 or 320. (We use exists in both cases. Intrepid rather than BlueGene/L since Blue Gene/P Let us examine the two key parameters [4] in The- is newer.) orems 4 that determine the existence of the reliability The peak unidirectional bandwidth of each 10- wall, and consequently, the scalability of a program Gigabit port of an I/O node is 6.8 Gbits/sec G that satisfies lim SP = ∞ on an R system. P →∞ by the internal collective network that feeds it. P s s Dsize. For full-memory checkpointing, Dsize is the stands for the number of cores as one process is size of the total memory, which increases lin- assumed to run on one core. The MTTF of Intrepid early with the number of system nodes. For (40 racks) is MP =12.6 days [38]. By (23), the MTTF s × × 11 incremental-memory checkpointing, Dsize is of a single core is M = MP P =1.8 10 secs. related to the characteristics and scalability Generally, large-scale parallel computing systems take of program G. In general, the amount of a checkpoint interval according to the system’s MTTF data required for computing the execution MP , and current systems have a checkpoint interval state of program G increases with P . So does in the order of several hours. Thus, our analysis has IEEE TRANSACTIONS ON COMPUTERS, VOL. *, NO. *, * * 10

1 1 Ă 13 14 n=2560 Reliability Wall × P n=1280 0 12 ×

) n=640 ,QWHUFRQQHFWLRQQHWZRUN 5 n=320 10 × VW ,2 QG ,2 QWK ,2 8 FRQWUROOHU FRQWUROOHU Ă FRQWUROOHU 6 × × 'LVN Ă 'LVN 'LVN Ă 'LVN 'LVN Ă 'LVN 4 Reliability Speedup(×10 2 Fig. 5. The centralized I/O architecture with n I/O 0 controllers. 0.5 1 1.5 2 2.5 3 3.5 Number of Nodes, P(×106)

Fig. 6. Trend of SR for Intrepid with full-memory assumed m = 100, implying that Interval =3hours P,Gustafson checkpointing (m = 100). for Intrepid (with P = 163, 840 processors).  s  P W Since Θ(1) s Dsize , is positive constants. Θ(1)  Dsize  P Then W . According to Theorem 4, the first one only, saves only the modified pages, while reliability wall exists for an program G that satisfies r the size to be recovered at a checkpoint, Dsize,isthe lim SP = ∞ running on Intrepid when either full- P →∞ same as in the case of full-memory checkpointing, i.e., r memory or incremental-memory checkpointing tech- Dsize =4P Gbits. nique is used. Below we calculate the smallest size of Based on (26), we also have R(P )=kP 2 =Θ(P 2), the system when the reliability wall is hit. where Interval 5.1.2.1 Centralized I/O + Full-Memory Check- 4(m 30×24 +1) s × P × k = pointing: Dsize =2 4 8=4P Gbits since WM each quad-core node has 2GB memory an P is the Substituting SP,Gustafson for SP into (26), we obtain total number of cores. Observing from (25), we have − SR inc of the same form as in (27) except that k R(P )=kP 2 =Θ(P 2), where P represents a different coefficient. 4(m +1) Proceeding exactly as in full-memory checkpoint- k = − WM P ≈ 1 SR inc ≈ P0 ing, we obtain 0 k and P P =P0 2 . Substituting SP,Gustafson for SP into (25), we obtain: By Definition 2, the reliability wall for incremental- − − R−full f +(1 f)P f +(1 f)P memory checkpointing is: S = = (27) P 1+R(P ) 1+kP 2 R−inc MW MWP − sup S = ≡ SR full P P m×Interval 4(mDs +Dr ) Suppose P reaches the reliability wall at 0, P 16( 30×24 +1) size size called the optimal system size, then we can find P0 as (29) follows: − 5.1.2.3 Reliability Wall: For a few combina- dSR full (1−f)(1+kP 2)−2kP (f +P (1−f)) P = 0 0 0 =0 tions of different values taken by n (number of 2 2 dP (1+kP0 ) P=P0 I/O controllers), Figure 6 and 7 display the trend of SR for Intrepid with full-memory and Because f is small for a large-scale parallel program P,Gustafson incremental-memory checkpointing techniques, re- ≈ 1 and can thus be omitted, we obtain P0 k and spectively. R−full 0 6 7 ≈ P The optimal system size P0 lies between 10 and 10 SP 2 . By Definition 2, the reliability P =P0 before the system hits the reliability wall, indicating wall is found to be: that the reliability wall will exist at the Peta/exascale R−full MW ≡ MWP levels. By comparing full-memory and incremental- sup SP = s (28) P 16(m +1) 4(m +1)Dsize memory checkpointing techniques, we find that we can push the reliability wall forward by using less 5.1.2.2 Centralized I/O + Incremental-Memory expensive fault-tolerance techniques, as further illus- Checkpointing: As discussed earlier, the total size of trated by an example given in Figure 8. In each case, all saved checkpoints is at least equal to that of the the reliability speedup increases initially and starts entire memory, since that is the size saved at the first to drop as soon as it hits the reliability wall, i.e., s checkpoint. Thus, Dsize is at least the ratio of the the maximum of the speedup achievable. Based on total memory size to the number of checkpoints when our earlier analysis results given in (28) and (29), program G runs on Intrepid. Suppose that program the system designer can also increase M or W , i.e., G runs on Intrepid for a month (of 30 days). Then adopt more reliable system components or improve s 4P Dsize = 30×24 Gbits. Each checkpoint, except the Interval the bandwidth, to push the reliability wall forward. .

IEEE TRANSACTIONS ON COMPUTERS, VOL. *, NO. *, * * 11

12 × 9 n=2560 Reliability Wall P 1.2 10 secs. As in the case of Intrepid, we assume × 0 n=1280 × m = 100. This implies that Interval =0.4 hours for ) 10 n=640 6 P = 8192 n=320 ASCI White (with processors). 8 × 'LVN 'LVN 'LVN 6 × ,2 ,2 Ă ,2 4 × FRQWUROOHU FRQWUROOHU FRQWUROOHU

Reliability Speedup(×10 2 1 1 13

0 0.5 1 1.5 2 2.5 Number of Nodes, P(×107) ,QWHUFRQQHFWLRQQHWZRUN

Fig. 7. Trend of SR for Intrepid with P,Gustafson Fig. 9. The distributed I/O architecture. incremental-memory checkpointing (m = 100). The distributed I/O controller in ASCI White is 6 shown in Figure 9. When the system size increases, × the I/O bandwidth of the system increases too, i.e.,

) 5 6 × × W =0.04 P 8=Θ(P ) Gbits/sec. Accordings to The- D 4 size  orem 4, the reliability wall exists when W Θ(1). Thus, for ASCI White, the criterion for the existence 3 Ds  Θ(P ) Reliability Wall ×P of the reliability wall is size . Note that 0  s  2 Full-memory Θ(1) Dsize Θ(P ) as discussed in Section 5.1.1. Incremental-memory According to Theorem 4, when a program G that Reliability Speedup(×10 1 satisfies lim SP = ∞ runs on the ASCI White sys- × P →∞ 0 tem with either full-memory or incremental-memory 2 4 6 8 10 12 6 checkpointing, the reliability wall always exists be- Number of Nodes, P(×10 ) s cause Dsize/W =Θ(1)holds. R Fig. 8. Comparison of SP,Gustafson for Intrepid between 5.1.3.1 Distributed I/O + Full-Memory Check- s × P × full-memory checkpointing and incremental-memory pointing: Dsize =16 16 8 Gbits =8P Gbits as each checkpointing when n = 640 (m = 100). compute node has 16 CPUs sharing 16 GB memory and P is the number of CPUs in ASCI White. In (25), we find that R(P )=kP =Θ(P ), where 5.1.3 ASCI White m +1 k = We now turn to ASCI White with local disks to study 0.04M the existence of the reliability wall for systems with Substituting SP,Gustafson for SP into (25), we obtain: scalable I/O architectures. − − R−full f +(1 f)P f +(1 f)P ASCI White is the third step in the DOE’s five SP = = (30) stage ASCI plan, with a peak performance of 12.3 Ter- 1+R(P ) 1+kP R−full aFlops. It is a computer cluster built based on IBM’s dSP Since > 0 for all P , the reliability wall is commercial RS/6000 SP Power3 375MHZ computer. dP calculated as: Each compute node contains 16 Power3-II CPUs built − − f +(1−f)P 1−f with IBM’s latest semi-conductor technology (silicon- sup SR full= lim SR full= lim = P →∞ P →∞ on-insulator and copper interconnects), with 16 GB P P P 1+kP k memory. ASCI White uses a faster switching network Again, f is small and thus omitted. and we have: (IBM’s SP switch) to exchange information across the − 1 0.04M MW computer nodes. The number of compute nodes is R full ≈ ≡ sup SP = s 512 (totalling 8192 CPUs). The local disk per compute P k m +1 (m +1)Dsize node is 70.9 GB, with the I/O bandwidth to local disk 5.1.3.2 Distributed I/O + Incremental-Memory being 40 MBs/sec. As discussed below, ASCI White Checkpointing: As in the case for Intrepid, we con- with scalable I/O bandwidth is better suited to check- sider running program G on the ASCI White for a pointing techniques than Intrepid with a centralized month (of 30 days). By conducting a similar analysis, s 8P r I/O architecture. we have Dsize = 30×24 Gbits and Dsize =8P Gbits. M =40 Interval The MTTF of ASCI White is P hours (in In this case, we continue to have R(P )=kP =Θ(P ), 2003 with 8192 CPUs) [39]. P denotes the number of where Interval (single-core) CPUs with one process running per CPU. m × +1 × k = 30 24 By (23), the MTTF of a single CPU is M = MP P = 0.04M .

IEEE TRANSACTIONS ON COMPUTERS, VOL. *, NO. *, * * 12

Substituting SP,Gustafson for SP into (26), we obtain For full-memory checkpointing, the cost of disks R−inc SP of the same form as (30) except that k rep- that are used to save checkpoint files is the cost resents a different coefficient. incurred by checkpointing. cP is the cost of RAIDs Proceeding exactly as in full-memory checkpoint- with a size being equal to the size of the total memory ing, we obtain the reliability wall as follows: available in Intrepid.

− 1 0.04M MW sup SR inc ≈ = ≡ TABLE 3 P k Interval mDs + Dr P m 30×24 +1 size size List prices for Intrepid in July, 2010 (US$) 5.1.3.3 Reliability Wall: The same observations Rack 100 TB RAIDs Interconnect per Node C1 made earlier for Intrepid also apply here and are thus 1.5 million 500, 000 290 1170 omitted. According to Table 3, the cost of 100 TB RAIDs is 4 500, 000 dollars. So the cost of 1 GB is 5 dollars. Given 3.5 × ×

) that P =40, 960 4 = 163, 840, the (hardware) cost 7 3 of full-memory checkpointing is cP =40, 960 × 2 × 2.5 5=2.5P dollars. Suppose program G consumes 6 GB ×P of the RAIDs when it runs on Intrepid without full- 2 0 × Full-memory memory checkpointing, the (hardware) cost is 6 5= 1.5 Incremental-memory 30 dollars. Thus, the cost of a P -node Intrepid system 6 1 with full-memory checkpointing is CP =1.5 × 10 × Reliability Speedup(×10 × 0.5 40 + 40, 960 290 + 30 = 71.9 million dollars. C = 0× Suppose the costup (without checkpointing) is 0.5 1 1.5 2 2.5 3 3.5 4 CP =Θ(lgP ) C = lg P Number of Nodes, P(×108) C1 , i.e., , where is a positive CP ≈ × 4 constant. Then, we find that = C1 lg P 1.2 10 R Fig. 10. Comparison of SP,Gustafson for ASCI White when P = 163, 840. between full-memory checkpointing and incremental- According to (20) and (25), when n = 640, the memory checkpointing (m = 100). general reliability speedup for full-memory check- pointing is: Consider an example illustrated in Figure 10 for SP the two checkpointing techniques. In each case, the R−full (m+1)Ds P − S 1+ size speedup function increases monotonically and ap- SGR full = P = WM P CR C(1 + cP ) proaches its supremum, i.e. reliability wall eventually. CP In each case, the optimal system size can be de- SP 1 1 = 2 C 4(100+1)P 1+ 2.5P termined by some simple heuristics. For example, let 1+ WM C1llgP threshold represent the growth rate (i.e. the first order f +(1− f)P R S = −7 . derivative) of a given reliability speedup function, P . × 4 × −13 2 1.8×10 P R 1.2 10 lgP(1+5.2 10 P )(1+ lgP ) If the growth rate of SP is slower than threshold, then any attempt for boosting performance by increas- (31) ing the system size will be futile. Thus, the optimal GR−full system size P0 can be calculated by simply solving It is easy to see that lim SP =0. Thus, R P →∞ dSP dP = threshold. the general reliability wall exists in this situation P =P0 6 according to Theorem 3. If threshold =0.01, then P0 is found to be 4.28×10 for full-memory checkpointing and 3.04 × 108 for Proceeding similarly as we did when analyzing incremental-memory checkpointing as shown in Fig- the reliability wall for Intrepid with full-memory ure 10. This implies that the reliability wall exists at checkpointing in Section 5.1.2, the optimal system size P0 for the general reliability wall can be calcu- the peta/exascale levels when the size of ASCI White GR−full 6 dSP increases and moves over the range between 10 and lated by solving dP =0. We find that 8 P =P0 10 . 6 P0 =1.26 × 10 and the general reliability wall is GR−full sup SP =8.21 as further illustrated in Figure 11. P 5.2 General Reliability Speedup and Wall Due to the high costup being taken into account, the We consider Intrepid with a centralized I/O architec- general reliability wall and its optimal system size are ture using full-memory checkpointing for fault toler- both much smaller than if the costup is ignored. ance, due to the price information publicly available. We can observe easily that the general reliability We derive its general reliability speedup and discuss wall exists if the reliability wall does, but the converse the existence of the general reliability wall. is not always true. IEEE TRANSACTIONS ON COMPUTERS, VOL. *, NO. *, * * 13

the (general) reliability wall can be analytically 10 predicted. General Reliability Wall • By performing a reliability-wall analysis, the 8 × s key scalability-limiting parameters, such as Dsize, r 6 Dsize, Interval, M and W , can be identified. The system designer can push the reliability-wall 4 forward by deploying more reliable components and increasing the I/O bandwidth to optimize 2 General Reliability Speedup system-wide parameters M and W . For exam- ple, scalable diskless checkpointing [24] is de- 0 0.5 1 P 1.5 2 2.5 signed for optimizing W . Simultaneously, appli- 0 6 Number of Nodes, P(×10 ) cation programmers or compiler researchers can − Fig. 11. Trend of SGR full for Intrepid when n = 640 develop optimizations to reduce fault-tolerance P,Gustafson overhead by choosing appropriate values for (m = 100). s r the other parameters, such as Dsize, Dsize and Interval, which are the main design decisions to 5.3 On the Optimum Checkpoint Interval be made in application-level checkpointing [17], [18]. m =  MP  In Sections 5.1 and 5.2, we consider Interval , • As illustrated in Figures 8 and 10, the develop- i.e., the average number of checkpoints between fail- ment of new fault-tolerance techniques and I/O m ures as a constant. However, can be optimized for architectures can be guided by our theory to focus Interval a particular configuration by optimizing to on optimizing the key scalability-limiting factors. obtain the optimum checkpoint interval [40] as follows:

s s s 2DsizeMP − Dsize Dsize MP 6CONCLUSION W W , W < 2 Interval= s (32) Dsize MP This paper introduces for the first time the con- MP ,  W 2 cept of “Reliability Wall” and proposes a (general) There is a broad consensus in the fault-tolerance reliability-wall theory that allows the effects of re- community that when the system size is scaled up liability on scalability of parallel applications to be continuously, the time spending on saving a global understood and predicted analytically for large-scale checkpoint will eventually reach and even exceeds the parallel computing systems, particularly those at the Dsize  MP MTTF of a system [9]. This implies that W s 2 peta/exascale levels. The significance of this work is P [P , ∞) Dsize < demonstrated with case studies using representative will hold eventually when is in and Ws MP  − Dsize real-world supercomputing systems with commonly 2 will hold when P is in [1,P 1], where W is used checkpointing techniques for fault tolerance. Our roughly the time spent on saving a globals checkpoint, P  Dsize = MP work enables us to push the reliability-wall forward and is a certain system size whens W 2 .  − Dsize MP by mitigating reliability-wall effects in system design When P is in [1,P 1], i.e., W < 2 holds, R−full − (e.g., by developing cost-effective fault-tolerance tech- both S and SR inc are bounded and have the P P s niques and I/O architectures) and through apply- P [P , ∞) Dsize  MP supremums. When is in , i.e., W 2 ing compiler-assisted optimizations at the application-  MP  holds, we have m = Interval =1, which falls into level. the cases discussed earlier where m is a constant. In Interval both cases, Theorem 4 is still applicable when ACKNOWLEDGMENTS is instantiated with the optimum checkpoint interval. Thus, the existence of the (general) reliability wall The authors thank the reviewers for their helpful does not change for the four system configurations comments and suggestions, which greatly improved discussed earlier in Sections 5.1 and 5.2. the final version of the paper. This work is supported by the National Natural Science Foundation of China (NSFC) No.60921062 and 61003087. 5.4 Mitigating Reliability-Wall Effects As demonstrated in our case studies, the reliability REFERENCES wall may exist in the future exascale supercomput- [1] D. B. Kothe, “Science prospects and benefits with exas- ing. Our reliability-wall theory provides us insights cale computing,” Oak Ridge National Laboratory, Tech. Rep. on reducing fault-tolerance overhead for improved ORNL/TM-2007/232, 2007. scalability in a number of principal directions. [2] H. Simon, T. Zacharia, and R. Stevens, “Modeling and Sim- ulation at the Exascale for Energy and the Environment,” • Based on our (general) reliability-wall theory, the Website, http://www.sc.doe.gov/ascr/ProgramDocuments/ relationship between reliability and scalability ProgDocs.html. [3] J. Stearley, “Defining and measuring supercomputer Reliabil- can be revealed. A (general) reliability speedup ity, Availability, and Serviceability (RAS),” in Proceedings of the function can be developed and the existence of Clusters Institute Conference, 2005.