1154 Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175

Frontiers of Information Technology & Electronic Engineering www.zju.edu.cn/jzus; engineering.cae.cn; www.springerlink.com ISSN 2095-9184 (print); ISSN 2095-9230 (online) E-mail: [email protected]

Storage wall for exascale supercomputing∗

Wei HU‡1, Guang-ming LIU1,2, Qiong LI1, Yan-huang JIANG1, Gui-lin CAI1 (1College of Computer, National University of Defense Technology, Changsha 410073, China) (2National Center in Tianjin, Tianjin 300457, China) E-mail: {huwei, liugm}@nscc-tj.gov.cn; [email protected]; [email protected]; [email protected] Received June 15, 2016; Revision accepted Oct. 5, 2016; Crosschecked Oct. 25, 2016

Abstract: The mismatch between compute performance and I/O performance has long been a stumbling block as evolve from petaflops to exaflops. Currently, many parallel applications are I/O intensive, and their overall running times are typically limited by I/O performance. To quantify the I/O performance bottleneck and highlight the significance of achieving scalable performance in peta/exascale supercomputing, in this paper, we introduce for the first time a formal definition of the ‘storage wall’ from the perspective of parallel application scalability. We quantify the effects of the storage bottleneck by providing a storage-bounded speedup, defining the storage wall quantitatively, presenting existence theorems for the storage wall, and classifying the system architectures depending on I/O performance variation. We analyze and extrapolate the existence of the storage wall by experiments on Tianhe-1A and case studies on Jaguar. These results provide insights on how to alleviate the storage wall bottleneck in system design and achieve hardware/software optimizations in peta/exascale supercomputing.

Key words: Storage-bounded speedup, Storage wall, High performance computing, Exascale computing http://dx.doi.org/10.1631/FITEE.1601336 CLC number: TP338.6

1 Introduction atmospherics, etc., brings new challenges to the per- formance, scalability, and usability of a supercom- With the developments in system architecture, puter’s storage system. the supercomputer has witnessed great advances In supercomputing, there exists a performance in system performance. In the TOP500 list of gap between compute and I/O. Actually, there are the world’s most powerful supercomputers in 2015, two causes of the mismatch between application re- Tianhe-2 was ranked as the world’s No. 1 system with quirements and I/O performance. First, since the a performance of 33.86 petaflop/s (TOP500, 2015). 1980s, the read and write bandwidth of a disk has It will not be long before supercomputers reach exas- increased by only 10% to 20% per year, while the av- cale performance (103 petaflop/s) (HPCwire, 2010). erage growth in processor performance has reached Supercomputers provide important support for 60%. Second, the compute system scales much faster scientific applications; meanwhile, the demands of than the I/O system. Thus, as I/O-intensive appli- these applications stimulate the development of cations expand at a very fast pace, supercomputers supercomputers. In particular, the rise in I/O- have to confront a scalability bottleneck due to the intensive applications, which are common in the competition for storage resources. fields of genetic science, geophysics, nuclear physics, The storage bottleneck is well known in the ‡ Corresponding author * Project supported by the National Natural Science Foundation supercomputer community. Many studies have ex- of China (Nos. 61272141 and 61120106005) and the National plored the challenges of exascale computing includ- High-Tech R&D Program (863) of China (No. 2012AA01A301) et al. ORCID: Wei HU, http://orcid.org/0000-0002-8839-7748 ing the storage bottleneck (Cappello , 2009; c Zhejiang University and Springer-Verlag Berlin Heidelberg 2016 Agerwala, 2010; Shalf et al., 2011; Lucas et al., 2014). Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175 1155

Some important and basic issues related to the stor- 2 Related work age bottleneck, e.g., quantitative description, inher- ent laws, and system scalability, have not been solved Speedup has been the exclusive measurement yet. This work seeks to solve these problems and to for performance scalability in for et al. provide more insights in this area. To achieve this a long time (Culler , 1998). Various speedup goal, we present the ‘storage wall’ theory to analyze models have been proposed. and quantify the storage bottleneck challenge, and In Amdahl (1967), a speedup model (Amdahl’s focus on the effects of the storage wall on the scala- law) for a fixed-size problem was advocated: bility of parallel supercomputing. Then, we present 1 S = , some experiments and case studies to provide in- Amdahl f +(1− f)/P sights on how to mitigate storage wall effects when building scalable parallel applications and systems where P is the number of processors and f the ra- at peta/exascale levels. Our main contributions can tio of the sequential portion in one program. Am- be summarized as follows: dahl’s law can serve as a guide to how much an en- hancement will improve performance and how to dis- 1. Based on the analysis of I/O characteristics in tribute resources to improve cost-performance (Hen- parallel applications, we present a storage-bounded nessy and Patterson, 2011). Nevertheless, the equa- speedup model to describe the system scalability un- tion reveals a pessimistic view on the usefulness of der the storage constraint. large-scale parallel computers because the maximum speedup cannot exceed 1/f. 2. We define the notion of a ‘storage wall’ for- Gustafson (1988) introduced a scaled speedup mally for the first time to quantitatively interpret for a fixed-time problem (Gustafson’s law), which the storage bottleneck issue for supercomputers. We scales up the workload with the increasing number also propose an existence theorem and a variation of processors to preserve the execution time: law of the storage wall to reflect the degree of system scalability affected by storage performance. SGustafson = f +(1− f) P. 3. We identify the key factor for the storage This speedup proves the scalability of parallel com- wall, called the ‘storage workload factor’. Based puters. Compared to Amdahl’s law, Gustafson’s law on this factor we present an architecture classifica- fixes the response time and is concerned with how tion method, which can reveal the system scalability large a problem could be solved within that time. It under storage constraint for each architecture cate- overcomes the shortcomings of Amdahl’s law. gory. Then, we extend the storage workload factor Sun and Ni (1993) presented a memory-bounded and find the system parameters that affect scalabil- speedup model, which scales up the workload accord- ity. We categorize mainstream storage architectures ing to the memory capacity of the system, to demon- in supercomputers according to the above scalabil- strate the relationship among memory capacity, par- ity classification method and reveal the scalability allel workload, and speedup. This speedup is suitable features by taking the storage wall into account. especially for the supercomputer with distributed- These works can help system designers and applica- memory multiprocessors, because the memory avail- tion developers understand and find solutions to cur- able determines the size of the scalable problem. rent performance issues in hardware, software, and Culler et al. (1998) proposed a speedup model architecture. taking the communication and synchronization over- 4. Through a series of experiments on Tianhe-1A head into account for the first time. Based on this by using IOR benchmarks and case studies on Jaguar model, researchers can improve the system perfor- by using parallel applications with checkpointing, we mance by reducing the communication and synchro- verify the existence of the storage wall. These re- nization overheads, such as overlapping communica- sults and analyses can provide effective insights on tion and computing and load balancing. how to alleviate or avoid storage wall bottlenecks In addition to the above speedup models, there in system design and optimization in peta/exascale are some other related studies. Wang (2009) supercomputing. provided a reliability speedup metric for parallel 1156 Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175 applications with checkpointing and studied the re- patterns (Lu, 1999). In single program multiple liability problem of supercomputers from the view data (SPMD) or multiple program multiple data of scalability. Furthermore, Yang et al. (2011; 2012) (MPMD) applications, more than 70% I/O opera- introduced for the first time the concept of a reliabil- tions are synchronous (Kotz and Nieuwejaar, 1994; ity wall based on reliability speedup. Their work is Purakayastha et al., 1995). For instance, applica- more relevant to ours. They provided insights on how tions in earth science and climatology (Carns et al., to mitigate reliability wall effects when building reli- 2011) consist of nearly identical jobs, which usually able and scalable parallel applications and systems at lead to a large number of synchronous I/O operations peta/exascale levels. Here, we establish the storage among processes (Xie et al., 2012). wall related theory to investigate storage challenges 4. Repeatability: A number of jobs tend to run for current and future large-scale supercomputers. many times using different inputs, and thus they ex- Chen et al. (2016) provided a topology parti- hibit a repetitive I/O signature (Liu et al., 2014). tioning methodology to reduce the static energy of According to the I/O characteristics, we will in- supercomputer interconnection networks. This is a troduce a storage-bounded speedup model in Section new exploration to solve the energy problem for fu- 4 to pave the way for the storage wall definition. ture exascale supercomputers. In a supercomputer system, the performance of 4 Storage wall theory the storage system significantly affects the system scalability of a supercomputer. In this work, we de- Since processors are evolving rapidly, the in- velop a storage wall framework based on our previous crease in computing power has enabled the creation work (Hu et al., 2016) to analyze and quantify the of important scientific applications. As parallel ap- effects of storage performance on the scalability of plications increase gradually in scale, their perfor- the supercomputer. mance is often confined to the storage performance. This significantly affects supercomputer scalability. 3 I/O characteristics of parallel To analyze and quantify this effect, we first de- applications fine the storage-bounded speedup, and then provide the storage wall theory to elaborate on supercom- Large-scale scientific parallel applications run- puter scalability based on the storage performance ning on supercomputers are mostly numerical simu- constraint. lation programs, and they have some distinctive I/O 4.1 Defining storage-bounded speedup features as follows: 1. Periodicity: Scientific applications are usually Storage-bounded speedup is built to quantify composed of two stages, i.e., the compute phases and system scalability considering the storage perfor- the I/O phases (Miller and Katz, 1991; Pasquale and mance constraint. Prior to defining storage-bounded Polyzos, 1993; Wang et al., 2004; Liu et al., 2014; Kim speedup, we assume that one process runs on a and Gunasekaran, 2015). For many I/O-intensive single-core processor in a parallel system, and we applications, write operations are usually the dom- treat a multi-core processor system as a system which inant operations in the I/O phase. For example, consists of multiple single-core processors. Thus, checkpointing can be seen as an I/O-intensive appli- given that a parallel system consists of P single- cation and usually periodically uses write operations core processors, the running processes own the same to save the state of the compute nodes (Hargrove and number. For clarity, let each node denote each pro- Duell, 2006; Bent et al., 2009). cess. The given parallel system W has P nodes 2. Burstiness: Scientific applications not only in total, and here we denote the ith node as Ni have periodic I/O phases, but also perform I/O in (i =0, 1, ··· ,P−1). All the symbols used in this short bursts (Kim et al., 2010; Carns et al., 2011; Liu paper are listed in Table 1. et al., 2014; Kim and Gunasekaran, 2015), particu- For a given program Q, the execution process larly with write operations (Wang et al., 2004). consists of two types of phases: the compute phase 3. Synchronization: The parallel patterns of and the I/O phase (Miller and Katz, 1991; Pasquale applications often determine parallel I/O access and Polyzos, 1993; Wang et al., 2004; Liu et al., 2014; Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175 1157

Table 1 List of main symbols and the related specifications Symbol Description Definition

Ni The ith node in the parallel system (i =0, 1, ··· ,P−1) Section 4.1 P The number of nodes with a single-core processor Section 4.1 W A parallel system with P nodes Section 4.1 Q A parallel program with P processes Section 4.1 SAmdahl Amdahl’s speedup Section 2 SGustafson Gustafson’s speedup Section 2 SP Compute speedup Section 4.1 N s The number of I/O phases when Q runs on a single node Eq. (1) N c The number of compute phases when Q runs on a single node Section 4.1 Isint Average time of all compute phases when Q runs on a single node Eq. (2) Isser Average time of all I/O phases when Q runs on a single node Eq. (1) T snon Execution time of all compute phases when Q runs on a single node Eq. (1) T sio Execution time of all I/O phases when Q runs on a single node Section 4.1 S TSto Execution time of program Q running on a node serially Eq. (1) p Ni The number of I/O phases on the ith node Section 4.1 pint Ii Average time of all the compute phases on the ith node Section 4.1 pser Ii Average time of all the I/O phases on the ith node Section 4.1 pnon Ti All the compute time on the ith node Section 4.1 p p N Average of Ni for each process when program Q is running on P nodes Eq. (3) pint pint I Average of Ii for each process when program Q is running on P nodes Eq. (4) pser pser I Average of Ii for each process when program Q is running on P nodes Eq. (3) T pnon The time for the compute parts when program Q is running on P nodes Eq. (3) pio Ti The I/O time on the ith node Section 4.1 T pio The time for the I/O parts when program Q is running on P nodes Section 4.1 P TSto Execution time of program Q on P nodes Eq. (3) P SSto Storage-bounded speedup Eq. (5) NIpser/Ipint Eq. (6) MIsser/Isint Eq. (6) O(P ) Storage workload factor Eq. (7) P SAmdahl-Sto Amdahl storage-bounded speedup Eq. (9) P SGustafson-Sto Gustafson storage-bounded speedup Eq. (10) P supSSto Storage wall Section 4.2 α Memory hit rate Section 5.1 D Average amount of data for the I/O phases for all processes Section 5.1 Vmem Average bandwidth for the memory access on each node Section 5.1 Vout-mem Average I/O bandwidth per node Section 5.1 CDP architecture Centralized, distributed, and parallel I/O architecture Section 5.2.3 β Absorbing ratio Section 5.2.4 Vburst Average bandwidth per node provided by burst buffers Section 5.2.4 P SG-Sto-fullckp Storage-bounded speedup for parallel application with full-memory checkpointing Eq. (19) P SG-Sto-incrckp Storage-bounded speedup for parallel application with incremental-memory checkpointing Eq. (20) Popt Optimal parallel size at which parallel applications obtain the largest storage-bounded Section 6.2 speedup

Kim and Gunasekaran, 2015). In this study, the com- In the following paragraphs we will analyze the pute phases also denote I/O-free phases. In Fig. 1, execution time of a serial or parallel program Q run- we use shaded parts to denote compute phases and ning on one or multiple nodes. white parts to denote I/O phases. Fig. 1a shows a In the serial program situation (Fig. 1a), the Q S program running on a node serially, where N0 de- total execution time TSto is obviously the time sum- S notes the process and TSto stands for the execution mation of all the compute phases and I/O phases, Q S = snon + sio snon time. Fig. 1b displays the case where program is i.e., TSto T T (T denotes the total P sio running on a P -node system in parallel, where TSto time of all compute phases and T the total time denotes the practical execution time. of all I/O phases). Let N s be the number of I/O 1158 Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175

Compute phase I/O phase The condition we study here is similar to (a) (b) S Gustafson’s law, in which the load (including the TSto P compute load and I/O load) increases continuously TSto with application scaling (Gustafson, 1988). In this procedure, if the total load increases but the increase Time Time ... of I/O load is disproportionate to the increase of the compute load, the application scalability may suffer from additional effects. To be specific, if the I/O load N N N N N 0 Node 0 1 Pí Pí Node increment is less than that of the compute load, it Fig. 1 Execution state of a program on a P-node would not be sufficient to detect the impact of stor- P system: (a) single process; (b) parallel processes age performance on application scalability; if the I/O phases and Isser the average time of all I/O phases. load increment is more than the compute load, the sio = s sser S application scalability will be further deteriorated by We can obtain T N I , and thus TSto can be described as a rapid I/O load increase. The primary objective of this work is to discuss the influence of storage per- S = snon + s sser formance on the scalability of parallel applications. TSto T N I . (1) To exclude the influence of other factors, such as an c sint Let N be the number of compute phases and I irregular expansion of the application load, here we be the average time of all compute phases. Thus, the assume that the I/O load is in proportion to the com- snon total execution time of all compute phases is T = pute load, and both loads increase with the increase c sint N I . The I/O load of a program consists mainly of the number of compute nodes. of reading raw data, storing intermediate data, and In the parallel program situation, it is non- writing final results, to name a few tasks. Both com- trivial to analyze the execution time. In this work, pute and I/O phases happen alternately, and the we focus on the statistical significance of the execu- N s number of I/O phases is usually one larger than tion time. That is, many parameters in the parallel N c N s = N c +1 that of the compute phases , i.e., running situation, such as the number of I/O phases, T snon = N cIsint =(N s − 1)Isint (Fig. 1). Therefore, are the values on average. As a matter of fact, large- N s = T snon/Isint +1 and . Then, we can rewrite scale parallel applications may encounter barriers, Eq. (1) as and the slowest node determines the execution time in this condition. The difference between the aver- T S = T snon +(T snon/Isint)Isser + Isser. (2) Sto age time and the worst case time due to the barrier Before analyzing the execution time of a parallel is mainly the global synchronization overhead. For program Q running on multiple nodes, two assump- parallel applications, the synchronization overhead tions need to be described in detail. The first is that is affected mainly by load imbalance (Culler et al., the load, especially the I/O load, is balanced among 1998). In the situation of load balance, all the pro- all nodes. The scalability of a large-scale parallel ap- cesses can reach the barrier almost simultaneously. plication can be affected by many factors, such as the As described by the first assumption above, this application scaling model, architecture type, storage research is established under the assumption of load performance, and load imbalance (Culler et al., 1998; balance, and the global synchronization overhead is Gamblin et al., 2008). Due to the complexity of the small and can be ignored here. From a statistical problem, many studies generally focus on the impact perspective, using the average time rather than a of a certain factor, assuming that the other factors certain runtime can effectively reflect the trend in ap- are fixed. Examples include Amdahl’s law (Amdahl, plication execution time and help eliminate the bias 1967) and Gustafson’s law (Gustafson, 1988), and caused by casual/random factors on the application they came to their conclusions in ideal conditions, in- execution time, thus improving the validity and re- cluding load balance. In this work, we discuss mainly liability of this study. For these reasons, we use the the influence of storage performance on the scalabil- average time to represent the time of the compute ity of a parallel application, assuming that the work phase and I/O phase to simplify the storage-bounded load is balanced when the application scales. speedup model, which is reasonable and in line with Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175 1159 the assumptions above. Definition 1 (Storage-bounded speedup) When a Here we analyze the execution time of the par- parallel program Q runs on a parallel system W with allel program Q running on a P-node system W. Let P nodes, the storage-bounded speedup can be de- P TSto denote the total execution time of the parallel fined as follows: program. Similar to the serial program situation, T S P = Sto P SSto . (5) TSto also consists of two parts, i.e., the time for the P TSto compute and I/O parts, denoted by T pnon and T pio, P ≈ pnon + pio According to Eqs. (2) and (4), suppose that SP respectively. Thus, TSto T T . Here we use approximation instead of equality. The approxi- is the speedup of the compute part. Meanwhile, we N = Ipser/Ipint M = Isser/Isint mation is reasonable and valid for the following the- define and . Thus, the ory because all parameters are defined on statistical storage-bounded speedup can be described as values discussed above. T S T snon +(T snon/I sint)Isser +Isser SP = Sto ≈ Q Sto P pnon +( pnon pint) pser + pser For the given parallel program that runs on TSto T T /I I I P p nodes, we use Ni to denote the number of I/O T snon(1 + Isser/Isint) i pser ≤ phases on the th node, and Ii to denote the av- T pnon(1 + Ipser/Ipint) erage time of all the I/O phases on the ith node. 1+M 1 i pio = p pser =SP = SP . (6) So, the I/O time on the th node is Ti Ni Ii . 1+N N − M p = P −1 p 1+ Further, we define N i=0 Ni /P as the average 1+M p pser = P −1 pser of Ni for each process, and I i=0 Ii /P pser Here, we leave the derivations of Eq. (6) in Ap- as the average of Ii for each process. So, we ob- P − pendix. In Eq. (6), we set the upper bound of the tain T pio = 1(N pIpser)/P . Since most par- i=0 i i right-hand side of the approximation as the value allel scientific applications have I/O features such of the storage-bounded speedup. Since the storage- as periodicity and synchronization, for the applica- bounded speedup is used to quantify the scalability tions running on a large number of nodes, whose under the storage performance constraint, for long- I/O load is balanced and scales in proportion to the running applications, this error caused by the upper compute load, the numbers of I/O phases are approx- bound is negligible and will not affect the correctness imately equal among all processes and they are all of the results analyzed below. approximately equal to N p. This means N p ≈ N p. i Let us define pio = P −1( p pser) ≈ So, we can obtain T i=0 Ni Ii /P P − N − M N p 1(Ipser/P )=N pIpser. O (P )= . (7) i=0 i 1+M Analogous with Eq. (1), the execution time of ( ) program Q running on P nodes is O P reflects the I/O execution time variation with the rise in the application load, and in this sense, we P ≈ pnon + p pser TSto T N I . (3) may call it the ‘storage workload factor’. pint Thus, the final form of the storage-bounded Similarly, we use Ii to signify the average time i speedup becomes of all the compute phases on the th node. Then P −1 pint 1 pint = P we define I i=0 Ii /P as the average of S = SP . (8) pint pnon Sto 1+O(P ) Ii for each process. We use Ti to stand for i all the compute time on the th node, and define Eq. (8) implies how the I/O performance varia- pint = P −1 pint pint I i=0 Ii /P as the average of Ii for each tion affects the speedup with the increase of the num- process. As stated above, the number of I/O phases ber of processors. Thus, it can reflect the scalability is one more than that of the compute phases on each of parallel systems under the storage constraint. p pnon pint node. Thus, we have N = T /I +1 and In Eq. (8), if SP is instantiated by Amdahl’s Eq. (3) becomes speedup and Gustafson’s speedup, respectively, we can obtain the Amdahl storage-bounded speedup as T P ≈ T pnon +(T pnon/Ipint)Ipser + Ipser. (4) Sto 1 SP =S · In terms of the above analysis, we define the Amdahl-Sto Amdahl 1+O(P ) (9) storage-bounded speedup to analyze the scalability P 1 = · , variation under the storage performance constraint. 1+f(P − 1) 1+O(P ) 1160 Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175 and the Gustafson storage-bounded speedup as bound. Based on the supremum principle (Rudin, SP 1 1976), the supremum of Sto exists, and the storage SP = S · wall exists for program Q owing to Definition 2. Gustafson-Sto Gustafson 1+O(P ) 1 (10) An important implication of Theorem 1 is as P P =(f + P (1 − f))· . follows. If lim SP = ∞, lim S = ∞ when S < 1+O(P ) P →∞ P →∞ Sto Sto SP . Thus, there always exists the storage wall. For In both Amdahl’s law and Gustafson’s law, example, Amdahl’s speedup meets this condition and there is only one parameter f; namely, the ratio will always have the storage wall. So, in the rest of of the sequential portion in one program is used to this paper, we consider the condition lim SP = ∞ P →∞ characterize the scalability. Similarly, in this study, and replace SP with SGustafson. Eqs. (8), (9), and (10) use f to characterize the scal- ability of the compute part of the application. The 4.3 System classification difference between earlier works, i.e., Amdahl’s law According to Eq. (8), it is easy to observe that and Gustafson’s law, and ours, is that our storage- O(P ) is the key factor in determining the existence bounded speedup takes into account the storage of the storage wall. Due to different O(P ), the sys- workload factor O(P ), which reflects the effect of the tem storage-bounded speedup has different trends. storage system over the whole system scalability, and To better analyze the laws of the storage wall, we this has not been studied before. For instance, both categorize parallel systems into two types according laws yield the highest speedup when f =0. Actu- to the characteristic of O(P ). ally, both claims may not be true in practice because Before the classification, we introduce the fol- I/O effect may exist. Due to the competition for I/O lowing symbol conventions: resources, I/O service time is prolonged in parallel 1. f(x) g(x) if lim f(x)/g(x)=∞, and computing, resulting in O(P ) =0. From this per- x→∞ f(x) g(x) if lim f(x)/g(x) is a positive constant spective, the storage-bounded speedup is more rea- x→∞ ∞ sonable for practical applications, which is our main or . 2. f(x) ≺ g(x) if lim f(x)/g(x)=0, and f(x) motivation for this study. Ideally, if O(P )=0, our x→∞ g(x) if lim f(x)/g(x) is a non-negative constant. two storage-bounded speedup models, i.e., Eqs. (9) x→∞ and (10), are simplified into traditional Amdahl’s 3. Suppose Θ(x) is a set of all functions over x, and Gustafson’s speedup models, respectively, and f(x) ∈ Θ(x) and g(x) ∈ Θ(x).If lim f (x)/g (x) is x→∞ the system scalability is not affected by the storage a positive constant, then f(x)=Θ(g(x)) or g(x)= performance. Θ(f(x)). Definition 3 (Constant and incremental systems) 4.2 Defining the storage wall Suppose that a program Q satisfying lim SP = ∞ P →∞ Definition 2 (Storage wall) When a program Q runs on a system W.IfO(P ) Θ(1), system W is runs on a parallel system which satisfies lim SP = considered to be a constant system. If O(P ) Θ(1), P →∞ ∞, the storage wall is the supremum of the storage- system W is considered to be an incremental system. P supSP Fig. 2 shows three examples for the classification bounded speedup SSto, denoted as Sto. Theorem 1 When a program Q runs on a parallel based on O(P ).IfO(P )=K Θ(1), where K is a W system which satisfies lim SP =∞, the storage wall positive constant, system is a constant system. P →∞ lim P =∞ According to Eq. (10), we find that exists if and only if SSto . P →∞ 1 Proof According to Definition 2, if the storage lim SP = lim (f +P (1−f)) =∞. P →∞ Gustafson-Sto P →∞ 1+ wall exists, the supremum of the storage-bounded K P supSP = By Theorem 1, there is no storage wall in this sys- speedup SSto exists, i.e., Sto R (R is a pos- P ≤ supSP tem. If O(P )=KP ln P Θ(1), system W is an itive constant). Then, SSto Sto. Therefore, lim SP = ∞.Iflim SP = ∞, lim SP = R incremental system. Similarly, we find that P →∞ Sto P →∞ Sto P →∞ Sto R ∈ Q+ ∀ε>0 ∃E ∈ Z+ lim SP ( ) holds. Then, , , P →∞ Gustafson-Sto | P − | which makes SSto R <εwhen P>E. Thus, 1 − P + P = lim (f + P (1 − f)) =0. R ε

10 section, we analyze the storage wall in terms of stor- 9 S P age architecture, while we discuss the storage wall 8 challenge based on large-scale parallel applications 7 O(P)=K in Section 6. To be specific, the factors that impact ) 3 6 the storage wall are analyzed in Section 5.1 and the 5 (×10 storage wall related to different storage architectures Sto P

O(P)=KlnP S S 4 are investigated in Section 5.2. 3 2 5.1 Factor analysis of the storage wall O(P)=KPlnP 1 In Section 3, we presented four I/O characteris- 0 012345678910 tics of scientific applications, i.e., periodicity, bursti- 3 P (×10 ) ness, synchronization, and repeatability. Based on Fig. 2 Three examples for storage-bounded speedup these characteristics, we make the following assump- and the storage wall of different types of systems tions to identify the key factors that influence the based on O(P ) (K is a constant in each function). When O(P )=K Θ(1), W is a constant system existence of the storage wall: and has no storage wall; when O(P )=K ln P Θ(1), 1. Suppose that the P processes of program Q W is an incremental system and has no storage wall; running on system W are synchronized and analo- O(P )=KP ln P Θ(1) W when , is an incremental gous, and this is in line with the application charac- system and has a storage wall teristics detailed in Section 3. Meanwhile, the com- In this situation, the storage wall exists. pute nodes of most supercomputers have the same configurations. So, the P processes running on each Now we introduce another existence theorem of node have approximately the same I/O behavior. the storage wall as follows: 2. Suppose that the parallel workload including Theorem 2 When a program Q that satisfies compute and I/O loads will increase proportionally lim SP = ∞ runs on a system W, the storage wall P →∞ with the number of processes. Ideally, the load of ( ) exists if and only if O P SP . compute phases approximates to be constant for each Proof According to Theorem 1, if the storage process even if the number of processes increases. So, P wall exists, lim S =limSP /(1 + O (P )) = ∞. pint sint P →∞ Sto P →∞ we assume both Ii and I to be equivalent in the Obviously, O (P ) SP holds. following analysis. P If O(P ) SP , then lim S = 3. Suppose that the memory hit rate on every P →∞ Sto lim (1 + ( )) ∞ lim P = ∞ processor is approximately equal, noted as α. The SP / O P < . Thus, SSto . P →∞ P →∞ average data amount of I/O phases for each process is So, the storage wall exists for program Q on system considered to be approximately equal, denoted by D. W by Theorem 1. The average bandwidth for memory access on each Thus, for a constant system which satisfies node is approximately equal, noted as Vmem.On lim SP = ∞, the storage wall does not exist be- P →∞ the other hand, the I/O bandwidth on each compute ( ) (1) ≺ cause O P Θ SP . node is approximately the same when the memory

On the other hand, for an incremental system, access misses, denoted by Vout-mem. there are two cases. If O(P ) SP , the storage wall Many representative applications, e.g., those in exists by Theorem 2. However, if Θ(1)≺O(P )≺SP , earth science, nuclear physics, and climatology, are in the storage wall does not exist, which is similar to a line with the above assumptions. These applications constant system. consist of nearly identical jobs running on the com- pute nodes of identical architectures (Carns et al., 5 System analyses according to I/O 2011). Besides, the checkpoint/restart mechanism architecture as a synchronous I/O workload also fully satisfies the above assumptions. This mechanism is used to The architecture determines the fundamentals protect parallel applications against failure by stor- of the supercomputer’s hardware and software, and ing the application state to persistent storage. Even also the scalability of the supercomputer. In this if a failure occurs, the application still restarts from 1162 Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175 the most recent checkpoint (Bent et al., 2009). reflect the I/O characteristics of parallel programs, By the above assumptions, from the view of a which distinguish compute-intensive programs with compute node, we transform the storage workload I/O-intensive ones. Table 2 shows the basic char- factor into acteristics of compute-intensive and I/O-intensive programs related to D and Isint. The larger the N − M Ipser/Ipint − Isser/Isint O(P )= = D and the smaller the Isint, the more intensive the 1+M 1+Isser/Isint I/O workload. sint( pser pint) − sser = I I /I I Isint + Isser pser − sser 1 Table 2 I/O characteristics of parallel programs I I pser sser = = (I − I ) sint Isint + Isser C DI Compute-intensive I/O-intensive √√ 1 D (1 − α)D = + − sser Large Large √√ I , (11) Small Small C Vmem Vout-mem √ Large Small √ where C = Isint + Isser is a positive constant. Small Large Hence, we obtain 1 2. α and Vmem: α represents the average mem- P S = SP ory hit rate on every processor when the process Sto 1+O(P ) 1 performs I/O operations. It essentially implies the = SP . (12) hierarchical storage architecture. V represents 1 D (1 − α)D mem + + Isint the average bandwidth of the memory access on each C V V mem out-mem node. According to the performance of dynamic ran- Based on Eq. (12) and Theorems 1 and 2, we dom access memory (DRAM), memory provides bet- derive the following corollary: ter access bandwidth but confronts data volatility Corollary 1 When a program Q that satis- and small capacity due to the high price. Improv- fies lim SP = ∞ runs on a system W,if ing the memory hit rate is a hot topic for enhancing P →∞ I/O performance. Many techniques such as data (1−α)D/Vout-mem Θ (P ), the storage wall exists. sint prefetching and cooperative caching are proposed to Proof In Eq. (12), C, I , and Vmem are constants. achieve this purpose. Meanwhile, a number of new Let SP be SGustafson =Θ(P ).If(1−α)D/Vout-mem media, such as FLASH and PCM, have been de- Θ (P ), then P/((1−α)D/Vout-mem) has a supremum. Since Eq. (12) can be converted to veloped to replace DRAM or to join together with ⎛ ⎞ DRAM to compose hybrid memory. Both advanced P media and advanced memory architecture can im- P ⎜ ⎟ S = Θ ⎝ 1 D (1 − α) D ⎠ , prove Vmem. Sto + +Isint C Vmem Vout-mem 3. Vout-mem: It represents the average I/O band- width of each processor when the I/O requests miss P SSto also has a supremum. According to Definition 2, in the memory. The average I/O access bandwidth (1− ) ( ) when α D/Vout-mem Θ P holds, the storage is the ratio between the I/O data size and the I/O wall exists. request time, which starts from initiation and ends Q According to Corollary 1, when a program with completion. Vout-mem plays a fundamental role satisfying lim SP = ∞ runs on a system W, the P →∞ in the storage wall because it is closely associated factors affecting the storage wall can be summarized with storage devices, storage architecture, and I/O as follows: access patterns. For different storage architectures, sint 1. D and I : D represents the average amount the impact factors of Vout-mem contain metadata of data for the I/O phases for all processes. We use management, data layout, and data management Di to denote the average amount of data for the I/O strategies, to name a few. For different I/O access i = P −1 phases on the th node, and thus D i=0 Di/P . patterns, Vout-mem is influenced by I/O request size, According to assumptions, Isint represents the aver- access density, burstiness, synchronization, and so age time of the compute phases no matter how the on. As the processors scale up, the overhead for program runs (serially or in parallel). D and Isint metadata access, communication collision, and data Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175 1163 access competition will increase remarkably, and this ture. Fig. 4 shows a case of storage wall for a super- leads to a significant decrease of Vout-mem. computer with a centralized I/O architecture.

5.2 Storage wall analysis related to main su- 2.0 percomputer I/O architectures 1.6 ) 3 1.2 According to the interconnecting relationship Storage wall (×10 Sto P 0.8 between compute and storage in supercomputer sys- S S O(P)=KPlnP tems, the architecture of supercomputer storage sys- 0.4 tems can be divided into four categories: centralized 0 architecture, distributed and parallel architecture, 012345 3 centralized, distributed, and parallel (CDP) archi- P (×10 ) tecture, and CDP architecture with burst buffers. Fig. 4 A case of a supercomputer with a centralized I/O architecture in which the storage wall exists Next, we will analyze the storage wall characteristics with each I/O architecture. 5.2.2 Distributed and parallel I/O architecture 5.2.1 Centralized I/O architecture Fig. 5 shows a system with a distributed and Fig. 3 shows a centralized I/O architecture, such parallel I/O architecture. This supercomputer stor- as network attached storage (NAS). This architec- age system usually refers to the architecture in which ture is often used in small-scale supercomputers, and each compute node has built-in storage or a direct- it is easily configured and managed, but has a poor attached storage server. Each compute node has a scalability. It can merely scale up instead of scaling file system, which can provide I/O service to itself out because the storage system cannot horizontally and other compute nodes. Due to the problems of expand to multiple servers but can vertically enhance I/O scheduling and global data consistency, the im- the performance of a single server. Therefore, the plementations of the parallel file system are compli- I/O architecture severely limits the performance of cated. In addition, the storage servers have different the storage system. Suppose that the bandwidth positions for the compute nodes which are direct- provided by the storage system is M . Although M S S attached or non-direct-attached; therefore, system concerns different I/O access patterns, there always load balance also becomes a problem. Hence, this exists a positive constant M so that M

Storage Storage Storage Storage Storage ... CN0 CN1 CN2 CN3 CNP server server server server server ... Interconnection network

CN0 CN1 CN2 CN3 CNP

Storage server Interconnection network

Fig. 3 Centralized I/O architecture (CN: compute Fig. 5 Distributed and parallel I/O architecture (CN: node) compute node) 1164 Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175

6 CN0 CN1 CN2 CN3 ... CNP )

3 4 O(P)=K Interconnection network (×10 Sto P

2 S S Storage Storage ... Storage (a) server server server 0 0246810 P (×103)

CN0 CN1 CN2 CN3 ... CNP Fig. 6 A case of a supercomputer with a distributed and parallel I/O architecture in which the storage wall does not exist Interconnection network

5.2.3 Centralized, distributed, and parallel architec- ION1 ION2 ION3 ... IONi ture Interconnection network Fig. 7 shows the CDP architectures which dom- inate in TOP500 supercomputers (TOP500, 2015). Storage Storage ... Storage (b) server server server Table 3 details the fastest 10 supercomputers in TOP500 whose storage architectures are all CDP ar- Fig. 7 Centralized, distributed, and parallel (CDP) chitecture. For medium-sized supercomputers (the architectures: (a) without I/O nodes; (b) with I/O nodes (CN: compute node; ION: I/O node) number of compute nodes reaches 103 to 104), com- pute nodes and storage servers are connected directly duction of IONs in the systems. The typical systems through an interconnection network (Fig. 7a). A par- are the IBM Bluegene series and Tianhe-2. Although allel storage system consists of storage servers usu- the system has been improved by the I/O nodes, the ally equipped with a parallel file system, such as essence of the I/O architecture remains unchanged. Lustre and PVFS, while compute nodes access the Thus, to simplify the model, the I/O node layer is storage system by the client of a parallel file system. omitted in subsequent analysis. The typical systems include Titan, Tianhe-1A, and In the CDP architecture, the storage system can so on. not only scale up but also scale out. However, these With the development in supercomputers, I/O systems still have some limitations: (1) For a spe- nodes (IONs) are inserted between compute nodes cific supercomputer the scale of the storage system and parallel storage systems to provide the I/O for- is fixed within a certain time period; (2) The paral- warding and management functions (Fig. 7b). On lel file system usually has ceilings on the number of the one hand, the number of compute nodes can in- storage nodes, aggregate performance, and storage crease continually without the limitation of the client capacity. Here we assume that the aggregate band- number of parallel file systems. On the other hand, width of the parallel file system is MS, and there I/O performance can be improved by I/O scheduling always exists a positive constant M so that MS

Table 3 Storage architecture of the fastest 10 computers in the TOP500 (TOP500, 2015)

TOP500 System Number of cores Storage architecture File system I/O nodes used? 1 Tianhe-2 3 120 000 CDP Hybrid hierarchy file system (H2FS) Yes 2 Titan 560 640 CDP Lustre No 3 Sequoia 1 572 864 CDP Lustre + zettabyte file system (ZFS) Yes 4 K computer 705 024 CDP Fujitsu exabyte file system (FEFS) No 5 Mira 786 432 CDP General parallel file system (GPFS) Yes 6 Piz Daint 115 984 CDP Lustre No 7 Shaheen II 196 608 CDP Lustre No 8 Stampede 462 462 CDP Lustre No 9 JUQUEEN 458 752 CDP GPFS Yes 10 Vulcan 393 216 CDP Lustre + ZFS Yes Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175 1165 conflict of the storage system are seriously aggra- layer (IBM Bluegene series) (Moreira et al., 2006; vated. Thus, in synchronous I/O mode for one com- Ali et al., 2009), the burst buffers can be inserted pute node, we obtain Vout-mem Θ(M/P). Since M into the I/O nodes, so this burst buffer layer is lo- is a constant factor, we can omit it and simplify the cated in the same level as the I/O forwarding layer above formula to Vout-mem Θ(1/P ). According to (Fig. 9b) (Liu et al., 2012). Despite these different Eq. (11), we obtain O(P ) Θ(P ). Then we can find designs and implementations, the burst buffers have that the supercomputer with the CDP architecture the same functionality and usage in absorbing the is an incremental system by Definition 3. Meanwhile, write activity. Below we use the work of Liu et al. when SP =Θ(P ), according to Theorem 2, we obtain (2012) as a case to study the burst buffer related O (P ) SP and it demonstrates that the storage architecture, and other works undergo the analogue wall exists in the supercomputer with the CDP ar- analysis procedure. chitecture. Fig. 8 shows a case for a supercomputer According to the hierarchical CDP with a burst with the CDP architecture. buffer architecture, Eqs. (11) and (12) can be rewrit-

6 ten as Storage wall 1 O(P )= (Ipser −Isser) O(P)=KP

) C

3 4 1 D βD (1−α−β)D (13) = + + − sser

(×10 I Sto P C Vmem Vburst Vout-mem

S S 2 and 1 0 P = 03691215 SSto SP P (×103) 1+O(P ) Fig. 8 A case for a supercomputer with a centralized, 1 =SP . distributed, and parallel (CDP) architecture in which 1 D βD (1 − α − β)D + + + sint the storage wall exists I C Vmem Vburst Vout-mem (14) Here, C = Isint + Isser is a positive constant, β is 5.2.4 Centralized, distributed, and parallel architec- defined as a data absorbing ratio, and thus βD rep- ture with burst buffers resents the amount of data that can be absorbed by The I/O workloads in scientific applications are the burst buffers for each node. Vburst denotes the often characterized by periodicity and burstiness, es- average bandwidth for each node provided by the pecially for write operations, and this provides an op- burst buffers. Since the burst buffers form an inter- portunity to use the write buffer. Burst buffers (Liu mediate layer with the I/O nodes, they scale out with et al., 2012; Sisilli, 2015; Wang et al., 2014; 2015) compute nodes in a fixed proportion horizontally (for are large high-performance write buffering spaces in- example, the ratio between the number of compute serted as a tier of the supercomputer storage using nodes and the number of I/O nodes is 64:1). So, solid state drives (SSDs), non-volatile random ac- we can assume that β and Vburst are approximately cess memory (NVRAM), and so on. By using burst equal on every processor as the processor number buffers, scientific applications can flush data to a increases. high-performance temporary buffer, and then over- Likewise, we can achieve a similar conclusion to lap computations that follow the I/O bursts with Corollary 1: writing data from the burst buffer back to external Corollary 2 When a program Q that satisfies lim SP = ∞ runs on a system W (CDP with burst storage. This buffering strategy can remove expen- P →∞ sive parallel file system (PFS) write operations from buffers), if D (1−α − β)/Vout-mem Θ (P ) , the stor- the critical path of execution. age wall exists. Under the CDP storage architecture, for the sys- The proof is omitted here because it is similar tem without the I/O forwarding layer (like Titan), to the proof of Corollary 1. Based on Corollary 2, burst buffer nodes typically are located between the it can be easily found that the storage wall can be compute nodes and the PFS (Fig. 9a) (Wang et al., pushed forward or diminished for a storage system 2014; 2015). For the system with an I/O forwarding with CDP architecture with burst buffers . 1166 Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175

5 − CN0 CN1 CN2 CN3 ... CNP ȕ =1 Į , ideal case ȕ <1−Į , real case 4 Interconnection network O(P)=K ) Burst buffer 4 3 Burst buffernode

Burst buffer (×10 ... Sto Storage Storage Storage node P

S server server server (a) S 2 Storage wall

1 O(P)=KP CN0 CN1 CN2 CN3 ... CNP

0 Interconnection network 0246810 P (×104)

Fig. 10 Two cases for a supercomputer with a central- ION0 ION1 ION2 IONi Burst Burst Burst ... Burst ized, distributed, and parallel architecture with burst buffer buffer buffer buffer buffers. In the ideal case (β =1− α), an application Q has only write activity which can all be absorbed Interconnection network by the burst buffers and the storage wall does not exist (dashed line); in real circumstances (β<1 − α), the storage wall can be pushed forward because a lot Storage Storage ... Storage (b) server server server of write operations are absorbed and hidden by the burst buffer layer (solid line) Fig. 9 Centralized, distributed, and parallel architec- ture with burst buffers without (a) and with (b) an 6 Experiments I/O forwarding layer (CN: compute node; ION: I/O node) In our experiments, we used typical benchmarks and applications to verify, analyze, and extrapolate The key factors to the effectiveness of burst the existence of the storage wall for the current rep- buffers are Vburst and β. Burst buffers usually use resentative systems and forthcoming exascale sys- nonvolatile media, which has better performance tems. The experiments have two goals: the first than disk-based storage. So, Vburst is a foundation one is to obtain the performance laws of storage sys- such that burst buffers can play a role in dealing with tems by using benchmarks and verify the existence the storage wall. This is closely related to the media of the storage wall in a specific I/O architecture (Sec- itself. β is another key point, revealing how many tion 6.1), and the second one is to analyze the parallel write operations can be absorbed by burst buffers. application with checkpointing to verify the storage It is closely related to the performance and capacity wall existence and seek the best application scale of burst buffers, and the I/O patterns of the ap- (Section 6.2). Our analyses also provide insights on plications. Burst buffers can accelerate only write how to mitigate storage wall effects in system design operations rather than read operations. and optimize large-scale parallel applications. Furthermore, they are also sensitive to the in- 6.1 Storage wall existence verification using tensity of I/O write activity, the cost of flushing the an I/O benchmark burst buffers, and protocol interactions with the ex- ternal storage system. In this section, we concentrate on the typical In the ideal case, if β =1− α, an application Q parallel I/O operations including the write and read has only write activity which can all be absorbed by operations, and obtain a further grip on the per- the burst buffers. The storage-bounded speedup is formance trends for storage systems along with the not restrained by the external storage performance. increase in program parallelism. Moreover, we pre- The storage wall does not exist as the dashed line liminarily find the existence of the storage wall in the in Fig. 10 shows. However, in real circumstances, experiments. the storage wall can be pushed forward because a lot The test platform of our experiment is the of write operations are absorbed and hidden by the Tianhe-1A supercomputer (National Supercomput- burst buffer layer as the solid line in Fig. 10 shows. ing Center in Tianjin, NSCC-TJ, Table 4), which has Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175 1167

Table 4 Specifications of the Tianhe-1A (YH MPP) 6.1.1 Scalability of I/O performance

Item Configuration Figs. 11–13 show the write and read perfor- CPU Xeon X5670 6C 2.93 GHz mance variation trends along with the scaling of the Number of compute 7168 number of processes in synchronous mode. Although nodes Memory 229 376 GB the I/O block size is different in the three figures, Proprietary (optic-electronic hy- they show the same trends in I/O performance with Interconnect brid fat-tree structure, point-to- the number of processes increasing. point bandwidth 160 Gbps) Lustre (Lustre × 4, total capacity Figs. 11a, 12a, and 13a consistently show that Storage file system 4 PB) the aggregate bandwidth of the storage system ini- Operating system Kylin Linux tially keeps up with the increase in the number of processes, and then starts growing slowly or declin- a typical CDP architecture with the Lustre parallel ing due to the performance limits and resource com- file system (Fig. 7a). petition. Figs. 11b, 12b, and 13b show that the I/O We used the interleaved-or-random (IOR) high bandwidth for each process has a sharply declining performance computing (HPC) benchmark (Univer- trend with relatively low value. Figs. 11c, 12c, and sity of California, 2007) to generate the I/O opera- 13c consistently demonstrate that the overhead time tions. IOR can be used for testing the performance for read or write operations on each process always of parallel file systems using various access patterns grows at a faster pace and achieves a large difference and interfaces such as the portable operating system over the storage-bounded speedup. interface (POSIX), message passing interface (MPI)- From the results above, it is easy to see that IO, and hierarchical data format 5 (HDF5). Due to a storage system with CDP architecture, like the the production system, our experiments were unable Tianhe-1A, has poor storage performance scalability, to extend to the whole system. Thus, we made a Lus- and falls behind the development of the computing tre pool including four object storage targets (OSTs) power of a supercomputer. on the Lustre4 (a Lustre system of the Tianhe-1A), 6.1.2 Storage wall and the related factors and this Lustre pool had no jobs during the main- tenance time. The Lustre pool had the same con- In this subsection, we reveal the existence of figuration as the whole storage system and was a the storage wall in the system with a CDP architec- microcosm, which can reveal the performance trends ture, and verify the impact factors. Fig. 14 shows of the large-scale systems with larger program par- the presence of the storage wall of the Tianhe-1A allelism as well. storage system based on Lustre. Fig. 14 reveals the In our experiments, first we obtained the scal- storage-bounded speedup variation along with the ability of the I/O performance for the typical I/O application parallelism P from three aspects: I/O operations with the number of processes increasing. block size (4 KB, 2 MB, or 512 MB), Isint, and D. Then, we revealed the existence of the storage wall For a CDP architecture like Lustre, the storage wall and related factors. is subject to the intensity and amount of data from Since I/O requests of parallel applications have the I/O requests, which are represented by Isint and different sizes in the data blocks, we chose three typ- D, respectively. ical block sizes, 4 KB, 2 MB, and 512 MB, which rep- For the same D, the smaller Isint leads to the resented small, medium, and large I/O requests. slower storage-bounded speedup increment, so the In the experiments, we ran IOR using the storage wall will appear faster, so does the case in POSIX application programming interface (API) which D is larger with the same Isint. Thus, we find and scaled the number of processes from 1 to 160 that the storage wall for the CDP architecture will where there was one client per process. We set a appear quickly when the applications have intense few configurations to avoid the impact of the cache I/O requests and a large amount of I/O data, and by setting an O_DIRECT flag for POSIX to bypass this type of application is generally called an ‘I/O- the I/O buffers, and setting a reorderTasks flag to intensive application’. avoid the reading after writing cache. Fig. 15 shows the impact of α on the storage 1168 Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175

12 3.0 Write 0.20 (a)Write (b)Write (c) Read Read 10 2.5 Read 0.16 8 2.0 0.12 6 1.5

Time (s) 0.08 4 1.0 Bandwidth of each

2 compute node (MB/s) 0.5 0.04

Aggregate bandwidth (MB/s) 0 0 0 0 32 64 96 128 160 0326496128160 0 326496128160 Number of compute nodes Number of compute nodes Number of compute nodes Fig. 11 Write and read operation performances when I/O block size is 4 KB: (a) aggregate bandwidth of the Lustre pool; (b) bandwidth per compute node; (c) time overhead per operation

140 1800 (a) (b) (c) 120 0.8 1500 100 0.6 1200 80 Write Write Write Read Read Read 900 60 0.4 Time (s) 600 node (MB/s) 40 0.2 300 20 Bandwidth of each compute Aggregate bandwidth (MB/s) 0 0 0 0 32 64 96 128 160 0 32 64 96 128 160 0 32 64 96 128 160 Number of compute nodes Number of compute nodes Number of compute nodes Fig. 12 Write and read operation performances when I/O block size is 2 MB: (a) aggregate bandwidth of the Lustre pool; (b) bandwidth per compute node; (c) time overhead per operation

1400 (a) (b)120 (c) 1200 160 100 1000 120 80 800 60 600 80 Write

Write Write Time (s) Read Read Read 400 node (MB/s) 40 40 200 20 Aggregate bandwidth (MB/s) 0 Bandwidth of each compute 0 0 0 32 64 96 128 160 0 326496128160 0 326496128160 Number of compute nodes Number of compute nodes Number of compute nodes Fig. 13 Write and read operation performances when I/O block size is 512 MB: (a) aggregate bandwidth of the Lustre pool; (b) bandwidth per compute node; (c) time overhead per operation wall. We can easily find that the storage wall can be the Oak Ridge National Laboratory. Although this alleviated by the improvement in memory hit rate, supercomputer has been replaced by the Cray XK7 and this is in line with Eq. (12). (renamed Titan), the storage architectures of these two systems are the same. The results based on 6.2 Application case study of storage wall Jaguar are still applicable to existing supercomput- ers with the CDP storage architecture. In this subsection, we use case studies to further Table 5 details the test-related configurations detail how the application scalability is restrained and Fig. 16 depicts the storage architecture of Jaguar by the storage performance and find the best appli- (Alam et al., 2007; Fahey et al., 2008), in which OSS cation scale based on the storage wall. We use the denotes the object storage server and MDS denotes representative cases that parallel applications with the metadata server. the checkpoint/restart mechanism run on the Cray Checkpoint/restart is a widely used mechanism XT4 Jaguar supercomputer. Jaguar is a typical su- for fault-tolerance, and meanwhile it has become percomputer with a CDP architecture, equipped in a driving I/O workload for supercomputer storage Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175 1169

(a) 180 (b) 180 (c) 180 SGustafson SGustafson SGustafson 160 sint 160 160 sint D=32 KB, I =1.0 s D=32 MB, Isint=2.0 s D=512 MB, I =60 s 140 sint 140 140 sint D=32 KB, I =0.5 s D=32 MB, Isint=1.0 s D=512 MB, I =30 s sint sint 120 D=32 KB, I =0.1 s 120 D=32 MB, Isint=0.5 s 120 D=512 MB, I =10 s sint sint 100 D=64 KB, I =0.1 s 100 D=64 MB, Isint=0.5 s 100 D=1024 MB, I =10 s sint sint D=128 KB, I =0.1 s sint D=2048 MB, I =10 s 80 80 D=128 MB, I =0.5 s 80 60 60 60 40 40 40 Storage-bounded speedup Storage-bounded speedup Storage-bounded 20 speedup Storage-bounded 20 20 0 0 0 0 40 80 120 160 0 40 80 120 160 0 40 80 120 160 Number of compute nodes Number of compute nodes Number of compute nodes Fig. 14 Cases for the storage wall in centralized, distributed, and parallel (CDP) architecture in different block sizes, Isint, and D: (a) block size = 4 KB; (b) block size = 2 MB; (c) block size = 512 MB

(a) 180 (b) 180 (c) 180

160 SGustafson ɲ=60% 160 SGustafson ɲ=60% 160 SGustafson ɲ=60% 140 Į=70% ɲ=80% 140 Į=70% ɲ=80% 140 Į=70% ɲ=80% 120 120 120 100 100 100 80 80 80 60 60 60 40 40 40 Storage-bounded speedup Storage-bounded speedup Storage-bounded 20 20 speedup Storage-bounded 20 0 0 0 0 40 80 120 160 0 40 80 120 160 0 40 80 120 160 Number of compute nodes Number of compute nodes Number of compute nodes Fig. 15 The impact of α on the storage wall: (a) D=32 KB, Isint=0.1 s, block size=4 KB; (b) D=2 MB, Isint=0.5 s, block size=32 MB; (c) D=512 MB, Isint=10.0 s, block size=512 MB systems (Bent et al., 2009). tails of the checkpoint/restart mechanism are dif- For different checkpoint methods, parallel appli- ferent, especially in the checkpoint data (Kalaiselvi cations, and supercomputer configurations, the de- and Rajaraman, 2000; Elnozahy et al., 2002; Agarwal et al., 2004; Elnozahy and Plank, 2004; Oldfield et al., Table 5 Specifications of the Cray XT4 Jaguar system 2007; Bent et al., 2009; Egwutuoha et al., 2013). Item Configuration We consider parallel applications with system-level checkpointing techniques to evaluate system scalabil- CPU 2.6 GHz dual-core opteron Number of processor sockets/ ity under the storage performance constraint (Plank 6296/12 592 cores et al., 1995; Agarwal et al., 2004). Memory 2 GB/core According to the checkpoint/restart mecha- Interconnect Cray SeaStar2 Disk controllers 18 DDN-9550 couplets nisms with different amounts of checkpoint data, we Number of OSSes/OSTs 72/144 completed the analysis and comparison to obtain the storage wall laws. Compute Node ComputeCompute Node Node Compute NodenodeClient(Client) In this subsection, we consider only the I/O (Client( ) ) ((Client)Client) caused by the checkpoint/restart mechanism and omit the I/O caused by the application itself, which Interconnection network may exacerbate the storage wall. As a production system, the test on the Tianhe-1A could not be ex- OSS OSS OSS MDS 0 1 2 tended to a larger scale. The Jaguar storage scaling tests were made specifically to a large scale by the OST OST OST 0 1 2 MDT researchers, and the follow-up work is based on the OST OST OST 3 4 5 results from the Jaguar storage scaling tests (Alam Fig. 16 Cray XT4 Jaguar storage architecture (OSS: et al., 2007; Fahey et al., 2008). The tests used the objcet storage server; OST: object storage target; IOR benchmark in the MPI I/O mode with a con- MDS: metadata server; MDT: metadata target) stant buffer size (2 MB, 8 MB, 32 MB) per core, and 1170 Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175 they used two modes: one writing or reading with Write aggregate bandwidth 6 Fitted curve one file per client, and the other for a shared file. 4

These results show the I/O aggregate bandwidth as (GB/s) 2

Bandwidth (a) the number of processes increases. Here we con- − 1 et al. sider the N checkpointing pattern (Bent , 15 Write aggregate bandwidth Fitted curve 2009). Since the major operations of checkpointing 10 are write operations, we use the results of the write (GB/s) 5 Bandwidth (b) experiments to a share file by multiple clients. We use V to represent the aggregate out-mem-aggr 20 Write aggregate bandwidth Fitted curve bandwidth of the shared storage system. Since the 15 data are transferred mainly between the memory and (GB/s) 10 Bandwidth (c) the shared storage system, both D/Vmem and α are 1000 2000 3000 4000 5000 6000 7000 8000 9000 zero. So, Eq. (12) can be simplified to Number of clients Fig. 17 Curve fitting of the storage system aggregate CSP SP = , (15) bandwidth by using two terms of the Gaussian func- Sto DP +Isint tion with the buffer size per core fixed at 2 MB (a), 8 Vout-mem-aggr MB (b), and 32 MB (c) where P stands for the number of parallel processes. For different amounts of checkpoint data, paral- Substituting SGustafson for SP , we obtain lel applications with a checkpoint/restart mechanism C(f +(1− f)P ) show a similar trend for the storage wall, but are to SP = . (16) Gustafson-Sto DP a different degree affected by the storage wall. Here- + sint I after, we detail the storage wall of the cases for full- Vout-mem-aggr and incremental-memory checkpointing, while we For a large-scale parallel application, f is usually use partial-memory (10% and 30% memory) check- small and can be omitted. Then we can obtain pointing briefly for comparison. CP 1. Full-memory checkpointing. In this case, the SP = . (17) Gustafson-Sto DP entire memory context is saved at one checkpoint for +Isint Vout-mem-aggr a process. D is the size of the total memory. Since each dual-core node has 4 GB memory (Table 5), To obtain the relationship between Vout mem aggr - - D =4GB. Let P be the number of processes, Isint and P when P increases, we use the Matlab curve the checkpointing interval, and Isser the average I/O fitting toolbox to analyze the storage scaling test phase time of the process in serial mode. Here, we results (Alam et al., 2007; Fahey et al., 2008). We suppose that the I/O bandwidth is 500 MB/s for one find that two terms of the Gaussian function, i.e., node in serial mode, Isser =4GB/ 500 MB/s=8s. P Vout-mem-aggr Then SSto is P −b 2 P −b 2 P 1 2 SG-Sto-fullckp =a1 exp − + a2 exp − , c1 c2 (Isint +8)P = . (19) (18) 4 P sint 2 2 +I −((P −b1)/c1) −((P −b2)/c2) a1e + a2e can fit the Vout-mem-aggr curve for different numbers P of processes and give a small root mean square er- Fig. 18 shows the variation trends of SSto in ror (RMSE). Fig. 17 and Table 6 show the fitting parallel applications with full-memory checkpointing curves of the scaling test data including three dif- on Jaguar. The results are based on three types of ferent buffer sizes (2 MB, 8 MB, 32 MB) per core. write aggregate bandwidth data (buffer sizes per core We find that the aggregate bandwidth of the storage at 2 MB, 8 MB, and 32 MB, respectively). Fig. 19 P system first rises to a maximum and eventually de- displays the SSto variation trends in three different creases to an asymptotic level. The curves in Fig. 17 checkpointing intervals (1, 3, and 6 h) under the accurately reflect the trend in the storage perfor- write aggregate bandwidth data of buffer size per mance with a small error. core at 32 MB. Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175 1171

Table 6 Parameters related to the curve fitting of storage system aggregate bandwidth using two terms of the Gaussian function

Buffer size a1 b1 c1 a2 b2 c2 RMSE 2 MB 4.73 887.3 1094 4.342 −5.489 × 103 1.033 × 104 0.06902 8 MB 10.46 823 1471 6.991 × 1015 −4.984 × 105 8.478 × 104 0.1212 32 MB 10.05 1191 1703 1.570 × 1016 −7.925 × 105 1.346 × 105 0.4897

7 can obtain 2 MB (12 051, 6235) 8 MB ) 6 3 32 MB + Runtime Storage wall Fullmemory Fullmemory 5 MTTF D ≥ Runtime Runtime 4 + (7022, 3491) Isint MTTF 3 20 (4769, 2283) 4+ 4 10 6 2 = = 20 20 10 .

Storage-bounded speedup (×10 1 + +1 Isint 10 Isint 0 0 4 8 12 16 Number of compute nodes (×103) According to Eqs. (17) and (18), and D de- P P Fig. 18 SSto variation trends in parallel applications scribed above, SSto for incremental-memory check- with full-memory checkpointing on Jaguar for write pointing can be changed to aggregate bandwidths with buffer size per core at 2, 8, and 32 MB, respectively (checkpointing interval is P SG-Sto-incrckp 3h) (Isint +8)P 9 ≤ . (14 952, 8545) 6 (20)

) 1 h

3 8 P 3 h 10 sint +1 /I sint 7 6 h 2 2 +I −((P −b1)/c1) −((P −b2)/c2) Storage wall (12 051, 6235) a1e + a2e 6 P 5 Fig. 20 shows the comparison of SSto among

4 (8252, 3488) applications with full-, partial-, and incremental- 3 memory checkpointing under the write aggregate 2 bandwidth data of buffer size per core at 32 MB. Storage-bounded speedup (×10 1 Applications with partial-memory checkpointing are

0 the cases used to reveal the effect of checkpoint data 0 4 8 12 16 20 Number of compute nodes (×103) size on the system scalability, and we use 10% and P Fig. 19 SSto variation trends in parallel applications 30% of the entire memory as the checkpoint data with full-memory checkpointing on Jaguar, for check- size. pointing intervals of 1 h, 3 h, and 6 h, respectively In Figs. 18–20, the storage-bounded speedup in- (write aggregate bandwidth of buffer size per core at 32 MB) creases initially and starts to drop as soon as it hits the storage wall, i.e., the point Popt. This point is 2. Incremental-memory checkpointing. Incre- called the ‘optimal parallel scale’ and implies that mental checkpointing attempts to reduce the amount parallel applications can achieve the largest storage- of checkpoint data by saving only the differences bounded speedup in terms of the storage wall def- from the last checkpoint (Agarwal et al., 2004; Fer- inition. For large-scale parallel applications on su- reira et al., 2014). In this case, since the size of the percomputers with a CDP architecture, the storage first checkpoint data is the entire memory, the total wall may widely exist at the petascale level now and size of all saved checkpoints is at least equal to that the exascale level in the future. By comparing dif- of the entire memory. Suppose that program Q runs ferent lengths of time intervals between I/O phases on Jaguar for 20 days, and that the mean time to in Fig. 19, we find that the larger the compute pro- failure (MTTF) of the whole system is 10 days. We portion of the parallel application compared to I/O, 1172 Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175

30 into serious consideration. The system designer can Incremental memory checkpointing 10% of full memory checkpointing push the storage wall forward by increasing the I/O ) (33 417, 25 178) 3 25 30% of full memory checkpointing bandwidth and the memory hit rate. For example, ×10 Full memory checkpointing 20 Storage wall global shared cache across multiple compute nodes in memory can significantly increase the memory hit (23 097, 11 598) 15 rate (Liao et al., 2007; Frasca et al., 2011; Sisilli, 2015). (17 332, 10 536) 10 3. Based on the analysis of I/O architecture us- ing the storage wall theory, the new I/O architecture 5 (12 051, 6235) Storage-bounded speedup ( design and the usage method for new storage me- dia can be guided by our theory to focus on solving 0 0 10203040the key factors that affect scalability. Hence, a scal- Number of compute nodes (×103) able I/O system design, maybe a new kind of dis- SP Fig. 20 Comparison of Sto among applications with tributed I/O architecture, may overcome the storage full-, partial-, and incremental-memory checkpointing with the write aggregate bandwidth data of buffer size wall limitation. per core at 32 MB (checkpointing interval is 3 h) 4. Now, the goal of parallel programming meth- ods is to maximize the use of computing resources; the more the appearance of the storage wall will be however, this may engender the storage wall problem postponed. due to ignoring the storage constraint. It is imper- By comparing applications with full-, partial-, ative that new parallel programming models should and incremental-memory checkpointing, we find that focus on both the computing and I/O performances, reducing the data size of I/O operations can push the and our storage wall theory is able to provide more storage wall forward (Fig. 20). Based on the anal- recommendations. ysis results given in Eqs. (19) and (20), the system sint designer can increase Vout-mem-aggr or I , and de- crease D, i.e., improve the bandwidth or optimize 7 Conclusions and future work the computing and I/O ratio to push the storage In this paper, we presented a storage-bounded wall forward. speedup model to describe system scalability under All the analyses are in accordance with the im- the storage constraint and introduced for the first pact factors of Section 5.1. time the formal definition of a ‘storage wall’, which 6.3 Mitigating storage wall effects allows for the effects of the storage bottleneck on the scalability of parallel applications for large-scale As demonstrated above, the storage wall may parallel computing systems, particularly for those at exist in large-scale supercomputers especially in peta/exascale levels. forthcoming exascale supercomputing. Tackling this We studied the storage wall theory through the- challenge will require a combined effort from storage oretical analyses, real experiments, and case studies system designers, file system developers, and appli- using representative real-world supercomputing sys- cation developers. Our storage wall theory can give tems. The results verified the existence of the stor- researchers more insights into alleviating or even re- age wall, revealed the storage wall characteristics of moving the challenge: different architectures, and revealed the key factors 1. The storage-bounded speedup model clearly that affect the storage wall. Based on our study, a reveals the relationship between storage scalability balanced design can be achieved between the com- and compute scalability. The storage wall quantifies pute and storage system, and newly proposed storage the storage bottleneck as it describes the application architectures can be analyzed to determine whether scalability characteristics under the storage perfor- the parallel scalability is constrained by the stor- mance constraint. age system or not. Our work enables researchers 2. Through the analysis of the storage wall, the to push the storage wall forward by designing or im- key parameters that affect the system scalability, proving storage architectures, applying I/O resource- sint such as Vout-mem, D, I , and α, should be taken oriented programming models, and so on. Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175 1173

In the future, our efforts will focus mainly on Elnozahy, E.N., Alvisi, L., Wang, Y.M., et al., 2002. A the following aspects. For the storage wall, we will survey of rollback-recovery protocols in message-passing 34 refine the theory and explore alleviation parameters systems. ACM Comput. Surv., (3):375-408. http://dx.doi.org/10.1145/568522.568525 in detail. At the same time, we will present methods Fahey, M., Larkin, J., Adams, J., 2008. I/O performance on to optimize the I/O architecture and improve system a massively parallel cray XT3/XT4. IEEE Int. Symp. flexibility. on Parallel and Distributed Processing, p.1-12. http://dx.doi.org/10.1109/IPDPS.2008.4536270 References Ferreira, K.B., Riesen, R., Bridges, P., et al., 2014. Ac- celerating incremental checkpointing for extreme-scale Agarwal, S., Garg, R., Gupta, M.S., et al., 2004. Adaptive in- computing. Fut. Gener. Comput. Syst., 30:66-77. cremental checkpointing for massively parallel systems. http://dx.doi.org/10.1016/j.future.2013.04.017 Proc. 18th Annual Int. Conf. on Supercomputing, Frasca, M., Prabhakar, R., Raghavan, P., et al., 2011. Virtual p.277-286. http://dx.doi.org/10.1145/1006209.1006248 I/O caching: dynamic storage cache management for Agerwala, T., 2010. Exascale computing: the challenges concurrent workloads. Proc. Int. Conf. for High Per- and opportunities in the next decade. IEEE 16th Int. formance Computing, Networking, Storage and Analy- Symp. on High Performance Computer Architecture. sis, p.38. http://dx.doi.org/10.1145/2063384.2063435 http://dx.doi.org/10.1109/HPCA.2010.5416662 Gamblin, T., de Supinski, B.R., Schulz, M., et al., 2008. Alam, S.R., Kuehn, J.A., Barrett, R.F., et al., 2007. Cray Scalable load-balance measurement for SPMD codes. XT4: an early evaluation for petascale scientific simu- Proc. ACM/IEEE Conf. on Supercomputing, p.1-12. lation. Proc. ACM/IEEE Conf. on Supercomputing, Gustafson, J.L., 1988. Reevaluating Amdahl’s law. Com- p.1-12. http://dx.doi.org/10.1145/1362622.1362675 mun. ACM, 31(5):532-533. Ali, N., Carns, P.H., Iskra, K., et al., 2009. Scalable I/O http://dx.doi.org/10.1145/42411.42415 forwarding framework for high-performance computing Hargrove, P.H., Duell, J.C., 2006. Berkeley lab check- systems. IEEE Int. Conf. on Cluster Computing and point/restart (BLCR) for Linux clusters. J. Phys. Workshops, p.1-10, Conf. Ser., 46(1):494-499. http://dx.doi.org/10.1109/CLUSTR.2009.5289188 http://dx.doi.org/10.1088/1742-6596/46/1/067 Amdahl, G.M., 1967. Validity of the single processor ap- Hennessy, J.L., Patterson, D.A., 2011. Computer Architec- proach to achieving large scale computing capabilities. ture: a Quantitative Approach. Elsevier. Proc. Spring Joint Computer Conf., p.483-485. HPCwire, 2010. DARPA Sets Ubiquitous HPC Program http://dx.doi.org/10.1145/1465482.1465560 in Motion. Available from http://www.hpcwire.com/ Bent, J., Gibson, G., Grider, G., et al., 2009. PLFS: a 2010/08/10/darpa_sets_ubiquitous_hpc_program_ checkpoint file system for parallel applications. Proc. in_motion/. Conf. on High Performance Computing Networking, Hu, W., Liu, G.M., Li, Q., et al., 2016. Storage speedup: an Storage and Analysis, p.21. effective metric for I/O-intensive parallel application. http://dx.doi.org/10.1145/1654059.1654081 18th Int. Conf. on Advanced Communication Technol- Cappello, F., Geist, A., Gropp, B., et al., 2009. Toward ogy, p.1-2. exascale resilience. Int. J. High Perform. Comput. http://dx.doi.org/10.1109/ICACT.2016.7423395 Appl., 23(4):374-388. Kalaiselvi, S., Rajaraman, V., 2000. A survey of checkpoint- http://dx.doi.org/10.1177/1094342009347767 ing algorithms for parallel and distributed computers. Carns, P., Harms, K., Allcock, W., et al., 2011. Understand- Sadhana, 25(5):489-510. ing and improving computational science storage ac- http://dx.doi.org/10.1007/BF02703630 cess through continuous characterization. ACM Trans. Kim, Y., Gunasekaran, R., 2015. Understanding I/O work- Stor., 7(3):1-26. load characteristics of a peta-scale storage system. J. http://dx.doi.org/10.1145/2027066.2027068 Supercomput., 71(3):761-780. Chen, J., Tang, Y.H., Dong, Y., et al., 2016. Reducing static http://dx.doi.org/10.1007/s11227-014-1321-8 energy in supercomputer interconnection networks using Kim, Y., Gunasekaran, R., Shipman, G.M., et al., 2010. topology-aware partitioning. IEEE Trans. Comput., Workload characterization of a leadership class storage 65(8):2588-2602. cluster. Petascale Data Storage Workshop, p.1-5. http://dx.doi.org/10.1109/TC.2015.2493523 http://dx.doi.org/10.1109/PDSW.2010.5668066 Culler, D.E., Singh, J.P., Gupta, A., 1998. Parallel Computer Kotz, D., Nieuwejaar, N., 1994. Dynamic file-access char- Architecture: a Hardware/Software Approach. Morgan acteristics of a production parallel scientific workload. Kaufmann Publishers Inc., San Francisco, USA. Proc. Supercomputing, p.640-649. Egwutuoha, I.P., Levy, D., Selic, B., et al., 2013. A survey of http://dx.doi.org/10.1109/SUPERC.1994.344328 fault tolerance mechanisms and checkpoint/restart im- Liao, W.K., Ching, A., Coloma, K., et al., 2007. Using MPI plementations for high performance computing systems. file caching to improve parallel write performance for J. Supercomput., 65(3):1302-1326. large-scale scientific applications. Proc. ACM/IEEE http://dx.doi.org/10.1007/s11227-013-0884-0 Conf. on Supercomputing, p.8. Elnozahy, E.N., Plank, J.S., 2004. Checkpointing for http://dx.doi.org/10.1145/1362622.1362634 peta-scale systems: a look into the future of practi- Liu, N., Cope, J., Carns, P., et al., 2012. On the role of cal rollback-recovery. IEEE Trans. Depend. Secur. burst buffers in leadership-class storage systems. IEEE Comput., 1(2):97-108. 28th Symp. on Mass Storage Systems and Technologies, http://dx.doi.org/10.1109/TDSC.2004.15 p.1-11. http://dx.doi.org/10.1109/MSST.2012.6232369 1174 Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175

Liu, Y., Gunasekaran, R., Ma, X.S., et al., 2014. Automatic Wang, T., Oral, S., Wang, Y.D., et al., 2014. Burstmem: identification of application I/O signatures from noisy a high-performance burst buffer system for scientific server-side traces. Proc. 12th USENIX Conf. on File applications. IEEE Int. Conf. on Big Data, p.71-79. and Storage Technologies, p.213-228. http://dx.doi.org/10.1109/BigData.2014.7004215 Lu, K., 1999. Research on Parallel File Systems Technology Wang, T., Oral, S., Pritchard, M., et al., 2015. Development Toward Parallel Computing. PhD Thesis, National of a burst buffer system for data-intensive applications. University of Defense Technology, Changsha, China (in arXiv:1505.01765. Available from http://arxiv.org/abs/ Chinese). 1505.01765 Lucas, R., Ang, J., Bergman, K., et al., 2014. DOE Advanced Wang, Z.Y., 2009. Reliability speedup: an effective metric Scientific Computing Advisory Subcommittee (ASCAC) for parallel application with checkpointing. Int. Conf. Report: Top Ten Exascale Research Challenges. US- on Parallel and , Applications DOE Office of Science. and Technologies, p.247-254. http://dx.doi.org/10.2172/1222713 http://dx.doi.org/10.1109/PDCAT.2009.19 Miller, E.L., Katz, R.H., 1991. Input/output behavior of Xie, B., Chase, J., Dillow, D., et al., 2012. Characterizing supercomputing applications. Proc. ACM/IEEE Conf. output bottlenecks in a supercomputer. Int. Conf. on Supercomputing, p.567-576. for High Performance Computing, Networking, Storage http://dx.doi.org/10.1145/125826.126133 and Analysis, p.1-11. Moreira, J., Brutman, M., Castano, J., et al., 2006. Design- http://dx.doi.org/10.1109/SC.2012.28 ing a highly-scalable operating system: the blue Gene/L Yang, X.J., Du, J., Wang, Z.Y., 2011. An effective speedup story. Proc. ACM/IEEE Conf. on Supercomputing, metric for measuring productivity in large-scale parallel p.53-61. http://dx.doi.org/10.1109/SC.2006.23 computer systems. J. Supercomput., 56(2):164-181. Oldfield, R.A., Arunagiri, S., Teller, P.J., et al., 2007. Mod- http://dx.doi.org/10.1007/s11227-009-0355-9 eling the impact of checkpoints on next-generation sys- Yang, X.J., Wang, Z.Y., Xue, J.L., et al., 2012. The relia- tems. 24th IEEE Conf. on Mass Storage Systems and bility wall for exascale supercomputing. IEEE Trans. Technologies, p.30-46. Comput., 61(6):767-779. http://dx.doi.org/10.1109/MSST.2007.4367962 http://dx.doi.org/10.1109/TC.2011.106 Pasquale, B.K., Polyzos, G.C., 1993. A static analysis of I/O characteristics of scientific applications in a production workload. Proc. ACM/IEEE Conf. on Supercomput- Appendix: Derivation of Eq. (6) ing, p.388-397. http://dx.doi.org/10.1145/169627.169759 In Section 4.1, the storage-bounded speedup is Plank, J.S., Beck, M., Kingsley, G., et al., 1995. Libckpt: relaxed to an upper bound. Here we describe the transparent checkpointing under Unix. Proc. USENIX Technical Conf., p.18. derivation of the upper bound. Purakayastha, A., Ellis, C., Kotz, D., et al., 1995. Char- Suppose that acterizing parallel file-access patterns on a large-scale multiprocessor. 9th Int. Parallel Processing Symp., T snon + N sIsser A = p.165-172. T pnon + N pIpser http://dx.doi.org/10.1109/IPPS.1995.395928 T snon +(T snon/Isint)Isser + Isser Sisilli, J., 2015. Improved Solutions for I/O Provisioning = and Application Acceleration. Available from http: T pnon +(T pnon/Ipint)Ipser + Ipser //www.flashmemorysummit.com/English/Collaterals/ Proceedings/2015/20150811_FD11_Sisilli.pdf [Ac- and cessed on Nov. 18, 2015]. T snon +(N s − 1)Isser Rudin, W., 1976. Principles of Mathematical Analysis. B = McGraw-Hill Publishing Co. T pnon +(N p − 1)Ipser Shalf, J., Dosanjh, S., Morrison, J., 2011. Exascale comput- T snon +(T snon/Isint)Isser ing technology challenges. 9th Int. Conf. on High Per- = pnon pnon pint pser . formance Computing for Computational Science, p.1-25. T +(T /I )I http://dx.doi.org/10.1007/978-3-642-19328-6_1 B A Strohmaier, E., Dongarra, J., Simon, H., et al., 2015. The difference between and is TOP500 Supercomputer Sites. Available from Diff = − http://www.top500.org/ [Accessed on Dec. 30, 2015]. B A Sun, X.H., Ni, L.M., 1993. Scalable problems and memory- T snon +(N s − 1)Isser T snon + N sIsser 19 = − bounded speedup. J. Parall. Distr. Comput., (1): T pnon +(N p − 1)Ipser T pnon + N pIpser 27-37. http://dx.doi.org/10.1006/jpdc.1993.1087 pser snon sser pnon s p pser sser University of California, 2007. IOR HPC Benchmark. Avail- I T − I T +(N − N )I I = . able from http://sourceforge.net/projects/ior-sio/ [Ac- (T pnon +(N p − 1)Ipser)(T pnon + N pIpser) cessed on Sept. 1, 2014]. Wang, F., Xin, Q., Hong, B., et al., 2004. File system work- According to the assumptions in Section 4.1, the load analysis for large scale scientific computing appli- load is balanced on all nodes and the I/O load is in cations. Proc. 21st IEEE/12th NASA Goddard Conf. on Mass Storage Systems and Technologies, p.139-152. proportion to the compute load when the compute Hu et al. / Front Inform Technol Electron Eng 2016 17(11):1154-1175 1175 load increases with the increase of the number of (T pnon +(N p − 1)Ipser) and T pnon ≈ T snon.We processors. have N p N s Thus, is approximately equal to and (Ipser − Isser)T pnon T pnon is less than or equal to T snon. Due to stor- Diff = B − A = . (T P )2 age performance degradation, Ipser is usually greater Sto sser than I , and this is verified by the experiments in For long-term running applications, Section 6. Thus, we have (N s−N p)IpserIsser ≈ 0 and pser snon sser pnon ( pser − sser) pnon  ( P )2 I T − I T ≥ 0. Based on the analysis I I T TSto . above, we find that the numerator of Diff is greater than or equal to zero, and the denominator is greater Hence, in Eq. (6), we set the upper bound as the Diff than zero. Therefore, B is greater than or equal to value of storage-bounded speedup. Since is neg- A, and B is an upper bound of A. ligible, this error will not affect the correctness of the P =(pnon + p pser) ≈ results in this study. Suppose that TSto T N I