Graphical Abstract The performance wall of parallelized sequential computing: the dark performance and the roofline of performance gain

János Végh

Using for achieving the needed higher performance adds one more limitation to the already known ones. The performance gain of parallelized sequential processor systems has a theoretical upper limit, derived from the laws of nature, the paradigm and the technical implementation of computing. Based on the different limiting factors, the paper derives some theoretical upper limit of the performance gain. From the rigorously controlled database of , performance gains of the ever-built supercomputers are analyzed and compared to the theoretical bound. It is shown that the present implementations already achieved the theoretical upper bound. The upper bound of the performance gain for processor-based brain simulation is also derived. The roofline of performance gain of supercomputers

HPL 1st by RMax 8 10 HPL 2nd by RMax HPL 3rd by RMax HPL 7 Best by αeff 10 HPCG 1st by RMax HPCG 2nd by RMax 6 HPCG 10 3rd by RMax HPCG Best by RMax Brain simulation 105 P erformance gain 104

103

102 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 Year arXiv:1908.02280v1 [cs.DC] 2 Aug 2019 Highlights The performance wall of parallelized sequential computing: the dark performance and the roofline of performance gain

János Végh

• A new limitation of computing performance is derived • A new merit for parallelization is introduced

• The theoretically possible performance gain is already achieved • The performance of brain simulation is compared with that of the benchmarks The performance wall of parallelized sequential computing: ⋆ the dark performance and the roofline of performance gain János Végh aKalimános BT, Hungary. 4032 Debrecen, Komlóssy 26.

ARTICLEINFO ABSTRACT

Keywords: The computing performance today is developing mainly using parallelized sequential computing, in efficiency many forms. The paper scrutinizes whether the performance of that type of computing has an upper single-processor limit. The simple considerations point out that the theoretically possible upper bound is practically performance achieved, and that the main obstacle of to step further is the presently used computing paradigm and performance gain implementation technology. In addition to the former "walls", also the "performance wall" must be considered. As the paper points out, similarly to the "dark silicon", also the "dark performance" is brain simulation always present in the parallelized many-processor systems.

1. Introduction the effort to understand the relevant characteristics of the benchmark applications they use, if they are to arrive at the The parallelization in the computing today becomes more correct design decisions for building larger multiprocessor and more focus [5]. After that the growth of the single- systems"[41]. The rules of game are different for the segre- processor performance has stalled [45], the only hope for gated processors and the parallelized ones [51]; the larger are making computing systems with higher performance is to the systems, the more remarkable deviations appear in their assemble them from a large number of sequentially work- behavior. In addition to the already known limitations [35], ing computers. "Computer architects have long sought the "The City of Gold" (El Dorado) of computer design: to cre- valid for segregated processors, a new limitation, valid for ate powerful computers simply by connecting many existing parallelized processors appears [55]. This paper presents that the well-known Amdahl’s law, smaller ones."[37]. Solving the satisfying of the the commonly used computing paradigm (the single-processor task, however, is not simple at all. One of the major motiva- approach [3]) and the commonly used implementation tech- tions of the origin of the Gordon Bell Prize was to increase nology together form a strong upper bound for the perfor- the resulting performance gain of the parallelized sequential a speedup of at least 200 times on a real problem mance of the parallelized sequential computing systems, and systems: " the experienced failures, saturation, etc. experiences can be running on a general purpose parallel processor"[6]. Valid- attributed to approaching and attempting to exceed that up- It is known from the very beginnings, that the " "We realize that Amdahl’s Law is ity of the Single Processor Approach to Achieving Large- per bound. In section2 one of the few, fundamental laws of computing" Scale Computing Capabilities [38], and "[3, 41] is at least question- reinterpret it for the targeted goal. The section introduces able, and the really large-scale tasks (like building super- the mathematical formalism used and derives the deployed computers comprising millions of single processors or build- logical merits. Based on the idea of Amdahl, in section3 ing brain simulators from single processors) may face seri- a general model of parallel computing in constructed. In ous scalability issues. After the initial difficulties, for today this (by intention) strongly simplified model the contribu- the large-scale supercomputers are stretched to the limits [7], tions are classified either as parallelizable or nonparalleliz- and although the EFlops (payload) performance has not yet 104 able ones. The section provides an overview of the compo- been achieved, already the times higher performance is nents contributing to the non-parallelizable fraction of pro- planned [32]. Many of the planned large-scale supercom- cessing, and attempts to reveal their origin and behavior. By puters, however, are delayed, canceled, paused, withdrawn, properly interpreting the contributions, it shows that under re-targeted, etc. It looks like that in the present "gold rush" it different conditions different contributions can have the role has not been scrutinized whether the resulting payload per- formance of parallelized systems has some upper bound of being the performance-limiting factor. The section in- . In terprets also the role of the benchmarks, and explains why the known feasibility studies this aspect remains out of sight the different benchmarks produce different results. Section4 in USA [44], in EU [17], in Japan [18] and in China [32]. The shows that the parallelized sequential computing systems have designers did not follow "that system designers must make their inherent performance bound, and shows numerous po- ⋆ This work is supported by the National Research, Development and tential limiting factors. The section also provides some supercomputer- Innovation Fund of Hungary, under K funding scheme Projects no. 125547 specific numerical examples. and 132683, as well as ERC-ECAS support of project 861938 is acknowl- The last two sections directly underpin the previous theo- edged ∗ Corresponding author retical discussion with measured data. In section5 some spe- [email protected] (J. Végh) cific supercomputer features are discussed, where the case is ORCID(s): 0000-0001-7511-2910 (J. Végh) well documented and the special case enables to draw gen-

J. Végh: Preprint submitted to Elsevier Page 1 of 17 The roofline of parallelized performance gain eral conclusions. This section also discusses the near (pre- • the parallel and sequential runtime components are dictable) future of large-scale supercomputers and their be- only slightly affected by cache operation havior. Section6 draws some statistical conclusions, based wires get increasingly slower relative to gates on the available supercomputer database, containing rigor- • the ously verified, reliable data for the complete supercomputer 2.3. Amdahl’s case under realistic conditions history. The large number of data enables, among others, to not derive conclusions about a new law of the electronic devel- The realistic case is that parallelized parts are of equal opment, to tell where and when it is advantageous to apply length (even if they comprise exactly the same instructions). graphic accelerator units, as well as to derive a "roofline" The hardware operation in modern processors may execute model of supercomputing. them in considerably different times; for examples see [53, 24], and references cited therein; operation of hardware ac- celerators inside a core, or network operation between pro- 2. Amdahl’s classic analysis cessors, etc. One can also see that the time required to con- 2.1. Origin and interpretation trol parallelization is not negligible and varying; represent- ing another source of performance bound. The most commonly known and cited limitation on par- The static correspondence between program chunks and allelization speedup ([3], the so called Amdahl’s law) con- processing units can be very inefficient: all assigned pro- siders the fact that some parts (Pi) of a code can be par- cessing units must wait the delayed unit. The measurable allelized, some (Si) must remain sequential. Amdahl only performance does not match the nominal performance: lead- wanted to draw the attention to that when putting together ing to the appearance of the "dark performance": the pro- several single processors, and using Single Processor Ap- cessors cannot be utilized at the same time, much similar to proach (SPA), the available speed gain due to using large- how fraction of cores (because of energy dissipation) can- scale computing capabilities has a theoretical upper bound. not be utilized at the same time, leading to the issue "dark He also mentioned that data housekeeping (non-payload cal- silicon"[16, 15]. culations) causes some overhead, and that the nature of that overhead appears to be sequential, independently of its ori- Also, some capacity is lost if the number of computing resources exceeds number of parallelized chunks. If number gin. of processing units is smaller than that of the parallelized 2.2. Validity threads, severals "rounds" for the remaining threads must A general misconception (introduced by successors of be organized, with all disadvantages of duty of synchroniza- Amdahl) is to assume that Amdahl’s law is valid for software tion [57, 10]. In such cases it is not possible to apply Am- only and that the non-parallelizable fraction means some- dahl’s Law directly: the actual architecture is too complex or thing like the ratio of numbers of the corresponding instruc- not known. However, in all cases the speedup can be mea- tions. Amdahl in his famous paper speaks about "the frac- sured and expressed in function of the number of processors. tion of the computational load " and explicitly mentions, in 2.3.1. Factors affecting parallelism the same sentence and same rank, algorithmic reasons like Usually, Amdahl’s law is expressed as "computations required may be dependent on the states of the variables at each point may "; architectural aspects like " S−1 = (1 − ) + ∕k be strongly dependent on sweeping through the array along (1) different axes on succeeding passes" as well as "physical where k is the number of parallelized code fragments, problems" like "propagation rates of different physical ef- is the ratio of the parallelizable fraction to the total, S is fects may be quite different". His point of view is valid also the measurable speedup. The assumption can be visualized today: one has to consider the load of the complex hard- that (assuming many processors) in fraction of running ware (HW)/software (SW) system, rather than some segre- time processors are processing data, in (1- ) fraction they gated component, and his idea describes parallelization im- are waiting (all but one). That is describes how much, in perfectness of any kind. When applied to a particular case, average, processors are utilized. Having those data, the re- however, one shall scrutinize which contributions can actu- sulting speedup can be estimated. ally be neglected. For the today’s complex systems, to calculate is hope- Actually, Amdahl’s law is valid for any partly paralleliz- less, but for a system under test, where is not a priory able activity (including computer unrelated ones) and the known, one can derive from the measurable speedup S an non-parallelizable fragment shall be given as the ratio of effective parallelization factor as the time spent with non-parallelizable activity to the total k S − 1 time. The concept was frequently successfully utilized on = eff k − 1 S (2) quite unexpected fields, as well as misunderstood and abused (see [30, 23]). S As discussed in [35] Obviously, this is not more than expressed in terms of and k from Equ. (1). So, for the classical case, = eff ; • many parallel computations today are limited by sev- eral forms of communication and synchronization

J. Végh: Preprint submitted to Elsevier Page 2 of 17 The roofline of parallelized performance gain which simply means that in ideal case the actually measur- Dependence of E and E on (1 αHPL) and N able effective parallelization achieves the theoretically pos- HPL HPCG − eff sible one. In other words, describes a system the architec- ture of which is completely known, while eff characterizes TOP5’2018.11 0 a system the performance of which is known from experi- 10 Summit ments. Again in other words, is the theoretical upper limit, Sierra which can hardly be achieved, while eff is the experimen- Taihulight 1 tal actual value, that describes the complex architecture and 10− Tianhe-2 the actual conditions. Piz Daint The value of eff can then be used to refer back to Am- dahl’s classical assumption even in realistic cases when the 10 2 Efficiency − detailed architecture is not known. On one side, the speedup S can be measured and eff can be utilized to characterize 4 measurement setup and conditions [54], how much from the 5 10 6 10 7 7 10 theoretically possible maximum parallelization is realized. 10− 10 6 5 10 − 10− On the other side, the theoretically achievable S or eff can (1 αHPL) No of cores be guessed from some general assumptions. − eff In the case of real tasks a Sequential/Parallel Execution Model [57] shall be applied, which cannot use the simple Figure 1: The dependence of computing efficiency of paral- picture reflected by , but eff gives a good merit of the de- lelized sequential computing systems on the parallelization ef- gree of parallelization for the duration of executing the pro- ficacy and the number of cores. The surface is described by cess on the given hardware configuration, and can be com- Eq. (4), the data points for the TOP5 supercomputers are pared to the results of technology-dependent parametrized calculated from the publicly available database [42] formulas1. Numerically (1 − eff ) equals to f value, estab- lished theoretically [29]. To the scaling of parallel systems several models can be Anyhow, when calculating speedup, one calculates applied, and all they can be goof on their specific field. How- there is probably (1 − ) + k ever, one must recall that "The truth is that S = = no specific parameter scaling principle that can be univer- (1 − ) + ∕k k(1 − ) + (3) sally applied [41]." This also means, that the validity of the scaling methods for extremely large number of processors hence efficiency3 (how speedup scales with number of pro- must be scrutinized. cessors) S 1 E = = 2.4. Efficiency of parallelization k k(1 − ) + (4) The distinguished constituent in Amdahl’s classic anal- This means that according to Amdahl, as presented in ysis is the parallelizable payload fraction , all the rest (in- Fig.1, the efficiency depends both on the total number of cluding wait time, communication, system contribution and processors in the system and on the perfectness4 of the par- any other non-payload activities) goes into the apparently allelization. The perfectness comprises two factors: the the- "sequential-only" fraction according to this extremely sim- oretical limitation and the engineering ingenuity. ple model. If parallelization is well-organized (load balanced, small When using several processors, one of them makes the overhead, right number of Processing Unit (PU)s), satu- 2 sequential-only calculation, the others are waiting (use the rates at unity (in other words: sequential-only fraction ap- same amount of time). In the age of Amdahl the number proaches zero), so the tendencies can be better displayed of processors was small and the contribution of the SW was through using (1 − eff ) in the diagrams below. high, relative to the contribution of the parallizedHW. Be- The importance of this practical term eff is underlined cause of this, the contribution of the SW dominated the se- by that it can be interpreted and utilized in many different (1 − ) quential part, so the value of really could be consid- areas [54, 52] and the achievable speedup (the maximum ered as a constant. The technical development resulted in achievable performance gain when using infinitely large num- decreasing all non-parallelizable components; the large sys- ber of processors) can easily be derived from Equ. (1) as tems today can be idle also because ofHW reasons. 1 G = 1 Just notice here that passing parameters among cores as well as block- (5) (1 − eff ) ing each other (of course, through the operating system (OS)) are all a kind of synchronization or communication, and their amount differs task by task. 3 2 This quantity is almost exclusively used to describe computing per- In a different technology age, the same phenomenon was already de- RMax scribed: "Amdahl argued that most parallel programs have some portion formance of multi-processor systems. In the case of supercomputers, RP eak of their execution that is inherently serial and must be executed by a single is provided, which is identical with E processor while others remain idle." [41] 4As it will explained below, in a higher-order approximation the value of (1 − ) itself also depends on the number of the processors.

J. Végh: Preprint submitted to Elsevier Page 3 of 17 The roofline of parallelized performance gain

Provided that the value of eff does not depend on the The actual speedup (or effective parallelization) depends number of the processors, for a homogenous system the total strongly on the ’tricks’ used during implementation. Al- payload performance is thoughHW and SW parallelisms are interpreted differently [26], they even can be combined [8], resulting in hybrid architec- Ptotal payload = G ⋅ Psingle processor (6) tures. For those greatly different architectural solutions it is hard even to interpret , while eff enables to compare dif- i.e. the total payload performance can be increased by in- creasing the performance gain or by increasing the single- ferent implementations (or the same implementation under different conditions). processor performance, or both. Notice, however, that in- creasing the single-processor performance through acceler- ators also has its drawbacks and limitations [47], and that 3. Our model of parallel execution the performance gain and the single-processor performance As mentioned in section 2.2, Amdahl listed different rea- are players of the same rank in defining the payload perfor- sons why losses in the "computational load" can occur. To mance. understand the operation of computing systems working in 2.4.1. Connecting efficiency and parallel, one needs to extend Amdahl’s original (rather than S that of the successors’) model in such a way, that the non- Through using Equ. (4), k can be equally good for de- scribing efficiency of parallelization of a setup, but anyhow parallelizable (i.e. apparently sequential) part comprises con- a second parameter, the number of the processors k is also tributions fromHW,OS, SW and Propagation delay (PD), required. From Equ. (4) and also some access time is needed for reaching the paral- lelized system. The technical implementations of the differ- Ek − 1 ent parallelization methods show up infinite variety, so here E,k = (7) E(k − 1) a (by intention) strongly simplified model is presented. Am- dahl’s idea enables to put everything6 that cannot be paral- This quantity depends on both E and k, but in some cases lelized into the sequential-only fraction. The model is gen- it can be assumed that is independent from the number of eral enough to discuss qualitatively some examples of par- processors. This seems to be confirmed by data calculated allely working systems, neglecting different contributions as from several publications as was noticed early, 1 possible in the different cases. The model can also be con- At this point one can notice that E in Equ. (4) is a lin- verted to a limited validity technical (quantitative) one. ear function of number of processors, and its slope equals to (1 − eff ). The value calculated in this way is denoted 3.1. Formal introduction of the model (1 − ) by Δ . Its numerical value is quite near to the value The contributions of the model component XXX to eff XXX calculated (see Equ.(2)) using all processors, and so it is not will be denoted by eff in the following. Notice the differ- displayed in the rest of figures. This also means that from ent nature of those contributions. They have only one com- efficiency data one can estimate value of Δ even for inter- mon feature: they all consume time. The extended Amdahl’s mediate regions, i.e. without knowing the execution time on model is shown in Fig.2. The vertical scale displays the a single processor (from technical reasons, it is the usual case actual activity for processing units shown on the horizontal for supercomputers). From a handful of processors one can scale. find out if the supercomputer under construction can have 5 Notice that our model assumes no interaction between hopes to beat the No. 1 in Top500 [42]. This result can processes running on the parallelized systems in addition to also be used for investment protection. the absolutely necessary minimum: starting and terminating 2.4.2. Time to organize parallelization the otherwise independent processes, which take parameters at the beginning and return results at the end. It can, how- The timing analysis given above can be applied to dif- ever, be trivially extended to the more general case when ferent kinds of parallelizations, from processor-level paral- processes must share some resource (like a database, which lelization (instruction or data level parallelization, in nanosec- shall provide different records for the different processes), onds range) toOS-level parallelization (including - either implicitly or explicitly. Concurrent objects have in- level parallelization using several processors or cores, in mi- herent sequentiality [13], and synchronization and commu- croseconds range), to network-level (between networked com- nication among those objects considerably increase [57] the puters, like grids, in milliseconds range). The principles are SW non-parallelizable fraction (i.e. contribution (1 − )), so the same [10], independently of the kind of implementation. eff In agreement with [57], housekeeping overhead is always in the case of extremely large number of processors special present (and mainly depends on theHW+SW architectural 6Although the modern diagnostic methods enable to separate the the- solution) and remains a key question. The main focus is al- oretically different contributions and measure their value separately, unfor- ways on to reduce its effect. Notice that the application itself tunately their effect is summed up in the non-parallelizable fraction. The summing rule is not simple: some sequential contribution may occur in par- also comprises some (variable) amount of sequential contri- allel with some other, like the propagation delay with theOS functionality; bution. and it must also be considered whether the given non-parallelizable item contributes to the main thread or to one of the fellow threads. 5At least a lower bound on eff (i.e. a higher bound on parallelization gain) can be derived.

J. Végh: Preprint submitted to Elsevier Page 4 of 17 The roofline of parallelized performance gain

Amdahl as the total sequential fraction7. As shown in Fig.2, Model of parallel execution the apparent execution time includes the real payload activ- P roc P0 P1 P2 P3 P4 ity, as well as waiting andOS and SW activity. Recall that the execution times may be different [36, 37, 39] in the indi- 0 Access Initiation vidual cases, even if the same processor executes the same Software P re instruction, but executing an instruction mix many times re- 1 OS P re sults in practically identical execution times, at least at model 0

T Just waiting 1 ) 2 level. Note that the standard deviation of the execution times T 2

00 appears as a contribution to the non-parallelizable fraction, T 10

3 3 and in this way increases the "imperfectness" of the archi- PD 20 T

PD processor 4 tecture. This feature of s deserves serious consid- T 30 4 PD eration when utilizing a large number of processors. Over- 40 0 1 PD optimizing a processor for single-thread regime hits back 2 5 PD when using it in a many-processor environment 3 . not proportional ( T otal P rocess P rocess 4 Extended 6 P rocess 3.2. The principle of the measurements P rocess T ime

11 When measuring performance, one faces serious diffi- 21 P rocess 01

7 P ayload

PD culties, see for example [37], chapter 1, both with making 31 PD PD measurements and interpreting them. When making a mea- 41 8 PD surement (i.e. running a benchmark) either on a single pro-

Just waiting PD cessor or on a system of parallelized processors, an instruc- 9 OSP ost SoftwareP ost tion mix is executed many times. There is, however, an cru- cial difference: in the second case an extra activity is also 10 AccessT ermination included: the job to organize the joint work. It is the reason of the ’efficiency’, and it leads to critical issues in the case Figure 2: The extended Amdahl’s model of parallelizing se- of extremely large number of processors. quential processing (somewhat idealistic and not proportional) The large number of executions averages the rather dif- ferent execution times [36], with an acceptable standard de- viation. In the case when the executed instruction mix is attention must be devoted to their role on efficiency of the the same, the conditions (like cache and/or memory size, the application on the parallelized system. network bandwidth, Input/Output (I/O) operations, etc) are Let us notice that all contributions have a role during different and they form the subject of the comparison. In measurement: the effect of contributions due to SW,HW, the case when comparing different algorithms (like results OS andPD cannot be separated, though dedicated measure- of different benchmarks), the instruction mix itself is also ments can reveal their role, at least approximately. The rela- different, tive weights of the different contributions are very different Notice that the so called "algorithmic effects" – like deal- for the different parallelized systems, and even within those ing with sparse data structures (which affects cache behav- cases depend on many specific factors, so in every single ior) or communication between the parallelly running threads, parallelization case a careful analysis is required. like returning results repeatedly to the main thread in an iter- ation (which greatly increases the non-parallelizable fraction 3.1.1. Access time in the main thread) – manifest through theHW/SW architec- Initiating and terminating the parallel processing is usu- ture, and they can hardly be separated. Also notice that there ally made from within the same computer, except when one are fixed-size contributions, like utilizing time measurement can only access the parallelized computer system from an- facilities or calling system services. Since eff is a relative other computer (like in the case of clouds). This latter access merit, the absolute measurement time shall be long. When time is independent from the parallelized system, and one utilizing efficiency data from measurements which were ded- must properly correct for the access time when derives tim- icated to some other goal, a proper caution must be exercised ing data for the parallelized system. Amdahl’s law is valid with the interpretation and accuracy of the data. only for properly selected computing system. This is a one- 3.3. The measurement method (bechmarking) time, and usually fixed size time contribution. Not to surprise, the method of the measurement basically 3.1.2. Execution time affects the result of the measurement: the "device under test" The execution time Total covers all processings on the and the "measurement device" are the same. parallelized system. All applications, running on a paral- The benchmarks, utilized to derive numerical parameters lelized system, must make some non-parallelizable activity for supercomputers, are specialized and standardized pro- at least before beginning and after terminating parallelizable 7Although someOS activity was surely included, Amdahl assumed activity. This SW activity represents what was assumed by some 20 % SW fraction, so the other contributions could be neglected com- pared to SW contribution.

J. Végh: Preprint submitted to Elsevier Page 5 of 17 The roofline of parallelized performance gain grams, which run in theHW/OS environment provided by only if a large number of rigorously verified data are avail- the parallelized computer under test. One can use bench- able for drawing conclusions. This method will be followed marks for different goals. Two typical fields of utilization: in connection with the supercomputers, where the reliable to describe the environment the computer application runs in database TOP500 [42] is available. This method is abso- (a "best case" estimation), and to guess how quickly an appli- lutely empirical, and results only in an "up to now" achieved cation will run on a given parallelized computer (a "real-life" value, so one cannot be sure whether the experienced limi- estimation). tation is just a kind of engineering imperfectness. The other If the goal is to characterize the supercomputer’sHW+OS method results in a theoretical upper bound, and one can- system itself, a benchmark program should distortHW+OS not be sure whether it can technically be achieved. It is a contribution as little as possible, i.e. SW contribution must strong confirmation, however, that the two ways lead to the be much lower thanHW+OS contribution. In the case of su- same limitation: what can be achieved theoretically, is al- percomputers, benchmark High Performance LINPACK [25] ready achieved in the practice. (HPL) (with minor modifications) is used for this goal since The technical implementation of the parallelized sequen- the beginning of the supercomputer age. The mathematical tial computing systems shows up an infinite variety, so it is behavior of HPL enables to minimize SW contribution, i.e. not really possible to describe all of them in a uniformed HW +OS HPL delivers the possible best estimation for eff . scheme. Instead, some originating factors are mentioned and If the goal is to estimate the expectable behavior of an ap- their corresponding term in the model named. At this point plication, the benchmark program should imitate the struc- the simplicity of the model is a real advantage: all possi- ture and behavior of the application. In the case of super- ble contributions shall be classified as parallelizable or non- computers, a couple of years ago the benchmark High Per- parallelizable ones only. The model uses time-equivalent formance Conjugate Gradients [25] (HPCG) was introduced units, so all contributions are expressed with time, indepen- for this goal, since "HPCG is designed to exercise compu- dently of their origin. tational and data access patterns that more closely match a That parallel programs have inherently sequential parts different and broad set of important applications, and to give (and so: inherent performance limit) is known since decades: incentive to computer system designers to invest in capabil- "Amdahl argued that most parallel programs have some por- ities that will have impact on the collective performance of tion of their execution that is inherently serial and must be these applications"[25]. However, its utilization can be mis- executed by a single processor while others remain idle."[41] leading: the ranking is only valid for the HPCG application, Those limitations follow immediately from the physical im- and only utilizing that number of processors. HPCG seems plementation and the computing paradigm; it depends on the really to give better hints for designing supercomputer ap- actual conditions, which of them will dominate. It is crucial plications8, than HPL does. According to our model, in the to understand that the decreasing efficiency (see Equ. (4)) case of using the HPCG benchmark, the SW contribution is coming from the computing paradigm itself rather than dominates9, i.e. HPCG delivers the best possible estimation from some kind of engineering imperfectness. This inherent SW limitation cannot be mitigated without changing the comput- for eff for this class of supercomputer applications. SW ing/implementation principle. The different benchmarks provide different (1 − eff ) contributions to the non-parallelizable fraction (resulting in 4.1. Propagation delay PD different efficiencies and ranking [27]), so comparing results In the modern high clock speed processors it is increas- (and especially establishing ranking!) derived using differ- ingly hard to reach the right component inside the processor ent benchmarks shall be done with maximum care. Since the efficiency depends heavily on the number of cores (see at the right time, and it is even more hard, if thePUs are also Fig.1 and Eq. (4)), the different configurations shall be at a distance much larger than the die size. Also, the techni- compared using the same benchmark and the same number cal implementation of the interconnection can seriously con- tribute. of processors (or same RP eak).10 4.1.1. Wiring 4. The inherent limitations of supercomputing As discussed in [35], the weight of wiring compared to the processing is continuously increasing. The gates may be- The limitations can be derived basically in two ways. Ei- come (much) faster, but the speed of light is an absolute limit ther the experiences based on the implementations can be for the signal propagation on the wiring connecting them. utilized to draw conclusions (as an empirical technical limit), This is increasingly true when considering large systems: or some theoretical assumptions can be utilized to derive the need of cooling the modern high density processors in- a kind of theoretical limit. The first way can be followed creases the length of wiring between them. 8This is why for example [11] considers HPCG as "practical perfor- mance". 4.1.2. Physical size 9 Returning calculated gradients requires much more sequential com- Although the signals travel in a computing system with munication (unintended blocking). nearly the speed of light, with increasing the physical size of 10This is why it is misleading in some cases on the TOP500 lists to compare HPL benchmark result, measured with all available cores, with the computer system a considerable time passes between is- HPCG benchmark, measured with a small fragment of the cores. suing and receiving a signal, causing the other party to wait,

J. Végh: Preprint submitted to Elsevier Page 6 of 17 The roofline of parallelized performance gain without making any payload job. At the today’s frequencies 4.2.2. Accelerators and chip sizes a signal cannot even travel in one clock pe- It is a trivial idea that since the single processor perfor- riod from one side of the chip to the other, in the case of a mance cannot be increased any more, some external comput- stadium-sized supercomputer this delay can be in the order ing accelerator (Graphics Processing Unit (GPU)(s)) shall be of several hundreds clock cycles. Since the time of Amdahl, used. However, because of the SPA the data must be copied the ratio of the computing to the propagation time drastically to the memory of the accelerator, and this takes time. This changed, so –as [35] calls the attention to it– it cannot be ne- non-payload activity is a kind of sequential contribution and glected any more, although presently it is not (yet) a major surely makes the value of (1− eff ) worse. The difference is dominating term. negligible at low number of cores, but in large-scale systems As long as computer components are in proximity in range it strongly degrades the efficiency, see section 6.2. mm, contribution byPD can be neglected, but the distance of 4.2.3. Complexity PUs in supercomputers are typically in 100 m range, and the nodes of a cloud system can be geographically far from each The processors are optimized for single-processor per- other, so considerable propagation delays can also occur. formance. As a result, they attempt to make more and more operations in a single clock cycle, and doing so introduces 4.1.3. Interconnection a limitation for the length of the clock period itself: "we The interconnection between cores happens in very dif- believed that the ever-increasing complexity of superscalar ferent contexts, from the public Internet connection between processors would have a negative impact upon their clock clouds through the various connections used inside super- rate, eventually leading to a leveling off of the rate of in- computers down to the System-on-Chip (SoC) connections. crease in microprocessor performance".[40] TheOS initiates only accessing the processors, after thatHW works partly in parallel with the next action of theOS and 4.3. The computing paradigm with other actions initiating accessing other processors. This At the time when the basic operating principles of the period is denoted in Fig.2 by Tx. After the corresponding computer were formulated, there was literally only one pro- signals are generated, they must reach the target processor, cessor which lead naturally to using the SPA. For today, due that is they need some propagation time.PDs are denoted by to development of the technology, the processor became a PDx0 and PDx1, corresponding to actions delivering input "free resource" [22]. Despite of that, mainly by inertia (the data and result, respectively. This propagation time (which preferred incremental development) and because up to now of course occurs in parallel with actions on other proces- the performance could be improved even using SPA compo- sors, but which is a sequential contribution within the thread) nents (and thinking), today the SPA is commonly used when depends strongly on how the processors are interconnected: building large parallel computing systems, although the im- this contribution can be considerable if the distance to travel portance of "cooperative computing" is already recognized is large or message transfer takes a long time (like lengthy and demonstrated [58]. The stalling of the parallel perfor- messages, signal latency, handshaking, store and forward mance may lead to the need of introducing the Explicitly operations in networks, etc.). Many-Processor Approach (EMPA)[48]. Although sometimes even in quite large-scale systems 4.3.1. Addressing like [22] Ethernet-based internal communication is deployed, it is getting accepted that "The idea of using the popular One of the major drawbacks of the SPA is what resulted shared bus to implement the communication medium [in large in the components (constructed for SPA systems). Among systems] is no longer acceptable, mainly due to its high con- others the SPA processors use SPA memories through SPA tention." [34]. buses. This also means, that only one single addressing ac- tion can take place at a time, that is why the need for address- 4.2. Other sources of delay ing processors increases linearly with the size of the system. The internal operation of the processors can also con- Although this issue can be mitigated by segmenting, cluster- tribute to the issues experienced in the large parallel com- ing, vectoring; the basic limitation effect is present. puting systems. 4.3.2. The context switching 4.2.1. Internal latency All applications must useOS services and someHW fa- The instruction execution micro-environment can be quite cilities to initiate themself as well as to access other pro- different for the different parallelized sequential systems. Here cessors. Because operating system works in a different (su- it is interpreted as the internal non-payload time around ex- pervisor) mode, a considerable amount of time is required ecuting (a bunch of) machine instructions, like waiting for for switching context. Actually, this means virtually "an- the instruction being processed in the , the bus be- other processor": a different (extended) Instruction Set Ar- ing disabled for some short period, copying data between chitecture (ISA), and a new set of processor registers. The address spaces, speculating or predicting. processor registers are very useful in single-processor op- timization, but their saving and restoring considerably in- creases the internal latency, and what is worse, also intro- duces many otherwise unneeded memory operations. This

J. Végh: Preprint submitted to Elsevier Page 7 of 17 The roofline of parallelized performance gain

HPL Prediction of RMax of Top10 Supercomputers HPL Development of RMax for P izDaint Supercomputer Summit 1 10− Sierra

) Taihulight Tianhe-2 ) 1 Piz Daint 10− 2 Trinity 10− ABCI SuperMUC-NG Titan Sequoia 3 10− exaF LOP S ( exaF LOP S

( Xeon E5-2690 + NVIDIA Tesla P100 (2018) Max 2 Xeon E5-2690 + NVIDIA Tesla P100 (2017) 10− 4 R Max 10− Xeon E5-2690 + NVIDIA Tesla P100 (2016)

R Xeon E5-2670 + NVIDIA K20x (2013) Xeon E5-2670 (2013) Xeon E5-2670 (2012) 10 2 10 1 100 − − 5 10− 10 3 10 2 10 1 100 RP eak (exaFLOPS) − − − RP eak (exaFLOPS)

Figure 3: a) Dependence of payload supercomputer performance on the nominal performance for the TOP10 supercomputers (as of November 2018) in case of utilizing the HPL benchmark. b) The timeline of the development of the payload performance

RMax as documented in the database TOP500 The actual positions are marked by bubbles on the diagram lines.

is usually not a really crucial contribution, but under the 5.1. Taihulight (Sunway) extreme conditions represented by supercomputers (and es- In the parallelized sequential computing systems imple- pecially if single-port memory is used), specialized operat- mented in SPA[3], the life begins in one such sequential ing systems must be used [22] or the calculation must be subsystem. In the large parallelized applications running on run in supervisor mode [20] or every single core most run a general purpose supercomputers, initially and finally only lightweightOS[58]. one thread exists, i.e. the minimal absolutely necessary non- parallelizable activity is to fork the other threads and join 4.3.3. Synchronization them again. With the present technology, no such actions Although not explicitly dealt with here, notice that the can be shorter than one processor clock period11. That is, the data exchange between the first thread and the other ones also absolute minimum value of the non-parallelizable fraction contributes to the non-parallelizable fraction and typically will be given as the ratio of the time of the two clock periods uses system calls, for details see [57, 19, 10]. Actually, we to the total execution time. The latter time is a free param- may have communicating serial processes, which does not eter in describing the efficiency, i.e. value of the effective improve the effective parallelism at all [1]. Some classes of parallelization eff also depends on the total benchmarking applications (like artificial neural networks) need intensive time (and so does the achievable parallelization gain, too). and frequent data exchange and in addition, because of the This dependence is of course well known for supercom- "time grid" they need to use to coordinate the operation of puter scientists: for measuring the efficiency with better ac- the "neurons" they use, they overload the network with bursts curacy (and also for producing better eff values) hours of of messages. execution times are used in practice. In the case of bench- marking T aiℎuligℎt [12] 13,298 seconds benchmark run- 13 5. Supercomputer case studies time was used; on the 1.45 GHz processors it means 2 ∗ 10 clock periods. The inherent limit of (1− eff ) at such bench- −13 From the sections above it can be concluded that paral- marking time is 10 (or equivalently the achievable perfor- 13 lelized sequential computing systems have some upper limit mance gain is 10 ). If the fork/join is executed by theOS as payload performance nominal performance 4 on their (the of usual, because of the needed context switchings 2 ∗ 10 [43] course can be increased without limitation, but the efficiency clock cycles are needed rather than the 2 clock cycles con- decreases proportionally). In this section some case stud- sidered in the idealistic case, i.e. the derived values are cor- ies on supercomputer implementations are presented, utiliz- respondingly by 4 orders of magnitude different; that is the 9 ing only public information. Examples of deploying the for- performance gain cannot be above 10 . For the develop- malism on other fields of parallelized computing are given in [54]. 11 Taking this two clock periods as an ideal (but not realistic) case, the actual limitation will be surely (thousands of times) worse than the one cal- culated for this idealistic one. The actual number of clock periods depends on many factors, as discussed below.

J. Végh: Preprint submitted to Elsevier Page 8 of 17 The roofline of parallelized performance gain

−7 ment of the achieved performance gain and the values for 10 . Notice that utilizing "cooperative computing" [58] en- the top supercomputers, see Fig.8. In the following for sim- hances further the value of (1 − eff ), but it means already plicity 1.00 GHz processors (i.e. 1 ns clock cycle time) will utilizing a (slightly) different computing paradigm: the cores be assumed. have a direct connection and can communicate with the ex- The supercomputers are also distributed systems. In a clusion of the main memory. stadium-sized supercomputer a distance between the proces- An operating system must also be used, for protection sors (cable length) about 100 m can be assumed. The net sig- and convenience. If one considers context switching with −6 3 4 nal round trip time is ca. 10 seconds, or 10 clock periods, its consumed 2 ∗ 10 cycles [43], the absolute limit is cca. −8 i.e. in the case of a finite-sized supercomputer the perfor- 5 ∗ 10 , on a zero-sized supercomputer. This value is 10 6 mance gain cannot be above 10 (or 10 if context switch- somewhat better than the limiting value derived above, but ing also needed). The presently available network interfaces it is close to that value and surely represents a considerable have 100...200 ns latency times, and sending a message be- contribution. This is why T aiℎuligℎt runs the actual com- tween processors takes time in the same order of magnitude. putations in kernel mode [58]. Since the signal propagation time is longer than the latency of the network, this also means that making better intercon- 5.2. Sum of the non-parallelizable contributions nection is not really a bottleneck in enhancing computing Notice the special role of the non-parallelizable activi- performance. This statement is underpinned also by statis- ties: independently of their origin, they are summed up as tical considerations [47]. ’sequential-only’ contribution and degrade considerably the 3 Taking the (maybe optimistic) value 2 ∗ 10 clock pe- payload performance. In systems comprising parallelized riods for the signal propagation time, the value of the effec- sequential processes actions like communication (including tive parallelization (1 − eff ) will be at best in the range also MPI), synchronization, accessing shared resources, etc. −10 of 10 , only because of the physical size of the super- [1, 10, 19, 57] all contribute to the sequential-only part. Their computer. This also means that the expectations against the effect becomes more and more drastic as the number of the how absolute performance of supercomputers are excessive: as- processors increases. One must take care, however, the suming a 100 Gflop/s processor and realistic physical size, communication is implemented. A nice example is shown no operating system and no non-parallelizable code fraction, in [4], how direct core to core (in other words: direct thread the achievable absolute nominal performance (see Eq. (5)) to thread) communication can enhance parallelism in large- 11 10 is 10 *10 flop∕s, i.e. 1000 EFlops. To implement this, scale systems. 9 around 10 processors are required. One can assume that the x −7 5.3. Competition of the contributions for the value of (1 − eff ) will be12 around of the value 10 . With eff those very optimistic assumptions (see Equ.4) the payload dominance performance for benchmark HPL will be less than 10 Eflops, As discussed above, the different contributions of (1 − and for the real-life applications of class of the benchmark eff ) depend on different factors, so their ranking in af- HPCG it will be surely below 0.01 EFlops, i.e. lower than fecting the value of RMax changes with the nominal perfor- the payload performance of the present TOP1-3 supercom- mance and how the system is assembled from SPA proces- puters. sors. Fig.4 attempts to provide a feeling on the effect of the These predictions enable to assume that the presently software contribution. A fictive supercomputer (with behav- T aiℎuligℎt achieved value of (1 − eff ) persists also for roughly hun- ior somewhat similar to that of supercomputer ) dred times more cores. However, another major issue arises is modeled. All subfigures have dual scaling. The blue dia- from the computing principle SPA: on an SPA bus only one gram line refers to the right hand scale and shows the pay- core at a time can be addressed. As a consequence, mini- load performance corresponding to the actual eff contri- mum as many clock cycles are to be used for organizing the butions; all the rest refer to the left hand scale and display (1 − XX ) parallel work as many addressing steps required. Basically, eff (for the details see [49]) contributions to the non- this number equals to the number of cores in the supercom- parallelizable fraction. The turn-back of the (1 − eff ) dia- puter, i.e. the addressing in the TOP10 positions typically gram clearly shows the presence of the "performance wall" 5 7 needs clock cycles in the order of 5 ∗ 10 ...10 ; degrading (compare it to Fig. 1 in [41]). −6 −5 the value of (1 − eff ) into the range 10 ...2 ∗ 10 . The For the sake of simplicity, only those components are de- number of the addressing steps can be mitigated using clus- picted, that have some role in forming the (1 − eff ) value. tering, vectoring, etc. or at the other end the processor itself In some other special cases other contributions may domi- can take over the responsibility of addressing its cores [58]. nate. For example, as presented in [52], in the case of brain Depending on the actual construction, the reducing factor of simulation a hidden clock signal is introduced and its effect 1 3 clustering of those types can be in the range 10 ...5 ∗ 10 , is in close competition with the effect of the frequent con- i.e the resulting value of (1 − eff ) is expected to be around text switchings for dominating the achievable performance.

−6 Notice that the performance breakdown shown in the fig- 12With the present technology the best achievable value is ca. 10 , −7 ures were experimentally measured by [41], [28](Fig. 7) which was successfully enhanced by clustering to ca. 2 ∗ 10 for Summit and Sierra, and the special cooperating cores of T aiℎuligℎt enabled to and [2](Fig. 8). −8 achieve 3 ∗ 10

J. Végh: Preprint submitted to Elsevier Page 9 of 17 The roofline of parallelized performance gain

10−5 100 10−5 100

10−6 10−6 ) )

−1 −1 s 10 s 10 ∕ 10−7 ∕ 10−7 Eflop HPL eff Eflop ( HPCG eff (

10−8 10−8 HPL Max −2 −2 HPCG Max 10 R 10 SW SW R OS OS 10−9 10−9 eff eff RMax(Eflop∕s) RMax(Eflop∕s) 10−10 10−3 10−10 10−3 10−3 10−2 10−1 100 10−3 10−2 10−1 100

RP eak(Eflop∕s) RP eak(Eflop∕s)

X total Figure 4: Contributions (1 − eff ) to (1 − eff ) and max payload performance RMax of a fictive supercomputer (P = 1Gflop∕s @ 1GHz) in function of the nominal performance. The blue diagram line refers to the right hand scale (RMax values), all others X ((1− eff ) contributions) to the left scale. The left subfigure illustrates the behavior measured with benchmark HPL. The looping contribution becomes remarkable around 0.1 Eflops, and breaks down payload performance when approaching 1 Eflops. In the right subfigure the behavior measured with benchmark HPCG is displayed. In this case the contribution of the application (thin brown line) is much higher, the looping contribution (thin green line) is the same as above. As a consequence, the achievable payload performance is lower and also the breakdown of the performance is softer.

5.4. The future of supercomputing 5.5. Piz Daint Because all of this above, in the name of the company Due to the quick development of the technology, the su- PEZY13 the last two letters are surely obsolete. Also, no percomputers have usually not many items registered in the Zettaflops supercomputers will be delivered for science and database TOP500 on their development. One of the rare ex- military [14]. ceptions is supercomputer P iz Daint. Its development his- Experts expect the performance14 to achieve the magic tory spans 6 years, two orders of magnitude in performance 1 Eflop/s around year 2020, see Fig. 1 in [33], although al- and used both non-accelerated computing and accelerated ready question marks, mystic events and communications computing using two different accelerators. Although usu- appeared, as the date approaches. The authors noticed that ally more than one of its parameters was changed between "the performance increase of the No. 1 systems slowed down the registered stages of its development, it nicely underpins around 2013, and it was the same for the sum performance", the statements of the paper. but they extrapolate linearly and expect that the development Fig.3 b) displays how the payload performance in func- 4 continues and the "zettascale computing" (i.e 10 times more tion of the nominal performance has developed in the case of than the present performance) will be achieved is just more P iz Daint (see also Fig. 2 in [52]). The bubbles diplay the that a decade. Although they address a series of important measured performance values documented in the database questions, the question whether building computers of such TOP500 [42] and the diagram lines show the (at that stage) size is feasible, remains out of their sight. predicted performance. As the diagram lines show the "pre- From TOP500 data, as a prediction, RMax values in func- dicted performance", the accuracy of the prediction can also tion of RP eak can be calculated, see Fig.3 a). The reported be estimated through the data measured in the next stage. It (measured) performance values are marked by bubbles on is very accurate for short distances, and the jumps can be the figure. When making that prediction, the number of qualitatively understood with knowing the reason. In the processors was virtually changed for the different configu- previous section the accuracy of the predictions based on the rations, without correcting for the increasing looping delay; model has been left open. This figure also validates the pre- i.e. the graphs are strongly optimistic, see also Fig.4. As ex- diction for the TOP10 supercomputers depicted in Fig.3 a). pected, RMax values (calculated in this optimistic way) sat- The data from the first two years of P iz Daint (non- urate around .35 Eflop/s. Without some breakthrough in the accelerated mode of operation) can be compared directly. technology and/or paradigm even approaching the "dream Increasing the number of the cores results in the expected limit" is not possible. higher performance, as the working point is still in the lin- ear region of the efficiency surface. The value slightly above 13https://en.wikipedia.org/wiki/PEZY_Computing: The name PEZY is an acronym derived from the greek derived Metric prefixs Peta, Eta, Zetta, the predicted one can be attributed to the fine-tuning of the Yotta architecture. 14There are some doubts about the definition of exaFLOPS, whether it Introducing accelerators resulted in a jump of payload R R means P eak or Max, in the former case whether it includes accelerator efficiency (and also moved the working point to the slightly cores, and in the latter case measured by which benchmark. Here the term HPL is used as RMax . non-linear region, see Fig.5), and the payload performance

J. Végh: Preprint submitted to Elsevier Page 10 of 17 The roofline of parallelized performance gain

5.7. Brain simulation Dependence of E on (1 α ) and N HPL − eff The artificial intelligence (including simulating the brain operation using computing devices) shows up exponentially Piz Daint growing interest, and also the size of such systems is contin- 2012/11 uously growing. In the case of brain simulation the "flagship 1 2013/06 goal" is to simulate tens of billions of neurons correspond- 2013/11 ing to the capacity of the human brain. Those definitely huge 0.8 2016/11 systems really go to the extremes, but also undergo the com- 2017/06 mon limitation of the large-scale parallelized sequential sys- 0.5 2018/11 tems. In recent studies it was shown that using the present methods (paradigm and technology) the behavioral-level of brain is simply out of reach of the research [2]. Efficiency 0.2 104 It was shown recently [52] that the special method of 5 10 simulating the artificial neural networks, using a "time grid", 106 7 10 7 6 10 causes a breakdown at relatively low computing performance, − 10− 5 10− and that under those special conditions the frequent context (1 α ) No of cores − eff switchings and the permanent need of synchronization are competing for dominating the performance of the applica- Figure 5: The positions efficiency values of the supercomputer tion. "Piz Daint" on the two-dimensional efficiency surface in the different stages of its building; calculated from the publicly 6. Statistical underpinning available database [42] For now, supercomputing has a quarter of century his- tory and a well-documented and rigorously verified database [42] on their architectural and performance data. The huge vari- is roughly 3 times more than it would be expected purely on ety of solutions and ideas does not enlighten drawing con- the predicted value calculated from the non-accelerated ar- clusions and especially making forecasts for the future of su- chitecture. According to the general experience [31], only percomputing. The large number of available data, however, a small fraction of the computing power hidden in the GPU enables to draw reliable general conclusions about some fea- can be turned to payload performance, and the efficiency is tures of the parallelized sequential computing systems. Those only about 3 times higher than would be without accelera- conclusions have of course only statistical validity because tors. of the variety of sources of components, different technolo- The designers might be not satisfied with the accelerator, gies and ideas as well as the interplay of many factors. That so they changed to another one, with a slightly higher nom- is, the result shows up a considerable scattering and requires inal performance but much larger separated memory space. an extremely careful analysis. The large number of cases, The result was disappointing: the slight increase of the nom- however, enables to draw some reliable general conclusions. inal performance of the GPU could not counterbalance the increased time needed to copy between the separated larger 6.1. Correlation between the number of cores and address spaces, and finally resulted in a breakdown of both the achieved rank the value of (1 − eff ) and efficiency although the payload Since the resulting performance (and so the ranking) de- performance slightly increased. Introducing the GPU accel- pends both on the number of processors and the effective erator increases the absolute performance, but (through in- parallelization, those quantities are correlated in Fig.6. As troducing the extra non-parallelizable component of copying expected, in the TOP50 supercomputers the higher the rank- the data) increases the value of (1 − eff ) and decreases ef- ing position is, the higher is the required number of proces- ficiency, for a discussion see section 6.2. The decrease is the sors in the configuration, and as outlined above, the more more considerable the more data are to be copied. Again, the processors, the lower (1 − eff ) is required (provided that the fine-tuning has helped both the efficiency and (1 − eff ) the same efficiency is targeted). to have a better value. In TOP10, the slope of the regression line on the left subfigure sharply changes relative to the TOP50 regression 5.6. Gyoukou line, showing the strong competition for better ranking po- A nice "experimental proof" for the existence of the per- sition. Maybe the value of the slope can provide the "cut formance limit is the one-time appearance of supercomputer line" between "racing supercomputers" and "commodity su- Gyoukou on the TOP500 list in Nov. 2017. They did partic- percomputers". On the right figure, TOP10 data points pro- ipate in the competition with using 2.5M cores (out of the −7 vide the same slope as TOP50 data points, demonstrating 20M available) and their (1 − eff ) value was 1.9 ∗ 10 , that to produce a reasonable efficiency, the increasing num- −7 comparable with the data of Summit (2.4M and 1.7 ∗ 10 ). ber of cores must be accompanied with a proper decrease in Simply, the performance bound did not enable to increase the value of (1− eff ), as expected from Equ. (4). Furthermore, payload performance further. that to achieve a good ranking, a good value of (1 − eff )

J. Végh: Preprint submitted to Elsevier Page 11 of 17 The roofline of parallelized performance gain

101 Data points Data points 10−5 Regression Top50 Regression TOP50 Regression Top10 Regression TOP10 )

0 −6 10 HPL 10 by eff

(1-

No of Processors/1e6 10−7 10−1

0 10 20 30 40 50 10−1 100 101 Ranking by HPL No of Processors/1e6

Figure 6: a) The correlation of the TOP50 supercomputer ranking with the number of the cores b) The correlation of the TOP50 supercomputer ranking with the value of (1 − eff )

Accelerated Accelerated 150 Non-accelerated 109 Non-accelerated GPU-accelerated GPU-accelerated Regression of accelerated Regression of accelerated 8 Regression of nonaccelerated 10 Regression of nonaccelerated 100 Regression of GPU accelerated Regression of GPU accelerated 107

50 106 Performance amplification factor Processor performance (Gflop/s) 105 0 0 10 20 30 40 50 0 10 20 30 40 50 Ranking by HPL Ranking by HPL

Figure 7: a) The correlation of the single-processor performance of the TOP50 supercomputers (having processors with and without accelerator) with ranking, in 2017. b) The correlation of the performance gain of the TOP50 supercomputers (having processors with and without accelerator) with ranking, in 2017.

must be provided. Recall that the excellent performance of celerated cores show up the lowest performance; they really T aiℎuligℎt shall be attributed to its special processor, de- can benefit from acceleration15. The GPGPU accelerated ploying "Cooperative computing" [58]. processors really increase the performance of processors by a factor of 2-3. This result confirms results of a former study 6.2. Deploying accelerators (GPUs) where an average factor 2.5 was found [31]. As suggested by Eq. (5), the trivial way to increase the However, this increased performance is about 40..70 times absolute performance of a supercomputer is to increase the lower than the nominal performance of the GPGPU acceler- single-processor performance of its processors. Since the ator. The effect is attributed to the considerable overhead [9], single processor performance has reached its limits, some and it was demonstrated that with improving the transfer per- kind of accelerator (mostly General-Purpose Graphics Pro- formance, the application performance can be considerably cessing Unit (GPGPU)) is frequently used for this goal. Fig.7 enhanced. Indirectly, that research also proved that the oper- shows how utilizing accelerators influences ranking of su- ating principle itself (i.e. that the data must be transferred to percomputers. The two important factors of supercomput- and from the GPU memory; and recall that GPUs do not have ers: the single-processor performance and the parallelization cache memory) takes some extra time. In terms of Amdahl’s efficiency in function of ranking are displayed. law, this transfer time contributes to the non-parallelizable As the left side of the figure depicts, the coprocessor ac- 15In the number of the total cores the number of coprocessors is included

J. Végh: Preprint submitted to Elsevier Page 12 of 17 The roofline of parallelized performance gain

Supercomputers, Top 500 1st-3rd The roofline of performance gain of supercomputers 10−2

1st HPL 2nd 1st by RMax 8 3rd 10 2nd by RHPL Best Max 10−3 Trend of (1 − ) HPL 3rd by RMax Sunway TaihuLight HPL 7 Best by αeff 10 HPCG 1st by RMax

−4 HPCG 10 2nd by RMax 6 HPCG 10 3rd by RMax HPCG Best by RMax )

10−5 (1 − 105 P erformance gain 10−6 104

3 10−7 10

102 10−8 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 Year Year

Figure 8: a) The trend of the development of (1 − ) in the past 25 years, based on the first three (by Rmax) and the first (by (1 − )) in the year in question. b)The performance gain of supercomputers modeled as "roofline" [56], as measured with the benchmarks HPL and HPCG, and the one concluded for brain simulation

fraction, i.e. increases (1 − eff ), i.e. decreases the achiev- development of the effective parallelism, and Amdahl’s law able performance gain. formulated by Equ. (7) is actually what Moore’s law is for The right side of the figure discovers this effect. The the size of electronic components.16 (The effect of Moore’s R effective parallelization of the GPU accelerated systems is Max law is eliminated when calculating RP eak .) To understand nearly ten times worse than that of the coprocessor-accelerated the behavior of the trend line, just recall Equ. (4): to in- processors and about 5 times worse than that of the the non- worse crease the absolute performance, more processors shall be accelerated processors, i.e. the resulting efficiency is included, and to provide a reasonable efficiency, the value than in the case of utilizing unaccelerated processors; this is of (1 − ) must be properly reduced. a definite disadvance when GPUs used in system with ex- The "roofline" model [56] is successful in fields where tremely large number of processors. some resource limits the maximum performance resulting The key to this enigma is hidden in Eq. (4): the pay- from the interplay of the other components. Here the limit- load performance increases by a factor of nearly 3, but the ing resource is the performance gain (originated from Am- value (increased by nearly an order of magnitude) of (1 − ) dahl’s law, the technical implementation and the comput- eff is multiplied by the number of cores in the system. ing paradigm together) that limits the engineering solutions. In other words: while deploying GPGPU-accelerated cores Their interplay may result in more or less perfect perfor- in systems having a few thousand cores is advantageous, in mance gains, but no combination enables to exceed that ab- supercomputers having processors in the range of million solute limit. is a rather expensive way to make supercomputer perfor- worse The two roofline levels displayed in Fig.8 b), right side mance . This makes at least questionable whether it correspond to the values measurable with the benchmarks is worth to utilize GPGPUs in large-scale supercomputers. HPL and HPCG, respectively. The latter benchmark has For a discussion see section 5.5, for a direct experimental documented history in three items only, but the data are con- proof see Figs.3 and5. vincing enough to set a reliable roofline level. Here the con- As the left figure shows, neither kind of acceleration shows tribution of the SW dominates ("the real-life" application correlation between the ranking of supercomputer and the class), so the architectural solution makes no big difference, type of the acceleration. Essentially the same is confirmed see also section 3.3. by the right side of the figure: the performance amplifica- The third roofline level (see section 5.7) is inferred from tion raises with the better ranking position, and the slope is the single available measured data [2]. The two smaller black higher for any kind of acceleration: to move the data from dots show the performance data of the full configuration (as one memory to other takes time. measured by the benchmark HPCG) of the two supercom- 6.3. The supercomputer timeline and the puters the authors had access to, and the red dot denotes the "roofline" model saturation value they experienced (and so they did not deploy more hardware). This strongly supports the assumption, that As a quick test, Equ. (7) can be applied to data from [42], see Fig.8 a). As shown, supercomputer history is about the 16They are also connected by that their validity terminates in the present technological epoch.

J. Végh: Preprint submitted to Elsevier Page 13 of 17 The roofline of parallelized performance gain the achievable supercomputer performance depends on the ing the supercomputer database with well-documented per- type of the application. The roof level concluded from the formance values. It was shown that both the (strongly sim- performance gain measured by benchmark HPL only mea- plified) theoretical description and the empirical trend show sures the effect of organizing the joint work plus the fork/join up limitation for the payload performance of large-scale par- operation. allelized computing, at the same value. The difficulties ex- The roof level concluded from HPCG benchmark is about perienced in building ever-larger supercomputers and espe- two orders of magnitude lower because of the increased amount cially utilizing artificial intelligence applications on super- of communication needed for the iteration. In the case of computers or building brain simulators from SPA computer the brain simulation the top level of the performance gain components convincingly prove that the present supercom- is two more orders of magnitude lower because of the need puting has achieved what was enabled by the computing paradigm of more intensive communication (many other neurons must and implementation technology. To step further to the next also be periodically informed on the result of the neural cal- level [21] a real rebooting is required, among others renew- culation). In the case of AI networks the intensity of com- ing computing [50] and introducing a new computing paradigm [46]. munication is between the last two (depending on the type The "performance wall" [55] was hit. and size of the network), so correspondingly the achievable performance gain must also reside between the last two roof References levels. The figure demonstrates how the "communication-to-computation[1] Abdallah, A.E., Jones, C., Sanders, J.W. (Eds.), 2005. Communicat- ing Sequential Processes. The First 25 Years. Springer. doi:10.1007/ ratio" introduced by [41] affects the achievable parformance b136154. gain. Notice that the achieved performance gain (=speedup) [2] van Albada, S.J., Rowley, A.G., Senk, J., Hopkins, M., Schmidt, M., 3 of brain simulation is about 10 and based on the amount Stokes, A.B., Lester, D.R., Diesmann, M., Furber, S.B., 2018. Per- of communication, Artificial Neural Networks also cannot formance Comparison of the Digital Neuromorphic Hardware SpiN- show up much higher performance gain. The bottleneck is Naker and the Neural Network Simulation Software NEST for a Full- not Scale Cortical Microcircuit Model. Frontiers in Neuroscience 12, 291. the performance of the floating operations; rather the [3] Amdahl, G.M., 1967. Validity of the Single Processor Approach to need of communication and the exceptionally high "communication- Achieving Large-Scale Computing Capabilities, in: AFIPS Confer- to-computation ratio". Without the need of organizing the ence Proceedings, pp. 483–485. doi:10.1145/1465482.1465560. joint work there would not be a roof level at all: adding more [4] Ao, Y., Yang, C., Liu, F., Yin, W., Jiang, L., Sun, Q., 2018. Perfor- cores would increase the performance gain with a permanent mance Optimization of the HPCG Benchmark on the Sunway Taihu- Light Supercomputer. ACM Trans. Archit. Code Optim. 15, 11:1– slope. With having non-zero non-parallelizable contribution 11:20. the roof level appears and the higher that contribution is, the [5] Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer, K., Kubi- lower is the value of the roof level (or, in other words: the atowicz, J., Morgan, N., Patterson, D., Sen, K., Wawrzynek, J., Wes- higher is the non-parallelizable contribution the lower is the sel, D., Yelick, K., 2009. A View of the Parallel Computing Land- nominal performance at which the roofline effect appears; in scape. Communications of the ACM 52, 56–67. [6] Bell, G., Bailey, D.H., Dongarra, J., Karp, A.H., Walsh, K., 2017. A the figure expressed in years). look back on 30 years of the gordon bell prize. The International Jour- The results of the benchmark HPL show a considerable nal of High Performance Computing Applications 31, 469âĂŞ484. scatter and some points are even above the roofline. This URL: https://doi.org/10.1177/1094342017738610, doi:10.1177/ 1094342017738610 arXiv:https://doi.org/10.1177/1094342017738610 benchmark is sensitive to the architectural solutions like clus- , . tering (internal or external), accelerators, absolute perfor- [7] Bourzac, K., 2017. Streching supercomputers to the limit. Nature ab ovo 551, 554–556. mance, etc., since here no dominating component is [8] Chandy, J.A., Singaraju, J., 2009. Hardware parallelism vs. software present. Because of this, some "relaxation time" was and parallelism, in: Proceedings of the First USENIX Conference on Hot will be needed until a right combination resulting in a per- Topics in Parallelism, USENIX Association, Berkeley, CA, USA. pp. formance gain approaching the roofline was/will be found. 2–2. The three points above the roofline belong to the same [9] Daga, M., Aji, A.M., c. Feng, W., 2011. On the efficacy of a fused T aiℎuligℎt cpu+gpu processor (or apu) for parallel computing, in: 2011 Sympo- supercomputer . Its "cooperating processors" [58] sium on Application Accelerators in High-Performance Computing, work using a slightly different computing paradigm: they pp. 141–149. doi:10.1109/SAAHPC.2011.29. slightly violate the principles of the SPA, which is applied [10] David, T., Guerraoui, R., Trigonakis, V., 2013. Everything you always by all the rest of the supercomputers. Because of this, its wanted to know about synchronization but were afraid to ask, in: Pro- performance gain is limited by a slightly different roofline: ceedings of the Twenty-Fourth ACM Symposium on Operating Sys- tems Principles (SOSP ’13), pp. 33–48. doi:10.1145/2517349.2522714. changing the computing paradigm (or the principles of im- [11] Dettmers, T., 2015. The Brain vs Deep Learning Part plementation) changes the rules of the game. This also hints I: Computational Complexity âĂŤ Or Why the Singular- a possible way out: the computing paradigm shall be modi- ity Is Nowhere Near. http://timdettmers.com/2015/07/27/ brain-vs-deep-learning-singularity/ fied [50, 48] in order to introduce higher roofline level. . [12] Dongarra, J., 2016. Report on the Sunway TaihuLight System. Tech- nical Report Tech Report UT-EECS-16-742. University of Tennessee 7. Summary Department of Electrical Engineering and Computer Science. [13] Ellen, F., Hendler, D., Shavit, N., 2012. On the Inherent Sequentiality The payload performance of parallelized sequential com- of Concurrent Objects. SIAM J. Comput. 43, 519âĂŞ536. doi:10. puting systems has been analyzed both theoretically and us- 1137/08072646X. [14] ERIK P. DeBenedictis, 2005. Petaflops, Exaflops, and Zettaflops

J. Végh: Preprint submitted to Elsevier Page 14 of 17 The roofline of parallelized performance gain

for Science and Defense. http://debenedictis.org/erik/SAND-2005/ ing from exascale to zettascale computing: challenges and tech- SAND2005-2690-CUG2005-B.pdf. niques. Frontiers of Information Technology & Electronic Engineer- [15] Esmaeilzadeh, H., 2015. Approximate acceleration: A path through ing 19, 1236âĂŞ1244. URL: https://doi.org/10.1631/FITEE.1800494, the era of dark silicon and big data, in: Proceedings of the 2015 In- doi:10.1631/FITEE.1800494. ternational Conference on Compilers, Architecture and Synthesis for [34] de Macedo Mourelle, L., Nedjah, N., Pessanha, F.G., 2016. Re- Embedded Systems, IEEE Press, Piscataway, NJ, USA. pp. 31–32. configurable and Adaptive Computing: Theory and Applications. URL: http://dl.acm.org/citation.cfm?id=2830689.2830693. CRC press. chapter 5: Interprocess Communication via Crossbar for [16] Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., et al., Systems-on-chip. 2012. Dark Silicon and the End of Multicore Scaling. IEEE Micro [35] Markov, I., 2014. Limits on fundamental limits to computation. Na- 32, 122–134. ture 512(7513), 147–154. [17] European Commission, 2016. Implementation of the Action [36] Molnár, P., Végh, J., 2017. Measuring Performance of Processor In- Plan for the European High-Performance Computing strategy. structions and Operating System Services in Soft Processor Based http://ec.europa.eu/newsroom/dae/document.cfm?doc_id=15269. Systems, in: 18th Internat. Carpathian Control Conf. ICCC, pp. 381– [18] Extremtech, 2018. Japan Tests Silicon for Exascale Com- 387. puting in 2021. URL: https://www.extremetech.com/computing/ [37] Patterson, D., Hennessy, J. (Eds.), 2017. Computer Organization and 272558-japan-tests-silicon-for-exascale-computing-in-2021. design. RISC-V Edition. Morgan Kaufmann. [19] Eyerman, S., Eeckhout, L., 2010. Modeling Critical Sections in Am- [38] Paul, J.M., Meyer, B.H., 2007. Amdahl’s Law Revisited for Single dahl’s Law and Its Implications for Multicore Design. SIGARCH Chip Systems. International Journal of Parallel Programming 35, Comput. Archit. News 38, 362–370. URL: http://doi.acm.org/10. 101–123. URL: https://doi.org/10.1007/s10766-006-0028-8, doi:10. 1145/1816038.1816011, doi:10.1145/1816038.1816011. 1007/s10766-006-0028-8. [20] Fu, H., Liao, J., Yang, J., Wang, L., Song, Z., Huang, X., Yang, C., [39] Randal E. Bryant and David R. O’Hallaron, 2014. Computer Systems: Xue, W., Liu, F., Qiao, F., Zhao, W., Yin, X., Hou, C., Zhang, C., A Programmer’s Perspective. Pearson. Ge, W., Zhang, J., Wang, Y., Zhou, C., Yang, G., 2016. The Sun- [40] Schlansker, M., Rau, B., 2000. EPIC: Explicitly Parallel Instruction way TaihuLight supercomputer: system and applications. Science Computing. Computer 33, 37–45. doi:10.1109/2.820037. China Information Sciences 59, 1–16. URL: http://dx.doi.org/10. [41] Singh, J.P., Hennessy, J.L., Gupta, A., 1993. Scaling parallel pro- 1007/s11432-016-5588-7, doi:10.1007/s11432-016-5588-7. grams for multiprocessors: Methodology and examples. Computer [21] Fuller, S.H., Millett, L.I., 2011. Computing Performance: Game Over 26, 42–50. doi:10.1109/MC.1993.274941. or Next Level? Computer 44, 31–38. [42] TOP500.org, 2019. The top 500 supercomputers. https://www. [22] Furber, S.B., Lester, D.R., Plana, L.A., Garside, J.D., Painkras, top500.org/. E., Temple, S., Brown, A.D., 2013. Overview of the SpiN- [43] Tsafrir, D., 2007. The context-switch overhead inflicted by hardware Naker System Architecture. IEEE Transactions on Computers 62, interrupts (and the enigma of do-nothing loops), in: Proceedings of 2454–2467. URL: doi.ieeecomputersociety.org/10.1109/TC.2012. the 2007 Workshop on Experimental Computer Science, ACM, New 142, doi:10.1109/TC.2012.142. York, NY, USA. pp. 3–3. URL: http://doi.acm.org/10.1145/1281700. [23] Gustafson, J.L., 1988. Reevaluating Amdahl’s Law. Commun. ACM 1281704, doi:10.1145/1281700.1281704. 31, 532–533. doi:10.1145/42411.42415. [44] US Government NSA and DOE, 2016. A Report from the [24] Hennessy, J.L., Patterson, D.A., 2007. : A NSA-DOE Technical Meeting on High Performance Comput- Quantitative Approach. Morgan Kaufmann Publishers. ing. https://www.nitrd.gov/nitrdgroups/images/b/b4/NSA_DOE_HPC_ [25] HPCG Benchmark, 2016. Hpcg benchmark. http://www. TechMeetingReport.pdf. hpcg-benchmark.org/. [45] US National Research Council, 2011. The Future of Computing Per- [26] Hwang, K., Jotwani, N., 2016. Advanced Computer Architecture: formance: Game Over or Next Level? URL: http://science.energy. Parallelism, Scalability, Programmability. 3 ed., Mc Graw Hill. gov/~/media/ascr/ascac/pdf/meetings/mar11/Yelick.pdf. [27] IEEE Spectrum, 2017. Two Different Top500 Supercom- [46] Végh, J., 2016. A configurable accelerator for manycores: the Explic- puting Benchmarks Show Two Different Top Supercomput- itly Many-Processor Approach. ArXiv e-prints URL: http://adsabs. ers. https://spectrum.ieee.org/tech-talk/computing/hardware/ harvard.edu/abs/2016arXiv160701643V, arXiv:1607.01643. two-different-top500-supercomputing-benchmarks-show-two-different-top-supercomputers[47] Végh, J., 2017.. Statistical considerations on limitations of supercom- [28] Ippen, T., Eppler, J.M., Plesser, H.E., Diesmann, M., 2017. Con- puters. CoRR abs/1710.08951. URL: http://arxiv.org/abs/1710. structing Neuronal Network Models in Environ- 08951, arXiv:1710.08951. ments. Frontiers in Neuroinformatics 11, 30. URL: https:// [48] Végh, J., 2018. Introducing the Explicitly Many-Processor Approach. www.frontiersin.org/article/10.3389/fninf.2017.00030, doi:10.3389/ Parallel Computing 75, 28 – 40. fninf.2017.00030. [49] Végh, J., 2018. Limitations of performance of Exascale Applications [29] Karp, A.H., Flatt, H.P., 1990. Measuring Parallel Processor Perfor- and supercomputers they are running on. ArXiv e-prints; submitted to mance. Commun. ACM 33, 539–543. doi:10.1145/78607.78614. special issue of IEEE Journal of Parallel and [30] Krishnaprasad, S., 2001. Uses and Abuses of Amdahl’s Law. J. Com- arXiv:1808.05338. put. Sci. Coll. 17, 288–293. URL: http://dl.acm.org/citation.cfm? [50] Végh, J., 2018. Renewing computing paradigms for more efficient id=775339.775386. parallelization of single-threads. IOS Press. volume 29 of Advances [31] Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, in Parallel Computing. chapter 13. pp. 305–330. A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, [51] Végh, J., 2019a. Classic versus modern computing: analogies with P., Singhal, R., Dubey, P., 2010. Debunking the 100X GPU vs. classic versus modern physics. Information Science In review, ??– CPU Myth: An Evaluation of Throughput Computing on CPU ??? and GPU, in: Proceedings of the 37th Annual International Sym- [52] Végh, J., 2019b. How Amdahl’s Law limits the performance of posium on Computer Architecture, ACM, New York, NY, USA. large artificial neural networks: (Why the functionality of full- pp. 451–460. URL: http://doi.acm.org/10.1145/1815961.1816021, scale brain simulation on processor-based simulators is lim- doi:10.1145/1815961.1816021. ited). Brain Informatics 6, 1–11. URL: https://braininformatics. [32] Liao, X., 2019. Moving from exascale to zettascale computing: chal- springeropen.com/articles/10.1186/s40708-019-0097-2. lenges and techniques. Frontiers of Information Technology & Elec- [53] Végh, J., Bagoly, Z., Kicsák, A., Molnár, P., 2014. An alterna- tronic Engineering , 1236–1244. tive implementation for accelerating some functions of operating sys- [33] Liao, X.k., Lu, K., Yang, C.q., Li, J.w., Yuan, Y., Lai, M.c., tem, in: Proceedings of the 9th International Conference on Soft- Huang, L.b., Lu, P.j., Fang, J.b., Ren, J., Shen, J., 2018. Mov- ware Engineering and Applications (ICSOFT-EA-2014), pp. 494–

J. Végh: Preprint submitted to Elsevier Page 15 of 17 The roofline of parallelized performance gain

499. doi:10.5220/0005104704940499. [14] ERIK P. DeBenedictis, 2005. Petaflops, Exaflops, and Zettaflops [54] Végh, J., Molnár, P., 2017. How to measure perfectness of paral- for Science and Defense. http://debenedictis.org/erik/SAND-2005/ lelization in hardware/software systems, in: 18th Internat. Carpathian SAND2005-2690-CUG2005-B.pdf. Control Conf. ICCC, pp. 394–399. [15] Esmaeilzadeh, H., 2015. Approximate acceleration: A path through [55] Végh, J., Vásárhelyi, J., Drótos, D., 2019. The performance wall of the era of dark silicon and big data, in: Proceedings of the 2015 In- large parallel computing systems, Springer. pp. 224–237. ternational Conference on Compilers, Architecture and Synthesis for [56] Williams, S., Waterman, A., Patterson, D., 2009. Roofline: An in- Embedded Systems, IEEE Press, Piscataway, NJ, USA. pp. 31–32. sightful visual performance model for multicore architectures. Com- URL: http://dl.acm.org/citation.cfm?id=2830689.2830693. mun. ACM 52, 65–76. [16] Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., et al., [57] Yavits, L., Morad, A., Ginosar, R., 2014. The effect of communication 2012. Dark Silicon and the End of Multicore Scaling. IEEE Micro and synchronization on Amdahl’s law in multicore systems. Parallel 32, 122–134. Computing 40, 1–16. [17] European Commission, 2016. Implementation of the Action [58] Zheng, F., Li, H.L., Lv, H., Guo, F., Xu, X.H., Xie, X.H., 2015. Plan for the European High-Performance Computing strategy. Cooperative computing techniques for a deeply fused and heteroge- http://ec.europa.eu/newsroom/dae/document.cfm?doc_id=15269. neous many-core processor architecture. Journal of Computer Sci- [18] Extremtech, 2018. Japan Tests Silicon for Exascale Com- ence and Technology 30, 145–162. URL: https://doi.org/10.1007/ puting in 2021. URL: https://www.extremetech.com/computing/ s11390-015-1510-9, doi:10.1007/s11390-015-1510-9. 272558-japan-tests-silicon-for-exascale-computing-in-2021. [19] Eyerman, S., Eeckhout, L., 2010. Modeling Critical Sections in Am- dahl’s Law and Its Implications for Multicore Design. SIGARCH References Comput. Archit. News 38, 362–370. URL: http://doi.acm.org/10. 1145/1816038.1816011, doi:10.1145/1816038.1816011. [1] Abdallah, A.E., Jones, C., Sanders, J.W. (Eds.), 2005. Communicat- [20] Fu, H., Liao, J., Yang, J., Wang, L., Song, Z., Huang, X., Yang, C., ing Sequential Processes. The First 25 Years. Springer. doi:10.1007/ Xue, W., Liu, F., Qiao, F., Zhao, W., Yin, X., Hou, C., Zhang, C., b136154. Ge, W., Zhang, J., Wang, Y., Zhou, C., Yang, G., 2016. The Sun- [2] van Albada, S.J., Rowley, A.G., Senk, J., Hopkins, M., Schmidt, M., way TaihuLight supercomputer: system and applications. Science Stokes, A.B., Lester, D.R., Diesmann, M., Furber, S.B., 2018. Per- China Information Sciences 59, 1–16. URL: http://dx.doi.org/10. formance Comparison of the Digital Neuromorphic Hardware SpiN- 1007/s11432-016-5588-7, doi:10.1007/s11432-016-5588-7. Naker and the Neural Network Simulation Software NEST for a Full- [21] Fuller, S.H., Millett, L.I., 2011. Computing Performance: Game Over Scale Cortical Microcircuit Model. Frontiers in Neuroscience 12, 291. or Next Level? Computer 44, 31–38. [3] Amdahl, G.M., 1967. Validity of the Single Processor Approach to [22] Furber, S.B., Lester, D.R., Plana, L.A., Garside, J.D., Painkras, Achieving Large-Scale Computing Capabilities, in: AFIPS Confer- E., Temple, S., Brown, A.D., 2013. Overview of the SpiN- ence Proceedings, pp. 483–485. doi:10.1145/1465482.1465560. Naker System Architecture. IEEE Transactions on Computers 62, [4] Ao, Y., Yang, C., Liu, F., Yin, W., Jiang, L., Sun, Q., 2018. Perfor- 2454–2467. URL: doi.ieeecomputersociety.org/10.1109/TC.2012. mance Optimization of the HPCG Benchmark on the Sunway Taihu- 142, doi:10.1109/TC.2012.142. Light Supercomputer. ACM Trans. Archit. Code Optim. 15, 11:1– [23] Gustafson, J.L., 1988. Reevaluating Amdahl’s Law. Commun. ACM 11:20. 31, 532–533. doi:10.1145/42411.42415. [5] Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer, K., Kubi- [24] Hennessy, J.L., Patterson, D.A., 2007. Computer Architecture: A atowicz, J., Morgan, N., Patterson, D., Sen, K., Wawrzynek, J., Wes- Quantitative Approach. Morgan Kaufmann Publishers. sel, D., Yelick, K., 2009. A View of the Parallel Computing Land- [25] HPCG Benchmark, 2016. Hpcg benchmark. http://www. scape. Communications of the ACM 52, 56–67. hpcg-benchmark.org/. [6] Bell, G., Bailey, D.H., Dongarra, J., Karp, A.H., Walsh, K., 2017. A [26] Hwang, K., Jotwani, N., 2016. Advanced Computer Architecture: look back on 30 years of the gordon bell prize. The International Jour- Parallelism, Scalability, Programmability. 3 ed., Mc Graw Hill. nal of High Performance Computing Applications 31, 469âĂŞ484. [27] IEEE Spectrum, 2017. Two Different Top500 Supercom- URL: https://doi.org/10.1177/1094342017738610, doi:10.1177/ puting Benchmarks Show Two Different Top Supercomput- 1094342017738610, arXiv:https://doi.org/10.1177/1094342017738610. ers. https://spectrum.ieee.org/tech-talk/computing/hardware/ [7] Bourzac, K., 2017. Streching supercomputers to the limit. Nature two-different-top500-supercomputing-benchmarks-show-two-different-top-supercomputers. 551, 554–556. [28] Ippen, T., Eppler, J.M., Plesser, H.E., Diesmann, M., 2017. Con- [8] Chandy, J.A., Singaraju, J., 2009. Hardware parallelism vs. software structing Neuronal Network Models in Massively Parallel Environ- parallelism, in: Proceedings of the First USENIX Conference on Hot ments. Frontiers in Neuroinformatics 11, 30. URL: https:// Topics in Parallelism, USENIX Association, Berkeley, CA, USA. pp. www.frontiersin.org/article/10.3389/fninf.2017.00030, doi:10.3389/ 2–2. fninf.2017.00030. [9] Daga, M., Aji, A.M., c. Feng, W., 2011. On the efficacy of a fused [29] Karp, A.H., Flatt, H.P., 1990. Measuring Parallel Processor Perfor- cpu+gpu processor (or apu) for parallel computing, in: 2011 Sympo- mance. Commun. ACM 33, 539–543. doi:10.1145/78607.78614. sium on Application Accelerators in High-Performance Computing, [30] Krishnaprasad, S., 2001. Uses and Abuses of Amdahl’s Law. J. Com- pp. 141–149. doi:10.1109/SAAHPC.2011.29. put. Sci. Coll. 17, 288–293. URL: http://dl.acm.org/citation.cfm? [10] David, T., Guerraoui, R., Trigonakis, V., 2013. Everything you always id=775339.775386. wanted to know about synchronization but were afraid to ask, in: Pro- [31] Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, ceedings of the Twenty-Fourth ACM Symposium on Operating Sys- A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, tems Principles (SOSP ’13), pp. 33–48. doi:10.1145/2517349.2522714. P., Singhal, R., Dubey, P., 2010. Debunking the 100X GPU vs. [11] Dettmers, T., 2015. The Brain vs Deep Learning Part CPU Myth: An Evaluation of Throughput Computing on CPU I: Computational Complexity âĂŤ Or Why the Singular- and GPU, in: Proceedings of the 37th Annual International Sym- ity Is Nowhere Near. http://timdettmers.com/2015/07/27/ posium on Computer Architecture, ACM, New York, NY, USA. brain-vs-deep-learning-singularity/. pp. 451–460. URL: http://doi.acm.org/10.1145/1815961.1816021, [12] Dongarra, J., 2016. Report on the Sunway TaihuLight System. Tech- doi:10.1145/1815961.1816021. nical Report Tech Report UT-EECS-16-742. University of Tennessee [32] Liao, X., 2019. Moving from exascale to zettascale computing: chal- Department of Electrical Engineering and Computer Science. lenges and techniques. Frontiers of Information Technology & Elec- [13] Ellen, F., Hendler, D., Shavit, N., 2012. On the Inherent Sequentiality tronic Engineering , 1236–1244. of Concurrent Objects. SIAM J. Comput. 43, 519âĂŞ536. doi:10. [33] Liao, X.k., Lu, K., Yang, C.q., Li, J.w., Yuan, Y., Lai, M.c., 1137/08072646X.

J. Végh: Preprint submitted to Elsevier Page 16 of 17 The roofline of parallelized performance gain

Huang, L.b., Lu, P.j., Fang, J.b., Ren, J., Shen, J., 2018. Mov- ware Engineering and Applications (ICSOFT-EA-2014), pp. 494– ing from exascale to zettascale computing: challenges and tech- 499. doi:10.5220/0005104704940499. niques. Frontiers of Information Technology & Electronic Engineer- [54] Végh, J., Molnár, P., 2017. How to measure perfectness of paral- ing 19, 1236âĂŞ1244. URL: https://doi.org/10.1631/FITEE.1800494, lelization in hardware/software systems, in: 18th Internat. Carpathian doi:10.1631/FITEE.1800494. Control Conf. ICCC, pp. 394–399. [34] de Macedo Mourelle, L., Nedjah, N., Pessanha, F.G., 2016. Re- [55] Végh, J., Vásárhelyi, J., Drótos, D., 2019. The performance wall of configurable and Adaptive Computing: Theory and Applications. large parallel computing systems, Springer. pp. 224–237. CRC press. chapter 5: Interprocess Communication via Crossbar for [56] Williams, S., Waterman, A., Patterson, D., 2009. Roofline: An in- Shared Memory Systems-on-chip. sightful visual performance model for multicore architectures. Com- [35] Markov, I., 2014. Limits on fundamental limits to computation. Na- mun. ACM 52, 65–76. ture 512(7513), 147–154. [57] Yavits, L., Morad, A., Ginosar, R., 2014. The effect of communication [36] Molnár, P., Végh, J., 2017. Measuring Performance of Processor In- and synchronization on Amdahl’s law in multicore systems. Parallel structions and Operating System Services in Soft Processor Based Computing 40, 1–16. Systems, in: 18th Internat. Carpathian Control Conf. ICCC, pp. 381– [58] Zheng, F., Li, H.L., Lv, H., Guo, F., Xu, X.H., Xie, X.H., 2015. 387. Cooperative computing techniques for a deeply fused and heteroge- [37] Patterson, D., Hennessy, J. (Eds.), 2017. Computer Organization and neous many-core processor architecture. Journal of Computer Sci- design. RISC-V Edition. Morgan Kaufmann. ence and Technology 30, 145–162. URL: https://doi.org/10.1007/ [38] Paul, J.M., Meyer, B.H., 2007. Amdahl’s Law Revisited for Single s11390-015-1510-9, doi:10.1007/s11390-015-1510-9. Chip Systems. International Journal of Parallel Programming 35, 101–123. URL: https://doi.org/10.1007/s10766-006-0028-8, doi:10. 1007/s10766-006-0028-8. [39] Randal E. Bryant and David R. O’Hallaron, 2014. Computer Systems: A Programmer’s Perspective. Pearson. [40] Schlansker, M., Rau, B., 2000. EPIC: Explicitly Parallel Instruction Computing. Computer 33, 37–45. doi:10.1109/2.820037. [41] Singh, J.P., Hennessy, J.L., Gupta, A., 1993. Scaling parallel pro- grams for multiprocessors: Methodology and examples. Computer 26, 42–50. doi:10.1109/MC.1993.274941. [42] TOP500.org, 2019. The top 500 supercomputers. https://www. top500.org/. [43] Tsafrir, D., 2007. The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops), in: Proceedings of the 2007 Workshop on Experimental Computer Science, ACM, New York, NY, USA. pp. 3–3. URL: http://doi.acm.org/10.1145/1281700. 1281704, doi:10.1145/1281700.1281704. [44] US Government NSA and DOE, 2016. A Report from the NSA-DOE Technical Meeting on High Performance Comput- ing. https://www.nitrd.gov/nitrdgroups/images/b/b4/NSA_DOE_HPC_ TechMeetingReport.pdf. [45] US National Research Council, 2011. The Future of Computing Per- formance: Game Over or Next Level? URL: http://science.energy. gov/~/media/ascr/ascac/pdf/meetings/mar11/Yelick.pdf. [46] Végh, J., 2016. A configurable accelerator for manycores: the Explic- itly Many-Processor Approach. ArXiv e-prints URL: http://adsabs. harvard.edu/abs/2016arXiv160701643V, arXiv:1607.01643. [47] Végh, J., 2017. Statistical considerations on limitations of supercom- puters. CoRR abs/1710.08951. URL: http://arxiv.org/abs/1710. 08951, arXiv:1710.08951. [48] Végh, J., 2018. Introducing the Explicitly Many-Processor Approach. Parallel Computing 75, 28 – 40. [49] Végh, J., 2018. Limitations of performance of Exascale Applications and supercomputers they are running on. ArXiv e-prints; submitted to special issue of IEEE Journal of Parallel and Distributed Computing arXiv:1808.05338. [50] Végh, J., 2018. Renewing computing paradigms for more efficient parallelization of single-threads. IOS Press. volume 29 of Advances in Parallel Computing. chapter 13. pp. 305–330. [51] Végh, J., 2019a. Classic versus modern computing: analogies with classic versus modern physics. Information Science In review, ??– ??? [52] Végh, J., 2019b. How Amdahl’s Law limits the performance of large artificial neural networks: (Why the functionality of full- scale brain simulation on processor-based simulators is lim- ited). Brain Informatics 6, 1–11. URL: https://braininformatics. springeropen.com/articles/10.1186/s40708-019-0097-2. [53] Végh, J., Bagoly, Z., Kicsák, A., Molnár, P., 2014. An alterna- tive implementation for accelerating some functions of operating sys- tem, in: Proceedings of the 9th International Conference on Soft-

J. Végh: Preprint submitted to Elsevier Page 17 of 17