<<

Preprint – To appear in the IEEE Int’l. Parallel & Distributed Processing Symp. (IPDPS), May 2013

A theoretical framework for - co-design

Kenneth Czechowski, Richard Vuduc School of Computational Science and Engineering Georgia Institute of Technology, Atlanta, Georgia {kentcz,richie}@gatech.edu

Abstract—We consider the problem of how to en- 28, 44], as well as the classical theory of cir- able architects and algorithm designers to cuit models and the area-time trade-offs studied reason directly and analytically about the relationship in models based on very large-scale integration between high-level architectural features and algo- rithm characteristics. We propose a modeling frame- (VLSI) [37, 49]. Our analysis is in many ways work designed to help understand the long-term most similar to several recent theoretical exascale and high-level impacts of algorithmic and technol- modeling studies [22, 47], combined with trends ogy trends. This model connects abstract commu- analysis [34]. However, our specific methods re- nication complexity analysis—with respect to both turn to higher-level I/O-centric complexity analy- the inter-core and inter- networks and the —with current technology proposals sis [4, 5, 10, 13, 21, 54], pushing it further by trying and projections. We illustrate how one might use the to resolve analytical constants, which is necessary framework by instantiating a particular model for a to connect abstract complexity measures with the class of and sample (three- physical constraints imposed by power and die dimensional fast Fourier transforms, matrix multiply, area. This approach necessarily will not yield cycle- and three-dimensional stencil). Then, as a suggestive demonstration, we analyze a number of what-if sce- accurate performance estimates, and that is not our narios within the model in light of these trends to aim. Rather, our hope is that a principled algorith- suggest broader statements and alternative futures for mic analysis that accounts for major architectural power-constrained architectures and algorithms. parameters will still yield interesting insights and suggest new directions for improving performance I.INTRODUCTION and scalability in the long run. We seek a formal framework that explicitly re- A formal framework. We pose the formal co- lates characteristics of an algorithm, such as its design problem as follows. Let a be an algorithm inherent parallelism or memory behavior, with from a set A of algorithms that all perform the parameters of an architecture, such as the number same computation within the same desired level of cores, structure of the memory hierarchy, or of accuracy. The set A might contain different network topology. Our ultimate goal is to say algorithms, such as “A = {fast Fourier trans- precisely and analytically how high-level changes form, F-cycle multigrid},” for the Poisson model to the architecture might affect the execution problem [17, 41]. Or, A may be a set of tuning time, scalability, accuracy, and power-efficiency parameters for one algorithm, such as the set of of a computation; and, conversely, identify what tile sizes for matrix multiply. Next, let µ be a classes of computation might best match a given machine architecture from a set M, and suppose architecture. Our approach marries abstract al- that each processor of µ has an area of χ(µ). gorithmic complexity analysis with key physical Lastly, let T (n; a, µ) be the time to execute a on constraints, such as caps on power and die area, µ for a problem of size n, while using a maximum that will be critical in the extreme scale systems of instantaneous power of Φ(µ). Then, our goal is to 2018 and beyond [1, 35]. We refer to our approach determine the algorithm a and architecture µ that as one of algorithm-architecture co-design. minimize time subject to constraints on total power We say “algorithm-architecture” rather than and processor die area, e.g., “hardware-software,” so as to evoke a high-level mathematical that precedes and comple- (a∗, µ∗) = argmin T (n; a, µ) (1) ments traditional methods based on detailed ar- (a∈A, µ∈M) chitecture of concrete code subject to artifacts and traces [9, 23, 27, 30, 48, 51]. Our ap- ∗ proach takes inspiration from prior work on high- Φ(µ ) = Φmax (2) ∗ level performance analysis and modeling [3, 26– χ(µ ) = χmax, (3)

1 Preprint – To appear in the IEEE Int’l. Parallel & Distributed Processing Symp. (IPDPS), May 2013

Performance on FFT Power 4.7175 DRAM More DRAM bandwidth

2.3588 $

C C C C C $ Echelon

C C C C C C Memory Bandwidth (TB/s) C C C C C C

C C C C C C C C C C C C 0 0 183.5 367 Area Cache Size (MB) Performance on Matrix Multiply 4.7175 More More cores cache DRAM DRAM

2.3588 $

C C C C C C $ Echelon C C C C C C Memory Bandwidth (TB/s) C C C C C C Higher

C C C C C C 0 frequency C C C C C C 0 183.5 367 Cache Size (MB) (b) Projected performance for a 3D FFT and matrix (a) A notional power/ allocation problem multiply at some problem size (lighter is better) Figure 1: (a) In our framework, a fixed die area (allocated between cores and cache) and a fixed power budget (allocated between core frequency and bandwidth), define a space of possible machines. (b) Different algorithms may perform differently on these machines. The marker is the approximate “location” within this space of the NVIDIA Echelon GPU-like architecture proposed for the year 2017 [32]. In the 3D FFT example, the optimal configuration is 2.6 times faster than Echelon.

where Φmax and χmax are caps on power and die get (y-axis) to memory bandwidth, compared to area, respectively. The central research problem matrix multiply. T (n; a, µ) Φ(µ) is to determine the form of , , and However, Figure 1b also suggests an intriguing χ(µ) . The significance and novelty of this analysis possibility. Observe that a die area configuration framework is that it explicitly binds characteristics (x-axis) that is good for matrix multiply will also of algorithms and architectures, Equation (1), with be good for an FFT; to make a system that can physical hardware constraints, Equations (2)–(3). perform “optimally” on both workloads, we would A demonstration. Suppose we wish to design need the ability to dynamically shift power from a µ, which we represent by the processor to memory bandwidth, by a large factor of roughly 7×. That is, reconfigurability of the four-tuple (βmem, q, f, Z): βmem is the processor- memory bandwidth (words per unit time), q is processor may be relatively less im- the number of cores per processor, f is the clock portant than extreme power reconfigurability with frequency of each core (cycles per unit time), and respect to bandwidth. Whether one can build such Z is the total size of the aggregate on-chip cache a system is a separate question; this demonstration (in words), assuming just a two-level hierarchy suggests and attempts to quantify the possibility. (cache and main memory). Further suppose that The remainder of this paper formalizes this anal- 2 the χmax = 141 mm of die area can be divided ysis. As a demonstration, we develop an analytical between on-chip cache (Z) and cores (q). Lastly, model of these constraints, Φ(µ) and χ(µ), as well suppose the node power budget is constrained to as performance models, T (n; a, µ). To suggest the

Φmax = 173 Watts, which can be used to increase possibilities of the framework, we develop models cycle-frequency (f) or boost off-die memory band- for a full-system configuration, consisting of a dis- width (βmem). Figure 1a is a cartoon that suggests tributed memory machine comprising any number how these parameters and constraints imply a of manycore processors connected by a network, in space of possible designs. Figure 1b shows how, the case of distributed matrix multiply, distributed given a specific model of different algorithms on 3D FFTs, and distributed stencil algorithms. We do this space of machines, we might then solve the op- not view any specific models and projections as the timization problem of Equations 1–3 to identify op- main contribution of our work. Rather, we wish timal systems. Unsurprisingly, a processor tuned to emphasize the basic framework, with the large for a communication-intensive 3D fast Fourier variety of potential detailed modeling strategies, transform will devote more of a fixed power bud- analyses, and projections as possibilities based on

2 Preprint – To appear in the IEEE Int’l. Parallel & Distributed Processing Symp. (IPDPS), May 2013 it. caches and memory systems, and on-chip and off- chip networks [6, 7, 16, 24, 31, 35, 38, 50]. Since II.BACKGROUNDAND RELATED WORK our ultimate goal is to consider potential long-term The principal challenge is how to connect power outcomes, we focus on recent projections of scaling (and implicitly, energy) and area constraints with trends [6, 32, 34, 35, 50]. a complexity analysis. There are numerous ap- III.AN EXAMPLEOF INSTANTIATING A MODEL proaches. The most widely-cited come from the WITHINTHE FRAMEWORK computer architecture community [11, 19, 25, 26, 40, 57]. In such approaches, the application or This section explains how one might go about algorithm is typically abstracted away through constructing meaningful cost and constraint mod- an Amdahl’s Law style analysis, which means it els within the framework. In particular, we in- can be difficult to relate high-level algorithmic stantiate specific forms for T (n; a, µ), Φ(µ), and characteristics to architectures precisely. Theorists χ(µ). These forms are meant to be illustrative, not have also considered a variety of new complex- necessarily definitive. Armed with such a model, ity models that incorporate energy as an explicit Section IV then considers a variety of what-if cost [36, 42]. However, this body of work is very scenarios at exascale (roughly 1 exaflop/s capable abstract and focused purely on asymptotics, mak- systems) expected in the 2017–2020 timeframe, to ing these models difficult to operationalize, in our illustrate the kinds of insights possible within the view. Lastly, there is a considerable body of work framework. from the embedded hardware/software commu- nity, emphasizing analysis suitable for - A. Technological and architectural parameters and run-time systems [15, 33, 53]. However, this To guide parameter selection and model cali- work necessarily focuses on specific concrete code bration, we start with the basic technology trend and architecture implementations, and therefore assumptions laid out in a recent description of the do not explicitly illuminate constraints due to proposed NVIDIA Echelon processor, scheduled fundamental algorithmic and physical limits. It is for release in approximately the year 2017 [32]. these limits that we seek to understand to make For our subsequent analysis and projections, we forecasts about future algorithm and system be- will assume the technology constants that appear havior. in Table I. These values depend on specific as- Regarding area constraints, note that our frame- sumptions about technological capabilities in the work treats die area, χ(µ), as a function of ar- 2017–2020 time frame, for which we borrow the chitecture µ only. Thus, to construct this function expectations used to design Echelon. We take those we need to consider what architectural compo- projections as-is; debates about the accuracy of nents we will put on chip (e.g., cores, caches, on- these values are beyond the and intention chip networks) and derive cost estimates for the of this paper. size of each component. Effectively, this choice Our target system is a . We pa- implies that we are most interested in still build- rameterize the high-level architecture µ by the ing architectures that can execute general-purpose following: the number of cores per processor (q), computations; however, the power and area al- cycle-frequency of each core (f), the aggregate locations are tuned to specific workloads. Thus, on-chip cache capacity (Z), memory bandwidth our approach stands in contrast to the classical (βmem), on-chip network bandwidth (βnoc), off-chip work on models of computation under very large- network link bandwidth (βnet), and total number scale integration (VLSI) [49, Chap. 12]. That body of nodes (p). By “node” we really mean a single of work also considers physical area-time trade- unit of distributed memory processing consisting offs [37, 45], with connections to energy [2, 55], of a (manycore) processor, local memory, persistent but does so for specific circuit structures that cor- storage (disk). This definition constitutes a sim- respond to specific computations. We imagine a plification of how a real “node” might look in a bridging between these two approaches but are future system; however, we do model the power for the moment agnostic on the specific analytical required by such a node (see node in Table I). We form of that bridge. use the term “core” to represent the basic unit of To develop cost models for both power and area, processing in the system. We endow a core with we are mining the vast literature on models and general-purpose processing capabilities, e.g., ALU, technology trends for everything from low-level address generation, branch unit, register file; how- processor device physics, functional units, cores, ever, because the sample algorithms we analyze

3 Preprint – To appear in the IEEE Int’l. Parallel & Distributed Processing Symp. (IPDPS), May 2013

are floating-point intensive, we characterize the V , we can simplify the expression for the power core essentially by its peak rate of floating-point consumption of each core to be a function that is instructions completed per cycle. The caches are cubic in f: core-private but the value Z is aggregate over all √ √ P = q c f 3 + c f 2 + c f . cores on a chip. We connect cores via a q× q 2D comp dynamic short leakage (5) β Mesh with an on-chip link bandwidth of noc. The We can obtain suitable coefficients by fitting this on-chip cache is evenly distributed to all cores so Z equation to the different voltage, frequency, and that each cores has a private cache of size q . We 1 1 1 energy settings provided in the Echelon paper [32]. connect nodes by a p 3 ×p 3 ×p 3 3D torus with a link Bandwidth power (Joules/sec or Watts) is band- bandwidth of βnet. As a reference point, the values width ( / sec) times energy cost per of these parameters as proposed for the Echelon (Joules/Byte): design appear in column 2 of Table II. (Columns 3 and 4 will be discussed in Section IV.) Pmem = βmem · λmem (6)

Our specific algorithmic analyses will also as- Pnet = βnet · Links(p) · λnet (7) sume an abundance of parallelism. Though this Z assumption seems strong, it is also a necessary The die area dedicated to each core is q σcache + condition for any application that can hope to scale σcore. Assuming each core is square in shape, the to very large numbers of nodes relative to today’s distance between the center of two neighboring q Z standards. cores is q σcache + σcore, which we will use to ap- Parameters 2018 (10 nm) proximate the length of the on-chip interconnect Value links. Assuming a 2D mesh topology, which is the System Power Cap: Φmax 20 MW most natural considering the current planar manu- 2 Chip Die Area: χmax 141.7 mm 2 facturing process of modern processors—there are Cache Density: σcache .386 mm /MB √ 2 a total of 4q − 4 q links. The power of the on-chip Core Density: σcore .0105 mm /MB GB network is therefore Memory BW Power: λmem 36 mW/ s GB s ! Network BW Power: λnet 36 mW/ s √ Z On-Chip Network BW Energy: λnoc .75 pJ/mm Byte Pnoc = (4q − 4 q) σcache + σcore βnocλnoc. (8) Dynamic Power Coefficient: cdynamic 0.00129704 q Short-Circuit Coefficient: cshort 0.0032426 Leakage Coefficient: cleakage 0.002026625 The limited processor die area constrains the Node Overhead: node 2 W number of cores and cache that can be placed on Table I: Technology Constants: Projected values for a single node. A larger cache capacity means there 2018. is less space for cores and vice-versa. Thus, given

a total die area of Ωdie, we constrain total core and cache capacity by requiring that B. A model of physical constraints Ω = (q · σ ) + (Z · σ ) . (9) We use the following models of power and area, die core cache based on the parameters shown in Tables I and II. C. Algorithmic cost models The total system power comprises the power of Given the basic architectural model and param- the cores (Pcomp), memory (Pmem), on-chip intercon- eters, the next step is to analyze an algorithm nect (Pnoc), and the system network (Pnet). There is or class of algorithms, so that we can compute also a nominal per node power cost (node) that rep- resents the inherent overhead cost of maintaining T (n; a, µ). We specifically consider the total execu- a node, e.g., power supply, chipset, disk. tion time to be the maximum of four component times:

Φ(µ) = p (Pcomp + Pmem + Pnoc + node) + Pnet (4)

T = max {Tcomp,Tnet,Tmem,Tnoc} (10) Power consumption of CMOS circuits are fre- quently modeled with a three term equation of the where Tcomp is the time performing compute (flops), 2 form, P = ACV f + τAV If + VIleakage, which ac- Tnet is the time spent in network communica- counts for the dynamic power consumption, short- tion, Tmem is the time spent performing processor- circuit current, and leakage current [43]. The key memory communication, and Tnoc is the time spent variables from our perspective are voltage V and in on-chip network communication. Below, we frequency f. Since the maximum operating fre- give sample analyses for 2.5D matrix multiply, a quency f is roughly linearly related to the voltage three-dimensional FFT, and a stencil computation.

4 Preprint – To appear in the IEEE Int’l. Parallel & Distributed Processing Symp. (IPDPS), May 2013

Parameters Ideal Ideal Ideal Echelon Matrix Multiply FFT Stencil Value Value Value Value Cores: q 4,096 11,491 11,295 9,494 Frequency: f 2.0 GHz .275 GHz 3.0 GHz .280 Aggregate On-chip Cache: Z 256 MB 54.5 MB 59.8 MB 109 MB Memory Bandwidth: βmem 1.6 TB/s .034 TB/s 29.5 TB/s .787 TB/s On-chip Network Bandwidth: βnoc 4.0 GB/s .11 GB/s 38.2 GB/s .19 GB/s Network Bandwidth: βnet 67 GB/s 11.4 GB/s 18,100 GB/s 12.5 GB/s Number of nodes: p 102,500 1,280,000 3,400 480,000 Peak: 2qfp 1.7 EF/s 8.1 EF/s 230 PF/s 2.5 EF/s Table II: Hardware Characteristics (µ).

1) Example: Distributed 2.5D matrix multiply: The as a distributed computation across the cores. We 2.5D matrix multiply algorithm of Solominik and estimate the on-chip network communication time Demmel [52] is a particularly interesting case for assuming the communication-optimal Cannon al- our framework. In particular, it contains a tuning gorithm [8], √ parameter that can be used to reduce communica-  p   2n2C  2n2 tion at the cost of increasing memory capacity, a Tnoc = 3 √ = √ , (15) 2 p qβ Cpqβ trade-off that we subsequently analyze. C noc noc The 2.5D matrix multiplication algorithm de- where the constants reflect the additional assump- composes a n × n matrix multiply, distributed tion of the matrix operands being distributed in across p nodes, into a sequence of (p/C)3/2 ma- “skew” order across the private caches. p p trix multiplies, each of size (n C/p) × (n C/p). 2) Example: Distributed 3D FFTs: Our second ex- The value C is the tuning parameter that, when ample is the 3D FFT. We assume a problem of size increased, decreases communication at the cost of N = n3 using the pencil decomposition [39]. The increased (replicated) storage. algorithm consists of three computation phases We take computation time to be that of the separated by two communication phases. Each conventional (non-Strassen) algorithm: computation phase computes n2 1D FFTs of size n in parallel. Each communication phase involves 2n3 √ Tcomp = . (11) P independent P -node personalized all-to-all ex- pqf changes. Network communication costs are based on the Each 1D FFT of size n is computed locally on asymptotically optimal bandwidth costs of the a node using Θ(n log n) floating point operations. 2.5D algorithm [52]: We assume the classic Cooley-Tukey radix-2 al- gorithm, which requires approximately 5n log n 2n2 2 T = √ . (12) flops; thus, net Cpβ net 3 5n log2 n 1/2 3/2 Each node will compute (p /C ) local ma- Tcomp = 3 × . (16)  q   q  fpq n C × n C trix multiplies of size p p during During each of the two communication phases, the computation. From this fact, we can calculate approximately n3 data points are transferred across the time spent locally transferring data between the network. Assuming a 3D torus network with the processor and memory, given the aggregate  2  a bisection bandwidth of O p 3 β , the commu- cache of size Z. Assuming an asymptotically I/O- net nication cost is approximately optimal blocked algorithm with block size b = q 3 Z , we obtain n 3 Tnet = 2 × 2 . (17) p 3 β √ √ !3 √ ! net  p  n C 2 3 √ During each of the three computation phases, Tmem = 3 √ (13) n2 C 2 p Zβ each node must compute 1D FFTs. The num- √ mem p 2 3n3 ber of processor-memory transactions necessary to = √ . (14) compute each local 1D FFT depends on the total p Zβ mem cache capacity Z of the node. If the entire 1D Since we assume private caches and a 2D mesh FFT can fit within the on-chip caches (n ≤ Z), network, we can treat the local matrix multiply then memory transfers are limited to just O (n)

5 Preprint – To appear in the IEEE Int’l. Parallel & Distributed Processing Symp. (IPDPS), May 2013

 1  compulsory cache misses. Otherwise, the compu- the total number of reads by a factor of O Z 3 . tation will incur an additional Θ(n logZ n) capacity Thus, misses [29]. We have previously estimated a lower- n3   6w   1  bound on this constant to be 2.5 [14], using hard- Tmem = 1 + 2 , (24) ware counters measurements of last-level cache p Z 3 βmem misses incurred by the highly-tuned FFTW [20]. 3 where n is the number of elements processed on Thus, p each node. 2 n 2.5n max(logZ n, 1) We can approximate the amount of on-chip Tmem = 3 × × (18) p βmem communication by comparing the cache misses 3 7.5n max(logZ n, 1) incurred by a core with a private cache of size = , (19) Z pβmem q with the number of cache misses incurred by a processor with an aggregate cache of size Z. Thus, where the max function ensures that the transfers include at least the compulsory misses. " ! #  n3  6w  6w  If an entire 1D FFT can fit in the private cache of p 1 + 2 − 1 + 2 ( Z ) 3 Z 3 n < Z q a single core ( q ), then no on-chip communi- Tnoc = (25) cation is necessary beyond the compulsory cache qβnoc 3  1  misses to DRAM. Otherwise, the 1D FFT must be 6wn q 3 − 1 nq distributed across qˆ = Z cores, which requires = 1 . (26)   Z 3 qpβ O √ n time assuming a 2D mesh topology. In noc qβˆ noc n3 IV. ANALYSIS total, each group must compute at least pZ of these distributed FFTs. Additionally, we only consider Given the models and architectural parameters large problems sizes (q  n) in this paper, so of Section III, we can now carry out a series of that load balance should not be a significant factor. what-if analyses to gain some insight into possible Thus, futures and high-level architectural designs, and √ ! even compute solutions to Equation 1.  n3  n Z Tnoc = 3 × √ √ (20) pZ βnoc q n A. Ideal architectures 3√ 3n n We solved the optimization problem of Equa- = √ √ . (21) p q Zβnoc tion 1 for the 3D FFT, matrix multiply, and sten- cil algorithms. For this first analysis, we fixed 3) Example: Distributed 3D Stencil: Our final ex- the matrix multiply algorithm to be the Cannon ample is a 3D stencil. For simplicity of demon- (2D) algorithm, rather than the 2.5D algorithm stration, we will only consider a 3D cross-shaped considered in the next section, and fix the stencil w 6w + 1 stencil of width and total of points, and width (w = 10). The ideal configuration for each we will ignore the possibility of algorithms that appears in columns 3-5 of Table II. Think of these N = n3 tile in time. We assume a problem of size . configurations as being the ones optimally tuned 12w + 1 The most direct method consists of for the corresponding algorithm, though recall that floating point operations per element; thus, the model is for a general-purpose system. Figure 2 n3(12w + 1) shows how resources are allocated in each of these T = . (22) comp fpq tuned configurations, as well as in the proposed Echelon configuration. Figure 3 shows execution √n × √n × √n Assuming each node owns a 3 p 3 p 3 p block times for each of the hypothetical machines on the √n √n of the dataset, each node will need a 3 p × 3 p × w 3D FFT, stencil, and matrix multiply workloads. sub-block from each of the six adjacent nodes on We can make a number of observations about these the 3D torus network. Therefore, the communica- results. tion cost is approximately Even under optimistic assumptions and an opti- 6wn2 mal machine configuration, the ideal FFT machine Tnet = 2 . (23) has a peak of only 230 petaflop/s (PF/s) with p 3 β net 20 MW of power, which is just 1/36 of the 8 Without reuse, the number of processor-memory exaflop/s (EF/s) possible on the ideal matrix mul- transactions necessary to compute each element is tiply system. This means that relative to a matrix 6w + 2. A cache of size Z can be used to reduce multiplication, the FFT computation requires 36×

6 Preprint – To appear in the IEEE Int’l. Parallel & Distributed Processing Symp. (IPDPS), May 2013

Ideal MatMult Configuration Ideal Stencil Configuration

Die Area Budget Node Power Budget System Power Die Area Budget Node Power Budget System Power 141 11.1 20 141 37 20

106 8.3 15 106 28 15

71 5.6 10 71 19 10

35 2.8 5 35 9 5

0 0 0 0 0 0 Die Area (mm^2) Node Power (W) System Power (MW) Die Area (mm^2) Node Power (W) System Power (MW)

Cache Area Noc Bandwidth Network Cache Area Noc Bandwidth Network Core Area Mem Bandwith Node Overhead Core Area Mem Bandwith Node Overhead Cores On-Chip Network Cores On-Chip Network Memory Bandwidth Memory Bandwidth Computation Computation

Ideal FFT Configuration Echelon Configuration

Die Area Budget Node Power Budget System Power Die Area Budget Node Power Budget System Power 141 1963 20 141 179.00 20

106 1472 15 106 134.25 15

71 982 10 71 89.50 10

35 491 5 35 44.75 5

0 0 0 0 0 0 Die Area (mm^2) Node Power (W) System Power (MW) Die Area (mm^2) Node Power (W) System Power (MW)

Cache Area Noc Bandwidth Network Cache Area Noc Bandwidth Network Core Area Mem Bandwith Node Overhead Core Area Mem Bandwith Node Overhead Cores On-Chip Network Cores On-Chip Network Memory Bandwidth Memory Bandwidth Computation Computation Figure 2: Hardware configurations for the hypothetical machines. The subplots break down the power and die area resource allocations.

Ideal FFT Configuration Ideal MatMult Configuration Ideal Stencil Configuration Echelon 60

45

30

15

0 FFT MatMult Stencil FFT MatMult Stencil FFT MatMult Stencil FFT MatMult Stencil Figure 3: Relative execution times for the hypothetical machines. The subplots show execution time relative to the ieal FFT, Stencil, and MatMult configurations. more energy per floating-point operation. How- a 40nm to 10nm process technology) but the cache ever, tuning for a 3D FFT means we will necessar- capacity by 64 (4x more than the scaling factor). ily divert power resources (and, therefore, energy It is interesting to further consider these config- efficiency) elsewhere in the system. urations in light of our motivating demonstration The Echelon design calls for 256 MB of on-chip of Section I. There, we observed that a single- cache which is over four times more than the ideal processor system with extreme reconfigurability of FFT and ideal MatMult. This is interesting because power—rather than die area—might lead to designs relative to NVIDIA’s current GPU, the core count capable of performing both a compute-intensive increased by a factor of 16 (the scaling factor from matrix multiply and a communication-intensive

7 Preprint – To appear in the IEEE Int’l. Parallel & Distributed Processing Symp. (IPDPS), May 2013

Performance as a Function of Node Density 3D FFT. The three ideal configurations, depicted in 1.00 Figure 2 and enumerated in Table II, are similarly suggestive. From Figure 2, observe that the proces- sor configurations for the three ideal systems are 0.75 not too dissimilar. Rather, the dramatic differences come from shifting power allocations to processors 0.50 FFT (Ideal MatMult), memory bandwidth (Ideal Sten- MatMult Stencil cil), or network (Ideal FFT). The node counts also 0.25

different significantly (Table II). Thus, an intriguing Performance relative to best question is whether there is any way to engineer 0 a single system having the same processors but a 1000 10000 100000 1000000 10000000 mechanism to perform dramatic power reconfig- Number of nodes urations (drawing down or shutting off nodes as Figure 4: Node Density Plots. needed, and diverting power to bandwidth).

B. Architecture trade-offs: lightweight vs. heavyweight as the application of a “filter” to a “signal.” Com- designs putationally, a convolution may be implemented as Recent discussions surrounding the direction of a stencil computation, when the filter is compact, high-end systems often characterize design strate- or alternatively by an FFT. A small value of the w gies as either “lightweight” or “heavyweight.” stencil width in our model (Section III-C3) corre- The key distinction between these two strategies sponds to our measure of compactness. In a stencil- is node density. Lightweight designs, exempli- based approach, the computational complexity of O (wn) fied by the Blue Gene-style processors, consist convolution will be . When the filter is of many lower power processors. Alternatively, not compact, an FFT-based method may be more heavyweight designs, exemplified by Jaguar-class suitable as it reduces computational complexity O (n log n) machines, consists of fewer but more powerful to . However, the FFT-based methods processors, each operating at high clock frequen- have a higher communication cost. For moderate cies. sized stencils this results in a computation versus communication trade-off: the FFT-based method Interestingly, these characterizations apply to the requires fewer floating point operations but more ideal machine configurations in Figure 2. The ideal data movement than the stencil method [12, 18]. matrix multiply configuration, “Ideal MatMult,” In the case of a large 3D convolution (n = 217), reflects a lightweight strategy, whereas the ideal 3D the FFT-based method will in our model become FFT configuration, “Ideal FFT,” resembles a heavy- more efficient when w ≥ 22. Figure 5 compares the weight strategy. The difference is extreme: Ideal execute time of the stencil and FFT-based methods FFT has only 3,400 nodes, which is 376× fewer over various stencil sizes. The results show that nodes than Ideal MatMult. Essentially, an FFT is an ideal stencil machine is actually much faster communication-bound and is therefore highly sen- than the ideal FFT machine until the stencil size is sitive to the decreased energy-efficiency of a large significantly larger (w ≈ 600) than expected. The network — on a 3D torus, the energy-efficiency reason for the discrepancy is the relative cost of a will decrease at a rate of O (p2/3)/p as p in- floating-point operation versus data movement. As creases. Unlike the FFT, matrix multiply benefits expected trading communication for computation from large node counts because it can exploit is beneficial until the trade-offs become extreme. the increased core count to make the computa- tion more efficient with a lower clock frequency. D. Algorithm trade-offs: space vs communication Figure 4 shows the stark contrast in performance In Section IV-A, we found the ideal architecture between the two computations as a function of for Cannon’s matrix multiply algorithm (the “2D” system density. approach). While this algorithm is asymptotically optimal when all of the available system memory C. Algorithm trade-offs: computation v. communica- is utilized, the class of “2.5D” algorithms can fur- tion ther√ reduce network communication by a factor A convolution is an algorithmic pattern that can of C by making an additional C copies of the capture the characteristics of many scientific com- data. Based on historical trends, we estimate that puting problems. We may interpret a convolution in 2018 increasing the system memory capacity

8 Preprint – To appear in the IEEE Int’l. Parallel & Distributed Processing Symp. (IPDPS), May 2013

Algorithms for a Convolution 4 is to increase the number of nodes, which in turn increases the communication overheads.

3 Performance as a Function of System Power 4.00

2 Stencil Time (sec) FFT 3.25 1

2.50 0 0 200 400 600 800 Stencil Width (w) 1.75 Figure 5: Plot of the time to compute a convolution Normalized Performance using different stencil sizes. The figure compares 1.00 the Stencil method with an FFT based method in 20 35 50 65 80 Power (MW) the 3D case. Stencil FFT MatMult Figure 7: Plot of performance as a function of the power budget. Performance values for the algo- could require an extra watt of power for every rithms are scaled to their performance at a 20 MW eight GB of additional DRAM [46, 56]. Thus, the power budget. power consumed by the requisite memory capac- ity, 3Cn2 nanowatts, increases as C increases. This introduces a trade-off between memory utilization V. CONCLUSION and network communication. An interesting ques- The ultimate goal of this work is to determine tion is to what extent replication can be used to whether alternative ways to allocate die area and improve performance without increasing the total power might lead to better designs for specific power and energy costs. applications. Our proposed framework suggests To find the optimal balance, we solve Equation 1 how to achieve this goal in a way that retains for the optimal algorithm, a∗, and the correspond- analytical rigor while remaining sufficiently high- ing architecture, µ∗, considering the set of all 2.5D level to yield insight. The suggestion of extreme implementations (i.e., values of the replication fac- power reconfigurability in the architecture as a tor, C). Figure 6 shows the performance and re- mechanism that might support diverse applica- source allocation of these algorithms. The results tions is one forward-looking example. show that indeed, replication can slightly improve Having said that, we emphasize the method- upon Cannon’s algorithm under these conditions. ological contribution of our work over any spe- However, the optimal balance is not at one of the cific quantitative predictions. The intent of this extremes, but rather somewhere in-between. paper is to suggest a different way of thinking about the high-level directions for algorithms and E. Increasing the power budget architectures to drive the subsequent detailed co- The U.S. Department of Energy, one of the pri- design process, rather than to provide a compre- mary customers of top-tier , has hensive critique of exascale designs. We hope that instituted a strict 20 MW power cap for future a much deeper exploration of our basic setup— supercomputers. This power constraint is one of including more architectural details and consid- the dominant challenges to reaching exascale. eration of richer classes of algorithms and mixed We can consider how much performance im- workloads—could produce a number of alterna- proves if the power cap is relaxed a little by tive futures, for both architecture and algorithm changing the power constraint, Φmax, in Equa- design. tion 1. Figure 7 compares the performance of an ’ideal’ machine, designed for a 20 MW power VI.ACKNOWLEDGEMENTS budget, with a similar machine that is designed We thank Mark Adams for comments. This work for a larger power budget. As the figure shows, was supported in part by the National Science communication-heavy applications like the FFT Foundation (NSF) under NSF CAREER award will improve at a slower rate than applications number 0953100; the U.S. Dept. of Energy (DOE) that are less dependent on communication. This is under award DE-FC02-10ER26006/DE-SC0004915; because the most efficient way to scale the machine and grants from the Defense Advanced Research

9 Preprint – To appear in the IEEE Int’l. Parallel & Distributed Processing Symp. (IPDPS), May 2013

2.5D Performance 2.5D Power Allocation 10.00 20

8.75 15

DRAM Network Bandwidth 7.50 10 Node Overhead NOC Bandwidth Memory Bandwidth Power (MW) Compute

Performance (Eflop/s) 6.25 5

5.00 0 0 0.11 0.22 0.33 0.00 0.11 0.22 0.33 C Value C Value

Figure 6: Performance of 2.5D Matrix Multiply variants. The 2.5D algorithm is parameterized by a value α 1 C = p where 0 ≤ α ≤ 3 . ’DRAM’ represents the power consumed by the requisite memory capacity.

Projects Agency (DARPA) Study [7] D. Brooks, V. Tiwari, and M. Martonosi. Group program. Any opinions, findings and con- Wattch: A framework for architectural-level clusions or recommendations expressed in this power analysis and optimizations. In material are those of the authors and do not nec- Proc. Int’l. Symp. Computer Architecture (ISCA), essarily reflect those of NSF, DOE, or DARPA. pages 83–94, Vancouver, British Columbia, Canada, June 2000. REFERENCES [8] L. E. Cannon. A cellular computer to implement the Kalman filter algorithm. PhD thesis, Mon- [1] The Potential Impact of High-End Capability tana State University, 1969. Computing on Four Illustrative Fields of Science [9] M. Casas, R. M. Badia, and J. Labarta. Pre- and Engineering. The National Academies diction of behavior of MPI applications. In Press, Washington, DC, USA, 2008. 2008 IEEE International Conference on Cluster [2] A. Aggarwal, A. Chandra, and P. Raghavan. Computing, pages 242–251. IEEE, Sept. 2008. Energy consumption in VLSI circuits. In Pro- [10] R. A. Chowdhury, F. Silvestri, B. Blakeley, and ceedings of the twentieth annual ACM symposium V. Ramachandran. Oblivious algorithms for on Theory of computing - STOC ’88, pages 205– multicores and network of processors. In 216, New York, New York, USA, 1988. ACM 2010 IEEE International Symposium on Parallel Press. & Distributed Processing (IPDPS), pages 1–12. [3] K. J. Barker, A. Hoisie, and D. J. Kerbyson. IEEE, 2010. An early performance analysis of POWER7- [11] E. S. Chung, P. A. Milder, J. C. Hoe, and IH HPC systems. In Proceedings of 2011 Inter- K. Mai. Single-Chip Heterogeneous Comput- national Conference for High Performance Com- ing: Does the Future Include Custom Logic, puting, Networking, Storage and Analysis on - FPGAs, and GPGPUs ? In IEEE/ACM Inter- SC ’11, page 1, New York, New York, USA, national Symposium on (MI- 2011. ACM Press. CRO), pages 225–236, Atlanta, GA, USA, 2010. [4] G. E. Blelloch. Programming parallel algo- [12] R. Clapp, H. Fu, and O. Lindtjorn. Selecting rithms. Communications of the ACM, 39(3):85– the right hardware for reverse time migration. 97, March 1996. The leading edge, 29(1):48–58, 2010. [5] G. E. Blelloch, P. B. Gibbons, and H. V. [13] D. Culler, R. Karp, D. Patterson, A. Sahay, Simhadri. Low depth cache-oblivious algo- K. E. Schauser, E. Santos, R. Subramonian, rithms. In Proc. ACM Symp. Parallel Algorithms and T. von Eicken. LogP: Towards a realistic and Architectures (SPAA), Santorini, Greece, model of parallel computation. ACM SIG- June 2010. PLAN Notices, 28(7):1–12, July 1993. [6] S. Borkar and A. A. Chien. The future of [14] K. Czechowski, C. McClanahan, C. Battaglino, . Communications of the ACM, K. Iyer, P.-K. Yeung, and R. Vuduc. On the 54(5):67, May 2011.

10 Preprint – To appear in the IEEE Int’l. Parallel & Distributed Processing Symp. (IPDPS), May 2013

communication complexity of 3D FFTs and A. Ailamaki. Toward Dark Silicon in Servers. its implications for exascale. In Proc. ACM IEEE Micro, 31(4):6–15, July 2011. Int’l. Conf. Supercomputing (ICS), San Servolo [26] M. D. Hill and M. R. Marty. Amdahl’s Law in Island, Venice, Italy, June 2012. (to appear). the multicore era. IEEE Computer, 41(7):33–38, [15] P. D’Alberto, M. Puschel,¨ and F. Franchetti. July 2008. Performance/energy optimization of DSP [27] T. Hoefler, T. Schneider, and A. Lumsdaine. transforms on the XScale processor. In LogGOPSim: Simulating large-scale applica- Proc. High Performance Embedded Architectures tions in the LoGOPS model. In Proceedings of and (HiPEAC), volume LNCS 4367, the 19th ACM International Symposium on High pages 201–214, Ghent, Belgium, January 2007. Performance - HPDC ’10, [16] W. J. Dally, J. Balfour, D. Black-Shaffer, page 597, New York, New York, USA, 2010. J. Chen, R. C. Harting, V. Parikh, J. Park, and ACM Press. D. Sheffield. Efficient Embedded Computing. [28] A. Hoisie, G. Johnson, D. J. Kerbyson, Computer, 41(7):27–32, 2008. M. Lang, and S. Pakin. A performance com- [17] J. W. Demmel. Applied Numerical Linear Alge- parison through benchmarking and model- bra. SIAM, 1997. ing of three leading supercomputers: Blue [18] M. Ernst. Serializing parallel programs by Gene/L, Red Storm, and Purple. In removing redundant computation. Master’s Proc. ACM/IEEE Conf. Supercomputing (SC), thesis, Massachusetts Institute of Technology, number 74, Tampa, FL, USA, November 2006. Dept. of Electrical Engineering and Computer [29] J.-W. Hong and H. Kung. I/O complexity: The Science, 1992. red-blue pebble game. In Proc. ACM Symp. [19] H. Esmaeilzadeh, E. Blem, R. St. Amant, Theory of Computing (STOC), pages 326–333, K. Sankaralingam, and D. Burger. Dark Silicon Milwaukee, WI, USA, May 1981. and the End of Multicore Scaling. In Pro- [30] H. Jagode, A. Knupfer, J. Dongarra, M. Jurenz, ceedings of the 28th International Symposiumn M. S. Mueller, and W. E. Nagel. Trace-based Computer Architecture (ISCA), San Jose, CA, performance analysis for the petascale simula- USA, 2011. tion code FLASH. International Journal of High [20] M. Frigo and S. G. Johnson. The design and Performance Computing Applications, Dec. 2010. implementation of FFTW3. Proc. IEEE: Special [31] A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi. issue on “Program Generation, Optimization, and Orion 2.0: A fast and accurate noc power and Platform Adaptation”, 93(2):216–231, 2005. area model for early-stage design space explo- [21] M. Frigo, C. E. Leiserson, H. Prokop, and ration. In Proceedings of the Design, Automation, S. Ramachandran. Cache-oblivious algo- and Test in Europe (DATE) Conference & Exhi- rithms. In Proc. Symp. Foundations of Computer bition, volume 29, pages 1–6. IEEE, 2009. Science (FOCS), pages 285–297, New York, NY, [32] S. W. Keckler, W. J. Dally, B. Khailany, M. Gar- USA, October 1999. land, and D. Glasco. GPUs and the Future of [22] H. Gahvari, A. H. Baker, M. Schulz, U. M. . IEEE Micro, 31(5):7–17, Yang, K. E. Jordan, and W. Gropp. Modeling Sept. 2011. the performance of an algebraic multigrid [33] W. Kirschenmann and L. Plagne. Optimizing cycle on HPC platforms. In Proceedings of the computing and energy performances on GPU international conference on Supercomputing - ICS clusters: Experimentation on a PDE solver. In ’11, page 172, New York, New York, USA, M. Pierson and H. Hlavacs, editors, Proceed- 2011. ACM Press. ings of the COST Action IC0804 on Large Scale [23] J. Gonzalez, J. Gimenez, M. Casas, M. Moreto, Distributed Systems, pages 1–5. IRIT, 2010. A. Ramirez, J. Labarta, and M. Valero. Sim- [34] P. Kogge and T. Dysart. Using the top500 to ulating Whole Supercomputer Applications. trace and project technology and architecture IEEE Micro, 31(3):32–45, May 2011. trends. In Proceedings of 2011 International Con- [24] N. Hardavellas, M. Ferdman, A. Ailamaki, ference for High Performance Computing, Net- and B. Falsafi. Power Scaling: the Ul- working, Storage and Analysis, page 28. ACM, timate Obstacle to 1K-Core Chips Power 2011. Scaling : the Ultimate Obstacle to 1K-Core [35] P. Kogge et al. Exascale Computing Study: Chips. Northwestern University, Technical Re- Technology challenges in acheiving exascale port NWU-EECS-10-05, pages 1–23, 2010. systems, September 2008. [25] N. Hardavellas, M. Ferdman, B. Falsafi, and [36] V. A. R. Korthikanti and G. Agha. Analysis of

11 Preprint – To appear in the IEEE Int’l. Parallel & Distributed Processing Symp. (IPDPS), May 2013

parallel algorithms for energy conservation in tion toolkit. ACM SIGMETRICS Performance scalable multicore architectures. In Proc. Int’. Evaluation Review, 38(4):37, Mar. 2011. Conf. Parallel Processing (ICPP), Vienna, Aus- [49] J. E. Savage. Models of Computation: Exploring tria, September 2009. the power of computing. CC-3.0, BY-NC-ND, [37] H. Kung. Let’s design algorithms for VLSI electronic edition, 2008. systems. In Proceedings of the Caltech Conference [50] J. Shalf, S. Dosanjh, and J. Morrison. Exas- on VLSI: Architecture, Design, and Fabrication, cale computing technology challenges. High pages 65–90, 1979. Performance Computing for Computational Sci- [38] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, enceVECPAR 2010, pages 1–25, 2011. D. M. Tullsen, and N. P. Jouppi. McPAT: An [51] A. Snavely, N. Wolter, and L. Carrington. integrated power, area, and timing modeling Modeling application performance by con- framework for multicore and manycore archi- volving machine signatures with application tectures. In D. H. Albonesi, M. Martonosi, profiles. In Proceedings of the Fourth Annual D. I. August, and J. F. Mart’inez, editors, Pro- IEEE International Workshop on Workload Char- ceedings of the 42nd Annual IEEE/ACM Interna- acterization. WWC-4 (Cat. No.01EX538), pages tional Symposium on Microarchitecture - Micro- 149–156. IEEE. 42, number c in MICRO 42, page 469, New [52] E. Solomonik and J. Demmel. York, New York, USA, 2009. HP Laboratories, Communication-optimal parallel 2.5D ACM Press. matrix multiplication and LU factorization [39] C. V. Loan. Computational frameworks for the algorithms. Technical report, University of Fast Fourier Transform. SIAM, 1992. California, Berkeley, CA, USA, 2011. [40] G. Loh. The cost of uncore in throughput- [53] H. Takizawa, K. Sato, and H. Kobayashi. oriented many-core processors. In Workshop SPRAT: Runtime processor selection for on Architectures and Languages for Throughput energy-aware computing. In Proc. IEEE Int’l. Applications, number June, pages 1–9. Citeseer, Conf. Cluster Computing (CLUSTER), pages 2008. 386–393, Tsukuba, Japan, October 2008. [41] J. Mandel and S. V. Parter. On the multigrid [54] S. Toledo. Locality of reference in LU decom- F-cycle. Applied Mathematics and Computation, position with partial pivoting. SIAM J. Matrix 37(1):19–36, May 1990. Anal. Appl., 18(4):1065–1081, October 1997. [42] A. J. Martin. Towards an energy complexity [55] A. Tyagi. Energy-Time Trade-offs in VLSI of computation. Information Processing Letters, Computations. In Foundations of Software Tech- 77(2–4):181–187, February 2001. nology and Theoretical Computer Science, vol- [43] T. Mudge. Power: A first-class architectural ume LNCS 405, pages 301–311, 1989. design constraint. Computer, 34(4):52–58, 2001. [56] T. Vogelsang. Understanding the energy con- [44] R. W. Numrich and M. a. Heroux. Self- sumption of dynamic random access mem- similarity of parallel machines. Parallel Com- ories. In Proceedings of the 2010 43rd An- puting, 37(2):69–84, Feb. 2011. nual IEEE/ACM International Symposium on [45] E. T. L. Omtzigt. Domain flow and streaming Microarchitecture, pages 363–374. IEEE Com- architectures: A paradigm for efficient parallel puter Society, 2010. computation. Phd dissertation, Yale University, [57] D. H. Woo and H.-H. S. Lee. Extending Am- 1993. dahl’s Law for energy-efficient computing in [46] D. Patterson. Technology trends: the many-core era. IEEE Computer, 41(12):24– The datacenter is the computer, the 31, December 2008. cellphone/laptop is the computer, October 2007. www.hpts.ws/papers/2007/ TechTrendsHPTSPatterson2007.ppt. [47] S. J. Pennycook, S. D. Hammond, S. A. Jarvis, and G. R. Mudalige. Performance analysis of a hybrid MPI/CUDA implementation of the NAS-LU benchmark. In Proceedings of the In- ternational Workshop on Performance Modeling, Benchmarking and Simulation, New Orleans, LA, USA, Nov. 2010. [48] A. F. Rodrigues et al. The structural simula-

12