A Theoretical Framework for Algorithm-Architecture Co-Design
Total Page:16
File Type:pdf, Size:1020Kb
Preprint – To appear in the IEEE Int’l. Parallel & Distributed Processing Symp. (IPDPS), May 2013 A theoretical framework for algorithm-architecture co-design Kenneth Czechowski, Richard Vuduc School of Computational Science and Engineering Georgia Institute of Technology, Atlanta, Georgia fkentcz,[email protected] Abstract—We consider the problem of how to en- 28, 44], as well as the classical theory of cir- able computer architects and algorithm designers to cuit models and the area-time trade-offs studied reason directly and analytically about the relationship in models based on very large-scale integration between high-level architectural features and algo- rithm characteristics. We propose a modeling frame- (VLSI) [37, 49]. Our analysis is in many ways work designed to help understand the long-term most similar to several recent theoretical exascale and high-level impacts of algorithmic and technol- modeling studies [22, 47], combined with trends ogy trends. This model connects abstract commu- analysis [34]. However, our specific methods re- nication complexity analysis—with respect to both turn to higher-level I/O-centric complexity analy- the inter-core and inter-processor networks and the memory hierarchy—with current technology proposals sis [4, 5, 10, 13, 21, 54], pushing it further by trying and projections. We illustrate how one might use the to resolve analytical constants, which is necessary framework by instantiating a particular model for a to connect abstract complexity measures with the class of architectures and sample algorithms (three- physical constraints imposed by power and die dimensional fast Fourier transforms, matrix multiply, area. This approach necessarily will not yield cycle- and three-dimensional stencil). Then, as a suggestive demonstration, we analyze a number of what-if sce- accurate performance estimates, and that is not our narios within the model in light of these trends to aim. Rather, our hope is that a principled algorith- suggest broader statements and alternative futures for mic analysis that accounts for major architectural power-constrained architectures and algorithms. parameters will still yield interesting insights and suggest new directions for improving performance I. INTRODUCTION and scalability in the long run. We seek a formal framework that explicitly re- A formal framework. We pose the formal co- lates characteristics of an algorithm, such as its design problem as follows. Let a be an algorithm inherent parallelism or memory behavior, with from a set A of algorithms that all perform the parameters of an architecture, such as the number same computation within the same desired level of cores, structure of the memory hierarchy, or of accuracy. The set A might contain different network topology. Our ultimate goal is to say algorithms, such as “A = ffast Fourier trans- precisely and analytically how high-level changes form, F-cycle multigridg,” for the Poisson model to the architecture might affect the execution problem [17, 41]. Or, A may be a set of tuning time, scalability, accuracy, and power-efficiency parameters for one algorithm, such as the set of of a computation; and, conversely, identify what tile sizes for matrix multiply. Next, let µ be a classes of computation might best match a given machine architecture from a set M, and suppose architecture. Our approach marries abstract al- that each processor of µ has an area of χ(µ). gorithmic complexity analysis with key physical Lastly, let T (n; a; µ) be the time to execute a on constraints, such as caps on power and die area, µ for a problem of size n, while using a maximum that will be critical in the extreme scale systems of instantaneous power of Φ(µ). Then, our goal is to 2018 and beyond [1, 35]. We refer to our approach determine the algorithm a and architecture µ that as one of algorithm-architecture co-design. minimize time subject to constraints on total power We say “algorithm-architecture” rather than and processor die area, e.g., “hardware-software,” so as to evoke a high-level mathematical process that precedes and comple- (a∗; µ∗) = argmin T (n; a; µ) (1) ments traditional methods based on detailed ar- (a2A; µ2M) chitecture simulation of concrete benchmark code subject to artifacts and traces [9, 23, 27, 30, 48, 51]. Our ap- ∗ proach takes inspiration from prior work on high- Φ(µ ) = Φmax (2) ∗ level performance analysis and modeling [3, 26– χ(µ ) = χmax; (3) 1 Preprint – To appear in the IEEE Int’l. Parallel & Distributed Processing Symp. (IPDPS), May 2013 Performance on FFT Power 4.7175 DRAM More DRAM bandwidth 2.3588 $ C C C C C C $ Echelon C C C C C C Memory Bandwidth (TB/s) C C C C C C C C C C C C C C C C C C 0 0 183.5 367 Area Cache Size (MB) Performance on Matrix Multiply 4.7175 More More cores cache DRAM DRAM 2.3588 $ C C C C C C $ Echelon C C C C C C Memory Bandwidth (TB/s) C C C C C C Higher C C C C C C 0 frequency C C C C C C 0 183.5 367 Cache Size (MB) (b) Projected performance for a 3D FFT and matrix (a) A notional power/transistor allocation problem multiply at some problem size (lighter is better) Figure 1: (a) In our framework, a fixed die area (allocated between cores and cache) and a fixed power budget (allocated between core frequency and bandwidth), define a space of possible machines. (b) Different algorithms may perform differently on these machines. The marker is the approximate “location” within this space of the NVIDIA Echelon GPU-like architecture proposed for the year 2017 [32]. In the 3D FFT example, the optimal configuration is 2.6 times faster than Echelon. where Φmax and χmax are caps on power and die get (y-axis) to memory bandwidth, compared to area, respectively. The central research problem matrix multiply. T (n; a; µ) Φ(µ) is to determine the form of , , and However, Figure 1b also suggests an intriguing χ(µ) . The significance and novelty of this analysis possibility. Observe that a die area configuration framework is that it explicitly binds characteristics (x-axis) that is good for matrix multiply will also of algorithms and architectures, Equation (1), with be good for an FFT; to make a system that can physical hardware constraints, Equations (2)–(3). perform “optimally” on both workloads, we would A demonstration. Suppose we wish to design need the ability to dynamically shift power from a manycore processor µ, which we represent by the processor to memory bandwidth, by a large factor of roughly 7×. That is, reconfigurability of the four-tuple (βmem; q; f; Z): βmem is the processor- memory bandwidth (words per unit time), q is processor transistors may be relatively less im- the number of cores per processor, f is the clock portant than extreme power reconfigurability with frequency of each core (cycles per unit time), and respect to bandwidth. Whether one can build such Z is the total size of the aggregate on-chip cache a system is a separate question; this demonstration (in words), assuming just a two-level hierarchy suggests and attempts to quantify the possibility. (cache and main memory). Further suppose that The remainder of this paper formalizes this anal- 2 the χmax = 141 mm of die area can be divided ysis. As a demonstration, we develop an analytical between on-chip cache (Z) and cores (q). Lastly, model of these constraints, Φ(µ) and χ(µ), as well suppose the node power budget is constrained to as performance models, T (n; a; µ). To suggest the Φmax = 173 Watts, which can be used to increase possibilities of the framework, we develop models cycle-frequency (f) or boost off-die memory band- for a full-system configuration, consisting of a dis- width (βmem). Figure 1a is a cartoon that suggests tributed memory machine comprising any number how these parameters and constraints imply a of manycore processors connected by a network, in space of possible designs. Figure 1b shows how, the case of distributed matrix multiply, distributed given a specific model of different algorithms on 3D FFTs, and distributed stencil algorithms. We do this space of machines, we might then solve the op- not view any specific models and projections as the timization problem of Equations 1–3 to identify op- main contribution of our work. Rather, we wish timal systems. Unsurprisingly, a processor tuned to emphasize the basic framework, with the large for a communication-intensive 3D fast Fourier variety of potential detailed modeling strategies, transform will devote more of a fixed power bud- analyses, and projections as possibilities based on 2 Preprint – To appear in the IEEE Int’l. Parallel & Distributed Processing Symp. (IPDPS), May 2013 it. caches and memory systems, and on-chip and off- chip networks [6, 7, 16, 24, 31, 35, 38, 50]. Since II. BACKGROUND AND RELATED WORK our ultimate goal is to consider potential long-term The principal challenge is how to connect power outcomes, we focus on recent projections of scaling (and implicitly, energy) and area constraints with trends [6, 32, 34, 35, 50]. a complexity analysis. There are numerous ap- III. AN EXAMPLE OF INSTANTIATING A MODEL proaches. The most widely-cited come from the WITHIN THE FRAMEWORK computer architecture community [11, 19, 25, 26, 40, 57]. In such approaches, the application or This section explains how one might go about algorithm is typically abstracted away through constructing meaningful cost and constraint mod- an Amdahl’s Law style analysis, which means it els within the framework. In particular, we in- can be difficult to relate high-level algorithmic stantiate specific forms for T (n; a; µ), Φ(µ), and characteristics to architectures precisely.