Amdahl's Law in the Multicore

COVER FEATURE Amdahl’s Law in the Multicore Era Mark D. Hill, University of Wisconsin-Madison Michael R. Marty, Google Augmenting Amdahl’s law with a corollary for multicore hardware makes it relevant to future generations of chips with multiple processor cores. Obtaining optimal multicore performance will require further research in both extracting more parallelism and making sequential cores faster. s we enter the multicore era, we’re at an A COROLLARY FOR MULTIcoRE CHIP COST inflection point in the computing landscape. To apply Amdahl’s law to a multicore chip, we need Computing vendors have announced chips a cost model for the number and performance of cores with multiple processor cores. Moreover, that the chip can support. vendor road maps promise to repeatedly We first assume that a multicore chip of given size and Adouble the number of cores per chip. These future chips technology generation can contain at most n base core are variously called chip multiprocessors, multicore equivalents, where a single BCE implements the baseline chips, and many-core chips. core. This limit comes from the resources a chip designer Designers must subdue more degrees of freedom for is willing to devote to processor cores (with L1 caches). multicore chips than for single-core designs. They must It doesn’t include chip resources expended on shared address such questions as: How many cores? Should caches, interconnection networks, memory controllers, cores use simple pipelines or powerful multi-issue pipeline and so on. Rather, we simplistically assume that these designs? Should cores use the same or different micro- nonprocessor resources are roughly constant in the mul- architectures? In addition, designers must concurrently ticore variations we consider. manage power from both dynamic and static sources. We are agnostic on what limits a chip to n BCEs. It Although answering these questions for today’s multi- might be power, area, or some combination of power, core chip with two to eight cores is challenging now, it will area, and other factors. become much more challenging in the future. Sources as Second, we assume that (micro-) architects have varied as Intel and the University of California, Berkeley, techniques for using the resources of multiple BCEs predict a hundred,1 if not a thousand,2 cores. to create a core with greater sequential performance. As the “Amdahl’s Law” sidebar describes, this model Let the performance of a single-BCE core be 1. We has important consequences for the multicore era. To assume that architects can expend the resources of r complement Amdahl’s software model, we offer a corol- BCEs to create a powerful core with sequential per- lary of a simple model of multicore hardware resources. formance perf(r). Our results should encourage multicore designers to Architects should always increase core resources when view the entire chip’s performance rather than focusing perf(r) > r because doing so speeds up both sequential and on core efficiencies. We also discuss several important parallel execution. When perf(r) < r, however, the trade- limitations of our models to stimulate discussion and off begins. Increasing core performance aids sequential future work. execution, but hurts parallel execution. 0018-9162/08/$25.00 © 2008 IEEE Published by the IEEE Computer Society July 2008 33 Amdahl’s Law Everyone knows Amdahl’s law, but quickly for- Finally, Amdahl argued that typical values of 1 – f gets it. —Thomas Puzak, IBM, 2007 were large enough to favor single processors. Despite their simplicity, Amdahl’s arguments Most computer scientists learn Amdahl’s law in held, and mainframes with one or a few proces- school: Let speedup be the original execution time sors dominated the computing landscape. They 1 divided by an enhanced execution time. The modern also largely held in the minicomputer and personal version of Amdahl’s law states that if you enhance a computer eras that followed. As recent technology fraction f of a computation by a speedup S, the overall trends usher us into the multicore era, Amdahl’s law speedup is: is still relevant. 1 Amdahl’s equations assume, however, that the Speedup f, S enhanced () = f computation problem size doesn’t change when ()1− f + S running on enhanced machines. That is, the frac- Amdahl’s law applies broadly and has important tion of a program that is parallelizable remains corollaries such as: fixed. John Gustafson argued that Amdahl’s law doesn’t do justice to massively parallel machines • Attack the common case: When f is small, optimi- because they allow computations previously intrac- 2 zations will have little effect. table in the given time constraints. A machine • The aspects you ignore also limit speedup: with greater parallel computation ability lets com- As S approaches infinity, speedup is bound by putations operate on larger data sets in the same 1/(1 – f). amount of time. When Gustafson’s arguments apply, parallelism will be ample. In our view, how- Four decades ago, Gene Amdahl defined his law for ever, robust general-purpose multicore designs the special case of using n processors (cores) in parallel should also operate well under Amdahl’s more when he argued for the single-processor approach’s pessimistic assumptions. validity for achieving large-scale computing capa- 1 bilities. He used a limit argument to assume that a References fraction f of a program’s execution time was infinitely 1. G.M. Amdahl, “Validity of the Single-Processor Approach parallelizable with no scheduling overhead, while to Achieving Large-Scale Computing Capabilities,” Proc. the remaining fraction, 1 – f, was totally sequential. Am. Federation of Information Processing Societies Conf., AFIPS Without presenting an equation, he noted that the Press, 1967, pp. 483-485. speedup on n processors is governed by: 2. J.L. Gustafson, “Reevaluating Amdahl’s Law,” Comm. ACM, May 1988, pp. 532-533. 1 Speedup f, n parallel () = f ()1− f + n Our equations allow perf(r) to be an arbitrary func- 1a and 1b show two hypothetical symmetric multicore tion, but all our graphs follow Shekhar Borkar3 and chips for n = 16. assume perf(r) = r . In other words, we assume efforts Under Amdahl’s law, the speedup of a symmetric that devote r BCE resources will result in sequential multicore chip (relative to using one single-BCE core) performance r . Thus, architectures can double per- depends on the software fraction that is parallelizable formance at a cost of four BCEs, triple it for nine BCEs, (f), the total chip resources in BCEs (n), and the BCE and so on. We tried other similar functions (for example, resources (r) devoted to increase each core’s perfor- 1. 5 r ), but found no important changes to our results. mance. The chip uses one core to execute sequentially at performance perf(r). It uses all n/r cores to exe- SYmmETRIC MULTIcoRE CHIPS cute in parallel at performance perf(r) × n/r. Overall, A symmetric multicore chip requires that all its cores we get: have the same cost. A symmetric multicore chip with a resource budget of n = 16 BCEs, for example, can sup- 1 Speedupsymmetric ()f,, n r = port 16 cores of one BCE each, four cores of four BCEs 1−f + f⋅ r perf ()r pperf ()r⋅ n each, or, in general, n/r cores of r BCEs each (our equations and graphs use a continuous approximation instead Consider Figure 2a. It assumes a symmetric multi- of rounding down to an integer number of cores). Figures core chip of n = 16 BCEs and perf(r) = r . The x-axis 34 Computer (a) (b) (c) Figure 1. Varieties of multicore chips. (a) Symmetric multicore with 16 one-base core equivalent cores, (b) symmetric multicore with four four-BCE cores, and (c) asymmetric multicore with one four-BCE core and 12 one-BCE cores. These figures omit important structures such as memory interfaces, shared caches, and interconnects, and assume that area, not power, is a chip’s limiting resource. 16 250 Symmetric, n = 16 Symmetric, n = 256 14 f = 0.999 200 f = 0.99 12 f = 0.975 f = 0.9 150 10 f = 0.5 symmetric symmetric 8 100 6 Speedup Speedup 50 4 2 0 2 4 8 16 0 2 4 8 16 32 64 128 256 (a) r BCEs (b) r BCEs 16 250 Asymmetric, n = 16 Asymmetric, n = 256 14 200 12 10 150 symmetric symmetric 8 100 6 Speedup Speedup 50 4 2 0 2 4 8 16 0 2 4 8 16 32 64 128 256 (c) r BCEs (d) r BCEs 16 250 Dynamic, n = 16 Dynamic, n = 256 14 200 12 10 150 dynamic dynamic 8 100 Speedup 6 Speedup 4 50 2 (e) 0 2 4 8 16 0 2 4 8 16 32 64 128 256 r BCEs (f) r BCEs Figure 2. Speedup of (a, b) symmetric, (c, d) asymmetric, and (e, f) dynamic multicore chips with n = 16 BCEs (a, c, and e) or n = 256 BCEs (b, d, and f). July 2008 35 gives resources used to increase each core’s perfor- with more resources to execute sequentially at per- mance: a value 1 says the chip has 16 base cores, while formance perf(r). In the parallel fraction, however, a value of r = 16 uses all resources for a single core. it gets performance perf(r) from the large core and Lines assume different values for the parallel fraction performance 1 from each of the n – r base cores. (f = 0.5, 0.9, …, 0.999). The y-axis gives the symmet- Overall, we get: ric multicore chip’s speedup relative to its running on 1 one single-BCE base core.

Load more