Bryan Cantrill and Jeffbonwick, Sun Microsystems

Bryan Cantrill and Jeff Bonwick, Sun Microsystems 16 September 2008 ACM QUEUE rants: [email protected] Chances are you won’t actually have to write multithreaded code. But if you do, some key principles will help you master this “black art.” Real-world CONCURRENCY oftware practitioners today could be forgiven (if not some hard-won wisdom) into a discussion that has if recent microprocessor developments have too often descended into hysterics. Specifically, we hope given them some trepidation about the future to answer the essential question: what does the prolif- of software. While Moore’s law continues to eration of concurrency mean for the software that you hold (that is, transistor density continues to develop? Perhaps regrettably, the answer to that question double roughly every 18 months), as a result of is neither simple nor universal—your software’s relation- both intractable physical limitations and prac- ship to concurrency depends on where it physically Stical engineering considerations, that increasing density executes, where it is in the stack of abstraction, and the is no longer being spent on boosting clock rate. Instead, it business model that surrounds it. is being used to put multiple CPU cores on a single CPU Given that many software projects now have compo- die. From the software perspective, this is not a revolu- nents in different layers of the abstraction stack spanning tionary shift, but rather an evolutionary one: multicore different tiers of the architecture, you may well find that CPUs are not the birthing of a new paradigm, but rather even for the software that you write, you do not have one the progression of an old one (multiprocessing) into answer but several: you may be able to leave some of your more widespread deployment. Judging from many recent code forever executing in sequential bliss, and some may articles and papers on the subject, however, one might need to be highly parallel and explicitly multithreaded. think that this blossoming of concurrency is the coming Further complicating the answer, we will argue that much of the apocalypse, that “the free lunch is over.”1 of your code will not fall neatly into either category: it As practitioners who have long been at the coal face of will be essentially sequential in nature but will need to be concurrent systems, we hope to inject some calm reality aware of concurrency at some level. more queue: www.acmqueue.com ACM QUEUE September 2008 17 the realization that their operating systems would need to be rewritten around the notion of concurrent execu- Real-world tion. These rewrites took place early in the 1990s, and the resulting systems were polished over the decade; much of the resulting technology can today be seen in open CONCURRENCY source operating systems such as OpenSolaris, FreeBSD, and Linux. Just as several computer companies made big bets Although we assert that less—much less—code needs around multiprocessing, several database vendors made to be parallel than some might fear, it is nonetheless bets around highly parallel relational databases; upstarts true that writing parallel code remains something of a including Oracle, Teradata, Tandem, Sybase, and Infor- black art. We also therefore give specific implementa- mix needed to use concurrency to achieve a performance tion techniques for developing a highly parallel system. advantage over the mainframes that had dominated As such, this article is somewhat dichotomous: we try transaction processing until that time.2 As in operating both to argue that most code can (and should) achieve systems, this work was conceived in the late 1980s and concurrency without explicit parallelism, and at the same early 1990s, and incrementally improved over the course time to elucidate techniques for those who must write of the decade. explicitly parallel code. This article is half stern lecture on The upshot of these trends was that by the end of the the merits of abstinence and half Kama Sutra. 1990s, concurrent systems had displaced their uniprocessor forebears as high-performance computers: when SOME HISTORICAL CONTEXT the TOP500 list of supercomputers was first drawn up in Before we discuss concurrency with respect to today’s 1993, the highest-performing uniprocessor in the world applications, it would be helpful to explore the history was just #34, and more than 80 percent of the top 500 of concurrent execution. Even by the 1960s—when the were multiprocessors of one flavor or another. By 1997, world was still wet with the morning dew of the com- uniprocessors were off the list entirely. Beyond the super- puter age—it was becoming clear that a single central pro- computing world, many transaction-oriented applications cessing unit executing a single instruction stream would scaled with CPU, allowing users to realize the dream of result in unnecessarily limited system performance. While expanding a system without revisiting architecture. computer designers experimented with different ideas The rise of concurrent systems in the 1990s coincided to circumvent this limitation, it was the introduction of with another trend: while CPU clock rate continued to the Burroughs B5000 in 1961 that proffered the idea that increase, the speed of main memory was not keeping up. ultimately proved to be the way forward: disjoint CPUs To cope with this relatively slower memory, microproces- concurrently executing different instruction streams but sor architects incorporated deeper (and more compli- sharing a common memory. In this regard (as in many) cated) pipelines, caches, and prediction units. Even then, the B5000 was at least a decade ahead of its time. It was the clock rates themselves were quickly becoming some- not until the 1980s that the need for multiprocessing thing of a fib: while the CPU might be able to execute became clear to a wider body of researchers, who over the at the advertised rate, only a slim fraction of code could course of the decade explored cache coherence protocols actually achieve (let alone surpass) the rate of one cycle (e.g., the Xerox Dragon and DEC Firefly), prototyped par- per instruction—most code was mired spending three, allel operating systems (e.g., multiprocessor Unix running four, five (or many more) cycles per instruction. on the AT&T 3B20A), and developed parallel databases Many saw these two trends—the rise of concurrency (e.g., Gamma at the University of Wisconsin). and the futility of increasing clock rate—and came to the In the 1990s, the seeds planted by researchers in the logical conclusion: instead of spending transistor budget 1980s bore the fruit of practical systems, with many on “faster” CPUs that weren’t actually yielding much computer companies (e.g., Sun, SGI, Sequent, Pyramid) in terms of performance gains (and had terrible costs in placing big bets on symmetric multiprocessing. These terms of power, heat, and area), why not take advantage bets on concurrent hardware necessitated correspond- of the rise of concurrent software and use transistors to ing bets on concurrent software—if an operating system effect multiple (simpler) cores per die? cannot execute in parallel, neither can much else in the system—and these companies independently came to 18 September 2008 ACM QUEUE rants: [email protected] That it was the success of concurrent software that latency requires that a unit of work be long enough in its contributed to the genesis of chip multiprocessing is an execution to amortize the substantial costs of coordinat- incredibly important historical point and bears reempha- ing multiple compute elements: one can envision using sis. There is a perception that microprocessor architects concurrency to parallelize a sort of 40 million elements— have—out of malice, cowardice, or despair—inflicted but a sort of a mere 40 elements is unlikely to take concurrency on software.3 In reality, the opposite is the enough compute time to pay the overhead of parallelism. case: it was the maturity of concurrent software that In short, the degree to which one can use concurrency to led architects to consider concurrency on the die. (The reduce latency depends much more on the problem than reader is referred to one of the earliest chip multiproces- on those endeavoring to solve it—and many important sors—DEC’s Piranha—for a detailed discussion of this problems are simply not amenable to it. motivation.4) Were software not ready, these microproces- For long-running operations that cannot be parallel- sors would not be commercially viable today. If anything, ized, concurrent execution can instead be used to perform the “free lunch” that some decry as being over is in fact, useful work while the operation is pending; in this model, at long last, being served. One need only be hungry and the latency of the operation is not reduced, but it is hid- know how to eat! den by the progression of the system. Using concurrency to hide latency is particularly tempting when the opera- CONCURRENCY IS FOR PERFORMANCE tions themselves are likely to block on entities outside The most important conclusion from this foray into his- of the program—for example, a disk I/O operation or a tory is that concurrency has always been employed for DNS lookup. Tempting though it may be, one must be one purpose: to improve the performance of the system. very careful when considering using concurrency merely This seems almost too obvious to make explicit—why else to hide latency: having a parallel program can become a would we want concurrency if not to improve perfor- substantial complexity burden to bear just for improved mance?—yet for all its obviousness, concurrency’s raison responsiveness. Further, concurrent execution is not the d’être is increasingly forgotten, as if the proliferation of only way to hide system-induced latencies: one can often concurrent hardware has awakened an anxiety that all achieve the same effect by employing nonblocking opera- software must use all available physical resources.

Bryan Cantrill and Jeffbonwick, Sun Microsystems

1 Thinking Methodically About Performance

And It All Went Horribly Wrong: Debugging Production Systems

The Rise & Development of Illumos

Introducing a New Product

Building Secure and Reliable Systems

[Cs.SE] 22 Sep 2003 Postmortem Object Type Identification

Download Slides

Dtrace: Dynamic Tracing for Solaris

Hidden in Plain Sight - ACM Queue

Participatory Motivations and Determinants of Success in Open Source Software

Erlang + Dtrace = ? • Erlang VM Architecture • Current Erlang Dtrace Scope • Erlang-Dtrace Demo • Questions

Dtrace Internals: Digging Into Dtrace