The Wavescalar Architecture

The WaveScalar Architecture Steven Swanson A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2006 Program Authorized to O↵er Degree: Computer Science & Engineering University of Washington Graduate School This is to certify that I have examined this copy of a doctoral dissertation by Steven Swanson and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the final examining committee have been made. Chair of the Supervisory Committee: Mark Oskin Reading Committee: Mark Oskin Susan Eggers John Wawrzynek Date: In presenting this dissertation in partial fulfillment of the requirements for the doctoral degree at the University of Washington, I agree that the Library shall make its copies freely available for inspection. I further agree that extensive copying of this dissertation is allowable only for scholarly purposes, consistent with “fair use” as prescribed in the U.S. Copyright Law. Requests for copying or reproduction of this dissertation may be referred to Proquest Information and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106-1346, 1-800-521-0600, or to the author. Signature Date University of Washington Abstract The WaveScalar Architecture Steven Swanson Chair of the Supervisory Committee: Assistant Professor Mark Oskin Computer Science & Engineering Silicon technology will continue to provide an exponential increase in the avail- ability of raw transistors. E↵ectively translating this resource into application performance, however, is an open challenge that conventional superscalar designs will not be able to meet. We present WaveScalar as a scalable alternative to conventional designs. WaveScalar is a dataflow instruction set and execution model designed for scalable, low-complexity, high-performance processors. Unlike previous dataflow machines, WaveScalar can efficiently provide the sequential memory seman- tics imperative languages require. To allow programmers to easily express parallelism, WaveScalar supports pthread-style, coarse-grain multithreading and dataflow-style, fine-grain threading. In addition, it permits blending the two styles within an application or even a single function. To execute WaveScalar programs, we have designed a scalable, tile-based processor architecture called the WaveCache. As a program executes, the WaveCache maps the program’s instructions onto its array of processing elements (PEs). The instructions remain at their processing elements for many invocations, and as the working set of instructions changes, the WaveCache removes unused instructions and maps new instructions in their place. The instructions communicate directly with one-another over a scalable, hierarchical on-chip interconnect, obviating the need for long wires and broadcast communication. This thesis presents the WaveScalar instruction set and evaluates a simulated implementation based on current technology. For single-threaded applications, the WaveCache achieves performance on par with conventional processors, but in less area. For coarse-grain threaded applications, WaveCache performance scales with chip size over a wide range, and it outperforms a range of the multi-threaded designs. The WaveCache sustains 7-14 multiply-accumulates per cycle on fine-grain threaded versions of well-known kernels. Finally, we apply both styles of threading to an example application, equake from spec2000, and speed it up by 9 compared to the ⇥ serial version. TABLE OF CONTENTS List of Figures . iii List of Tables . v Chapter 1: Introduction . 1 Chapter 2: WaveScalar Sans Memory . 5 2.1 The von Neumann model . 5 2.2 WaveScalar’s dataflow model . 6 2.3 Discussion . 15 Chapter 3: Wave-ordered Memory . 16 3.1 Dataflow and memory ordering . 17 3.2 Wave-ordered memory . 19 3.3 Expressing parallelism . 25 3.4 Evaluation . 27 3.5 Future directions . 31 3.6 Discussion . 35 Chapter 4: A WaveScalar architecture for single-threaded programs . 36 4.1 WaveCache architecture overview . 37 4.2 The processing element . 40 4.3 The WaveCache interconnect . 43 4.4 The store bu↵er.............................. 45 4.5 Caches . 49 4.6 Placement . 49 4.7 Managing parallelism . 50 i Chapter 5: Running multiple threads in WaveScalar . 52 5.1 Multiple memory orderings . 53 5.2 Synchronization . 58 5.3 Discussion . 61 Chapter 6: Experimental infrastructure . 63 6.1 The RTL model . 63 6.2 The WaveScalar tool chain . 65 6.3 Applications . 69 Chapter 7: WaveCache Performance . 71 7.1 Area model and timing results . 73 7.2 Performance analysis . 75 7.3 Network traffic.............................. 92 7.4 Discussion . 93 Chapter 8: WaveScalar’s dataflow features . 95 8.1 Unordered memory . 96 8.2 Mixing threading models . 99 Chapter 9: Related work . 104 9.1 Previous dataflow designs . 104 9.2 Tiled architectures . 112 9.3 Objections to dataflow . 114 Chapter 10: Conclusions and future work . 116 Bibliography . 120 ii LIST OF FIGURES Figure Number Page 2.1 A simple dataflow fragment . 7 2.2 Implementing control in WaveScalar . 9 2.3 Loops in WaveScalar . 10 2.4 A function call . 14 3.1 Program order . 18 3.2 Simple wave-ordered annotations . 20 3.3 Wave-ordering and control . 21 3.4 Resolving ambiguity . 22 3.5 Simple ripples . 26 3.6 Ripples and control . 27 3.7 Memory parallelism . 29 3.8 Reusing sequence numbers . 32 3.9 Loops break up waves . 33 3.10 A reentrant wave . 34 4.1 The WaveCache . 37 4.2 Mapping instruction into the WaveCache . 38 4.3 The flow of operands through the PE pipeline and forwarding networks 42 4.4 The cluster interconnects . 43 4.5 The store bu↵er.............................. 46 5.1 Thread creation and destruction . 54 5.2 Thread creation overhead . 55 5.3 Tag matching . 60 5.4 A mutex . 60 6.1 The WaveScalar tool chain . 66 7.1 Pareto-optimal WaveScalar designs . 78 iii 7.2 Single-threaded WaveCache vs. superscalar . 80 7.3 Performance per unit area . 81 7.4 Splash-2 on the WaveCache . 87 7.5 Performance comparison of various architectures . 88 7.6 The distribution of traffic in the WaveScalar processor . 92 8.1 Fine-grain performance . 98 8.2 Transitioning between memory interfaces . 100 8.3 Using ordered and unordered memory together . 101 iv LIST OF TABLES Table Number Page 6.1 Workload configurations . 70 7.1 A cluster’s area budget . 72 7.2 Microarchitectural parameters . 74 7.3 WaveScalar processor area model . 76 7.4 Pareto optimal configurations . 84 9.1 Dataflow architectures . 106 v ACKNOWLEDGMENTS There are many, many people who have contributed directly and indirectly to this thesis. Mark Oskin provided the trifecta of mentoring, leadership, and (during the very early days of WaveScalar) Reece’s Peanut Butter Cups and soda water. He also taught me to think big in research. Susan Eggers has provided invaluable guidance and support from my very first day of graduate school. She has also helped me to become a better writer. Working with Ken Michelson, Martha Mercaldi, Andrew Petersen, Andrew Put- nam, and Andrew Schwerin has been an enormous amount of fun. The vast majority of the work in this thesis would not have been possible without their hard work and myriad contributions. They are a fantastic group people. A whole slew of other grad students have helped me out in various ways during grad school: Melissa Meyer, Kurt Partridge, Mike Swift, Vibha Sazawal, Ken Yasuhara, Luke McDowell, Don Patter- son, Gerome Miklau, Robert Grimm, Kevin Rennie, and Josh Redstone to name a few. I could not have asked for better colleagues (or friends) in graduate school. I fear that I may have been thoroughly spoiled. A huge number of other people have helped me along the way, both in graduate school and before. Foremost among them are the faculty that have advised me (both on research and otherwise) through my academic career. They include Perry Fizzano, who gave me my first taste of research, Hank Levy, Roy Want, Dan Grossman, and Gaetano Boriello. I owe all my teachers from over the years a huge debt, both for everything they have taught me and for their amazing patience. For their insightful comments on the document at hand, I thank my reading vi committee members: Mark Oskin, Susan Eggers, and John Wawrzynek. Finally, my family deserves thanks above the rest. They have been unendingly supportive. They have raised me to be a nice guy, to pursue my dreams, to take it in stride when they don’t work out, and to always, always keep my sense of humor. Thanks! 26 May 2006 vii DEDICATION To Glenn E. Tyler Oh, bring back my bonnie to me. To Spoon & Peanut Woof! viii 1 Chapter 1 INTRODUCTION It is widely accepted that Moore’s Law will hold for the next decade. Although more transistors will be available, simply scaling up current architectures will not convert them into commensurate increases in performance [5]. The gap between the increases in performance we have come to expect and those that larger versions of existing architectures can deliver will force engineers to search for more scalable processor architectures. Three problems contribute to this gap: (1) the ever-increasing disparity between computation and communication performance – fast transistors but slow wires; (2) the increasing cost of circuit complexity, leading to longer design times, schedule slips, and more processor bugs; and (3) the decreasing reliability of circuit technology, caused by shrinking feature sizes and continued scaling of the underlying material characteristics. In particular, modern superscalar processor designs will not scale, be- cause they are built atop a vast infrastructure of slow broadcast networks, associative searches, complex control logic, and centralized structures. This thesis proposes a new instruction set architecture (ISA), called WaveScalar [59], that adopts the dataflow execution model [21] to address these challenges in two ways. First, the dataflow model dictates that instructions execute when their inputs are available.

Load more