High-Speed Soft-Processor Architecture for FPGA Overlays by Charles Eric Laforest a Thesis Submitted in Conformity with the Requ
Total Page:16
File Type:pdf, Size:1020Kb
High-Speed Soft-Processor Architecture for FPGA Overlays by Charles Eric LaForest A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto c Copyright 2015 by Charles Eric LaForest Abstract High-Speed Soft-Processor Architecture for FPGA Overlays Charles Eric LaForest Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 2015 Field-Programmable Gate Arrays (FPGAs) provide an easier path than Application- Specific Integrated Circuits (ASICs) for implementing computing systems, and generally yield higher performance and lower power than optimized software running on high- end CPUs. However, designing hardware with FPGAs remains a difficult and time- consuming process, requiring specialized skills and hours-long CAD processing times. An easier design process abstracts away the FPGA via an \overlay architecture", which implements a computing platform upon which we construct the desired system. Soft- processors represent the base case of overlays, allowing easy software-driven design, but at a large cost in performance and area. This thesis addresses the performance limitations of FPGA soft-processors, as building blocks for overlay architectures. We first aim to maximize the usage of FPGA structures by designing Octavo, a strict round-robin multi-threaded soft-processor architecture tailored to the underlying FPGA and capable of operating at maximal speed. We then scale Octavo to SIMD and MIMD parallelism by replicating its datapath and connecting Octavo cores in a point-to-point mesh. This scaling creates multi-local logic, which we preserve via logical partitioning to avoid artificial critial paths introduced by unnecessary CAD optimizations. We plan ahead for larger Octavo systems by adding support for architectural ex- tensions, instruction predication, and variable I/O latency. These features improve the ii efficiency of hardware extensions, eliminate busy-wait loops, and provide a building block for more efficient code execution. By extracting the flow-control and addressing sub-graphs out of a program's Control- Data Flow Graph (CDFG), we can execute them in parallel with useful computations using special dedicated hardware units, granting better performance than fully unrolled loops without increasing code size. Finally, we benchmark Octavo against the MXP soft vector processor, the NiosII/f scalar soft-processor, and equivalent benchmark implementations written in C synthe- sized with the LegUp High-Level Synthesis (HLS) tool, and direct Verilog Hardware Description Language (HDL) implementations. Octavo's higher clock frequency offsets its higher cycle count, performing roughly on-par with MXP, and within an order of magnitude of the performance of LegUp and Verilog solutions, but with an order of magnitude area penalty. iii Dedication To my family and friends, near and far. It's been a life-long road, and you all helped me travel it. I am to all in your debt. In Memoriam: John Gregory Steffan August 27, 1972 - July 24, 2014 iv Acknowledgements Since I was a teenager, I've always wanted to design a computer. You would not be reading this without the help and influence of many. Firstly, my wife Joy. Do I need to expound on the love, support, and patience of a spouse over a decade spent in academia, with her own career, while raising a family? Secondly, to Greg Steffan. I can credit pretty much my entire academic success to his ability to pull me back to the big picture after fruitfully letting me get lost in the details, arguing for my own direction. Back in 2008, he proposed I do my MASc on a seemingly small little problem, which grew into a sub-field of its own after we published, and won us a significant award. Right from the beginning, while he co-supervised my undergraduate BIS thesis in 2006, before he even had tenure, we talked about what kind of computer architectures would work best on an FPGA. The work you are holding right now stems from those conversations, and I think I can say we figured out a good solution. Neither of us knew how far it would all go, and I am sorry he did not get to see it completed. Jason Anderson stepped up to carry the supervision torch, and gave his time and effort to make sure the LegUp benchmarks were great. I am lucky to have had his support, unflagging enthusiasm, and invaluable thesis-writing feedback. Guy Lemieux, of UBC, influenced the later stages of this work enormously. The entire work on execution overhead originates from a single conversation we had. Guy always took the time to help and explain things, criticizing honestly, greatly stimulating my thinking. One of his students, Aaron Severance, gave his time and effort also to provide MXP benchmark data, better than any I could have done myself, and patiently explained how the machinery worked. Any mistakes or omissions are mine alone. I have had the luck of also working with Vaughn Betz and Tarek Abdelrahman on my thesis committee, both bringing contrasting and broad perspectives on my research. In fact, every ECE and DCS Faculty member I have spoken with have answered my questions and given help freely. v I have also had the pleasure of great student colleagues, too numerous to mention all, but especially Ruediger Willenberg, Xander Chin, Braiden Brousseau, Robert Hesse, Ming G. Liu, Zimo Li, Tristan O'Rourke, Mihai Burcea, Andrew Shorten, Jason Lu, Henry Wong, Mark Jeffreys, Jasmina Vasiljevic, Ioana Baldini, and Charles Chiasson. Some roots of this work came from conversations at MIT with Jack Dennis and some of Arvind's grad students in the summer of 2010. Joel Emer of Intel also provided some early feedback. Octavo's initial design improved a lot after speaking with Patrick Doyle, Mark Stoodley, and other people of IBM Markham. I am also indebted to Blair Fort (as a student), Martin Labrecque, and Peter Yian- nacouras. Their earlier soft-processor work laid a lot of the groundwork for my own. I am still learning things from it. I got nothing but first-class IT tech support from Jaro Pristupa, Eugenia Distefano, and YongJoo Lee, who never blinked at all my software installation requests, and saved my data through RAID failures and my errors. Furthermore, Kelly Chan and Beverly Chu were always available for EECG administrative help, as were Judith Levene and Darlene Gorzo of the ECE Grad Office, and Jayne Leake who managed my TA assignments. A lot of computational horsepower (over 24 CPU-years of it!), was provided by the GPC supercomputer at the SciNet HPC Consortium [83]. My special thanks to Ching- Hsing Yu and Dr. Daniel Gruner for their enthusiasm at helping me scale my computa- tions, even when I unintentionally broke their system a little sometimes. (\What do you mean, I used up all the file descriptors?!?") The folk at Altera have also provided support in many ways: funding, Quartus li- censes, support, and hardware. My special thanks to Blair Fort (as Altera staff) for his freely given help. Further funding came from the ECE Department, the Queen Elizabeth II World Telecommunication Congress Graduate Scholarship in Science and Technology, the Wal- ter C. Sumner Foundation, and NSERC. If you read this and feel forgotten, my apologies. Call me up and reminisce! vi \He was alive, trembling ever so slightly with delight, proud that his fear was under control. Then without ceremony he hugged in his forewings, extended his short, angled wingtips, and plunged directly toward the sea. By the time he passed four thousand feet he had reached terminal velocity, the wind was a solid beating wall of sound against which he could move no faster. He was flying now straight down, at two hundred fourteen miles per hour. He swallowed, knowing that if his wings unfolded at that speed he'd be blown into a million tiny shreds of seagull. But the speed was power, and the speed was joy, and the speed was pure beauty." {\Jonathan Livingston Seagull", Richard Bach I was born to Rock n' Roll. Everything I need. I was born with the hammer down. I was built for speed. {\Built For Speed", Mot¨orhead It is by caffeine alone I set my mind in motion. It is by the beans of Java that thoughts acquire speed, the hands acquire shakes, the shakes become a warning. It is by caffeine alone I set my mind in motion. { variation on the \Mentat Mantra" { attributed to Mark Stein, Arisia SF Convention, Boston, ca. 1993, Sunday morning. Soundtrack by deadmau5. vii Contents 1 Introduction1 1.1 Motivation...................................2 1.2 Contributions.................................3 1.2.1 Chapter3: The Octavo Soft-Processor...............3 1.2.2 Chapter4: Tiling Overlay Architectures..............4 1.2.3 Chapter5: Planning for Larger Systems..............5 1.2.4 Chapter6: Approaching Overhead-Free Execution.........5 1.3 Organization.................................6 2 Background7 2.1 FPGA Architecture..............................7 2.1.1 Logic Elements............................7 2.1.2 Logic Clusters and Interconnect...................9 2.1.3 Island-Style FPGA Architecture...................9 2.2 Limitations of Design Processes with FPGAs................ 10 2.2.1 Hardware Description Languages.................. 11 2.2.2 High-Level Synthesis......................... 12 2.2.3 Overlays................................ 13 Overlays: Design Cycle........................ 13 Overlays: Scalar Soft-Processors................... 14 Overlays: Vector Processors..................... 16 Overlays: \Virtual" FPGAs and CGRAs.............. 17 2.3 Significant FPGA Features Affecting The Design Process......... 18 2.4 The Potential for Higher-Performance Overlays.............. 22 2.5 Summary................................... 25 viii 3 The Octavo Soft Processor 26 3.1 Design Goals................................. 27 3.1.1 A Multi-Threaded Overlay...................... 28 3.2 Related Work................................. 29 3.3 Contributions................................. 30 3.4 Introducing Octavo.............................. 31 3.5 Experimental Framework........................... 32 3.6 Storage Architecture............................