Divide-And-Conquer Techniques for Large Scale FPGA Design by Kevin
Total Page:16
File Type:pdf, Size:1020Kb
Divide-and-Conquer Techniques for Large Scale FPGA Design by Kevin Edward Murray A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto © Copyright 2015 by Kevin Edward Murray Abstract Divide-and-Conquer Techniques for Large Scale FPGA Design Kevin Edward Murray Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2015 The exponential growth in Field-Programmable Gate Array (FPGA) size afforded by Moore's Law has greatly increased the breadth and scale of applications suitable for implementation on FPGAs. However, the increasing design size and complexity challenge the scalability of the conventional approaches used to implement FPGA designs | making FPGAs difficult and time-consuming to use. This thesis investigates new divide-and-conquer approaches to address these scalability challenges. In order to evaluate the scalability and limitations of existing approaches, we present a new large FPGA benchmark suite suitable for exploring these issues. We then investigate the practicality of using latency insensitive design to decouple timing requirements and reduce the number of design iterations required to achieve timing closure. Finally we study floorplanning, a technique which spatially decomposes the FPGA implementation to expose additional parallelism during the implementation process. To evaluate the impact of floorplanning on FPGAs we develop Hetris, a new automated FPGA floorplanning tool. ii Acknowledgements First, I would like to thank my supervisor Vaughn Betz. His suggestions and feedback have been invaluable in improving the quality of this work. Furthermore, I am deeply appreciative the time and effort he has invested in mentoring me. I would also like to thank my lab mates and friends. You have always been willing to hear me out and answer my questions. You have also been the catalysts for many good ideas and well needed breaks. I specifically would like to thank Jason Luu for his assistance and suggestions with all things VPR related, Suya Liu for her work organizing and collecting benchmark circuits, and Scott Whitty for creating the VQM2BLIF tool. I am also grateful to the many individuals and organizations which have shared benchmark circuits including: Altera, Braiden Brousseau, Deming Chen, Jason Cong, George Constantinides, Zefu Dai, Joseph Garvey, IWLS2005, Mark Jervis, LegUP, Simon Moore, OpenCores.org, OpenSparc.net, Kalin Ovtcharov, Alex Rodionov, Russ Tessier, Danyao Wang, Wei Zhang, and Jianwen Zhu. I also thank David Lewis, Jonathan Rose and Jason Anderson for useful discussions, and Stuart Taylor for introducing me to the fascinating world of hard optimization problems. During this work I have been fortunate to receive financial support from the Province of Ontario, the University of Toronto and the Noakes Family. Finally, I would like to thank my parents. It is through your constant love and support that this is possible. iii Preface This thesis is based in part on the following works published with co-authors: • K. E. Murray, S. Whitty, S. Liu, J. Luu and V. Betz, \Timing Driven Titan: Enabling Large Benchmarks and Exploring the Gap Between Academic and Commercial CAD", To appear in ACM Trans. Reconfig. Technol. Syst., 18 pages. • K. E. Murray and V. Betz, \Quantifying the Cost and Benefit of Latency Insensitive Communication on FPGAs", ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2014, 223-232. • K. E. Murray, S. Whitty, S. Liu, J. Luu and V. Betz, \Titan: Enabling Large and Complex Benchmarks in Academic CAD", IEEE Int. Conf. on Field-Programmable Logic and Applications, 2013, 1-8. • K. E. Murray, S. Whitty, S. Liu, J. Luu and V. Betz, \From Quartus To VPR: Converting HDL to BLIF with the Titan Flow", IEEE Int. Conf. on Field-Programmable Logic and Applications, 2013, 1-1. [Demo Night Paper] iv Contents 1 Introduction 1 1.1 Motivation . .1 1.2 Organization . .2 2 Background 3 2.1 Field Programmable Gate Arrays . .3 2.1.1 FPGA Architecture . .3 2.1.2 CAD for FPGAs . .6 2.1.3 FPGA Trends . .7 2.2 FPGA Benchmarks & CAD Flows . .9 2.2.1 FPGA Benchmarks . 10 2.3 Impact of CAD & Design Methodology on Productivity . 11 2.3.1 Scaling Challenges and Approaches . 12 2.4 Timing Closure . 12 2.4.1 Scalability Challenges with Synchronous Design . 14 2.4.2 Beyond Synchronous Design . 14 2.4.3 Latency Insensitive Design . 16 2.5 Scalable Design Modification and Synthesis . 18 2.5.1 Scalable Design Modification . 18 2.5.2 Scalable Design Synthesis . 18 2.5.3 Floorplanning . 19 2.6 Types of Floorplanning Problems . 21 2.6.1 The Homogeneous Floorplanning Problem . 21 2.6.2 The Fixed-Outline Homogeneous Floorplanning Problem . 22 2.6.3 The Rectangular Homogeneous Floorplanning Problem . 22 2.6.4 The Heterogeneous Floorplanning Problem . 22 2.6.5 Optimization Domain . 23 2.7 Floorplanning for ASICs . 23 2.7.1 ASIC Floorplanning Techniques . 24 2.7.2 Simulated Annealing . 26 2.7.3 Floorplan Representations . 27 2.8 Floorplanning for FPGAs . 32 2.8.1 FPGA Floorplanning Techniques . 33 v 2.8.2 Comments on FPGA Floorplanning Techniques . 39 3 Titan: Large Benchmarks for FPGA Architecture and CAD Evaluation 40 3.1 Motivation . 40 3.2 Introduction . 40 3.3 The Titan Flow . 41 3.4 Flow Comparison . 43 3.5 Benchmark Suite . 44 3.5.1 Titan23 Benchmark Suite . 44 3.5.2 Benchmark Conversion Methodology . 44 3.5.3 Comparison to Other Benchmark Suites . 45 3.6 Stratix IV Architecture Capture . 45 3.6.1 Floorplan . 46 3.6.2 Global (Inter-Block) Routing . 46 3.6.3 Logic Array Block (LAB) . 46 3.6.4 Adaptive Logic Module (ALM) . 47 3.6.5 DSP Block . 47 3.6.6 RAM Block . 48 3.6.7 Phase-Locked-Loops . 48 3.6.8 I/O . 49 3.7 Advanced Architectural Features . 49 3.7.1 Carry Chains . 49 3.7.2 Direct-Link Interconnect and Three Sided Logic Array Blocks (LABs) . 49 3.7.3 Improved DSP Packing . 50 3.8 Timing Model . 50 3.8.1 LAB Timing . 50 3.8.2 RAM Timing . 50 3.8.3 DSP Timing . 51 3.8.4 Wire Timing . 51 3.8.5 Other Timing . 51 3.8.6 VPR Limitations . 51 3.8.7 Timing Model Verification . 52 3.9 Benchmark Results . 52 3.9.1 Benchmarking Configuration . 53 3.9.2 Quality of Results Metrics . 53 3.9.3 Timing Driven Compilation and Enhanced Architecture Impact . 54 3.9.4 Performance Comparison with Quartus II . 55 3.9.5 Quality of Results Comparison with Quartus II . 57 3.9.6 Modified Quartus II Comparison . 58 3.9.7 Comparison of VPR to Other Commercial Tools . 59 3.9.8 VPR versus Quartus II Quality Implications . 59 3.10 Conclusion . 60 vi 4 Latency Insensitive Communication on FPGAs 61 4.1 Introduction . 61 4.2 Latency Insensitive Design Implementation . 61 4.2.1 Baseline Wrapper . 63 4.2.2 Optimized Wrapper . 64 4.3 Results . 64 4.3.1 FIR Design Overhead . 65 4.3.2 Pipelining Efficiency . 67 4.3.3 Generalized Latency Insensitive Wrapper Scaling . 68 4.3.4 Latency Insensitive Design Overhead . 70 4.4 Conclusions . ..