Fast and Flexible CAD for Fpgas and Timing Analysis by Kevin Edward

Total Page:16

File Type:pdf, Size:1020Kb

Fast and Flexible CAD for Fpgas and Timing Analysis by Kevin Edward Fast and Flexible CAD for FPGAs and Timing Analysis by Kevin Edward Murray A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto © Copyright 2020 by Kevin Edward Murray Abstract Fast and Flexible CAD for FPGAs and Timing Analysis Kevin Edward Murray Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 2020 Field Programmable Gate Arrays (FPGAs) are a widely used platform for hardware acceleration and digital systems design, which offer high performance and power efficiency while remaining re- programmable. However a major challenge to using FPGAs is the complex Computer Aided Design (CAD) flow required to program them, which makes FPGAs difficult and time-consuming to use. This thesis investigates techniques to address these challenges. A key component of the CAD flow is timing analysis, which is used to verify correctness and drive optimizations. Since the dominant form of timing analysis, Static Timing Analysis (STA), is time consuming to perform, we investigate methods to efficiently perform STA in parallel. We then develop Extended Static Timing Analysis (ESTA), which performs a more detailed and less pessimistic form of timing analysis by accounting for the probability of signal transitions. We next study a variety of enhancements to the FPGA physical design flow, including to key optimization stages like packing, placement and routing. These techniques are combined in the open- source VTR 8 design flow, and significantly outperform previous methods. We also develop advanced architecture modelling capabilities which allow VTR 8 to target a wider range of FPGA architectures. These enhancements allow FPGA architects more fine grained control of the targeted FPGA architecture, enabling more in depth study of key architectural features such as the routing network. Additionally, we show how these enhancements enable the VPR placement and routing tool within VTR to target commercial FPGAs as part of a fully open-source design flow. Finally, we study how Reinforcement Learning (RL) can be used to learn new effective heuristics for use in CAD algorithms, and show how this can be used to improve the run-time quality trade-off of an FPGA placement algorithm. ii Acknowledgements First, I would like to thank my supervisor, Vaughn Betz. He has provided numerous suggestions and provided very helpful feedback which has greatly improved this work. Beyond research, I am deeply appreciative of the time and effort he has spent mentoring me, and helping to develop my teaching, communication and management skills. I have learned about much beyond CAD and FPGAs under your guidance. I would also like to thank many of my fellow lab mates and friends. Thank you for your engaging discussions which have been the catalysts for many ideas, and for your encouraging friendship. I'd particularly like to thank Mohamed Abdelfattah, Jeff Cassidy, Henry Wong, Sadegh Yazdanshenas, Ibrahim Ahmed, Oleg Petelin, Kosuke Tatsumura, Sameh Attia, Abed Yassine, Andrew Boutros, Fynn Schwiegelshohn, Mathew Hall, Linda Shen, Tanner Young-Schultz, Mustafa Abbas, Mohamed ElDafrawy and Mohamed ElGammal. I am also greatful for having been able to work with a number of undergraduate summer students including: Johnson Zhong, Jasmine Wang, Eugene Sha, Yang Su, Tina Yang, Richard Ren, Andre Pereira, Jean Wu, Matthew Walker, Hanqing Zeng and Martin Zhang. I hope you learned as much for the experience as I have. From Imperial College London, I would like to thank George Constantinides (who hosted me at Imperial for a semester), Andrea Suardi, and the whole Circuit and Systems group. I am also fortunate to have collaborated with many people as part of the VTR project. In particular I would like to thank Jean-Philippe Legault, Kenneth Kent, Jason Luu, Tim Ansell, Keith Rothman, Dusty DeWeese, and Jeff Goeders for their feedback, brainstorming and help supporting and developing VTR. I would also like to thank the members of my thesis commitee, Jason Anderson and Jonathan Rose, for their helpful discussions and feedback over the years, and Tom Spyrou, Chris Wysocki, and Paul Leventis for helpful insights into timing analysis. During this work I have been fortunate to recieve funding from the National Science and Engineering Research Council (NSERC), the Province of Ontario, and the University of Toronto. I am also thankful for a number of individuals who were influential in leading me to this path, including particularly Ian Rowe, and Stuart Taylor. Finally, I would like to thank my parents, whose constant love, support and encouragment have made this all possible. iii Contents 1 Introduction 1 1.1 Motivation . .1 1.2 Contributions . .3 1.3 Organization . .4 2 Background 6 2.1 Field Programmable Gate Arrays . .6 2.1.1 FPGA Architecture . .6 2.1.2 CAD for FPGAs . .9 2.2 CAD Flows . 11 2.2.1 Impact of CAD Flow on Productivity . 11 2.3 Timing Analysis . 13 2.4 FPGA Packing & Placement . 16 2.4.1 Optimization Objectives . 16 2.4.2 Optimization Algorithms . 17 2.4.3 Handling Legality . 18 2.5 FPGA Routing . 18 I Timing Analysis 20 3 Tatum: Fast Parallel Timing Analysis 21 3.1 Introduction . 21 3.2 Background & Related Work . 22 3.2.1 STA Background . 22 3.2.2 Parallel Timing Analysis . 23 3.3 Baseline Implementation . 23 3.3.1 Memory Layout Optimizations . 24 3.4 Parallel STA on GPUs . 25 3.4.1 GPU Algorithm . 25 3.4.2 GPU Experimental Results . 26 3.5 Parallel STA on CPUs . 28 3.5.1 CPU Algorithm A: Levelized Locks . 28 3.5.2 CPU Algorithm B: Levelized No-Locks . 28 iv 3.5.3 CPU Algorithm C: Dynamic . 30 3.5.4 CPU Experimental Results . 32 3.6 Multi-Corner Timing Analysis . 33 3.6.1 Parallel . 33 3.6.2 SIMD . 34 3.6.3 Analysis within Fixed CAD Run-Time Budgets . 35 3.7 Multi-clock Timing Analysis . 35 3.8 Tatum . 36 3.8.1 Asymptotic Scalability . 36 3.8.2 Comparison to OpenTimer . 38 3.9 Evaluation in VPR . 39 3.10 Improving Optimization . 40 3.10.1 Algorithmic Enhancements . 42 3.10.2 Results . 42 3.11 Conclusion . 43 3.12 Future Work . 44 4 Extended Static Timing Analysis 45 4.1 Introduction . 45 4.2 Background & Related Work . 46 4.3 Extended Static Timing Analysis Formulation . 48 4.3.1 Transition Model . 48 4.3.2 Using #SAT to Calculate Probabilities . 49 4.3.3 Combining Activation Functions . 50 4.3.4 Conditioning Functions . 51 4.4 ESTA Implementation . 51 4.4.1 Calculating Timing Tags . 52 4.4.2 Input Filtering . 52 4.4.3 Tag Merging . 53 4.4.4 Run-Time/Accuracy Trade-Offs . 53 4.4.5 Enhanced Algorithm . 54 4.4.6 Computational Complexity . 54 4.4.7 Comparison with Exhaustive Simulation . 56 4.4.8 Per-Output & Max-Delay Modes . 56 4.5 Non-Uniform & Correlated Conditioning Functions . 56 4.5.1 Non-uniform Conditioning Functions . 57 4.5.2 Correlated Condition Functions . 59 4.6 Evaluation Methodology . 60 4.6.1 Monte-Carlo Simulation . 61 4.6.2 Metrics . 62 4.7 Results . 62 4.7.1 Verification . 62 4.7.2 Maximum Delay Estimation with False Paths . 63 4.7.3 Monte-Carlo Convergence . 63 v 4.7.4 ESTA Run-time/Accuracy Trade-Offs . 63 4.7.5 ESTA Non-Uniform Conditioning Functions . 65 4.7.6 ESTA and MC Comparison . 66 4.8 Conclusion . 69 4.9 Future Work . ..
Recommended publications
  • EMBEDDED PROGRAMMABLE LOGIC CORES By
    IMPLEMENTATION CONSIDERATIONS FOR “SOFT” EMBEDDED PROGRAMMABLE LOGIC CORES by James Cheng-Huan Wu B.A.Sc., University of British Columbia, 2002 A thesis submitted in partial fulfillment of the requirements for the degree of Master of Applied Science in The Faculty of Graduate Studies Department of Electrical and Computer Engineering We accept this thesis as conforming to the required standard: ___________________________ ___________________________ ___________________________ ___________________________ The University of British Columbia September 2004 © James C.H. Wu, 2004 ABSTRACT IMPLEMENTATION CONSIDERATIONS FOR “SOFT” EMBEDDED PROGRAMMABLE LOGIC CORES As integrated circuits become increasingly more complex and expensive, the ability to make post-fabrication changes will become much more attractive. This ability can be realized using programmable logic cores. Currently, such cores are available from vendors in the form of “hard” macro layouts. An alternative approach for fine-grain programmability is possible: vendors supply an RTL version of their programmable logic fabric that can be synthesized using standard cells. Although this technique may suffer in terms of speed, density, and power overhead, the task of integrating such cores is far easier than the task of integrating “hard” cores into an ASIC or SoC. When the required amount of programmable logic is small, this ease of use may be more important than the increased overhead. In this thesis, we identify potential implementation issues associated with such cores, and investigate in depth the area, speed and power overhead of using this approach. Based on this investigation, we attempt to improve the performance of programmable cores created in this manner. Using a test-chip implementation, we identify three main issues: core size selection, I/O connections, and clock-tree synthesis.
    [Show full text]
  • FPGA Emulation for Critical-Path Coverage Analysis
    FPGA Emulation for Critical-Path Coverage Analysis by Kyle Balston B.A.Sc., Simon Fraser University, 2010 ATHESISSUBMITTEDINPARTIAL FULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF Master of Applied Science in THEFACULTYOFGRADUATESTUDIES (Electrical and Computer Engineering) The University Of British Columbia (Vancouver) October 2012 © Kyle Balston, 2012 Abstract A major task in post-silicon validation is timing validation: it can be incredibly difficult to ensure a new chip meets timing goals. Post-silicon validation is the first opportunity to check timing with real silicon under actual operating conditions and workloads. However, post-silicon tests suffer from low observability, making it difficult to properly quantify test quality for the long-running random and di- rected system-level tests that are typical in post-silicon. In this thesis, we propose a technique for measuring the quality of long-running system-level tests used for timing coverage through the use of on-chip path monitors to be used with FPGA emulation. We demonstrate our technique on a non-trivial SoC, measuring the cov- erage of 2048 paths (selected as most critical by static timing analysis) achieved by some pre-silicon system-level tests, a number of well-known benchmarks, boot- ing Linux, and executing randomly generated programs. The results show that the technique is feasible, with area and timing overheads acceptable for pre-silicon FPGA emulation. ii Preface Preliminary work related to this thesis has been published in a conference paper and a journal paper. The work itself will be published as an invited conference paper. The first paper, Post-Silicon Code Coverage Evaluation with Reduced Area Overhead for Functional Verification of SoC, was published as [33].
    [Show full text]
  • Common Path Pessimism Removal in Static Timing Analysis
    Scholars' Mine Masters Theses Student Theses and Dissertations Summer 2015 Common path pessimism removal in static timing analysis Chunyu Wang Follow this and additional works at: https://scholarsmine.mst.edu/masters_theses Part of the Computer Engineering Commons Department: Recommended Citation Wang, Chunyu, "Common path pessimism removal in static timing analysis" (2015). Masters Theses. 7440. https://scholarsmine.mst.edu/masters_theses/7440 This thesis is brought to you by Scholars' Mine, a service of the Missouri S&T Library and Learning Resources. This work is protected by U. S. Copyright Law. Unauthorized use including reproduction for redistribution requires the permission of the copyright holder. For more information, please contact [email protected]. COMMON PATH PESSIMISM REMOVAL IN STATIC TIMING ANALYSIS by CHUNYU WANG A THESIS Presented to the Faculty of the Graduate School of the MISSOURI UNIVERSITY OF SCIENCE AND TECHNOLOGY In Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE IN COMPUTER ENGINEERING 2015 Approved by Yiyu Shi, Advisor Minsu Choi Jun Fan 2015 Chunyu Wang All Rights Reserved iii ABSTRACT Static timing analysis is a key process to guarantee timing closure for modern IC designs. However, additional pessimism can significantly increase the difficulty to achieve timing closure. Common path pessimism removal (CPPR) is a prevalent step to achieve accurate timing signoff. To speed up the existing exhaustive exploration on all paths in a design, this thesis introduces a fast multi-threading timing analysis for removing common path pessimism based on block-based static timing analysis. Experimental results show that the proposed method has faster runtime in removing excess pessimism from clock paths.
    [Show full text]
  • Gate-Level Timing Analysis and Waveform Evaluation
    Syracuse University SURFACE Dissertations - ALL SURFACE May 2015 Gate-level timing analysis and waveform evaluation Chaobo Li Syracuse University Follow this and additional works at: https://surface.syr.edu/etd Part of the Engineering Commons Recommended Citation Li, Chaobo, "Gate-level timing analysis and waveform evaluation" (2015). Dissertations - ALL. 220. https://surface.syr.edu/etd/220 This Dissertation is brought to you for free and open access by the SURFACE at SURFACE. It has been accepted for inclusion in Dissertations - ALL by an authorized administrator of SURFACE. For more information, please contact [email protected]. Abstract Static timing analysis (STA) is an integral part of modern VLSI chip design. Table lookup based methods are widely used in current industry due to its fast runtime and mature algorithms. Conventional STA algorithms based on table-lookup methods are developed under many assumptions in timing analysis; however, most of those assumptions, such as that input signals and output signals can be accurately modeled as ramp waveforms, are no longer satisfactory to meet the increasing demand of accuracy for new technologies. In this dissertation, we discuss several crucial issues that conventional STA has not taken into consideration, and propose new methods to handle these issues and show that new methods produce accurate results. In logic circuits, gates may have multiple inputs and signals can arrive at these inputs at different times and with different waveforms. Different arrival times and waveforms of signals can cause very different responses. However, multiple-input transition effects are totally overlooked by current STA tools. Using a conventional single-input transition model when multiple-input transition happens can cause significant estimation errors in timing analysis.
    [Show full text]
  • Manoel Barros Marin BE-BI-BP ISOTDAQ 2019 @ Royal Holloway, University of London (UK) 09/04/2019
    ISOTDAQ 2019 @ Royal Holloway, University of London (UK) 09/04/2019 Manoel Barros Marin BE-BI-BP ISOTDAQ 2019 @ Royal Holloway, University of London (UK) 09/04/2019 Outline: • … from the previous lesson • Key concepts about FPGA design • FPGA gateware design work flow • Summary Manoel Barros Marin BE-BI-BP ISOTDAQ 2019 @ Royal Holloway, University of London (UK) 09/04/2019 • … from the previous lesson Manoel Barros Marin BE-BI-BP … from the previous lesson What is an Field Programmable Gate Array (FPGA)? 4 … from the previous lesson What is an Field Programmable Gate Array (FPGA)? FPGA - Wikipedia https://en.wikipedia.org/wiki/Field-programmable_gate_array A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturing – hence "field-programmable". 5 … from the previous lesson What is an Field Programmable Gate Array (FPGA)? FPGA - Wikipedia https://en.wikipedia.org/wiki/Field-programmable_gate_array A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturing – hence "field-programmable". 6 … from the previous lesson • FPGA fabric (matrix like structure) made of: • I/O-cells to communicate with outside world • Logic cells o Look-Up-Table (LUT) to implement combinatorial logic o Flip-Flops (D) to implement sequential logic • Interconnect network between logic resources • Clock tree to distribute the clock signals 7 … from the previous lesson • But it also features Hard Blocks: Example of
    [Show full text]
  • Chapter 8 Timing Closure
    Chapter 8 8 Timing Closure 8 Timing Closure ............................................................221 8.1 Introduction..................................................................................... 221 8.2 Timing Analysis and Performance Constraints ............................. 223 8.2.1 Static Timing Analysis................................................... 224 8.2.2 Delay Budgeting with the Zero-Slack Algorithm........... 229 8.3 Timing-Driven Placement............................................................... 233 8.3.1 Net-Based Techniques ................................................. 234 8.3.2 Embedding STA into Linear Programs for Placement . 237 8.4 Timing-Driven Routing ................................................................... 239 8 8.4.1 The Bounded-Radius, Bounded-Cost Algorithm.......... 240 8.4.2 Prim-Dijkstra Tradeoff................................................... 241 8.4.3 Minimization of Source-to-Sink Delay........................... 242 8.5 Physical Synthesis ......................................................................... 244 8.5.1 Gate Sizing.................................................................... 244 8.5.2 Buffering........................................................................ 245 8.5.3 Netlist Restructuring...................................................... 246 8.6 Performance-Driven Design Flow.................................................. 250 8.7 Conclusions...................................................................................
    [Show full text]
  • Timing Closure
    Timing Closure Introduction Every design has to run at a certain speed based on the design requirement. There are generally three types of speed requirement in an FPGA design: Timing requirement – how fast or slow a design should run. This is defined through the target clock period (or clock frequency) and a few other constraints. Throughput – the average rate of the valid output delivered per clock cycle Latency – the amount of the time required when the valid output is available after the input arrives, usually measured in the number of clock cycles Throughput and latency are usually related to the design architecture and application, and they need to be traded off between each other based on the system requirement. For example, high throughput usually means more pipelining, which increases the latency; low latency usually requires longer combinatorial paths, which removes pipelines, and this can reduce the throughput and clock speed. More often, FPGA designers deal with the timing requirement to make sure that the design runs at the required clock speed. This can require hard work for high-speed design in order to close timing using various techniques (including the trade-off between throughput and latency, appropriate timing constraint adjustments, etc.) and running through multiple processing iterations including Synthesis, MAP and PAR. This document focuses on the timing requirement; it explains the timing- driven FPGA implementation processes and shows how to tackle timing issues when timing closure becomes problematic. Copyright © October 2013 Lattice Semiconductor Corporation. Introduction Timing Requirements and Constraints Several types of timing requirement are commonly used in FPGA designs and can be specified in Diamond through constraints and preferences.
    [Show full text]
  • Aadam: a Fast, Accurate, and Versatile Aging-Aware Cell Library Delay Model Using Feed-Forward Neural Network
    Aadam: A Fast, Accurate, and Versatile Aging-Aware Cell Library Delay Model using Feed-Forward Neural Network Seyed Milad Ebrahimipour1, Behnam Ghavami2, Hamid Mousavi1, Mohsen Raji3 Zhenman Fang2, Lesley Shannon2 1Shahid Bahonar University of Kerman, {miladebrahimi, hamidmousavi0}@eng.uk.ac.ir 2Simon Fraser University, {behnam_ghavami, zhenman, lesley_shannon}@sfu.ca 3Shiraz University, [email protected] ABSTRACT 1 INTRODUCTION With the CMOS technology scaling, transistor aging has become With the CMOS technology scaling, the reliability of circuits has one major issue affecting circuit reliability and lifetime. There are become one of the major issues affecting digital circuit designs1 [ , 7]. two major classes of existing studies that model the aging effects In addition to the correct functionality of the circuit, a longtime in the circuit delay. One is at transistor-level, which is highly ac- lifespan is also crucial in many application fields such as aerospace, curate but very slow. The other is at gate-level, which is faster but defense, and medical industries [5, 13]. Transistor aging is a key less accurate. Moreover, most prior studies only consider a limited source of failure that threatens the lifetime reliability of digital subset or limited value ranges of aging factors. circuits. It leads to a degradation of the electrical characteristics of In this paper, we propose Aadam, a fast, accurate, and versatile transistors and subsequently, a considerable increase of the device aging-aware delay model for generic cell libraries. In Aadam, we delay [19]. For example, the Negative Bias Temperature Instability first use transistor-level SPICE simulations to accurately charac- (NBTI) aging phenomenon, a major parametric reliability issue, terize the delay degradation of each library cell under a versatile may increase the circuit delay by up to 30% [14].
    [Show full text]