Fast and Flexible CAD for FPGAs and Timing Analysis

by

Kevin Edward Murray

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto

© Copyright 2020 by Kevin Edward Murray Abstract

Fast and Flexible CAD for FPGAs and Timing Analysis

Kevin Edward Murray Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 2020

Field Programmable Gate Arrays (FPGAs) are a widely used platform for hardware acceleration and digital systems design, which offer high performance and power efficiency while remaining re- programmable. However a major challenge to using FPGAs is the complex Computer Aided Design

(CAD) flow required to program them, which makes FPGAs difficult and time-consuming to use. This thesis investigates techniques to address these challenges.

A key component of the CAD flow is timing analysis, which is used to verify correctness and drive optimizations. Since the dominant form of timing analysis, Static Timing Analysis (STA), is time consuming to perform, we investigate methods to efficiently perform STA in parallel. We then develop

Extended Static Timing Analysis (ESTA), which performs a more detailed and less pessimistic form of timing analysis by accounting for the probability of signal transitions.

We next study a variety of enhancements to the FPGA physical design flow, including to key optimization stages like packing, placement and routing. These techniques are combined in the open- source VTR 8 design flow, and significantly outperform previous methods.

We also develop advanced architecture modelling capabilities which allow VTR 8 to target a wider range of FPGA architectures. These enhancements allow FPGA architects more fine grained control of the targeted FPGA architecture, enabling more in depth study of key architectural features such as the routing network. Additionally, we show how these enhancements enable the VPR placement and routing tool within VTR to target commercial FPGAs as part of a fully open-source design flow.

Finally, we study how Reinforcement Learning (RL) can be used to learn new effective heuristics for use in CAD algorithms, and show how this can be used to improve the run-time quality trade-off of an

FPGA placement algorithm.

ii Acknowledgements

First, I would like to thank my supervisor, Vaughn Betz. He has provided numerous suggestions and provided very helpful feedback which has greatly improved this work. Beyond research, I am deeply appreciative of the time and effort he has spent mentoring me, and helping to develop my teaching, communication and management skills. I have learned about much beyond CAD and FPGAs under your guidance. I would also like to thank many of my fellow lab mates and friends. Thank you for your engaging discussions which have been the catalysts for many ideas, and for your encouraging friendship. I’d particularly like to thank Mohamed Abdelfattah, Jeff Cassidy, Henry Wong, Sadegh Yazdanshenas, Ibrahim Ahmed, Oleg Petelin, Kosuke Tatsumura, Sameh Attia, Abed Yassine, Andrew Boutros, Fynn Schwiegelshohn, Mathew Hall, Linda Shen, Tanner Young-Schultz, Mustafa Abbas, Mohamed ElDafrawy and Mohamed ElGammal. I am also greatful for having been able to work with a number of undergraduate summer students including: Johnson Zhong, Jasmine Wang, Eugene Sha, Yang Su, Tina Yang, Richard Ren, Andre Pereira, Jean Wu, Matthew Walker, Hanqing Zeng and Martin Zhang. I hope you learned as much for the experience as I have. From Imperial College London, I would like to thank George Constantinides (who hosted me at Imperial for a semester), Andrea Suardi, and the whole Circuit and Systems group. I am also fortunate to have collaborated with many people as part of the VTR project. In particular I would like to thank Jean-Philippe Legault, Kenneth Kent, Jason Luu, Tim Ansell, Keith Rothman, Dusty DeWeese, and Jeff Goeders for their feedback, brainstorming and help supporting and developing VTR. I would also like to thank the members of my thesis commitee, Jason Anderson and Jonathan Rose, for their helpful discussions and feedback over the years, and Tom Spyrou, Chris Wysocki, and Paul Leventis for helpful insights into timing analysis. During this work I have been fortunate to recieve funding from the National Science and Engineering Research Council (NSERC), the Province of Ontario, and the University of Toronto. I am also thankful for a number of individuals who were influential in leading me to this path, including particularly Ian Rowe, and Stuart Taylor. Finally, I would like to thank my parents, whose constant love, support and encouragment have made this all possible.

iii Contents

1 Introduction 1 1.1 Motivation ...... 1 1.2 Contributions ...... 3 1.3 Organization ...... 4

2 Background 6 2.1 Field Programmable Gate Arrays ...... 6 2.1.1 FPGA Architecture ...... 6 2.1.2 CAD for FPGAs ...... 9 2.2 CAD Flows ...... 11 2.2.1 Impact of CAD Flow on Productivity ...... 11 2.3 Timing Analysis ...... 13 2.4 FPGA Packing & Placement ...... 16 2.4.1 Optimization Objectives ...... 16 2.4.2 Optimization Algorithms ...... 17 2.4.3 Handling Legality ...... 18 2.5 FPGA Routing ...... 18

I Timing Analysis 20

3 Tatum: Fast Parallel Timing Analysis 21 3.1 Introduction ...... 21 3.2 Background & Related Work ...... 22 3.2.1 STA Background ...... 22 3.2.2 Parallel Timing Analysis ...... 23 3.3 Baseline Implementation ...... 23 3.3.1 Memory Layout Optimizations ...... 24 3.4 Parallel STA on GPUs ...... 25 3.4.1 GPU Algorithm ...... 25 3.4.2 GPU Experimental Results ...... 26 3.5 Parallel STA on CPUs ...... 28 3.5.1 CPU Algorithm A: Levelized Locks ...... 28 3.5.2 CPU Algorithm B: Levelized No-Locks ...... 28

iv 3.5.3 CPU Algorithm C: Dynamic ...... 30 3.5.4 CPU Experimental Results ...... 32 3.6 Multi-Corner Timing Analysis ...... 33 3.6.1 Parallel ...... 33 3.6.2 SIMD ...... 34 3.6.3 Analysis within Fixed CAD Run-Time Budgets ...... 35 3.7 Multi-clock Timing Analysis ...... 35 3.8 Tatum ...... 36 3.8.1 Asymptotic Scalability ...... 36 3.8.2 Comparison to OpenTimer ...... 38 3.9 Evaluation in VPR ...... 39 3.10 Improving Optimization ...... 40 3.10.1 Algorithmic Enhancements ...... 42 3.10.2 Results ...... 42 3.11 Conclusion ...... 43 3.12 Future Work ...... 44

4 Extended Static Timing Analysis 45 4.1 Introduction ...... 45 4.2 Background & Related Work ...... 46 4.3 Extended Static Timing Analysis Formulation ...... 48 4.3.1 Transition Model ...... 48 4.3.2 Using #SAT to Calculate Probabilities ...... 49 4.3.3 Combining Activation Functions ...... 50 4.3.4 Conditioning Functions ...... 51 4.4 ESTA Implementation ...... 51 4.4.1 Calculating Timing Tags ...... 52 4.4.2 Input Filtering ...... 52 4.4.3 Tag Merging ...... 53 4.4.4 Run-Time/Accuracy Trade-Offs ...... 53 4.4.5 Enhanced Algorithm ...... 54 4.4.6 Computational Complexity ...... 54 4.4.7 Comparison with Exhaustive Simulation ...... 56 4.4.8 Per-Output & Max-Delay Modes ...... 56 4.5 Non-Uniform & Correlated Conditioning Functions ...... 56 4.5.1 Non-uniform Conditioning Functions ...... 57 4.5.2 Correlated Condition Functions ...... 59 4.6 Evaluation Methodology ...... 60 4.6.1 Monte-Carlo Simulation ...... 61 4.6.2 Metrics ...... 62 4.7 Results ...... 62 4.7.1 Verification ...... 62 4.7.2 Maximum Delay Estimation with False Paths ...... 63 4.7.3 Monte-Carlo Convergence ...... 63

v 4.7.4 ESTA Run-time/Accuracy Trade-Offs ...... 63 4.7.5 ESTA Non-Uniform Conditioning Functions ...... 65 4.7.6 ESTA and MC Comparison ...... 66 4.8 Conclusion ...... 69 4.9 Future Work ...... 70

II Flexible and Efficient FPGA CAD 72

5 VTR Introduction 73 5.1 Introduction ...... 73 5.2 Related Work ...... 75 5.2.1 FPGA CAD & Architecture Exploration Frameworks ...... 75 5.2.2 Research Using VTR ...... 75

6 VTR Design Flow 78 6.1 VTR Design Flow Overview ...... 78 6.2 Design Flow Enhancements ...... 79

7 Architecture Modeling Generality & Completeness in VTR 82 7.1 Architecture Modelling Enhancements ...... 82 7.1.1 General Device Grid ...... 82 7.1.2 Flexible Routing Architectures ...... 83 7.1.3 Area & Timing Modelling ...... 86 7.2 Architecture Enhancements Case Study: Stratix IV ...... 87 7.2.1 Grid Layout ...... 87 7.2.2 Routing Network ...... 87 7.2.3 Timing Model ...... 90 7.2.4 Architecture Enhancements Comparison ...... 90

8 Enhanced FPGA Packing & Placement 92 8.1 Packing Enhancements ...... 92 8.1.1 Where to Start? ...... 93 8.1.2 What to Pack Next? ...... 94 8.1.3 When is a Cluster Full? ...... 96 8.1.4 Adapting to Available Device Resources ...... 97 8.1.5 Packing QoR Evaluation ...... 98 8.2 Placement Enhancements ...... 99 8.2.1 Placement Macro Move Generator ...... 99 8.2.2 Compressed Move Grid ...... 100

9 AIR: A Fast but Lazy Timing-Driven FPGA Router 102 9.1 Introduction ...... 102 9.2 Algorithm Overview ...... 103 9.3 A Fast but Lazy Router ...... 105

vi 9.3.1 Incremental Routing ...... 105 9.3.2 High-Fanout Nets ...... 106 9.3.3 Block Output-pin Lockdown ...... 108 9.4 Improving Quality ...... 109 9.4.1 Router Lookahead ...... 109 9.4.2 Routing Resource Base Costs ...... 111 9.5 Adapting to Congestion ...... 112 9.5.1 Dynamic Router Bounding Boxes ...... 112 9.5.2 Congestion Mode ...... 113 9.6 Multi-convergence Routing ...... 114 9.7 Routing with Non-Configurable Switches ...... 115 9.8 Speeding-Up Minimum Channel Width Search ...... 117 9.8.1 Routing Failure Predictor ...... 117 9.8.2 Minimum Routable Channel Width Hints ...... 119 9.9 AIR Evaluation & Comparison ...... 119 9.9.1 Comparison with VPR Routers ...... 120 9.9.2 Comparison with CRoute and Quartus ...... 121

10 Other VTR Enhancements 123 10.1 Logic Optimization & Technology Mapping ...... 123 10.1.1 Safe Multi-Clock Optimization ...... 123 10.2 Timing Analysis ...... 124 10.2.1 Combined Setup & Hold Analysis ...... 125 10.2.2 Comparison with VPR Classic STA ...... 125 10.3 Software Engineering ...... 125 10.3.1 VPR ...... 126 10.3.2 Compiler Settings ...... 127 10.3.3 Regression Testing ...... 128 10.3.4 Error Messages & File Parsers ...... 129 10.3.5 Documentation ...... 129

11 VTR 8 Evaluation 131 11.1 VTR Benchmarks ...... 131 11.2 Titan23 Benchmarks ...... 132 11.2.1 Comparison with VPR 7 ...... 133 11.2.2 Comparison with Quartus 18.0 ...... 133 11.3 Conclusion ...... 135 11.4 Future Work ...... 135

III An Open-Source CAD Flow for Commercial FPGA Devices 139

12 Symbiflow & VPR 140 12.1 Introduction ...... 140 12.2 Related Work ...... 142

vii 12.3 Design Flow ...... 143 12.3.1 Specifying Architectures ...... 144 12.3.2 Enhancements to Target Commercial FPGAs ...... 145 12.3.3 Verification and Correctness ...... 147 12.4 Flow Quality ...... 147 12.5 Conclusion & Future Work ...... 149

IV Reinforcement Learning Enhanced CAD 150

13 Adaptive FPGA Placement Optimization via Reinforcement Learning 151 13.1 Introduction ...... 151 13.2 Reinforcement Learning ...... 152 13.3 Simulated Annealing Placement for FPGAs ...... 152 13.3.1 Placement Moves ...... 154 13.4 Reinforcement Learning Enhanced Move Generator ...... 154 13.4.1 Reward Formulation ...... 155 13.4.2 Estimating Action Values ...... 156 13.4.3 Action Selection: Exploration versus Exploitation ...... 158 13.5 Results ...... 159 13.6 Conclusion & Future Work ...... 161

14 Conclusion & Future Work 163 14.1 Parallel STA and Tatum ...... 164 14.2 Extended Static Timing Analysis ...... 164 14.3 VTR Design Flow & Architecture Modeling ...... 165 14.4 FPGA Packing & Placement ...... 166 14.5 FPGA Routing & AIR ...... 167 14.6 VTR Evaluation ...... 168 14.7 Open-Source FPGA CAD ...... 168 14.8 Reinforcement Learning Enhanced CAD ...... 169

Appendices 170

A Synthesizing Correlated ESTA Condition Functions 170 A.1 An ILP approach ...... 170 A.1.1 Specifying Probabilities and Correlations ...... 170 A.1.2 Relating Condition Functions to the Specification ...... 171 A.1.3 Condition Function Synthesis Formulation ...... 171 A.1.4 Implementing Condition Function Synthesis ...... 172 A.1.5 Example ...... 173 A.2 A Stochastic Synthesis Approach ...... 174 A.2.1 Covariance Matrices ...... 174 A.2.2 Decorrelation with Cholesky Decomposition ...... 175 A.2.3 Stochastic Computation ...... 177

viii A.2.4 Stochastic ...... 177 A.2.5 Combining Stochastic Synthesis & Cholesky Decomposition to Generate ESTA Condition Functions ...... 178 A.3 Conclusion ...... 180

Bibliography 181

ix List of Tables

3.1 Performance of Memory Optimizations ...... 25 3.2 Compute Platforms ...... 27 3.3 GPU Performance ...... 27 3.4 GPU Speed-up Comparison ...... 28 3.5 Speed-Up of CPU Parallel Algorithms ...... 33 3.6 Timing Graph Memory Usage ...... 33 3.7 VPR Multi-clock STA Run-Time During Placement ...... 40 3.8 VPR Placement Total Run-Time ...... 41 3.9 Impact of STA Frequency During Quench ...... 43

4.1 Path-Delay Distribution (uniform input transition probability)...... 47 4.2 Signal Transitions. Initial/Final correspond to signal values at the start/end of a clock cycle...... 49 4.3 AND Gate Transitions...... 49 4.4 Normalized Critical Path Delays with False Paths...... 62 4.5 Impact of convergence criteria on MC Max Delay run-time of the MCNC20 Benchmarks. 64 4.6 Relative ESTA per-output run-time for various conditioning functions (d = 100, m = 104). 67 4.7 Quality and Run-Time on the MCNC20 Benchmarks...... 69

7.1 Normalized Impact of 2x3 RAM block Internal Routing Connectivity on VTR benchmarks (> 10K primitives) ...... 85 7.2 Summary of Modelled Stratix IV Routing Architecture Characteristics ...... 89 7.3 Geometric Mean Normalized Impact of Architecture Variants on the Titan Benchmarks . 91

8.1 Cumulative Normalized Impact of Packing Changes on Titan benchmarks ...... 99 8.2 Normalized Impact of Packer High Fanout Threshold on Titan benchmarks ...... 99 8.3 Cummulative Normalized Impact of Placement Move Generation Changes on Titan bench- marks ...... 100

9.1 Normalized Impact of Incremental Routing on VTR benchmarks (> 10K Primitives) . . . 106 9.2 Normalized Impact of Incremental Routing on Titan benchmarks ...... 106 9.3 Normalized Impact of High Fanout Routing on Titan benchmarks ...... 108 9.4 Normalized Impact of OPIN Lockdown on Titan benchmarks ...... 108 9.5 Normalized Impact of Router Lookahead and Base Costs on Titan benchmarks ...... 110

x 9.6 Normalized Impact of Dynamic Router Bounding Boxes on VTR benchmarks (> 10K primitives) ...... 113 9.7 Normalized Impact of Congestion Mode Threshold on VTR benchmarks (> 10K)..... 113 9.8 Normalized Impact of Routing Predictor on VTR benchmarks (> 10K primitives) . . . . 118 9.9 Normalized Impact of Minimum Channel Width Hints on VTR benchmarks ...... 119 9.10 VPR Router Comparisons on Titan v1.1.0 Benchmarks ...... 120 9.11 AIR & Academic Router Comparison on Titan Benchmarks ...... 121 9.12 AIR & Quartus Comparison on Titan Benchmarks ...... 121 9.13 AIR Congestion Oblivious vs Congestion Free on Titan Benchmarks ...... 122

10.1 Normalized Impact of ABC Multiclock Flows on VTR benchmarks (> 10K primitives) . . 124 10.2 Normalized Impact of Setup & Hold STA Traversals While Routing Titan Benchmarks . . 125 10.3 Comparison of Classic & Tatum STA (Setup Only) on Titan Benchmarks ...... 125 10.4 Normalized Impact of Atom Netlist Data Structure Refactoring on VTR Benchmarks . . 126 10.5 Cumulative Impact of VPR Peak Memory Optimizations on neuron ...... 127 10.6 Normalized Impact of Advanced Compiler Optimizations on VPR 8.0 Run-Time (Titan Benchmarks) ...... 128 10.7 VTR Code Coverage ...... 129

11.1 Summary VTR Flagship Architecture Characteristics ...... 131 11.2 VTR 7 & VTR 8 Comparison Summary on the VTR Benchmarks ...... 132 11.3 VPR Quartus QoR Summary on Titan Benchmarks ...... 134 11.4 VTR 8.0: VTR Benchmarks (Relative to VTR 7) ...... 137 11.5 VPR 8.0: Titan v1.1.0 Benchmarks (Relative to VPR 7+) ...... 137 11.6 VPR 8.0 High Effort: Titan v1.3.0 (Relative to Quartus 18.0) ...... 138

12.1 VPR 8 Quality and Run-time targeting Stratix IV Model on Titan Benchmarks ...... 148 12.2 Symbiflow & VPR Quality and Run-time on Artix 7 (XC7A50TFGG484) ...... 148

13.1 Agent State and Action Spaces ...... 158

xi List of Figures

1.1 FPGA Size and Serial CPU Performance ...... 2

2.1 Basic Logic Element ...... 7 2.2 Logic Block ...... 7 2.3 Uniform FPGA ...... 8 2.4 Switch Block and Connection Block ...... 8 2.5 Heterogeneous FPGA ...... 9 2.6 FPGA CAD Flow ...... 11 2.7 Research FPGA CAD Flow ...... 12 2.8 Design Implementation CAD Flow ...... 13 2.9 Timing delays between two registers...... 14 2.10 Example circuit and timing graph with annotated delays. The path b → c → d → e is the critical timing path, with a delay of 3.0 units ...... 15

3.1 GPU calculation of level i arrival times...... 26 3.2 Multi-corner STA performance for openCV at P = 4. All results except ‘Naive’ perform a single set of traversals for all timing corners...... 34 3.3 Self-relative parallel speed-up on Tau’15...... 39 3.4 Tatum Performance Scaling over Classic VPR STA...... 41

4.1 Example circuit and timing graph with annotated delays...... 47 4.2 The path b → d → e is a false-path; the common sel means transitions on b can never propagate to e after the other paths have stabilized...... 48 4.3 AND gate ...... 49 4.4 ESTA models the glitch between t = 1 and t = 2 as a F transition with arrival time 1 + δ. 49 4.5 Input filtering example. The arrival time at c is 1 + δ, but STA or a naive ESTA implementation will report 2 + δ...... 53 4.6 Per-Output Mode: Path delay distributions calculated for each output...... 57 4.7 Max-Delay Mode: Bounding path delay distribution calculated over all output...... 57 4.8 Independent Condition Functions...... 57 4.9 Non-Independent Condition Functions...... 57

4.10 Example Truth Table and conditioning functions satisfying P (fR) = 0.25, P (fF ) = 0.125,

P (fH ) = 0.125 and P (fL) = 0.5. Note that the specific minterms assigned to each transition have no impact on the analysis results...... 58 4.11 Evaluation Flow...... 62

xii 4.12 MC max-delay run-time on apex2 for various convergence criteria...... 64 4.13 ESTA run-time/accuracy tradeoffs on the dsip benchmark in per-output mode...... 65 4.14 CDFs for the pksi 83 d output of the dsip benchmark for various conditioning func- tions. The ‘ESTA Uniform’ CDF corresponds to using the conditioning functions from Equation (4.5) for each primary input. Each ‘ESTA Non-Uniform’ CDF correspond to a different set of randomly generated non-uniform conditioning functions used at the primary inputs. ESTA was run with d = 1 and m = ∞...... 66 4.15 Maximum delay CDFs on the spla benchmark...... 67

6.1 VTR-based CAD Flow Variants ...... 79 6.2 Routing Resource Graph. Dotted arrows represent switch-block switches, while dashed arrows represent connection-block switches...... 80

7.1 General Device Grid Examples ...... 83 7.2 Blocks and Routing Channels ...... 84 7.3 Switch-block connectivity within and around a unit and 2x3 sized blocks...... 85 7.4 Circuit Structures which can be modelled with non-configurable switches...... 86 7.5 Dual-Port RAM Block Internal Timing Path Example ...... 87 7.6 Placement of the leon2 benchmark with the updated grid layout and block type annotations. 88 7.7 Three sided architecture with each pin able to access one horizontal track, and one vertical track...... 89 7.8 Direct connections between adjacent blocks shown in purple, including horizontal ‘direct- link’ connections (between all blocks) and vertical carry-chain connections (between LABs). Block colors are the same as Figure 7.6...... 90

8.1 Packing example with ζ =4...... 94 8.2 Unrelated Clustering Example ...... 96 8.3 des90 high packing density (unroutable) ...... 97 8.4 des90 moderate packing density (routable) ...... 97 8.5 Macro swap examples...... 100 8.6 Region Limit Examples ...... 101

9.1 Route tree pruning example...... 105 9.2 Routing resources explored (colored by cost to reach the current target sink) while routing a high-fanout net (∼ 3500 terminals) on the neuron benchmark circuit targeted to the Stratix IV model from Section 7.2. The target sink is located near the centre of the device.108 9.3 Stratix IV-like hierarchical routing architecture with short (L4) and long (L16) wires. Long wires are only accessible via short wires, and do not directly connect to block pins...... 110 9.4 Lookahead Delay Estimates ...... 110 9.5 Lookahead Congestion Estimates ...... 111 9.6 Comparison of routing congestion by resource type and router run-time for different base costs on the neuron benchmark circuit with the map lookahead...... 112 9.7 Dynamic bounding box example...... 113 9.8 QoR Impact of multiple routing convergence on the VTR Benchmarks (> 10K primitives) 114

xiii 9.9 Non-configurable routing example. Assumes unit congestion cost and delay for wires, unit delay for configurable switches, and zero delay for non-configurable switches...... 117 9.10 Routing Congestion Trend on LU32PEEng (targeting VTR’s flagship k6 frac N10 frac chain mem32K 40nm architecture) at routable and unroutable channel widths...... 118

12.1 Left, an FPGA consisting of Configurable Logic Blocks (CLBs) for computation and Block RAMs (BRAMs) for storage, along with programmable routing to interconnect them. Right, each CLB contains Look-Up Tables (LUTs) and Flip-Flops (FFs). Routing Muxes are configured to interconnect elements within each block, or inter-block routing wires. . 142 12.2 Design flow...... 143 12.3 FPGA Architecture Representation...... 144 12.4 Architecture and Implementation Flows...... 145 12.5 Architecture Model Generation...... 146

xiv List of Algorithms

1 Baseline Serial STA Algorithm ...... 24 2 CPU Algorithm A: Levelized Locks ...... 29 3 Levelized walk with no locks ...... 29 4 Dynamic Walk ...... 31 5 Multi-Corner Node Traversal ...... 33 6 Multi-traversal Multi-clock Timing Analysis ...... 35 7 Single-traversal Multi-clock Timing Analysis ...... 36 8 Basic ESTA Node Traversal ...... 52 9 Enhanced ESTA Node Traversal ...... 55 10 Grouped Minterm Allocation ...... 59 11 VPR 8 Packing Algorithm ...... 93 12 AIR Netlist Routing ...... 104 13 AIR Connection Routing ...... 104 14 Map Lookahead Creation ...... 109 15 AIR Path Finding ...... 115 16 AIR Node Expansion ...... 116 17 Simulated Annealing Placement ...... 153

xv Preface

This thesis is based in part on the following peer-reviewed works published with co-authors: Referred Journals

ˆ K. E. Murray, T. Ansell, K. Rothman, A. Comodi, M. A. Elgammal and V. Betz “Symbiflow & VPR: An Open-Source Design Flow for Commercial and Novel FPGAs”, IEEE Micro, 2020. [1]

ˆ K. E. Murray, O. Petelin, S. Zhong, J. M. Wang, M. ElDafrawy, J.-P. Legault, E. Sha, A. G. Graham, J. Wu, M. J. P. Walker, H. Zeng, P. Patros, J. Luu, K. B. Kent and V. Betz “VTR 8: High Performance CAD and Customizable FPGA Architecture Modelling”, ACM Trans. Reconfig. Technol. Syst. (TRETS), 2020. [2]

ˆ K. E. Murray, A. Suardi, V. Betz, G. Constantinides, “Calculated Risks: Quantifying Timing Error Probability with Extended Static Timing Analysis”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2018. [3]

Referred Conference/Workshop Proceedings

ˆ K. E. Murray, S. Zhong and V. Betz, “AIR: A Fast but Lazy Timing-Driven FPGA Router”, IEEE/ACM Asia Pacific Design Automation Conference (ASP-DAC), 2020, 1-7. [4]

ˆ K. E. Murray, and V. Betz, “Adaptive FPGA Placement Optimization via Reinforcement Learning”, ACM/IEEE Workshop on Machine Learning for CAD (MLCAD), 2019, 1-6. [5]

ˆ K. E. Murray, and V. Betz, “Tatum: Parallel Timing Analysis for Faster Design Cycles and Improved Optimization”, IEEE Int. Conf. on Field-Programmable Technology (FPT), 2018, 1-8. [6]

ˆ K. E. Murray, A. Suardi, V. Betz, G. Constantinides, “Quantifying Error: Extending Static Timing Analysis with Probabilistic Transitions”, Design, Automation & Test in Europe (DATE), 2017, pp. 1486-1491. [7]

xvi Chapter 1

Introduction

1.1 Motivation

Field Programmable Gate Arrays (FPGAs) offer a compelling platform for designing and implementing digital systems, offering the potential for higher performance and/or power efficiency than other pro- grammable platforms like CPUs and GPUs [8]. While not as efficient as Application Specific Integrated Circuits (ASICs) [9], FPGAs are re-programmable and can be re-configured to implement different applications in seconds. This allows them to be re-targeted to different applications and, compared to an ASIC (which may take years to design, manufacture and verify), quickly adjust to changing application requirements. However one of the major barriers to using FPGAs is the challenge of implementing an application on an FPGA. While this process is typically faster than an ASIC, it is still time consuming. A large component of this time is spent within the automated FPGA Computer Aided Design (CAD) flow which figures out how to map a designer’s high-level intent to the actual FPGA hardware. This CAD flow must make many detailed optimization decisions to produce a good implementation, and can take hours to days to complete for large designs [10]. Since multiple design iterations are typically required to converge to a working implementation, long CAD flow run-times significantly degrade designer productivity. Furthermore, as shown in Figure 1.1, FPGA device sizes continue to increase (30× more logic elements over the past decade [11, 12]), while serial CPU performance has slowed (only increasing 3.5× over the same period [13, 14]) – further challenging the CAD flow’s scalability. It is therefore important to improve the efficiency, scalability and robustness of the FPGA CAD flow. FPGA architects face a wide range of options when designing a new FPGA architecture, and many of the trade-offs they face evolve over time. New application domains emerge, such as machine learning (e.g. machine translation, automated question answering, self-driving cars), new and faster communication standards (e.g. 400G Ethernet, 5G wireless), embedded FPGAs within larger SoCs [15, 16, 17], and unique customized FPGA-like devices [18, 19]. Additionally, the underlying manufacturing process technology used to build FPGAs also changes. For instance, early FPGAs were built on process nodes with large features where transistor delay dominated, but state-of-the-art FPGAs aggressively pursue leading edge processes where features are small and interconnect delay dominates (due to wire resistance) [20, 21]. These trends have also led FPGAs to become more complex, changing from relatively uniform architectures (consisting primarily of Look-Up Tables (LUTs) and wires), to become increasingly heterogeneous, (with

1 Chapter 1. Introduction 2

1000 Largest FPGA Serial CPU Perf.

800

600

400 Normalized Value (1998) 200

0

1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 Year Figure 1.1: FPGA Size and Serial CPU Performance (SPECInt). various types of I/O, computational elements like adders/multipliers, various types of RAM, numerous global clock networks with complex controls, many types of wires and new communication structures such as Networks-on-Chip (NoCs) [22, 23]) in order to deal with diverse applications and the realities of huge transistor budgets and high wire resistance. It is therefore important for FPGA architectures to adapt to these new requirements, constraints and changing trade-offs. To fairly compare and evaluate these different ideas requires a flexible CAD flow which can target each proposed architecture. To ensure the resulting architectural conclusions are valid, it is important for the CAD flow to support modeling all important details, and produce high quality results. Traditionally, the tools used by academic researchers have made various simplifying assumptions which do not necessarily hold true in commercial FPGAs. To ensure the validity of their conclusions it is therefore important to ensure the tools available to researchers capture the important architectural details. Since the CAD flow is such a critical component for the use of FPGAs, commercial vendors have traditionally invested heavily in their closed-source commercial tools. However the tool flows produced are not always ideally suited to the desires of FPGA designers, and their closed-source nature limits their flexibility to what the vendor chooses to support. Furthermore, the CAD flow is a significant barrier to smaller commercial FPGA vendors or other groups wishing to produce their own customized FPGA-like devices. Open-source FPGA design flows capable of targeting both commercial and customized FPGAs have the potential to alleviate these issues, allowing users to customize the CAD flow to their wishes and allowing the broader commercial and research communities to pool their resources and collaboratively develop improved tools and flows. This thesis aims to address several of these issues. Faster CAD with higher result quality combined less pessimistic timing analysis improves designer productivity, making it tractable to tackle larger design sizes. More flexible CAD allows the investigation of new FPGA architectures, including those with advanced and complex features. More detailed CAD bridges the gap between architecture research (idea) Chapter 1. Introduction 3 and a production CAD flow (implementation), allowing not only the prototyping of a new FPGA, but the ability to fully describe, fine-tune, and ultimately program it.

1.2 Contributions

The main contributions of this thesis are as follows:

ˆ The investigation of parallel algorithms for Static Timing Analysis on GPUs and CPUs (Chapter 3), the result of which is the open-source Tatum STA library. Using our best algorithms, Tatum achieves a 7× larger parallel speed-up than a recent parallel ASIC timing analyzer, and runs 15.2× faster (using parallelism) than VPR 7’s (serial) STA engine. Tatum’s speed and flexibility have led to its adoption as the STA engine for VTR 8 [2] and CGRA-ME [24]. We also show how faster STA enables improved optimizations reducing critical path delay by 4%.

ˆ The development of Extended STA (ESTA), a more accurate form of timing analysis which considers the probability of signal transitions (Chapter 4). ESTA is 75% to 37% less pessimistic than standard STA, while running 569× to 16× faster than Monte-Carlo timing simulation while delivering stronger correctness guarantees. ESTA gives designers new insights into the timing behaviour of their designs which enable new optimizations [25].

ˆ Significant extensions to the FPGA architecture modelling capabilities of the Verilog to Routing (VTR) project [26, 27, 2], which enables architects to explore new architectures and capture many of the details necessary to model, and even program, commercial devices (Chapters 6 and 7). As a result, VTR now supports general device grids which do not restrict blocks to be in columns or IOs to be around the perimeter. Block types of any width and height are supported at any location within the device grid. The detailed routing architecture description is now more flexible, allowing architects detailed control of the switch-block and connection-block patterns. VTR also now supports loading an arbitrary externally defined routing architecture. The routing model has been extended to efficiently support arbitrary shaped wires (e.g. ‘L’ and ‘T’) and complex structures like clock networks using non-configurable edges. VTR’s area and delay modelling have also been improved to produce more accurate results, and support key features like minimum delay timing for hold analysis and primitive internal timing paths. These capabilities are illustrated with the creation of an improved capture of Intel’s Stratix IV FPGA architecture which reduces critical path delay by 8%, area by 11%, and route time by 28% compared to what is achievable with VTR 7.

ˆ Techniques to improve the quality of FPGA packing (Chapter 8) by producing more natural and less interconnected clusters, which improve routability (from 13 to 23 routable large Titan benchmarks), wirelength (-23%), critical path delay (-5%) and overall run-time (-30%). We also propose methods to improve FPGA placement quality (Chapter 8) through better optimization of structures like carry-chains and a better range limit formulation which improves the placement of sparse blocks, which improve wirelength (-8%), and critical path delay (-2%) at equivalent place-time.

ˆ A new FPGA routing algorithm called AIR (Chapter 9), which avoids unnecessary work and adapts to the targetted FPGA architecture, improving both quality (-15% wirelength, -19% critical path delay) and run-time (6.2× faster). We also present enhancements to reduce minimum channel width Chapter 1. Introduction 4

search time (2 − 2.7× faster), and develop a new approach to incrementally improve an existing legal routing (reducing critical path delay at minimum channel width by 2%). Additionally, we introduce extensions to support routing with non-configurable edges (which are used to model low flexibility routing structures like clock networks).

ˆ A number of additional engineering improvements to VTR (Chapter 10), which improve other parts of the flow (like logic synthesis) as well as its overall usability, robustness and flexibility.

ˆ A detailed evaluation of the above VTR enhancements in the context of the entire design flow (Chapter 11). These results show the combined VTR 8 design flow significantly improves quality (-15% minimum routable channel width, -41% wirelength, -12% critical path delay), run-time (5.2× faster) and memory usage (3.3× lower) compared to VTR 7 on the VTR benchmarks. Compared to Intel’s commercial Quartus CAD system, the quality is narrowed on the large Titan benchmarks (compared to VPR 7) from 2.0× to 1.26× for wirelength, and 1.53× to 1.26× for critical path delay at comparable run-time and memory usage.

ˆ An open-source design flow (built in collaboration with others) which leverages the above con- tributions and can target both commercial (e.g. Xilinx Artix 7) and novel FPGA architectures (Chapter 12). This process also shows that how an FPGA architecture is modelled has significant impact on the quality of results achieved. Compared to Xilinx’s Vivado CAD system, this design flow completes approximately 2× faster, but produces circuits which are 2.3× slower.

ˆ A new approach to developing adpative heuristics for use in CAD tools (Chapter 13). This technique uses Reinforcement Learning (RL) to learn heuristics and adapt to the specific optimization problem being considered without requiring a human in the loop. We demonstrate how this technique can be used improve the run-time quality trade-off achieved by an FPGA placement algorithm, allowing it to achieve the same quality in 20 to 50% less run-time.

1.3 Organization

Initial background about FPGA architecture, CAD and timing analysis are presented in Chapter 2. The remainder of the thesis is structured in four parts. Part I focuses on techniques for analyzing the timing behaviour of digital circuits. Chapter 3 investigates parallel algorithms for performing STA, and presents Tatum, a fast parallel timing analysis library. Chapter 4 describes our new ESTA method for analyzing the timing behaviour of a digital circuit while accounting for input transition probabilities. Part II describes and a number of contributions relating to FPGA physical design algorithms and FPGA architecture modelling which fall under the VTR project. Chapters 5 and 6 describe VTR and the VTR design flow, along with enhancements which improve the flow’s flexibility. Chapter 7 describes extensions to VTR’s FPGA architecture modelling capabilities which allow the capture of many of the realities of modern FPGA architectures. These features are then used to produce a more detailed capture of the Intel Stratix IV FPGA architecture. Chapter 8 focuses on methods to produce more natural clusters of netlist primitives and improves placement quality. Chapter 9 describes AIR, a highly efficient timing-driven FPGA routing algorithm. A number of other engineering improvements to VTR Chapter 1. Introduction 5 are outlined in Chapter 10, and Chapter 11 presents a complete evaluation of the VTR tools and overall design flow. Part III outlines how many of the enhancements from Part II can be leveraged to create a fully open-source design flow (described in Chapter 12), which can target and program both commercial and novel FPGAs. Part IV investigates how Reinforcement Learning (RL) can be used to improve the quality and run-time of CAD algorithms (such as placement in Chapter 13) by making them more adaptive. Finally, Chapter 14 concludes and presents directions for future work. Chapter 2

Background

In this chapter we present an overview of necessary background and work related to the overall thesis.1 More detailed background and related work is presented where relevant in the subsequent chapters.

2.1 Field Programmable Gate Arrays

Field-Programmable Gate Arrays (FPGAs) offer many benefits as a computation platform. They provide many of the benefits of dedicated hardware, such as high performance application-customized datapaths and low power consumption [8]. However, they are also re-programmable and require significantly reduced design time and effort compared to Full Custom or Standard Cell based ASICs [29]. FPGAs have been used successfully to accelerate a wide range of applications such as molecular dynamics [30], biophotonics simulation [31], web search [32], option pricing [33], solving systems of linear equations [34], machine learning [35] and many others. The programmable nature of FPGAs however, comes at a cost. FPGAs require 21-40× more silicon area, 9-12× more dynamic power, and operate 2.8-4.5× slower than Application Specific Integrated Circuits (ASICs) [9]. These present a unique set of trade-offs compared to ASICs, Graphics Processing Units (GPUs) and Microprocessors, which have enabled FPGAs to find use in a wide range of applications.

2.1.1 FPGA Architecture

FPGAs typically contain K-input Look-up Tables (LUTs) and Flip-Flops (FFs) interconnected by pre- fabricated programmable routing. These are used to implement ‘soft logic’. Typically a LUT and FF are grouped together into a Basic Logic Element (BLE) (Figure 2.1), where the output of the LUT is optionally registered. To improve area efficiency and performance, the BLEs are usually grouped together into a Logic Block (LB) (Figure 2.2) [36, 37, 38]. An FPGA typically consists of columns of LBs, with programmable inter-block routing used to interconnect the LBs as shown in Figure 2.3. The inter-block routing consists of Connection Blocks (CBs) where adjacent LB input and output pins connect to the FPGA routing, and Switch Blocks (SBs) where routing wires interconnect (Figure 2.4) [38]. While ‘soft logic’ can be used to implement nearly any type of digital circuit, it is often more efficient to ‘harden’ certain commonly used functions into fixed-function hardware on the device. This trades-off

1Parts of this chapter are derived from [28].

6 Chapter 2. Background 7

BLE

6-LUT D Q FF

Figure 2.1: A conventional academic Basic Logic Element (BLE).

LB

BLE 1

To Local general Interconnect BLE 2 routing ...

BLE N

...

From general routing Figure 2.2: A simple Logic Block (LB). ‘x’ indicates a programmable switch. Chapter 2. Background 8

Vertical Routing Channel

LB LB LB LB LB LB LB LB LB LB

LB LB LB LB LB LB LB LB LB LB

LB LB LB LB LB LB LB LB LB LB

LB LB LB LB LB LB LB LB LB LB

LB LB LB LB LB LB LB LB LB LB Horizontal Routing Channel LB LB LB LB LB LB LB LB LB LB

Figure 2.3: A simple homogeneous FPGA with vertical and horizontal routing channels (containing wires) between logic blocks.

LB CB

CB SB Figure 2.4: A LB and associated CB and SB. Arrows terminating at the LB indicate block inputs from the general routing, arrows leaving the LB indicate block outputs driving the general routing. Horizontal and vertical routing wires are respectively shown below and to the right of the LB. ‘x’ indicates a programmable CB switch. The configurable right-going connections from the horizontal channel are shown with dotted lines. Chapter 2. Background 9

LB LB LB LB LB RAM RAM RAM

LB LB LB LB LB DSP DSP

LB LB LB LB LB RAM RAM RAM

LB LB LB LB LB

LB LB LB LB LB DSP DSP RAM RAM RAM

LB LB LB LB LB

Figure 2.5: A simple heterogeneous FPGA.

flexibility for efficiency. Typical examples of ‘hard’ blocks in modern FPGAs include Digital Signal Processing (DSP) blocks (with embedded hard multipliers) and Random Access Memory (RAM) blocks (Figure 2.5). This variety of block types makes modern FPGAs heterogeneous, with a variety of different resource types. Additionally, smaller ‘hard’ components can also be added to LBs to perform common operations, such as hard-adders for arithmetic operations [39].

2.1.2 CAD for FPGAs

In order to program an FPGA to implement a specific application, the designer’s high-level intent must be translated into a low level bitstream which sets the millions of individual configuration switches in the FPGA. This translation process constitutes the ‘Computer Aided Design (CAD) Flow’. Since the CAD flow takes only an abstract high-level description, but produces a detailed low level implementation, it must make numerous choices to implement the system. These choices have significant impacts on key performance metrics such as power, area and operating frequency. It is therefore key that the CAD flow makes good choices to optimize the final implementation. An example FPGA CAD flow is illustrated in Figure 2.6, and discussed below.2 [41]

High-Level Synthesis High-Level Synthesis (HLS) is a relatively recent addition to FPGA CAD flows, which aims to improve designer productivity by further increasing their level of abstraction. This is typically accomplished by allowing designers to describe their systems algorithmically, using conventional programming languages such as C or OpenCL [42, 33, 43], rather than using a close-to-the-metal, cycle-by-cycle behavioural description using a Hardware Description Language (HDL) (e.g. Verilog, VHDL). Given an algorithmic description of a system, HLS selects an appropriate hardware

2It should be noted that while discrete steps in the CAD flow are described here, many modern flows blur the lines between these different stages — for example by re-optimizing the design logic after placement [40]. Confusingly, this is sometimes referred to as ‘Physical Synthesis’ in the literature. Here we take Physical Synthesis to be an encompassing term for the physically aware stages of the CAD flow (i.e. packing, placement and routing), in contrast with Logical Synthesis which encompasses the non-physically aware stages. Chapter 2. Background 10

architecture to implement the algorithm and chooses which computations occur in each clock cycle. The output of HLS is normally an Hardware Description Language (HDL) description of the design, in a language such as Verilog.

Elaboration Elaboration converts the behavioural description of the hardware (either provided by the designer, or generated by HLS) into a logical hardware description (i.e. set of logic operations and signals). The input to elaboration is generally an HDL, such as Verilog, and its output is a netlist of operations.

Logic Optimization Technology independent logic optimization is then performed, which involves removing redundant portions of the hardware and re-structuring the logic to improve the quality (area, speed, power) of the resulting hardware.

Technology Mapping Once logic optimization is completed, the system is then mapped (i.e. implemented with) the primitive components found in the FPGA architecture (LUTs, FFs, multipliers etc.) to create a netlist of primitives (low level hardware components) that exist on the targeted FPGA.

Clustering Clustering (also referred to as Packing), groups together device primitives into the larger blocks (e.g. LBs, RAM blocks, DSP blocks) of the target FPGA architecture. This step is usually not found in non-FPGA CAD flows. It is typically used to enforce the strict legality constraints facing FPGAs (since all resources are pre-fabricated), and also helps to reduce the number of placeable objects. The output of clustering is a netlist of these larger, ‘clustered’ blocks.

Placement Placement decides the location for each placeable block on the target device. This makes it one of the key steps in the physical design implementation flow since it largely determines wirelength, and best case delay. It affects routability, and power consumption.3 The output of placement is a legal location for each block in the (usually clustered) netlist.

Routing Given the locations of the various blocks determined by placement, routing determines how to interconnect the various pins in the netlist using the pre-fabricated routing wires and programmable switches on the FPGA. The output of a routing algorithm is a tree of routing wires and switches for every signal in the netlist, where each tree ensures that all the signal destinations are connected to the source, and no routing resources are overused (i.e. short two different signals together).

Analysis With the design fully implemented, it is passed through detailed analysis tools to evaluate the result. This can include confirming circuit functionality via timing analysis and performing detailed power analysis. These tools output reports that (for example) detail if the design meets timing or not, its maximum operating frequency and/or timing slack, and its expected power dissipation.

3In some CAD flows placement can also change the clustering (move primitives), while in others the placement algorithm places only the larger blocks that were formed by a prior clustering step, see Section 2.4.3 for more details. Chapter 2. Background 11

Pack High Level Synthesis Place Physical Synthesis

Elaboration

Route Logic Logical Synthesis Optimization Analysis Technology Mapping Bitstream Generation

Figure 2.6: An example FPGA CAD flow.

Bitstream Generation After routing there is finally sufficient information to determine how to set all block configuration bits and programmable interconnect switches on the FPGA to implement the designer’s original specification. Bitstream generation converts all this information into a programming file used to configure the FPGA.

2.2 CAD Flows

Two major thrusts in FPGA research are building improved FPGA architectures (Section 2.1.1) and improving FPGA CAD tools (Section 2.1.2). Both of these are typically evaluated empirically, since closed form analytical solutions are rarely applicable. A typical research CAD flow is shown in Figure 2.7. The VTR project [2] is a popular open-source example of this type of CAD flow. In a research CAD flow a set of benchmark circuits are mapped onto candidate FPGA architectures, and the results analyzed. In typical usage for FPGA architecture research, the CAD flow and benchmarks are kept constant while the target FPGA architectures are varied. Conversely for CAD tool research the benchmarks and target architectures are kept constant while the CAD flow is varied.4

2.2.1 Impact of CAD Flow on Productivity

The typical process for a designer implementing an application targeting an FPGA is shown in Figure 2.8. A designer describes his/her design using a HDL and then passes it off to the automated CAD flow for synthesis and analysis. After analysis it is determined whether the design has met its constraints (e.g.

4In reality this distinction is not so clear cut, as there is an interdependence between both the CAD flow and FPGA architectures. For example, if a CAD flow fails to take full advantage of an FPGA’s architectural features, or optimizes poorly the conclusions about the architecture would not be accurate. Chapter 2. Background 12

Architecture Benchmarks CAD Flow

Logical Synthesis

Physical Synthesis

Analysis

Modify No Satisfactory No Modify Architecture Result? CAD Flow

Yes

Done

Figure 2.7: CAD and architecture evaluation process. timing, power and area). If the constraints are not satisfied then the designer must go back and modify their design and re-run the design flow. Since this iterative process is repeated numerous times during development, it is important that each iteration occur quickly; however this is rarely the case. Firstly, the synthesis and analysis design flow, while automated, is large and complex requiring significant computing time — on the order of hours to days for large designs [10]. Secondly, manually modifying the design to address the constraint violations may not be easy. It can involve significant design modifications, which in turn require design re-verification to ensure correctness is maintained. On large designs this may involve changes across multiple design components owned by other individuals or teams — making design modification a time-consuming process.5 As a result, there are two primary approaches to improving designer productivity:

1. Reducing the required number of design iterations, and 2. Reducing the required time for each design iteration.

The first can be addressed through improving the Quality of Result (QoR) produced by the CAD flow. If the CAD flow produces higher quality results, the designer will not need to make costly and time-consuming changes to their design to achieve their performance targets. The second can be addressed by improving the efficiency (run-time, memory usage) of the CAD flow, for instance by improving the underlying algorithms and tools.

5It should be noted that FPGA designers have less flexibility than ASIC designers to address issues during the physical stages of the CAD flow. To resolve timing issues, ASICs designers have multiple adjustments they can make, such as inserting buffers on long nets, adjusting transistor threshold voltages and adjusting transistor sizing. Most of these techniques cannot be applied on FPGAs due to their prefabricated nature. As a result, FPGA designers are often forced to address design issues by making RTL changes. Chapter 2. Background 13

HDL

Logical Synthesis

Physical Synthesis

Automated Analysis

Modify Constraints Design No Met?

Manual Yes

Done

Figure 2.8: FPGA design implementation process.

2.3 Timing Analysis

An important consideration for both designer’s and optimization tools is the timing behaviour of a particular circuit implementation. Most digital circuits operate synchronously, where the state holding elements of a circuit (e.g. D Flip-Flops) are updated together (synchronously) relative to a shared control signal (i.e. a clock). To avoid metastability issues,6 and ensure elements update to the correct values, each state element must satisfy a number of timing constraints. Two of the most significant constraints are the setup and hold constraints. Setup constraints ensure that signals arrive at registers a sufficient amount of time before the capturing clock edge. During each cycle of operation every connection terminating at a register must satisfy:

(max) (max) tcq + tpd ≤ Tclk − τsu (2.1)

(max) (max) where tcq is the maximum clock-to-q delay of the launching register, tpd is the longest propagation delay between the launch and capture registers, τsu is the setup time of the capture register and Tclk is the desired clock period. Long (slow) paths typically cause setup violations. Setup violations can be alleviated by increasing the clock period (giving more time for the signal to arrive), although this decreases the operating frequency of the circuit. Hold constraints ensure signals that have arrived at registers remain stable for a sufficient amount of time after the capturing clock edge. During each cycle of operation every connection terminating at a register must satisfy: (min) (min) tcq + tpd ≥ τhld (2.2)

6Where the stored value is indeterminant, not clearly a logic ‘1’ or ‘0’ [44] Chapter 2. Background 14

(min) tpd

D Q D Q

tcq tsu thld

t(max) (launch) pd tclk

clock (capture) tclk

Figure 2.9: Timing delays between two registers.

(min) (min) where tcq is the minimum clock-to-q delay of the upstream register, tpd is the shortest propagation delay between the upstream and current register, and τhld is the required hold time of the capture register. Short (fast) paths typically cause hold violations. Unlike setup violations, hold violations cannot be fixed by changing the clock frequency. More generally, the delay from clock source to each launch and capture register may not be identical as shown in Figure 2.9, leading in the constraints:

(launch) (max) (max) (capture) tclk + tcq + tpd ≤ tclk + Tc − τsu | {z } | {z } (max) (setup) t treq,c arr (2.3) (launch) (min) (min) (capture) tclk + tcq + tpd ≥ tclk + τhld | {z } | {z } (min) (hold) tarr treq

(launch) (capture) where tclk and tclk are the clock insertion delays to the launch and capture registers respectively 7 and Tc is the timing constraint. It is helpful to define the arrival time and required time8 which correspond respectively to the left and right hand sides of Equation (2.3):

Definition 1 (Arrival Time (tarr)) The time at which a data signal stabilizes (arrives) at the capture register input.

Definition 2 (Required Time (treq)) The time by which a data signal must stabilize at the capture register input to avoid metastability and ensure the correct data value is captured/latched.

We can then restate the setup and hold constraints as:

(hold) (min) (max) (setup) treq ≤ tarr ≤ tarr ≤ treq,c (2.4)

Hence, the arrival times must fall within a timing window bounded by the required setup and hold times.

7 In general Tc is the smallest time interval between active clock edges of the launch and capture clocks, which reduces to the clock period (Tclk) in the single-clock case. 8Sometimes referred to as the required arrival time Chapter 2. Background 15

a d b c e

a 1.0 b 1.0 1.0 1.0 e 1.0 Figure 2.10: Example circuit and timing graph with annotated delays. The path b → c → d → e is the critical timing path, with a delay of 3.0 units

For optimization purposes it is also useful to define the concept of slack:

Definition 3 (Slack) The difference between required (arrival) and arrival (required) times of a setup (hold) constraint.

Which is formally specified as: (setup) (max) slacksetup,c = treq,c − tarr (2.5) (min) (hold) slackhold = tarr − treq where positive slack indicates the timing constraint is met, and negative slack indicates the constraint is violated. Slacks are useful to both designer and optimization tools since they provide insight into how close or far a constraint is from violation. The above discussion focused on the timing constraints which apply between a single connected pair of registers. To analyze the timing behaviour of an entire circuit we need a way to represent all registers/primary inputs/primary outputs and logic in the design to analyze them together. A common approach is to store the topological structure of the circuit as a timing graph:

Definition 4 (Timing Graph) A directed acyclic graph9 where nodes represent the pins of circuit elements, edges represent timing dependencies and edge weights correspond to delays.

The circuit’s primary inputs and state element outputs (e.g. Flip-Flop Q pins) become timing sources: nodes with no fan-in. Conversely, the primary outputs and state element inputs (e.g. Flip-Flop D pins) become timing endpoints: nodes with no fan-out. An example circuit and timing graph are shown in Figure 2.10. The timing sources are inputs a and b, and the output e is the single timing endpoint. We can now define a timing path:

Definition 5 (Timing Path) A path in the timing graph between a timing source and timing endpoint.

This allows us to define the timing verification problem:

Definition 6 (Timing Verification Problem) Determine whether timing constraints are satisfied at all timing endpoints.

9If the timing graph contains cycles it can be transformed into an acyclic graph using standard techniques [45]. Chapter 2. Background 16

That is, verify whether the constraints in Equation (2.3) are satisfied for all timing endpoints. Typically designer’s and optimization tools desire more information than a yes/no answer about whether their design’s timing constraints are satisfied, so we usually consider the more general timing analysis problem: Definition 7 (Timing Analysis Problem) Determine the slacks at all nodes in the timing graph.

Having access to slack information (Definition 3) at each node allows designers and optimization tools to focus their optimization efforts on parts of their design which are failing timing constraints.

2.4 FPGA Packing & Placement

FPGA placement involves determining the position of each netlist element within the FPGA device grid. There are a variety of different considerations including: optimization objectives, optimization algorithms, and how detailed legality constraints are handled.

2.4.1 Optimization Objectives

The classic objective to be optimized during FPGA placement is wirelength, the total amount of wiring needed to interconnect a particular circuit placement. Since the true wirelength is unknown until routing is complete, the approximate Half-Perimeter Wirelength (HPWL) objective is often used instead, as it is reasonably accurate and fast to calculate. For instance, VPR [46] uses a corrected HPWL cost to estimate the wirelength of a particular net:

HPWL(k) = q(fanout(k)) · (bbx(k) + bby(k)) (2.6) where HPWL of net k is determined by the bounding box of its terminals in the x (bbx(k)) and y (bby(k)) directions, with q(·) a fanout dependent correction factor which avoids underestimating the wirelength for high-fanout nets. Reducing wirelength is important since excessive demand for wiring may make it impossible for the router to find a feasible routing solution10, or drastically increase router run-time. Reducing wirelength is also correlated with reductions in power usage and can decrease connection delays (due to shorter connections). Another important objective is timing. Since interconnect delays account for the majority of delay within FPGAs, and the length of these connections is largely determined by placement, timing is an important consideration during placement. Timing optimization is often achieved by generating delay estimates for each connection (based on the current placement) and performing timing analysis to determine the slack of each connection. This slack information is then converted into a timing cost which is optimized during placement. For instance, VPR [47, 48] calculates connection setup slack11 between the combinationally connected pins i and j for constraint c as:

(setup) (max) slacksetup,c(i, j) = treq,c (i) − tarr (j) − delay(i, j) (2.7)

10This is particularly relevant for FPGAs which have limited amounts of pre-fabricated routing. 11Most timing-driven placement algorithms focus on optimizing setup timing only. Chapter 2. Background 17 and then converts connection slacks into a ‘criticality’. For a particular timing constraint c this is typically specified as: slacksetup,c(i, j) critc(i, j) = 1 − (2.8) n (setup) o max treq,c (s) ∀s∈sinks

(setup) where treq,c (s) is the required setup time at node s for the constraint c. The resulting connection criticality ranges between zero and one,12 with values equal to one indicating a connection on the critical path. In the presence of multiple clock domains there are multiple constraints (i.e. for each pair of interacting clock domains), and the criticality is calculated as the maximum over all constraints:

crit(i, j) = max {critc(i, j)} (2.9) ∀c∈constrs

The placer then calculates the timing cost of a connection as the product of its criticality and delay:

T imingCost(i, j) = crit(i, j)critexp · delay(i, j) (2.10)

where critexp (criticality exponent) is a parameter used to ‘sharpen’ criticalities to focus the placer more on timing critical connections. The combined wirelength and timing cost function is then the summation of wirelength (over all nets) and timing cost (over all connections):

X X (1 − λ) · HPWL(k) + λ · T imingCost(i, j) (2.11) k∈nets i,j∈conns where λ is an empirical weighting factor (timing trade-off) used to control the placer’s focus on wirelength versus timing. This cost function guides the placer to place elements with timing critical connections between them at locations where their connection delays are lower (usually closer together). A number of other optimization objectives have also been considered such as power [49], and routing congestion [50, 51, 52].

2.4.2 Optimization Algorithms

The most common algorithmic approaches currently used are based on Analytic Placement (AP), and Simulated Annealing (SA) [53].13 Commercial FPGA CAD tools often use a hybrid approach, with a fast early stage placement (e.g. analytic-based), followed by a detailed iterative improvement algorithm (e.g. annealing-based) [54].

SA Placement Perturbations

An important consideration during SA placement is how the current placement is perturbed (the perturbations used are often called moves). VPR [46] uses random swaps between blocks of the same type. To avoid proposing large perturbations which are likely to be rejected late in the anneal, VPR also uses a range limit controlled by the parameter Rlim which controls the maximum distance between

12In practise negative slacks may need further processing to ensure criticality is bounded in this way, see [48] for further details. 13Historically partitioning based algorithms have also been used, but while fast, they tend to produce lower quality results. Chapter 2. Background 18

swapped blocks [46, 55]. Initially Rlim is as large as the device, but eventually reduces to 1 (i.e. only allowing neighbouring swaps to be performed) as a function of the fraction of moves accepted.

2.4.3 Handling Legality

One common approach to handling legality is to pre-cluster primitive elements (e.g. LUTs and Flip-Flops) together into architecture blocks (e.g. LBs, RAM blocks, DSP blocks), and then place the resulting clusters. This is the approach taken by VPR/VTR (SA-based) [46, 27, 2], and HeAP (AP-based, followed by greedy swaps) [56]. One advantage of this approach is the post-clustered netlist is smaller than the primitive netlist which reduces the size of the placement problem and reduces run-time. Another key advantage that it cleanly separates the detailed legality constraints of the FPGAs clusters from placement (e.g. any clustered block of a particular type can be placed at any location containing that block type in the device grid). As a result, it is relatively straight-forward to re-target the combined packing and placement algorithms to different FPGA architectures (which may have different legality constraints). This is particularly important for tools like VPR/VTR which are used for FPGA architecture research and need to be easily re-targetable to different FPGAs. Other placement approaches have combined analytic placement and clustering at various stages. For instance, performing AP on fully clustered netlists [56, 57], partially BLE-level packing followed by AP ([50]), flat (unclustered) initial AP followed by clustered AP [51], and flat placement with embedded clustering [52, 58]. However one of drawbacks of many of these approaches is that non-clustered placement algorithms often embed the constraints of the FPGA architecture they target into the placement algorithm, making them applicable only to that FPGA architecture. Additionally, most of these proposed AP algorithms are not timing-driven.

2.5 FPGA Routing

A number of approaches have also been proposed for FPGA routing, the problem of finding a large number of non-overlapping route trees within the FPGA’s routing network which implement the netlist connectivity. Early approaches focused on two-stage routing [59], with an initial global routing stage (assigning connections to channels) followed by a detailed routing stage (assigning connections to tracks within channels). However these approaches were out performed by single stage combined global-detailed routing based on the PathFinder negotiated congestion algorithm [60], as exemplified in VPR [46]. PathFinder is an iterative routing algorithm, which rips-up and re-routes all nets in a design over a number of routing iterations. The key insight of PathFinder is to allow multiple nets to use the same routing resources. Such routing resources are said to be congested. While this is electrically illegal (nets which are shorted together in such a way are no longer independent nets), it is helpful for routing since it prevents nets from blocking each other during routing. To eventually arrive at a legal routing solution, PathFinder relies on ‘congestion negotiation’, a process which increases the costs of congested resources so that the number of congested resources drops over time. This is done in such a way that connections negotiate among themselves, trading off their timing criticality and wiring requirements to converge to a legal solution. The formulation of PathFinder used in VPR is outlined below [38]. The cost of using a routing resource l as part of a connection between pins i and j consists of two parts, one based on delay and the Chapter 2. Background 19 other on congestion:

ResourceCost(l) = crit(i, j) · delay(l) + (1 − crit(i, j)) · cong(l) (2.12) where the connection’s criticality (2.9) is used to control the relative importance of the resource’s delay and congestion costs. The congestion cost of a routing resource l is defined as:

cong(l) = b(l) · h(l) · p(l) (2.13)

The base cost b(l) represents the fundamental/basic cost of using resource l. The historical congestion cost h(l) increases each routing iteration the routing resource l is overused, and helps to bias future routed connections away from the resource. The present congestion cost p(l) is zero if no other connections are using l, but is non-zero if l is already in use. Furthermore, p(l) increases exponentially (multiplied by a constant presfac > 1) at the end of each routing iteration. As a result during later routing iterations connections will increasingly avoid inducing congestion by using an already in use routing resource. By iteratively increasing resource costs in this way, PathFinder allows connections to negotiate their usage of routing resources to converge to a high-quality legal solution. PathFinder relies on a path search algorithm to find low-cost paths through the routing network. This is usually a directed search algorithm, which costs partial paths from a source s to sink t passing through a routing resource l as:

P athCost(s, t, l) = KnownCost(s, l) + Astarfac · ExpectedCost(l, t) (2.14)

Where KownCost(s, l) is the sum of ResourceCosts (2.12) to reach l from s (following the current partial path), and ExpectedCost(l, t) is a look-ahead estimate of the cost to reach the target t from the candidate resource l. The path cost defined by Equation (2.14) is then used to order a priority queue of routing resources to explore using a Djikstra/A* path search algorithm.14

14 To reduce run-time an Astarfac greater than one is usually used, which makes this an approximate A* shortest path search. Additionally, often ExpectedCost is not a strict lower bound. Part I

Timing Analysis

20 Chapter 3

Tatum: Fast Parallel Timing Analysis

Static Timing Analysis (STA) is used to evaluate the correctness and performance of a digital circuit implementation. In addition to final sign-off checks, STA is called numerous times during placement and routing to guide optimization. As a result, STA consumes a significant fraction of the time required for design implementation; to make progress reducing FPGA compile times we need faster STA. We first evaluate whether a GPU or multi-core CPU is the best platform for a fast parallel timing analyzer. On core STA algorithms our GPU kernel achieves a 6.2 times kernel speed-up but data transfer overhead reduces this to 0.9 times. This leads us to focus on CPU techniques, with our best CPU implementation achieving a 9.2 times parallel speed-up on 32 cores, yielding a 15.2 times overall speed-up compared to the VPR 7 analyzer, and a 6.9 times larger parallel speed-up than a recent parallel ASIC timing analyzer. We then show how reducing the run-time cost of STA can be leveraged to improve optimization quality, as we demonstrate that frequent timing analysis during late stages of placement reduces critical path delay by 4%.

3.1 Introduction

A common goal when designing a digital circuit is maximizing its performance; as a result a circuit is analyzed repeatedly during design to determine its operating frequency. The most common approach to determine a digital circuit’s speed is Static Timing Analysis (STA). While STA is significantly faster than other approaches such as timing simulation, it is still time-consuming. Consequently designers and optimization tools often opt to sacrifice accuracy, by performing STA only ‘occasionally’ during the design process, to minimize design iteration times. Despite this, a placement tool like VPR will call STA hundreds of times during optimization [47]. However this still means design decisions are made using stale (old and possibly now incorrect) timing information. This leads designers and optimization algorithms (whose decisions can benefit from accurate timing information) to assume unnecessarily pessimistic design conditions – resulting in costly over-design. Furthermore, design sizes continue to increase rapidly [61], while improvements in single-threaded CPU performance have slowed [62]. Additionally, the number of timing analyses required to fully characterize a design is also increasing due to the proliferation of timing corners [63], and the growing number of

21 Chapter 3. Tatum: Fast Parallel Timing Analysis 22 clock domains [64]. As a result, in commercial FPGA place and route tools STA typically takes 25% of total run-time, but may dominate the optimization algorithms when designs have multiple clocks and timing constraints [65]. Moreover, modern FPGAs have performance-driven architectural features such as pulsed latches [66], beneficial skew [67, 20] and interconnect registers [68], which exacerbate hold-time issues. This requires additional minimum-delay timing analyses to evaluate, and design tools to explicitly optimize for hold-time; requiring numerous rapid calls to STA [69]. Finally, a variety of performant parallel algorithms have been proposed for FPGA placement [62, 70, 71, 72] and routing [73, 74, 75, 76]. As these parallel approaches speed-up the core optimization algorithms, timing analysis becomes an increasingly dominant portion of run-time – limiting the achievable speed-up.1 These factors all make the development of fast and scalable timing analysis algorithms, which can exploit the parallelism available from modern computing systems, key to reducing FPGA design times. In this chapter we focus on the problem of developing efficient and scalable parallel algorithms for block-based STA (described in Section 3.2.1). Our contributions include:

ˆ Memory layout optimizations for STA data structures,

ˆ Parallel algorithms for block-based STA and evaluations on GPUs and multi-core CPUs,

ˆ Techniques to efficiently perform multi-corner and multi-clock STA,

ˆ Tatum, an open-source reusable STA library built on our best algorithms,

ˆ Analysis and comparison of Tatum with existing serial and parallel STA tools, and

ˆ An evaluation of Tatum’s performance when integrated in an existing placement tool (VPR) to guide its optimization.

Section 3.2 discusses background and related work. Sections 3.3 to 3.5 present our GPU and CPU parallel algorithms and evaluates them within a simplified STA formulation. Sections 3.6 and 3.7 describe our methods for speeding-up multi-corner and multi-clock STA, and Section 3.8 describes and evaluates our best algorithms within a full-featured STA formulation. Section 3.10 illustrates how fast STA can be exploited to improve optimization quality. The conclusion and future work are presented in Section 3.11.

3.2 Background & Related Work

Related work can be divided into two components: STA and its related extensions, and parallel algorithms.

3.2.1 STA Background

Timing analysis checks whether a digital circuit implementation will operate correctly, given a set of timing constraints. A typical timing constraint is the clock period, which requires signals to be stable before the active clock edge to avoid latching stale data or inducing meta-stability in data storage elements [44]. STA [77] is the conventional approach for timing analysis. STA operates on a timing graph (Defini- tion 4), an abstract representation of a digital circuit where nodes represent the pins of circuit elements

1Most works on parallel placement and routing do not account for time spent on STA when reporting speed-ups. However STA’s impact is significant. For instance, at 25% of total run-time STA limits the best-case speed-up of any parallel placement or routing algorithm to only 4× (Ahmdal’s law). Chapter 3. Tatum: Fast Parallel Timing Analysis 23 and uni-directional edges represent the timing dependencies between them. We denotes the number of nodes in the timing graph as N, and the number of levels after topological sorting as L. Two classes of STA algorithms have been proposed: path-based and block-based. Path-based algorithms perform a detailed independent analysis of every path in a circuit – offering high accuracy, but with worst-case exponential run-time (since there are O(2N ) paths). As a result more efficient block-based algorithms are usually used. Block-based STA analyzes timing behaviour at the node (rather than path) level, determining the worst-case timing behaviour of all timing paths passing through a node.2 As a result block-based STA’s computation time grows linearly with the size of the circuit (i.e. O(N)). A more recent development has been Statistical STA (SSTA) [78]. Rather than calculating scalar delays, SSTA calculates delay probability distributions to capture the delay impact of manufacturing process variation. SSTA can be applied with either path-based or block-based algorithms, and calculated either analytically, or by Monte Carlo methods. Due to their lower computational complexity, many industrial design flows and most optimization tools, use block-based algorithms. We focus on block-based STA, but our techniques can be directly extended to block-based SSTA.

3.2.2 Parallel Timing Analysis

Several approaches have been taken to parallelize timing analysis on CPUs. Distributed approaches partition the circuit into pieces which can be processed by multiple machines [79, 80], using message passing for synchronization and communication. Scalability is typically limited by the partitioning [80] and communication overheads. In [81] an enhanced sampling method for Monte-Carlo based SSTA is presented, which is trivially parallelized. OpenTimer [82] presents a static timing analyzer using pipeline parallelism, and is evaluated in Section 3.8.2. Several works have looked at accelerating timing analysis on GPUs. A path-based STA engine, formulated as a Sparse-Matrix Vector Product problem is presented briefly in [83], however it requires O(N 2) memory for a circuit of size N, which limits scalability. [84] describes a GPU accelerated path-based SSTA implementation using the Monte Carlo approach. However, path-based Monte Carlo SSTA is very computationally expensive making it impractical in many industrial design flows, and too expensive to call repeatedly in an optimizer. [85] describes a GPU SSTA algorithm using a dynamic batch construction algorithm based on the Coffman-Graham algorithm. CASTA [86] presents a GPU-based STA implementation and is compared to our approach in Section 3.4.2. Unlike previous work, we develop multiple parallel CPU algorithms, account for data transfer when evaluating GPUs, consider simultaneous multi-corner and multi-clock analysis, and evaluate our algorithms in an optimization tool on a variety of large benchmark circuits.

3.3 Baseline Implementation

It is a large effort to develop a full featured timing analysis engine for each parallel algorithm and compute platform. Hence we first consider a simplified timing analysis formulation which captures the key algorithmic characteristics of STA. We develop several STA engine variants to compare a wide variety of

2This introduces pessimism compared to path-based algorithms since worst-case analysis at each node may be pessimistic for some paths. However it is ‘safe’ in the sense that the worst-case behaviour ensures path delays are never under/over estimated during setup/hold analysis. Chapter 3. Tatum: Fast Parallel Timing Analysis 24 parallel approaches. These simplified engines calculate node arrival/required times for single-clock circuits using a pre-calculated delay model. In Section 3.8 we create and evaluate a full featured parallel timing engine (Tatum) built on the most promising approach, which supports multiple-clocks and additional timing constraints like false and multicycle paths.

Algorithm 1 Baseline Serial STA Algorithm Require: G levelized timing graph to analyze Require: Arr array of initialized arrival times (zero for timing sources, −∞ for other nodes.) Require: Req array of initialized required times (constraint values for timing endpoints, +∞ for other nodes.) 1: function SerialWalk(G, Arr, Req) 2: L ← NumLevels(G, Arr, Req) 3: for ` ∈ (0 ··· L − 2) do . Forward traversal 4: for node ∈ LevelNodes(G, `) do 5: FwdTraverseNodePush(node) 6: 7: for ` ∈ (L − 2 ··· 0) do . Backward traversal 8: for node ∈ LevelNodes(G, `) do 9: BwdTraverseNodePull(node) 10: 11: function FwdTraverseNodePush(nsrc) 12: for eout ∈ OutEdges(nsrc) do . Each edge to a successor 13: nsnk ← SinkNode(eout) . Forward edge to successor 14: edge arr ← Arr(nsrc) + EdgeDelay(eout) . Calculate arrival time via edge 15: Arr(nsnk) ← Max(Arr(nsnk), edge arr) . Push arrival time update to sink 16: 17: function BwdTraverseNodePull(nsrc) 18: for eout ∈ OutEdges(nsrc) do . Each edge to a successor 19: nsnk ← SinkNode(eout) . Forward edge to successor 20: edge req ← Req(nsnk) − EdgeDelay(eout) . Calculate required time via edge 21: Req(nsrc) ← Min(Req(nsrc), edge req) . Pull required time update from sink

The baseline CPU algorithm is shown in Algorithm 1, which is based on a standard breadth-first traversal of the timing graph.3 The timing graph (levelized by a topological sort) is first traversed in the forward direction one level at a time (Line 3), which ensures all predecessor nodes have valid arrival times. FwdTraverseNodePush, which walks the out-going edges to update the arrival time of downstream nodes, is then called on each node in the level (Line 5). The graph is traversed similarly in the backward direction to calculate required times (Lines 7 to 9).

3.3.1 Memory Layout Optimizations

The graph traversals in Algorithm 1 have a relatively small amount of computation (arrival/required time calculation) compared to data access (graph nodes and edges). As a result, memory access patterns and caching behaviour are key to achieving high performance. We implemented several optimizations to improve this behaviour:

Struct-of-Arrays We change the memory layout from Array-of-Structs (AoS) (e.g. where a node’s arrival time, required time and out-going edges are located in adjacent memory) to Struct-of-Arrays (SoA) (e.g. where the arrival times of all nodes are located in adjacent memory). This is more efficient when only a subset of a node’s data members need to be accessed (e.g. node type), as

3This particular implementation is based on the implementation used in VPR 7 [27]. Chapter 3. Tatum: Fast Parallel Timing Analysis 25

Table 3.1: Performance of Memory Optimizations Data Layout Speed-Up AoS 1.00 SoA 1.44 SoA & Re-Order Edges 1.61 SoA & Re-Order Nodes 3.98 SoA & Re-Order Edges + Nodes 5.40 Geomean across neuron, openCV, denoise, and gaussianblur Titan FPGA benchmarks [10] (see Sec- tion 3.9 and Table 3.8 for details).

instead of loading otherwise unused data (e.g. node fan-in) into the cache, the data members (e.g. node types) of other nearby node’s are loaded into the cache.

Re-order Edges We re-order timing graph edges to match the expected traversal order based on the levelization.

Re-order Nodes We similarly re-order timing graph nodes.

Table 3.1 shows the impact of these optimizations, which improve spatial and temporal locality of the timing graph. Individually, SoA layout and re-ordering edges provide moderate improvements, while re-ordering nodes causes a significant improvement. These optimizations combine to yield an overall 5.4× speed-up.4 All following comparisons are made with this optimized implementation unless otherwise noted. While these results were measured on an Intel Xeon E5-1620 (see Table 3.2), we expect they will generalize to other CPU and GPU architectures since they focus on improving general memory access characteristics such as spatial and temporal locality. Improvements in spatial and temporal locality help improve the performance of block-based caches (where a memory access brings a block/line of nearby memory elements into the cache) and lookahead prefetchers (where a predictable stride between memory accesses facilitates the speculatively loading of data into the cache ahead of time), both of which are standard in modern CPU architectures [87]. They also help ensure coalesced memory access, which is important for ensuring effective use of GPU external memory bandwidth.

3.4 Parallel STA on GPUs

GPUs are a popular platform for compute acceleration, due to their potential high performance if many threads can be utilized effectively. GPU accelerated FPGA placement [72] and routing [76] techniques have been proposed, so evaluating their ability to accelerate STA is warranted. A GPU consists of a large number of simple compute units, which execute a unique thread but usually share a common control flow [88]. Subsets of the compute units can communicate quickly using a shared local memory, while compute units outside the subset must communicate through slower global memory [88].

3.4.1 GPU Algorithm

To evaluate the suitability of GPUs, we focus on accelerating the timing graph traversals. These traversals are the most time-consuming part of STA, accounting for ∼ 90% of the run-time in the optimized CPU

4Re-ordering nodes and edges is done once based on the timing graph structure, and takes less time than a full timing analysis. Therefore the re-ordering run-time overhead is trivial in CAD tools like VPR which perform hundreds of timing analyses. Chapter 3. Tatum: Fast Parallel Timing Analysis 26 implementation. We accelerate the forward traversal (the backward traversal is similar). The timing graph is processed one level at-a-time using two kernels as shown in Figure 3.1. In the segmented additive map kernel5, each thread is assigned a single edge and calculates the edge arrival time by adding the source node arrival time (Level i − 1) and the edge delay. Next, in the segmented max reduction kernel, each thread is assigned a sink node (at Level i) and performs a maximum reduction over all incoming edge arrival times. The key benefit of this approach is the exposure of a large amount of parallelism to the GPU when calculating edge arrival times. We expect this to improve performance by reducing code divergence (divergent branching) caused by variable node fan-in, which is inefficient on GPUs [88]. We also expect better load-balancing compared to processing the timing graph in a strictly node-at-a-time manner, since the work performed per edge is effectively constant. The edge and node data is re-ordered as in Section 3.3.1 to ensure coalesced memory accesses, and speculatively loaded into local memories for fast access.

Source Node Level i − 1 Arrival Times 6 10 3 5 Map Thread Segmented Additive Edge Delays 6 1 5 2 3 7 4 0 Map Kernel

Calculated Edge Arrival Times 12 7 11 12 13 10 9 5 Segmented Reduce Max Reduction Calculated Sink Thread Level i Kernel Node Arrival Times 11 13 9 10 Figure 3.1: GPU calculation of level i arrival times.

3.4.2 GPU Experimental Results

GPU Experimental Methodology

Both the CPU and GPU implementation are evaluated using a selection of large benchmarks from the Titan FPGA benchmark suite [10] run on an Intel Q9550 (45nm ‘Yorkfield’) processor, and NVIDIA GTX780 (28nm ‘GK110’) GPU (see Table 3.2 for additional details). The GPU algorithm is implemented using OpenCL, and minimizes wall-clock time by overlapping computation and data transfer. Reported run-times are the sum over all kernel invocations required to evaluate the timing graph, and are averaged over 100 runs.

GPU Experimental Results

Table 3.3 shows the performance results. Comparing first the CPU and GPU Kernel times, we observe the GPU significantly outperforms the well optimized single-core CPU implementation from Section 3.3 by 6.2×. However, taking into account the time spent on data transfer between the host CPU and

5This kernel takes in arrays of incoming arrival times and edge delays and adds them together in a parallel map operation. This is done in a segmented manner where subsets of edge delays (i.e. those associated with the same node) will re-use the same incoming arrival time. Chapter 3. Tatum: Fast Parallel Timing Analysis 27

Table 3.2: Compute Platforms Frequency Memory Power Process Type Model Cores Sockets PCI-E BW (MB/s) Year (MHz) BW (GB/s) (W) (nm) GPU NVIDIA GTX780 (GK110) - - 15760 (x16 Gen 3) 900 288.4 250 28 2013 CPU Intel Core 2 Q9550 (Yorkfield) 4 1 2,830 10.6 95 45 2008 CPU Intel Xeon E5-1620 (Sandybridge) 4 1 3,600 51.2 95 32 2012 CPU Intel Xeon E5-2683 v4 (Broadwell) 16 2 2,100 76.8 120 14 2016 CPU Intel Xeon Gold 6146 (Skylake) 12 2 3,200 128.0 165 14 2017

Table 3.3: GPU Performance CPU Opt. Transf. Benchmark GPU Kernel GPU Total (P=1) Frac. neuron 0.014 0.006 (2.33×) 0.031 (0.45×) 0.81 openCV 0.046 0.014 (3.30×) 0.077 (0.60×) 0.82 bitcoin miner 0.179 0.007 (24.46×) 0.090 (1.97×) 0.92 sparcT1 chip2 0.124 0.013 (9.52×) 0.107 (1.16×) 0.88 LU230 0.100 0.020 (5.02×) 0.108 (0.92×) 0.82 GEOMEAN 0.068 0.011 (6.17×) 0.076 (0.89×) 0.85 Time in seconds, speed-up relative to optimized CPU baseline in brackets.

GPU, the GPU implementation is slower, taking 12% longer to complete. The data transfer overhead is significant and dominates overall run-time, on average accounting for 85% of the GPU’s wall-clock time. On a per-benchmark basis it is clear that the GPU implementation’s performance varies widely. This is largely tied to the size of the benchmark and the number of levels in the timing graph. The GPU implementation performs better on larger benchmarks, with the bitcoin miner benchmark achieving the largest speed-up (2.0× overall, 24.5× kernel only). Compared to the other benchmarks bitcoin miner has substantially fewer (and wider) levels in the timing graph (it is deeply pipelined with a shallow logic depth). This increases the work per kernel invocation, improving GPU utilization and providing more opportunity to overlap computation and data transfer. Data transfer dominates the compute time, even though the GPU implementation overlaps computation and data transfer. The small amount of computation and large data transfer time means there is relatively little opportunity for overlap to occur. The largest amount of overlap occurs on bitcoin miner where overlapping data transfer reduced wall clock time by 11%. Table 3.4 summarizes the achieved GPU speed-ups and compares them with [86]. Comparing only GPU Kernel execution time, our method achieved a larger speed-up (33.3× vs 12.9×) than [86] when compared to baseline CPU implementations. However using an optimized CPU implementation (Section 3.3.1) our achieved speed-up shrinks to 6.2×. This is a smaller speed-up than [86] but is against a more realistic baseline.6 Including data-transfer times (GPU Total) further decreases our speed-up to 0.9×. Data-transfer times are important, since they dominate the Kernel execution time, but are not discussed in [86].7

GPU Conclusion

While our results show that GPUs are amenable to accelerating the core kernels of block-based STA, the data transfer overhead dominates the computation resulting in a net slow down. Fundamentally, unlike

6The CPU baseline in [86] contained GPU specific modifications, and was not optimized for CPU execution. 7While it may be possible to transfer only a subset of the data (e.g. edge delays only), this would still require transferring a quantity of data which grows linearly with the timing graph size. Chapter 3. Tatum: Fast Parallel Timing Analysis 28

Table 3.4: GPU Speed-up Comparison GPU Kernel vs GPU Kernel vs GPU Total vs CPU Baseline (P=1) CPU Opt. (P=1) CPU Opt. (P=1) This Work 33.3× 6.2× 0.9× CASTA [86] 12.9× Blank values unreported. many popular algorithms accelerated by GPUs, block-based STA is a linear-time algorithm – requiring a linear amount of data to be transferred. This limits (in the Ahmdal’s law sense) the overall performance of the GPU.89 Given that our GPU implementation considered only a simplified STA kernel, a more full-featured STA engine (supporting multiple-clocks, more complex delay models and timing exceptions) would likely perform worse, due to increased code-divergence, increased load-imbalance between threads and additional data transfer overhead.

3.5 Parallel STA on CPUs

As shown in Section 3.4, while GPUs show promising performance on the core computational kernels, the overhead of marshalling data between the CPU and GPU is prohibitively expensive. We now focus on parallel STA algorithms targeting multi-core CPUs.10 By parallelizing on a CPU we avoid the data-transfer overhead associated with an off-chip accelerator. This also allows easier integration with existing CPU-based data structures and optimization algorithms, which can call STA a very large number of times.

3.5.1 CPU Algorithm A: Levelized Locks

The first parallel algorithm, Algorithm 2, is a straight-forward parallel extension of the serial baseline (Algorithm 1). The timing graph is still processed one level at a time (Lines 3 and 7), but the nodes within each level are processed in parallel (Lines 4 and 8). To avoid race conditions while updating arrival times (when a node pushes arrival time updates to its successors), we synchronize access using fine-grained per-node locking (Lines 16 and 18). There is no race condition when updating node required times as the uni-directional timing graph ensures only one thread updates a node’s required time (by pulling the required time updates from its predecessors).

3.5.2 CPU Algorithm B: Levelized No-Locks

The observation that the parallel backward traversal can calculate required times without explicit locking synchronization suggests the same can be done while calculating arrival times during the forward traversal. This would reduce synchronization costs and improve scalability. However, to do this efficiently, we must

8For instance, even if the GPU kernel run-time was reduced to zero the average speed-up would be only 1.05× for the benchmarks in Table 3.3. 9It is left for future work to investigate whether GPU-based systems where memory is tightly coupled memory between the host CPU and GPU (e.g. which avoid explicit data transfer via a shared address space) could alleviate the data transfer bottlenecks we observed. However we expect that such systems would require a very high-bandwidth interface between the host CPU and GPU to alleviate the challenges we observed. 10While techniques to parallelize STA have been explored in industrial settings, to the best of our knowledge the algorithms we developed have not been previously reported in the open literature. Chapter 3. Tatum: Fast Parallel Timing Analysis 29

Algorithm 2 CPU Algorithm A: Levelized Locks Require: G levelized timing graph to analyze Require: Arr array of initialized arrival times (zero for timing sources, −∞ for other nodes.) Require: Req array of initialized required times (constraint values for timing endpoints, +∞ for other nodes.) 1: function ParallelLevelizedWalkLocks(G, Arr, Req) 2: L ← NumLevels(G) 3: for ` ∈ (0 ··· L − 2) do . Forward traversal 4: parallel for node ∈ LevelNodes(G, `) do 5: FwdTraverseNodePushLocks(node) 6: 7: for ` ∈ (L − 2 ··· 0) do . Backward traversal 8: parallel for node ∈ LevelNodes(G, `) do 9: BwdTraverseNodePull(node) 10: 11: function FwdTraverseNodePushLocks(nsrc) 12: for eout ∈ OutEdges(nsrc) do 13: nsnk ← SinkNode(eout) 14: edge arr ← Arr(nsrc) + EdgeDelay(eout) . Read-only access to source node 15: 16: Lock(sink node) . Multiple threads (source nodes) may update sink in parallel 17: Arr(nsnk) ← Max(Arr(nsnk), edge arr) . Push arrival time update to sink 18: UnLock(sink node)

modify the timing graph to be bi-directional; each edge in the graph must store both its source and sink nodes. The memory costs associated with this are small and evaluated in Section 3.5.4. The improved levelized walk algorithm is shown in Algorithm 3. As before, the forward traversal

Algorithm 3 Levelized walk with no locks Require: G levelized timing graph to analyze Require: Arr array of initialized arrival times (zero for timing sources, −∞ for other nodes.) Require: Req array of initialized required times (constraint values for timing endpoints, +∞ for other nodes.) 1: function ParallelLevelizedWalkNoLocks(G, Arr, Req) 2: L ← NumLevels(G) 3: for ` ∈ (1 ··· L − 1) do . Forward traversal 4: parallel for node ∈ LevelNodes(G, `) do 5: FwdTraverseNodePull(node) 6: 7: for ` ∈ (L − 2 ··· 0) do . Backward traversal 8: parallel for node ∈ LevelNodes(G, `) do 9: BwdTraverseNodePull(node) 10: 11: function FwdTraverseNodePull(nsnk) 12: for ein ∈ InEdges(nsnk) do . Each edge from a predecessor 13: nsrc ← SourceNode(ein) . Reverse edge to predecessor 14: edge arr ← Arr(nsrc) + EdgeDelay(ein) . Read-only access to upstream source nodes 15: Arr(nsnk) ← Max(Arr(nsnk), edge arr) . Only one thread updates sink node processes each level serially (Line 3), while nodes within the level are processed in parallel (Line 4). However, a pull (instead of push) based arrival time calculation is performed (Line 5). FwdTraverseNodePull updates the arrival time of a node without requiring locking. It iterates through a node’s input edges (Line 12) and uses the bi-directional timing graph to identify the source node driving the edge (Line 13). It then calculates the arrival time from that edge (Line 14) and updates Chapter 3. Tatum: Fast Parallel Timing Analysis 30 the arrival time for the current node (Line 15).11 Since only a single thread updates each node no locks are required for synchronization.

3.5.3 CPU Algorithm C: Dynamic

While the levelized algorithm presented in Section 3.5.2 eliminates explicit locking, it still relies on implicit barriers where all threads wait after each parallel for in Algorithm 3 to synchronize across levels. Since the amount of work per node may vary, barriers may not be the most efficient form of synchronization.12 To avoid this per-level synchronization we developed a dynamic graph walk algorithm. Instead of walking the graph in a strictly levelized manner (where nodes are processed only when all nodes in the preceding level have been processed), each node is processed whenever its dependencies become satisfied. The detailed algorithm is shown in Algorithm 4. The forward traversal13 is started by invoking in parallel the dynamic forward node traversal function (Algorithm 4 Line 4). However unlike the levelized traversals (Algorithms 1 to 3), this is done only for the initial level of the timing graph. Having been launched in this fashion the traversal proceeds dynamically through the timing graph, with the node traversal function processing nodes as their dependencies become satisfied. After each node’s arrival times are updated (Algorithm 4 Line 10) the node’s downstream dependencies are updated (Algorithm 4 Lines 15 to 17, described in more detail below). If the node was the last outstanding dependency of a downstream node (Algorithm 4 Line 19), then the downstream node is enqueued (forked) for processing (Algorithm 4 Line 20). Note that fork is performed efficiently using Cilk Plus task spawning [89], which puts each forked call into per-thread work-queues, and uses work-stealing to distribute the work-load evenly among threads.

Dependency Tracking

An important aspect of the dynamic traversal algorithm is tracking dependencies during the traversal. To ensure correct results, each node in the timing graph must only be processed after its dependencies (predecessors for the forward traversal, and successors for the backward traversal) have been fully updated. For the forward traversal14 dependencies are tracked using an array of counters (input ready counts in Algorithm 4). Each counter corresponds to a node in the timing graph, and stores the number of predecessor (input) nodes which have been fully processed (i.e. are ‘ready’). When this count corresponds to the number of incoming edges to a node, we know all upstream dependencies have been satisfied and the node can be safely processed. Rather than using high-overhead locks to synchronize access when updating the counters, we opt for a lock-free approach. This uses an atomic compare-and-swap (CAS) operation (Algorithm 4 Line 16) to avoid race conditions when updating the counters. AtomicCAS(value, expected, new) returns true and atomically updates value with new only when value = expected (i.e. no other thread has changed value), otherwise it returns false. Modern processors support efficient AtomicCAS operations in hardware [90]. The atomic load and CAS loop (Algorithm 4 Lines 15 to 17) safely increments the associated downstream node’s counter. In the event another thread updates the counter between when it is loaded (Line 15) and stored (Line 16) the CAS will fail. The current thread will then keep retrying (Line 16)

11For SSTA, statistical addition/max operators would be used. 12For example, one node may take substantially longer to process — leaving other threads waiting idle at the barrier. 13The backward traversal uses the same approach 14The backward traversal is similar Chapter 3. Tatum: Fast Parallel Timing Analysis 31 using the updated count (Line 17) until the CAS succeeds.15 This ensures the downstream node’s ready counter is incremented atomically, while also ensuring the previous count value (prev inputs ready) (from before the current thread’s increment) is correct. This allows us to detect when the current thread satisfies the last outstanding dependency (Line 19) ensuring the downstream node is enqueued precisely once (Line 20), and as soon as its dependencies are satisfied.

Algorithm 4 Dynamic Walk Require: G timing graph to analyze Require: Arr array of initialized arrival times (zero for timing sources, −∞ for other nodes.) Require: Req array of initialized required times (constraint values for timing endpoints, +∞ for other nodes.) Require: input ready counts array of node fanins (zero initialized) Require: output ready counts array of node fanouts (zero initialized) 1: function ParallelDynamicWalk(G, Arr, Req, input ready counts, output ready counts) 2: L ← NumLevels(G) 3: parallel for node ∈ LevelNodes(G, 1) do . Launch forward traversal 4: FwdTraverseNodeDynamic(node) 5: 6: parallel for node ∈ LevelNodes(G, L − 2) do . Launch backward traversal 7: BwdTraverseNodeDynamic(node) 8: 9: function FwdTraverseNodeDynamic(n) 10: FwdTraverseNodePull(n) . Update arrival times of node n as previously defined 11: 12: for eout ∈ OutEdges(n) do . Update dependency tracking of nodes downstream of n 13: nsnk ← SinkNode(eout) 14: 15: prev inputs ready ← AtomicLoad(input ready counts[nsnk]) 16: while !AtomicCAS(input ready counts[nsnk], prev inputs ready, prev inputs ready + 1) do 17: prev inputs ready ← AtomicLoad(input ready counts[nsnk]) 18: 19: if prev inputs ready = NumOutEdges(n) − 1 then . n was last outstanding dependency of nsnk 20: fork FwdTraverseNodeDynamic(nsnk) . Enqueue nsnk to be processed 21: 22: function BwdTraverseNodeDynamic(n) 23: BwdTraverseNodePull(n) . Update required times of node n as previously defined 24: 25: for ein ∈ InEdges(n) do . Update dependency tracking of nodes upstream of n 26: nsrc ← SourceNode(ein) 27: 28: prev outputs ready ← AtomicLoad(output ready counts[nsrc]) 29: while !AtomicCAS(output ready counts[nsrc], prev outputs ready, prev outputs ready + 1) do 30: prev outputs ready ← AtomicLoad(output ready counts[nsrc]) 31: 32: if prev outputs ready == NumInEdges(n) − 1 then . n was last outstanding dependency of nsrc 33: fork BwdTraverseNodeDynamic(nsrc) . Enqueue nsrc to be processed

15We use a ‘weak’ CAS (which may fail spuriously due to hardware implementation details) since it can be more efficient, and we already need to retry the CAS. Chapter 3. Tatum: Fast Parallel Timing Analysis 32

3.5.4 CPU Experimental Results

Experimental Methodology

The various STA algorithms are evaluated on the same benchmarks used in Section 3.4.2. All algorithms are implemented in C++, and run on an Intel Xeon E5-1620 (32nm ‘Sandy Bridge’, 4 cores, see Table 3.2). Results are the average across 100 runs. It should be noted that the simplified timing analysis formulation evaluated here performs limited work per-node. Therefore the synchronization overhead is high, limiting the absolute parallel speed-ups. A more full-featured timing analyzer which performs more computation per node will scale better (as shown in Sections 3.6 and 3.8). However, this pessimistic scenario allows us to differentiate and evaluate the relative scalability of these algorithms in a high stress setting.

Results

The performance of the parallel algorithms are shown in Table 3.5. The ‘Levelized Locks’ algorithm (Section 3.5.1) achieves only limited speed-up (1.13×). Closer examination showed the forward traversal achieved no parallel speed-up – the synchronization enforced by locking effectively serialized the traversal. By avoiding locks with a bi-directional timing graph, ‘Levelized No-Locks’ (Section 3.5.2) produces better results, achieving an average speed-up of 1.92×. Avoiding the synchronization cost of locks (by ensuring only a single writer per-node) improves scalability. The ‘Dynamic algorithm’ (Section 3.5.3) performs significantly worse, running 4× slower than the serial algorithm. The cost of tracking when a node’s dependencies are satisfied, and enqueuing nodes to be processed is substantial and dominates the computation. To determine how close these algorithms come to the best-case performance we implemented an additional ‘No Dependencies’ algorithm. This algorithm performs the same amount of work as the other algorithms, but ignores the dependencies between timing graph nodes.16 The ‘No Dependencies’ algorithm achieves an average speed-up of 2.32×. This speed-up is less than linear, indicating that memory (rather than computation and synchronization) is the remaining bottleneck. The ‘Levelized No-Locks’ algorithm achieves a speed-up only 21% less than ‘No Dependencies’, showing the levelized approach can extract nearly all of the parallelism available. The remaining difference consists of both the effect of the timing graph’s dependency structure which cannot be parallelized, and the overhead of using implicit barriers for synchronization. The memory required to store bi-directional and uni-directional edges in the timing graph is shown in Table 3.6. While storing bi-directional edges increases the timing graph size by 63% on average, the timing graph is a small fraction of a CAD tool’s memory usage. For instance in VPR this corresponds to an increase of less than 0.5%. Combined with the improved speed-ups achieved with ‘Levelized No-Locks’ this represents a good trade-off.

16While ‘No Dependencies’ does not produce a correct analysis (due to race conditions), it produces an upper-bound on the achievable performance since no synchronization is performed. Chapter 3. Tatum: Fast Parallel Timing Analysis 33

Table 3.5: Speed-Up of CPU Parallel Algorithms Levelized Levelized Dynamic No Dep. Benchmark Locks No-Locks (P=4) (P=4) (P=4) (P=4) neuron 0.98 1.70 0.26 2.21 openCV 1.09 1.87 0.28 2.43 bitcoin miner 1.22 2.04 0.27 2.37 sparcT2 chip2 1.25 1.94 0.27 2.31 LU230 1.13 2.06 0.18 2.31 GEOMEAN 1.13 1.92 0.25 2.32 Speed-up relative to (serial) optimized CPU baseline. P is the number of processor cores.

Table 3.6: Timing Graph Memory Usage Benchmark Uni-Directional Edges Bi-Directional Edges neuron 63.4 (0.72%) 103.8 (1.19%) opencv 147.7 (0.71%) 237.9 (1.14%) denoise 125.4 (0.87%) 204.0 (1.41%) gaussianblur 645.1 (0.78%) 1,071.6 (1.30%) Memory in MB. Percentage of VPR’s peak memory use in brackets.

3.6 Multi-Corner Timing Analysis

An additional challenge facing digital design is the rapid increase in the number of timing corners to analyze [63], both for sign-off and to drive optimization in multi-corner aware tools [91]. The ‘Naive’ approach is to repeatedly invoke the analysis for each of the T timing corners. However, from a computational perspective, analyzing multiple corners simultaneously should be beneficial, since it increases the amount of work performed per node, better amortizing the data access and synchronization costs.17 Algorithm 5 shows the extension to arrival time calculations to support simultaneous multi-corner analysis.18 The graph traversal is identical across all timing corners, so the only modification is the extension of the single node arrival time to a vector of arrival times (one for each timing corner). The loop at Line 4 then calculates and updates the arrival times for each corner.

Algorithm 5 Multi-Corner Node Traversal

Require: nsnk the timing graph node to process, T number of timing corners 1: function FwdTraverseNodePullMultiCorner(nsnk) 2: for ein ∈ InEdges(nsnk) do 3: nsrc ← Source(ein) 4: for t ∈ 0 ...T − 1 do . For each timing corner t 5: α ← Arr(nsrc, t) + Delay(ein, t) 6: Arr(nsnk, t) ← Max(Arr(nsnk, t), α)

3.6.1 Parallel

Figure 3.2 illustrates the impact of increasing the number of timing corners (T ) on run-time. Results are again collected on a 4-core Intel Xeon E5-1620 processor (see Table 3.2).

17This also allows us to easily explore the scaling trends for the algorithms from Section 3.5 as the arithmetic intensity increases. 18To the best of our knowledge this approach has not been previously published. Chapter 3. Tatum: Fast Parallel Timing Analysis 34

1.2 Naive (Separate Traversals) Serial 1.0 Serial SIMD Dynamic Dynamic SIMD 0.8 Levelized No-Locks Levelized No-Locks SIMD

0.6

Run-time (sec) 0.4

0.2

0.0 5 10 15 20 25 30 T (Number of Timing Corners) Figure 3.2: Multi-corner STA performance for openCV at P = 4. All results except ‘Naive’ perform a single set of traversals for all timing corners.

All parallel algorithms exhibit similar slopes which are substantially lower than those of the ‘Serial’ algorithm, meaning their speed-ups improve as T (work per node) increases. For instance the ‘Levelized No-Locks’ algorithm improves its speed-up from 1.7× at T = 1 to 3.8× at T = 32 (a nearly linear speed-up with 4 cores). The ‘Dynamic’ algorithm has substantial synchronization overhead (Section 3.5.4), and is always dominated by ‘Levelized No-Locks’. The per-node synchronization performed by the ‘Dynamic’ algorithm substantially outweighs the benefit of avoiding the per-level barriers used by ‘Levelized No-Locks’. It is interesting to note that performing a combined multi-corner analysis is inherently more efficient than the ‘Naive’ approach of traversing the timing graph for each corner analyzed. For example, the ‘Naive’ method runs 32× longer at T = 32 compared to T = 1, while the ‘Serial’ algorithm at T = 32 runs only 22.4× longer (since it traversed the timing graph only once). The difference is more substantial for the parallel algorithms, since their scalability improves as the amount of work increases. For instance, ‘Levelized No-Locks’ at T = 32 runs 8.7× faster than ‘Naive’.

3.6.2 SIMD

Processing multiple timing corners in parallel also exposes potential new data-level parallelism (the loop at Line 4 of Algorithm 5), since each basic operation (addition, max etc.) can be applied element-wise to the vector of arrival/required time values. This can be performed efficiently using the Single Instruction Multiple Data (SIMD) instructions found on many modern processors [88]. The Intel Xeon E5-1620 processor targeted supports 8-way SIMD on 32-bit values. To use SIMD, we manually vectorized the arrival time computation corner loop (Algorithm 5 Line 4) Chapter 3. Tatum: Fast Parallel Timing Analysis 35 using compiler intrinsics.19 This further improves scalability, flattening out the slopes of all algorithms in Figure 3.2. This results in larger speed-ups compared to ‘Serial’, with the ‘Dynamic SIMD’ and ‘Levelized No-Locks SIMD’ algorithms achieving speed-ups of 3.2× and 6.2× respectively at T = 32 using 4 cores.

3.6.3 Analysis within Fixed CAD Run-Time Budgets

In practise STA is often performed with a fixed CAD run-time budget. This could be a fixed portion of run-time dedicated to STA within an optimization tool, or a design engineer’s constraint to evaluate a proposed design change (e.g. overight). In both cases the thoroughness of the analysis is limited by what can fit within the run-time budget. Since the scalability of the parallel techniques improves as workload increases this changes the trade-offs. For instance, in the same fixed-time budget required by ‘Naive’ to analyze 3 corners, ‘Serial’ analyzes 6 corners, ‘Levelized No-Locks’ analyzes 21 corners, and ‘Levelized No-Locks SIMD’ analyzes 32 corners. This means with 4 cores, ‘Levelized No-Locks’ and ‘Levelized No-Locks SIMD’ can respectively analyze 7.0× and 10.6× more corners than ‘Naive’, enabling designers and optimization tools to perform a much more detailed analysis than would otherwise be possible.

3.7 Multi-clock Timing Analysis

So far, we have only considered timing analysis with a single clock. We now consider a full featured timing analyzer which supports advanced features such as multi-clock timing analysis and timing exceptions. Modern designs make use of a large number of clock domains, for instance, FPGA designs can use tens to hundreds of clock domains [64]. As a result, performing efficient timing analysis in the presence of multiple clock domains is important [65]. The multi-traversal approach to multi-clock timing analysis, is shown in Algorithm 6. For each pair of clock domains, the arrival times are initialized (Line 4) on the nodes in the first level of the graph (level 0), and required times are initialized (Line 5) on the last level of the graph (level L − 1). The arrival and required times for the clock domain pair are then propagated through the graph using one of the graph traversal algorithms described in Sections 3.3 and 3.5. The overall complexity of this approach with C clocks and a graph of N nodes is O(C2N), since the graph is traversed for all pairs of interacting clock domains (O(C2)) and the work performed at each of the N timing graph nodes nodes is O(1). However, a major drawback of this approach is that the graph traversal is invoked O(C2) times, and as discussed in Section 3.3.1, the memory accesses to traverse the timing graph are expensive.

Algorithm 6 Multi-traversal Multi-clock Timing Analysis Require: G timing graph to analyze, Clocks set of clock domains to analyze 1: function MultiTraversalSTA(G,Clocks) 2: for clklaunch ∈ Clocks do 3: for clkcapture ∈ Clocks do 4: InitializeLaunchArrival(G, clklaunch) 5: InitializeCaptureRequired(G, clkcapture) 6: Walk(G) . One of the previously described traversal algorithms

19Compiler-driven auto-vectorization was less efficient, performing 13-42% slower for the various algorithms at T = 32. Chapter 3. Tatum: Fast Parallel Timing Analysis 36

An alternate method is the tagged approach, shown in Algorithm 7. Here, instead of tracking only a single arrival/required time at each node during the traversal, there may be multiple arrival/required times which are ‘tagged’ with their associated clock domains. This allows the graph to only be traversed once. This increases the work performed per-node during the traversal from O(1) to O(C2) but requires only a single graph traversal. While the overall complexity remains O(NC2) the constant factors are significantly reduced since the expensive memory accesses to the timing graph only occur once, and are amortized over multiple clock domains. The tagged approach can be more efficient when clock domains are localized to only a portion of the timing graph. For instance, in the case where all C clock domains are independent and non-interacting (i.e. each node is part of a single single clock domain), each node can be processed in O(1) time, and all C clocks can be analyzed in O(N) time. Similarly, this approach can be more efficient when timing exceptions (such as false paths between parts of the circuit or entire clock domains) reduce cross-clock domain interactions. We evaluate the performance of the tag-based approach in Section 3.9.

Algorithm 7 Single-traversal Multi-clock Timing Analysis Require: G timing graph to analyze, Clocks set of clock domains to analyze 1: function MultiTraversalSTA(G,Clocks) 2: for clklaunch ∈ Clocks do 3: InitializeLaunchTags(G, clklaunch)

4: for clkcapture ∈ Clocks do 5: InitializeCaptureTags(G, clkcapture) 6: Walk(G) . One of the previously described traversal algorithms

3.8 Tatum

Guided by the preceding results, we developed Tatum, an open-source C++ library suitable for integration with CAD tools. Tatum provides facilities for application defined timing graphs and delay calculation.20 These capabilities facilitate the creation of timing-aware CAD tools and the development of timing-aware optimization algorithms. Tatum supports multi-clock timing analysis and timing exceptions. Tatum, uses the most performant CPU parallel algorithm (‘Levelized No-Locks’ from Section 3.5.2), uses the tagged approach (Section 3.7) to perform multi-clock timing analysis, and uses Cilk Plus [89] for parallelization. Cilk Plus was selected since it provides dynamic load-balancing between threads through efficient work-stealing [89]. In contrast, contemporary versions of OpenMP supported only static (i.e. ahead-of-time) partitioning of work between threads, which was found to be less efficient since the work performed per node varies depending on node fan-in/fan-out.21

3.8.1 Asymptotic Scalability

To evaluate asymptotic parallel scalability we adopt the Work-Span model from [92].

20Tatum’s source code is available at: kmurray.github.io/tatum. 21Tatum was later converted to use Thread Building Blocks library [92], which supports the same work-stealing scheduler as Cilk Plus, and does not require explicit compiler support. Chapter 3. Tatum: Fast Parallel Timing Analysis 37

Work-Span Model

The Work-Span model describes a computation as a sequence of work-items with dependencies between them. This forms a Directed Acyclic Graph (DAG) where vertices represent work-items and edges the dependencies between them. Before a work-item can be executed, its dependencies must be satisfied. Intuitively work corresponds to the total amount of work to be performed by an algorithm, and the span is the portion of work the algorithm must perform serially (i.e. the longest path in the DAG of work-items).

Following [92], we define TP as the time for an algorithm to execute with P processor cores. T1 is then 22 the serial execution time, and T∞ is the time to execute on a theoretical infinitely parallel machine.

An upper-bound on TP is Brent’s Lemma:

T − T T ≤ 1 ∞ + T . (3.1) P P ∞

The speed-up with P processors is then: ! T1 T1 SP = = O . (3.2) T T1−T∞ P P + T∞

Work-Span of Static Timing Analysis

STA performs a set of operations on each of the N nodes in the timing graph, which can be topologically sorted into L levels. For all parallel algorithms analyzed

T1 = Θ(N) , (3.3) meaning the time-complexity of all algorithms grows linearly with the size of the timing graph, when executed on a single processor. This means the parallel algorithms are work efficient: they perform no additional work compared to a strictly sequential algorithm.

Tatum: Data Parallelism

Tatum takes a data-parallel approach which parallelizes STA operations across the N timing graph nodes. However, the structure of the timing graph imposes an unavoidable degree of serialization to ensure all dependencies are satisfied and no race conditions occur. That is, the span is equal to L, the number of levels in the timing graph.23 As a result

T∞ = Θ (L) , (3.4) and the parallel speed-up is then ! N SP = O N−L , (3.5) P + L which as P → ∞ becomes N  S = O . (3.6) ∞ L

22 Intuitively, T∞ is, the run-time of inherently unparallelizable work. 23Intuitively, in the limit when N = L no parallel speed-up is possible – since the timing graph would be a serial chain of nodes. Chapter 3. Tatum: Fast Parallel Timing Analysis 38

This means that Tatum’s speed-up grows linearly with the size of the timing graph, which is inherently more scalable than OpenTimer’s constant-factor speed-up.24

3.8.2 Comparison to OpenTimer

We next compare Tatum to OpenTimer [82], an ASIC oriented tool which at the time was the only other open-source parallel static timing analyzer.

OpenTimer: Pipeline Parallelism

OpenTimer uses pipelining as its parallelization method. This corresponds to a functional decomposition of STA, where each functional task (e.g. delay calculation, arrival time propagation) is performed in parallel. To ensure all dependencies between tasks remain satisfied and no race conditions occur, each functional task is performed on a different level of the timing graph – forming a parallel pipeline [82]. If all K pipeline stages take an equal amount of time then:

N  T = Θ , (3.7) ∞ K and the parallel speed-up is then ! N SP = O N , (3.8) N− K N P + K which as P → ∞ becomes

S∞ = O (K) = O(1) , (3.9) since the number of parallel pipeline stages is constant (K = 6 for OpenTimer). Equation (3.9) means that OpenTimer’s pipeline parallelism only speeds up the algorithm by at most a constant factor. Notably, the speed-up does not scale with the size of the problem. In addition, if the time per pipeline stage is not perfectly balanced, the slowest stage will form a bottleneck causing other stages to block until it completes – further limiting scalability.

Numeric Comparison

We evaluate Tatum and OpenTimer on the Tau 2015 ASIC Timing Contest benchmarks [93]. The smallest benchmarks with fewer than 105 timing nodes (N ≤ 105) are excluded, as they are too small to benefit from parallelization and are much smaller than modern commercial ASIC designs. We report the self-relative speed-up achieved when fully recalculating the timing graph, averaged over 100 runs. Run-time results were collected with two Intel E5-2683 v4 CPUs (14nm ‘Broadwell’, 16 cores, see Table 3.2), for a total of 32 cores. As Figure 3.3 shows, OpenTimer achieves a maximum speed-up of 1.2×, and does not scale beyond 8 cores. OpenTimer’s speed-up is limited by its fixed task-based pipeline, which yields at most an O(1) constant-factor speed-up as the number of cores increases [92].25 In contrast, Figure 3.3 shows Tatum scales to at least 32 cores, achieving a maximum speed-up of 8.3×. Notably, Tatum’s speed-up increases monotonically with the number of cores unlike OpenTimer.

24 N 4 5 Additionally, the ratio of L is large in practise. For instance, on the order of 10 to 10 for the Titan benchmarks [10]. 25This limited scalability is in agreement with Equation (3.9), although the magnitude of the speed-up is much less than K, likely because of unbalanced work between pipeline stages, and synchronization overhead. Chapter 3. Tatum: Fast Parallel Timing Analysis 39

8.3 × Linear Speed-up 8 Tatum (N > 105) OpenTimer (N > 105)

6

4 Geomean Speed-Up

2 1.2 ×

0 1 2 4 8 16 32 P (Number of Cores) Figure 3.3: Self-relative parallel speed-up on Tau’15.

Since Tatum’s data-parallel approach scales as O(N/L), this means its speed-up improves with larger design size (as N increases). While Tatum is more scalable than OpenTimer, it scales sub-linearly with the number of cores, which is in agreement with Equation (3.5). This is due to:

ˆ the inherently unparallelizable work (e.g. caused by the span, L in Equation (3.6)), and

ˆ the communication and synchronization costs of real computing systems which are neglected in the work-span model.

3.9 Evaluation in VPR

We next integrated Tatum into the VPR FPGA CAD tool, which repeatedly calls STA to guide optimization [94]. We compare Tatum’s performance (and verify correctness) against the classic VPR 7 timing analyzer [27].26 We report both the run-time spent on STA and total run-time while placing the Titan FPGA benchmarks [10], which are large designs ranging in size from 90K to 1.8M netlist primitives. The Titan benchmarks are synthesized with Intel’s Quartus tool and consist of various FPGA device primitives (e.g. LUTs, FFs, adders, multipliers, RAM slices) [10]. We should note that the Titan benchmarks are substantially larger (153× more timing graph nodes on average) than the Tau’15 benchmarks, and include multi-clock circuits (which OpenTimer does not support). The results were again collected on a system with dual Intel E5-2683 v4 CPUs (32 cores, see Table 3.2).

26VPR 7’s classic analyzer performs a serial levelized graph traversal (Algorithm 1) and uses the multi-traversal approach to multi-clock analysis (Algorithm 6). Chapter 3. Tatum: Fast Parallel Timing Analysis 40

Table 3.7: VPR Multi-clock STA Run-Time During Placement Timing Classic Benchmark Clocks Tatum P = 1 Tatum P = 32 Graph Nodes VPR stereo vision 416,747 4 93.7 39.6 (2.37×) 4.8 (19.7×) sparcT1 core 513,974 3 78.9 39.6 (1.99×) 4.8 (16.5×) minres 1,278,986 3 268.1 127.9 (2.10×) 10.5 (25.6×) sparcT2 core 1,566,943 3 402.6 132.7 (3.03×) 11.6 (34.6×) gsm switch 2,666,350 5 979.2 273.7 (3.58×) 18.6 (52.6×) LU230 3,038,632 3 1,218.8 416.4 (2.93×) 40.7 (30.0×) mes noc 3,261,440 10 1,855.9 383.4 (4.84×) 29.7 (62.5×) LU Network 3,355,076 21 1,494.0 427.5 (3.49×) 28.7 (52.0×) sparcT1 chip2 3,765,406 3 1,099.4 446.4 (2.46×) 35.7 (30.8×) bitcoin miner 4,513,409 3 1,891.4 943.6 (2.00×) 66.1 (28.6×) directrf 4,902,870 3 2,365.5 1,066.5 (2.22×) 86.2 (27.4×) GEOMEAN 2,081,093 4 673.7 248.5 (2.71×) 21.1 (32.0×) Time in seconds; Speed-up relative to ‘Classic VPR’ in brackets. P = number of cores used for STA; P = 32 uses two sockets.

STA Run-Time

Table 3.7 shows the STA run-time results on the multi-clock subset of Titan benchmarks. With a single core (P = 1) Tatum runs 2.7× faster than Classic VPR,27 showing the tagged approach (Section 3.7) performs better on multi-clock circuits. With 32 cores Tatum runs 32× faster. Figure 3.4 shows how Tatum’s performance scales with the number of cores. Tatum achieves an average speed-up of 15.2× on the full Titan benchmark set with 32 cores, and an average speed-up of 7.7× on the single-clock subset of the Titan benchmarks.

Total Run-Time

Table 3.8 shows the results for total placement run-time. With VPR’s classic timing analyzer STA accounts for 10% of total run-time on average, with multi-clock circuits spending up to 22% of run-time on STA. When VPR is run using Tatum’s parallel timing analysis total placement run-time is reduced by 11%.

3.10 Improving Optimization

CAD tools like VPR have been tuned to minimize the amount of run-time spent on STA by performing timing analysis relatively infrequently as they change the implementation [47]. This controls STA run-time, but has an important side effect: the optimization tools are making decisions based on timing information that is stale (not reflective of the latest circuit state). For example, during placement of the gaussianblur Titan benchmark VPR modifies the placement ∼ 747 million times, but only calls timing analysis to obtain updated delay information 150 times (i.e. once every ∼ 5 million modifications). Since Tatum reduces the run-time overhead of maintaining up-to-date timing information it is worth considering whether more frequent timing analysis can improve optimzation quality. To this end we modified VPR’s placement algorithm to exploit more accurate timing information.

27Tatum and the Classic VPR timing analyzer produce identical results to within floating point round-off. Chapter 3. Tatum: Fast Parallel Timing Analysis 41

35 Linear Speed-up 32.0 × Titan23 30 Titan23 (Multi-clock)

25

20 15.2 × 15

Geomean Speed-Up 10

5

0 1 2 4 8 16 32 P (Number of Cores) Figure 3.4: Tatum Performance Scaling over Classic VPR STA.

Table 3.8: VPR Placement Total Run-Time Timing STA VPR Tatum VPR Tatum Benchmark Clocks VPR Classic Graph Nodes Updates (P = 32) Speed-Up stereo vision 416,747 4 143 7.1 (22.0%) 5.6 (1.9%) 1.26× neuron 451,916 1 143 6.7 (10.7%) 6.1 (1.9%) 1.10× sparcT1 core 513,974 3 125 8.7 (15.1%) 7.5 (1.4%) 1.16× cholesky mc 615,925 1 148 10.4 (11.6%) 9.4 (2.3%) 1.10× SLAM spheric 756,307 1 170 21.3 (8.7%) 19.7 (1.5%) 1.08× des90 779,404 1 140 14.4 (11.4%) 13.0 (1.7%) 1.11× segmentation 980,309 1 193 31.3 (9.7%) 30.0 (5.8%) 1.04× dart 1,137,708 1 126 20.9 (10.3%) 19.0 (1.2%) 1.10× stap qrd 1,248,502 1 224 60.2 (7.8%) 56.2 (1.0%) 1.07× minres 1,278,986 3 135 32.1 (13.9%) 27.9 (1.0%) 1.15× openCV 1,281,693 1 146 25.6 (11.2%) 23.0 (1.3%) 1.11× cholesky bdti 1,392,239 1 185 40.1 (9.4%) 36.8 (1.4%) 1.09× bitonic mesh 1,409,553 1 127 32.6 (9.6%) 29.8 (1.0%) 1.09× sparcT2 core 1,566,943 3 141 58.7 (11.4%) 52.2 (0.5%) 1.12× denoise 2,068,769 1 245 115.5 (5.7%) 112.0 (2.8%) 1.03× gsm switch 2,666,350 5 142 109.8 (14.9%) 94.0 (0.5%) 1.17× LU230 3,038,632 3 176 182.7 (11.1%) 163.4 (0.6%) 1.12× mes noc 3,261,440 10 142 183.9 (16.8%) 153.7 (0.5%) 1.20× LU Network 3,355,076 21 166 202.7 (12.3%) 178.6 (0.5%) 1.13× sparcT1 chip2 3,765,406 3 143 203.9 (9.0%) 186.3 (0.4%) 1.09× bitcoin miner 4,513,409 3 272 355.8 (8.9%) 325.8 (0.5%) 1.09× directrf 4,902,870 3 254 489.4 (8.1%) 452.0 (0.5%) 1.08× gaussianblur 11,554,254 1 150 1,143.5 (1.6%) 1,127.3 (0.2%) 1.01× GEOMEAN 1,591,104 2 162 55.2 (10.0%) 49.8 (1.0%) 1.11× Time in minutes; Percentage of time spent on STA in brackets. P = number of cores used for STA; P = 32 uses two sockets. Chapter 3. Tatum: Fast Parallel Timing Analysis 42

3.10.1 Algorithmic Enhancements

We focus on critical path optimizations during VPR’s final placement optimization stage: the ‘quench’ phase of the anneal, where only modifications which improve quality are accepted. All other placement steps use the standard VPR formulation. During the quench, we call STA after each move (swap of blocks) to ensure the optimizer has a fully accurate view of the move’s timing impact. Interestingly, with VPR’s default move generator and cost formulation this resulted in no net improvement in critical path delay.28 To improve efficiency we biased the move generator to pick blocks which are connected to highly critical connections (criticality > 0.7). We also set a high criticality exponent (critexp = 50) and timing trade off (λ = 0.99) to focus the optimization on improving the critical path, and used a small 29 region limit (Rlim = 1) to generate small incremental modifications to the circuit. We observed that VPR’s connection-based criticality timing cost formulation (Equation (2.10) [47] fails to provide accurate guidance to reduce the critical path in this final optimization, as its connection oriented approach does not directly evaluate the critical path delay. At this final stage of fine-tuning, focusing on specific paths is key to improving results, and we modified the timing cost function during the quench to directly optimize the circuit’s critical path delay. The resulting cost function (based on Equation (2.11)) used during the quench is then: X (1 − λ) · HPWL(k) + λ · CPD (3.10) k∈nets where CPD is the worst critical path delay calculated by timing analysis. Finally, we exit the quench when no moves have resulted in cost improvements for a significant period of time (indicating the system has frozen out and reached a minima), or the standard VPR move limit is reached.

3.10.2 Results

Table 3.9 compares the impact of using stale and up-to-date timing information while directly optimizing the critical path during the quench. Results were collected on the MCNC20 benchmarks [95] targeting a K = 4, N = 1 (single 4-LUT per logic block) architecture with length one wires, and were run on a machine with two Intel Xeon Gold 6146 (14nm ‘Skylake’, 12 cores), for a total of 24 cores. Reported values are the average over 5 placement seeds. Performing STA during the evaluation of each proposed move (Per-Move STA) improves Critical Path Delay (CPD) by 4% on average, and by up to 8% on some circuits (e.g. elliptic) compared to standard VPR. These CPD improvements are obtained with no degredation in wirelength (WL). To show that these improvements are in fact derived from frequent timing analysis (rather than the algorithmic modifications listed above), we also report results where only a single STA is performed at the start of the quench (Single STA). In this scenario, the optimizer is forced to make decisions using stale timing information. This leads it to over-optimize paths which originally appeared critical, degrading the actual critical path by 5% compared to standard VPR. While we’ve shown that frequent timing analysis can improve optimization quality, it comes with a run-time overhead – increasing placement run-time by 5.96×. However, this overhead is dominated by STA and can be reduced to 2.39× when Tatum uses 24 cores. This makes the approach suitable for late

28We suspect that for detailed tweaks as proposed here, VPR’s connection based timing cost function fails to effectively guide the placer to path-level improvements. 29For definitions of these terms see Section 2.4. Chapter 3. Tatum: Fast Parallel Timing Analysis 43

Table 3.9: Impact of STA Frequency During Quench Per-Move STA Single STA Place Time Place Time Benchmark WL CPD WL CPD Place Time (P = 1) (P = 24) tseng 1.00 0.97 3.80 2.75 1.00 1.08 0.77 dsip 1.00 0.93 2.97 1.39 1.00 1.07 0.77 diffeq 1.00 1.00 4.15 2.33 1.00 1.02 0.90 bigkey 1.00 0.99 3.21 1.43 1.00 0.99 0.82 s298 1.00 0.95 6.16 3.38 1.00 1.02 0.88 frisc 1.00 0.96 9.79 3.38 1.00 1.14 0.90 elliptic 1.00 0.92 11.36 3.23 1.00 1.08 1.01 s38584.1 1.00 0.98 11.07 2.44 1.00 1.05 1.04 s38417 1.00 0.97 12.11 2.80 1.00 1.04 1.01 clma 1.00 0.94 12.75 3.02 1.00 0.99 1.01 ex5p 1.00 0.97 4.45 3.32 1.00 1.02 1.00 apex4 1.00 0.96 4.78 2.98 1.00 1.09 1.05 misex3 1.00 0.95 4.02 2.10 1.00 1.03 0.95 alu4 1.00 0.97 3.20 1.62 1.00 1.06 0.78 des 1.00 0.99 4.27 1.84 1.00 1.04 0.85 seq 1.00 0.98 3.82 1.80 1.00 1.07 0.84 apex2 1.00 0.97 5.23 2.17 1.00 1.02 0.97 spla 1.00 0.98 9.41 2.63 1.00 1.07 1.06 pdc 1.00 0.95 9.87 2.57 1.00 1.05 0.91 ex1010 1.00 0.97 8.51 2.34 1.00 1.05 0.90 GEOMEAN 1.00 0.96 5.96 2.39 1.00 1.05 0.92 Values normalized to VPR default. P = number of cores used for STA; P = 24 uses two sockets. stages of the design process (such as final timing closure), where increasing tool effort is acceptable in return for improved quality. We expect these results to improve on heavily pipelined designs (which are increasingly common), where up-to-date timing information has also been shown to improve timing [96].

3.11 Conclusion

Static Timing Analysis is a key element in FPGA design flows and accounts for a significant fraction of design implementation time. This fraction threatens to increase further in the face of growing design sizes and increasingly complex timing conditions. Furthermore, as other algorithms such as placement and routing are parallelized serial STA becomes the dominant run-time component. Therefore speeding-up STA is fundamental to reducing FPGA design cycle times. To address these issues we have investigated a variety of techniques to speed-up Static Timing Analysis, including data layout optimizations and parallel algorithms on both GPUs and CPUs. While GPUs were shown to offer good kernel speed-ups (average 6.2×), the overhead of data transfer to/from the host CPU negated the performance improvement. Of the CPU parallel algorithms, a level-parallel approach performed best, particularly when synchronization costs are reduced by avoiding expensive locks and atomic operations. This approach achieved an average self-relative speed-up of 1.9× with 4 cores. Analyzing multiple timing corners increases the amount of work per memory access, improving the speed-up to 3.8× with 4 cores. Exploiting SIMD techniques further improves the speed-up to 6.2×. The Chapter 3. Tatum: Fast Parallel Timing Analysis 44 combination of these methods allows 14.5× more timing corners to be analyzed on 4 cores within a fixed CAD run-time budget. With this knowledge of how to speed-up the basic operations of a timing analyzer, we moved on to more advanced STA features, describing how tagged arrival/required times improve performance on multi-clock circuits by reducing the number of timing graph traversals. We then introduced the open-source Tatum STA library, supporting efficient parallel timing analysis of complex designs with multiple clocks and timing constraints. Tatum reduces the barrier to producing timing-aware CAD tools, and has been adopted as the timing analysis engine in CGRA-ME [24] and version 8.0 of VTR [2]. We compared Tatum’s performance with OpenTimer, showing that OpenTimer’s parallel approach achieves a maximum 1.2× speed-up at 8 cores, while Tatum’s speed-up increases to 8.3× at 32 cores. We evaluated Tatum’s performance in VPR, showing that it runs on average 15.2× faster than VPR’s classic timing analyzer with 32 cores, and 32× faster on more complex multi-clock circuits; reducing total placement time of the Titan benchmarks by 11%. Finally, we showed how faster timing analysis can be used to improve optimization results, showing that more frequent STA can improve critical path delay by 4%. As Tatum is an open-source re-usable library, we believe it can be leveraged by other researchers to facilitate the creation of advanced timing-aware CAD tools. Such tools enable new CAD algorithm explorations, and a wider variety of more accurate architectural studies.

3.12 Future Work

Another way to get up-to-date timing information after small perturbations to the design implementation (which can occur during the late optimizations discussed in Section 3.10), is to incrementally update timing information. From an algorithmic perspective this would involve identifying the subset of edges in the timing graph whose delays have changed and updating the affected nodes. However it is unclear how this would impact the performance of the proposed algorithms. For instance, while we found that level-parallel approaches such as Algorithm 3 are most efficient during full timing updates, they may perform more work than neccessary during an incremental update, and the more fine-grained parallel algorithms like the dynamic approach (Algorithm 4) may be more efficient. Likely this trade-off will depend on the number of nodes which must be updated during the incremental update, but further research is needed to explore the trade-offs in this space. Another direction for future work include extending Tatum to support more advanced features such as modelling on-chip-variation and performing clock pessimism removal. Chapter 4

Extended Static Timing Analysis: Quantifying Risks of Input-Dependent Timing Errors

As discussed in the prior chapter, timing analysis is a key step in the digital design process. However, where Chapter 3 was concerned with rapid timing analysis which scales to large designs and guides optimization efficiently, in this chapter we look for less pessimistic timing analysis, even at the expense of more computation. By modelling device delay variations Statistical Static Timing Analysis (SSTA) reduces pessimism compared to traditional Static Timing Analysis (STA). However, like STA, it ignores the circuit’s logic which causes some timing paths to never, or only rarely, be sensitized. We introduce a general timing analysis approach and tool to calculate the probability that individual timing paths are sensitized, enabling the calculation of bounding delay distributions over all input combinations. We show how this analysis is related to the well-known #SAT problem and present approaches to improve scalability, achieving on average, results 75 to 37% less pessimistic than Static Timing Analysis while running 569 to 16 times faster than Monte-Carlo timing simulation while delivering stronger guarantees.

4.1 Introduction

Timing analysis, the process of determining whether a circuit will operate correctly at speed, is a key step in the digital design process. The conventional approach for performing timing analysis is Static Timing Analysis (STA) [97]. However, with smaller manufacturing process technologies the worst case and average case delay can deviate significantly, causing STA to be very pessimistic [98, 78]. Statistical Static Timing Analysis (SSTA) was introduced to reduce this pessimism, by modeling device and interconnect delay variations. This allows designers to sacrifice a small amount of delay coverage (and resulting yield) for considerable performance improvement [78]. However, both STA and SSTA assume the worst case switching behaviour across all input combinations. We propose a complementary approach which quantifies the stochastic variation across inputs rather than across delays, with the aim of sacrificing a small amount of input combination coverage to achieve

45 Chapter 4. Extended Static Timing Analysis 46 considerable performance improvement on the remaining combinations. This is motivated by applications amenable to approximate computing, where small errors may be acceptable (e.g. decoding lossy video), correctable (e.g. by a higher level algorithm), or where the input data is inherently noisy (e.g. sensor readings) and extreme accuracy is unnecessary [99]. By running systems beyond their strictly robust operating regimes, it is hoped that better trade-offs between power, area and performance can be achieved. For instance [25] studied the impact of ‘overclocking’ arithmetic operators beyond their ‘maximum’ safe operating frequencies1 for improved performance, and inspired changes to the architecture of arithmetic components [100]. However designing such systems is challenging as their behaviour is difficult to analyze. This is particularly true at the circuit-level where conventional tools like STA and SSTA assume worst-case switching behaviour to ensure exhaustive coverage of input combinations. In [25] timing analysis was performed by hand, a method which is time consuming, error prone and not scalable. To the best of our knowledge there has been no generic approach to the timing analysis of this kind of design. We aim to address this design capability gap by developing a generalization of STA. Instead of assuming worst-case switching behaviour to generate a longest path delay bound as in STA, we determine a bounding delay distribution over all input combinations.2 To simplify matters we initially consider the case where the set of inputs at cycles i and i + 1 are independent, identically uniformly distributed and statistically stationary. We later describe how these restrictions can be lifted. Our key contributions include:

ˆ A new timing analysis formulation to determine bounding delay distributions across input combina- tions,

ˆ Methods to model non-uniform and correlated inputs,

ˆ Reduction of the analysis to #SAT,

ˆ Techniques to improve scalability on real circuits, and

ˆ Experimental comparison with Monte-Carlo simulation.

Section 4.2 discusses background and related work. Sections 4.3 and 4.4 present our formulation and implementation. Section 4.5 describes how non-uniform input probabilities and correlations can be modelled. Sections 4.6 and 4.7 describe the experimental methodology and results. Section 4.8 concludes and Section 4.9 outlines future work.

4.2 Background & Related Work

The conventional approach for performing timing analysis is STA [97]. STA performs a pessimistic analysis by considering only the topological structure of the circuit, pessimistically assuming that all signals switch every cycle. This topological structure is stored as a timing graph (Definition 4). We denote the number of timing sources by I. An example timing graph is shown in Figure 4.1. The timing sources are inputs a and b (I = 2), and the output e is the single timing endpoint.

1In practice non-error tolerant signals (e.g. control signals) must have sufficient slack to operate reliably at the overclocked frequency. 2This differs from conventional SSTA, where delay distributions are derived from device and interconnect delay variations. Chapter 4. Extended Static Timing Analysis 47

a d b Table 4.1: Path-Delay Distribution (uniform input c e transition probability). Active Path Delay Probability 1.0 a b → c → d → e 3.0 0.125 1.0 a → d → e 2.0 0.1875 b 1.0 1.0 e b → e 1.0 0.3125 1.0 None (constant output) 0.0 0.375 Figure 4.1: Example circuit and timing graph with an- notated delays.

We are interested in setup time (or maximum delay) analysis, as setup time limits the clock speed of circuits. For setup analysis STA calculates the delay of the longest (or critical) timing paths (Definition 5), by calculating the latest (worst-case) arrival time (Definition 1) of signals at each node in the graph (that is, the latest time the signal can become stable). In Figure 4.1, the path b → c → d → e is the critical timing path, with a delay of 3.0 units. Conventional STA always performs a robust analysis (never underestimating delay), but can be quite pessimistic in practice. There are two primary sources of pessimism: the use of worst-case delays and assuming worst-case switching behaviour (e.g. ignoring logic functions). SSTA [78] has been developed to address the pessimism introduced by worse-casing delay values, which becomes particularly problematic in the face of increasing device and interconnect variation. SSTA directly models the statistical variation of device and interconnect delays, calculating delay distributions rather than the fixed worst-case delays used by conventional STA. Directly modelling delay variations reduces pessimism since worst-case delay combinations (which are unlikely to occur) can be ignored.3 Both STA and SSTA assume worst-case switching behaviour, by assuming all signals switch every clock cycle. In real operation this assumption does not hold. The most obvious cases are false paths, timing paths which are impossible to exercise in practice. An example is shown in Figure 4.2 There has been a variety of previous work investigating these issues. In [101] a framework for comparing false path identification criteria is presented. [102] presents algorithms for detecting near-critical false paths. Algebraic Decision Diagrams (ADDs) are used in [103] to identify and eliminate false paths. [104] describes how Automatic Test Pattern Generation techniques can be applied to a transformed version of the circuit to identify false paths. In [105] a circuit’s true delay bounds are determined by checking whether critical paths are sensitizable using CNF-based SAT solvers. However, unlike this work, these works consider only false path identification and elimination. They do not consider the probability of (true) paths being activated. In [106], toggle rate information (similar to vector-less power estimation) is used to adjust the results of SSTA to reduce pessimism. However only the average toggling behaviour (i.e. across many cycles) is considered and the toggling of individual timing paths cannot be distinguished. The problem of re-convergent paths which cause correlations is also not addressed.

3In practice, the designer chooses an acceptable level of timing yield for their design, accepting the failure of some devices. Chapter 4. Extended Static Timing Analysis 48

a 1 e b 1 d 0 c 0 sel Figure 4.2: The path b → d → e is a false-path; the common sel means transitions on b can never propagate to e after the other paths have stabilized.

4.3 Extended Static Timing Analysis Formulation

To present the Extended Static Timing Analysis (ESTA) formulation we begin by defining some termi- nology. Definition 8 (Path Sensitization Probability) The probability of a timing path experiencing a transition during a single clock cycle.

By associating a delay with each path sensitization we can build path-delay distributions: Definition 9 (Path-delay Distribution) A set of paths, delays, and their associated sensitization probabilities.

Table 4.1 shows the path-delay distribution (determined by exhaustive timing simulation) for the circuit in Figure 4.1 assuming uniform input transition probability, where rise, fall, static high, and static low each have 25% probability (i.e. each case is equally likely in a given clock cycle). We observe that the longest path is active only 12.5% of the time, much less likely than other paths. This occurs when b undergoes a falling transition (p = 0.25), and a simultaneously undergoes either a static high or rising transition (p = 0.5). Interestingly no timing paths are active 37.5% of the time since the output (e) remains constant. A path-delay distribution is different from the delay distribution produced by SSTA. Under SSTA the delays are distributed according to a statistical delay model (i.e. due to device and interconnect delay variation), while in ESTA delays are distributed according to the probability of individual timing paths being activated.4 We can now define the ESTA problem which we focus on for the remainder of this chapter: Definition 10 (Extended Static Timing Analysis Problem) Determine the path-delay distributions at all timing endpoints.

4.3.1 Transition Model

To analyze timing path behaviour, we must model the signal transitions which can occur. We define a transition as: Definition 11 (Transition) The state change of a signal value between the start and end of a clock cycle.

The transitions we consider are shown in Table 4.2 which correspond to a transition mode [107] model with four types of transitions: R (rising), F (falling), H (high), and L (low).

4While we use a deterministic delay model for each timing graph edge in this work, it does not preclude the use of a statistical delay model. Chapter 4. Extended Static Timing Analysis 49

Table 4.2: Signal Transitions. Initial/Final correspond to signal values at the start/end of a clock cycle. Initial Final Transition State State 0 1 R 1 0 F 1 1 H 0 0 L

Table 4.3: AND Gate Transitions. a b c a b c a b c a b c a c RR R FR L HR R LR L b RF L FF F HF F LF L Figure 4.3: AND gate RH R FH F HH H LH L RL L FL L HL L LL L

As an example, consider the AND gate in Figure 4.3. For this simple circuit we can enumerate the possible output transitions, as shown in Table 4.3. Note that circuits may have glitches, where the signal value changes (potentially multiple times in a clock cycle) before settling to its final value. In ESTA, these temporary glitches are modelled by the logical transition which occurs and an appropriate delay. For example, Figure 4.4 shows a high signal which temporarily glitches before settling to low. We modelled this as a F transition (initial state high, finals state low) with arrival time 1 + δ. Note that as a result of glitches, logically static transitions (i.e. H or L) are modelled as having a (potentially non-zero) delay (i.e. the delay after which they no longer glitch).

4.3.2 Using #SAT to Calculate Probabilities

To calculate a path-delay distribution we need to determine the delays and sensitization probabilities of different timing paths. The delay of a path can be calculated by traversing the timing graph and adding up the annotated delays (as in conventional STA), but calculating path sensitization probabilities is more involved. We first define an activation function: Definition 12 (Activation Function) A Boolean function which evaluates to true whenever a path is sensitized by a transition.

Note that for our purposes (as described in Section 4.3.1) it is necessary to model logically static values

δ

t = 1 t = 2 Figure 4.4: ESTA models the glitch between t = 1 and t = 2 as a F transition with arrival time 1 + δ. Chapter 4. Extended Static Timing Analysis 50 as transitions to capture the impact of glitches. For example, if we had two events A and B which together produced an event C, we could define the activation of event C as:

fc = fa ∧ fb (4.1) where fa and fb are the activation functions for events A and B. fc is the logical conjunction of fa and fb since event C occurs only when both A and B occur together. Given a boolean activation function f with support size |f| (i.e. the number of boolean variables f depends on) we can calculate its sensitization probability as:

#SAT (f) p = (4.2) 2|f| where #SAT (f) represents the number of satisfying assignments to f, and 2|f| represents the total number of possible assignments to a boolean function of |f| variables. #SAT is a well established problem in theoretical computer science closely related to Boolean satisfiability (SAT) [108]. Where SAT seeks to find a satisfying assignment to a Boolean function, #SAT seeks the number of satisfying assignments. There are a number of algorithms for solving #SAT which are more efficient than naively enumerating the satisfying assignments with SAT [108]. Provided we can build appropriate activation functions for the paths under analysis we can use Equation (4.2) to calculate their sensitization probabilities. This generalizes previous work, in that a path with zero sensitization probability is by definition a false path.

4.3.3 Combining Activation Functions

We can now define a timing tag, which intuitively corresponds to the delay of a transition along a particular path:

Definition 13 (Timing Tag) A tuple (τ, ν, f) ∈ R × {R,F,H,L} × (BQ → B), corresponding to a path and transition combination. τ is the arrival time, ν the signal transition, f the activation function, and Q is the support size of f.

For example, a tag (15, R, x1 ∧ x2) corresponds to a rising transition with an arrival time of 15 units, which occurs only when the Boolean function x1 ∧ x2 evaluates true. Since f is a general Boolean function encoding all the scenarios where the timing tag applies, false and re-convergent paths can be accounted for by appropriately constructing f.

Consider the AND gate from Figure 4.3. If we have two timing tags ta and tb arriving at the gate inputs and a gate delay of δAND we can construct the corresponding output tag tc as:

tc = (δAND + max(ta.τ, tb.τ),

AND(ta.ν, tb.ν), (4.3)

ta.f ∧ tb.f) where δAND + max(ta.τ, tb.τ) is the latest arrival time of a transition at the output, AND(ta.ν, tb.ν) is the resulting transition (e.g. determined from Table 4.3), and ta.f ∧ tb.f is the logical conjunction (AND) of the input activation functions. Chapter 4. Extended Static Timing Analysis 51

More generally for a K-input gate with delay δgate implementing the logic function g(x1, x2, . . . , xK ) and incoming tags t1, t2, . . . , tK the output tag tgate can be defined as:

tgate = (δgate + max(t1.τ, t2.τ, . . . , tK .τ),

G(t1.ν, t2.ν, . . . , tK .ν), (4.4)

t1.f ∧ t2.f ∧ ... ∧ tK .f)

5 where G(ν1, ν2, . . . , νK ) is the transition function derived from the logic function g() . Equation (4.4) produces an STA-like delay estimate which is a safely pessimistic upper-bound, ensuring no paths will be underestimated. The activation function is specified as the logical conjunction of all the incoming tag activation functions, since all the tags must arrive to generate the corresponding output transition and arrival time.

4.3.4 Conditioning Functions

Equation (4.4) describes how to propagate timing tags through a gate (or wire6), allowing us to construct timing tags – including their associated activation functions – by walking through the timing graph. However we still require some base activation functions at timing sources, and a method to specify their transition probabilities. We can accomplish this by defining a set of Boolean conditioning functions at each timing source. Intuitively these functions model the statistical behaviour of the circuit’s primary inputs. For the simplest case of uniform probability we can define the following conditioning functions:

0 0 0 0 fR(x, x ) = x ∧ x fF (x, x ) = x ∧ x (4.5) 0 0 0 0 fH (x, x ) = x ∧ x fL(x, x ) = x ∧ x where intuitively x and x0 represent the current and next state of the source. Each function in Equa- tion (4.5) corresponds to a particular transition occurring on the source. Note each conditioning function in Equation (4.5) is satisfied 25% of the time (e.g. #SAT (fR) = 1 ), 2|fR| 4 assuming uniformly random x and x0. This yields a uniform probability for each transition. In general arbitrary conditioning functions can be used, allowing for non-uniform probabilities and correlations between sources. This is discussed in more detail in Section 4.5. By combining Equation (4.4) with conditioning functions such as those in Equation (4.5) we can propagate timing tags from timing sources to timing endpoints. The probabilities of the tags at all timing endpoints can then be calculated using Equation (4.2) to construct the path-delay distributions – completing the ESTA analysis.

4.4 ESTA Implementation

We have developed a tool to perform ESTA.7 The tool is written in C++ and uses Binary Decision Diagrams (BDDs) [109] (via the CUDD library [110]) to represent the netlist logic and timing tag

5G() can be determined by evaluating g() twice; first at the initial and then at the final values of the input transitions (e.g. first 0, then 1 for a R input transition). 6Wires can be treated as single-input ‘gates’ implementing logical identity. 7The source code is available from github.com/kmurray/esta. Chapter 4. Extended Static Timing Analysis 52 activation functions. BDDs allow easy manipulation of Boolean functions and enable #SAT to be solved efficiently once the BDDs are constructed [108].

4.4.1 Calculating Timing Tags

The basic procedure to calculate a node’s output timing tags is shown in Algorithm 8. Provided with the set of tags arriving at each of the K inputs, we enumerate the Cartesian product of the input tag sets (Line 3) thereby ensuring all possible cases of input transitions and arrival times are considered.8 Lines 4 to 6 evaluate a specific set of InputT ags (one tag per input) according to Equation (4.4). For each case the resulting tag is recorded (Line 7), and the full set of output tags returned (Line 8) for use by downstream nodes.

Algorithm 8 Basic ESTA Node Traversal (1) (K) Require: Intags, . . . , Intags sets of tags on each input, δ input to output delay, G node logic function (1) (K) 1: function TraverseNode(Intags, . . . , Intags, δ, G) 2: OutT ags ← ∅ (1) (K) 3: for each InputT ags ∈ Intags × ... × Intags do 4: τ ← δ + Max(InputT ags[1].τ, . . . , InputT ags[K].τ) 5: ν ← G(InputT ags[1].ν, . . . , InputT ages[K].ν) 6: f ← InputT ags[1].f ∧ ... ∧ InputT ags[K].f 7: OutT ags.Append((τ, ν, f)) 8: return OutT ags

After all nodes in the timing graph have been processed the tags at all endpoints can be evaluated with #SAT to build the path-delay distributions.9

4.4.2 Input Filtering

Consider the timing diagram in Figure 4.5 for the AND gate from Figure 4.3. Initially, both inputs (a, b) are high, producing a high output (c). At t = 1 input a falls, producing a falling transition on the output c with some delay. The later transition on input b at t = 2 produces no change in the output c, since input a remained low controlling the output. In this case input a can be said to ‘filter’ transitions on input b. While Algorithm 8 handles these cases correctly, it does so in an unnecessarily pessimistic manner, always using the latest arrival time, even if the associated transition would be filtered and have no effect. To counteract this, as each input arrives we restrict the node logic function to the input’s post-transition value. We can then use Boolean difference to identify subsequently arriving input transitions which do not affect the output. Such input tags are ignored during arrival time calculation, removing the unnecessary pessimism. It should be noted that this approach maintains the monotone speed-up property10 necessary for a robust analysis [112]. While input filtering is performed in a delay dependent manner it is safe since:

8This can result in a large number of tags for nodes deep in the timing graph. Sections 4.4.2 to 4.4.4 discuss techniques to safely reduce the number of tags considered. 9While our implementation always performs a full timing analysis, it could be extended to perform an incremental timing analysis using the same techniques as conventional STA [111, 82]. 10Monotone speed-up ensures that decreasing the delay of circuit element(s) cannot increase the critical path delay. Chapter 4. Extended Static Timing Analysis 53

a

b δ

c t = 1 t = 2 Figure 4.5: Input filtering example. The arrival time at c is 1 + δ, but STA or a naive ESTA implementation will report 2 + δ.

ˆ Decreasing the delay of an input transition can only increase the amount of filtering performed (since the signal would be stable earlier).

ˆ Increased filtering can only decrease the delay of a resulting output transition (if an additional late arriving input transition was filtered).

ˆ Input filtering never prevents a transition from propagating through the circuit.

Together these ensure that the delay of any timing path can never increase if a component delay decreases, and that all timing paths are considered during the analysis – none are ‘dropped’ in a delay-dependent manner.

4.4.3 Tag Merging

In the worst case Algorithm 8 can produce O(`K ) output tags where ` is the maximum number of tags across all K inputs. While the number of tags produced is often smaller in practice it can still grow large – particularly since the output tags of a node become the inputs to subsequent nodes.

To counteract this rapid growth, we can merge tags together. Suppose we have the tags t1, t2, . . . , tn with the same transition ν. We can produce a new tag tmerged which approximates the original tags:

tmerged = (max(t1.τ, t2.τ, . . . , tn.τ), ν, (4.6)

t1.f ∨ t2.f ∨ ... ∨ tn.f) .

We ensure a safely pessimistic approximation by using the maximum arrival time, and only merging tags with the same transition.11 The activation function of the merged tag is the logical disjunction (OR) of the original tags, since any of the tags could produce this bounding transition. Note the approximation is exact if the original tags have the same arrival time. Our implementation always merges such tags.

4.4.4 Run-Time/Accuracy Trade-Offs

Although precise tag merging helps, for larger circuits the number of tags (and hence run-time) can grow prohibitively large. Accordingly we have also developed several different techniques to trade-off accuracy for reduced run-time. They all rely on Equation (4.6) to safely merge tags.

11Since the arrival time defines the point at which a signal is stable (i.e. after glitching), taking the maximum is safely pessimistic even for signals which glitch. Chapter 4. Extended Static Timing Analysis 54

Fixed Delay Binning: Merges output tags with the same transition, and delay within the same delay bin. For a delay bin size of d, the delay bins are defined as [k · d, (k + 1) · d), k ∈ N. As an example, consider a set of five tags with the same transition and delays 75, 80, 110, 120, 155. Performing Fixed Delay Binning with d = 100 would produce two tags with delays 80 and 155.

Adaptive Binning: Merges input tags to limit the size of the Cartesian product evaluated at a node to at most m. This is performed by iteratively re-binning the node’s input tags at increasing bin-sizes, until the size of the cartesian product drops below m.

Percentile Binning: Merges output tags which cannot generate sth-percentile critical paths. For example, s = 0.05 would ensure no merging occurred on the top 5% of critical paths, while all other tags would be merged. This allows a detailed analysis to be efficiently performed for the near-critical paths of interest. This approach requires conventional STA to be run first to determine the worst-case required times at each node, and infer which tags can be merged during ESTA without effecting the top s percentile of paths.

In practice these various techniques can be used in combination, and many other binning formulations could be developed.

4.4.5 Enhanced Algorithm

Algorithm 9 shows the procedure for traversing nodes including the enhancements from Sections 4.4.2 to 4.4.4. Lines 3 to 5 calculate the cartesian product of tags, limiting the number of cases evaluated to at most m. To perform input filtering we first sort the K input tags of a single case by ascending order of arrival time (Line 10).12 We then check whether the transitions associated with each input can affect the output (Line 12). If a transition does affect the output (is not filtered) we ‘restrict’ the node’s logic function13 (Line 13), so the current input’s stable post-transition value will be considered when filtering later arriving inputs. Finally, we bin the resulting tags into bins of size d (Line 17). In our implementation activation function construction (Line 15) is performed lazily. As a result the cost of constructing activation function BDDs is incurred only during the final #SAT evaluations, and not during the graph traversal.

4.4.6 Computational Complexity

The computational complexity of our ESTA implementation is dominated by two components: evaluating tags, and performing #SAT.

nvars The complexity of manipulating BDD’s is O(2 ), where nvars is the total number of Boolean variables in the BDD. The size of a node’s Cartesian tag product is O(LK ), where K is the maximum number of inputs to any node, and L is the maximum number of tags on any node input. Sorting the input tags for filtering takes O(K log K) time, and manipulating the node’s logic function (as a BDD) takes O(2K ) time. The complexity of evaluating a single node is then O(LK (2K + K log K)). This must be done across all n nodes in the timing graph, taking O(nLK (2K + K log K)) time.

12This enforces causality by ensuring that only stable inputs can filter subsequently arriving transitions. 13This is a commonly supported operation in logic manipulation packages (such as CUDD [110]), which effectively fixes a variable to a constant value and propagates the constant through the function. For example, if the logic function was f(x1, x2, x3, . . . , xn) = x1 ∧ x2 ∧ x3 ∧ · · · ∧ xn, and x2 was fixed/restricted to 0, the restricted logic function would be 0 f (x1, 0, x3, . . . , xn) = 0. Chapter 4. Extended Static Timing Analysis 55

Algorithm 9 Enhanced ESTA Node Traversal (1) (K) Require: Intags, ..., Intags sets of tags on each input, δ input to output delay, G node logic function, m maximum cartesian product size, d delay bin size (1) (K) 1: function TraverseNode(Intags, . . . , Intags, δ, G, m, d) 2: Outtags ← ∅ (1) (K) 3: AllCases = Intags × ... × Intags . Cartesian product 4: if |AllCases| > m then . Limit product size to m 5: AllCases ← ReduceProduct(AllCases, m) 6: for InputT ags ∈ AllCases do 7: τ ← ∅ 8: ν ← G(InputT ags[1].ν, ..., InputT ags[K].ν) 9: f ← True 10: SortAscendingArrival(InputT ags) 11: for InputT ag ∈ InputT ags do . Each node input 12: if not Filtered(G, InputT ag) then 13: G ← Restrict(G, InputT ag) . Constrain input 14: τ ← Max(τ, δ + InputT ag.τ) . Update delay 15: f ← f ∧ InputT ag.f . Update activation function

16: Outtags.Append((τ, ν, f))

17: Outtags ← ReduceTags(Outtags, d) . Delay binning 18: return Outtags

To solve #SAT we must construct BDDs which in the worst case takes O(2Q) time, where Q is the total number of Boolean conditioning function variables. The resulting complexity of ESTA is

O(nLK (2K + K log K) + 2Q) . (4.7)

In typical circuits K is bounded by a small constant (e.g. 6), and the complexity simplifies to

O(nLK + 2Q) . (4.8)

If we define q as the number of conditioning function variables per timing source then Q = qI. For the conditioning functions in Equation (4.5) q = 2 and the complexity becomes

O(nLK + 4I ) . (4.9)

L can be controlled using the methods in Section 4.4.4, and as a result run-time is typically dominated by BDD construction for #SAT. However in practice decreasing L also reduces #SAT run-time, since fewer BDDs covering more of the input space (which are typically simpler to encode) are required. While the time complexity of ESTA is high, this is fundamental to the problem of analyzing the logic behaviour of a circuit under all possible inputs. For instance, all known general false-path detection algorithms also have exponential time complexity, since false-path identification is reducible to SAT: an NP-complete problem. The ESTA problem (calculating the probability that timing paths are activated) is more general than the false-path identification problem (finding timing paths with zero activation probability), and therefore subject to the same theoretical limits. It should also be noted that the O(4I ) term represents the worst-case complexity of BDD construction, Chapter 4. Extended Static Timing Analysis 56 however it can be more efficient in practice – such as when using techniques like dynamic variable re-ordering [113].

4.4.7 Comparison with Exhaustive Simulation

It is informative to compare the complexity of ESTA and Exhaustive Simulation, which takes

O(n4I ) (4.10) time. An important distinction between Equations (4.9) and (4.10) is the coefficient of the O(4I ) term. In exhaustive simulation the coefficient is n (circuit size), while in ESTA the coefficient is a constant. This results in a significant difference: exhaustive simulation’s complexity grows as the product of circuit size and the number of possible input combinations, while ESTA’s complexity grows linearly with circuit size and O(4I ). Intuitively, ESTA achieves this lower complexity since it traverses the timing graph only once during its analysis. In contrast, exhaustive simulation must evaluate the whole circuit for each of the 4I possible sets of input transitions. In addition, constructing BDDs to evaluate #SAT is far more efficient in practice than enumerating a vast input space.14

4.4.8 Per-Output & Max-Delay Modes

Our ESTA tool can be run in two possible operating modes as illustrated in Figures 4.6 and 4.7:

Per-Output Mode: Calculates a unique path-delay distribution for each timing endpoint. This is analogous to the maximum arrival time at each timing endpoint reported by STA.

Max-Delay Mode: Calculates a single path-delay distribution corresponding to the maximum delay across all timing endpoints. This is analogous to the Critical Path Delay (CPD) reported by STA. However the maximum path-delay distribution calculated may involve multiple timing paths, since different paths will be maximal for different sets of input transitions. This requires the ESTA tool to track which timing path (tag) is responsible for the maximum delay of each set of input transitions. To reduce this overhead, binning is applied when calculating the maximum delay distribution to reduce the number of unique tags.

It is important to note that the max-delay distribution cannot be calculated from the path-delay distributions produced in per-output mode, since the distributions contain only aggregate information and do not capture which path is maximal for a given set of primary input transitions.

4.5 Non-Uniform & Correlated Conditioning Functions

In traditional STA a design is typically analyzed with multiple delay models to capture the impact of delay variation at different ‘timing corners’, and in multiple operational ‘modes’ to model the behaviour of the design in different use cases [114]. Like traditional STA, ESTA can be performed using different

14For example, exhaustive simulation of the clma benchmark, requires evaluating > 10249 sets of input transitions; counting the satisfying assignments is far faster. Chapter 4. Extended Static Timing Analysis 57

Circuit Under Circuit Under Analysis Analysis

MAX

Figure 4.6: Per-Output Mode: Path delay distribu- tions calculated for each output. Figure 4.7: Max-Delay Mode: Bounding path delay distribution calculated over all output.

0 0 0 Boolean Boolean x x y y z z 0 y y0 0 Variables x x z z Variables

Cond. F’ns Cond. F’ns Cond. F’ns Cond. F’ns

Primary Primary Inputs Inputs

Circuit Under Circuit Under Analysis Analysis

Figure 4.8: Independent Condition Functions. Figure 4.9: Non-Independent Condition Functions. combinations of delay models and operating modes. However, ESTA generalizes STA’s concept of a mode to include the transition probability distribution of the primary inputs. So far we have considered uniform and independent inputs, which are modelled with the conditioning functions in Equation (4.5). In general, inputs may have non-uniform probabilities and may not be independent. In practice, the probabilities and correlations could be determined from designer knowledge, simulation, the upstream circuit logic, or derived from word-level statistics [115]. In order to model these statistical characteristics we need to develop different conditioning functions. Intuitively, conditioning functions can be viewed as a form of pre-processing which transforms a set of uniform and independent Boolean variables (e.g. x and x0) into some output distributions. As shown in Figure 4.8, if unique independent variables are used for the conditioning functions associated with each primary input, then the resulting conditioning functions are independent. However, if some variables are shared across conditioning functions, as in Figure 4.9, the resulting conditioning functions may not be independent and could exhibit correlations.

4.5.1 Non-uniform Conditioning Functions

As described by Equation (4.2), the probability of a particular transition is proportional to the number of satisfying assignments (i.e. minterms) of the associated conditioning function. To model non-uniform Chapter 4. Extended Static Timing Analysis 58

x1 x2 x3 f 0 0 0 R 0 0 1 F 0 1 0 H 0 1 1 L 1 0 0 R 1 0 1 L 1 1 0 L 1 1 1 L

fR(x1, x2, x3) = (x1 ∧ x2 ∧ x3) ∨ (x1 ∧ x2 ∧ x3)

fF (x1, x2, x3) = (x1 ∧ x2 ∧ x3)

fH (x1, x2, x3) = (x1 ∧ x2 ∧ x3)

fL(x1, x2, x3) = (x1 ∧ x2 ∧ x3) ∨ (x1 ∧ x2 ∧ x3)

∨ (x1 ∧ x2 ∧ x3) ∨ (x1 ∧ x2 ∧ x3)

Figure 4.10: Example Truth Table and conditioning functions satisfying P (fR) = 0.25, P (fF ) = 0.125, P (fH ) = 0.125 and P (fL) = 0.5. Note that the specific minterms assigned to each transition have no impact on the analysis results. input probabilities, we can assign the minterms in proportion to the desired probabilities. However care must be taken to ensure no minterms are shared between conditioning functions associated with the same primary input, as each transition must be independent (e.g. R and F cannot occur simultaneously). Furthermore, by adding additional variables to the conditioning functions’ support, probabilities can be assigned at finer granularity. Figure 4.10 shows how a set of conditioning functions for a non-uniform primary input can be created. Each transition is assigned to a unique subset of the possible minterms in proportion to the desired probability. Note that the order in which minterms are assigned to transitions has no functional impact, as only the number of minterms assigned to each transition determine their probability. As a result, there are many possible conditioning functions which satisfy a set of target probabilities. The algorithm we use to construct non-uniform conditioning functions is shown in Algorithm 10. This approach tries to assign adjacent minterms to a particular transition. Since the minterms for a particular transition are grouped together, the conditioning function is easier to encode as a BDD.15 First the conditioning functions are initialized on Lines 2 to 5. Next the transitions are sorted (Line 7), so the highest probability transition is processed first. The number of minterms to assign to each transition is calculated on Line 10, based on transition probability and the number of Boolean variables per input (q). The relevant minterms are then set in the conditioning function on Line 13, and the conditioning functions are returned on Line 15.

15Evaluation on 12 of the MCNC20 benchmarks showed this approach reduced ESTA run-time by 1.8× compared to a round-robin allocation, such as shown in Figure 4.10. Chapter 4. Extended Static Timing Analysis 59

Algorithm 10 Grouped Minterm Allocation Require: q the number of Boolean variables per primary input, P the desired probabilities of R/F/H/L transition 1: function AllocateGroupedMinterms(q, P ) 2: f[R] ← False 3: f[F ] ← False 4: f[H] ← False 5: f[L] ← False 6: T rans ← (R,F,H,L) 7: SortDescendingProbability(T rans, P ) 8: m ← 0 . Current minterm number 9: for t ∈ T rans do 10: M ← P [t] · 2q . Number of minterms to allocate 11: for 1 ...M do 12: fm ← GetMintermFunction(q, m) 13: f[t] ← f[t] ∨ fm . Set minterm 14: m ← m + 1 15: return (f[R], f[F ], f[H], f[L])

4.5.2 Correlated Condition Functions

It is also possible to construct conditioning functions which capture correlations between inputs. In such cases the number of minterms shared between two conditioning functions describes their degree of correlation.

Consider the following joint probability specification between inputs a and b:

aR aF aH aL   bR 1/16 0 0 1/8 b 1/16 0 1/8 0  F   (4.11)   bH 1/16 1/8 1/4 0  bL 1/16 1/8 0 0 where each entry corresponds to the probability of specific transitions occurring simultaneously on a and b.

In Equation (4.11) aR is uncorrelated with the transitions on input b, since it is equally likely to occur with any transition on b. In contrast, aF occurs only when b is undergoing a H or L transition, aH occurs only if b is undergoing a F or H transition, and aL occurs only if b is undergoing a R transition. Chapter 4. Extended Static Timing Analysis 60

A set of conditioning functions which model this behaviour are:

faR (x1, x2, x3, x4) = (x0 ∧ x1 ∧ x2 ∧ x3) ∨ (x0 ∧ x2 ∧ x3)

∨ (x0 ∧ x1 ∧ x2 ∧ x3)

faF (x1, x2, x3, x4) = (x1 ∧ x2 ∧ x3) ∨ (x0 ∧ x1 ∧ x2)

faH (x1, x2, x3, x4) = (x1 ∧ x2 ∧ x3) ∨ (x0 ∧ x1 ∧ x2)

∨ (x0 ∧ x1 ∧ x3) ∨ (x0 ∧ x2 ∧ x3)

∨ (x0 ∧ x1 ∧ x2 ∧ x3) f (x , x , x , x ) = (x ∧ x ∧ x ∧ x ) aL 1 2 3 4 0 1 2 3 (4.12) ∨ (x0 ∧ x1 ∧ x2 ∧ x3)

fbR (x1, x2, x3, x4) = (x0 ∧ x1 ∧ x2 ∧ x3)

∨ (x0 ∧ x1 ∧ x2 ∧ x3)

∨ (x0 ∧ x1 ∧ x2 ∧ x3)

fbF (x1, x2, x3, x4) = (x0 ∧ x1 ∧ x2) ∨ (x0 ∧ x1 ∧ x2 ∧ x3)

fbH (x1, x2, x3, x4) = (x1 ∧ x3) ∨ (x2 ∧ x3) ∨ (x0 ∧ x1 ∧ x2)

fbL (x1, x2, x3, x4) = (x0 ∧ x1 ∧ x2 ∧ x3) ∨ (x0 ∧ x1 ∧ x2)

Where, for example:

#SAT (f ∧ f ) P (a ∧ b ) = aH bF H F 24 #SAT ((x ∧ x ∧ x ∧ x ) ∨ (x ∧ x ∧ x ∧ x )) = 0 1 2 3 0 1 2 3 (4.13) 16 2 1 = = 16 8

as desired, since faH and fbF share the two minterms: x0 ∧ x1 ∧ x2 ∧ x3 and x0 ∧ x1 ∧ x2 ∧ x3. In general, constructing a set of conditioning functions that satisfy an arbitrary set of joint probabilities for a large number of inputs is challenging. This task can be formulated as a constrained minterm assignment problem, and solved with Integer Linear Programming (ILP). However this approach is computationally expensive, and as a result it is only practical for groups of less than 4 correlated circuit inputs. For details of the ILP formulation, and attempts to generate conditioning functions by Stochastic Synthesis see Appendix A.

4.6 Evaluation Methodology

To evaluate our ESTA implementation we compare it against post-place-and-route timing simulation performed with Mentor Graphics Modelsim SE 10.4c. The evaluation flow used is shown in Figure 4.11. We take in a circuit netlist which is mapped onto a 40nm 6-input Look-Up-Table (6-LUT) based FPGA using VPR [27] to generate an SDF file with both logic and routing delays. The SDF file is then used to annotate identical delays in both Modelsim and ESTA. To avoid unrealistic glitch filtering Modelsim was run using the transport delay model. We evaluate all tools on the 20 largest MCNC benchmarks [95] which are listed in Table 4.7. Any state elements (e.g. Flip-Flops) were replaced with primary inputs and outputs.16

16While our ESTA implementation does not perform clock-pessimism removal, the same techniques used in traditional STA could be applied [116, 117]. Chapter 4. Extended Static Timing Analysis 61

4.6.1 Monte-Carlo Simulation

While exhaustive simulation is useful for verification it quickly becomes impractical, since the number of cases to be simulated grows as Θ(4I ). To enable the evaluation of larger benchmarks, and provide a more realistic run-time comparison to ESTA, we also developed a Monte-Carlo (MC) based simulation framework for calculating path-delay distributions. It is important to distinguish between the strength of guarantees that MC and ESTA provide. ESTA guarantees it will always produce safely pessimistic upper bounds of path-delay distributions. MC cannot provide any such guarantees. In the MC framework we uniformly generate random sets of input transitions to sample the input space. This sampling procedure was run for 48 hours on each benchmark. All quality comparisons are based on the larger and more accurate 48-hour sample, while run-times are determined by finding the smallest sample size which meets some convergence criterion. In order to define the MC convergence criteria it is helpful to define some additional metrics. Definition 14 (Population Max Delay Probability) Let p be the probability of activating maximum delay paths across all input transition combinations.

In other words, p is the true probability of activating a maximum delay path (i.e. the ‘population’ statistic). The obvious way to calculate p is to exhaustively simulate all input transition combinations and count the fraction which activate a maximum delay path (i.e. exhaustive simulation). However, this is impractical for large circuits as the number of input combinations is far too large. To address this using MC we aim to estimate p, by sampling a subset of input transition combinations (i.e. simulating a smaller non-exhaustive set of randomly selected input transition combinations). Definition 15 (Sample Max Delay Probability) Let pˆ be the probability of activating maximum delay paths from a uniformly random sample of simulated transitions.

In other words, pˆ is an estimate of p based on a particular sample (i.e. pˆ is a ‘sample’ statistic). We can also calculate, for a particular sample, a confidence interval17 around pˆ (i.e. LB, UB such that pˆ ∈ [LB, UB]) where the size of the confidence interval (i.e. |UB − LB|) gives us insight into how ‘good’ of an estimatep ˆ is of p.18 As a result, we determine convergence as follows:. Definition 16 (MC Max-Delay Convergence) Given a sample with max-delay probability confidence interval [LB, UB] at α confidence, we say the UB−LB sample has converged if pˆ < ∆pˆrel .

For instance, choosing α = 0.95 and ∆pˆrel = 0.05 corresponds to a 95% confidence that the relative error in pˆ is less than 5%. MC run-time is dependent on how α and ∆pˆrel are chosen and their effect is evaluated in Section 4.7.3. The reported Monte-Carlo run-times include only the simulation run-time, and exclude the large amount of post-processing required to extract useful transition and delay information, and to determine convergence.

17We model the activation of a max delay path as a Bernoulli process and calculate the confidence interval using the Clopper-Pearson method [118] using the implementation from [119]. 18This is not strictly accurate, the confidence interval defines the region where we would expect the pˆ’s of different independently drawn samples to fall with some probability (i.e. the confidence level). There is no guarantee that the true ‘population’ p is actually contained within a particular confidence interval [120]. Chapter 4. Extended Static Timing Analysis 62

Netlist Table 4.4: Normalized Critical Path Delays with False Paths.

VPR STA MC ESTA

SDF s298 1.000 0.981 0.981 clma 1.000 0.959 0.984 „ ESTA Modelsim frisc 1.000 0.965 0.999 * elliptic 1.000 0.974 Compare * ESTA upper-bound, „MC optimistic Figure 4.11: Evaluation Flow.

4.6.2 Metrics

To compare the Quality of Results (QoR) between STA, ESTA and MC we used the Earth Mover’s Distance (EMD) metric [121], a measure of the ‘distance’ between two probability distributions (commonly used to compare image histograms). Intuitively, EMD considers each probability distribution as a pile of earth, and the EMD is the minimum amount of work required to transform one pile of earth into the other. This can be formulated as a flow transportation problem on a bipartite flow network, were the sources and sinks correspond to the discrete bins of the two probability functions being compared, and the cost to transport a flow between two bins is proportional to the distance between them. The value of the minimum cost (subject to the detailed constraints outlined in [121]), corresponds to the EMD. To account for different critical path delays across benchmarks we normalize EMD by the EMD between MC and STA. The resulting normalized EMD ∈ [0, 1] describes how closely the MC distribution is approximated. A value of 1 corresponds to an STA-like analysis (worst-case switching behaviour), and a value of 0 corresponds to the MC distribution (assumed true switching behaviour). For each benchmark we report the run-times for determining the maximum delay across all outputs, and for determining the delay distribution of each output. We also report the mean normalized EMD of the resulting path-delay distributions.

4.7 Results

Using the experimental methodology from Section 4.6 and the implementation from Section 4.4 we perform several experiments to investigate the characteristics of ESTA. Unless otherwise noted ESTA was run with d = 100 ps, m = 104 and s = 0.0, which were found experimentally to offer good run-time/quality trade-offs as shown in Section 4.7.4. For MC we used α = 0.95 and ∆pˆrel = 0.05 unless otherwise noted. Results were collected on an Intel Xeon E5-2643v3 machine with 256GB of memory.

4.7.1 Verification

To verify the correctness of our ESTA tool we exhaustively compared with Modelsim on a set of small micro-benchmarks including simple logic circuits, ripple-carry adders and array multipliers. We verified for each possible set of input transitions that ESTA agreed with simulation on the resulting output transitions and that ESTA’s delay estimate was an upper-bound of the simulation delay. Chapter 4. Extended Static Timing Analysis 63

4.7.2 Maximum Delay Estimation with False Paths

By running both MC, STA, and ESTA (with percentile binning) we can investigate the impact of false-paths. While most circuits in the MCNC20 benchmarks produced the same critical path delay in all tools, the existence of false paths caused divergence on the benchmarks in Table 4.4. ESTA was able to identify the true critical path delay on the s298 and clma benchmarks, and confirm false paths exist on frisc.19 Notably on clma MC reported an unsafe (optimistic) delay – indicating the worst-case path was never sampled. This illustrates the utility of ESTA’s strong guarantees; it never underestimates delay.

4.7.3 Monte-Carlo Convergence

The specific criteria chosen for MC convergence (Definition 16) can affect whether MC converges and its associated run-time. Figure 4.12 illustrates the impact of varying these parameters on the run-time of the apex2 benchmark, a representative example. Increasing the confidence level (α) increases run-time, particularly as the 20 confidence approaches 1.0 . Decreasing the allowed relative error in max-delay probability (∆pˆrel ) strongly increases run-time. Reducing the allowed relative max-delay probability variation from 5% to 2.5% causes run-time to more than triple, and exceed 48 hours for confidences beyond 0.99.

Table 4.5 shows the impact, across all benchmarks, of varying ∆pˆrel and α on MC max-delay convergence. For relatively loose convergence criteria (∆pˆrel = 0.05 and α = 0.95), only 10 of the 20 MCNC benchmarks converge. Increasing the confidence level (α) increases run-time, but decreasing the allowed relative error in pˆ has a more substantial impact, causing substantial increases in the median run-time, and reducing the number of converged benchmarks. At ∆pˆrel = 0.005 and α = 0.999, MC fails to converge on all benchmarks. It should be noted that to equal ESTA’s strong upper-bound guarantee (analogous to 1.0 confidence) MC effectively reverts to exhaustive simulation. Furthermore, all comparisons we make between MC and

ESTA are performed with relatively loose MC convergence criteria (∆pˆrel = 0.05, α = 0.95), and both MC convergence and run-time would be much worse if tighter convergence criteria were used.

4.7.4 ESTA Run-time/Accuracy Trade-Offs

Figure 4.13 shows run-time/accuracy trade-offs for the binning methods described in Section 4.4.4. Increasing the fixed bin size (d) from 0 to 100 at m = ∞ speeds-up ESTA by 2.2×, with only a 4% degradation in QoR. Beyond d = 100 the QoR degrades significantly for little or no run-time improvement. Decreasing m from ∞ to 106 has limited impact on QoR. Further decreasing m to 104 causes some impact, resulting in a 40% QoR degradation for speed-up of 1.5× compared to m = ∞ at d = 10. At m = 102 speed-up compared to m = ∞ improves further (e.g. a further 1.8× compared to m = ∞ at d = 100), but at the cost of significant QoR degradation (e.g. a further 222% compared to m = ∞ at d = 100). In general d = 100 and m in the range 104 to 106 provide good run-time/accuracy trade-offs. d is the most effective, and setting it to a small but non-negligible portion of the circuit delay ensures only timing

19ESTA exceeded memory limits on frisc due to the large number of nearly critical false paths, and exceeded 48 hours run-time on elliptic. 20Higher α causes the size of the calculated confidence interval to increase slowing convergence. Chapter 4. Extended Static Timing Analysis 64

48 Hour Limit apex2 p Rel Err: 0.025 40 apex2 p Rel Err: 0.05

30

20 Run­Time (hours)

10

0 0.95 0.96 0.97 0.98 0.99 1.00 Confidence Figure 4.12: MC max-delay run-time on apex2 for various convergence criteria.

Table 4.5: Impact of convergence criteria on MC Max Delay run-time of the MCNC20 Benchmarks. # Converged Median MC ∆ α pˆrel Benchmarks Run-time (hours) 0.05 0.95 10 1.09 0.05 0.99 10 1.88 0.05 0.999 10 3.07 0.025 0.95 10 4.34 0.025 0.99 8 3.62 0.025 0.999 8 5.92 0.01 0.95 6 11.60 0.01 0.99 5 17.57 0.01 0.999 5 28.71 0.005 0.95 3 21.11 0.005 0.99 2 33.13 0.005 0.999 0 Blank entries exceeded 48 hours run-time Chapter 4. Extended Static Timing Analysis 65

0.40 d=0 m= d=10000 m=104 d=10 m= d=0 m=106 d=100 m= d=10 m=106 0.35 d=1000 m= d=100 m=106 d=10000 m= d=1000 m=106 d=10 m=102 d=10000 m=106 0.30 d=100 m=102 d=0 m=108 d=1000 m=102 d=10 m=108 d=10000 m=102 d=100 m=108 0.25 d=10 m=104 d=1000 m=108 d=100 m=104 d=10000 m=108 d=1000 m=104 0.20 Mean Normalized EMD

0.15

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Normalized Run­Time Figure 4.13: ESTA run-time/accuracy tradeoffs on the dsip benchmark in per-output mode. tags with similar delays are merged together. This reduces run-time with minimal QoR impact. m serves to limit ESTA’s worst-case complexity by bounding the O(LK ) term in Equation (4.9). Setting m to a large, but finite value ensures designs with large numbers of convergent timing paths remain tractable, at the cost of some additional pessimism.

4.7.5 ESTA Non-Uniform Conditioning Functions

As described in Section 4.5, ESTA can model non-uniform and correlated inputs with appropriate conditioning functions. Figure 4.14 illustrates the impact of different conditioning functions on the delay-probability Cumu- lative Distribution Functions (CDFs) of the pksi 83 d output of the dsip benchmark. The condition functions for the ‘ESTA Non-Uniform’ are constructed to satisfy randomly generated target transition probabilities. By changing the random number generator’s seed, different target transition probabilities can be created. In this instance the non-uniform conditioning functions result in a sharper delay transition around 2000ps compared to the uniform conditioning functions. Notably, the different input probability distributions can result in significantly different delay behaviour; for instance the 99th-percentile delay for ‘ESTA Non-Uniform Seed 2’ is 2656ps, while for ‘ESTA Uniform’ it is 3235ps, 22% higher. It is also important to consider the impact of conditioning functions on ESTA’s run-time. A key factor is how many Boolean variables (q), are used to form the support of the conditioning functions of each input. Larger values of q allow probabilities to be specified at finer granularity, since the smallest 1 probability that can be specified is 2q . Table 4.6 shows the run-time impact of using the uniform conditioning functions from Equation (4.5), or randomly generated non-uniform conditioning functions, for various values of q. Increasing q has a strong effect on run-time, since it is directly proportional to the total number of Boolean variables. Increasing q to 4 and 6 allows probabilities to be specified with 6.25% and 1.56% granularity respectively. However this comes at a cost to run-time, with ESTA taking 3.2× and 7.4× longer on the common benchmarks which completed. Clearly, the smallest value of q which captures the non-uniform input Chapter 4. Extended Static Timing Analysis 66

1.0 STA (cdf) ESTA Uniform (cdf) ESTA Non­Uniform Seed 1 (cdf) 0.8 ESTA Non­Uniform Seed 2 (cdf) ESTA Non­Uniform Seed 3 (cdf)

0.6

Probability 0.4

0.2

0.0 0 500 1000 1500 2000 2500 3000 Delay (ps) Figure 4.14: CDFs for the pksi 83 d output of the dsip benchmark for various conditioning functions. The ‘ESTA Uniform’ CDF corresponds to using the conditioning functions from Equation (4.5) for each primary input. Each ‘ESTA Non-Uniform’ CDF correspond to a different set of randomly generated non-uniform conditioning functions used at the primary inputs. ESTA was run with d = 1 and m = ∞. distribution should be used to minimize run-time.21

4.7.6 ESTA and MC Comparison

Figure 4.15 plots the maximum path-delay CDFs for the spla benchmark. STA, which assumes worst-case switching behaviour, produces a single maximum delay estimate of 6230ps (the critical path delay) for all cases (p = 1). In contrast MC, which directly simulates switching behaviour, produces a path-delay distribution, showing 13% of input transitions cause no delay (i.e. don’t affect the output), and only 4% of input transitions produce delays > 5000ps. ESTA is always safely conservative compared to MC and less pessimistic than STA, with its CDF always falling between MC and STA. The form of ESTA’s CDF follows the shape of MC’s CDF, with larger m producing a more accurate result. Table 4.7 quantitatively compares ESTA and MC.

Max-Delay Comparison

First, we consider the behaviour of ESTA and MC when evaluated in Max-Delay mode (Section 4.4.8), where a single path-delay distribution represents the maximum delay over all timing endpoints. For MC max-delay probability (pˆ) we observe that benchmarks with relatively large max-delay probability tend to converge, while those with smaller probability (pˆ ≤ 10−4) tend not to converge. A relatively high pˆ means max-delay paths are observed relatively frequently, and the confidence interval around pˆ shrinks quickly. In contrast, on benchmarks with relatively rare max-delay paths (e.g. pˆ ≤ 10−4) MC often fails to converge within 48 hours. Intuitively, it is difficult to determine from a given sample if a small pˆ is caused by an inherently rare path, or by insufficient sample size; this causes MC to converge slowly.

21It should be noted that despite these slow downs ESTA would still outperform MC (Section 4.7.6). Chapter 4. Extended Static Timing Analysis 67

Table 4.6: Relative ESTA per-output run-time for various conditioning functions (d = 100, m = 104). Uniform Non-Uniform Non-Uniform Benchmark q = 2 q = 4 q = 6 ex5p 1.00 1.08 1.73 apex4 1.00 4.33 7.10 ex1010 1.00 1.26 1.87 s298 1.00 2.82 6.60 misex3 1.00 5.58 8.75 alu4 1.00 8.58 14.07 spla 1.00 1.34 7.30 pdc 1.00 2.91 12.27 apex2 1.00 seq 1.00 41.12 des 1.00 4.54 18.25 clma 1.00 tseng 1.00 diffeq dsip 1.00 4.86 10.34 bigkey 1.00 6.23 12.00 frisc elliptic s38584.1 1.00 12.22 s38417 Geomean 1.00 4.37 7.42 Blank entries exceeded 48 hours run-time

1.0 STA (cdf) MC (cdf) ESTA d=100 m=106 (cdf) 0.8 ESTA d=100 m=104 (cdf)

0.6

Probability 0.4

0.2

0.0 0 1000 2000 3000 4000 5000 6000 Delay (ps) Figure 4.15: Maximum delay CDFs on the spla benchmark. Chapter 4. Extended Static Timing Analysis 68

Now considering Max-Delay QoR for ESTA m=104, we see ESTA’s analysis falls between STA and MC, with normalized EMD ranging between 0.96 (nearly STA-like) and 0.33 (more MC-like), with an geometric mean of 0.63 across the benchmarks which completed. Increasing m to 106 (Max-Delay ESTA m=106) reduces the geometric mean normalized EMD by 24% to 0.48 on the common benchmarks which completed. The run-time performance of Max-Delay ESTA and MC are also shown in Table 4.7. Looking at Max-Delay Run-time for MC, it converges on only 10 of the 20 benchmarks within 48 hours, and fails to converge even on benchmarks with few inputs (e.g. s298). Max-Delay Run-time for ESTA m = 104 completed 11 of 20 benchmarks and shows more stable run-time, completing all of the benchmarks with 41 or fewer inputs and also the largest benchmark (in terms of logic) clma. For those benchmarks which completed under both MC and ESTA, ESTA m = 104 and m = 106 achieved a geometric mean speed-up of 164× and 16× respectively. For all benchmarks with > 415 inputs ESTA in Max-Delay mode exceeded 48 hours run-time during #SAT evaluation.

Per-Output Comparison

We can also evaluate ESTA and MC in Per-Output mode (Section 4.4.8), where a unique path-delay distribution is calculated for each timing endpoint.22 Since there are multiple path-delay distributions produced we calculate the normalized EMD and report the arithmetic mean across all timing endpoints. For MC convergence we report the time required for the maximum delay of each timing endpoint to converge, according to Definition 16. Table 4.7 also shows the Per-Output QoR; ESTA produces a more accurate analysis than in the Max-Delay mode: with a geometric mean EMD of 0.28 and 0.25 for m = 104 and m = 106 respectively. Looking at Per-Output Run-time in Table 4.7 we can see that MC converges on only 6 of 20 benchmarks, and total run-time increased 4.3× compared to MC Max-Delay mode for those which converged. In comparison, in Per-Output mode ESTA converges on more benchmarks: 16 and 13 for m = 104 and m = 106 respectively. On the common benchmarks which completed, ESTA’s geometric mean speed ups over MC were 569× and 109× for m = 104 and m = 106 respectively.

Discussion

The QoR gap between ESTA and MC is derived from four factors. First, binning (Section 4.4.4) introduces additional pessimism since the true distribution is approximated with fewer timing tags. Second, Modelsim performs more aggressive transition filtering than ESTA (Section 4.4.2). Since Modelsim simulates glitching behaviour it can filter transitions when some inputs have not yet stabilized to their final values. In contrast, ESTA does not directly model glitching behaviour, instead calculating upper bounds on the stable arrival times, and only filters based on the stable post-transition signal value. To illustrate this difference consider Figure 4.4. ESTA will filter based on the signal state only after t = 1 + δ (when the signal is guaranteed to be stable and glitch-free), while Modelsim will filter based on the signal state for the full time period (even before the signal has stabilized).23 Third, Modelsim optimistically treats simultaneous transitions at a gate input with no logical effect (e.g. simultaneous

22This more indicative of how an ESTA-like tool would be used to analyze an overclocking-style design, since the acceptable error rates are likely to be endpoint dependent. 23ESTA’s behaviour ensures it maintains the monotone speed-up property [112]. Modelsim’s analysis does not satisfy the monotone speed-up property, since it filters transitions in a delay-dependent manner. Chapter 4. Extended Static Timing Analysis 69

Table 4.7: Quality and Run-Time on the MCNC20 Benchmarks. Max-Delay Max-Delay QoR Max-Delay Run-time Per-Output QoR Per-Output Run-time Prob. (norm. EMD) (minutes) (mean norm. EMD) (minutes) ESTA ESTA ESTA ESTA ESTA ESTA ESTA ESTA Benchmark I LUTs MCp ˆ MC MC m=104 m=106 m=104 m=106 m=104 m=106 m=104 m=106 ex5p 8 740 3.9 · 10−3 0.73 0.37 151.7 0.5 4.6 0.40 0.33 305.1 0.7 5.6 apex4 9 970 5.9 · 10−3 0.85 0.74 118.1 1.5 8.7 0.63 0.48 712.0 1.2 8.7 ex1010 10 3,093 2.7 · 10−3 0.96 0.85 922.1 5.8 40.4 0.84 0.65 10.1 127.4 s298 11 1,301 0.87 0.79 5.7 59.4 0.54 0.45 15.7 118.6 misex3 14 1,158 1.5 · 10−3 0.66 0.46 781.3 5.2 54.9 0.41 0.34 4.4 68.3 alu4 14 1,173 2.5 · 10−3 0.83 0.46 473.6 2.4 58.9 0.58 0.44 1,966.4 1.8 62.3 spla 16 3,005 0.40 0.25 6.3 207.1 0.26 0.19 12.7 108.3 pdc 16 3,627 0.56 0.31 11.3 118.4 0.39 0.25 15.5 282.7 apex2 39 1,478 6.9 · 10−4 0.33 2,068.2 87.1 0.31 80.1 seq 41 1,325 0.34 57.9 0.19 0.15 5.7 164.5 des 256 554 1.8 · 10−2 60.7 0.27 0.23 1,101.8 0.6 2.8 clma 415 6,239 0.92 319.2 0.13 129.4 tseng 436 798 0.15 87.0 diffeq 440 871 dsip 460 880 3.5 · 10−2 49.8 0.15 0.14 56.1 0.5 0.6 bigkey 494 883 1.2 · 10−2 150.6 0.17 0.16 151.2 0.4 0.5 frisc 905 3,028 elliptic 1,252 2,135 s38584.1 1,332 4,486 3.2 · 10−3 2,829.4 0.09 0.07 26.3 111.8 s38417 1,545 3,465 Common 3.0 · 10−3 0.80 0.55 360.8 2.2 22.1 0.31 0.27 398.3 0.7 3.6 Geomean ‘Common Geomean’ is the geomean of the subset of benchmarks which completed in MC, ESTA m = 104 and ESTA m = 106 in the same mode. Note this means Max-Delay and Per-Output geomeans are not comparable. Blank entries exceeded 48 hours run-time.

R/F in Table 4.3) as producing no output transition. To remain safely pessimistic ESTA models these with an appropriate R/F transition. Fourth, on some benchmarks MC fails to converge (e.g. s298), meaning our assumed ground-truth may not actually be the ground-truth. In such cases, if ESTA’s calculated path-delay distributions are closer to the real ground-truth than MC’s, the reported normalized EMD may be pessimistic. It is also interesting to note that MC performs worse (converges on fewer benchmarks and runs slower) when run in Per-Output mode, while ESTA exhibits the inverse behaviour. For MC, Per-Output mode requires that all timing endpoints converge according to Definition 16, while MC Max-Delay mode requires only that the maximum delay converges. Thus Per-Output is a stricter convergence criteria, since it requires potentially non-maximal delay outputs to converge. For ESTA, Per-Output mode is faster since it does not require tracking the covered input transitions, and produces better QoR since it avoids the final stage of tag merging and binning used to compute the maximum distribution over all timing endpoints in Max-Delay mode (Section 4.4.8). Overall, these results show that ESTA can be used to quickly calculate bounding path-delay distri- butions on most of the circuits, including some with hundreds or thousands of primary inputs. This would enable automated analysis of practical design components such as the 12-input overclocked designs considered in [25].

4.8 Conclusion

In conclusion we have presented ESTA, a new timing analysis method which accounts for non-worst-case switching behaviour, calculating safe bounding path-delay distributions over all input combinations. We showed how the sensitization probability of timing paths can be calculated using #SAT, allowing path-delay distributions to be constructed. We presented a BDD-based implementation, including Chapter 4. Extended Static Timing Analysis 70 approaches to improve accuracy and scalability. We also described how non-uniform probabilities and correlations can be modelled in ESTA. Our experimental comparisons of ESTA and Monte-Carlo timing simulation showed ESTA on average runs 16 to 569× faster while achieving results within 25 to 63% of Monte-Carlo, representing a 75 to 37% reduction in pessimism compared to STA. ESTA is more general than the previous state of the art for calculating path delay distributions (manual analysis by hand), and is more scalable than Monte-Carlo timing simulation while performing a more robust analysis with stronger correctness guarantees. ESTA’s high complexity does limit its worst-case scalability, and analysis of large-scale designs remains intractable, as they do for all known general methods that analyze input-dependent circuit behaviour. However, ESTA is suitable for use on high value sub-circuits where the additional design effort, pessimism reduction and strong correctness guarantees are warranted.

4.9 Future Work

There are a variety of directions for future work. The key algorithmic challenge for ESTA is scalability, with #SAT being the main run-time bottleneck in our implementation. One potential approach is to simplify the transition model from the 4-state model (‘R’, ‘F’, ‘H’, ‘L’) used in this work, to a simpler 2-state model (e.g. ‘Static’, ‘Switch’). This would reduce ESTA’s complexity to O(nLK + 2I ), but would likely increase pessimism since filtering (Section 4.4.2) would become less effective. There have also been a variety of works proposing accuracy/guarantee trade-offs while approximating the solution to #SAT [108]. These approximations could substantially reduce the time required to perform #SAT provided: some potential for error in the analysis is acceptable, or #SAT can be approximated pessimistically (e.g. only ever safely over-estimating activation probabilities). Whether these accuracy/guarantee/complexity trade-offs are worthwhile requires further investigation. Another avenue for investigation is using CNF-based solvers instead of BDDs for solving #SAT, which could be more efficient for some problems, although they are subject to the same worst-case complexity bounds as BDDs (#SAT is a #P-complete problem). However prior work has found that in practice some #SAT problems are easier to solve with BDDs, and others with CNF-based solvers, indicating that hybrid techniques which combine both may be a promising approach. It would also be interesting to compare ESTA’s efficiency at identifying false paths with other dedicated false path detection algorithms. Improved run-time quality trade-offs (Section 4.4.4), particularly those that actively consider the impact on quality would also improve results. For instance percentile binning, which analyzes only the most critical paths, warrants further investigation. There are also open questions driven by the application of ESTA. While it is possible to model arbitrary correlations in ESTA it is not clear how to automatically construct such conditioning functions in an efficient manner. It is also not clear how best to model the switching behaviour of state elements like Flip-Flops. ESTA’s approach of analyzing circuit behaviour over the input combination space using #SAT could potentially also be applied to areas such as vector-less power analysis. It would also be useful to combine both ESTA and SSTA (i.e. a statistical delay model). In this sense ESTA and SSTA are complementary, and combining them would allow determination of the path-delay distribution while considering both device-level delay variation (SSTA) and input combination variation (ESTA) simultaneously. The use of statistical delays may impact how pruning should performed in ESTA, Chapter 4. Extended Static Timing Analysis 71 but the traditional maximum delay operations used in SSTA should still be applicable. Finally, since ESTA enables automated evaluation of the path-delay distribution associated with different circuit implementations, it could be used to drive design optimization. This would require new metrics to guide optimization which combine path delay and path activation probability information. Potential metrics could include a probabilistically weighted slack, or an error-rate slack. Such metrics would allow designers and optimization tools (such as technology mapping or placement) to directly optimize a design’s error-rate based on its physical implementation. This would enable new trade-offs between delay, path sensitization probability, and other circuit characteristics which would have interesting applications to approximate computing techniques such as overclocking. Part II

Flexible and Efficient FPGA CAD

72 Chapter 5

VTR Introduction

Developing Field Programmable Gate Array (FPGA) architectures is challenging due to the competing requirements of various application domains, and changing manufacturing process technology. This is compounded by the difficulty of fairly evaluating FPGA architectural choices, which requires sophisticated, high-quality Computer Aided Design (CAD) tools to target each potential architecture. This part of the thesis describes enhancements and extensions to the open source Verilog To Routing (VTR) project (included in version 8 of VTR), which provides such a design flow. These extensions expand the scope of FPGA architectures which can be modelled, allowing VTR to target and model many details of both commercial and proposed FPGA architectures. The VTR design flow also serves as a baseline for evaluating new CAD algorithms. It is therefore important, for both CAD algorithm comparisons and the validity of architectural conclusions, that VTR produce high-quality circuit implementations. Many of these enhancements significantly improve optimization quality (reductions of 15% minimum routable channel width, 41% wirelength, and 12% critical path delay), run-time (5.3× faster) and memory footprint (3.3× lower). Finally, we demonstrate that with these enhancements VTR is run-time and memory footprint efficient, while producing circuit implementations of reasonable quality compared to highly-tuned architecture-specific industrial tools – showing that architecture generality, good implementation quality and run-time efficiency are not mutually exclusive goals.

5.1 Introduction

The application domains targeting FPGAs and the underlying manufacturing process technology used to build FPGAs both change rapidly. As a result FPGA architectures cannot remain static – they must evolve to optimize their efficiency for the target process technology, and both legacy and emerging application domains. Furthermore novel architectural ideas to improve FPGAs are constantly being proposed. Being able to quantitatively compare different architectural ideas is key to developing a high quality FPGA architecture. However performing these comparisons is challenging, as it requires well-optimized implementations for a suite of representative benchmark circuits across many architectural variations, along with accurate area, delay and power estimates. This requires a full high quality automated design flow to target each architecture. Since creating such optimization algorithms for a specific FPGA architecture is itself a complex research problem [122], it is impractical to develop unique optimization algorithms for every

73 Chapter 5. VTR Introduction 74 architectural variant. Similarly, accurately estimating the delay, area and power of a completed design implementation on an FPGA also requires sophisticated algorithms and tools [6, 123, 124, 125]. The approach pioneered by Versatile Place and Route (VPR) [46, 126] and the Verilog To Routing (VTR) project [26, 27]1 is to build highly flexible Computer Aided Design (CAD) tools and optimization algorithms which can target and adapt to a wide variety of FPGA architectures. This allows FPGA archi- tects to efficiently evaluate different architectural choices while ensuring well optimized implementations are produced for each benchmark circuit. This facilitates fair comparisons across architectures. We introduce a variety of extensions and enhancements to the open-source Verilog To Routing project (included in version 8.0 of VTR).2 The main contributions include:

ˆ Extending VTR’s FPGA architecture modelling capabilities to capture a broader range of FPGA architectures,

ˆ An improved capture of the Intel Stratix IV FPGA architecture,

ˆ Enhancing VTR’s interoperability with other tools and design flows,

ˆ More complete and flexible timing analysis,

ˆ Significant improvements to optimization quality (41% wirelength, 12% critical path delay, and 15% minimum routeable channel width reductions on the VTR benchmarks), robustness (completes 9 more large Titan benchmarks), run-time (5.3× faster on the VTR benchmarks, reducing average run-time on the large VTR designs from over 35 minutes to less than 7 minutes, and from 3.4 hours to less than 40 minutes on the larger Titan benchmarks) and memory usage (3.3× lower on the VTR benchmarks), and

ˆ A detailed evaluation comparing VPR to Intel’s commercial Quartus CAD flow which shows that VPR 8, despite its flexibility, is competitive in run-time and reasonable in optimization quality compared to a highly-optimized architecture-specific tool.3

In combination, these enhancements mean a wider range of architectures can be explored with both more detailed analysis and higher quality optimization, which increase confidence in the resulting architectural conclusions. The more flexible architectural representation also means it can model and target commercial FPGA devices, and program them through architecture-specific bitstream generators which translate VTR’s implementation results into bitstreams (discussed in detail in Chapter 12). This part of the thesis is organized as follows:

ˆ the remainder of this chapter (Chapter 5) presents related work,

ˆ Chapter 6 introduces the VTR CAD flow and its enhancements,

ˆ Chapter 7 describes enhancements to VTR’s architecture modelling capabilities, and illustrates how these features enable an improved capture of Stratix IV FPGAs.

The following chapters then describe algorithmic and engineering improvements to the CAD flow:

1We use VPR to refer to the architecture modelling and place & route tool, and VTR to refer to the design flow including ODIN, ABC and VPR (Chapter 6). 2VTR 8 can be downloaded from verilogtorouting.org. 3Note that since VTR 7 [27], the VTR and VPR version numbers have been aligned. Therefore VPR 8 refers to the version of VPR used with VTR 8, and VPR 7 to the version used in VTR 7. Chapter 5. VTR Introduction 75

ˆ Chapter 8 studies improvements to packing and placement,

ˆ Chapter 9 investigates improvements to routing,

ˆ Chapter 10 discusses improvements to logic optimization and technology mapping, timing analysis and addresses software engineering concerns, and

Finally, Chapter 11 performs a detailed evaluation of the enhanced VTR 8 CAD flow.

5.2 Related Work

Related work falls into two broad categories: FPGA CAD and architecture exploration frameworks, and research which builds upon and compares to VTR.

5.2.1 FPGA CAD & Architecture Exploration Frameworks

There have been a variety of FPGA related-frameworks produced which target different aspects of the design implementation and architecture exploration space. Commercial FPGA vendors have internal tools which they use to develop and architect their next generation devices. For instance, Intel/Altera use their FPGA Modelling Toolkit, originally based on a heavily modified version of VPR, to explore new FPGA architectures [127]. However these tools are closed-source and proprietary – making them unavailable to other researchers. Torc [128] and RapidSmith [129, 130] provide infrastructure for manipulating the design implementa- tion of commercial Xilinx FPGAs, but both rely on the XDL interface provided by Xilinx’s legacy ISE tools. More recently, RapidWright has enabled low-level access to the design implementation of more recent Xilinx FPGAs [131]. However these tools can only target fixed devices from a single manufacturer, making it impractical to perform architecture research within these frameworks. Furthermore, while these tools provide the infrastructure to build CAD tools targeting these devices, they provide simple proof-of-concept algorithms for optimization stages like placement and routing (e.g. non-timing driven, conflict unaware) – making them unsuitable for architecture evaluation and as baselines for comparing CAD algorithms. Independence [132] is a very general placement and routing tool for evaluating FPGA architectures, but it runs more than 4 orders of magnitude slower than VPR – too slow to allow broad exploration of architectures with large-scale benchmarks.

5.2.2 Research Using VTR

VTR is commonly used as a baseline upon which other tools and research build. These works broadly fall into three categories: FPGA Architecture, CAD algorithms and hybrid design flows. VTR’s ability to target a wide range of novel FPGA architectures, combined with its adaptive optimization algorithms make it a common choice for exploring and evaluating FPGA architectural ideas. Some examples include:

ˆ Introducing new hard blocks to accelerate dynamic memory access [133],

ˆ Evaluating the trade-offs between Look-up Table (LUT) size and logic block size [134], and LUT fracturablitity and logic block routing connectivity [135]. Chapter 5. VTR Introduction 76

ˆ Exploring new logic elements such as AND-Inverter Cones [136, 137], NAND-NOR Cones [138], dedicated Muxes [139], and reduced flexibility LUTs [140, 141]

ˆ Optimizing logic blocks for arithmetic operations [142, 39],

ˆ Developing routing architectures which exploit mixtures of routing wire lengths and complex switch block patterns [143],

ˆ Quantifying the impact of 3D stacking [144] and interposer [145] technologies,

ˆ Investigating the efficiency of different RAM block architectures [146], and implementation tech- nologies [147],

ˆ Exploring the impact of different semiconductor technologies including DRAM [148], threshold logic [149], and non-volatile technologies like MTJ, [150], PCM [151] and RRAM [152],

ˆ Evaluating power reduction methods like charge recycling [153], power gating [154] and dynamic voltage scaling [155], and

ˆ Developing uncloneable and secure FPGAs [156].

VTR’s CAD algorithms strike a effective balance between flexibility and optimization quality. This makes VTR a common infrastructure for exploring and evaluating new CAD algorithms and optimization techniques. These include new or enhanced algorithms which aim to improve run-time/quality trade offs during packing [157, 158], placement [57, 158, 159, 160, 5], and also include cross-stage techniques like multi-clock timing optimization [48] and CAD parameter auto-tuning [161]. Other approaches include reducing run-time through parallel processing during packing [157], placement [71, 70, 72], routing [75, 74, 76, 162] and timing analysis [6]. VTR has also been used to explore novel CAD techniques like device-aging aware routing [163] and the insertion of debug logic into unused resources [164]. In addition to architecture and CAD research, VTR has also been used to create various hybrid flows which often involve targeting or modelling fabricated FPGA devices and often mix commercial and open-source tools. Some examples include:

ˆ Interfacing with Intel/Altera tools for front-end HDL synthesis [165, 10],

ˆ Programming Xilinx FPGAs by interfacing with Xilinx’s tools [166, 167],

ˆ Automatically generating standard cell realizations of VTR generated FPGA architectures [168, 169, 170, 171, 172], and

ˆ Programming fabricated FPGAs using device-specific bitstream generators (which translate VTR’s circuit implementation into bitstreams) targeting: realizations of VTR generated FPGA architec- tures [173, 18, 19, 171, 168, 170, 169, 174, 156], non-VTR generated FPGA architectures [172, 175], and commercial QuickLogic [176], Xilinx & Lattice FPGAs [177].

The VTR code base has also been used by commercial companies such as Intel/Altera and Texas Instruments, as well as numerous FPGA start-up companies (such as Efinix [178])– often forming the basis of their internal place and route tools. Given the many uses of VTR, improving the flexibility of the tool flow, the quality of its results and the software engineering of its code base is very important. Previous work [10, 167] has shown Chapter 5. VTR Introduction 77 significant gaps in quality and run-time between commercial and open-source/academic FPGA design flows. In addition FPGA sizes continue to grow, putting increasing pressure on CAD flow run-time to allow efficient design of large-scale systems. This has led some to question whether architecture-flexible tools (like VTR) can be adapted to target commercial devices while generating high quality implementations in reasonable run-time – or whether specialized device-specific tools (either commercial [179, 180] or open-source [181, 182]) are required. We will show that architecture generality, good implementation quality and efficient run-times are not mutually exclusive goals. In the following chapters we will show how to improve architecture modelling flexibility, along with optimization quality and run-time – showing how these competing goals can be well balanced. Chapter 6

VTR Design Flow

In this chapter, we describe the VTR design flow and our extensions and enhancements of it. These enhancements improve the design flow’s flexibility and interoperability, opening many up new potential flows and flow variations. These capabilities will also be exploited in Chapter 12 to support targeting commercial FPGAs.

6.1 VTR Design Flow Overview

The VTR design flow and several of its variants are shown in Figure 6.1. It provides a full design implementation flow which maps an application onto a target FPGA architecture. Importantly, and unlike commercial FPGA CAD flows, the VTR design flow is flexible and can target a wide range of FPGA architectures. This makes it a useful design flow for evaluating and comparing different FPGA architectures. The VTR flow takes two primary inputs:

ˆ a human-readable FPGA architecture description, and

ˆ an application benchmark described in behavioural Hardware Description Language (HDL).

The design flow will build a model of the specified FPGA architecture and map the given application onto it. The design flow proceeds in several stages. First, the behavioural HDL is elaborated and synthesized into a circuit consisting of soft logic and FPGA architectural primitives (e.g. FFs, multipliers, adders) using Odin II [183]. Second, the circuit logic is passed to ABC [184, 185] which performs technology independent combinational and sequential logic optimizations, and then technology maps the soft logic to LUTs. Third, the resulting technology-mapped circuit netlist is then passed to VPR which generates the physical implementation of the circuit on the target FPGA architecture.1 VPR first builds a detailed in-memory model of the target FPGA device based on the FPGA architecture description. VPR then [38]:

ˆ Packs the circuit netlist into the blocks available on the FPGA device (e.g. Logic/DSP/RAM Blocks).

1For details on the file formats used by VTR see the documentation at docs.verilogtorouting.org.

78 Chapter 6. VTR Design Flow 79

FPGA RR-Graph HDL Architecture Device Description

Quartus Design Yosys Odin II Map CAD Tool Stage

VTR Flow VQM to BLIF ABC Titan Flow

Standard Optional ABC

Tech. Mapped Netlist

VPR Other Packing Pack Pack

VPR Other Placement Place Place

VPR Routing Other Route Route

Design Implementation

VPR Bitstream Gen. Analysis

Post-Impl. Area Timing Power Bitstream Netlist Metrics Metrics Metrics Figure 6.1: VTR-based CAD Flow Variants

ˆ Places the clustered blocks onto valid grid locations.

ˆ Routes all connections in the netlist through the FPGA’s interconnect network.

At each stage VPR optimizes the implementation for area and speed. Finally, VPR analyzes the resulting circuit implementation to produce area, speed and power results, and a post-implementation netlist. The tools which make up the VTR design flow can be re-used and re-combined. An increasingly common use case is to use portions of the VTR CAD flow. These use cases often relate to targeting fabricated FPGA devices or mixing open source and proprietary tools. As a result, improving the ability to mix-and-match tools and increase the flexibility of the design flow is an area we have focused on enhancing in VTR 8. For example, the Titan Flow [165] uses Intel’s Quartus [179] FPGA CAD tool to perform logic synthesis, optimization and technology mapping. Through the use of a conversion tool (VQM to BLIF) it is then possible to use the resulting technology mapped netlist with VPR. Other synthesis tools like Yosys [186] can also be used, provided they generate technology mapped netlists with primitives matching the targeted FPGA architecture.

6.2 Design Flow Enhancements

VTR 8 extends the capabilities of the design flow compared to previous releases. These enhancements make it more flexible, and improve interoperability with other tools.

Complete External Implementation State and Flexible Analysis

In addition to support for loading a design’s packing or placement from external files, VPR 8 adds support for also loading routing from an external file. This enables the complete design implementation Chapter 6. VTR Design Flow 80

g h f g i f i e c m j b k c d k l d a n l b m e a n h j (a) Routing Architecture (b) Corresponding Routing Resource Graph Figure 6.2: Routing Resource Graph. Dotted arrows represent switch-block switches, while dashed arrows represent connection-block switches.

(labeled in Figure 6.1) to be loaded into VPR from external files. This allows individual optimization stages to be run independently, and allows other tools to modify the design implementation. This facilitates a variety of alternative design flows, as shown in Figure 6.1. For instance, tools other than VPR can perform some or all of the design implementation, while still making use of VPR’s device modelling or area, timing and power analysis features. Another use case is to re-analyze a fixed design implementation under different operating conditions, such as different supply voltages [155].

External Routing Resource Graph

The detailed routing architecture is a key component of any FPGA. VPR models the detailed routing architecture using a Routing Resource Graph (RR-Graph), which represents conductors (wires and pins) as nodes, and the switches between them as edges. The RR-Graph is a key abstraction which allows VPR’s optimization and analysis engines to easily target a wide variety of FPGA routing architectures. Figure 6.2 illustrates how a particular detailed routing architecture can be modelled by an RR-Graph. Historically, VPR has generated RR-graphs internally from a high-level specification provided in an FPGA architecture description file [187]. Using a parameterizable RR-Graph generator has been very productive for FPGA research as it allows rapid exploration and evaluation of a wide range of routing architectures. While VTR 8 significantly extends the capabilities of VPR’s RR-Graph generator (Section 7.1.2), this approach has limitations when targeting pre-fabricated FPGA devices. To address these limitations VPR 8 adds support for loading the targeted device’s RR-Graph from an external file.2 This yields several advantages:

ˆ First, it enables an RR-Graph to be loaded which exactly matches a fabricated device. This is crucial, as any disagreement between the CAD tool’s internal device model and the physical device may result in invalid or functionally incorrect bitstreams.

ˆ Second, it allows VPR to target routing architectures with features which are either difficult to specify, or not supported by VPR’s RR-Graph generator.3 Provided an RR-Graph describing such an architecture can be generated (e.g. by another tool), it can be targeted by VPR.

ˆ Third, it decouples the RR-Graph targeted by VPR from VPR’s RR-Graph generation algorithm. The routing architecture specification provided to the RR-Graph generator is under-specified, and hence does not uniquely describe a particular architecture. This is highly beneficial since it is easy

2The RR-Graph is a low-level detailed description. This makes it highly flexible, but generally not practical to create by hand. Typically it would be created by another program or a layout extractor. To aid in this process, VPR can generate an RR-Graph file from its internal RR-graph representation. VPR also performs sanity checks to ensure the loaded RR-Graph is reasonably structured. 3VPR performs basic sanity checks to ensure the provided RR graph is well formed. Chapter 6. VTR Design Flow 81

for the architect to specify, but means the RR-Graph generator has significant freedom to choose how it constructs the routing architecture. As a consequence, the generated routing architecture is dependent on the exact RR-Graph generation algorithm used, and changes to this algorithm (e.g. to add support for new features, or improve the quality of the generated architecture) may change the resulting routing architecture.

ˆ Fourth, the RR-Graph becomes an interchange format describing the FPGA’s routing architecture. When committing a VPR generated routing architecture to silicon, the RR-Graph becomes the specification the physical layout must implement. Alternately, an RR-Graph produced from a physical layout becomes the device specification CAD tools (e.g. VPR or other tools like bitstream generators) target.

Support for external RR-Graphs improves VPR’s interoperability with other tools, and helps to decouple VPR’s device model generation, optimization and analysis stages. For instance, the Symbiflow [177, 1] (discussed in Chapter 12) and PRGA [172] projects generate RR-Graphs in order to target their architectures with VPR.

Post-Implementation Netlist & Verification

VPR supports generating a post-implementation netlist which structurally models the design after complete implementation in the FPGA. It can also produce delay information for each primitive and interconnect element in the FPGA. These can be used to simulate the design implementation, with realistic delays, to check its functional correctness. In VPR 8 this ability has been extended to enable the post-implementation netlist to be produced in BLIF format (in addition to Verilog). This allows ABC to formally verify (prove) the original netlist and final design implementation are logically equivalent. This provides a strong check ensuring the tool flow correctly implements the original design. Chapter 7

Architecture Modeling Generality & Completeness in VTR

A key feature of the VTR project [2] is its ability to be re-targeted to a wide range of FPGA architectures. While VTR could already model many different FPGA architectures its modelling capabilities had limitations, which often precluded incorporating and investigating features found in modern commercial FPGAs. In this chapter, we outline a variety of enhancements included in VTR 8 which improve its architecture modelling generality. This allows architectures to be modelled in greater detail and which much higher accuracy and completeness. They will also be leveraged to allow open source tools to target commercial FPGAs in Chapter 12. Section 7.1 outlines the various modelling enhancements made in VTR 8. To illustrate the utility of these enhancements, and quantify their impact, we present a more accurate model of a commercial Intel Stratix IV FPGA architecture in Section 7.2.

7.1 Architecture Modelling Enhancements

FPGA architectures continue to evolve as new application domains emerge, the underlying process technology changes, and new architecture features are proposed. To facilitate exploration of these changing characteristics we have extended and improved VTR’s modelling capabilities. In particular VTR now supports: more general device grids which are not restricted to columns with perimeter I/O (Section 7.1.1), more general and flexible routing architecture descriptions (Section 7.1.2), as well as a variety of area and delay modelling improvements (Section 7.1.3).

7.1.1 General Device Grid

Historically island-style FPGAs used column-based architectures (where all tiles in a column were of the same type) with I/Os located around the device perimeter as shown in Figure 7.1a. However modern FPGAs have more complex and non-uniform device grids, with I/Os often placed in columns instead of around the perimeter [188], and other large blocks such as hardened accelerators [189], embedded microprocessors [190, 23] and PLLs [191] located throughout the device grid. Other architectural ideas, such as embedding a Network-on-Chip (NoC) into the FPGA fabric (Figure 7.1c), may also result in

82 Chapter 7. Architecture Modeling Generality & Completeness in VTR 83

IO

Logic PCIE Router

RAM

(a) Traditional column-based lay- (b) Non-column based layout with (c) Layout with embedded Net- out with perimeter I/O non-perimeter I/O work on Chip Figure 7.1: General Device Grid Examples non-columnar device grids [192, 22]. To this end we have extended VTR to support arbitrary device grids – there are no longer any location or size limitations for architectural blocks. For example Figure 7.1b shows a non-perimeter IO device with logic blocks, RAMs, I/Os and a hardened PCIE accelerator. Notably, the I/Os are not restricted to the device perimeter, and the accelerator block can be located at an arbitrary location spanning multiple rows and columns. Large blocks also interact with the routing network architecture, requiring enhancements to the routing architecture generator (see Section 7.1.2). VTR 8’s placement, routing and analysis routines (like timing and area estimation) have all been updated to adapt automatically to the generated architecture. For instance, the VPR placer uses a new grid representation (Section 8.2) which helps efficiently perturb placements with more complex and irregular placement grids.

7.1.2 Flexible Routing Architectures

Detailed Routing Architecture

To optimize for the characteristics and availability of different metal layers, modern FPGA routing architectures make use of wire types with different lengths, electrical characteristics and connectivity [143]. For instance long wires, which are usually implemented in the lower resistance upper metal layers, are used to quickly cross large distances but are relatively rare and usually have more restricted connectivity. Previously, VTR supported only a limited number of switch blocks (Wilton [193], Subset/Planar/Disjoint [194] and Universal [195]) which defined a routing architecture’s wire-to-wire connectivity pattern. These switch block patterns were developed with unit-length wires in mind, and do not describe different patterns between wire types, such as the hierarchical wire type connectivity used in commercial FPGA architectures [127, 196]. Similarly, VTR previously provided only limited control over the connectivity between block pins and different wire types. In VTR 8 we have extended VTR’s routing architecture generation capabilities to allow the architect detailed control of both the switch block (wire-to-wire), connection block (pin-to-wire) and direct-connect (pin-to-pin) patterns. The switch block pattern is now highly customizable,1 allowing detailed control of how different wire types and individual tracks interconnect along their lengths. The connection block specification is also more flexible, allowing the connectivity between individual block pins and wire types to be specified. Finally, support for direct connections between block pins has also been extended to fully support connectivity between blocks of differing sizes (as shown as purple connections in Figure 7.8).

1Developed in collaboration with Oleg Petelin. Chapter 7. Architecture Modeling Generality & Completeness in VTR 84

Vertical Channel Segment (0, 1)

(0,1) (1,1) Routing Track

Horizontal Channel

(0,0) (1,0)

Vertical Block Channel Figure 7.2: Blocks and Routing Channels

These extensions give the architect more flexibility to define higher quality switch patterns which exploit different wire types [197], and more control over how the inter-block routing network connects to block pins. This also enables more accurate modelling of commercial routing architectures with hierarchical connectivity, as discussed in Section 7.2.

Routing Channels & Large blocks

VTR groups routing tracks which are logically adjacent into routing channels which are associated with a particular location in the FPGA device grid as shown in Figure 7.2. We refer to the wires above or to the right of a block as channel segments. In addition to unit-sized logic blocks (Figure 7.3a), most heterogeneous FPGA architectures now include larger blocks (Figure 7.3b), such as large DSP blocks [189], large high-density RAM blocks (e.g. UltraRAM [198]), or other hardened accelerators (e.g. PCI-E, Ethernet [199]). These types of blocks are often created following different design methodologies from the rest of the FPGA fabric. For instance, a DSP block or hardened PCI-E interface may be designed with a standard-cell-based methodology, while RAM blocks are usually built around a full-custom memory array. This makes integrating these blocks into the FPGA routing fabric challenging, particularly when the blocks are large enough to contain entire routing channels (both width and height > 1). There are several different approaches which can be taken, all of which are now supported in VTR 8.2 The simplest approach is to forbid the FPGA routing fabric from crossing the large block, as shown in Figure 7.3c. This allows the large block to be implemented independently from the FPGA fabric, but results in large routing blockages within the FPGA fabric. Another approach is to reserve some of the metal layers above the large block, allowing FPGA fabric routing wires to cross over the block (Figure 7.3d). This has a smaller impact on the large block implementation (since only the reserved metal layers cannot be used by it), but requires no routing fabric switching to be pushed down into the block’s active layers. Since signals can cross over the block this provides some routability improvement, but the lack of full switching capabilities between wires (switch-block) is still restrictive. This approach can also produce longer routing wires, since the wires on either side of the block are electrically shorted together, increasing their delay. A more complex approach is to fully integrate the routing fabric into the large block (Figure 7.3b).

2Previous releases of VTR assumed all large blocks contained switch-blocks as in Figure 7.3b. Chapter 7. Architecture Modeling Generality & Completeness in VTR 85

(a) Unit-sized blocks (b) Large block with full (c) Large block with no (d) Large block with switch-block internal connections shorted straight con- nections Figure 7.3: Switch-block connectivity within and around a unit and 2x3 sized blocks.

Table 7.1: Normalized Impact of 2x3 RAM block Internal Routing Connectivity on VTR benchmarks (> 10K primitives) Routing Area Routed WL Crit. Path Delay Route Time Wmin (1.3 · Wmin) (1.3 · Wmin) (1.3 · Wmin) (1.3 · Wmin) none 1.00 1.00 1.00 1.00 1.00 short 0.86 0.86 1.06 1.02 1.01 full 0.86 0.92 1.00 1.00 0.94 Area is silicon area measured in Minimum Width Transistor Areas (MWTAs).

This requires both reserving metal layers and pushing down routing fabric switches into the active layers of the large block’s implementation. While this ensures maximum routing flexibility and performance it is highly disruptive to the large block’s implementation, and may not be feasible in many cases.3 As a result this approach likely only makes sense for extremely common blocks where the other approaches would cause too much disruption to the routing fabric. VTR 8 supports all the above architectural options, allowing architects to specify how the routing network is implemented within and around different block types, enabling them to evaluate the different trade-offs between design time, area, delay, and routability.4 Table 7.1 shows the impact of these approaches on an architecture with length 4 wires and large 2x3 grid location RAM blocks (evaluated with the large VTR benchmarks). Compared to the least flexible routing approach (none, Figure 7.3c), allowing wires to cross-over the block (short, Figure 7.3d) 5 significantly improves routability, with minimum routable channel width (Wmin) decreasing by 14%. Interestingly, allowing full switch-blocks within the block (full, Figure 7.3b) does not improve minimum routable channel width compared to short, but increases area.6 However full does improve routed wirelength, critical path delay, and route time compared to short.7

Non-configurable Switches

Traditionally, VTR has assumed that all switches in the routing network are configurable – meaning they can be configured to either connect or disconnect their input and output wires.

3For instance it may be difficult or impossible to embed routing switches into a RAM block’s full-custom memory array. 4All of these architectural choices are used in various blocks of a variety of commerical FPGA architectures. 5Critical path delay and wirelength also increase moderately. As wires crossing the large block are shorted together in this architecture the additional loading will increase their delay. It may also increase wirelength, since they can no longer be used independently. 6The area increases due to the additional switch-block switches; however our area estimate neglects the likely significant area cost of disrupting the block’s layout. 7While full achieves similar wirelength and critical path delay as none it does so at substantially smaller channel widths. Chapter 7. Architecture Modeling Generality & Completeness in VTR 86

However some common circuit structures, such as inline non-configurable buffers along a wire and electrical shorts between wires, do not satisfy this assumption, as they always connect their input and output wires. To alleviate this, we added support for non-configurable switches which cannot be disconnected or turned off, but must be used if their associated wiring is used. This allows VTR to model structures such as wiring with inline buffers as shown in Figure 7.4a (which often appear in structures like clock networks), and non-rectilinear wiring (e.g. L, T or other shapes [200, 201]) as shown in Figure 7.4b which can be modelled as a combination of horizontal and vertical wire segments connected together with an electrical ‘short’. Section 9.7 describes the routing algorithm enhancements required to support non-configurable edges.

(a) Wire with inline buffer. (b) Diagonal or ‘L’-shaped wire modelled as recti- linear wires with an electrical short. Figure 7.4: Circuit Structures which can be modelled with non-configurable switches.

7.1.3 Area & Timing Modelling

VTR’s modelling and analysis capabilities have also been extended.

Area Modelling

VTR now uses the COFFE area model [123], which provides more accurate area estimates on modern process technologies. In particular, it better captures the effects of N-well spacing and the combined effects of increasing transistor width and adding parallel diffusion regions.

Size Dependant Mux Delays

As VTR constructs an FPGA routing architecture it builds muxes with various numbers of inputs. While the VTR area model accounts for this, VTR 7 assumed a fixed delay regardless of mux size. VTR 8 now supports specifying a range of size-dependent mux delays8, facilitating more accurate delay estimates.

Internal Primitive Timing Paths

Many components of an FPGA architecture are modelled as architectural primitives (e.g. LUTs, Flip- Flops). With increasing heterogeneity, a wider variety of primitives are used. Internally these components can be fairly complex and contain internal timing paths which will impact the timing behaviour of the overall system (e.g. DSP blocks, RAM blocks, hard memory controllers). VTR 8 now supports modelling these internal timing paths, which can occur between a primitive’s sequential input/output pins. It also supports multi-clock primitives where different ports are controlled by different clock domains. Figure 7.5 illustrates these capabilities. The input ports we1, addr1 and data1 are sequential inputs clocked by clk1. These are connected to the sequential output data2 (clocked by clk2) by an internal timing path. In contrast, the input addr2 is a combinational input which connects to data2. VTR 8

8Implemented by Oleg Petelin based on architectural discussions between us Chapter 7. Architecture Modeling Generality & Completeness in VTR 87

we1

addr1 data2 data1 clk1 addr2 clk2 Figure 7.5: Dual-Port RAM Block Internal Timing Path Example supports any collection of combinational and sequential inputs (with multiple clocks) and internal timing edges/paths between them.

Improved Delay Modelling

Traditionally, VTR has only supported the specification of maximum delay timing models. These models are used with Static Timing Analysis (STA) to guide implementation optimization, and report circuit critical paths and maximum operating frequencies.9 However maximum-delay setup-time checks are not the only timing constraints which determine correct operation. It is also important to analyze minimum-delay timing for hold-time checks. To this end, VTR now supports the specification of minimum (in addition to maximum) timing information on all block delays, which is used for hold-time analysis (Section 10.2).

7.2 Architecture Enhancements Case Study: Stratix IV

The new architecture modelling capabilities described in Section 7.1 make it possible to model a much wider variety of FPGA architectures. To illustrate their utility, we now present improvements to the Stratix IV architecture capture used with the Titan benchmark suite [10].10

7.2.1 Grid Layout

Using the generalized grid support (Section 7.1.1), the device grid layout was adjusted to more accurately reflect real Stratix IV devices, as shown in Figure 7.6. PLLs are now placed within the I/O periphery and the I/Os are positioned so they can access vertical and horizontal routing regardless of where they are located. The M9K and M144K RAM blocks along with DSP blocks (which contain multipliers and accumulators) are arranged in columns. The remainder of the device is filled with Logic Array Blocks (LABs) consisting of Look-Up-Tables (LUTs), adders and Flip-Flops (FFs).

7.2.2 Routing Network

The routing network connectivity pattern was one of the largest approximations in the original Titan Stratix IV architecture capture from [10], since VTR 7 offered very limited control of the generated connectivity pattern.

9The delay values for an architecture can be derived from the low-level architecture component implementations using tools like COFFE [123] 10Later Stratix series devices [196, 202] are similar to Stratix IV in the architecture dimensions considered here. Chapter 7. Architecture Modeling Generality & Completeness in VTR 88

M9K DSP M144K

I/O

LAB

PLL

Figure 7.6: Placement of the leon2 benchmark with the updated grid layout and block type annotations.

Wire Connectivity, Switch Block & Routing Mux Sizes

VTR 7 supported only a small number of fixed switch blocks, with the Wilton switch block being the most commonly used. However, the Wilton switch block was originally designed only considering wires of the same length. When used with multiple wirelengths the twisting pattern used by the Wilton switch block to improve routability means long wires rarely connect to each other. This defeats one of the primary purposes of having long wires in a routing architecture, which is to provide fast routes across long distances (accomplished by stringing together multiple long wires). By leveraging VTR 8’s custom switch block support (Section 7.1.2) we can define a switch block which better captures the characteristics of the Stratix IV routing architecture. The characteristics modelled are summarized in Table 7.2. As in previous work [165, 10] we model the routing network as a mixture of 87% L4 and 13% L16 wires with a channel width of 300 tracks through-out the device. Using VTR 8 we can correctly model the connectivity between the different wire types and between wires and block pins. For instance, the L4 wires connect to block pins and connect to other wires via switch blocks at every grid tile (L4ISBD = 1). The L16 wires are only accessible from the L4 wires and can neither drive nor be driven by block pins.11 L16 wires also only connect to other wires via switch 12 blocks at every 4th grid tile (L16ISBD = 4), and we correctly model this restriction. We also take care to match the size of the wire driver muxes, something which can be explicitly controlled with the custom switch block support in VTR 8. The L4 Driver Input Muxes (DIMs in Intel terminology) are sized to be 12:1. We distribute the inputs to these muxes such that 6 inputs come from block pins (determined by Fc), and the remaining 6 inputs come from a combination of other L4 and L16 wires. For the L16 DIMs Stratix IV uses larger muxes to increase the number of potential paths onto the long-wire network. Using VTR 8’s custom switch-block support we were able to specify a routing architecture matching these characteristics. Since the precise details of Stratix IV’s switch pattern have never been disclosed, we experimentally evaluated a variety of routing patterns which matched the characteristics in Table 7.2 to find a pattern which performed well.13

11Intuitively, this makes the rare L16 wires easier to use since the more numerous L4 wires serve as a feeder network for getting on to and off of the L16 wires. 12This is done in Stratix IV to reduce the expensive deep via stacks required to access transistors from the upper metal layers [127]. 13Note these experiments, which evaluated which wires and which position along those wires were selected to drive each Chapter 7. Architecture Modeling Generality & Completeness in VTR 89

Table 7.2: Summary of Modelled Stratix IV Routing Architecture Characteristics Metric Expression Value Channel Width W 300 L4 Wires 0.8667 · W 260 L16 Wires 0.1333 · W 40

L4 Fcin L4Fcin 0.055

L4 Fcout L4Fcout 0.075

L16 Fcin L16Fcin 0.000

L16 Fcout L16Fcout 0.000 L4 Driver Mux Size L4DIM 12:1 L16 Driver Mux Size L16DIM 40:1 L4 inter-switch-block distance L4ISBD 1 L16 inter-switch-block distance L16ISBD 4

Figure 7.7: Three sided architecture with each pin able to access one horizontal track, and one vertical track.

Pin Locations

Stratix IV uses a three-sided routing architecture where each block can reach routing channels which are above, left or right of it [203]. By leveraging the enhanced pin location specifications, and improved control of routing connectivity around multi-height blocks (Section 7.1.2) this style of architecture can be accurately modelled. In particular, for each logical pin on a block we create two physically equivalent pins: one on the top edge of the block, and the other on either the left or right side of the block. This allows each logical block pin to access both horizontal (above) and vertical (either left or right) routing wires, matching Stratix

IV’s capability (see Figure 7.7). By scaling the fraction of routing tracks to which each pin connects (Fc) appropriately we ensure the total number of wires connected to each logical pin remains constant.

Direct-Connect

We also extend the modelling of direct-link connections (which were previously modelled only between LABs) to include connections between blocks of differing types. In particular care was taken to ensure multi-height blocks like DSPs and M144K RAM blocks connect correctly to adjacent blocks which may have differing heights as shown in Figure 7.8. We have also updated the architecture capture to include pack-patterns [204, 178] for all DSP block modes, ensuring that complex uses of the DSP block (like multiply-accumulate) are packed efficiently and use the appropriate dedicated circuitry. mux, were not possible to perform in VTR 7 due to limited user control over the generated switch pattern. Chapter 7. Architecture Modeling Generality & Completeness in VTR 90

Figure 7.8: Direct connections between adjacent blocks shown in purple, including horizontal ‘direct-link’ connections (between all blocks) and vertical carry-chain connections (between LABs). Block colors are the same as Figure 7.6.

7.2.3 Timing Model

The timing model has been extended to include block internal timing paths (Section 7.1.3), such as those between registered RAM address ports and data output ports, and within registered DSP blocks. A minimum delay timing model calibrated to match the component delays reported by Quartus STA is also included to enable hold-time analysis (Section 7.1.3).

7.2.4 Architecture Enhancements Comparison

Table 7.3 shows the impact of various architectural enhancements. In this experiment the Titan v1.1.0 benchmarks are run targeting both the original Stratix IV architecture capture (A) from [10] (which was describable in VTR 7), and several variants of the enhanced architecture with the custom switch-block and various DIM sizes (describable in VTR 8). The same version and configuration of VPR 8 is used in all cases. Compared to the baseline architecture (A), the enhanced architecture (E) improves routing area (silicon area used for routing) by 11% and critical path delay by 8%. The critical path delay improvement is achieved in part by appropriately modelling the hierarchical long and short wire connectivity of Stratix IV. This ensures long wires always connect to each other, allowing critical signals to quickly traverse large distances by staying on the long-wire network.14 The area reduction results from using smaller wire driver muxes than in the baseline architecture. Finally, the enhanced architecture is also easier to route, reducing route-time by 28%. We also explored several other routing architecture variants, which tweak the size of the wire driver muxes. Note these experiments cannot be performed in VTR 7, as the driver mux sizes cannot be directly controlled. Compared to the baseline architecture (A), moving to the customized switch-block with hierarchical wire connectivity and fixed 12:1 driver muxes (B) reduces routing area by 13% and improves critical path delay by 5% while decreasing route time by 22%, at the cost of a small 3% increase in wirelength. Increasing the L16 driver muxes to 72:1 (C) makes it easier for critical connections to reach the fast long wires, further reducing critical path delay. While these muxes are much larger, there are relatively few L16 wires, so the overall area impact is less significant, but does reduce the area

14This is not the case with the Wilton switch-block used in the legacy architecture, as it shuffles tracks in a manner which makes it unlikely the relatively few long wires will connect to other long wires. Chapter 7. Architecture Modeling Generality & Completeness in VTR 91

Table 7.3: Geometric Mean Normalized Impact of Architecture Variants on the Titan Benchmarks Routing Area Routed Crit. Path Route Switch-block L4 L16 DIM DIM Per-Tile WL Delay Time A wilton 1.00 1.00 1.00 1.00 B custom 12:1 12:1 0.87 1.03 0.95 0.78 C custom 12:1 72:1 0.92 0.99 0.92 0.76 D custom 16:1 72:1 1.01 0.96 0.90 0.79 E custom 12:1 40:1 0.89 1.00 0.92 0.72 Area is silicon area measured in Minimum Width Transistor Areas (MWTAs). improvement to 8%. Increasing the much more common L4 driver muxes to 16:1 (D) reduces wirelength by 4% and critical path delay by 10% compared to the baseline, but at the cost of 9% higher area than (C) (1% higher area than the baseline). Interestingly, we found that decreasing the L16 driver mux size to 40:1 (like Stratix IV) had no significant impact on wirelength or delay, but reclaims some area compared to using 72:1 muxes. Based on these results the enhanced architecture (E) was selected as the best/closest approximation to Stratix IV. Unless otherwise noted, all results targeting Stratix IV in VPR use this architecture. For instance, this architecture is used when comparing VPR to Quartus in Chapter 11. Chapter 8

Enhanced FPGA Packing & Placement

Packing and Placement are key steps in the FPGA design flow. This chapter outlines enhancements to VPR’s packing and placement algorithms included in VTR 8 [2], which significantly improve result quality, routability and run-time.

8.1 Packing Enhancements

VTR supports a highly general packing algorithm based on greedy bottom-up seed-based clustering, which can target a wide variety of FPGA logic block and hard block architectures. Seed-based clustering algorithms can achieve a very dense packing (high utilization per cluster) which helps minimize the number of clustered blocks required to implement a circuit. However, achieving high logic utilization has also been found to harm routability [205], and has been identified as a potential cause of routability issues with VTR 7 [10]. To this end we have focused on improving VPR’s packing quality to produce more natural clusters, which better reflect the true relationships between netlist primitives. These more natural clusters are typically less dense and less interconnected which improves routability and overall implementation quality. Algorithm 11 shows the key features of VPR 8’s packing algorithm. Given a target architecture and primitive netlist to cluster, the packer groups primitives into molecules (Line 2), which are groups of primitives which must be packed together.1 Next a seed molecule is selected (Line 5, described in Section 8.1.1) and used to create a new open cluster (Line 6). The cluster is then grown by selecting an unpacked molecule ‘related’ to the current cluster (Line 9, described in Section 8.1.2). The algorithm then attempts to add this candidate molecule into the cluster (Line 12). This continues until either the cluster is deemed to be sufficiently full (Line 8, described in Section 8.1.3), or no more ‘related’ molecules can be found (Line 10). At that point the current open cluster is closed (Line 15), and a new seed is selected from the remaining unpacked molecules (Line 5). This repeats until all molecules have been packed into clusters (Line 4). For more details on the generality of VPR’s packing algorithm and how legality is ensured see [204, 178].

1An example of a multi-primitive molecule is a carry-chain. Each adder in the chain must be packed together to use a logic block’s dedicated carry logic.

92 Chapter 8. Enhanced FPGA Packing & Placement 93

Algorithm 11 VPR 8 Packing Algorithm Require: The target FPGA architecture, the primitive netlist to be packed Returns: The packed clusters implementing the primitive netlist 1: function vpr pack(architecture, primitive netlist) 2: unpacked mols ← build molecules(architecture, primitive netlist) 3: clusters ← ∅ 4: while unpacked mols 6= ∅ do 5: seed mol ← select seed(unpacked mols) 6: cluster ← create cluster(architecture, seed mol) 7: untried mols ← unpacked mols 8: while space remains(cluster) do 9: next mol ← pick next molecule(cluster, untried mols) 10: if next mol = ∅ then 11: break . Nothing productive to add 12: if try add molecule to cluster(cluster, next mol) then 13: unpacked mols ← unpacked mols \ next mol . Molecule is now packed 14: untried mols ← untried mols \ next mol . Do not retry molecule 15: clusters ← clusters ∪ cluster . Commit cluster 16: return clusters

Some FPGA clustering approaches have bypassed bottom-up clustering [206, 207, 157, 208] using either partitioning or flat placement. However these techniques often relax legality constraints, are architecture specific, or still use bottom-up clustering as the final legalization step. The VPR packer faces unique constraints as it must support many different FPGA architectures which include different heterogeneous block types (e.g. DSP, RAM and custom blocks, in addition to logic blocks) with arbitrary internal connectivity (e.g. depopulated crossbars) and legality constraints (e.g. external/internal pin limits, special structures like carry-chains). Focusing on a bottom-up clustering approach makes it feasible to continue supporting this flexibility. Consistent with [209] we found that avoiding high complexity clusters was important for improving routability and wirelength.

8.1.1 Where to Start?

Selecting a good seed primitive to start a cluster is an important consideration, since it impacts the order in which different parts of the netlist will be clustered. This is particularly important when transitive connectivity is used to determine relatedness. This is illustrated in Figure 8.1, which shows the different types of connectivity the packer considers from the perspective of the seed primitive LUT1. Initially (Figure 8.1a) only FF4 is considered to be transitively related to LUT1 (since both LUT1 and FF4 connect to RAM1). However, after the RAM slices (RAM1, RAM2, RAM3) have been packed into a RAM block (Figure 8.1b), the packer gains additional information as it sees that FF5 and FF6 are also transitively related to LUT1 (via the RAM block). It is therefore beneficial to pack large blocks like RAMs and DSPs early during packing, so the packer has more complete transitive connectivity information when packing soft logic.2 However, we found that VPR 7 did not always pack such blocks early. As a result we changed the seed selection criteria for timing-driven clustering to prefer this behaviour. This was accomplished by

2RAMs and DSPs usually have less packing flexibility and so benefit less from transitive connectivity information. Chapter 8. Enhanced FPGA Packing & Placement 94

FF1 RAM Block FF1 Seed LUT1 RAM1 LUT1 RAM1 FF2 Low FF2 FF4 Fanout FF4 FF7 RAM2 FF7 RAM2 Trans. FF5 Fanout FF5

High RAM3 RAM3 Fanout FF6 FF6 Unrelated FF8 FF9 FF8 FF9 (a) Connectivity before clustering (b) Connectivity after RAM block clustered Logic Block Logic Block RAM Block RAM Block FF1 FF1 LUT1 RAM1 LUT1 RAM1 FF2 FF2 FF4 FF4 FF7 RAM2 FF7 RAM2 FF5 FF5

RAM3 RAM3 FF6 FF6

FF8 FF9 FF8 FF9 (c) After Packing Prioritizing Large Net Connectivity (d) After Packing Prioritizing Transitive Connectivity Figure 8.1: Packing example with ζ = 4. using a different criteria to rank potential seed molecules:

gain = 0.5 · norm input pins + 0.2 · norm used input pins (8.1) + 0.2 · norm primitives in molecule + 0.1 · primitive criticality

Where norm input pins, norm used input pins, norm primitives in molecule and primitive criticality are respectively: the number of input pins on the primitive (normalized to the primitive type with the most pins), the number of connected input pins (normalized to the primitive type’s number of input pins), the number of primitives in the molecule (normalized to the largest molecule in the circuit), and primitive timing criticality. The new gain function favours selecting primitives with a larger number of pins (such as DSPs and RAMs), ensuring they are clustered early in the packing process. In the case of a tie, the gain function favours blocks with more used inputs (i.e. which are connected in the circuit), and primitives which form parts of large multi-block molecules like carry-chains. Timing criticality is used as a final tie-breaker.

8.1.2 What to Pack Next?

After a seed molecule has been selected and a new cluster opened, the packer searches for unclustered molecules to add. This process is guided by attraction functions which estimate the gain (benefit) of adding an unclustered molecule to the current open cluster. We observed that VPR 7’s attraction functions tend to produce somewhat unnatural clusters, with a mixture of related, loosely related, and unrelated logic. To address this we have made several enhancements to the attraction functions in VPR 8. Chapter 8. Enhanced FPGA Packing & Placement 95

Prioritization of Connectivity Types

The packer considers three types of connectivity when selecting molecules to try adding to the current cluster. Small net connectivity considers molecules connected to the current cluster via low-fanout nets, those with fanout below the high-fanout net threshold ζ (in VTR 7 ζ = 256). In Figure 8.1a the connections between FF1, FF2, RAM1 and LUT1 are examples of this type of connectivity. Selecting candidate molecules connected by small nets is beneficial as it encourages nets to become completely absorbed within a cluster, reducing the number of inter-block connections which must be routed through the inter-block routing network [38]. Large net connectivity considers molecules connected to the current cluster via high-fanout nets (fanout ≥ ζ). In Figure 8.1a the connections from FF7 are examples of this type of connectivity. Large high-fanout nets are less beneficial for guiding packing than small nets, as they can rarely be well localized (e.g. absorbed into a single cluster). Transitive connectivity considers molecules which are connected to the current cluster indirectly through another molecule or cluster. In Figure 8.1b the connections from FF4, FF5, and FF6 are transitively connected to LUT1 through the RAM block. Transitive connectivity can be beneficial since it helps keep logic which is indirectly related together.3 VPR 7 prioritized small net connectivity first, followed by large net connectivity, and finally transitive connectivity. Since many molecules connect to at least one high-fanout net this meant VPR 7 only occasionally exploited transitive connectivity information. As a result, molecules which were only weakly related through high-fanout connectivity were often packed together – even when they were strongly related (e.g. via low-fanout connectivity) to other logic which had yet to be packed. The clusters which resulted often ended-up with split personalities: some portion of the logic was strongly related, but other parts were only weakly interconnected with the rest of the cluster (and may have had stronger connectivity to logic which ended up in other clusters). This is illustrated in Figure 8.1c and Figure 8.1d. In Figure 8.1c, from the seed LUT1 small nets connectivity guides the packer to pack FF1 and FF2 into the cluster. Then, following large net connectivity, the only loosely related FF7, FF8 and FF9 are added to the cluster. In contrast Figure 8.1d follows the same small net connectivity to pack FF1 and FF2, but then follows transitive connectivity to select FF4, FF5 and FF6. This results in a more natural clustering. In VPR 8 we have de-emphasized the importance of large net connectivity through two techniques. First, we prioritize transitive connectivity over large net connectivity. This ensures transitively related molecules are added to a cluster before those connected to high fanout nets. Second, we have lowered the threshold for high fanout nets (ζ) from 256 to 32 for logic blocks and 128 for all other block types.4

Unrelated Logic

As VPR packs a cluster it may be unable to find any unpacked molecules which are related to the current cluster. In VPR 7 the packer will continue to try filling up the cluster by finding unrelated molecules which will fit. While this minimizes the number of clusters created (by maximizing the utilization of each cluster) it can be detrimental to quality.

3For instance the FFs in a register bank may drive related logic but have no direct connectivity between themselves. 4A lower threshold treats more nets as high-fanout, further focusing the packer on small net connectivity. Chapter 8. Enhanced FPGA Packing & Placement 96

(a) Unclustered (b) Naturally Clustered

(c) Clustered with unrelated logic Figure 8.2: Unrelated Clustering Example

Consider the unclustered netlist in Figure 8.2a, which has a clear structure and can be naturally clustered as shown in Figure 8.2b by placing only related logic in each cluster. Allowing unrelated logic as in Figure 8.2c may result in fewer clusters but they are typically much more strongly interconnected; in this case leading to two highly connected clusters connected to other logic on opposite sides of the device. Since the packer (both in VPR 7 and VPR 8) operates on a single open cluster at a time, ‘unrelated’ logic is only unrelated from the viewpoint of the current cluster and may be strongly related to other yet-to-be-packed parts of the netlist. As a result unrelated logic packing tends to scatter connected logic (which may have otherwise formed new more natural clusters) across many clusters. The resulting clusters are then coupled together through the unrelated logic making them more difficult to place and route. In VPR 8, unrelated logic packing is disabled unless too many clusters are produced to fit on a fixed-size device. The impact of these changes on the des90 benchmark is illustrated in Figures 8.3 and 8.4, which correspond to VTR 7-style and VTR 8-style packings respectively. In Figure 8.3 unrelated logic packing is enabled, producing fewer logic blocks which are more tightly coupled together. This coupling induces the placer to produce a tighter placement at the centre of the device to minimize wirelength (Figure 8.3a). However as Figure 8.3b shows, this causes significant routing congestion at the centre of the device, where 5066 channel segments5 use ≥ 90% of available wiring, resulting in an unroutable implementation. In contrast, with unrelated logic packing disabled (Figure 8.4a) the clusters are less interconnected and the placer is able to spatially localize the circuit’s connectivity resulting in a more spread-out and natural placement. As Figure 8.4b shows, this significantly reduces routing congestion, and the design routes easily – with routing utilization ≥ 90% in only two channel segments.

8.1.3 When is a Cluster Full?

Determining when a cluster should have no additional logic packed into it is also important for producing natural clusters. When attempting to fill each cluster as full as possible (as VTR 7 does) some clusters may end up using all (or nearly all) of their external routing ports (i.e. cluster pins which connect to inter-block routing). While this high density packing may be beneficial from a utilization perspective (minimizing the number of logic blocks), it can make routing more difficult. Clusters with a large number of external routing connections will put additional stress on their adjacent routing channels – potentially inducing

5A channel segment is all the wires in either a vertical or horizontal routing channel which overlap a particular grid tile, as illusrated in Figure 7.2. Chapter 8. Enhanced FPGA Packing & Placement 97

(a) Placement (b) Routing utilization Figure 8.3: des90 high packing density (unroutable)

(a) Placement (b) Routing utilization Figure 8.4: des90 moderate packing density (routable) congestion.

Clustered Block Pin Utilization Targets

To avoid this behaviour we have modified the packer to consider the open cluster’s pin utilization as part of the criteria for deciding when it should be closed. If the current open cluster exceeds its pin utilization targets it is considered full and is closed. These targets bias the packer away from creating difficult to route high pin utilization clusters. Importantly, these targets are treated as soft (rather than hard) constraints, which ensures netlist structures which may legitimately require more pins than the utilization target (e.g. large carry-chains) can still be packed. VPR 8 now sets the default input pin utilization target of the logic block type to 80% of the available pins. All other blocks default to a 100% pin utilization target. This serves to guide the packer away from producing unnecessary high pin utilization logic blocks, at the potential cost of increasing the number of clusters created.

8.1.4 Adapting to Available Device Resources

While VPR is often allowed to dynamically re-size the device grid based on the benchmark circuit, it also supports targetting fixed-size devices. When targeting fixed-size devices it is possible for the packer to produce more blocks of a particular resource than are available on the device. In VTR 8 we have enhanced the packer to be resource aware and adapt to the resource mix of the targeted device. Together these changes ensure VTR produces a higher quality natural packing when there is sufficient space, and Chapter 8. Enhanced FPGA Packing & Placement 98 a denser clustering only if required.

Resource Balancing

The packer can optionally attempt to balance the resource utilization of different block types to better match the resources available on the device. For instance, if a netlist primitive can map to multiple block types (e.g. RAMs of two different sizes), the packer will attempt to produce a packing which balances the usage of the different block types in proportion to their quantity on the device. This is implemented by tracking the usage of each block type for the current partial netlist packing, and comparing it to the number of each block type available in the device grid. Specifically, if a netlist primitive could map to multiple block types, resource balancing causes the packer to select the candidate block type which has lower utilization.6 On the Titan benchmarks (Section 11.2) targeting the Stratix IV model from Section 7.2 this results in more RAMs being implemented as larger M144Ks (which are typically underutilized), instead of the more common M9Ks. This decreased the average device size needed to implement RAM limited circuits by 10%.

Adaptive Re-packing

The packer initially focuses on producing a natural packing, with unrelated logic packing disabled (Section 8.1.2) and reasonable pin utilization targets (Section 8.1.3). If this packing uses more resources than are available, packing is re-started from scratch – with the aiming of achieving a higher packing density and reducing resource utilization. During this re-packing unrelated logic packing is enabled along with resource balancing, and the pin utilization targets are relaxed.

8.1.5 Packing QoR Evaluation

Table 8.1 shows the cumulative impact of these modifications on the Titan benchmarks (Section 11.2) targeting the Stratix IV model from Section 7.2,7 where baseline corresponds to a VTR 7 style packing. Turning off unrelated logic packing (no-unrel.) had the largest impact reducing routed wirelength by 16%, decreasing router run-time by > 2×, and allowing seven additional benchmarks to complete, at the (small) cost of 4% more logic blocks. Using the new seed formulation (seed) and an 80% input pin utilization target on logic blocks (pin util.) had limited effect. However, prioritizing transitive connectivity (prioritize transitive) further reduced (compared to the baseline) wirelength by 19% and improved critical path delay by 3%, while also enabling two more benchmarks to complete. Table 8.2 shows the impact of modifying the packer’s high-fanout net threshold (ζ) with other parameters set to those of the last row of Table 8.1. Decreasing the threshold from 256 to 64 improves wirelength by 4% and reduces route time by 10% while increasing the number of M144K RAM blocks by 8%. Further decreasing the threshold to 48 or 64 allows the largest Titan benchmark (gaussianblur) to complete, but further increases the number of RAM blocks. These results indicate that high-fanout net connectivity is useful for densely packing coarser blocks like RAMs and DSPs but can hurt the routability of the more fine-grained logic blocks. Decreasing the LAB high-fanout threshold to 32 while leaving the

6This is still a very simple heuristic based on resource balancing, although still an advancement over VPR 7 which simply used a first-fit heuristic. Notably, resource balancing does not consider the efficiency of implementing primitives in different block types, which is left for future work. 7The VTR benchmarks showed similar trends but smaller improvements. Chapter 8. Enhanced FPGA Packing & Placement 99

Table 8.1: Cumulative Normalized Impact of Packing Changes on Titan benchmarks # Circuits Routed Crit. Path Pack Place Route Total LABs DSPs M9Ks M144Ks Completed WL Delay Time Time Time Time baseline 13 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 no-unrel. 20 1.04 1.00 1.00 1.00 0.84 0.99 0.83 1.04 0.44 0.78 seed 20 1.05 1.00 1.00 1.00 0.84 0.98 0.86 1.04 0.42 0.78 pin util. 20 1.05 1.00 1.00 1.00 0.84 0.99 0.86 1.08 0.44 0.80 prioritize transitive 22 1.05 1.00 1.00 1.00 0.81 0.97 0.86 1.07 0.43 0.80 Normalized values are the geometric means over the common set of benchmark circuits which completed for all tool parameters

Table 8.2: Normalized Impact of Packer High Fanout Threshold on Titan benchmarks # Circuits Routed Crit. Path Pack Place Route Total LABs DSPs M9Ks M144Ks Completed WL Delay Time Time Time Time 256 22 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 64 22 1.01 1.00 1.00 1.08 0.96 0.99 0.97 0.98 0.90 0.97 48 23 1.01 1.00 1.14 1.21 0.98 0.98 0.88 0.85 0.85 0.89 32 23 1.03 1.01 1.19 1.22 0.97 0.99 0.93 0.97 0.85 0.97 128 LAB:64 22 1.01 1.00 1.00 1.00 0.97 0.99 0.87 0.89 0.89 0.90 128 LAB:32 23 1.03 1.00 1.00 1.00 0.95 0.98 0.83 0.87 0.85 0.87 128 LAB:16 23 1.07 1.00 1.00 1.00 0.94 0.99 0.80 0.91 0.78 0.88 Normalized values are the geometric means over the common set of benchmark circuits which completed for all tool parameters threshold at 128 for other block types (128 LAB:32) achieves the same routability improvement without increasing the number of DSP and RAM blocks. This validates the use of these values as the VTR 8 default.

8.2 Placement Enhancements

VPR’s Simulated Annealing (SA) based placer iteratively makes a perturbation to the current placement (called a move) which is then evaluated and either accepted or rejected [46]. A desirable property for any move generator used in SA-based placement is that all possible con- figurations (i.e. placements) should be reachable through some (hopefully short) sequence of moves. Failure to ensure this means some parts of the placement solution space will either be impossible or very difficult for the placer to reach, potentially excluding higher quality placements. In VPR 8 we have made enhancements to the move generator to improve robustness and optimization quality.

8.2.1 Placement Macro Move Generator

VPR’s placer supports placement macros which enforce relative placement constraints between specific blocks in the packed netlist. This is primarily motivated by architectural features such as fast dedicated routing for carry chains which require the connected blocks to have specific relative placements. As noted in [210], VPR 7’s move generator had limited support for moves involving placement macros, and immediately aborted (gave up on) any moves involving two or more placement macros. This meant the placement of carry chains was not always well optimized as the placer required longer sequences of indirect moves to change the placement of two macros.8 In VPR 8 we have extended the placer’s move generator to support moving two or more macros simultaneously. Table 8.3 shows the cumulative impact of supporting different types of increasingly complex moves involving multiple placement macros on the Titan benchmarks targetting the Stratix IV

8This was particularly problematic on designs with many carry-chains, as proposed moves are more likely to involve multiple carry-chains in such cases. For instance in the Titan bitcoin miner design 44% of logic blocks are part of a carry chain, and as a result > 49% of proposed moves were aborted. Chapter 8. Enhanced FPGA Packing & Placement 100

a1 b a1 c

a1 b1 a2 c c a2 a1 b1

a2 b1 b2 a1 a3 =⇒ a1 b1 d1 =⇒ a2 b2 =⇒

a3 b2 b3 a2 b a2 b2 d2 d1 b3

b3 a3 c a3 b3 d2

(a) Simple independent macro-macro (b) Self-overlapping macro (c) Overlapping macro-macro swap. swap. swap (shift). Moving macro a overlaps macro b, which when swapped also over- laps macro d. Figure 8.5: Macro swap examples.

Table 8.3: Cummulative Normalized Impact of Placement Move Generation Changes on Titan benchmarks Completed Placed Placed Routed Routed CPD Place Route Total Benchmarks WL CPD WL (geomean) Time Time Time baseline 21 1.00 1.00 1.00 1.00 1.00 1.00 1.00 macro swap 21 0.97 0.98 0.99 1.00 1.01 0.96 0.98 overlapping macros 22 0.97 1.00 0.99 1.01 1.01 0.94 0.99 compressed move grid 22 0.92 0.98 0.94 1.02 1.01 0.86 0.97 QoR and run-time metrics are normalized geomeans over mutually completed benchmarks model from Section 7.2. Compared to the VPR 7-style baseline placer (baseline, which does not support moves involving more than one macro), supporting moves with two macros (macro swap, Figure 8.5a) improves routed wirelength by 2% and reduces route-time by 4%.9 This more complex move generator increases placement time by only 1%. We observed that despite this, many moves involving macros were still being aborted; particularly late during the anneal. This occurred because VPR’s region limit10 is reduced as placement progresses and as a result moves involving macros often either overlapped themselves or overlapped multiple nearby macros. For example, a move could propose shifting a carry chain by a distance shorter than the carry chain’s length, resulting in it overlapping its previous position (Figure 8.5b). Furthermore circuit structures like adder trees may cause carry chains to be placed close together, and as a result moving one carry chain may require shifting multiple carry chains to avoid overlaps (Figure 8.5c). Adding support for these types of overlapping macro moves (overlapping macros in Table 8.3) produces similar average QoR to macro swap, but enables the large carry-chain heavy bitcoin miner design to successfully route.11 On that design many proposed macro moves result in overlaps, so supporting such moves is key to allowing the placer to fine-tune the placement of carry chains.

8.2.2 Compressed Move Grid

When proposing moves, VPR 7 uses a region limit to control the distance between the involved blocks. Figure 8.6a shows the region limit around a selected block; the block can only be swapped with blocks within the region limit. As placement progresses the region limit is decreased; focusing the placer on smaller, less disruptive changes which help fine-tune the placement. One challenge with this approach is that it can prematurely lock down blocks which are relatively sparse within the FPGA device grid, as shown in Figure 8.6b. Once the region limit shrinks below the inter-column distance of a particular block type, those blocks become stuck within their current column

9On bitcion miner the move abort rate was reduced to 22.8%. 10The region limit controls how far apart the blocks involved in a move are allowed to be. 11The bitcion miner move abort rate was further reduced to 0.6%. Chapter 8. Enhanced FPGA Packing & Placement 101

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 (a) VTR 7 move region limit for a common block. (b) VTR 7 move region limit preventing a sparse block from moving between columns.

2 3 5 6 7 1 4 8 (c) VTR 8 move region limit for a block type in (d) VTR 8 move region limit for a sparse block in a a compressed grid space. Common blocks can compressed grid space. Sparse blocks can move move across columns of other block types. between columns even at low range limits. Figure 8.6: Region Limit Examples and will never be able to move to another column. If this happens too early in the anneal such blocks may be prematurely locked down in sub-optimal positions. To counteract this, VTR 8 does not apply the region limit directly to the FPGA grid. Instead, it constructs a compressed grid space for each block type to which the region limit is applied. This ensures blocks of a particular type in the FPGA grid which are logically adjacent to each other are considered as such during move generation. As shown in Figures 8.6c and 8.6d this prevents the range limit from artificially restricting the movement of some block types.12 Using these compressed grid spaces also ensures VPR can always quickly find adjacent blocks of the same type.13 The results of this change are listed as compressed move grid in Table 8.3. Using compressed move grids further improves routed wirelength and route-time by 6% and 14% respectively. While routed critical path is shown to increase by 2%, this excludes the improvements on benchmarks like bitcoin miner which did not complete in the baseline. However, the post-placement estimated critical path delay (which includes all benchmarks) improves by 2%.

12 Effectively, the region limit Rlim (Section 2.4.2) is scaled based on the density of each block type within the FPGA device grid. 13Previously in VTR 7 the placer would abort some moves if it was unable to quickly find another block of the correct type. Chapter 9

AIR: A Fast but Lazy Timing-Driven FPGA Router

Routing is a key step in the FPGA design process, which significantly impacts design implementation quality. Routing is also very time-consuming, and can scale poorly to very large designs. This chapter describes the Adaptive Incremental Router (AIR) [4], the high-performance timing-driven FPGA router used in VTR 8 [2]. AIR dynamically adapts to the routing problem, which it solves ‘lazily’ to minimize work. Compared to the widely used VPR 7 router, AIR significantly reduces route-time (7.1× faster), while also improving quality (15% wirelength, and 18% critical path delay reductions). We also show how these techniques enable efficient incremental improvement of existing routing.

9.1 Introduction

Field Programmable Gate Arrays (FPGAs) have pre-fabricated routing resources along with configurable switches to interconnect them. This allows the routing resources to be re-configured to implement different designs, but also means the interconnect network has limited and restricted connectivity. The interconnect network typically accounts for a majority of delay, and 50%+ device area [38, 66]. As a result the routing of signals through the network has a large effect on the final design implementation quality (wirelength, power consumption and critical path delay). Unlike Application Specific Integrated Circuits (ASICs) where custom interconnect topologies can be constructed, FPGA routing requires finding a legal embedding of the design’s interconnect into the pre-fabricated network. Routing also accounts for a large portion of FPGA CAD flow run-time (41-86% in commercial and academic tools [10]) and can scale poorly to very large designs due to high-fanout nets and difficult to resolve congestion. In the FPGA CAD flow there are no stages following routing which can fix up routability or timing issues (e.g. no buffer insertion or gate re-sizing). The only way for designers to address these issues is to restructure their design’s logic (to reduce congestion and/or improve timing), which is extremely disruptive and time-consuming. It is therefore very important for FPGA routers to be robust (find legal routings) and high quality (minimize wirelength and delay). An FPGA’s routing resources and switches are typically modelled as a Routing Resource (RR) graph as shown in Figure 6.2, where conductors (wires/pins) and switches correspond to nodes and edges respectively. The FPGA routing problem is to find a large number of non-overlapping trees within the

102 Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 103

RR graph which implement the design’s connectivity; while simultaneously minimizing metrics such as wirelength and critical path delay, all within reasonable run-time. This is a challenging task given the large size of the RR graph1 and its limited connectivity (sparseness). The most successful FPGA routing techniques are based upon the PathFinder negotiated congestion algorithm (described in Section 2.5) [60], where multiple nets are allowed to use the same routing resources. Such overused resources are said to be congested. Allowing congestion prevents nets from blocking each other, an important characteristic given the interconnect network’s limited flexibility. To resolve congestion and arrive at a legal solution nets are repeatedly ripped-up and re-routed while modifying routing resource costs. A variety of works have studied FPGA routing. Early works focused on two-stage global-detailed routing [211] and investigated heuristics for Steiner point selection to improve quality [212]. However both were outperformed by single-stage negotiated congestion routing [46]. Different criteria for ranking paths during graph search were explored in [213], improving quality at the cost of significantly higher run-time. [214] investigated methods to prune the search space and build good initial ‘spines’ for high-fanout nets. More recently, CRoute [215] proposed decomposing nets into independent connections to reduce route-time. Unlike CRoute, our approach maintains the natural net-based structure of the problem which improves routing resource re-use, reduces redundant work during graph search, and facilitates further net-based optimizations. Another popular approach has been to exploit parallel computing resources to speed-up routing [75, 216, 76, 162]. However these techniques have also faced scalability challenges due to synchronization costs and load imbalances. Our approach is complementary as it reduces the work required to route individual nets, and particularly large high-fanout nets which often limit parallel speed-up [75]. We focus on developing the Adaptive Incremental Router (AIR) which improves scalability and quality compared to previous approaches. We describe the core algorithm in Section 9.2 and detail our enhancements in Sections 9.3 to 9.5. We then present extensions to support non-configurable switches in Section 9.7, and the incremental improvement of existing routing in Section 9.6. Finally, improvements to minimum channel width search are described in Section 9.8, and Section 9.9 evaluates the enhanced router.

9.2 Algorithm Overview

The AIR netlist routing algorithm, is outlined in Algorithm 12. The algorithm operates over multiple routing iterations (Line 4). During each iteration, connections are routed between each net’s driver and sinks (Lines 6 and 7). Importantly, each connection is allowed to use routing resources already used by other nets. Such overused routing resources are said to be congested. Once all nets have been routed (Line 5), the routing resource costs are selectively increased (Line 16) with the aim of reducing congestion during subsequent routing iterations. This process repeats until a legal routing is found (Lines 8 and 11), or the design is deemed unroutable by either: prediction (Line 14) or hitting the maximum iteration limit (Line 4). Finally, the best legal routing found (if any) is returned (Line 20).

1Tens to hundreds of millions of nodes for modern FPGAs Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 104

Algorithm 12 AIR Netlist Routing Require: The nets to route, α the maximum number of routing convergences Returns: The best routing found 1: function air route(nets, α) 2: best ← ∅, current ← ∅ . Best and current route trees for each net 3: convergences ← 0 . How many times a legal routing has been found 4: for iter ∈ 1 . . . max iters do 5: for net ∈ nets do 6: for sink ∈ unrouted sinks(net, current[net]) do . Incrementally route 7: current[net] ← air route connection(net, current[net], sink) 8: if is legal(current) then 9: best ← best routing(best, current) . Keep best routing so far 10: convergences ← convergences + 1 11: if convergences = α then . Convergence limit reached 12: break 13: reset pres fac() . Reduce present congestion cost to focus on delay 14: if predict unroutable() then . Early abort 15: break 16: update costs() . Update pathfinder costs 17: for net ∈ nets do . Prepare for incremental re-routing 18: ripup illegal connections(current[net]) 19: ripup delay degraded connections(current[net]) 20: return best

Algorithm 13 AIR Connection Routing Require: The net to route, its existing route tree, the target sink to be added Returns: The updated route tree with a branch connecting to sink 1: function air route connection(net, route tree, sink) 2: path ← ∅ 3: if fanout(net) ≥ β and criticality(sink) < γ then 4: heap ← initialize nearby routing(route tree, sink) . High-fanout opt. 5: path ← find path from heap(heap, sink) 6: if path = ∅ then 7: heap ← initialize full route tree(route tree) 8: path ← find path from heap(heap, sink) 9: update route tree(route tree, path) 10: return route tree Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 105

A A A

B B B

C D C D C D

 E F G E G E J G

H I H I H I

(a) Route tree from driver/root A (b) Pruned legal partial route (c) Incrementally re-routed legal to sinks/leaves {E,H,I,G}. tree. Only sinks H and I route tree. Sinks H and I are Node F is illegal (also used must be re-routed. Any of now legally routed through J. by another net). {A, B, C, D, E, G} could be new branch points. Figure 9.1: Route tree pruning example.

9.3 A Fast but Lazy Router

A key component to AIR’s efficient run-times compared to previous approaches is its ‘lazy’ approach. Wherever possible AIR attempts to minimize performing redundant work. This section outlines a number of enhancements made to AIR which improve its run-time performance by making it lazier.

9.3.1 Incremental Routing

In PathFinder-based routers each net’s routing is represented as a route tree: a tree of routing resources spanning from the net’s driver (tree root) to each of its sinks (tree leaves). Each path in the route tree (from root to leaf) corresponds to a net connection (from driver to sink). To reduce wirelength it is usually desirable for connections within the same net to share some routing resources. Traditional PathFinder-based routing algorithms rip-up and re-route all nets every routing iteration. However in practise many nets (or portions of nets) may have already been legally routed. In such cases it is not strictly necessary to re-route these connections. AIR exploits this by re-routing nets incrementally on a per-connection basis. This ‘lazy’ approach avoids redundantly ripping up (and then re-routing) the legal portions of a net.2 First, a net’s current route-tree is walked to identify congested sub-trees as shown in Figure 9.1a. Illegal sub-trees are then pruned as shown in Figure 9.1b to leave only legal routing (Algorithm 12 Line 18). Second, the pruned net sinks are re-routed during the next routing iteration (Algorithm 12 Line 6), using the net’s remaining legal routing as potential branch points, as shown in Figure 9.1c.3 CRoute [215] proposed decomposing nets into independent connections, which allowed the creation of a routing algorithm with reduced run-time. Unlike CRoute, our approach does not strictly decompose the routing problem into connections and maintains its natural net-based structure. This allows partial routing from previous connections to be re-used which reduces the amount of graph exploration required and enables further net-based optimizations such as high-fanout routing (Section 9.3.2).

2This is a much more fine-grained approach than [216], which only avoided ripping-up completely legal nets. 3Note the router already stores route trees for each net, so no additional memory is required for incremental routing. Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 106

Table 9.1: Normalized Impact of Incremental Routing on VTR benchmarks (> 10K Primitives) Routed WL Crit. Path Delay Route Time Route Time Peak Total Flow δ Wmin (1.3 · Wmin) (1.3 · Wmin) (find Wmin) (1.3 · Wmin) Memory Time Baseline 1.00 1.00 1.00 1.00 1.00 1.00 1.00 8 1.01 0.99 1.01 0.63 0.77 1.01 0.72 16 1.01 0.98 1.03 0.66 0.77 1.01 0.76 32 1.02 0.99 1.00 0.75 0.81 1.01 0.84 64 1.01 0.99 1.01 0.82 0.90 1.01 0.86 128 1.00 0.99 1.01 0.82 0.97 1.00 0.90

Table 9.2: Normalized Impact of Incremental Routing on Titan benchmarks Peak Total Flow δ Routed WL Crit. Path Delay Route Time Memory Time Baseline 1.00 1.00 1.00 1.00 1.00 8 0.98 1.00 0.85 1.00 0.93 16 0.99 1.00 0.81 1.00 0.87 32 0.99 1.00 0.85 1.00 0.87 64 1.00 1.00 0.85 1.00 0.88 128 1.00 1.00 0.87 1.00 0.91 Excluding: directrf, gaussianblur, dart

While incremental routing is primarily a run-time optimization it is important to consider whether failing to rip-up legal connections can have any impact on QoR. In particular, since Pathfinder uses a present congestion cost factor, not ripping up all connections may result in some timing critical nets taking more indirect routes to avoid present congestion. During experiments we found that in some instances this could degrade critical path delay. To alleviate this behaviour we also perform delay-based rip-up (Algorithm 12 Line 19), which will force critical connections4 whose delay has also degraded (compared to previous results)5 to be re-routed even if they have a legal routing. While this technique can save significant work for large nets, on small low-fanout nets the work performed pruning the route tree is comparable to the work required to re-route the net. As a result we apply incremental net re-routing only for nets beyond a certain fanout threshold (δ). Tables 9.1 and 9.2 compare incremental routing at various fanout thresholds to the Baseline method (where nets are always ripped up) on the VTR benchmarks (targeting VTR’s flagship k6 frac N10 frac chain mem32K 40nm architecture) and Titan benchmarks (targeting the Stratix IV model from Section 7.2) respectively, with all other routing optimizations enabled. The results show that incremental routing has effectively no impact on quality of results (minimum routable channel width, wirelength, or critical path delay), while reducing the run-time of the router. For the VTR benchmarks (Table 9.1) and Titan benchmarks (Table 9.2) fixed channel width route-time is minimized at a fanout threshold of 16, reducing average route-time by 23% and 19% respectively. Importantly, the impact of incremental routing is more significant on larger circuits (run-time reduction of 2.3× on directrf), which shows it improves scalability.

9.3.2 High-Fanout Nets

AIR routes each net one connection at a time (Algorithm 12 Line 7) using Algorithm 13. To avoid wasting large amounts of wiring, any existing routing (from a net’s previously routed connections) is added to the heap with zero cost, allowing it to be re-used as a branch-point for subsequent connections (Algorithm 13 Line 7).

4Those with criticality < 0.9 · max criticality 5Where new delay > 1.1 · best seen delay Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 107

Most designs have a relatively small number of high-fanout nets which span a large portion of the device. On a subset of the Titan benchmarks we found VPR spent 12-34% run-time routing these few high-fanout nets. For a net with k sinks, putting the entire route tree into the heap means the time complexity to route the net grows quickly. For a single sink, pushing the net’s O(k) route tree into the heap has a time complexity6 of: O(k log k). (9.1)

Since this is done for each of the k sinks, the overall time complexity to route a net grows as:

O(k2 log k). (9.2)

The O(k2) component of Equation (9.2) is problematic, as it leads to very high route-time for high-fanout nets. To avoid this VTR 8 uses an approach based on [217]. Rather than putting the entire route-tree into the heap, only routing resources which are spatially near the target sink are added (Algorithm 13 Line 4). Since the number of such resources is bounded by a small constant this reduces the O(k log k) component of Equation (9.2) to O(1), and the complexity of routing a net falls to:

O(k), (9.3) which is much more scalable. Figure 9.2 illustrates the differences between these approaches, by showing the routing resources examined by the router when routing to a particular high fanout net sink located near the center of the device. Figure 9.2a shows the traditional approach looks at a large number of routing resources (i.e. considers the entirety of the large route tree), many of which will not lead to the target sink (since they are far from it). In contrast, Figure 9.2b shows the spatial approach is much more efficient, with the router only examining a few routing resources which are near the target sink.7 However, we found that on some FPGA architectures this technique was not robust, and no path could be found from the spatially nearby previous routing and the target sink. This can occur in routing architectures which have sparse connectivity to certain routing resources such as long wires. In such cases we fall back to placing the full route tree onto the heap (Algorithm 13 Lines 6 to 8). Since such instances are rare, this maintains the speed-up of spatially aware high-fanout routing while ensuring robustness. Furthermore, we observed that this high-fanout routing technique can also degrade timing performance, since high-fanout timing-critical connections will have less flexibility to find potentially faster routes. To alleviate this we route all timing critical high-fanout connections8 using the full previous route tree (Algorithm 13 Line 3). While these optimizations had no significant effect on the VTR benchmarks, they do impact the larger Titan benchmarks. Table 9.3 shows the results on the Titan benchmarks (targeting the Stratix IV model from Section 7.29) for several fanout (β) and criticality (γ) thresholds (Algorithm 13 Line 3), with 6Assuming a binary heap with O(log k) insertions. 7Note that even in commercial FPGAs which often have dedicated networks for high-fanout connections (e.g. global clock networks), this technique should still be beneficial, as it helps avoid examining the entire global network when routing a new sink. 8For the first routing iteration we treat all connections as timing critical. For later iterations the criticality is determined based on the delays of the previous routing iteration. 9The modelled routing architecture has no high-fanout optimized global routing networks, and as a result clocks are not Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 108

(a) Traditional approach (b) Spatial approach Figure 9.2: Routing resources explored (colored by cost to reach the current target sink) while routing a high-fanout net (∼ 3500 terminals) on the neuron benchmark circuit targeted to the Stratix IV model from Section 7.2. The target sink is located near the centre of the device.

Table 9.3: Normalized Impact of High Fanout Routing on Titan benchmarks Routed WL Crit. Path Delay Route Time off 1.00 1.00 1.00 β = 32 γ = 1.0 1.00 1.03 0.84 β = 64 γ = 1.0 1.00 1.03 0.86 β = 64 γ = 0.9 1.00 1.00 0.77 all other routing optimizations enabled. While high-fanout routing reduces run-time for a wide range of these parameters, the best result occurs with a fanout threshold of 64 and criticality threshold of 0.9, where route time is reduced by 23% with no impact on wirelength or critical path delay. Notably, setting an active criticality threshold (γ < 1) improves route time since it likely avoids having to rip-up and re-route otherwise delay sub-optimal high-fanout connections.

9.3.3 Block Output-pin Lockdown

To facilitate faster routing convergence, the VPR 7 router locks-down block output pins after a fixed number of routing iterations. This is a search-space pruning technique, which aims to prevent signals from oscillating between different logically equivalent output pins as congestion is resolved. As shown in Table 9.4 (with all other routing optimizations enabled), disabling this feature reduces both wirelength and route-time on the Titan benchmarks targeting the Stratix IV model from Section 7.2. AIR’s incremental re-routing enhancement (Section 9.3.1) avoids unnecessarily ripping up and re-routing uncongested output pins, and so naturally avoids the scenario output pin lockdown tried to address in VPR 7. As a result output pin lockdown is not performed by AIR. routed in these results. However other high-fanout nets like resets are routed in the regular routing.

Table 9.4: Normalized Impact of OPIN Lockdown on Titan benchmarks Routed WL Crit. Path Delay Route Time Peak Memory on 1.00 1.00 1.00 1.00 off 0.98 1.01 0.94 1.00 Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 109

9.4 Improving Quality

In addition to being run-time performant, it is important that an FPGA router like AIR produce high quality results. This section describes enhancements which focus on improving the quality of AIR’s routing. Often, these enhancements also lead to further run-time improvements as a higher quality routing leads AIR to converge quicker.

9.4.1 Router Lookahead

VPR’s connection-based router makes use of a predictive lookahead, to estimate the cost (delay and congestion) of reaching the target sink through the node being expanded (Algorithm 16 Lines 10 and 11). This lookahead is used to quickly guide the router to find a low cost path to the target.10 The lookahead used by prior versions of VPR, which we call the classic router lookahead, makes a variety of simplifying assumptions which may not hold true on modern FPGA architectures. In particular, it assumes different wire types do not interconnect, and all wire types connect to block pins. As a result, the classic lookahead can mislead the router on modern architectures (where its assumptions typically do not hold, as illustrated in Figure 9.3), harming delay and wirelength. As a result, AIR uses a new lookahead based on an enhanced version of [143], which adapts to the targeted routing architecture. The new lookahead uses an undirected Djikstra-based search to quickly profile a large number of different routes through the RR graph. As shown in Algorithm 14, this search is performed for a handful of sample locations, and every wire type and orientation (vertical or horizontal). This produces the delay and congestion costs to travel different horizontal and vertical distances. This data is reduced into a concise map of the routing network by making the common assumption of translational invariance in the FPGA routing network, so only differences in position need to be stored.

Algorithm 14 Map Lookahead Creation 1: function build map lookahead() 2: map ← init map inf() . Initialize all entries to ∞ 3: for loc ∈ SAMP LE LOCS do . Each sample location 4: for w ∈ WIRE TYPES do . Each wire type (e.g. length) 5: for c ∈ {CHANX, CHANY } do . Each wire orientation 6: ref node ← pick ref node(loc, w, c) 7: costs ← djikstra flood from(ref node) 8: for x ∈ 1 ...W do . Each horizontal position 9: for y ∈ 1 ...H do . Each vertical position 10: dx, dy ← |loc.x − x|, |loc.y − y| . Position difference 11: delay ← min(costs[x][y].delay, map[w][c][dx][dy].delay) 12: cong ← min(costs[x][y].cong, map[w][c][dx][dy].cong) 13: map[w][c][dx][dy].delay ← delay 14: map[w][c][dx][dy].cong ← cong 15: return map

Table 9.5 shows the impact of the different lookaheads on the Titan benchmarks (targeting the Stratix IV model from Section 7.2), with all other router optimizations enabled. First, compared to the VPR 7-style lookahead (classic), the new map lookahead (map) achieves much better result quality; reducing critical path delay by 9% and wirelength by 2%, but at the cost of increasing route-time by nearly 6× (the high run-time will be alleviated by adjusting the base costs as described in Section 9.4.2).

10This is an approximate variant of the A* algorithm [218], where the lookahead is akin to a heuristic lower bound but is not strictly admissible. Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 110

Table 9.5: Normalized Impact of Router Lookahead and Base Costs on Titan benchmarks Routed WL Crit. Path Delay Route Time Peak Memory classic 1.00 1.00 1.00 1.00 map 0.98 0.91 5.94 1.00 classic length 0.93 0.99 0.76 1.00 map length 0.92 0.91 0.89 1.00 Values are normalized geomeans over mutually completed benchmarks

L4

L16 Figure 9.3: Stratix IV-like hierarchical routing architecture with short (L4) and long (L16) wires. Long wires are only accessible via short wires, and do not directly connect to block pins.

As illustrated in Figure 9.3, modern FPGAs often have high-speed long wire sub-networks which are only accessible from a subset of the more plentiful short wires [127, 66] (Section 7.2.2). The map lookahead improves critical path delay since it understands this hierarchical structure from its profiling of the routing network. This is illustrated in Figure 9.4 which shows the delay estimates for various distances starting from a short wire for both lookaheads. In Figure 9.4a the classic lookahead assumes all connections use the same type of short wire, leading the lookahead’s delay estimate to rapidly increase with distance. In contrast, Figure 9.4b shows the map lookahead’s delay estimate, which increases much more slowly – particularly at longer distances. The map lookahead understands the interconnections between the different wire types, so it knows it is possible to get onto the faster long wire network even when starting from a short wire. As a result its delay estimate increases more slowly at longer distances since it knows the faster long wire network can be used. The map lookahead captures the effect of using fast long wires to traverse long distances (even when starting from a short wire).

These differences guide the router to make different choices for timing critical long distance connections. With the classic lookahead, the router immediately starts driving towards the target using the short-wire network, since it does not understand faster paths may exist. It will only use the long wire sub-network if it happens to encounter it during its directed exploration. The map lookahead instead guides the router to search the short-wire network more thoroughly to find a way onto the faster long wire network. Getting onto the long wire network early leads to significant delay reductions for long distance connections.

1e 9 1e 9

40 40 3.5 3.5

3.0 3.0 20 20 2.5 2.5

y 2.0 y 2.0 0 0

1.5 1.5

20 20 1.0 1.0

0.5 0.5 40 40 0.0 0.0 60 40 20 0 20 40 60 60 40 20 0 20 40 60 x x (a) Classic L4 (b) Map L4 Figure 9.4: Lookahead Delay Estimates Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 111

1e 10 1e 10 1e 9 4.0

40 40 40 3.5 8 8

3.0 20 20 20

6 6 2.5 y y 0 0 y 0 2.0

4 4 1.5

20 20 20 1.0 2 2

0.5 40 40 40 0 0 0.0 60 40 20 0 20 40 60 60 40 20 0 20 40 60 60 40 20 0 20 40 60 x x x (a) Classic L4 (b) Classic L16 (c) Map Length L16 Figure 9.5: Lookahead Congestion Estimates

9.4.2 Routing Resource Base Costs

The lookaheads also provide congestion/resource cost estimates to guide the router’s search process. Figure 9.5 shows the congestion cost estimates. Figures 9.5a and 9.5b show the classic lookahead’s congestion cost estimates when starting from short or long wires. Interestingly, this shows it is much cheaper to travel an equivalent distance using the long rather than short wires – even though the long wires are much larger and rarer routing resources.11 This cost difference biases the router to prefer using the long wire network even for non-timing-critical connections. The original map lookahead suffers from the same bias. Since the map lookahead also has a stronger preference for using long wires for timing-critical connections this led to significant congestion in the long wire network. While this congestion would eventually be resolved through congestion negotiation, that is a very slow process, requiring connections to be repeatedly ripped-up and re-routed over many routing iterations. To address this issue we adjusted the wire base costs to be proportional to each wire’s length. For instance, in the case of length 4 and 16 wires they would be automatically assigned base costs of 4 and 16 units respectively (previously both would have been allocated unit base costs). The resulting congestion cost estimates for long wires with the map lookahead are shown in Figure 9.5c. These costs make long wires more expensive than short wires, particularly when using a long wire to travel distances shorter than their length.12 This guides short distance and non-timing-critical connections to use the more plentiful short wires. As a result, wirelength improves, and run-time decreases (as congestion is resolved quickly). This is illustrated in Figure 9.6 which compares how base costs affect congestion resolution and route-time. For non-length-scaled base costs (Figure 9.6a) both the L4 and L16 wires are initially heavily congested. However the L16 wires are proportionally much more congested than the L4 wires.13 Congestion in both wire lengths then resolves very slowly over many routing iterations. In contrast, for length-scaled base costs (Figure 9.6b) congestion in both wire lengths resolves quickly. This is particularly true of the L16s, which are congestion free much earlier (6th iteration) than with unit base costs (11th iteration). As a result the router converges in fewer routing iterations, and requires significantly less run-time (80 vs 50 seconds).14

11This is derived from the original VPR formulation [38], which used a single base cost for all wire lengths. 12For instance, with the previous uniform base costs using an L16 wire to move 8 units was half the cost of using 2 L4 wires. Using length-scaled base costs the L16 wire would be twice as expensive as using 2 L4s, which is intuitively consistent as half the L16 wire would be unused. 13There are ∼ 26× fewer L16 than L4 wires (Table 7.2), but initially only 5× fewer congested L16s in Figure 9.6. 14Re-routing in the presence of congestion requires the router to repeatedly search more and more of the RR graph to Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 112

90 90 Route Time Route Time

4 4 10 80 10 80

70 70

3 3 10 60 10 60

50 50

2 2

10 Legal 10 Legal 40 40

30 30 Congested Routing Resources Congested Routing Resources 101 Cummulative Route Time (sec) 101 Cummulative Route Time (sec) 20 20

L16 Cong. L16 Cong. L4 Cong. 10 L4 Cong. 10 100 Pin Cong. 100 Pin Cong. 0 0 2 4 6 8 10 12 2 4 6 8 10 12 Routing Iteration Routing Iteration (a) Unit base costs. (b) Length-scaled base costs. Figure 9.6: Comparison of routing congestion by resource type and router run-time for different base costs on the neuron benchmark circuit with the map lookahead.

The impact of using length-scaled base costs with the two lookaheads is also shown in Table 9.5. With length-scaled base costs the map lookahead (map length) significantly improves route-time, so it is 11% faster than the classic lookahead. They also improve the classic lookahead’s run-time (classic length), but the relative improvement is smaller. For both lookaheads the length-scaled base costs reduce routed wirelength by 7-8%.15

9.5 Adapting to Congestion

Difficult to resolve routing congestion is a significant challenge in FPGA routing. This section outlines some enhancements which allow AIR to adapt dynamically to congestion, improving its robustness when faced with difficult congestion.

9.5.1 Dynamic Router Bounding Boxes

To reduce router run-time negotiated congestion routers like VPR, have historically restricted each net to route within a fixed region derived from the bounding box of the net’s driver and sink locations.16 While this limiting of the router’s search space saves run-time, it can make it more difficult to resolve congestion. In particular, the fixed bounding box limits how far a net can ‘move away’ from a congested region (as it is restricted to remain inside the bounding box). Such an approach can make it difficult to resolve congestion, since it forms a hard limit which may prevent a connection from routing around congestion, as shown in the left of Figure 9.7. To avoid this, AIR dynamically expands the search limit when a net uses a routing resource adjacent to a bounding box edge as shown in the right of Figure 9.7. This ensures no hard limit restricts how far nets can move out of the way to alleviate congestion. Table 9.6 shows the impact of dynamic bounding boxes with all other routing optimizations enabled. Dynamic bounding boxes improve minimum routable channel width by 2% on the large VTR benchmarks (targeting VTR’s flagship k6 frac N10 frac chain mem32K 40nm architecture), and reduces run-time in

find uncongested paths. As a result, the run-time benefits of reducing congestion tend to improve with the larger RR graphs needed for bigger designs. 15This likely indicates long wires were previously sub-optimally used by connections traversing short distances. This was also independently identified and fixed by [215]. 16The default router bounding box is the net’s bounding box expanded by 3 routing channels on each side [46]. Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 113

Unavoidable Bounding Box Congestion Bounding Box Congestion Iteration i Avoided Iteration i + 1

t t

Congestion Congestion

s s

At Edge Expanded Figure 9.7: Dynamic bounding box example.

Table 9.6: Normalized Impact of Dynamic Router Bounding Boxes on VTR benchmarks (> 10K primitives) Route Time Routed WL Crit. Path Delay Route Time Wmin (find Wmin) (Fixed W ) (Fixed W ) (Fixed W ) static 1.00 1.00 1.00 1.00 1.00 dynamic 0.98 1.11 1.00 1.00 0.98

the less congested fixed channel width (1.3 · Wmin) case by 2%. While the minimum channel width search time increases moderately this is not significant, as minimum channel width run-time is sensitive to small changes in routability (which may cause binary search to explore a different sequence of channel widths which may be more challenging to route). Given that both the minimum routable channel width and route time at fixed channel are improved we believe this indicates an overall routability improvement.

9.5.2 Congestion Mode

If the number of routing iterations is approaching the maximum allowed, AIR will enter a high-effort congestion resolution mode. This mode disables delay-driven incremental re-routing (so the router only rips-up congested connections) and also increases net bounding box sizes, giving the router more freedom to find alternative (less congested) routes. Table 9.7 shows the impact of different congestion thresholds on the large VTR benchmarks (target- ing VTR’s flagship k6 frac N10 frac chain mem32K 40nm architecture), with all routing optimizations enabled. Decreasing the iteration congestion threshold increases router effort in hard-to-route instances. With thresholds of 80% and 50% of maximum routing iterations respectively the minimum routable channel width decreases by 1% and 2%, while minimum channel width route time increases by 8% and 57% respectively. There is no run-time impact at relaxed channel widths since congestion is less severe. The default threshold is set to 1.0, which leaves congestion mode disabled by default. However, end-users can decrease the threshold to increase router effort on difficult to route designs.

Table 9.7: Normalized Impact of Congestion Mode Threshold on VTR benchmarks (> 10K) Threshold Routed WL Crit. Path Delay Route Time Route Time VPR Total Flow (fraction of maximum W min (1.3 · W ) (1.3 · W ) (find W ) (1.3 · W ) Memory Time router iterations) min min min min 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.8 0.99 1.01 1.00 1.08 1.00 1.00 1.01 0.5 0.98 1.01 1.00 1.57 1.00 1.00 1.23 Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 114

1.00 1.00 Wmin 1.10 Wmin 1.00 Wmin 1.05 W 1.30 W 1.8 1.05 W 0.99 min min min 1.10 Wmin

1.30 Wmin 0.98 1.6

0.97 1.4 Critical Path Delay 0.96 Normalized Geomean

1.2 1.00 Route Time 0.98 1.0 Normalized Geomean 0.96 0.8 0.94 Wirelength

0.92 0.6 Normalized Geomean

0.90 1 2 3 4 5 1 2 3 4 5 Routing Convergence Count Routing Convergence Count Figure 9.8: QoR Impact of multiple routing convergence on the VTR Benchmarks (> 10K primitives)

9.6 Multi-convergence Routing

During negotiated congestion routing, the router attempts to resolve congestion while minimizing the impact on critical path delay. However in highly congested designs congestion costs can end-up dominating timing costs, and critical connections may be detoured degrading the critical path delay. As a result the achieved critical path delay can be somewhat chaotic in the presence of significant routing congestion. Multi-convergence routing aims to address this by continuing to improve the routing of timing critical connections after the first legal routing solution is found. In particular, the router will continue re-routing timing critical connections with the aim of improving their delay. This allows the router to fix up timing issues it neglected while focusing on resolving routing congestion. When the routing next converges to a legal congestion-free solution the routing with the best QoR is kept. The pseudo-code for multi-convergence routing is shown in Algorithm 12. When a legal routing is found (Algorithm 12 Line 8), the best routing is updated17 (Algorithm 12 Line 9). In preparation for the rip-up and re-routing of delay sub-optimal connections the present congestion cost factor (presfac) is reset to its initial value (default zero) (Algorithm 12 Line 13). This allows timing critical connections to focus on finding faster routes instead of avoiding congestion. Incremental re-routing (Section 9.3.1) then rips-up delay sub-optimal connections (Algorithm 12 Line 19). It is worth noting that once re-routing is ‘kicked-off’ in this manner any resulting congestion is naturally handled by incremental re-routing, and less critical (but newly congested) connections will be detoured away to resolve congestion. Figure 9.8 shows the impact of multi-convergent routing on the large VTR benchmarks (targeting VTR’s flagship k6 frac N10 frac chain mem32K 40nm architecture) at various levels of routing stress, and with all other router optimizations enabled. We can make several interesting observations. Firstly, independent of multi-convergence routing, increasing channel width improves both delay and wirelength, and reduces route-time. The additional routing resources mean the router does not need to detour as drastically to resolve congestion. Secondly, multi-convergence routing improves critical path delay and wirelength in high stress settings at or near the minimum routable channel width. For instance, compared to only a single convergence, allowing two convergences reduces critical path delay at minimum channel width by 2.3%. However these gains diminish as channel width increases. Allowing more than two routing convergences offers minimal

17Provided the new routing is better, defined as having lower critical path delay, with wirelength used as a tie-breaker. Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 115 quality benefit. Thirdly, multi-convergence routing is run-time efficient, with subsequent convergences increasing run-time by far less time than the initial convergence (which routed the entire netlist). For instance at minimum channel width the second convergence only increased overall route-time by 30%. The run-time overhead of multi-convergence routing also decreases in less stressful routing conditions (where it offers less benefit). The connection-based re-route and high fanout optimizations (Section 9.3) greatly reduce work when a routing is almost legal, which keeps the run-time overhead for multi-convergent routing low. It is interesting to note that multi-convergent routing with delay-based rip-up accomplishes many of the same goals as the delay-targeted routing approach proposed in [219]. In particular, it reduces the often chaotic impact of routing congestion on critical path delay at narrow channel widths. Multi-convergent routing should be more run-time efficient as it only re-routes the relevant connections in the netlist. In contrast delay-targeted routing requires the full netlist to be re-routed multiple times in search of an appropriate delay target. Finally, multi-convergent routing naturally extends to multi-clock designs where there is no longer a single delay target.

9.7 Routing with Non-Configurable Switches

As described in Section 7.1.2, VPR 8 supports non-configurable connections between routing resources (i.e. connections which must always be used and cannot be disabled), such as non-tristate-able buffers (which are common in clock networks and can be useful for long wires) and electrical shorts. Using non-configurable switches to create electrical shorts allows the modeling of a broader range of wires such as more complex ‘L’ and ‘T’ shaped wires, while allowing the fundamental RR node data structure to remain a simple one dimensional wire. This ensures RR nodes can be described using compact and efficient data structures. This is very important as the RR node data structures form the largest component of VPR’s peak memory footprint, and keeps the far more common case of a linear wire efficient. The VPR router has traditionally assumed each switch, modelled as an edge in the Routing Resource (RR) graph, was configurable – meaning it could be enabled/disabled to connect/disconnect two routing resources. Supporting non-configurable switches required modifications to the router’s path finding algorithm.

Algorithm 15 AIR Path Finding Require: The heap containing start locations, the target node to be found Returns: A low-cost path to target from elements initially in heap 1: function find path from heap(heap, target) 2: while not empty(heap) do 3: elem ← pop smallest(heap) 4: if target = elem.node then 5: return traceback path(elem) . Found path to target 6: for edge ∈ outgoing edges(elem.node) do . Explore Neighbours 7: next elem ← expand node(elem, sink node(edge), target) 8: push(heap, next elem) 9: return ∅ . No path to target

Algorithm 15 details AIR’s algorithm for finding paths for each net connection through the RR-graph. Like a standard Djikstra/A*-based path finding algorithm AIR maintains a heap (priority queue) of Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 116

Algorithm 16 AIR Node Expansion Require: The previous element used to reach the current node, the target node to be found Returns: An element to be placed on the heap with the costs of using node to reach target 1: function expand node(prev, node, target) 2: if reached by configurable edge(prev, node) then 3: nodes = nonconfig connected nodes(node) . Includes node 4: node congestion = sum congestion(nodes) . All non-config. connected nodes 5: else . Part of previously reached non-config. node set 6: node congestion = 0 . Already included in non-config. connected node set 7: congestion = prev.congestion + node congestion 8: delay = prev.delay + delay(prev.node, node) . Only delay of current node 9: γ = criticality(target) 10: cost = (1 − γ) · (congestion + expected congestion(node, target)) 11: + γ · (delay + expected delay(node, target)) 12: return element(node, congestion, delay, cost) currently explored nodes in the graph. The lowest cost node is popped off the heap (Line 3), and its neighbours are explored and added to the heap (Lines 6 to 8). This continues until either the target is found and the resulting path returned (Lines 4 and 5), or the heap is emptied (Line 2) indicating no path exists (Line 9). There are several challenges to supporting non-configurable edges. First, the router must ensure all non-configurably connected nodes are always used together. This is achieved by pre-processing the RR-graph to identify any sets of non-configurably connected nodes in the graph, and by updating the path traceback routine (Algorithm 15 Line 5) to add all non-configurably connected nodes to the route tree when a path is found. Second, it is crucial to cost such nodes appropriately so the router effectively optimizes their use. In particular, the router must:

ˆ see the full congestion cost impact of using any node in a non-configurably connected set (since choosing to use at least one node from the set implies using all nodes in the set), and

ˆ account for the delay impact of choosing particular nodes to form the routing path (since the individual nodes selected determine the delay).

Algorithm 16 shows how this is done by AIR. Whenever a node is explored via a configurable edge the relevant nodes are identified (Algorithm 16 Line 3)18 and their congestion costs summed (Algorithm 16 Line 4).19 When a node is expanded via a non-configurable edge (Algorithm 16 Line 6), by definition it is part of a non-configurably connected node set, whose congestion cost is already included (it was added to the previous congestion when the current partial path first encountered the set), so no additional incremental congestion cost is added. The remainder of the cost calculations proceed as usual, weighting the delay and congestion costs by target criticality, and including the expected congestion and delay costs to the target (Algorithm 16 Lines 9 to 11). Figure 9.9 illustrates how non-configurable switches affect costing. As the router explores from the source s in Figure 9.9a it may expand to node d1 which is a member of the non-configurably connected node set D = {d1, d2, d3}. As shown in Figure 9.9b, the congestion cost to reach d1 includes the cost of

18With pre-processing this can be done in constant time. 19For regular nodes this is just the individual node’s congestion costs, while for a node in a non-configurably connected set this is the total congestion cost of all the nodes in the set. Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 117

Est. Delay a b Node Cong. Delay s to t s 0 0 7 d1 c a 1 2 5 t b 2 4 3 c 3 6 1 d1 4 4 2 d2 4 5 1 d3 d2 d3 4 5 5 (a) Example routing resource graph with source s and target t. (b) Path costs from s. Dashed edges are configurable, solid edges are non-configurable.

Figure 9.9: Non-configurable routing example. Assumes unit congestion cost and delay for wires, unit delay for configurable switches, and zero delay for non-configurable switches.

using all members of the set, but its delay cost only reflects the delay to reach d1. When expanding from a to d1 the router pushes only d1 onto the heap (not all of D), but costs it to reflect the full congestion cost of using D. This ensures the router does not under estimate the cost to use d1 (since choosing to use d1 actually requires using all of D). If the router later pops d1 off the heap the other members of

D (i.e. d2, d3) will be expanded, accounting for their individual timing impact, but with no additional congestion cost. It is important that each member of D is tracked separately in the heap, since each member may have different expected congestion/delay costs to reach the target. Tracking each node separately ensures the router can prioritize exploring promising nodes which move it closer to the target (e.g. expanding d2 before d3). This is particularly important for large non-configurably connected node sets which may fanout to a large number of nodes which may not be relevant for reaching a particular target.20 With this formulation the router correctly determines the path a → b → c is better for non-timing critical connections (lower congestion cost), while a → d1 → d2 is better for timing-critical connections (lower delay).

9.8 Speeding-Up Minimum Channel Width Search

In the architecture evaluation flow, AIR will be invoked multiple times at various fixed channel widths as part of a binary search to identify the architecture’s minimum routable channel width (Wmin). As a result, all the enhancements included in AIR which have been described so far also benefit the minimum channel width search. However AIR includes several further enhancements targeted particularly at reducing the run-time of minimum channel width search.

9.8.1 Routing Failure Predictor

Router run-time is highly dependent on the amount and difficulty of resolving congestion. In uncongested designs the router explores only a limited portion of the routing network and runs quickly. However, in highly congested designs the router must repeatedly explore much more of the routing network in order to find uncongested paths, which significantly increases run-time. For designs which are unroutable this means the router will continually explore most of the routing network in a (futile) attempt to resolve congestion. To improve the run-time behaviour we designed

20Such as a clock network. Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 118

105 Routable Unroutable

104

103

102 Congested Routing Resources 101

100

10 20 30 40 50 Routing Iteration

Figure 9.10: Routing Congestion Trend on LU32PEEng (targeting VTR’s flagship k6 frac N10 frac chain mem32K 40nm architecture) at routable and unroutable channel widths.

Table 9.8: Normalized Impact of Routing Predictor on VTR benchmarks (> 10K primitives) Routed WL Crit. Path Delay Route Time Route Time VPR Total Flow Wmin (1.3 · Wmin) (1.3 · Wmin) (find Wmin) (1.3 · Wmin) Memory Time off 1.00 1.00 1.00 1.00 1.00 1.00 1.00 safe 1.00 1.00 1.00 0.71 1.02 1.00 0.77 aggressive 1.00 1.00 1.00 0.52 1.05 1.00 0.64 a routing predictor 21 which will abort routing early if it is not expected to complete successfully (Algorithm 12 Line 14). The routing predictor is also beneficial when routing at a fixed channel width, since it reduces the time to inform the user a design is unroutable. However its largest benefit comes when an FPGA architect is attempting to find the minimum routable channel width, since the binary search run-time is dominated by routings which occur at unroutable channel widths. As shown in Figure 9.10, congestion typically decreases exponentially over routing iterations.22 The routing predictor uses this information to construct a model of congestion over time and estimate when congestion will be resolved. We use a simple linear regression model fit using least-squares to the logarithm of the congested routing resource count.23 At the end of each routing iteration the model is constructed using the congestion information from the most recent 50% of previous routing iterations.24 If the estimated iteration count is greater than 3 times (safe setting) or 1.5 times (aggressive setting) the maximum number of allowed routing iterations, the routing is aborted. Table 9.8 shows the impact of the routing predictor on routing QoR and run-time metrics on the large circuit subset of the large VTR benchmarks (targeting VTR’s flagship k6 frac N10 frac chain mem32K 40nm architecture), with all other routing optimizations enabled. For the safe and aggressive predictor settings there are no changes in quality metrics, but minimum channel width route time is reduced by 29% and 48% respectively.

21Building on an initial implementation by Andre Pereira and Jason Luu which used linear interpolation. 22A linear trend on a log-linear plot such as Figure 9.10. 23Using the logarithm allows a linear model to fit an exponential trend. 24Using the most recent 50% of routing iterations means only recent history is considered, but the length of history increases as routing proceeds. This helps minimize noise caused by the typically small number of congested routing resources late in the routing process. Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 119

Table 9.9: Normalized Impact of Minimum Channel Width Hints on VTR benchmarks Routed WL Crit. Path Delay Route Time Route Time Total Flow VPR Wmin (1.3 · Wmin) (1.3 · Wmin) (find Wmin) (1.3 · Wmin) Time Memory No Hints 1.00 1.00 1.00 1.00 1.00 1.00 1.00 -30% Hint 1.02 1.00 1.00 1.10 0.97 1.05 1.00 -10% Hint 1.01 1.00 1.00 0.86 1.00 0.93 1.02 Correct Hint 1.00 1.00 1.00 0.72 0.99 0.85 1.02 +10% Hint 1.01 1.00 1.00 0.69 1.02 0.83 1.00 +30% Hint 1.02 1.00 1.00 1.09 0.98 1.04 0.99

9.8.2 Minimum Routable Channel Width Hints

Often when performing minimum routable channel width experiments the architect has some insight into what the expected channel widths will be.25 VPR 8’s minimum channel width search can now take such estimates as optional ‘hints’ to the minimum channel width binary search. These hints are used as the starting point of the binary search, and if accurate allow a large part of the search space to be quickly eliminated. Additionally, if the hint value is found to be routable a smaller initial step (10% channel width reduction, instead of the default 50%) is taken in hopes of quickly identifying an unroutable channel width and pruning the number of such channel widths explored. Table 9.9 shows the impact of minimum channel width hints on the VTR benchmarks (targeting VTR’s flagship k6 frac N10 frac chain mem32K 40nm architecture) with all other routing optimizations enabled. Providing accurate hints (Correct Hint) results in the same minimum channel width as not providing a hint (No Hint), but reduces minimum channel width search time by 28%. It should be noted that the accuracy of the hints does not significantly affect the resulting minimum channel width found by the binary search. Less accurate hints can still reduce minimum channel width search time (+10% Hint, -10% Hint). Inaccurate hints (+30% Hint, -30% Hint) only moderately increase search time since the number of additional channel widths explored is bounded.

It is interesting to note that the obtained Wmin values in Table 9.9 vary by small amounts depending on the hints given. VPR uses a binary search to determine Wmin, which assumes routing failures are monotonic in W . In practise the routability is nearly monotonic in W , but there is some small variation

(algorithmic noise) in routability near Wmin which is dependant on the detailed connectivity pattern produced by by the RR graph generator. This can cause some small variation in the calculated Wmin depending on the exact sequence of channel widths evaluated by binary search. If desired, end users can configure VPR to only give up after routing has failed two times at nearby channel widths.26

9.9 AIR Evaluation & Comparison

We now evaluate AIR (which is the VPR 8 Timing-Driven router) and compare it with other academic routers (VPR 7 Timing-Driven, VPR 7 Breadth-First, VPR 8 Breadth-First, CRoute) and the commercial Intel Quartus router.

25For instance they may have an estimate based on prior experience with the same or similar architectures. 26This is controlled via the --verify binary search option. Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 120

Table 9.10: VPR Router Comparisons on Titan v1.1.0 Benchmarks # Routable Routed Routed Route Benchmarks WL CPD Time AIR/VPR8 Timing-Driven (Timing + WL) 21 (1.00×) (1.00×) (1.00×) AIR/VPR8 Timing-Driven (WL only) 21 (0.97×) (1.47×) (0.70×) VPR8 Breadth-First (WL only) 14 (1.10×) (1.66×) (394.54×) VPR7+ Timing-Driven (Timing + WL) 20 (1.18×) (1.16×) (6.16×) VPR7+ Breadth-First (WL only) 14 (1.08×) (1.56×) (304.45×) All routers used identical packings/placements/RR graphs produced by VPR 8. QoR is the geomean of the mutually routable subset of 14 benchmarks.

9.9.1 Comparison with VPR Routers

To evaluate the effect of the router enhancements, Table 9.10 compares the various routing algorithms in VPR 7+27 and VPR 8 under identical conditions (same netlist, packing, placement, and RR graphs produced by VPR 8) on the Titan v1.1.0 benchmarks and Stratix IV model from [10] (which is compatible with both VPR 8 and VPR7+).28 VPR contains two different routing algorithms the primary timing- driven router (whose enhancements in VPR 8, AIR, are described above) which can be run to either optimize wirelength and timing simultaneously (Timing + WL) or wirelength only (WL only), and the breadth-first router which optimizes wirelength only and is included for historical reasons [46].

First considering the VPR 8 and VPR 7+ timing-driven routers (Timing + WL), we observe the AIR/VPR 8 router is substantially more run-time efficient (6.2× faster) while also improving QoR (18% wirelength and 16% critical path delay reductions). It also successfully routes an additional benchmark compared to VPR 7+. Second, the simpler breadth-first routers in VPR 7+ and VPR 8 both run more than two orders of magnitude slower (> 300×) than AIR. They also only successfully route 14 (vs. 21) benchmarks, while using 8 to 10% more wiring and produce 56 to 66% longer critical paths. Third, the AIR timing-driven router is very efficient at trading-off wirelength and timing. When run with a wirelength only objective (WL only) which ignores timing, it only further reduces wirelength by 3% and run-time by 30%. This shows that when run with a combined timing and wirelength objective, the timing-driven router is able to effectively prioritize the routing of timing critical signals without significantly degrading the routing of other connections. As a result, there is little reason to run with only a wirelength objective.

These results show AIR is superior in both run-time and QoR to the breadth-first routers and the VPR 7 timing-driven router. Therefore other works building on and/or comparing to the VPR should ensure they compare to AIR. Choosing to compare to the non-default VPR 8 breadth-first router will result in misleading conclusions. For instance, parallel routing algorithms typically report speed-ups of 2 to 8× [75], however not all authors use the same serial baseline. While some works have used the VPR 7 timing-driven router as a baseline, others used the VPR 7 breadth-first router which is orders of magnitude slower. This means not all published parallel routing speed-ups are significant. Going forward, parallel routing researchers should ensure they use a state-of-the-art serial routing algorithm as their baseline.

27VPR 7+ is the enhanced version of VPR 7 used in [10] which is compatible with the Titan v1.1.0 benchmarks. 28Note that in the following results clock nets are not being routed, as clock network modelling support was added after these results were collected. Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 121

Table 9.11: AIR & Academic Router Comparison on Titan Benchmarks Routed Routed CPD Route Wirelength (worst clock) Time VPR 7+ 1.00 1.00 1.00 CRoute [215] 0.89 0.95 0.30 AIR 0.85 0.81 0.15 Normalized geomean of mutually routable benchmarks

Table 9.12: AIR & Quartus Comparison on Titan Benchmarks Routed Routed CPD Route Time Wirelength (clock geomean) Quartus 18.0 1.00 1.00 1.00 VPR8 Place + AIR 1.27 1.23 0.23 Normalized geomean of mutually routable benchmarks

9.9.2 Comparison with CRoute and Quartus

To evaluate AIR’s effectiveness we compare it to the routers from VPR 7+ [10], CRoute [215], and the industrial Intel Quartus 18.0 router [179]. To fairly evaluate the core routing algorithm run-time, all algorithms were run serially. All tools are evaluated using the Titan FPGA benchmarks [10] and a Stratix IV-like architecture, which are representative of modern FPGA usage. To isolate the effect of the VPR 7+ and AIR routers, identical packings and placements29 and routing architectures are used, while wirelength and Critical Path Delay (CPD) metrics are calculated by VPR 8 [2]. Table 9.11 compares AIR and CRoute’s improvements relative to VPR 7. Note that CRoute results are from [215] which uses a different CAD flow for packing and placement – making it impossible to perfectly isolate the impact of the routers. With that caveat, the results show AIR’s circuit implementations operate 17% faster and use 5% less wirelength than CRoute’s. This was achieved while also completing routing 2.0× faster, and routing 2 more benchmarks (21 vs 19) than CRoute. Finally, following the methodology from [10], we can compare AIR with the industrial Intel Quartus 18.0 router. Quartus results use Quartus’ packing and placement which are higher quality/more routable than those produced by VPR [10].30 Despite routing from a lower quality placement, Table 9.12 shows AIR (targeting the Stratix IV model from Section 7.2) completes routing 4.3× faster than the Quartus router on the Titan benchmarks – further illustrating its scalability. While the circuits routed by AIR use 27% more wirelength and operate 23% slower, this is due to the lower quality placement being used. This can be confirmed by comparing the quality of the initial routing (congestion oblivious and routed for minimum delay), and final legal routing (congestion free) produced by AIR.31 Table 9.13 shows the final legal routings produced by AIR on the Titan benchmarks (targeting the Stratix IV model from Section 7.2) do not degrade critical path delay or wirelength. In fact, wirelength improves since non-critical connections are re-routed for wirelength. The reduction in wirelength may seem somewhat surprising, but results from routing all connections during the first routing iteration

29Generated by VPR 8 with inner num set to match Quartus’ average placement-time. 30There are additional caveats which make this an approximate ball-park comparison. AIR uses the automatically generated adaptive lookahead from Section 9.4.1, while Quartus uses a highly tuned hand-written lookahead which is hard-coded for the routing architecture. The routing architecture targeted by AIR is an approximate capture (Section 7.2) of the Stratix IV devices, as the precise routing patterns used in Stratix IV are not publicly available. Quartus is also performing timing-driven LUT rotation, while VPR does not. Quartus also optimizes for additional objectives such as hold-time. Finally, the higher quality Quartus placement may present a more difficult routing problem if there are more highly critical timing paths (i.e. the path delay distribution has fewer outliers due to better/more balanced placement) which are more challenging for routing to balance. 31Congestion oblivious routing produces a critical path delay which is a near lower bound for the given placement. Chapter 9. AIR: A Fast but Lazy Timing-Driven FPGA Router 122

Table 9.13: AIR Congestion Oblivious vs Congestion Free on Titan Benchmarks Routed Routed CPD Wirelength (worst clock) Congestion Oblivious (min. delay) 1.00 1.00 Congestion Free (legal) 0.86 1.00 Normalized geomean of routable benchmarks with criticality = 1 while ignoring congestion. This leads each net to be routed in a star-like pattern (each connection is routed on the best delay path from the driver) with no regard for wirelength. On subsequent routing iterations we have more accurate criticality estimates from the previous iteration’s route trees, and when re-routed non-timing critical connections will use a more tree-like pattern (i.e. sharing routing resource with other connections in the same net) which minimizes wirelength. Chapter 10

Other VTR Enhancements

In addition to core enhancements to VTR as described in Chapters 7 to 9, a large number of other enhancements and engineering work was performed to improve VTR. This chapter provides a brief overview of some of these contributions.

10.1 Logic Optimization & Technology Mapping

VTR uses ABC to perform logic optimization and technology mapping. The version of ABC included with VTR 8 has been updated from an older 2012 version to a newer 2018 version [185]. We have also re-written our synthesis scripts to make use of ABC’s new capabilities. The first three rows of Table 10.1 show the impact of these changes on the large (> 10K primitive) VTR benchmarks (targeting VTR’s flagship k6 frac N10 frac chain mem32K 40nm architecture). Note that in these results only ABC is being modified; all other parts of the flow remain constant. Compared to using the old version of ABC (2012) and VTR’s old synthesis script (2012), the new version of ABC (2018) produces smaller circuits (8% fewer primitives) resulting in a 10% reduction in routed wirelength, while ABC run-time decreases by 20%. Combining the new version of ABC (2018) with a new synthesis script1 (2018) further improves results, decreasing netlist primitives by 13% and logic depth by 10%; resulting in a 12% reduction in wirelength and 6% reduction in critical path delay. However this substantially increases ABC’s run-time (> 9×) compared to the 2012 ABC and script. This results in a small increase in total flow time since run-time is dominated by the search for Wmin. However when running an implementation flow targeting a fixed channel width (and hence performing a single routing) ABC accounts for 35% of flow run-time on average, increasing overall flow time by 1.5×.2

10.1.1 Safe Multi-Clock Optimization

Multiple clock domains are a key characteristic of modern FPGA applications [220, 64]. One of the challenges with new versions of ABC is that all clock-related information is dropped during optimization.3 This is dangerous, since it means sequential optimizations can occur across clock-domains, which can produce incorrect results. For instance, sequential elements with common fan-in but controlled by

1Based on suggestions from Alan Mishchenko at UC Berkeley 2ABC can dominate run-time on some benchmarks, for instance accounting for 74% of overall run-time on mcml. 3Effectively ABC assumes all sequential circuit elements operate on a single shared clock domain. ABC also ignores their initial state.

123 Chapter 10. Other VTR Enhancements 124

Table 10.1: Normalized Impact of ABC Multiclock Flows on VTR benchmarks (> 10K primitives) Multi-clock Routed Crit. Total Total ABC ABC Primitive Logic Routed WL ABC Black-box Wmin Path Delay Flow Time Flow Time Version Script Blocks Depth (1.3 · Wmin) Time Flow (1.3 · Wmin) (find Wmin) (1.3 · Wmin) 2012 2012 none 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 2018 2012 none 0.92 0.98 1.01 0.90 1.02 0.80 0.73 0.87 2018 2018 none 0.87 0.90 1.01 0.88 0.94 9.19 1.05 1.51 2018 2018 all 0.98 0.95 0.99 0.94 0.98 2.05 0.79 0.99 2018 2018 iterative 0.87 0.89 0.99 0.87 0.94 8.22 1.07 1.54 different clock domains could be incorrectly merged, or flip-flops controlled by one clock domain could be re-timed into logic related to another clock domain. Furthermore, special care is needed to safely transfer data across clock domains while avoiding synchronization and metastability issues [44]. Sequential optimizations may interfere with these transfers. Since VTR supports multi-clock circuits it is important to ensure synthesis is performed safely and all clock domain information is correctly maintained. To address this, VTR 8 adds support for several different approaches to ensure safe multi-clock logic optimization. The most straightforward way to accomplish this is to hide a circuit’s sequential elements from ABC by replacing them with black-box primitives which ABC does not optimize. However as shown in the fourth row of Table 10.1 black-boxing all sequential primitives (all) substantially degrades the quality of the resulting netlist compared to performing no black-boxing (none). Instead VTR 8 defaults to an iterative black-boxing flow, where ABC is invoked multiple times, with only a single clock domain exposed as sequential elements; all sequential elements in other clock domains are black boxed.4 ABC therefore safely performs sequential optimizations on only one clock-domain at a time. As shown in the fifth row of Table 10.1 this approach (iterative) results in Quality of Results (QoR) similar to exposing sequential elements in all clock domains to ABC simultaneously (none) but ensures multi-clock circuits are implemented correctly.

10.2 Timing Analysis

VTR 8 uses Tatum [6] (Chapter 3) as its Static Timing Analysis (STA) engine, which was designed to be more complete, flexible and higher performance than VPR 7’s classic STA engine. Tatum provides various new timing analysis features such as hold (min-delay) timing analysis, support for multi-clock netlist primitives (e.g. dual port RAMs with different clocks), additional timing constraints and exceptions, and support for paths completely contained within primitives (such as registered block RAM read paths). As a result, VTR 8 also supports more advanced timing reports, including detailed multi-path timing reports which help users understand and evaluate the speed-limiting paths in their design. Furthermore, Tatum produces the same analysis results as VPR 7 (to within floating point round-off) for setup analysis, allowing critical path delays to be fairly compared between VPR 7 and VPR 8 when identical delay models are used. As STA is called hundreds of times to evaluate the timing characteristics of the design implementation and drive further optimization, it is key for STA to be fast. As described in Chapter 3 Tatum includes several enhancements which improve its performance over VPR 7’s classic STA engine.

4This approach was developed by Jean-Philippe Legault and Kenneth B. Kent at the University of New Brunswick with assistance from the author. Chapter 10. Other VTR Enhancements 125

Table 10.2: Normalized Impact of Setup & Hold STA Traversals While Routing Titan Benchmarks STA Traversal Time Separate 1.00 Combined 0.70

Table 10.3: Comparison of Classic & Tatum STA (Setup Only) on Titan Benchmarks STA Pack Time Place Time Route Time Total Time Time Classic 1.00 1.00 1.00 1.00 1.00 Tatum serial 1.00 0.88 0.95 0.94 0.47 Tatum parallel 0.99 0.80 0.91 0.90 0.07

ˆ Tatum performs a single traversal of the timing graph, simultaneously analyzing all clock domains. This is more efficient than VPR 7’s classic STA engine, which traversed the timing graph for each pair of clock domains.

ˆ If both setup and hold timing analyses are required (e.g. during combined long and short path timing optimization), Tatum can perform a combined setup and hold analysis during the same traversal of the timing graph.

ˆ Tatum supports parallel execution across multiple cores to further speed-up STA.

10.2.1 Combined Setup & Hold Analysis

Table 10.2 quantifies the effect of combining setup and hold analyses into a single traversal (Combined) compared to performing two independent traversals (Separate) while routing the Titan benchmarks (targeting the Stratix IV model from Section 7.2). Performing a single set of combined traversals reduces the total STA traversal time by 30% due to improved cache locality. With the combined traversals the timing graph does not need to be reloaded into the cache for both setup and hold analyses.

10.2.2 Comparison with VPR Classic STA

Table 10.3 compares the run-time of using VPR’s classic timing analyzer and Tatum when implementing the Titan benchmarks on the Stratix IV model from Section 7.2. Overall STA time in VPR is reduced by 2.1× serially and 14.3× in parallel with 24 cores.5 As a result overall run-time is reduced by 6% (Tatum serial) to 10% (Tatum parallel), with the largest run-time reductions occurring in placement (12-20%) and routing (5-9%). However it is worth noting that for some designs STA is a larger proportion of overall run-time, and the run-time reductions are larger.6

10.3 Software Engineering

VTR is a large complex software project, and as a result requires significant software engineering efforts. These efforts are multifaceted and include:

5Compared to VPR’s classic STA engine, Tatum performs best on large multi-clock designs, with circuits like mes noc and gsm switch running 7.5-6.2× faster serially, and 59.6-48.0× faster with 24 cores. 6For example, stereo vision’s total run-time is reduced by 21% using Tatum with parallelism. Chapter 10. Other VTR Enhancements 126

Table 10.4: Normalized Impact of Atom Netlist Data Structure Refactoring on VTR Benchmarks Route Time Route Time Total VPR Pack Time Place Time (find Wmin) (1.3 · Wmin) Time Baseline 1.00 1.00 1.00 1.00 1.00 Refactored 0.63 1.00 0.92 0.93 0.86

ˆ Optimizing the implementations of key algorithms and data structures,

ˆ Improving robustness and stability,

ˆ Providing support and documentation for users, and

ˆ Facilitating collaboration between developers.

Some of these items have a significant impact on the technical merits of the project such as Quality of Results (QoR) and run-time, while others affect VTR’s usefulness to end-users, and the project’s longer-term health and sustainability. This section highlights a subset of this work.

10.3.1 VPR

VPR is developed in C/C++, and has a long development history, being continually extended and enhanced over time. As a result it is important to improve code quality, which makes it easier for both developers and end-users to make future enhancements and modifications to VPR. Additionally, for a high performance CAD tool solving large-scale optimization problems, the implementation details of various data structures and algorithms have a significant impact on run-time and memory footprint. These enhancements are beneficial to the research community as it makes it easier for researchers to develop and evaluate new CAD algorithms within VPR, or customize VPR to explore new architectural features.

VPR Netlists

One of the significant software engineering efforts in VPR 8 was refactoring the netlist data structures which store the technology mapped (or ‘atom’) netlist, and the post-packing ‘clustered’ netlist. These data structures have been significantly re-written in order to produce a well-defined common API for accessing and modifying the netlist data. During this process the underlying data layout and representation was modified to reduce indirect memory accesses (pointer-chasing), and improve cache utilization. In particular, the data layout was changed from an Array-of-Structs (AoS) to Struct-of-Arrays (SoA) representation, which is more cache friendly when only a few data fields are accessed at a time. Table 10.4 shows the atom netlist refactoring reduced pack-time, route-time and total VPR run- time by 37%, 7-8% and 14% respectively on the VTR benchmarks when targeting VTR’s flagship k6 frac N10 frac chain mem32K 40nm architecture.

Contexts

VPR has historically made use of numerous global variables to represent the device model and design implementation state, which made it challenging to reason about and modify the code base. To address this we have restructured and grouped together the various data structures representing the device model and implementation state into contexts. The device context contains all data structures Chapter 10. Other VTR Enhancements 127

Table 10.5: Cumulative Impact of VPR Peak Memory Optimizations on neuron Code Area Peak Memory (MiB) Ratio Baseline 9,030 1.00 Exact RR Graph alloc. RR Construction 5,874 0.65 Contiguous SB pattern RR Construction 5,423 0.60 Flyweight segment details RR Construction 4,573 0.51 Sparse Intra-block routing Packer 3,866 0.43 Contiguous RR Edge & RR Node fanin RR Construction 3,431 0.38 used to represent the targeted FPGA device, including block type definitions, device grid and RR-graph. There are also contexts representing different aspects of the implementation state such as the clustering, placement or routing. Each part of the code base (e.g. the router) now explicitly asks for the different contexts they require, and specifies whether read-only or read-write access is required. This helps to minimize coupling between different parts of the code base, while making the dependencies more explicit.

Redundant Work

After performing a move during placement, VPR iterates through the affected nets to update wirelength and timing information. In VPR 7, the affected nets would be iterated through multiple times. We have improved this in VPR 8, by ensuring that all updates required occur during a single iteration through the affected nets. Similarly to performing a single set of timing graph traversals (Section 10.2.1), this improves cache locality by not repeatedly reloading the netlist and placement information into the cache.

Memory Usage Optimizations

It was previously noted that VPR 7’s peak memory usage was substantially higher than commercial tools like Quartus [10], which required very high memory computers to implement the largest FPGA designs. As a result we have worked to reduce peak memory usage. The largest data-structure in VPR is typically the RR-Graph, which can contain millions of nodes and 10s of millions of edges. This data structure is carefully packed to minimize the size of each node and edge (improving cache utilization), and common data is factored out in a flyweight pattern [221]. Table 10.5 shows the impact of the most recent set of optimizations on the Titan neuron benchmark (targeting the Stratix IV model from Section 7.2). VPR’s peak memory usage typically occurs during RR-Graph construction, as both the RR-graph and auxiliary RR-graph-sized data structures (used to build the RR-graph) are both allocated. During RR-graph construction, exactly sizing allocations and allocating contiguous regions of memory helped to significantly reduce memory usage caused by fragmentation, as did flyweighting additional non-unique data. Moving to a sparse representation of the intra-block routing within each cluster also reduced memory usage, since much of the routing is unused in any particular cluster. These optimizations reduced peak memory use by 2.6×. As discussed in Section 11.2 VPR’s peak memory usage is now comparable to Quartus on the Titan benchmarks.

10.3.2 Compiler Settings

Modern C++ compilers support a variety of advanced optimizations such as Inter-Procedural/Link-Time Optimization (IPO/LTO) and Profile-Guided Optimization (PGO). Table 10.6 shows the impact of building VPR with these optimizations enabled using GCC 8 when implementing the Titan benchmarks on the enhanced Stratix IV model from Section 7.2. IPO offers significant improvements reducing place, Chapter 10. Other VTR Enhancements 128

Table 10.6: Normalized Impact of Advanced Compiler Optimizations on VPR 8.0 Run-Time (Titan Benchmarks) Pack Time Place Time Route Time Misc. Time Total Time STA Time Standard 1.00 1.00 1.00 1.00 1.00 1.00 IPO 1.00 0.77 0.74 0.93 0.81 0.76 IPO + PGO 0.89 0.70 0.62 0.92 0.74 0.68 STA Time is included in Pack/Place/Route/Misc. and Total Times route and STA time by 26-23% and total time by 19%. PGO (using a profile generated from a run of neuron) improves the total run-time reduction further to 26%. Given that the run-time reductions from these compilation options are significant it is important for future work (e.g. comparing to VTR) to ensure they use consistent compilation settings, so as to not miss-attribute run-time improvements to algorithmic changes when they are in fact due to compiler settings.

10.3.3 Regression Testing

VTR has a large code base consisting of several different tools, each of which have a large number of features. To ensure robustness it is important to have tests which catch when existing features are broken by code changes. Additionally, it is vital to track VTR’s Quality of Results (QoR) to detect any quality or run-time regressions, as it is otherwise very easy to inadvertently degrade VTR’s optimization algorithms or data structures. The full regression test suite requires ∼ 168 hours to run serially, and covers over 370 architecture, benchmark and tool option combinations. Since each combination is independent the run-time can be reduced to less than 40 hours with moderate parallelism (3 or 4 cores).

Benchmark & Architecture Coverage

VTR 8 regression tests now cover a wider variety of architectures including: classic (soft-logic only) architectures, modern FPGA architectures like the VTR Flagship and Titan architectures, architectures with custom hard blocks, and architectures using a variety of different routing technologies (classic bi-directional, uni-directional, complex custom switch patterns). We now also test a wider variety of benchmark sets including the classic MCNC20 [95], VTR benchmarks, Titan ‘other’ and Titan23. In particular including the Titan23 ensures VTR’s QoR and scalability are regularly evaluated on large complex benchmarks.

Feature Coverage

In addition to covering additional architectures and benchmarks, VTR 8 also includes more extensive tests which verify the functionality of specific features. While most VTR tests are integration tests that test the overall functionality of the flow or specific tools, we have also begun including unit tests to verify the functionality of smaller parts of the code-base. While not a perfect metric, code-coverage quantifies how well the test suite covers the various parts of the code-base. Table 10.7 compares VTR 7 & 8’s code line count and code-coverage.7 Compared to VTR 7, the code base for Odin II has grown by 7% while VPR has grown by 40%. The large growth in VPR is due to the additional architecture modelling features, improved code structure, better encapsulated data structures, and enhancements to the various optimization algorithms (particularly the router). Despite

7Results were collected with the gcov utility [222]. Chapter 10. Other VTR Enhancements 129

Table 10.7: VTR Code Coverage Norm. Lines Test Coverage VTR 7 VTR 8 VTR 7 VTR 8 odin 1.00 1.07 0.70 0.69 vpr 1.00 1.40 0.75 0.86 vpr base 1.00 1.26 0.58 0.86 vpr pack 1.00 1.25 0.85 0.86 vpr place 1.00 1.17 0.92 0.94 vpr route 1.00 2.40 0.79 0.84 vpr timing 1.00 0.75 0.68 0.87 vpr power 1.00 1.19 0.88 0.87 vpr util 1.00 1.96 0.77 0.77 libarchfpga 1.00 1.20 0.70 0.80 libvtrutil 0.80 libtatum 0.82 this growth VPR’s overall code coverage has increased to 86% with the largest coverage improvements occurring in the router and base functionality.

Formal Verification

In addition to integration and unit tests which measure the functionality and QoR of VTR, in VTR 8 we have also added correctness tests. These leverage VPR’s improved netlist writer and ABC’s formal verification capabilities (Section 6.2) to formally prove the resulting implementation is equivalent to the original netlist. These tests ensure the various tools correctly implement the design and do not change the design’s functionality.

10.3.4 Error Messages & File Parsers

A common issue encountered by users of previous versions of VTR has been the limited number of (or unhelpful) error messages. We have made significant efforts in VTR 8 to provide more detailed, easier to interpret and actionable error messages. A large part of this effort has involved improving the robustness of VPR’s file format parsers, particularly those dealing with human-editable formats such as the FPGA Architecture file, SDC constraints, and BLIF netlist. To facilitate this several of the file parsers (SDC, BLIF) have been re-written using a formal format specification and parse generators [223, 224] which leads to helpful error messages when invalid syntax is provided. The FPGA architecture file has also been updated to use a robust XML parsing library [225] which removes artificial syntax restrictions, while the logic which interprets the XML file has been improved to emit more user-friendly error messages when invalid or unexpected values are encountered.

10.3.5 Documentation

We have also worked to improve VTR’s documentation, which is key for helping users understand and use VTR. VTR’s documentation is now centralized and automatically generated from the VTR source tree, ensuring it is easy to update and keep synchronized with the source code. The documentation is now searchable and available online at docs.verilogtorouting.org, and includes:

ˆ Tutorials for common VTR use cases,

ˆ Design flow documentation, Chapter 10. Other VTR Enhancements 130

ˆ Tool specific documentation,

ˆ Common file-format specifications, and

ˆ Developer guidelines.

The resulting VTR 8 user manual runs over 250 pages. Chapter 11

VTR 8 Evaluation

With the wide range of new algorithms, enhancements and features, it is important to quantify the overall impact of these changes on VTR 8. This chapter carefully evaluates VTR 8 including the enhancements described in Chapters 6 to 10. We believe it is important to ground these comparisons, so in addition to comparing against the previous release of VTR (VTR 7), we also compare to Intel’s commercial Quartus FPGA CAD system. Unless otherwise noted all results were collected on an identical system with two Intel Xeon gold 6146 CPUs (24 cores) and 768 GB of RAM. VTR was compiled with GCC 8 with full optimization (-O3) with IPO and PGO using a profile generated from single runs of the stereovision1 and neuron benchmarks.

11.1 VTR Benchmarks

On of VTR’s purposes is to serve as a flow for architecture research and evaluation. To that end, we now evaluate VTR 8’s QoR in an architecture exploration context. These results were collected on the VTR benchmarks while targeting VTR’s flagship k6 frac N10 frac chain mem32K 40nm architecture, whose characteristics are listed in Table 11.1. Each benchmark is run through the standard VTR flow (Figure 6.1), starting from a behavioural Verilog netlist which is synthesized by ODIN II and optimized by ABC. VPR then performs packing and placement, followed by a binary search over channel width

(W ) to find the minimum routable channel width (Wmin). Wirelength and critical path delay are then measured after routing at a relaxed channel width set to 1.3 · Wmin. Table 11.2 summarizes VTR 8’s achieved QoR relative to VTR 7 for benchmarks with > 10K netlist primitives (to avoid small benchmarks skewing the results). We include results for both the default VTR 8 (ODIN 8, ABC’18, VPR 8), VTR 7 (ODIN 7, ABC’12, VPR 7) and hybrid flows mixing tool versions,

Table 11.1: Summary VTR Flagship Architecture Characteristics Metric Expression Value Process Technology 40nm LUT Size K 6 (fracturable) BLEs per Logic Block N 10 Adder Bits per Logic Block 20 Memory Block Size 32 Kb DSP Block Multiplier Size 36x36 (fracturable) Switchblock Type Wilton Wiretype Length 4

131 Chapter 11. VTR 8 Evaluation 132

Table 11.2: VTR 7 & VTR 8 Comparison Summary on the VTR Benchmarks Netlist ABC Routed WL Routed CPD Odin ABC Pack Place Route Time Route Time Total Peak CLBs Wmin Primitives Depth (1.3 · Wmin) (1.3 · Wmin) Time Time Time Time (Find Wmin) (1.3 · Wmin) Time Memory VTR 7: ODIN7/ABC’12/VPR7 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 ODIN8/ABC’12/VPR7 0.86 0.99 0.88 1.00 0.88 0.98 1.29 0.73 0.87 0.97 1.15 0.96 1.10 0.90 ODIN8/ABC’12/VPR8 0.86 0.99 1.13 0.83 0.74 0.97 1.68 0.72 0.55 0.63 0.10 0.33 0.15 0.49 VTR 8: ODIN8/ABC’18/VPR8 0.74 0.89 0.74 0.85 0.59 0.88 1.53 5.15 0.58 0.56 0.08 0.30 0.19 0.30 Values are normalized geomeans over mutually completed benchmarks which allow us to isolate the impact of improvements to ODIN, ABC, and VPR separately. We first evaluate the impact of enhancements to ODIN (ODIN 8/ABC’12/VPR7), which show ODIN produces fewer netlist primitives due to improved sweeping of unused logic and hard blocks [39]. We next compare the impact of enhancements to VPR (ODIN8/ABC’12/VPR8), which shows significant improvements in minimum routable channel width, wirelength and drastic improvements in pack/place/route run-times and peak memory usage, with the overall flow completing significantly faster. Finally we can compare the full VTR 8 (ODIN8/ABC’18/VPR8), showing overall a 15% reduction in Wmin, a 41% reduction in routed wirelength and a 12% reduction in critical path delay. Additionally, the run-times of the physical optimization stages are substantially reduced (1.7× packing, 1.8× placement, 12.5× to find Wmin and

3.3× routing at 1.3 · Wmin), reducing total flow run-time by 5.2×. Peak memory usage is also reduced by 3.3×. Notably, while the newer ABC performs better optimizations its run-time is also substantially increased. These results show that improvements to all stages of the CAD flow contribute to the overall QoR and run-time improvements. Table 11.4 shows VTR 8’s detailed QoR on the VTR benchmarks and the normalized results compared to VTR 7. We again focus on the results for benchmarks with > 10K netlist primitives. The improved logic sweeping and optimization in VTR 8 reduces the number of netlist primitives by 26% and logic depth by 11%. As a result, despite decreasing packing density (Section 8.1) VTR 8 uses 26% fewer CLBs than VTR 7. VTR 8 produces significantly better implementation quality than VTR 7, improving minimum routable channel width by 15%, reducing wirelength by 41%, and reducing critical path delay by 12%. These quality improvements are achieved while cutting the overall VTR flow run-time by 5.2×, and peak memory consumption by 3.3×. The run-time gains come primarily from reductions in minimum channel width search time (12.5× faster), and place time (1.8× faster). Pack and relaxed channel width route times were also reduced by 1.7× and 3.3× respectively. ODIN II’s run-time increased by 1.5× but remains a very small fraction (0.6%) of overall run-time. While the new version of ABC performs better optimizations, it runs substantially slower (5.2×) and accounts for 39% of overall run-time in the architecture exploration flow (find Wmin), which is comparable to the time spent on the physical implementation (packing, placement and routing). ABC run-time is an even larger fraction (62%) of run-time in the design implementation (W = 1.3 · Wmin) flow. As a result, despite significant run-time improvements during the physical implementation, VTR 8’s run-time is comparable to VTR 7’s when targeting a fixed channel width.

11.2 Titan23 Benchmarks

To compare VTR and Intel’s commercial Quartus [179] tool we used the Titan Flow (Figure 6.1), where designs are synthesized and technology mapped using Quartus. The physical implementation (packing, placement, routing) can then be performed using either Quartus or VPR. Unless otherwise noted, VPR was run with its default options, except the maximum number of routing Chapter 11. VTR 8 Evaluation 133 iterations was increased to 400, and the router used the new map lookahead (Section 9.4.1) in VPR 8. For all benchmarks a 48 hour wall-clock run-time limit was imposed. Any benchmarks exceeding the limit were classified as failures. Critical path delay is reported as the geometric mean over all clock domains in a circuit, excluding any virtual I/O clocks.

11.2.1 Comparison with VPR 7

To compare VPR 7 and 8 on the Titan benchmarks, we used VPR 8 and VPR 7+. Both tools use the same Titan v1.1.0 benchmarks synthesized with Quartus 12.1, and target the same legacy Stratix IV architecture model from [10]. Table 11.5 compares VPR 8 and VPR 7+ on the Titan v1.1.0 benchmarks. VPR 8 completes 5 more benchmarks than VPR 7+, showing that it is more robust and better able to handle large complex designs. VPR 8 packs less densely, using 6% more LABs and 13% more DSPs than VPR 7+ (Section 8.1.2). VPR 8 reduces routed wirelength by 34% and critical path delay by 11% on the common benchmarks which completed in both tools. From an overall run-time perspective VPR 8 runs substantially faster (5.3×) and uses much less memory (5.0×) than VPR 7+. The largest run-time improvement came from routing which completes 33.3× faster in VPR 8, highlighting the utility of the router enhancements detailed in Chapter 9 and improvements to packing (Chapter 8). Pack time is also reduced by 1.5× and place time by 1.1×.1

11.2.2 Comparison with Quartus 18.0

To compare VPR and Quartus on the Titan benchmarks, we used VPR 8 and Quartus 18.0. Both VPR and Quartus used the more recent Titan v1.3.0 benchmarks synthesized and technology mapped with Quartus 18.0. Quartus 18.0 targeted Stratix IV, while VPR 8 targeted the Stratix IV architecture model from Section 7.2 (which improves upon the model used in [10]). However the precise details of the Stratix IV routing architecture are not publicly available, so it is not possible to perform a perfect comparison.2 Our Stratix IV model uses a timing model calibrated to Quartus STA’s reported component delays, and models the key characteristics of both block and routing architectures and was validated in [10]. We believe this allows us to perform a reasonable comparison between VPR and Quartus. Quartus is configured to use ‘auto’ device selection which selects the smallest device with sufficient resources to implement the design. Similarly, VPR will automatically construct the smallest device with sufficient resources to implement the design. Since several of the Titan designs are larger than the largest commercially available Stratix IV device Quartus refuses to fit them, while VPR can build a sufficiently large device. Quartus was run in STANDARD FIT mode to ensure full optimization effort. To ensure run-times remain comparable, both Quartus and VPR were run using a single thread. VPR was run at its default effort level and at an increased High Effort (HE) level (inner num = 2, astar fac = 1) which resulted in run-times comparable to Quartus. Both Quartus and VPR were given equivalent timing constraints as follows. Paths between clock domains were cut, except for those to/from external I/Os which are constrained on a virtual I/O clock.

1Miscellaneous time increases in VPR 8 since the routing graph needs to be profiled to create the map lookahead, but remains a small fraction overall. 2We expect VPR’s automatically generated RR graph, while matching Stratix IV’s major characteristics, is of lower quality than the extensively hand-optimized pattern used in Stratix IV. Chapter 11. VTR 8 Evaluation 134

Table 11.3: VPR Quartus QoR Summary on Titan Benchmarks VPR Quartus Routed Routed Pack Place Route Total Peak # Benchmarks # Benchmarks LABs DSPs M9Ks M144Ks WL CPD Time Time Time Time Memory Completed Completed VPR 7+ / Quartus 12.0 * 14 20 0.82 1.13 1.13 1.69 2.02 1.53 2.36 0.87 8.23 2.82 6.21 VPR 8 / Quartus 18.0 * 23 20 0.85 0.96 1.33 0.00 1.38 1.32 1.39 0.46 0.30 0.51 0.96 VPR 8 HE / Quartus 18.0 * 23 20 0.85 0.96 1.33 0.00 1.31 1.26 1.39 1.04 0.35 0.87 0.97 VPR 8 HE / Quartus 18.0 „ 23 20 0.95 0.97 1.30 0.00 1.26 1.20 1.18 1.00 0.34 0.83 0.96 * QoR from common subset of 13 completed benchmarks across all tool pairs „ QoR from common subset of 20 completed benchmarks across VPR 8 and Quartus 18.0 VPR 7+/Quartus 12.0 results from [10].

Each clock domain is given an aggressive 1ns clock period target. Extremely small clock domains (those fanning out to < 0.1% of netlist primitives) were ignored to avoid skewing the average critical path delay and instead focus on primary system clocks. Table 11.3 summarizes the QoR comparison between various versions of VPR normalized to Quartus. Compared to VPR 7+, VPR 8 completes significantly more circuits (23 vs. 14). When VPR 8 is run with its default parameters it requires 38% more wire and produces circuits which are 32% slower than Quartus 18.0, but runs nearly 2× faster and uses less memory. Increasing VPR 8’s effort level so run-time is comparable to Quartus 18.0, and including all 20 benchmarks which mutually complete (VPR 8 HE / Quartus 18.0) the wirelength and circuit speed gaps narrow to 26% wirelength and 20% critical path delay.3 VPR 8 substantially narrows the quality gap, with the wirelength and critical path delay gaps reduced by 1.5× and 1.2× respectively compared to VPR 7+. Notably, including the larger benchmarks which VPR 8 can now complete further narrows the gap with Quartus, indicating the scalability of VPR’s optimizations. VPR 8’s run-time is also substantially improved (5.5× to 3.2× faster overall compared to VPR 7+) with route-time seeing the largest improvement (27× to 24× faster), while peak memory use is reduced by 6.5×. Table 11.6 compares VPR 8 (in high effort mode) and Quartus 18.0 in detail. VPR 8 completes all 23 of 23 feasible Titan benchmarks, and Quartus completes 20 of 21 feasible benchmarks. This shows that VPR and Quartus’ benchmark completion rates are similar. In terms of resource utilization VPR 8 uses 5% fewer LABs and 3% fewer DSPs than Quartus. However, VPR 8 uses more M9Ks and fewer M144Ks since it uses the larger and rarer M144Ks only when RAMs would otherwise not fit in M9Ks. Due to resource utilization differences VPR 8’s auto-sized devices can either smaller or larger than the devices used by Quartus, but on average they are 15% larger. As expected, those designs where VPR uses more resource of relatively sparse types (e.g. M9K, DSPs) than Quartus tend to produce the largest differences in device sizes. From a run-time perspective, VPR 8 HE runs 1.2× faster than Quartus and has a 4% smaller memory footprint. VPR 8’s packer runs 18% slower than Quartus’, but is much more general purpose. For these large benchmarks, placement dominates run-time (78% of total time), and VPR 8’s placer run-time is comparable to Quartus. The VPR 8 router is more efficient and runs 2.9× faster than Quartus even in this high effort mode. Compared to VPR 8, Quartus spends approximately 3.3× more run-time on STA. VPR 8’s QoR and run-time improvements in VPR 8 result from two key factors:

1. CAD algorithm enhancements, which improve the quality and robustness of VPR’s optimizations, and

2. Improved architecture modelling capabilities (Section 7.1), which allow VPR to target a more realistic model of Stratix IV (Section 7.2), and in particular a more accurate capture of Stratix

3For multi-clock circuits geometric mean and worst critical path delay results were similar. Chapter 11. VTR 8 Evaluation 135

IV’s routing architecture.4

This shows the benefit of simultaneously improving both architecture modelling capabilities and CAD algorithms together.

11.3 Conclusion

This part of the thesis has presented the enhancements and extensions included in version 8.0 of the open-source Verilog To Routing project. VTR 8 substantially enhances VTR’s architecture modelling capabilities allowing VTR to model a wider range of modern FPGA characteristics including: general device grids, highly customizable routing architectures, and improved area and delay estimates. These features can be used to both facilitate new research (e.g. by allowing more control over characteristics such as the detailed routing pattern), and to better model and target modern commercial FPGAs with suitable bitstream generators [177]. In addition to these features we have also significantly enhanced the CAD algorithms and tools used to implement and optimize designs in VTR. These enhancements have significantly improved implementation quality of the VTR benchmarks (15% minimum channel width, 41% wirelength, and 12% critical path delay reductions) while also reducing tool run-time and memory usage by 5.2× and 3.3× respectively. On the larger industrial-scale Titan benchmarks, VPR 8 now completes a similar proportion of mutually feasible designs as a commercial tool like Quartus. We have also reduced the quality gap between VPR and the commercial Quartus tool to 26% wirelength and 20% critical path delay (reductions of 1.5× and 1.2× compared to previous results), while simultaneously running 17% faster and using similar amounts of memory. These results show that through careful code architecture and algorithm design it is possible for general re-targetable CAD tools like VPR to be both run-time and memory footprint efficient, while producing circuit implementations which are of reasonable quality compared to those produced by highly tuned architecture-specific industrial tools. Reducing the implementation quality gap also helps ensure the architectural conclusions produced using these tools are valid and not skewed due to poor optimization.

11.4 Future Work

In the future we plan to continue extending VTR’s modelling capabilities to enable the exploration of an even broader range of FPGA architectures and to enable full modelling of advanced features found in commercial FPGAs. We will also continue to focus on improving the quality of VTR’s optimization algorithms. We expect that a large portion of the remaining quality gaps result from clustering quality. In particular, we have observed that while VPR 8 produces clusters which are more natural than VPR 7, related logic can still sometimes be scattered between different clusters, which end-up placed far apart. The result is that such signals end-up crossing large parts of the chip and increasing wirelength. Inspection of the resulting critical paths shows a similar trend, with timing critical connections sometimes spanning large portions of the chip. We believe it is important to consider cross-boundary optimizations between the traditional

4However it should be noted the detailed routing pattern used by VPR is still automatically generated, and likely remains of lower quality than the heavily hand optimized routing pattern used in Stratix IV. Chapter 11. VTR 8 Evaluation 136 pack/place/route stages, such as being able to take-apart clusters during placement [226].5 It would also be interesting to continue exploring the application of Machine Learning [227, 5] techniques to various stages of the CAD flow (one instance of this is discussed in Chapter 13). From a run-time perspective further investigation into hierarchical and parallel compilation techniques also seem promising. We also plan to continue extending VPR to be an end-to-end CAD flow capable of both targeting existing FPGA devices and facilitating the creation of physical devices based on VTR architectures. To that end, it would be useful to extend the RR graph file (Section 6.2) to become a description of the entire FPGA device (including device grid and block architectures). Such a ‘device file’ could be generated from VTR’s higher-level architecture file description, or through other means to model new architectures or architectural features not currently supported by VTR’s higher-level descriptions. This would allow VTR’s design implementation and optimization algorithms to be better decoupled from the high-level architecture description, and be more easily re-targeted to other architectures. While we believe that VTR’s improved optimization quality allows it to make accurate CAD and architecture conclusions, it would be interesting to verify its fidelity by, for instance comparing the impact of different optimizations in both VTR and commercial CAD systems. It would additionally be interesting to compare the quality of routing patterns generated by VTR when targeting our model of Stratix IV compare to the heavily optimized actual routing patterns used in commercial devices such as Stratix IV. It is also future work to consider how VTR could be extended to support multi-context FPGAs, and partial reconfiguration.

5Previous measurements of Quartus [10] indicate taking apart clusters allows Quartus to reduce wirelength and critical path delay by 9% and 10% respectively. Table 11.4: VTR 8.0: VTR Benchmarks (Relative to VTR 7) VTR Flow Netlist Routed Wirelength Routed CPD Route Time Route Time VTR Flow Benchmark Clocks Logic Depth IOs CLBs DSPs RAMs Wmin Odin II Time ABC Time Pack Time Place Time Elapsed Time Peak Memory Primitives (1.3 · Wmin) (1.3 · Wmin) (Find Wmin) (1.3 · Wmin) Elapsed Time (excl. Find Wmin) mcml 165,809 (0.86×) 1 27 (0.96×) 36 (1.00×) 7,196 (0.94×) 27 (0.90×) 159 (1.00×) 128 (0.98×) 1,007,246 (0.63×) 45.83 (1.03×) 24.94 (1.81×) 3,172.5 (17.49×) 204.7 (0.44×) 679.4 (0.80×) 1,321.6 (0.12×) 91.2 (0.38×) 5,597.2 (0.42×) 4,275.6 (2.33×) 1.93 (0.36×) LU32PEEng 101,542 (0.78×) 1 99 (0.95×) 114 (1.00×) 6,919 (0.79×) 32 (1.00×) 167 (0.99×) 136 (0.96×) 1,275,909 (0.72×) 75.25 (0.94×) 29.82 (4.25×) 853.4 (13.69×) 222.5 (0.83×) 536.1 (0.79×) 1,673.7 (0.02×) 146.9 (0.56×) 3,549.4 (0.04×) 1,875.7 (1.41×) 1.66 (0.28×) stereovision2 42,077 (0.93×) 1 3 (1.00×) 149 (1.00×) 1,703 (0.75×) 202 (0.95×) 88 (0.71×) 417,138 (0.57×) 14.25 (0.88×) 1.01 (1.09×) 6.6 (0.68×) 24.2 (0.78×) 88.7 (0.71×) 321.8 (0.03×) 11.4 (0.16×) 484.6 (0.05×) 162.8 (0.63×) 0.89 (0.49×) LU8PEEng 31,396 (0.79×) 1 103 (0.99×) 114 (1.00×) 2,062 (0.80×) 8 (1.00×) 44 (0.98×) 94 (0.98×) 295,304 (0.69×) 77.62 (0.89×) 2.44 (1.30×) 73.5 (4.46×) 64.2 (0.90×) 68.6 (0.92×) 177.5 (0.11×) 17.9 (0.34×) 427.7 (0.23×) 250.2 (1.09×) 0.53 (0.31×) bgm 24,865 (0.43×) 1 14 (0.52×) 257 (1.00×) 2,136 (0.54×) 11 (1.00×) 74 (0.90×) 281,735 (0.46×) 19.25 (0.62×) 9.07 (2.30×) 146.4 (6.87×) 39.5 (0.52×) 58.1 (0.33×) 101.8 (0.07×) 8.2 (0.25×) 388.3 (0.22×) 286.6 (0.88×) 0.46 (0.18×) stereovision0 21,789 (0.70×) 1 5 (0.83×) 157 (0.93×) 704 (0.63×) 56 (0.88×) 59,060 (0.50×) 3.54 (0.79×) 0.91 (1.00×) 8.3 (1.48×) 7.6 (0.37×) 8.6 (0.33×) 11.9 (0.18×) 1.3 (0.30×) 47.5 (0.36×) 35.6 (0.54×) 0.20 (0.26×) stereovision1 19,549 (0.71×) 1 3 (1.00×) 115 (0.86×) 702 (0.65×) 40 (1.05×) 78 (0.72×) 115,572 (0.56×) 5.44 (0.92×) 1.02 (1.01×) 40.0 (7.36×) 7.3 (0.34×) 11.1 (0.39×) 60.1 (0.07×) 3.8 (0.25×) 134.5 (0.15×) 74.4 (0.92×) 0.24 (0.33×) arm core 18,437 1 18 133 1,034 40 116 212,114 20.97 1.11 65.6 29.6 23.0 170.7 14.4 340.5 169.8 0.36 ◦ blob merge 11,415 (0.85×) 1 5 (1.00×) 36 (1.00×) 623 (0.88×) 62 (0.74×) 68,514 (0.60×) 14.95 (1.00×) 0.30 (1.21×) 36.0 (9.09×) 14.9 (0.78×) 7.8 (0.57×) 19.4 (0.21×) 1.3 (0.31×) 86.3 (0.63×) 66.9 (1.49×) 0.14 (0.31×) or1200 4,530 (0.82×) 1 8 (1.00×) 385 (1.00×) 255 (0.88×) 1 (1.00×) 2 (1.00×) 94 (1.15×) 45,258 (0.81×) 9.06 (1.01×) 0.24 (1.03×) 3.9 (2.52×) 6.4 (0.88×) 5.6 (0.87×) 10.2 (0.06×) 2.7 (1.23×) 32.6 (0.18×) 22.4 (1.15×) 0.09 (0.43×) mkDelayWorker32B 4,145 (0.36×) 1 5 (0.71×) 506 (0.99×) 505 (0.99×) 44 (1.02×) 40 (0.53×) 18,099 (0.14×) 6.52 (0.86×) 0.60 (1.25×) 5.4 (1.61×) 3.0 (0.23×) 10.1 (0.60×) 6.7 (0.00×) 0.9 (0.07×) 32.1 (0.02×) 25.4 (0.52×) 0.25 (0.49×) raygentop 2,934 (0.61×) 1 3 (0.75×) 214 (0.90×) 110 (0.53×) 8 (1.14×) (0.00×) 56 (0.74×) 20,468 (0.62×) 4.89 (0.97×) 0.19 (1.10×) 1.0 (1.20×) 1.9 (0.33×) 2.0 (0.53×) 7.0 (0.27×) 0.8 (0.44×) 14.9 (0.38×) 7.9 (0.57×) 0.06 (0.39×) mkSMAdapter4B 2,852 (0.69×) 1 4 (0.67×) 193 (0.99×) 199 (0.99×) 5 (1.00×) 50 (0.83×) 17,952 (0.69×) 5.33 (0.93×) 0.20 (1.22×) 1.6 (1.57×) 2.7 (0.48×) 2.3 (0.73×) 5.9 (0.35×) 0.3 (0.42×) 15.0 (0.52×) 9.1 (0.76×) 0.06 (0.39×) sha 2,744 (0.83×) 1 3 (0.60×) 38 (1.00×) 154 (0.76×) 74 (1.19×) 15,641 (0.86×) 10.13 (1.01×) 0.56 (1.74×) 245.3 (67.95×) 1.8 (0.60×) 1.5 (0.63×) 6.0 (0.19×) 0.3 (0.46×) 260.1 (6.10×) 254.1 (22.45×) 0.06 (0.43×) spree 1,229 1 15 45 65 1 3 70 11,210 10.52 0.12 0.7 2.0 0.8 2.4 0.3 7.5 5.1 0.04 ◦ mkPktMerge 1,160 (0.88×) 1 2 (0.67×) 311 (1.00×) 29 (1.38×) 15 (1.00×) 40 (0.80×) 13,486 (0.90×) 4.51 (1.09×) 0.10 (1.11×) 0.1 (0.60×) 0.5 (0.79×) 1.5 (0.68×) 2.6 (0.08×) 0.5 (0.20×) 6.9 (0.18×) 4.3 (0.69×) 0.06 (0.69×) boundtop 1,141 (0.20×) 1 3 (0.38×) 142 (0.52×) 94 (0.37×) (0.00×) 48 (0.89×) 3,152 (0.11×) 3.38 (0.54×) 0.25 (1.06×) 0.4 (0.28×) 0.7 (0.09×) 0.8 (0.19×) 0.9 (0.11×) 0.1 (0.08×) 4.9 (0.20×) 3.9 (0.25×) 0.04 (0.22×) diffeq1 886 (0.87×) 1 6 (1.00×) 162 (1.00×) 32 (0.91×) 5 (1.00×) 50 (0.81×) 8,945 (0.86×) 17.02 (1.01×) 0.03 (1.24×) 0.2 (0.79×) 0.3 (0.46×) 0.7 (0.90×) 6.3 (0.62×) 0.3 (0.42×) 8.7 (0.65×) 2.4 (0.74×) 0.04 (0.91×) diffeq2 599 (0.81×) 1 6 (1.20×) 66 (1.00×) 22 (0.85×) 5 (1.00×) 52 (0.90×) 7,411 (0.83×) 12.97 (1.02×) 0.02 (1.89×) 0.1 (0.80×) 0.4 (0.53×) 0.6 (0.76×) 2.3 (0.21×) 0.3 (0.63×) 4.5 (0.34×) 2.2 (0.87×) 0.04 (0.97×) ch intrinsics 493 (0.55×) 1 3 (0.75×) 99 (1.00×) 64 (1.73×) 1 (1.00×) 54 (1.13×) 1,304 (0.33×) 2.58 (0.68×) 0.06 (1.58×) 0.2 (0.59×) 0.1 (0.20×) 0.3 (0.81×) 0.6 (0.64×) 0.0 (0.35×) 2.1 (0.67×) 1.5 (0.68×) 0.03 (0.91×) stereovision3 321 (0.92×) 2 5 (1.25×) 11 (1.00×) 14 (1.08×) 30 (0.94×) 791 (1.06×) 2.65 (0.95×) 0.04 (1.28×) 0.1 (0.92×) 0.3 (1.02×) 0.1 (0.60×) 0.3 (0.98×) 0.0 (0.88×) 1.6 (1.21×) 1.3 (1.29×) 0.02 (1.30×) GEOMEAN 5,450.6 (0.68×) 1 7.2 (0.82×) 112.1 (0.95×) 291.6 (0.82×) 10.2 (1.00×) 15.7 (1.00×) 65.7 (0.87×) 36,594.1 (0.56×) 10.25 (0.89×) 0.46 (1.39×) 5.7 (2.39×) 4.4 (0.48×) 5.8 (0.60×) 15.4 (0.13×) 1.3 (0.33×) 50.6 (0.29×) 31.3 (0.99×) 0.14 (0.44×) GEOMEAN (> 10K) 33,242.9 (0.74×) 1 13.6 (0.89×) 105 (0.97×) 1,700.5 (0.74×) 29.2 (0.98×) 82.7 (0.99×) 88.6 (0.85×) 254,151.9 (0.59×) 19.68 (0.88×) 2.58 (1.53×) 81.1 (5.15×) 34.4 (0.58×) 50.7 (0.56×) 146.2 (0.08×) 10.8 (0.30×) 406.1 (0.19×) 241.7 (1.05×) 0.49 (0.30×) % TOTAL (> 10K) 0.64% 39.1% 5.7% 13.4% 34.9% 2.7% 100.0% 64.9% Run-time in Seconds, Memory in GiB, WL in grid tiles, CPD in ns ◦ VTR 7 error

Table 11.5: VPR 8.0: Titan v1.1.0 Benchmarks (Relative to VPR 7+) Netlist Device Grid Routed CPD Benchmark Clocks IOs LABs DSPs M9Ks M144Ks Routed Wirelength Pack Time Place Time Route Time Misc. Time Total Time Peak Memory Primitives Tiles (geomean) gaussianblur 1,860,045 1 558 105,457 4 12 0 117,410 126.1 (0.76×) 576.8  ◦ bitcoin miner 1,062,214 2 385 29,533 (1.02×) 1,668 (1.14×) 0 (0.00×) 43,139 12,604,393 1.01 12.1 (0.23×) 169.2 (1.40×) 23.8 12.3 217.4 12.28  directrf 934,809 2 319 58,474 252 2,535 0 73,162 23.5 217.6 N ◦ sparcT1 chip2 766,880 1 1,890 29,716 (0.96×) 3 (1.00×) 580 (1.09×) 55,622 7,207,537 (0.61×) 19.93 (0.89×) 25.1 (0.47×) 88.2 (0.94×) 7.8 (0.02×) 14.3 (5.20×) 135.4 (0.23×) 11.41 (0.24×) LU Network 630,376 19 405 31,026 112 1,175 0 35,478 11.8 86.0  ◦ LU230 568,367 2 373 17,049 (1.11×) 116 (1.00×) 5,040 (1.09×) 16 (0.05×) 135,676 18,171,449 10.13 20.6 (0.51×) 62.9 20.8 32.7 137.0 19.67 ◦ mes noc 549,051 9 5 25,570 (1.03×) 800 (1.00×) 29,106 5,093,677 (0.56×) 9.90 24.8 (0.73×) 77.9 (0.85×) 9.8 (0.03×) 8.0 (3.59×) 120.5 (0.24×) 8.15 (0.21×) gsm switch 491,989 4 138 20,827 (1.08×) 1,848 (1.15×) 0 (0.00×) 47,124 6,566,158 (0.70×) 5.59 (0.85×) 12.7 (0.50×) 39.3 (0.83×) 5.6 (0.00×) 11.8 (5.48×) 69.4 (0.02×) 9.17 (0.14×) denoise 343,755 1 852 18,202 (1.05×) 48 (1.00×) 377 (1.03×) 21,125 3,897,065 (0.66×) 864.44 (0.93×) 8.4 (0.38×) 79.3 (1.09×) 7.5 (0.01×) 6.0 (4.10×) 101.2 (0.10×) 6.17 (0.24×) sparcT2 core 288,458 1 451 14,215 (1.09×) 266 (1.00×) 16,280 4,604,317 (0.84×) 18.81 (1.53×) 11.2 (0.61×) 41.0 (1.29×) 6.4 (0.04×) 4.4 (4.40×) 63.0 (0.31×) 4.62 (0.25×) cholesky bdti 256,234 1 162 9,678 (1.04×) 132 (1.00×) 600 (1.00×) 20,832 2,562,018 (0.66×) 9.34 (0.77×) 4.8 (0.61×) 14.2 (0.74×) 4.5 (0.04×) 5.1 (4.55×) 28.6 (0.20×) 5.02 (0.20×) minres 252,687 2 229 7,693 (1.07×) 149 (1.17×) 1,458 (1.27×) 0 (0.00×) 36,244 2,749,531 (0.61×) 6.28 (0.86×) 5.4 (0.64×) 13.5 (0.96×) 3.2 (0.03×) 8.3 (6.61×) 30.3 (0.21×) 6.61 (0.15×) stap qrd 237,347 1 150 16,067 (1.04×) 74 (1.00×) 553 (1.31×) 0 (0.00×) 18,212 2,635,810 (0.62×) 8.06 (0.82×) 4.3 (0.56×) 25.7 (0.86×) 4.1 (0.10×) 4.7 (4.09×) 38.9 (0.47×) 4.52 (0.19×) openCV 212,826 1 208 7,135 (1.10×) 248 (1.47×) 787 (1.01×) 40 (0.83×) 38,930 3,271,085 11.50 5.3 (0.64×) 12.1 (0.76×) 6.5 8.9 32.9 6.65 4 dart 202,439 1 69 6,898 (1.07×) 530 (1.07×) 0 (0.00×) 13,837 2,533,682 13.32 6.6 (0.79×) 11.4 (0.92×) 3.5 3.6 25.1 3.58 4 bitonic mesh 191,783 1 119 7,671 (1.12×) 169 (1.84×) 1,664 (1.07×) 0 (0.00×) 43,139 4,139,332 (0.53×) 15.33 (0.81×) 8.8 (0.78×) 17.9 (0.94×) 5.9 (0.00×) 9.9 (7.39×) 42.4 (0.03×) 7.05 (0.13×) segmentation 168,365 1 441 8,641 (1.06×) 21 (1.00×) 488 (1.33×) 0 (0.00×) 13,500 2,060,810 (0.67×) 880.13 (0.96×) 3.8 (0.43×) 23.2 (0.98×) 3.2 (0.01×) 3.5 (4.27×) 33.7 (0.09×) 3.55 (0.20×) SLAM spheric 125,679 1 479 6,657 (1.04×) 66 (1.06×) 87 (1.00×) 10,890 2,116,663 82.97 3.7 (0.59×) 12.6 (0.95×) 3.7 2.7 22.7 2.99  des90 109,928 1 117 4,252 (1.07×) 88 (1.80×) 860 (1.05×) 0 (0.00×) 21,125 2,192,517 (0.59×) 12.50 (0.80×) 4.7 (0.85×) 8.4 (0.91×) 3.6 (0.02×) 4.8 (6.03×) 21.6 (0.09×) 3.92 (0.14×) cholesky mc 108,499 1 262 4,792 (1.05×) 58 (1.00×) 444 (1.09×) 16 (0.80×) 11,193 1,203,602 (0.78×) 8.37 (0.95×) 1.9 (0.74×) 4.4 (0.82×) 2.2 (0.11×) 2.7 (4.11×) 11.3 (0.40×) 2.75 (0.17×) stereo vision 93,171 3 506 3,038 (1.12×) 80 (1.05×) 135 (1.00×) 12,160 702,858 (0.73×) 3.19 (0.88×) 3.7 (2.96×) 2.3 (0.67×) 1.1 (0.19×) 2.7 (5.90×) 9.8 (0.89×) 2.68 (0.29×) sparcT1 core 91,580 1 310 3,910 (1.04×) 1 (1.00×) 131 (1.00×) 4,661 1,255,399 (0.78×) 8.50 (0.84×) 4.4 (1.02×) 4.7 (0.94×) 1.7 (0.08×) 1.3 (2.93×) 12.1 (0.40×) 1.90 (0.26×) neuron 90,857 1 77 3,310 (1.05×) 97 (1.04×) 185 (1.00×) 14,076 873,185 (0.69×) 6.07 (0.80×) 1.6 (0.88×) 2.7 (0.80×) 1.5 (0.13×) 3.3 (7.01×) 9.1 (0.51×) 2.94 (0.29×) GEOMEAN 286,921 1.6 236.2 12,112.4 (1.06×) 51.4 (1.13×) 530.2 (1.08×) 21.7 (0.31×) 26,191.7 3,071,830.2 (0.66×) 14.80 (0.89×) 8.1 (0.65×) 26.1 (0.92×) 4.6 (0.03×) 5.7 (4.88×) 38.5 (0.19×) 5.25 (0.20×) Run-time in Minutes, Memory in GiB, WL in grid tiles, CPD in ns N VPR 8 time-out;  VPR 8 unroute; 4 VPR 7+ time-out;  VPR 7+ unroute; ◦ VPR 7+ error Chapter 11. VTR 8 Evaluation 138 „ „

) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) × × × × × × × × × × × × × × × × × × × × × 96 88 88 12 02 89 90 08 99 94 01 78 96 78 88 09 13 80 22 13 92 ...... (0 (0 (0 (1 (1 (0 (0 (1 (0 (0 07 95 03 79 68 38 54 67 36 91 47 75 76 (1 55 (0 86 (0 39 (0 22 24 (0 19 (1 73 (1 98 (0 10 (1 69 (1 92 (0 ...... 7 4 6 4 3 3 3 2 2 27 16 10 ) ) ) ) ) ) ) ) ) ) ) 8 ) 2 ) 5 ) 11 ) 5 ) 5 ) 6 ) 1 ) 17 ) 5 ) 2 × × × × × × × × × × × × × × × × × × × × × 22 21 33 22 39 26 52 25 32 20 27 36 30 44 53 22 24 28 25 38 30 ...... (0 (0 (0 (0 (0 (0 (0 (0 (0 (0 6 9 1 1 4 5 8 2 7 8 6 8 4% 93 11 (0 4 (0 2 (0 8 (0 6 (0 3 (0 4 (0 0 (0 1 (0 5 (0 2 (0 ...... 8 6 3 2 2 2 1 1 0 0 2 22 10 ) ) ) ) ) ) ) ) ) ) ) 4 ) 1 ) 3 ) 9 ) 3 ) 2 ) 3 ) 1 ) 8 ) 2 ) 1 × × × × × × × × × × × × × × × × × × × × × 01 77 94 05 33 48 62 10 86 88 59 70 83 73 60 87 88 70 81 14 04 ...... (1 (0 (0 (1 (1 (0 (0 (1 (0 (0 0 7 2 1 1 0 3 7 5 6 8 5 0% 29 6 (0 8 (0 6 (0 6 (0 2 (0 8 (0 4 (0 9 (0 3 (0 1 (1 5 (1 ...... 73 79 41 48 39 13 14 438 540 218 197 106 100 , 1 ) ) )) 152 ) ) ) ) )) 28 ) ) ) 91 ) 461 ) 131 ) 59 ) 93 ) 16 ) 439 ) 62 ) 21 × × × × × × × × × × × × × × × × × × × × × 36 28 49 36 67 55 11 68 38 69 07 84 49 24 43 55 59 38 64 98 68 ...... (0 (0 (0 (0 (0 (0 (0 (0 (1 (0 5 9 3 1 28 (0 2 9 8 8 08 (0 1 2 99% (0 0 (0 9 271 7 (0 7 (0 8 (0 5 (0 9 (0 3 (0 1 (0 ...... 8 4 9 4 3 3 5 3 3 4 33 20 10 ) ) ) ) ) ) ) ) ) ) ) 6 ) 12 ) 2 ) 11 ) 33 ) 4 ) 5 ) 8 ) 10 ) 3 ) 1 × × × × × × × × × × × × × × × × × × × × × 20 26 18 38 97 14 23 44 38 35 34 35 54 09 67 19 01 99 15 98 20 ...... (0 (0 (0 (0 (0 (0 (0 (0 (0 (0 1 2 1 7 3 2 5 0 5 7 6 9 08% (0 1 (0 1 (0 0 (0 24 (0 15 5 (0 2 (1 3 (0 5 (0 8 (0 3 (0 ...... 6 7 6 3 5 2 2 2 0 0 9 77 84 ) 6 ) ) ) 17 ) ) ) ) ) )) 3 ) ) ) 10 ) 6 ) 183 ) 4 ) 8 ) 6 ) 5 ) 2 ) 1 × × × × × × × × × × × × × × × × × × × × × × 00 37 87 64 44 32 65 82 66 73 33 86 05 97 85 07 65 91 17 42 17 83 ...... (1 (0 (1 (1 (1 (0 (0 (1 (0 (1 6 (1 9 (0 4 (0 7 3 9 1 3 6 3 4 0 7 9 9 0% 3 (0 9 (0 4 (1 1 (0 2 (1 7 (1 7 (1 0 (0 3 (0 ...... 8 8 83 54 64 29 39 26 78 200 418 188 158 , 1 ) 66 ) 110 ) 20 ) ) ) ) ) ) ) ) ) ) ) 428 ) 221 ) 200 ) 41 ) 42 ) 68 ) 13 ) 10 ) 115 × × × × × × × × × × × × × × × × × × × × × × 18 16 80 69 51 89 03 76 61 33 70 92 98 31 53 75 97 55 28 15 44 13 ...... (0 (1 (1 (1 (0 (1 (1 (2 (0 (0 4 (1 7 (1 2 (1 7 3 9 2 7 0 5 5 2 5 2 4 2% 3 (0 2 (0 6 (0 7 (0 3 (1 3 (2 0 (1 0 (3 6 (1 ...... Pack Time Place Time Route Time Misc. Time Total Time STA Time Peak Memory 6 4 6 3 4 1 1 7 17 12 23 11 126 ) 7 ) 12 ) 3 ) 12 ) 21 ) 4 ) 5 ) 8 ) 2 ) 4 ) ) ) )) 6 ) ) ) ) ) ) × × × × × × × × × × × × × × × × × × × × × 20 25 11 93 03 68 24 32 58 13 92 16 09 90 12 04 29 92 60 96 32 ...... (0 (1 (1 (1 (2 (1 (0 (1 (0 (1 00 (1 10 (1 06 (1 24 71 05 38 83 16 67 93 22 71 47 56 38 (0 0002 (1 27 09 (1 20 (1 74 (1 26 (1 67 (1 00 (0 ...... (geomean) 9 5 8 4 7 3 5 Routed CPD 10 11 12 929 848 ) 16 ) 6 ) 79 ) 8 ) ) 10 ) ) ) )) 9 ) 11 ) ) 13 ) ) )) 7 )) 8 ) 851 × × × × × × × × × × × × × × × × × × × × × 26 95 15 06 19 45 03 06 46 48 53 46 02 80 19 26 96 52 00 27 10 ...... (1 (1 (1 (1 (1 (1 (1 (1 (1 (2 9 (1 . 814 (0 905 (1 344 160 973002 (1 606573 (1 5 119 (1 630 (1 536 (0 099 (1 016 (1 344 841 316 873510 (1 613 081 886 716 530 195 , , , , , , , , , , , , , , , , , , , , , , , , 314 671 427 738 238 269 706 448 572 997 258 235 225 652 942 562 067 708 595 117 704 147 608 765 , , , , , , , , , , , , , , , , , , , , , , 5 4 4 2 2 2 1 2 Routed Wirelength 30 12 DSP DSP LAB LAB LAB LAB LAB M9K M9K M9K M9K M9K Limiting Resource ) M9K 5 ) LAB 1 ) 3 ) LAB) 10 ) IO)) M9K 7 15 ) ) )) DSP)) 2 DSP)) 2 M9K 4 ) )) M9K) 1 ) LAB 1 ) LAB 3 × × × × × × × × × × × × × × × × × × × × × × 15 50 78 77 72 56 64 36 23 20 58 90 02 16 08 47 33 29 03 59 85 87 ...... (1 (0 (0 (1 (1 (1 (0 (1 (2 (1 2 (1 . 342 195 (1 984 (0 170 (3 395 (1 318 (1 625 (1 718 184495 (1 753860 (1 936 176 912184 (0 125762 (1 076 736 717 384 002384 (0 , , , , , , , , , , , , , , , , , , , , , , , , 74 35 27 17 37 18 14 13 21 12 12 116 ) 25 ) 48 ) ) ) 137 ) 32 ) 43 ) ) 11 ) × × × × × × × × × × 87 00 00 00 00 83 00 00 80 00 ...... (0 (0 (0 (0 7 (0 . 0 0 0 0 VPR 8.0 High Effort: Titan v1.3.0 (Relative to Quartus 18.0) ) 21 ) 0 (0 ) )) )) ) 16) (1 37 ) ) 57 ) )) ) 40) (0 ) 16 0) (0 21 ) ) 16) (0 )) 5 × × × × × × × × × × × × × × × × × × × × × × × 15 30 00 02 23 00 26 05 19 00 74 31 00 00 17 00 03 87 00 19 00 01 00 ...... (1 (1 (1 (1 (1 (1 (1 (1 (1 (1 (1 (4 1 (1 . 12 848 (1 331535 (1 175 040 (4 458 664 (1 800 260 553 530 481 860 113 136 , , , , , , , 2 1 1 ) 551 ) 6 ) 506) (1 5 ) 359) (2 600) (1 785) (1 1 ) 444) (1 128 (1 ) ) ) ) ) ) ) × × × × × × × × × × × × × × × × × Table 11.6: 97 56 00 00 73 00 68 71 00 00 00 93 00 83 50 00 81 ...... (1 (0 (1 (0 (0 (4 (0 7 (0 . 2 78 74 15 44 76 89 240 112 ) 39 ) 1 ) 37 (0 )) 3) (1 116 (1 1 ) 24) (0 132 (1 ) 213 (1 ) 85 (0 ) 59) (1 1 (1 ) ) ) ) ) ) ) ) ) ) ) ) × × × × × × × × × × × × × × × × × × × × × × × × 95 72 68 00 73 51 79 09 84 50 88 95 12 75 47 81 04 63 82 83 78 75 87 68 ...... (1 (1 (1 (0 (1 (0 (1 (0 (0 (0 (0 (0 2 (0 . 534 (0 607 (0 977 405 (4 619 (0 434 (0 128 (0 824 (0 802 382824 (1 050 167 663 030 725029 (1 325 381093 (0 248 327 983136 (0 , , , , , , , , , , , , , , , , , , , , , , , , 8 7 7 4 3 3 60 31 24 14 16 103 ) 21 ) 5 ) 11 ) ) ) 33 ) ) 16 ) ) ) 14 ) ) )) 7 ) ) )) 4 ) ) 32 ) 9 ) 7 ) 3 × × × × × × × × × × × × × × × × × × × × × × × × 01 79 95 81 98 70 04 00 00 95 87 67 95 62 00 82 88 98 30 74 71 00 88 94 ...... (0 (0 (2 (1 (0 (0 (0 (1 (0 (0 (1 (0 4 (0 . 5 69 77 558 319 411 451 229 150 441 117 506 891 (1 , Quartus unroute

6 236 . 1 2 4 9 1 2 1 1 1 1 2 1 Clocks IOs LABs DSPs M9Ks M144Ks Device Grid Tiles 8 1 . 070 3 138 (1 354 1 479 (0 371 989 079 568 220 480 177 402 832 549 090 875 412 21001 1 2786 373 (1 1553 852 (0 1 208 (0 592 1 262 (0 014 537 1 385 (0 478 1746 162 (1 1 119 (0 975 1 310 (0 , , , , , , , , , , , , , , , , , , , , , , , , 94 86 91 Netlist 490 111 930 630 547 300 257 234 202 137 110 760 568 274 212 108 859 087 255 190 , , Primitives 1 1 mc bdti core mesh core miner chip2 qrd noc vision switch spheric dart Network LU230 des90 minres openCV neuron mes denoise stap directrf % TOTAL Benchmark LU gsm Quartus device size exceeded; GEOMEAN 280 cholesky sparcT2 bitonic SLAM sparcT1 gaussianblur segmentation bitcoin sparcT1 cholesky stereo Run-time in Minutes, MemorySTA in Time GiB, is WL included in„ in grid Pack/Place/Route/Misc. tiles, and CPD Total in Times ns Part III

An Open-Source CAD Flow for Commercial FPGA Devices

139 Chapter 12

Symbiflow & VPR: An Open-Source Design Flow for Commercial and Novel FPGAs

As the benefits of Moore’s Law diminish, computing performance and efficiency gains are increasingly achieved through specializing hardware to a specific domain of computation. However this limits the hardware’s generality and flexibility. Field Programmable Gate Arrays (FPGAs), microchips which can be re-programmed to implement arbitrary digital circuits, enable the benefits of specialization while remaining flexible. A challenge to using FPGAs is the complex computer aided design flow required to efficiently map a computation onto an FPGA. Traditionally these design flows are closed-source and highly specialized to a particular vendor’s devices. We propose an alternate data-driven approach which uses highly adaptable and re-targettable open-source tools to target both commercial and research FPGA architectures. While challenges remain, we believe this approach makes the development of novel and commercial FPGA architectures faster and more accessible. Furthermore, it provides a path forward for industry, academia, and the open-source community to collaborate and combine their resources to advance FPGA technology.

12.1 Introduction

Moore’s Law has tracked our ability to perform increasingly efficient and complex computation over the past 55 years, enabling general purpose devices like CPUs (and more recently GPUs) to continually improve performance and power efficiency. However as the traditional benefits of manufacturing process technology scaling diminish, so are the performance and efficiency gains of these devices. At the same time, demand for computation from domains like machine learning and wireless signal processing continues to rapidly increase. This has led computer architects to investigate domain-specific computing architectures, which focus on efficiently implementing a specific domain of related computational applications. While such approaches offer significant benefits, these are largely derived from their specialization and lack of flexibility. Furthermore, developing such architectures is expensive (designing and fabbing a custom chip

140 Chapter 12. Symbiflow & VPR 141 may cost hundreds of millions of dollars), and is risky (e.g. what if some unsupported operation becomes required?). As a result relatively few application domains are stable enough and garner sufficient financial support for this approach. Field-Programmable Gate Arrays (FPGAs) present an alternative approach which combines flexibility and general applicability, with the performance and efficiency benefits of Domain Specific Architectures (DSAs). Instead of designing a DSA which must then be manufactured as an Application Specific (a process which usually takes years to design, manufacture and verify), a DSA can be implemented and re-programmed onto an FPGA in a few hours or days. As discussed in Chapter 2, and shown in Figure 12.1, an FPGA consists of programmable logic blocks (which can implement arbitrary boolean logic functions), data storage (Flip-Flops and BRAMs) and a programmable routing fabric to interconnect them. This enables an FPGA to be quickly re-programmed to implement an arbitrary digital circuit. The architecture of FPGAs themselves continues to evolve in many ways, such as the integration of increasingly heterogeneous blocks such as digital signal processing blocks, diverse I/O controllers, embedded Networks-on-Chip [22, 192] and more (see Section 2.1.1). The evolution of FPGAs and the creation of FPGAs with unique features to benefit new computational domains is hindered by the very complex Computer-Aided Design (CAD) system needed to map a high-level computation description to the millions of configuration bits which control the FPGA. The creation of such tooling is a daunting task: FPGA vendors such as Xilinx and Intel employ more software engineers than hardware engineers in their FPGA divisions to develop their (closed-source) CAD systems [61]. Furthermore, without a prototype CAD system for each FPGA architecture of interest, an FPGA architect cannot quantitatively evaluate new architectural ideas. In this chapter we discuss the SymbiFlow project, which combines VPR and other open-source CAD tools to build a fully open-source design flow for FPGAs that can be used not only to program commercial FPGAs, but also to evaluate new FPGA architectures.1 This dual goal creates a challenge. On the one hand, creating a programming file for a specific FPGA requires that every device detail is perfectly supported by the CAD system. On the other hand, for a CAD system to be able to target a wide range of FPGA architectures it must be very flexible and avoid chip-specific coding. We show that a data-driven approach is possible: SymbiFlow (using VPR) can fully map designs to commercial Xilinx Artix 7 devices with open-source synthesis, placement, routing, and bitstream generation tools that can still be retargeted to other (existing or novel) FPGAs. While the VTR flow developed in Chapters 5 to 11 allows exploration of FPGA architectures without significant extra coding, before this work it could not capture all the details required to program a production FPGA architecture. In this chapter, we quantify the current feature support and result quality of this flow when targeting commercial devices. While much remains to be done, we believe SymbiFlow has the potential both to allow a much larger community to develop CAD flows for current FPGAs and to make the development of new FPGA architectures faster and more accessible.

1This work was done in collaboration with Google (Tim Ansell and Keith Rothman), and Antmicro (Alessandro Commodi and others) [1]. The contribution towards this thesis was to generalize a number of parts of VPR to support specifying and modelling the details required to target commercial FPGAs with VPR, as well as providing high-level guidance for how the design flow should be architected and evolve in a data-driven way. Google and Antmicro were responsible for creating the Artix 7 device model for use in VPR, the open-source bitstream generator, and verifying the correctness of the resulting flow. They also proposed fixes and enhancements to VPR to address issues targeting Artix 7. Experimental results for Symbiflow & VPR, and Vivado were collected by Antmicro, with additional assistance from the author and Mohamed A. Elgammal. Chapter 12. Symbiflow & VPR 142

CLB BRAM Routing Wire

... CLB ...... LUT

......

......

LUT ......

Figure 12.1: Left, an FPGA consisting of Configurable Logic Blocks (CLBs) for computation and Block RAMs (BRAMs) for storage, along with programmable routing to interconnect them. Right, each CLB contains Look-Up Tables (LUTs) and Flip-Flops (FFs). Routing Muxes are configured to interconnect elements within each block, or inter-block routing wires.

12.2 Related Work

FPGA CAD flows serve two main purposes: implementing designs in a completely-specified FPGA architecture, and quantitatively evaluating new hardware architectures before their creation. Each FPGA vendor has a CAD flow for the first purpose – generating a programming file for their devices – and several also have CAD flows to evaluate new architecture ideas for future microchips, such as Intel/Altera’s FPGA Modeling Toolkit (FMT) [68]. These tools are closed source however, so they cannot be used by academic or other researchers to test new CAD algorithms, to investigate new architectures, or to implement designs on novel FPGAs. Some interfaces to commercial tools have been documented, allowing some interaction between external open source tools and these commercial CAD flows. Torc [128] and RapidSmith II [130] use Xilinx Design Language (XDL) to pass intermediate CAD result information to Xilinx’s (now legacy) ISE CAD flow. RapidWright [131] uses similar techniques to read and write partial CAD results to and from Xilinx’s Vivado CAD tool, enabling customization of some portions of the design flow. While these interfaces are helpful to the open source community, they do not address all usage scenarios. First, device programming information is not exposed. Second, the code bases to which they interface are closed, and the devices targeted are limited to existing commercial devices. This places limits on how much the CAD flow can be changed and precludes investigation of, or support for, novel FPGA architectures. To overcome these limitations, several standalone open-source frameworks have been created. The VTR framework combines Odin II for verilog synthesis [183], ABC for technology mapping [184] and VPR for packing, placement and routing [2]. VPR takes a human-readable description of an FPGA architecture as input and has been used for a wide variety of FPGA architecture research in academia and industry (Altera’s FMT was originally derived from VPR). VPR/VTR also serve as a framework for investigating many new CAD algorithms, allowing researchers to implement or modify only their novel part of the CAD flow. Chapter 12. Symbiflow & VPR 143

Device Description Device Model HDL Design

CAD Tool Yosys Stage Elaboration ABC Logic Opt./Tech. Map. FASM

Netlist Assembler

VPR Bitstream Pack/Place/Route

Figure 12.2: Design flow.

While VTR has been extensively used for research, it has historically seen less use for programming actual devices. VTR-to-Bitstream [166] proposed a toolchain based on VTR that can program a Virtex 6 commercial FPGA, using Xilinx’s XDL-based tools only for the final programming. Several start-ups, such as Efinix [178], have made private copies of VPR and developed implementation tool flows for their devices on top of it – but their tools are closed source. Similarly, several academic research teams created novel spatial architectures which use VTR as their CAD flow (e.g. [18]), but they all required significant custom coding to create a full device model and programming flow. This has lead to significant duplicate work across these projects. Hence a major recent focus of the VTR project, has been to enhance the tool flow in a data-driven way to support full implementation flows for both existing and future novel FPGAs with little (or ideally no) custom coding. The OpenFPGA project further builds on the VTR infrastructure to allow automatic layout of a full FPGA, enabling a complete idea-evaluation-layout-programming flow [228]. Nextpnr [182] is another open source FPGA placement and routing tool that has been created to enable a complete open-source FPGA implementation flow for the Lattice Ice40 and ECP5 devices. Unlike VTR, its main purpose is to target existing commercial FPGA architectures with an open-source flow; as such it is more amenable to custom coding. In this work we detail Symbiflow: enhancements to VPR and a data-driven bitstream generator that allow a complete open-source implementation flow for a Xilinx Artix 7 commercial device, without significant custom coding. We believe this new flow fills an important gap by enabling rapid creation of full implementation CAD flows for both existing and new FPGA devices.

12.3 Design Flow

The open source design flow we focus on is shown in Figure 12.2. The end-user provides a behavioural specification of the computations they wish to perform in a Hardware Description Language (HDL). This description is then transformed by several tools to map the computation to the target architecture, finally producing a bitstream used to configure the FPGA. This mapping occurs in several stages. Chapter 12. Symbiflow & VPR 144

Architecture: LUT Size (K), Cluster Size (N), Arch Channel Width (W ), Switch Block (Fs) Connection Block (Fc), Wire Types Abstract

CAD: Device Model Netlist Primitives, Block Types, Device Grid, RR Graph, Bitstream

Circuit: Detailed Description Routing Wires, Muxes, Buffers, SRAM

Fabrication:

Detailed Process Technology & Mask Layout Transistors, Vias, Metal Layers

Figure 12.3: FPGA Architecture Representation.

First, Yosys converts the behavioural HDL into a circuit consisting of soft logic (boolean equations) and hard architectural primitives (like adders, multipliers, and RAMs). For hard primitives Yosys’ technology mapping library lowers generic HDL operations (such as addition) to architecture specific primitives (e.g. full adders).2 The soft logic is then optimized by ABC and technology mapped to the primitive computational elements (e.g. LUTs). VPR then takes the technology mapped netlist along with a detailed model of the FPGA device and determines where to place each circuit element (placement) and how to interconnect them (routing) while minimizing the wiring required and maximizing circuit speed. The resulting circuit implementation is then converted to a low level FPGA Assembly (FASM) description which can be directly translated into the binary bitstream which programs an FPGA.

12.3.1 Specifying Architectures

FPGA architectures can be described at various levels of detail, as shown in Figure 12.3. While the lower level descriptions are more complete they are also unstructured, making it impractical for CAD tools to target them directly. Instead CAD tools operate on a ‘Device Model’ which captures the key information about an FPGA architecture/device in a structured manner. This allows tools to effectively optimize and implement designs based on the targeted device model. In our data-driven approach, the Device Model serves as an Intermediate Representation (IR) which can model a range of hypothetical and commercial FPGA devices. FPGA architects developing new architectures often take a top-down approach (i.e. the upper half of Figure 12.4), using high-level descriptions to quickly describe and evaluate a wide range of architectures and architectural parameters [229]. These descriptions leave many aspects under-specified, leaving it up to the tools to automatically fill in the details (i.e. the ‘Elaborate’ step in Figure 12.4). This approach is highly productive for architecture exploration and research, and if an FPGA device matches the data model well, it can produce a reasonably accurate Device Model. For instance in [10] (Section 7.2) we used

2Further lowering is later performed to ensure primitives match the VPR device model. Chapter 12. Symbiflow & VPR 145

Architecture Evaluation Top-Down

Arch 1 Benchmarks Elaborate Arch n Area/Speed

Device CAD Model (IR) Flow

Bitstreams Device 1 Extract Designs Device n

Bottom-Up Design Implementation Figure 12.4: Architecture and Implementation Flows.

VPR to model commercial Intel Stratix IV FPGAs. However this method can make it challenging to precisely model the details of a specific device, which may not fit cleanly into the available architectural parameters. To implement designs targeting commercial FPGAs and produce valid bitstreams (i.e. the bottom half of Figure 12.4) requires modelling the target device with bit-level accuracy. This requires a bottom-up approach which constructs the Device Model from the low-level device details (i.e. the ‘Extract’ step in Figure 12.4). With such a Device Model the CAD flow in Figure 12.2 can then be used to map applications to that device. Recently, these low-level device descriptions have become available for some FPGAs, including Lattice’s ICE and ECP5 families, and Xilinx’s Artix 7. These low-level descriptions provide detailed information about the FPGA structure (logic and how routing resources interconnect), and how their use is described in the bitstream. However, as noted previously, merely having access to the low-level details is insufficient to successfully target these devices. The CAD flow requires higher-level context and structure to correctly implement and effectively optimize applications for the device. The process of converting these low-level device descriptions into the higher-level Device Model can be challenging, as the low-level details are relatively unstructured. This process is typically performed through a combination of automated processing and human-driven modelling (‘Extract’ in Figure 12.4). While much of this work can be re-used within a particular device family (or potentially across related families from a particular vendor), each vendor makes a variety of different architectural, implementation and terminology choices. The broadness and significance of these differences have often been at the root of concerns about whether flexible architecture agnostic tools like VPR can target commercial FPGAs.

12.3.2 Enhancements to Target Commercial FPGAs

A variety of new tools and enhancements are required in order to enable a fully open-source FPGA tool flow which can be re-used and re-targeted to multiple devices. Chapter 12. Symbiflow & VPR 146

Low-Level Timing Raw Device Device DB DB Data

Processing Stage Extract Device Build Timing Device Structure Model Description

Device Model

Figure 12.5: Architecture Model Generation.

VPR Enhancements

In order to support commercial devices we have extend VPR’s device model and flexibility to capture the details required to produce valid bitstreams. The changes include:

ˆ Support for loading the detailed routing architecture from a file (rather than building it from a high-level specification) as described in Chapter 6 [2],

ˆ Generic device grids (to model the precise layouts of commercial devices) as described in Chapter 7 [2],

ˆ Non-configurable routing graph edges (to model multi-segment and non-linear wiring) as described in Chapter 7, and support for routing them as described in Chapter 9 [2],

ˆ Support for routing clocks on dedicated clock networks, and

ˆ Extensions to timing analysis to capture clock network delays and fan-out related loading effects.

Device Model Extraction

As described above, low-level device databases exist for some commercial FPGAs but are very low-level and unstructured. This makes them an unsuitable description for place and route tools which aim to optimize higher-level characteristics like resource usage, wirelength and timing. It is therefore necessary to extract additional higher-level structural information which tools like VPR can target. Figure 12.5 illustrates at a high-level the major steps in Symbiflow required to convert the low-level device database for Xilinx’s Artix 7 family into an appropriate Device Model for use in VPR (for more details see [1]). The resulting Device Model captures all the details correctly and includes the higher-level structural information required for effective place and route optimizations. This process involves extracting higher-level structural information from the low-level database, including the available block types and operating modes, device floorplan/grid, inter-block routing network (i.e. RR Graph). Some architectural details also need to be transformed to fit into the VPR data model. For instance, the Artix 7 family includes ‘U’-shaped wires which are converted into three linear wires connected by non-configurable edges (Sections 7.1.2 and 9.7). These steps are sufficient to generate a functionally correct Device Model, which captures the low-level details of the device and can be used with VPR to implement designs and generate valid bitstreams Chapter 12. Symbiflow & VPR 147 for real Artix 7 devices. However, additional information is required to enable VPR to optimize circuit implementations well. To enable timing-driven optimization it necessary to provide a timing model for the various architecture components (routing wires, LUTs, FFs etc.). This is done by creating a timing model (‘Build Timing Model’ in Figure 12.5) from a database of timing information extracted from Vivado. The timing model is used by Tatum [6] (Chapter 3), VPR’s Static Timing Analysis (STA) engine, for timing analysis and to drive timing-based optimizations. Additionally, at each stage it is necessary to extract higher-level information to enable VPR to effectively optimize the use of various routing and logic resources. For example, grouping the numerous wires in the RR graph into a smaller number of ‘types’ which share similar electrical and connectivity characteristics (e.g. rare but fast long wires, common but slower and more flexible short wires), enables VPR to trade-off which signals use the different wire types. Another example, is the importance of aligning the block and routing network coordinate systems, which ensures both the placer and router have aligned understandings of what components of the device are physically close together.

Generic Bitstream Assembler

To create a generic bitstream assembler, a generic FPGA assembly (FASM) format [230] is used. FASM is a textual format which lists ‘features’ to enable within the FPGA fabric. A feature might be the LUT truth-table, or the input wire selected by a routing mux. Generally a FASM feature will enable 1 or more bits, and may also require that 1 or more bits remain cleared in the output bitstream. FPGA bitstream contents can be roughly broken down into three categories: primitive block con- figurations (e.g. LUTs, FFs), intra-block routing muxes and inter-block routing muxes. During Device Model extraction, information about the FASM features associated with each of these categories are tagged as metadata on the relevant components. The corresponding FASM features are then set based on component usage in the final design implementation as determined by VPR.

12.3.3 Verification and Correctness

Multiple levels of verification have been performed to ensure the resulting bitstreams are functionally correct on various example and benchmark designs. This includes programming the bitstreams onto real devices and testing functionality (e.g. manually, or using Built In Self Test), re-importing the produced implementation into Vivado, and formally verifying the equivalence of the post-synthesis and post-place-and-route netlists.

12.4 Flow Quality

To quantitatively evaluate the quality and run-time of our open-source design flow we compare it in two scenarios: one where VPR understands the higher level structure of the architecture, and another which captures the full details of a commercial Xilinx device. In the first scenario we use a mixed open and closed source flow, where we use Intel’s Quartus (closed-source) for logic synthesis and technology mapping, and either Quartus or VPR (our open-source place and route tool) to perform the physical implementation.3 In this scenario VPR targets a somewhat

3This allows us to evaluate only the impact of placement and routing (since the same synthesis result is used). Chapter 12. Symbiflow & VPR 148

Table 12.1: VPR 8 Quality and Run-time targeting Stratix IV Model on Titan Benchmarks Routed Routed Pack Place Route Total LABs WL CPD Time Time Time Time 0.95 1.26 1.20 1.18 1.00 0.34 0.83 Geomean of 20 mutually implementable Titan benchmarks [10], normal- ized to Intel Quartus 18.0 [2] (Reproduced from Table 11.3).

Table 12.2: Symbiflow & VPR Quality and Run-time on Artix 7 (XC7A50TFGG484) Netlist Crit. Path Delay Synth. Time Pack & Place Route Time Assembly Time Total Time Benchmark CLBs RAMB18s Primitives (ns) (sec) Time (sec) (sec) (sec) (sec) picosoc 8,108 895 (1.14×) 3 (1.50×) 18.47 (1.93×) 29.2 (0.56×) 62.0 (4.43×) 11.1 (0.56×) 9.5 (1.19×) 115.9 (0.83×) murax 4,558 392 (1.23×) 4 (0.67×) 12.22 (1.45×) 18.4 (0.63×) 23.0 (1.92×) 3.9 (0.28×) 9.0 (1.00×) 58.3 (0.58×) top bram n3 3,595 361 (4.25×) 1 (1.00×) 14.56 (3.17×) 14.8 (0.82×) 9.9 (1.41×) 4.2 (0.38×) 6.5 (0.81×) 37.8 (0.49×) top bram n2 2,619 264 (3.72×) 1 (1.00×) 14.07 (2.84×) 13.6 (0.68×) 7.2 (1.02×) 3.0 (0.25×) 6.2 (0.77×) 32.4 (0.40×) top bram n1 1,661 162 (2.49×) 1 (1.00×) 13.36 (2.57×) 11.0 (0.65×) 7.9 (1.12×) 1.2 (0.10×) 5.2 (0.65×) 27.6 (0.36×) dram test 64x1d 1,035 115 (1.47×) 0 10.95 (2.12×) 7.8 (0.39×) 6.6 (0.94×) 0.6 (0.06×) 4.3 (0.53×) 21.1 (0.26×) GEOMEAN 2,902.7 292.2 (2.08×) 1.6 (1.00×) 13.75 (2.27×) 14.5 (0.61×) 13.2 (1.53×) 2.7 (0.21×) 6.5 (0.80×) 41.1 (0.45×) % TOTAL 32.7% 38.8% 8.0% 14.1% 100.0% Crit. Path Delay analyzed by Xilinx Vivado 2017.2. Ratios normalized to Xilinx Vivado 2017.2 in parentheses.

abstracted model [2] (Section 7.2) of Intel’s Stratix IV FPGA family, since the low level details of Stratix IV are unavailable. Both tools are used to implement the Titan benchmark suite [10], which consists of modern large scale benchmarks (90K-1.9M netlist primitives) which use all the types of heterogeneous resources available in the Stratix IV family. Table 12.1 shows that compared to Quartus, VPR uses similar amounts of logic (LABs) and requires a comparable amount of run-time to implement all designs. However VPR’s implementations use approximately 26% more wiring and operate 20% slower than those produced by Quartus. These results show that when our tools comprehend the overall device (including higher-level structure) their implementation and optimization algorithms achieve good run-time and reasonable result quality compared to a highly tuned architecture specific tool like Quartus.

In the second scenario we use the fully open-source Symbiflow (with VPR) design flow shown in Figure 12.2 to target a Xilinx Artix 7 device for which the full low-level device details are available, and compare the results with Xilinx’s closed-source Vivado 2017.12 tool suite.4

The results are shown in Table 12.2. Here we see that Symbiflow uses roughly double the amount of logic (CLBs) to implement these designs, and produces circuits which have 2.3× longer critical paths. A significant portion of this gap is due to logic synthesis and technology mapping, with Vivado producing smaller and better optimized netlists.5 From a tool run-time perspective, Symbiflow’s average run-time is lower than Vivado’s. Symbiflow spends less time on synthesis, more time on placement and completes routing faster than Vivado.

It is important to note that the result in Table 12.2 are achieved while targeting the full details of Artix 7, and as a result produce valid bitstreams which can be programmed onto an Artix 7 device without any closed-source tooling. We believe that many of the run-time and quality issues can be alleviated through either improving the modelling of Artix 7’s high-level characteristics, or improving the adaptability of VPR’s algorithms.

4Note the reported critical path delays are calculated by Vivado’s timing analyzer for both design flows. 5When Vivado uses the same netlist as VPR, VPR uses 24% fewer CLBs than Vivado (indicating Vivado packs the design less densely than VPR) and the critical path delay gap reduces to 1.6×. This indicates that the remaining gap may be derived from optimization differences and the quality of the device model. Chapter 12. Symbiflow & VPR 149

12.5 Conclusion & Future Work

FPGAs offer a compelling approach to achieving the performance, power and cost benefits of domain specific architectures while retaining the flexibility to adapt to changing computational requirements and supporting a wide range of application domains. However the complex CAD flow required to target FPGAs has hindered progress in this direction. We believe open-source tools for FPGAs are an important step towards addressing this challenge, allowing the broader academic, commercial and open-source communities to pool their efforts in order to advance this technology. We’ve show that with a data-driven approach, it is possible to create architecture agnostic tools, like VPR, which can target the full details of commercial FPGAs and produce valid bitstreams. These same tools enable FPGA architects to experiment with and evaluate new FPGA architectures, while also providing a ready-made functional CAD flow for targeting such devices. However there is plenty of work still to be done. When architectures match VPR’s data model well, it is possible to achieve reasonable implementation quality and run-time (Table 12.1). However for architectures which do not match well the results are more mixed. It will be important to continue improving the quality of architecture captures given to VPR, and continue extending VPR’s device model to allow better descriptions of such architectures. Working from low-level device descriptions is challenging as many of the important structural characteristics of an FPGA architecture are implicit. This requires complex processing and manual efforts to extract this higher-level structure from the low-level data, but is key to achieving high quality optimization and reasonable run-times. If vendors provided this higher-level device information along with their chips this effort would be eased significantly. Finally, there remains significant room for improving the quality and run-time of VPR’s optimization algorithms. There are many identified areas for improvement [2] which we hope the community will collaborate together to enhance. Part IV

Reinforcement Learning Enhanced CAD

150 Chapter 13

Adaptive FPGA Placement Optimization via Reinforcement Learning

Developing new or improved optimization heuristics for Computer Aided Design (CAD) tools is challenging and time consuming, relying on empirical experimentation and researcher experience. In this chapter we study how this process can be improved by using Reinforcement Learning (RL) to learn effective and adaptive heuristics. Applying these techniques to Field Programmable Gate Array (FPGA) placement, we show our RL-enhanced algorithm outperforms the standard VTR 8 placer, achieving a better run-time & quality trade-off (up to 2× faster for equivalent quality). RL enables the placer to more efficiently explore the solution space and enables it to dynamically adapt to the specific problem instance being solved. We expect further application of advanced RL methods will improve these results, which motivates further exploration of how RL can be applied in the CAD flow.

13.1 Introduction

Computer Aided Design (CAD) often requires solving complex large scale optimization problems. The large solution spaces necessitate the use of heuristics to find good quality solutions in reasonable run-times. However, the process of designing these heuristic optimization algorithms is itself challenging and time consuming. The process typically involves two stages:

ˆ first, a CAD researcher develops a new algorithm or algorithm enhancement;

ˆ second, they tune their algorithms’ parameters to produce suitable quality and run-time trade-offs.

The tuning process is key to producing good results but is often opaque, relying on the researcher’s intuition/experience (e.g. which parameters to expose and how to set them) combined with empirical experimentation. Machine Learning techniques offer the potential to automate and improve this process. In this work we investigate the use of Reinforcement Learning (RL) to improve a Simulated Annealing (SA) based Field Programmable Gate Array (FPGA) placement algorithm. Unlike previous work looking at CAD parameter auto-tuning (which typically wraps around an existing algorithm) [231, 232], we

151 Chapter 13. Adaptive FPGA Placement Optimization via Reinforcement Learning 152

at ot+1 Environment

Interpreter Agent

rt+1

st+1

Figure 13.1: Reinforcement Learning Problem. focus on using Reinforcement Learning to develop policies (which also adapt online) for use within the optimization algorithm.

13.2 Reinforcement Learning

Reinforcement Learning is an approach to machine learning which trains an Agent, through experience, to achieve an objective by interacting with its environment. The agent receives feedback about its performance via a numeric reward, with the RL agent’s goal being to maximize its total reward. RL is a very general problem formulation, allowing RL techniques to be applied to a wide range of problems, including robotics [233], finding good neural network architectures [234], and website recommendations [235]. RL techniques have also been shown to be effective for hard search and decision problems such as Chess, Go [236], and real time strategy games [237].

Figure 13.1 outlines the RL problem in more detail. At timestep t, the agent selects an action at to perform, from the set of possible actions A. At the next timestep the environment is observed, producing observation ot+1. This observation is then interpreted to produce a reward rt+1 ∈ R and the agent’s perception of the environment’s state st+1 (from the set of possible states S). On the basis of the states and rewards previously observed the agent selects a new action at+1 to perform. For more background on RL see [238]. Many RL techniques rely upon estimating action values: the long-term value of taking a particular action (i.e. considering both immediate and future rewards). This estimate is often defined as a function Q(s, a): S × A → R. How Q is estimated from the agent’s experience (action value estimation), along with how Q is used to select the next action (action selection) are important considerations for many RL algorithms. An important special case of the RL problem are so-called K-armed bandit problems. In such problems there is only a single state (i.e. |S| = 1), and Q becomes a function only of the actions: Q(a): A → R.

13.3 Simulated Annealing Placement for FPGAs

As described in Chapters 2 and 11, FPGA placement (which determines where to place design logic while minimizing the amount of wiring required and critical path delays) is often the most time-consuming step in the automated FPGA CAD flow. As shown in Figure 13.2 design logic can only be placed at Chapter 13. Adaptive FPGA Placement Optimization via Reinforcement Learning 153

Logic DSP RAM

Figure 13.2: FPGA placement grid

locations where pre-fabricated resources exist, as a result FPGA placement is a discrete optimization problem with many legality constraints. It is also key to generating a high quality implementation, as placement is the primary determinant of both routability and timing.1 These considerations have made Simulated Annealing (SA) a popular algorithm for FPGA placement [46, 53].

As shown in Algorithm 17, SA starts with an initial solution (Line 2). The annealer then repeatedly generates moves (Line 6) which perturb the current solution. The associated change in cost (Line 7) is used to decide whether the move should be accepted or rejected. SA accepts all moves which decrease cost, and probabilistically accepts moves which increase cost as a function of the current temperature (Line 8). Accepting cost increases allows the annealer to hill climb and escape local minima. After making M moves (Line 5) the temperature is decreased (Line 10). This focuses the annealer on cost improvements (less likely to accept uphill moves), guiding it to converge to a high quality solution. The process repeats until the exit conditions are met (Line 11).

Algorithm 17 Simulated Annealing Placement

Require: Pinit initial placement, Tinit initial temp., M moves per temp. Returns: An optimized placement 1: function sa place(Pinit,Tinit,M) 2: P ← Pinit 3: T ← Tinit 4: repeat 5: for 1 ...M do 6: P 0 ← generate move(P ) . Perturb Solution 0 7:∆ cost ← evaluate move(P,P ) 8: if ProbabilisticAccept(∆cost,T ) then 9: P ← P 0 . Accepted Move 10: T ← update temp(T ) 11: until exit condition(P,T ) 12: return P . Optimized Placement

1Unlike Application Specific Integrated Circuits (ASICs), there are no later stages (e.g. buffer insertion, gate re-sizing) to fix-up timing issues. Chapter 13. Adaptive FPGA Placement Optimization via Reinforcement Learning 154

13.3.1 Placement Moves

It is useful to clarify what a move means in the context of FPGA placement. Perhaps the simplest type of move is to swap two blocks of the same type (e.g. two DSP blocks). These are the types of moves used by the standard VTR 8 placer [2], which randomly selects a block from the netlist with uniform probability, and randomly swaps it with another block (or empty location) of the same type. Random swaps are a ‘complete’ type of move as they ensure any valid placement configuration can be reached from any other. This is desirable since it means potentially better configurations are always reachable through some sequence of moves. Despite this, reaching a significantly better configuration may be challenging, requiring a large number of swaps (potentially many of them uphill if already in a deep local minima). However, moves are not limited to simple random swaps. By exploiting knowledge about the structure of the problem more powerful and intelligent moves can be applied. For instance, directed moves [239, 62] use circuit structure, current placement, and timing information to intelligently move blocks towards their optimal positions.2 These moves make the annealer more effective at escaping local minima (by making larger steps through the solution space), achieving better quality in fewer moves, which reduces run-time. An interesting and more general way of viewing these more intelligent moves is as a set of algorithms the annealer selects from to improve the current placement. There is no fundamental restriction on what types of algorithms could be used for move generation. Moves could be very significant, potentially encompassing what would traditionally be considered entirely separate placement algorithms, for instance based on partitioning, assignment or analytic placement techniques [240]. In this sense annealing acts as a framework for combining these algorithms together, and as a true meta-heuristic selecting among their results. However designing such a system which selects among numerous other algorithms is inherently challenging. It is unclear when or how often the different move types should be used. Likely, which type of move will be effective will be situationally dependent, for instance depending on circuit structure and target technology (which varies across designs), the current optimization progress, and the run-time budget. These factors, combined with the large number of possibly diverse move types likely makes it intractable for a human algorithm designer to tune this type of system well. As a result, to control complexity move generators are typically statically configured with limited coupling to the rest of the optimizer.

13.4 Reinforcement Learning Enhanced Move Generator

To address this challenge we will apply RL to learn effective policies for controlling more powerful and flexible move generators, aiming to make the optimizer more adaptable. For instance, enabling the optimizer to adapt to both the static characteristics of the problem instance (e.g. circuit structure), and the evolving dynamics of the optimization process (e.g. which types of moves are most productive, which regions of the circuit would benefit from further optimization). This allows the optimizer to focus effort where it is most needed, and explore the solution space more effectively. An illustration of our RL-enhanced move generator is shown in Figure 13.3. The ‘Agent’ chooses between several different types of moves. In this case selecting a different type of block to be randomly

2For instance, the commercial Quartus placer uses > 10 different types of directed moves [62]. Chapter 13. Adaptive FPGA Placement Optimization via Reinforcement Learning 155

Move Types

Logic

Agent

RAM

DSP cost ∆ Acc./Rej. Move Evaluation

Figure 13.3: RL-based move generator. swapped.3 The impact of a move selected by the agent is not known to the agent until after it has been performed. After selecting a move type, the blocks are swapped and the move evaluated using the existing tool infrastructure to calculate the associated change in cost (∆cost). The annealer then determines whether the move is accepted or rejected (Algorithm 17 Line 8) as usual. The ∆cost and whether the move was accepted or rejected (i.e. reverted) are provided to the agent to calculate the reward. Based on this feedback, over time the agent learns which move types yield high reward and focuses the placers’ effort there.

13.4.1 Reward Formulation

To guide the agent towards the desired behaviour we must formulate a reward which captures this intent. The agent will then attempt to maximize this reward through its actions. One reward formulation we considered was:

rt = −∆cost (13.1)

Since placement is a minimization problem (we seek decreases in wirelength and timing costs), but the agent (by convention) seeks to maximize reward, we use the negative of the cost change. Intuitively, this seems to capture our goal: for moves which decrease cost the agent’s actions are affirmed (with a positive reward), while for moves which increase cost the agent is penalized (negative reward). However, we found Equation (13.1) did not perform well in practise. In particular, the agent prefered move types which had low probability of producing negative rewards. A common example of this is moving IO blocks. Since IO connectivity is not the dominant factor in the placer’s cost functions, and given that many IO configurations have similar cost, this allowed the agent to avoid the (potentially)

3While we use these simple moves as a proof of concept, in general they could be more complex. Notably, the agent does not have any insight into what any of the moves actually do, it only needs to be aware that they exist and can be selected. Chapter 13. Adaptive FPGA Placement Optimization via Reinforcement Learning 156 large negative rewards associated with moving other (more significant) block types. Furthermore, as optimization progresses, it becomes harder to find moves which decrease cost (since the circuit is becoming increasingly well optimized), and correspondingly easier to find moves which increase cost (although such moves become more likely to be rejected as the anneal progresses). This leads the agent to become very risk-averse, focusing on actions with limited chance of a downside – even though it would ultimately be more productive to propose a number of moves with high probability of negative reward (which are likely to be rejected) in the hope of finding some moves which decrease cost. To better capture this intent we used an alternative reward defined as:  −∆cost if accepted rt = (13.2) 0 if rejected

This reward formulation does not penalize the agent for proposing moves which are ultimately rejected by the annealer. By reducing the downside the agent becomes more exploratory and focuses on proposing moves which ultimately lead to cost improvements. It is interesting to note this formulation has another advantage: the sum of all rewards is equal to the total cost change of the circuit, which is our ultimate goal. It is also worth considering how this reward function treats hill climbing moves (which increase cost, but are accepted by the annealer). In this case the agent receives a negative reward which, if a common result, will lead the agent away from proposing that move type. It is unclear whether this is a good choice. Hill climbing moves are helpful for escaping local minima, so penalizing them may not be ideal. On the other hand, it does guide the agent away from unproductive move types, allowing it to focus on more productive types.

13.4.2 Estimating Action Values

Our agent’s action space is shown in the first row of Table 13.1. Each action corresponds to swapping a random block of a particular type. The FPGA architecture we targeted has four block types, but in general there could be arbitrary types of moves (such as those described in Section 13.3.1). The agent only needs to know the number of potential move types; it does not need to know what they do, or how they work. The agent learns to use them effectively by trying them out and observing the resulting rewards. To determine what action to perform next the agent estimates/learns action values. The key difference between action values and rewards is the time scale. A reward corresponds to the outcome of the previous action, while action values provide numeric estimates of the long term value of performing each of the available actions. The agent estimates action values from the sequence of rewards it receives for performing particular actions. As the second row of Table 13.1 shows, our agent has only a single state, and hence treats the move type selection problem as a K-armed bandit problem [238]. Intuitively, this means the agent doesn’t gain any information about the environment (e.g. current placement, circuit structure) in which it operates,4 and hence it only acts based on the sequence of rewards it has received. This is a significant simplifying assumption, and we believe extending the state space will allow the agent to recognize different situations

4A single state provides no differentiation Chapter 13. Adaptive FPGA Placement Optimization via Reinforcement Learning 157

(e.g. optimization progress, previous moves, circuit structure) and respond accordingly.5 However this is left for future work (Section 13.6). One aspect of the move type selection problem which differs from many conventional RL problems is that the problem statistics are non-stationary. That is, the distribution of rewards offered by the different move types evolves over time as optimization progresses. For instance, a particular move type may offer the highest expected reward early in placement optimization, but later becomes sub-optimal (i.e. when the placement is better optimized). It is therefore important for our agent to continually adapt to the changing dynamics of the move type rewards. To estimate the value of taking each potential action, our agent uses a tabular Q function which stores the current estimated reward of taking a particular action at. After performing action at and receiving reward rt+1 the Q value is updated as:

Q(at) = Q(at) + α(rt+1 − Q(at)) (13.3) where rt+1 − Q(at) corresponds to the ‘error’ between the received reward and the current estimate, and α is the size of step taken to reduce the error. In order to track the evolving reward statistics we weight the rewards of recent moves more heavily using a weighted exponential average. We correspondingly set α as:

α = 1 − elog(γ)/M (13.4) where M is the number of moves per temperature, and γ is the fraction of weight given to moves which occurred > M moves ago.6 We empirically found small γ values (0.1 to 0.001) performed best as they focus the agent on tracking the shorter term dynamics of which move types were most effective. We expect an expanded state representation (Section 13.6) will be needed to effectively capture longer term trends. Figure 13.4 illustrates how the agent’s Q estimates change during early placement with γ = 0.1.7 We can see that the Q values for each move type change rapidly as placement proceeds. It is interesting to note the ‘best’ (highest Q value) move type changes in an oscillatory fashion, alternating between the different block types (Logic, DSP and RAM). Previous work has found this approach (cyclically optimizing different block types) to be an effective heuristic [242, 56].8 It is particularly noteworthy that in previous work this heuristic was manually programmed into the placers by their developers. In contrast, we provided no such heuristic guidance to our placer. The agent learned this technique by itself, based on its experience and the feedback provided by the reward function. This points toward the potential of using RL to find new improved heuristics and reduce the development effort required to build high quality CAD tools.

5For instance, when RL was successfully applied to playing Atari video games, the state representation used was the raw pixels from the last four video frames [241]. 6γ controls the ‘length’ of the agent’s memory. As γ approaches 1 the agent has the longest term memory (Q approaches the average of all previous moves). At γ = 0 the agent has the shortest term memory (Q equals the reward of the previous move). 7This places 90% of weight on the most recent M moves. On the large VTR benchmarks considered, this corresponds to α values ranging from 1.6 × 10−5 to 1.3 × 10−3. 8Intuitively, after optimizing the placement of a particular block type (e.g. DSPs), it is likely the placement of other block types (e.g. Logic, RAM) could be improved to account for the first type’s new placement. Chapter 13. Adaptive FPGA Placement Optimization via Reinforcement Learning 158

Table 13.1: Agent State and Action Spaces Space Values Action (A) {IO, Logic, DSP, RAM} State (S) {s0}

×10 6

3 Logic RAM DSP 2

1 Q

0

1 IO Logic DSP 2 RAM

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Move Number ×105 Figure 13.4: Agent’s perceived action values (Q) during early placement on the LU8PEEng benchmark.

13.4.3 Action Selection: Exploration versus Exploitation

Once the agent has an estimate of the value of potential actions it must select which action to perform. The agent faces two competing factors:

ˆ exploring the set of potential actions (A) to improve its action value estimates (Q), and

ˆ performing high pay-off actions to maximize reward.

This is typically referred to in the RL literature as the exploration versus exploitation trade-off [238]. Failing to sufficiently explore possible actions means the agent’s value estimates will not reflect the true values – leading to poor choices. However each exploratory action also has a cost, since it is not spent making further progress towards the goal. E-Greedy action selection is one approach to handle this trade-off. With this technique, the agent mostly selects the highest value (greedy) action – the action it believes will yield maximum reward. However for some fraction of actions ε the agent instead performs a random exploratory action. In stationary RL problems (where the rewards of actions do not evolve) it is common to decrease ε over time, to focus on exploitation, once action value estimates have converged. In our case, the reward statistics are non-stationary so we use a fixed ε . Figure 13.5 shows how varying the fixed ε changes the obtained Quality of Result (QoR) and run-time. First looking at QoR, we observe varying ε has only a small impact impact on Wirelength (WL) and Critical Path Delay (CPD) for most of its range (0.9 to 0.005).9 Despite this some exploration is key for

9Interestingly, CPD degrades more slowly than WL which is desirable. FPGA designers are typically more concerned Chapter 13. Adaptive FPGA Placement Optimization via Reinforcement Learning 159

4.0 : 0.00 : 0.0005 : 0.005 : 0.05 : 0.50 : 0.0001 : 0.001 : 0.01 : 0.10 : 0.90 3.5

3.0

2.5 Wirelength Crit. Path Delay 2.0 Num. Moves

1.5 Metric Value (Normalized) 1.0

0.5 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Place Time (Normalized, log scale) Figure 13.5: Run-time trade-off for varying degrees of exploration (ε) with γ = 0.001, normalized to the default VTR 8 placer. Results are the average over the same benchmark circuits following the methodology of Section 13.5. good QoR, as decreasing ε (below 0.0005 and particularly to 0.0) significantly degrades QoR. For very small ε the agent performs insufficient exploration (resulting in misleading action value estimates) which leads the agent to myopically focus on sub-optimal move types. It was surprising such small ε values (e.g. 0.005) did not cause more significant QoR degradation, as they make the agent more exploitative/greedy and hence likely to get stuck in local minima. One reason for this is the agent’s relatively short term ‘memory’ (Section 13.4.2), which means it tends to forget about historical action performance (whether poor or successful). This ensures the agent continues to adapt, which also makes the agent inherently explorative. Additionally, the move generation process still has a random component, as the locations blocks are moved to are chosen randomly. Now considering run-time, we also see in Figure 13.5 that decreasing ε achieves a significant 2.0× to 2.8× run-time reduction for ε from 0.005 to 0.0001. Figure 13.5 also shows the total number of moves the annealer performs (Num. Moves), which tends to decrease with ε. Making the agent more exploitative (decreasing ε) focuses it on high pay-off move types, which quickly improves the placement in fewer moves. The annealer then exits when it detects no further quality improvement seems likely [46], reducing run-time.

13.5 Results

We evaluate our RL-enhanced placer (Section 13.4) on the standard VTR benchmark circuits targeting the k6 frac N10 frac chain mem32K 40nm FPGA architecture [2]. We exclude small benchmark circuits with < 10K primitives, and average results over three random number generator seeds to reduce algorithmic noise. We compare to the standard VTR 8 placer (Chapter 8) [2], with two variants of our algorithm using different agents: with timing than wirelength – so long as their design remains routable. Chapter 13. Adaptive FPGA Placement Optimization via Reinforcement Learning 160

ˆ Random: selects a block type to move at random (ignoring any learned action values)

ˆ RL: selects the block type to move based on action values (exploiting learned action values) using an ε-greedy approach.

Run-time and QoR are both important metrics when evaluating and comparing placement algorithms. Most placement algorithms have some tuneable parameters which allow a trade-off between run-time and quality. This is useful since different run-time/quality trade-offs may be desirable at different stages of the design. For instance, early in the design process design engineers may value faster turn-around time at the cost of some quality, but at later stages (e.g. during final timing closure) may be willing to spend additional run-time for improved quality. Therefore the entire run-time/quality trade-off curve of a placement algorithm is of interest. Figure 13.6 shows the run-time versus quality trade-offs for WL and CPD achieved by the different algorithms. For the VTR 8 placer, this was achieved by varying the number of moves per temperature (M in Algorithm 17), which is the best run-time/quality trade-off parameter as determined by the original developers. As M is decreased the placer performs fewer moves which decreases run-time, at the cost of some QoR. While the quality degradation is initially moderate it rises quickly for run-time below 0.4, as the solution space is not being explored sufficiently to find a good solution. Our placer using the RL-enhanced move generator (RL Agent) outperforms the VTR 8 placer, achieving a better trade-off – particularly at low run-times. This was achieved by varying ε and γ at the same values of M used by the VTR 8 placer. The RL Agent curve is shifted left (better run-time) and below (better quality) the VTR 8 curve, running up to 2× faster at equivalent quality. The improvements are more pronounced at low run-times where a small number of moves are being made. Here the RL Agent significantly out performs VTR 8, achieving results which are otherwise unachievable by the VTR 8 placer in the same run-time budget. The RL Agent is able to focus the placer’s efforts on high reward moves, which is particularly valuable when only limited exploration is possible. At higher run-times, the improvement is less pronounced, likely because enough moves are performed for the standard placer to sample these high reward moves. To verify our agent is responsible for these improvements Figure 13.6 also includes results for a ‘Random Agent’ which uses the same move generator, but randomly selects the type of move to perform (i.e. ignores any learned action values). The Random Agent performs uniformly worse than both our RL Agent and VTR 8. This shows these improvements are not simply due to an improved move generation mechanism, but are a result of the RL agent’s ability to learn and adapt as optimization proceeds. To further illustrate this, we can look at how the RL agent’s move proposal distribution changes as shown in Figure 13.7. First, comparing across the different benchmarks circuits we can see the agent chooses to emphasize moves of different types on different circuits. For instance on mcml it focuses primarily on moving Logic blocks (> 80% of all proposed moves), with less than 4% of moves spent on DSP blocks. In contrast, stereovision2 has a more even distribution, with 33% DSP moves, 48% Logic moves, and 19% IO moves.10 We can also look at how the move proposal distribution changes during the anneal, showing how the RL agent adapts online.11 For all benchmarks in Figure 13.7 the placer initially focuses its efforts

10It is interesting to note that the RL agent broadly recovers the uniform move generator’s rough ranking of block type importance (which is determined statically by the frequency of block types in the netlist). However, unlike the uniform approach, the RL agent tends to perform more moves effecting the less common block types (which tend to be larger). 11In contrast, the move proposal distribution of the standard VTR 8 placer is fixed and statically determined by the frequency of block types in the netlist. Chapter 13. Adaptive FPGA Placement Optimization via Reinforcement Learning 161

1.50 1.20 VTR 8 VTR 8 Random Agent Random Agent RL Agent RL Agent 1.40 1.15 50% 1.30 faster 1.10

1.20 1.05

1.10 Wirelength (Normalized) 1.00 Critical Path Delay (Normalized)

1.00 Better Better 20% faster 0.95 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.91.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.91.0 Place Time (Normalized, log scale) Place Time (Normalized, log scale) Figure 13.6: Pareto optimal run-time versus quality trade-offs on the VTR Benchmarks (> 10K primitives, geomean over 3 seeds). Run-time and quality metrics are normalized to the VTR 8 default settings. on IO moves. With blob merge the agent then shifts its focus to Logic moves while still performing a moderate fraction of IO moves. For LU8PEEng the agent focuses nearly exclusively on Logic moves during the middle portion of the anneal before it starts performing more DSP and IO moves towards the end of the anneal. stereovision2 shows somewhat different behaviour, with the agent splitting its efforts between Logic and DSP through out the anneal with an increased focus on DSPs near the end.

13.6 Conclusion & Future Work

In conclusion, we’ve presented a RL-enhanced FPGA placer which uses a RL agent to control how the placer explores the solution space. The agent controls a more flexible SA move generator, and learns online what types of moves are productive. This enables the placer to adapt based on circuit characteristics and how optimization progresses. Compared to the VTR 8 placer, our result show the RL-enhanced placer achieves an improved run-time/quality trade-off. This allows our placer to achieve the same QoR while running up to 2× times faster. Our approach is particularly effective in a low run-time regime where few moves are made, and performing ‘good’ moves is more beneficial. This illustrates how RL can be used to make more adaptable CAD tools and learn effective optimization heuristics with reduced human design effort. RL offers a very general framework for formulating these types of problems and so should be applicable at various stages of the CAD flow. There are many potential avenues for future work. While we showed our technique was effective with a relatively small number of simple moves (based on block type) we believe using a wider variety of more powerful moves will make the techniques presented here more effective, as they will give the agent a broad range of capabilities which it can use to improve placement quality. The reward formulations (Section 13.4.1) have also not been well explored and can likely be improved. One factor not accounted for is the run-time impact of different move types, as some moves may be more computationally intensive than others. We would like the agent to account for this, enabling it to effectively trade-off both quality and run-time. Currently our agent has only a single state (Section 13.4.2), which limits its ability to learn longer time scale phenomena and situationally dependent behaviour. Additional state information (e.g. circuit and optimizer statistics) would likely help it make better decisions. The agent also uses ε-greedy action Chapter 13. Adaptive FPGA Placement Optimization via Reinforcement Learning 162

mcml stereovision2 1.0 1.0 IO (RL) IO (uniform) IO (RL) IO (uniform) 0.8 Logic (RL) Logic (uniform) 0.8 Logic (RL) Logic (uniform) DSP (RL) DSP (uniform) DSP (RL) DSP (uniform) 0.6 RAM (RL) RAM (uniform) 0.6

0.4 0.4

0.2 0.2

Fraction of Cummulative Moves 0.0 Fraction of Cummulative Moves 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Move Number (Normalized) Move Number (Normalized) blob_merge LU8PEEng 1.0 1.0 IO (RL) IO (uniform) IO (RL) IO (uniform) 0.8 Logic (RL) Logic (uniform) 0.8 Logic (RL) Logic (uniform) DSP (RL) DSP (uniform) 0.6 0.6 RAM (RL) RAM (uniform)

0.4 0.4

0.2 0.2

Fraction of Cummulative Moves 0.0 Fraction of Cummulative Moves 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Move Number (Normalized) Move Number (Normalized) Figure 13.7: Cumulative number of moves performed during the anneal for several benchmark circuits. The RL-enhanced move generator adapts online to both benchmark characteristics and optimization progress, while the traditional VPR move generator (uniform) uses a fixed method. selection (Section 13.4.3). One drawback of this approach is if several actions are perceived to have nearly equal value, the agent will usually greedily select the highest value action. A less greedy approach to action selection (e.g. soft-max) would likely better balance the agent’s actions with its perceived action values. Our agent also only learns online. While we believe online learning is important (since it allows the agent to adapt to the current problem characteristics), it means the agent begins learning from scratch at the start of each placement. We expect off-line training (e.g. on a suite of benchmark circuits) would help the agent learn general techniques based on a wider range of experience which would further improve results. Online learning could then be used to ensure the agent still adapts to the specific placement problem. Additionally, our agent uses relatively simple K-armed bandit-based RL algorithms. While these approaches are fast and run-time efficient (important since they are used in the placer’s inner loops) there are a wide variety of more powerful RL algorithms which could be applied, such as Temporal Difference Learning and Policy Gradients [238]. We expect more complex RL techniques will need to be invoked periodically (to keep their run-time overheads low) with the goal of setting the general optimization direction. This direction can then be fine-tuned online with fast RL algorithms like those used in this work. Finally, we believe RL techniques can be used with other CAD algorithms, such as analytic placement, and negotiated congestion routing (e.g. determining which connections conflict, when to rip-up and re-route connections, how congestion costs should change). However the application of RL to such algorithms remains future work, although other recent works have begun investigating the use of RL to determine good orders for applying optimization passes in logic [243] and high level [244] synthesis. Chapter 14

Conclusion & Future Work

In this thesis, we have presented several major components:

1. Parallel Algorithms for Static Timing Analysis (STA) on CPUs and GPUs, and the open-source Tatum STA library (Chapter 3),

2. Extended Static Timing Analysis (ESTA), a new approach and algorithms to analyze the timing behaviour of digitial circuits under input transition probablity distributions (Chapter 4),

3. Improvements to the VTR design flow (Chapters 5 and 6) which increase its flexibility and interoperability with other tools,

4. Extensions to VTR’s architecture modelling capabilities (Chapter 7) which allow VTR to capture many of the important details of both commercial and novel FPGAs and an enhanced capture of Stratix IV for use with the Titan benchmarks,

5. Enhanced approaches to FPGA clustering which produce more ‘natural’ clusters, and improvements to placement optimization which improve wirelength, timing and routability (Chapter 8),

6. A scalable and efficient approach to FPGA routing (AIR) which is significantly faster than previous approaches, and simultaneously improves wirelength and timing (Chapter 9),

7. The overall VTR 8 design flow (Chapters 5 to 10) which substantially improves optimization quality and run-time compared to VTR 7 along with detailed comparisons to commercial tools (Chapter 11),

8. An open-source FPGA design flow (built in collaboration with others) using parts of the VTR design flow, and leveraging its flexibility and adaptability to target both commercial and novel FPGAs,

9. A new approach to improve the run-time and quality of CAD tools, by leveraging Reinforcement Learning (RL) to learn new optimization heuristics and fine-tine optimizer behaviour to the specific problem being solved (Chapter 13).

In this concluding chapter we discuss the key conclusions from each component, and future research directions.

163 Chapter 14. Conclusion & Future Work 164

14.1 Parallel STA and Tatum

STA is a time consuming part of Digital design flows, and is a key requirement to enable timing-driven optimizations. We presented a number of approaches for performing parallel STA, which exploits the parallelism available on modern computer systems (Chapter 3). While we found that GPUs performed well on the core STA traversals (6.2×) the overhead of data transfers limited the overall speed-up (0.9×). We also investigated a number of CPU-parallel algorithms for performing STA, finding that restructuring the traversals to reduce hazards and minimizing explicit locking were key to scalability. We also presented further optimizations to perform multi-clock and multi-corner STA efficiently. These techniques were combined into the open-source Tatum STA library which was shown to be more scalable (7× larger parallel speed-up) than other published approaches. Integrating Tatum into VPR we found Tatum achieved speed-ups of 15 to 32× compared to the VTR 7 timing analyzer, and showed how this enables more up-to-date timing information to be used to improve timing optimization quality. Tatum has been adopted as the STA engine within both VTR 8 [2] and the CGRA-ME frameworks [24], illustrating its flexibility and performance.

Future Work

There are a number of potential directions for future work. Firstly, it would be interesting and useful to extend Tatum to support incremental updates, which would further reduce the cost of using up-to-date timing information during optimization.1 Incremental updates are also likely to make parallelization more challenging, since the work to be performed is not known ahead-of-time but is ‘discovered’ on the fly, likely requiring more synchronization which may harm parallel scalability. There are also a number of additional STA features which Tatum could be extended to support, including derating, path-based analysis, and statistical delays.

14.2 Extended Static Timing Analysis

Extended Static Timing Analysis (ESTA) presented a new approach to analyzing the timing behaviour of digital circuits, which unlike traditional STA or SSTA, considers the probability of inputs and internal signals transitioning (Chapter 4). This provides designers with new insight into the statistical behaviour their circuits with respect to the data fed into them. This kind of analysis is useful for approximate computing design styles like overclocking, where designers must understand this behaviour to properly bound the accuracy of their circuits. ESTA is the first automated technique suitable for analyzing the behaviour of these kinds of designs (such analysis was previously performed by hand). ESTA produces an analysis with strong guarantees (always producing a bounding delay distribution over all input combinations), while remaining far more scalable than exhaustive timing simulation. Furthermore, ESTA runs 16 to 569× faster than Monte Carlo timing simulation (which provides no strong guarantees) and reducing pessimism by 37 to 75% compared to traditional STA.

1Initial results to this end show significant run-time savings, 31× faster placement on 15 of the Titan benchmarks when performing frequent STA (i.e. after every move). Chapter 14. Conclusion & Future Work 165

Future Work

Scalablity is a fundamental challenge for ESTA, as it is for all techniques which attempt to analyze input-dependent circuit behaviour. Potential directions for future work improve scalability include: investigating alternative switching models, evaluating different solution techniques for #SAT (e.g. CNF- based solvers, approximation techniques), and developing improved heuristics for merging path-delay distributions. In particular, further investigation of the percentile-binning approach to merging path-delay distri- butions seems like an important next step. This approach has the potential to significantly improve scalablity by focusing the analysis on the longest timing paths, which are usually of interest to designers. There are also open questions about how best to model statistical correlations between inputs and generate the conditioning functions used by ESTA. While simple correlations can be captured, it is unclear how to handle more complex inter-dependent correlations across multiple circuit inputs. Further investigation into techniques like Stochastic Synthesis (Appendix A) may prove fruitful. It would be interesting to investigate how ESTA could be combined with SSTA, to produce timing analysis results which account for both input-dependent variation and component delay variation. There is also future work to be done studying how ESTA can be integrated into design methodologies and CAD tools to drive input-dependent variation aware optimizations.

14.3 VTR Design Flow & Architecture Modeling

Extending the flexibility of the VTR design flow (Chapter 6) makes it easier to combine the various design tools within VTR (along with other tools outside VTR) into new and novel CAD flows. These provide researchers and end-users with much greater flexibility to create CAD flows which are suited to their interests. Combined with the architecture modelling enhancements (Chapter 7), such as general device grids, improved detailed routing architecture descriptions, externally defined RR graphs and non-configurable switches this allows VTR to capture many of the details of commercial FPGAs which were previously not considered, or had limited end-user configurability. This flexibility allows architects to more thoroughly explore a wider range of potential FPGA architecture using VTR. It also enables improved detailed modeling of commercial architectures such as Intel’s Stratix IV, capturing key characteristics like hierarchical routing network connectivity which are important for achieving a high performance. Additionally, we have made a number of engineering and usability enhancements to VTR which make it more robust and easier to use (Chapter 10). This is an important contribution to the overall research community, as it reduces the friction, effort and duplicate work involved in studying new FPGA architectures and new CAD algorithms within VTR.

Future Work

While we have improved the flexibility and modeling capabilities of VTR there is still much that could be done. For instance, many of the interfaces between or within the tools in VTR could be further standardized, and there is no generally agreed upon format for transferring design or architecture representations between VTR and other tools. The development of such representations would greatly improve tool interoperability. Chapter 14. Conclusion & Future Work 166

From an architecture modeling perspective there is still potential to generalize these capabilities to a wider range of architectures. For instance, some commercial architectures fit more ‘cleanly’ in to the VTR model than others, since they are closer in conceptual structure to what VTR expects. While it is often still possible to model these other architectures it cannot always be done in an easy and efficient way. We expect that further generalizing VTR’s modeling capabilities and removing implicit assumptions would further simplify and improve this process.

14.4 FPGA Packing & Placement

Having separate packing/clustering and placement stages is very helpful for allowing VTR to target a wide range of FPGA architectures, as it allows many of the detailed device legality constraints to be considered only during packing. We presented a number of techniques and heuristics which helped to produce more ‘natural’ clusters which were less interconnected (Chapter 8). This in turn helped to improve routability (from 13/23 routable to 23/23 routable Titan benchmarks), reduced wirelength by 23%, and improved critical path delay by 5%. These more natural clusters also reduced overall run-time by 30% as later optimization stages were able to converge more quickly. We also made improvements to VTR’s SA-based placer (Chapter 8), which now performs better optimization of complex structures like carry chains, and uses a compressed representation of the placement grid to allow more efficient exploration of the solutions pace. These changes also improved routability (completing an additional Titan benchmark), while reducing placed wirelength and critical path delay by 8% and 2% respectively with no increase in placement time.

Future Work

There are a number of approaches that could be taken to further improve the quality and run-time of packing and placement. For instance the hard barrier between packing and placement as currently exists in VTR is likely limiting the degree to which the placement can be further optimized. There are a number of approaches which can be considered to alleviate this. First, the quality of initial packing can likely be significantly improved by seeding the packer with additional physical information. This would provide the existing packing algorithm with improved guidance about what netlist primitives are closely related, and should be packed together into the same cluster. Approaches like a flat analytic placement which considers only coarse legality considerations seems like a good candidate for generating this information as such an approach could run quickly and take into consideration both timing and wirelength. Second, the placer could be enhanced to support taking apart clusters during placement. This would allow the placer to perform better detailed optimization by moving smaller groups (or individual) netlist primitives; allowing it to fix-up ‘mistakes’ made by the initial clustering due to incomplete information. Third, the scalability of the SA-based placer could be enhanced by performing directed moves [62, 239]. Such moves would exploit information about the current placement to propose perturbations which aim to optimize specific design characteristics like wirelength and timing. Fourth, a hybrid placer could be developed performing an initial analytic placement which is then legalized and refined by the SA-based placer. This approach has the potential to further improve the placer’s scalability by avoiding the less efficient high temperature annealing regime. Chapter 14. Conclusion & Future Work 167

14.5 FPGA Routing & AIR

AIR (Chapter 9), is an FPGA routing algorithm which is both run-time efficient and produces high quality results. Many of AIR’s run-time benefits are derived from avoiding unnecessary or redundant work. This is achieved by incrementally re-routing only congested connections, and efficiently determining where to expand from while routing high-fanout nets. AIR also adapts to the targeted FPGA architecture by creating an adaptive lookahead and adjusting routing resource base costs to the specific characteristics of the targeted FPGA architecture. Together these enhancements enable AIR to find legal routings significantly faster (6.2×) than the widely used VPR 7 router, while significantly reducing routed wirelength (15% less) and critical path delay (19% faster). We also presented extensions for routing with non-configurable edges (which allow modeling of non-linear wiring patterns), and extensions to significantly reduce the time required for finding the minimum routable channel width (2.0 to 2.7× faster), as is done during architecture evaluation. We also described a new algorithm exploiting AIR’s incremental capabilities can improve the quality of an existing legal routing. As a result of its performance and scalabilty AIR is used as the default router in VTR 8 [2].

Future Work

It would be useful, particularly when targeting commercial devices, to support hold-time aware routing in AIR. It would be interesting to investigate how hold-aware routing techniques like RCV [69] could be used in AIR. While AIR uses an enhanced map lookahead, it could be further improved. In particular the map lookahead fails to account for the characteristics of different block input/output pins. Since the connectivity between input/output pins and the inter-block routing network is usually identical for each instance of a block type in the device grid, this connectivity could be profiled efficiently in a side lookahead mapping from output pins → routing wires, and another from routing wires → input pins. These two sub-lookaheads could then be combined with the map lookahead’s existing inter-block routing wire lookahead to capture wire↔wire, pin→wire and wire→pin connectivity effects in the routing network. As currently implemented, AIR is a serial algorithm which does not exploit the parallel processing power available on modern systems. It would therefore be useful to investigate parallelizing the AIR algorithm. In particular, AIR’s connection-based routing approach is promising in this respect, as a connection-based parallelization may expose significantly more parallelism than more traditional approaches which have relied upon coarser spatial or net-based parallelism. Another potential approach to investigate would to be study the efficacy of differential/landmark- basesed lookaheads [245, 246] which to our knowledge have not yet been considered for FPGA routing. These techniques also profile the graph to be searched in order to generate a lookahead, but do not assume translational invariance (which may be helpful in some FPGA architectures, for instance were large hard blocks like hard-processor sub-systems intrude into the FPGA fabric). However, it is unclear how tenable the likely higher memory use of these methods would be. Another potential advantage of differential based lookaheads is that they could be updated during routing to account for congestion costs, which are ignored by current lookaheads. While AIR’s incremental approaches substantially reduce the work required by only re-routing congested connections, the underlying path search still repeatedly searches the routing resource graph Chapter 14. Conclusion & Future Work 168

(since the lowest cost path may have changed due to changes in circuit timing or congestion). Since these costs often change slowly or only effect small portions of the graph, it would be interesting to investigate how incremental shortest path algorithms (like incremental A* [247] or D*-lite [248]) could be applied to this problem. These algorithms can incrementally re-calculate shortest paths based on changes in to graph node and edge costs. However the additional state that must be maintained may mean these approaches significantly increase memory usage.

14.6 VTR Evaluation

To quantify the impact of the above enhancements to packing, placement and routing, and other engineering improvements (Chapters 8 to 10) we also evaluated the overall VTR design flow, and compared it to the previous VTR 7 release and Intel’s commercial Quartus CAD system. Compared to VTR 7, these enhancements significantly improve quality (15% minimum routable channel width, 41% wirelength, and 12% critical path delay reductions), run-time (5.2× faster) and memory use (3.3× lower) compared to VTR 7 on the VTR benchmarks. On the larger Titan benchmarks wirelength and critical path delay were reduced by 44% and 10% respectively, while run-time and memory usage were reduced by 5.3× and 5.0× respectively compared to VPR 7. These improvements also allow VPR to complete a similar number of circuits as Quartus and narrows the quality gaps from 2.0× to 1.26× for wirelength and 1.53× to 1.26× for critical path delay at comparable run-time and memory usage.

Future Work

It would be interesting to further investigate how other CAD flows compare to VTR, such as Vivado. Additionally, it would be interesting to evaluate how well VTR adapts to a wider range of FPGA architectures.

14.7 Open-Source FPGA CAD

We have also described how the architecture modelling enhancements previously outlined, along with other enhancements, can be used to target commercial FPGAs using VPR (Chapter 12). This allows for a fully open-source end-to-end design flow which can produce valid and functionally correct bitstreams for commercial devices. This design flow can be re-targetted to both novel and commercial FPGAs. Compared to Xilinx’s commercial Vivado tool, this design flow completes approximately 2× faster, but produces circuits which are 2.3× slower. Approximately half of this gap is due to open source logic synthesis (which produces larger, less optimized netlists than Vivado) with the remainder occuring during physical design (in VPR).

Future Work

An important consideration raised by this work is the importance of how an FPGA architecture is modeled, which was found to have a significant impact on both tool run-time and implementation quality. It is therefore important future work to continue refining techniques for generating efficient FPGA architectural models which appropriately capture the architectural details needed by the tools to perform good quality optimization in reasonable run-time. Chapter 14. Conclusion & Future Work 169

This work also revealed some implicit assumptions which VPR makes which do not necessarily hold on all FPGA architectures. While many of these have been resolved it illustrates there is still room for improvement in VPR’s architectural modelling capabilities, in particular to allow cleanly capturing many of the commercial device details. Further work is needed to identify and resolve the remaining causes of the quality gap. We suspect it can be reduced through a combination of improved architectural modeling, and more adaptive optimization algorithms.

14.8 Reinforcement Learning Enhanced CAD

Developing high quality and performant heuristics for CAD algorithms is an iterative design process, where CAD developers propose new/enhanced heuristics and then empirically fine-tune there parameters. This process is challenging and time-consuming as it requires a ‘human-in-the-loop’. In an effort to address this, we have shown how Reinforcement Learning (RL) can be used to automate parts of the fine-tuning process. As an illustrative example, we studied SA-based FPGA placement, where we used an RL agent to choose among different types of placement perturbations. This allows the overall placement algorithm to dynamically adapt on-line to the circuit being placed, the target FPGA architecture, and optimization progress. This approach improves the placer’s run-time quality trade-off, yielding the same quality in 20% less run-time in high run-time regime, and 50% less run-time in low run-time regimes.

Future Work

In the context of FPGA SA-based placement with RL there are a number of important next steps. To give the agent greater control of the optimization process, it makes sense to increase the types of moves/perturbations it can perform. There is also significant scope for determining better reward functions and action selection strategies for the agent. It will also be important to expand the agent’s state representation (i.e. moving beyond a K-armed bandit) combined with off-line training. We expect this will be key to allowing the agent to learn more complex heuristics and useful situation specific behaviours. This will require the use of more advanced RL training algorithms, and determining what information/features to include in the state representation. As shown with high performance RL systems in other fields, it is seems likely that using neural networks (i.e. deep reinforcement learning) will be helpful to encode the likely complex state representation. Since such agents are likely to be more computationally intensive, they will likely need to be called less frequently (e.g. once per temperature) to set the general optimization approach. The optimization approach could then be fine-tuned with simpler but much faster RL algorithms (like those we presented), allowing the system to continue adapting on-line after every move. More broadly, it would be interesting to investigate how RL techniques can be applied to other areas of the CAD flow. A particularly compelling direction seems to be replacing simple heuristic or random choices (e.g. what annealing moves to make, the order of routing nets and connections), with an RL agent which can adapt and learn more complex heuristics. Appendix A

Synthesizing Correlated ESTA Condition Functions

In ESTA (Chapter 4) the basic formulation models primary-inputs as:

ˆ each having uniform probability (i.e. Rise/Fall/High/Low transitions all have probability 0.25), and

ˆ each being independent (i.e. no correlations between input transitions)

If the circuit being analyzed is embedded in a larger system these assumptions may not hold. It is therefore desirable to allow users to specify both the probabilities of different transitions, and the correlations between them. This can be done by creating appropriate condition functions (Section 4.5) which capture these correlations. This appendix considers two approaches for synthesizing such correlation functions from a user- provided specification. The first (Section A.1) is based on constructing an Integer Linear Programming (ILP) feasibility problem. The second (Section A.2) is a constructive approach based on stochastic logic synthesis. Ultimately only the ILP-based approach was successful; the stochastic synthesis approach was unable to fully capture the required correlations (its limitations are discussed in Section A.2.5).

A.1 An ILP approach

The first approach is based on formulating an Integer Linear Program whose solution yields the desired condition functions.

A.1.1 Specifying Probabilities and Correlations

The first step is to formulate a way of describing the probabilities and transition correlations between primary inputs. Let us assume a circuit with two primary inputs a and b. We can then build a matrix which describes the probabilities and correlations between these two inputs (i.e. the joint probability distribution)1:

1Note that this captures only pair-wise correlations and does not capture any higher-order correlations between more than two primary inputs

170 Appendix A. Synthesizing Correlated ESTA Condition Functions 171

aR aF aH aL   bR paRbR paF bR paH bR paLbR b p p p p  F  aRbF aF bF aH bF aLbF    bH paRbH paF bH paH bH paLbH 

bL paRbL paF bL paH bL paLbL

Where for instance paRbR represents the probability of a rising (R) transition occurring simultaneously on both input a and input b. As a joint probability distribution, the sum of each row/column corresponds to the marginal probability of a particular transition occurring on the associated input. Similarly the sum of all entries must be 1. In general there will be a joint probability distribution for each pair of circuit primary inputs. However these distributions must be consistent; in particular the marginal distribution for a particular input must be the same across the joint-probability distributions of all pairs of inputs. For example if we had a circuit with inputs a, b and c, then for the aR transition to be consistent:

paR = paRbR + paRbF + paRbH + paRbL = paRcR + paRcF + paRcH + paRcL (A.1) must hold. Similar constraints must hold between all inputs for all transitions.

A.1.2 Relating Condition Functions to the Specification

Given a set of joint probability distributions describing the desired input statistics, we now need to relate these statistics to the boolean condition functions which are used to model the inputs in the ESTA formulation (Section 4.5). This amounts to constructing a set of boolean functions fit where i is the primary input (e.g. a), and t ∈ {R,F,H,L} is the transition. If the condition functions all depend on the same N boolean variables2 then we require the condition functions to satisfy the following:

#SAT (f ∧ f ) it ju = p ∀i, j ∈ combinations(inputs) ∀t, u ∈ {R,F,H,L} (A.2) 2N itju which ensures that they match the pair-wise probability specification.

A.1.3 Condition Function Synthesis Formulation

The problem of synthesizing condition functions to meet a probability specification is analogous to assigning a fixed number of minterms (satisfying assignments) to the set of condition functions such that:

N #SAT (fit ∧ fju) = 2 · pitju ∀i, j ∈ combinations(inputs) t, u ∈ {R,F,H,L} (A.3)

This can be formulated as a constraint satisfaction problem, where the goal is to find any valid assignment of minterms to condition functions such that Equation (A.3) holds. If we assume there are N boolean variables, then there are 2N minterms to be assigned, and the

2This implies that all condition functions have the same support; it is possible other formulations could require different/partitioned supports per primary-input. Appendix A. Synthesizing Correlated ESTA Condition Functions 172 problem can be formulated as:

find feasible x s.t. X xmitxmju = #SAT (fit ∧ fju) ∀i, j ∈ combinations(inputs) ∀t, u ∈ {R,F,H,L} (A.4) m∈[1,2N ] X xmit = 1 ∀i ∈ inputs t∈{R,F,H,L}

3 N where xmit is a boolean decision variable which is true if minterm m ∈ [1, 2 ] is assigned to condition function fit. The result of (A.4) is an assignment of minterms to each condition function. The constraint P m xmitxmju = #SAT (fit ∧ fju) ensures that the pair-wise conditions have the correct number of P common minterms (i.e. probability). The constraint t xmit = 1 ensures that each minterm is assigned precisely once for each primary input; guaranteeing that the resulting condition functions are well formed

(i.e. cover all minterms, or analogously fiR ∨ fiF ∨ fiH ∨ fiL = 1 ∀i ∈ inputs). If there are I primary inputs, N variables and T transitions then this formulation requires O(2N ·T 2 ·I) I 2 decision variables and O( 2 · T + I) constraints.

A.1.4 Implementing Condition Function Synthesis

Equation (A.4) can be solved directly as a Constraint Satisfaction Problem, however this has so far proved too slow to be practical4. Since the decision variables of (A.4) are boolean it can also be formulated as an Integer Linear

Programming (ILP) problem. However the formulation in (A.4) includes the non-linear term xmitxmju (logical AND) in the constraints, making a direct implementation impossible. Fortunately we can transform this non-linear term into a set of linear constraints on an additional set of decision variables:

find feasible x,y s.t. X ymitju = #SAT (fit ∧ fju) ∀i, j ∈ combinations(inputs) ∀t, u ∈ {R,F,H,L} m∈[1,2N ] X xmit = 1 ∀i ∈ inputs t∈{R,F,H,L}

xmit + xmju − ymitju ≤ 1 ∀i, j ∈ combinations(inputs) ∀t, u ∈ {R,F,H,L}

ymitju − xmit ≤ 0 ∀i, j ∈ combinations(inputs) ∀t, u ∈ {R,F,H,L}

ymitju − xmju ≤ 0 ∀i, j ∈ combinations(inputs) ∀t, u ∈ {R,F,H,L} (A.5) where ymitju is a boolean variable which is true if minterm m is assigned to both conditions fit and fju.

The constraints xmit + xmju − ymitju ≤ 1, ymitju − xmit ≤ 0 and ymitju − xmju ≤ 0 ensure that ymitju is

3Note that the x’s here are decision variables for the optimization problem, and not the boolean variables which form the support of the condition functions 4Based on initial experiments with Google’s ORTools [249], however further investigation using different solvers may be warranted. Appendix A. Synthesizing Correlated ESTA Condition Functions 173

always equal to xmitxmju. N 2 I I 2 This formulation requires O(2 · T · (I + 2 )) decision variables and O( 2 · T + I) constraints.

A.1.5 Example

Let us consider a circuit with 3 inputs (a, b and c, and the following joint probability distributions:

aR aF aH aL aR aF aH aL cR cF cH cL       bR 1/8 1/8 0 0 cR 1/4 1/4 0 0 bR 1/4 0 0 0 b 1/8 1/8 0 0  c 1/4 1/4 0 0  b  0 1/4 0 0  F   F   F         bH 1/8 1/8 0 0  cH  0 0 0 0  bH 1/4 0 0 0  bL 1/8 1/8 0 0 cL 0 0 0 0 bL 0 1/4 0 0

Since the smallest probability is 1/8 = 1/2−3, we require at least 3 variables to be able to precisely express the probabilities. We then multiply each joint probability distribution by 23 to yeild the required number of minterms between all pairs of conditions functions:

aR aF aH aL aR aF aH aL cR cF cH cL       bR 1 1 0 0 cR 2 2 0 0 bR 2 0 0 0 b  1 1 0 0  c  2 2 0 0  b  0 2 0 0  F   F   F         bH  1 1 0 0  cH  0 0 0 0  bH  2 0 0 0  bL 1 1 0 0 cL 0 0 0 0 bL 0 2 0 0

This information can then be used to formulate the ILP in (A.5) (e.g. #SAT (fbR ∧ fcR) = 2). Passing it off to an ILP solver like yields the following minterm assignments:

ˆ faR = [0, 2, 4, 7]

ˆ faF = [1, 3, 5, 6]

ˆ faH = []

ˆ faL = []

ˆ fbR = [2, 5]

ˆ fbF = [4, 6]

ˆ fbH = [0, 1]

ˆ fbL = [3, 7]

ˆ fcR = [0, 1, 2, 5]

ˆ fcF = [3, 4, 6, 7]

ˆ fcH = []

ˆ fcL = []

It is important to note the same minterms must be assigned to a condition function across each pair of inputs. As an example consider fbR which was assigned the minterms 2 and 5. According to the required minterm counts one of these minterms must be shared with faR (minterm 2) and faF (minterm Appendix A. Synthesizing Correlated ESTA Condition Functions 174

5). Furthermore, these two minterms must also be shared with fcR. Similar constraints apply for all other condition functions. The highly constrained nature of this problem makes it difficult to break it down into smaller sub- problems, necessitating the use of ‘global’ search techniques like ILP. In our experiments using the CBC ILP solver (included with Google OR tools [249]), we found that this approach was capable of synthesizing condition functions for up to 3 correlated inputs in reasonable run-time. However the large number of ILP variables and constraints limits its scalability. Despite this it would be interesting to evaluate scalability in industrial ILP solvers like GUROBI which may be more efficient.

A.2 A Stochastic Synthesis Approach

While the ILP-based approach described above is functional, it involves solving a large and complex search problem which limits its scalability. Another alternative approach is to construct the condition functions directly, in such a way that they satisfy the probability and correlation constraints. It was hoped that this correct-by-construction style of approach would prove more scalable.

A.2.1 Covariance Matrices

For each transition/input combination we can define a Bernoulli random variable which is 1 if the specific transition/input combination occurs, and 0 otherwise. We can then define x as a vector of all these Bernoulli random variables. For example if we have the inputs a and b; each subject to 4 types of transitions (R,F,H,L) then:

x = [aR, aF , aH , aL, bR, bF , bH , bL] (A.6)

and E[x] = [paR , paF , paH , paL , pbR , pbF , pbH , pbL ] corresponds to the probabilities of each transition type occurring on each inputs. The joint probability distributions of Section A.1.1 can then be viewed as the off-diagonal blocks of the matrix E[x · xT ]. Continuing the above example E[x · xT ] would be:

aR aF aH aL bR bF bH bL   aR paR 0 0 0 paRbR paF bR paH bR paLbR a  0 p 0 0 p p p p  F  aF aRbF aF bF aH bF aLbF    aH  0 0 paH 0 paRbH paF bH paH bH paLbH    a  0 0 0 p p p p p  L  aL aRbL aF bL aH bL aLbL    bR paRbR paF bR paH bR paLbR pbR 0 0 0    b p p p p 0 p 0 0  F  aRbF aF bF aH bF aLbF bF    bH paRbH paF bH paH bH paLbH 0 0 pbH 0 

bL paRbL paF bL paH bL paLbL 0 0 0 pbL

Note that E[x · xT ] is symmetric along the diagonal, and that some entries are always zero. The zero entries indicate that only a single transition can occur on a single input (i.e. transitions on a single input are mutually exclusive). Appendix A. Synthesizing Correlated ESTA Condition Functions 175

With this information specified, we can derive the covariance matrix Q:

Q = COV (x) = E[x · xT ] − E[x]E[xT ] (A.7) which fully captures the correlations between inputs. As a concrete example, consider the probabilities described for a and b in Section A.1.5. The corresponding E[x · xT ] is then:

aR aF aH aL bR bF bH bL   aR 1/2 0 0 0 1/8 1/8 1/8 1/8 a  0 1/2 0 0 1/8 1/8 1/8 1/8 F     aH  0 0 0 0 0 0 0 0    a  0 0 0 0 0 0 0 0  L     bR 1/8 1/8 0 0 1/4 0 0 0    b 1/8 1/8 0 0 0 1/4 0 0  F     bH 1/8 1/8 0 0 0 0 1/4 0  bL 1/8 1/8 0 0 0 0 0 1/4

and E[x]E[xT ] is:

aR aF aH aL bR bF bH bL   aR 1/4 1/4 0 0 1/8 1/8 1/8 1/8 a 1/4 1/4 0 0 1/8 1/8 1/8 1/8  F     aH  0 0 0 0 0 0 0 0    a  0 0 0 0 0 0 0 0  L     bR 1/8 1/8 0 0 1/16 1/16 1/16 1/16   b 1/8 1/8 0 0 1/16 1/16 1/16 1/16 F     bH 1/8 1/8 0 0 1/16 1/16 1/16 1/16 bL 1/8 1/8 0 0 1/16 1/16 1/16 1/16

Which yields the covariance matrix Q = E[x · xT ] − E[x]E[xT ]:

aR aF aH aL bR bF bH bL   aR 1/4 −1/4 0 0 0 0 0 0 a −1/4 1/4 0 0 0 0 0 0  F     aH  0 0 0 0 0 0 0 0    a  0 0 0 0 0 0 0 0  L     bR  0 0 0 0 3/16 −1/16 −1/16 −1/16   b  0 0 0 0 −1/16 3/16 −1/16 −1/16 F     bH  0 0 0 0 −1/16 −1/16 3/16 −1/16 bL 0 0 0 0 −1/16 −1/16 −1/16 3/16

A.2.2 Decorrelation with Cholesky Decomposition

Our ultimate goal is to generate condition functions which transform a set of independent 0/1 random variables into a set of correlated random variables with a desired covariance matrix Q. One approach is to use Cholesky’s Decomposition, which decomposes a matrix into the product of a Appendix A. Synthesizing Correlated ESTA Condition Functions 176 lower triangular matrix L and it’s transpose LT :

Q = LLT (A.8)

Let X = [x , x , ..., x ] be a vector of random variables with E[x ] = 0 and σ2 = 1, with each x 0 1 n i xi i independent. It follows that: E[XXT ] = I (A.9)

Let us define Y as: Y = LX (A.10) where L is derived from Cholesky’s decomposition. Y represents a set of variables yi constructed as linear combinations of the independent xi’s. Then:

E[YY T ] = E[(LX)(LX)T ] = E[LXXT L] = LE[XXT ]LT (A.11) = LILT = LLT

Additionally: E[Y ] = E[LX] = LE[X] = 0 (A.12) and similarly, E[Y T ] = 0 (A.13)

Then: COV (Y ) = E[YY T ] − E[Y ]E[Y T ] = E[YY T ] (A.14) = LLT = Q as desired. That is, the set of transformed variables yi constructed as linear combinations of the independent variables xi have the desired covariance matrix Q. It can also be shown that for E[x ] = 0.5 and σ2 = 0.25 (e.g. x either 0 or 1, instead of -1 or 1) that i xi i the transformation becomes: Y = LXˆ = 2LX (A.15)

In summary, to generate a set of correlated random variables Y with covariance matrix Q from a set of independent 0-1 (Bernoulli) random variables X we do the following:

1. Construct the desired covariance matrix Q

2. Calculate the Cholesky Decomposition L of Q

3. Generate the correlated variables: Y = 2LX

This yields a set of linear equations defining each correlated random variable yi as a linear combination Appendix A. Synthesizing Correlated ESTA Condition Functions 177

of the independent random variables xi:

yi = αi1x1 + αi2x2 + ... + αinxn (A.16)

However it is important to note that the coefficients (αij) in the linear combination, and the resulting yi are real numbers, and hence do not directly yield the desired boolean condition functions.

A.2.3 Stochastic Computation

Stochastic Computing (SC) is an approach where numbers are represented by sequences of boolean values in time (often called bitstreams) [250]. This is in contrast to the traditional dense binary encoding where a number is represented by a set of spatially parallel boolean values. A simple SC encoding is the unipolar (UP) encoding, which encodes a number’s value as the fraction of 1’s appearing in the bitstream [250]. For example, the bitstream ‘1010111010’ has length 10 and 6 contains 6 1’s yielding the value 10 = 0.6. SC naturally operates over the real numbers, although with a limited range. In particular the UP encoding is limited to values withing [0, 1], while other encodings such as bipolar (BP) and inverted-bipolar (IBP) [250] are limited to values within [-1, 1]. It has been shown that a variety of common mathematical operations (e.g. addition/subtraction, multiplication) can be implemented using simple boolean logic gates [250]. The link SC provides between mathematical operations over the real numbers, and boolean logic seemed promising, as it could allow us to generate boolean condition functions by implementing Equation (A.16) using SC.

A.2.4 Stochastic Logic Synthesis

A variety of approaches have been proposed to perform SC synthesis. STRAUSS [250], introduces a flexible approach to performing SC synthesis, which allows multilinear functions over the real numbers to by synthesized to boolean logic. STRAUSS describes how the Walsh-Hadamard transformation (a Fourier transform using the Walsh functions as a basis) can be used to either analyze (e.g. determine the characteristic equation defining the SC behaviour of a boolean function), or synthesize a boolean function (e.g. convert from an equation over the real numbers to a boolean function). In practise this means that applying the inverse Walsh-Hadamard transform to an IBP multilinear equation yields the truth table (TT) of the corresponding function. [250] notes three possible cases for the resulting TT:

1. The elements are +1 (false) or -1 (true), and the function can be directly converted to a boolean circuit.

2. The elements fall within [-1,+1]. Any element not equal to +1 or -1 is treated as a stochastic constant, and additional inputs and logic are required to generate it.

3. The elements fall outside [-1,+1]. Such a function is said to not be SC-implementable (although it is possible a scaled version may be SC-implementable). Appendix A. Synthesizing Correlated ESTA Condition Functions 178

As an example, consider the case of multiplying the two SC numbers X1 and X2 in UP format:

F (X1,X2) = X1X2 (A.17)

To synthesize this characteristic equation we first convert it into IBP format using the relation Xˆ = 1 − 2X (where X is the value in UP, and Xˆ is in IBP):

1 − Fˆ(Xˆ , Xˆ ) 1 − X  1 − X  1 2 = 1 1 =⇒ 2 2 2 (A.18) 1 Fˆ(Xˆ , Xˆ ) = (1 + Xˆ + Xˆ − Xˆ Xˆ ) 1 2 2 1 2 1 2

We then apply the inverse Walsh-Hadamard transform to yield the TT f~:

~ −1 ~ ~ f = F (Fˆ) = HnFˆ (A.19)

~ where Hn is the Hadamard matrix of size n, and Fˆ is the coefficients of Fˆ in appropriate vector form [250]. This yields:       1 1 1 1 0.5 1       1 −1 1 −1  0.5   1  f~ =     =   (A.20) 1 1 −1 −1  0.5   1        1 −1 −1 1 −0.5 −1 which is the truth table for an AND gate where only the 4th minterm is set (in IBP -1 is true, and 1 is false). This is the known implementation of a UP multiplier [250].

A.2.5 Combining Stochastic Synthesis & Cholesky Decomposition to Gen- erate ESTA Condition Functions

The hope was that Stochastic Synthesis would allow us to convert the correlation transformations determined with Cholesky Decomposition into a set of Boolean functions. While each component (Cholesky Decomposition, and Stochastic Synthesis) work correctly in isolation, combining them together was ultimately unsuccessful. In particular, the following issues limited this approach:

1. the transformations produced by Cholesky were not always SC implementable (e.g. the TT coefficients fall outside [-1,+1])

2. Mutually exclusive conditions require non-linear characteristic equations; while Cholesky always yields linear equations (even in the presence of mutual exclusivity).

Additionally, a large number of stochastic constants were required, with requires a large number of additional inputs (boolean variables) which hurts ESTA’s scalability. Although many of the stochastic constant values were the same, the stochastic constant generators could not be shared among condition functions, since that would introduce spurious correlations. Appendix A. Synthesizing Correlated ESTA Condition Functions 179

Independent Inputs Counter Example

Consider the following example, where we have two inputs a and b which are independent (i.e. uncorrelated) with each transition (e.g. R, F, H, L) having uniform probability on each input. The corresponding E[x · xT ] is then:

aR aF aH aL bR bF bH bL   aR 1/4 0 0 0 1/16 1/16 1/16 1/16 a  0 1/4 0 0 1/16 1/16 1/16 1/16 F     aH  0 0 1/4 0 1/16 1/16 1/16 1/16   a  0 0 0 1/4 1/16 1/16 1/16 1/16 L     bR 1/16 1/16 1/16 1/16 1/4 0 0 0    b 1/16 1/16 1/16 1/16 0 1/4 0 0  F     bH 1/16 1/16 1/16 1/16 0 0 1/4 0  bL 1/16 1/16 1/16 1/16 0 0 0 1/4

This scenario with uniform probabilities across independent inputs has a known set of condition functions (Section 4.3.4):

0 0 0 0 faR (x1, x1) = x1 ∧ x1 faF (x1, x1) = x1 ∧ x1 0 0 0 0 faH (x1, x1) = x1 ∧ x1 faL (x1, x1) = x1 ∧ x 1 (A.21) 0 0 0 0 fbR (x2, x2) = x2 ∧ x2 fbF (x2, x2) = x2 ∧ x2 0 0 0 0 fbH (x2, x2) = x2 ∧ x2 fbL (x2, x2) = x2 ∧ x2

0 0 where the variables x1, x1, x2 and x2 are independent boolean variables. If the SC synthesis approach is to be successful, it should yield an equivalent set of condition functions for this basic scenario. One approach to verifying whether this is possible is to determine the characteristic equations of these known condition functions (via the Hadamard-Walsh transform), and see if the resulting characteristic equation could be constructed. 0 Consider the first condition function faR = x1 ∧ x1. It’s corresponding characteristic equation is:

FaR = 0.5X0X1 − 0.5X0 + 0.5X1 + 0.5 (A.22)

Notably, the characteristic equation is non-linear containing the product term: X0X1. Such non-linear terms cannot be constructed via a linear transformation (such as Cholesky decomposition). This result indicates that some type of non-linear transformation is required to generate stochastic condition functions with mutually exclusive states. Provided such a transformation could be determined, then the stochastic synthesis approach outlined above would be a feasible approach to generating ESTA conditioning functions. However the exact type or form of such a non-linear transformation remains unclear. On potential direction for future work in this area would be to investigate whether product terms such as X0X1 in (A.22) could be constructed by synthesizing multi-level stochastic circuits. For instance by defining B = X1X2 (a SC multiplier) and feeding that into the next level of the SC equation. For Appendix A. Synthesizing Correlated ESTA Condition Functions 180 instance applied to (A.22) this would yield the multi-level equation:

B = X1X2 (A.23)

FaR = 0.5B − 0.5X0 + 0.5X1 + 0.5

However, the correctness of this approach, and how it could be used for synthesis requires further investigation and evaluation.

A.3 Conclusion

In conclusion, we have investigated how ESTA condition functions subject to arbitrary correlation constraints could be synthesized. In Section A.1 we described an ILP formulation for synthesizing condition functions. While this formulation works, it takes a brute force approach which fails to scale beyond 2 or 3 variables. In Section A.2 we described a Stochastic Computing approach to construct condition functions directly from their specification (covariance matrix). This approach fails to produce the correct condition functions, at least in part since generating independent condition functions requires non-linear characteristic SC equations. However further investigation is needed, with multi-level SC circuits showing some potential promise. Bibliography

[1] K. E. Murray, et al. “Symbiflow & VPR: An Open-Source Design Flow for Commercial and Novel FPGAs.” IEEE Micro, 2020, doi:10.1109/MM.2020.2998435. To Appear.

[2] K. E. Murray, O. Petelin, et al. “VTR 8: High Performance CAD and Customizable FPGA Architecture Modelling.” ACM Transactions on Reconfigurable Technology and Systems, 2020. To Appear.

[3] K. E. Murray, A. Suardi, V. Betz, and G. Constantinides. “Calculated Risks: Quantifying Timing Error Probability With Extended Static Timing Analysis.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38 (4), pp. 719–732, 2019.

[4] K. E. Murray, S. Zhong, and V. Betz. “AIR: A Fast but Lazy Timing-Driven FPGA Router.” In Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 1–7. 2020.

[5] K. E. Murray and V. Betz. “Adaptive FPGA Placement Optimization via Reinforcement Learning.” In ACM/IEEE Workshop on Machine Learning for CAD (MLCAD), pp. 1–6. 2019.

[6] K. E. Murray and V. Betz. “Tatum: Parallel Timing Analysis for Faster Design Cycles and Improved Optimization.” In Int. Conf. on Field-Programmable Technology (FPT). 2018.

[7] K. E. Murray, A. Suardi, V. Betz, and G. Constantinides. “Quantifying error: Extending static timing analysis with probabilistic transitions.” In Design, Automation Test in Europe Conference Exhibition (DATE), 2017, pp. 1486–1491. 2017, doi:10.23919/DATE.2017.7927226.

[8] J. Richardson et al. “Comparative Analysis of HPC and Accelerator Devices: Computation, Memory, I/O, and Power.” In Int. Workshop on High-Performance Reconfigurable Comput. Technol. and Appl. (HPRCTA), pp. 1–10. 2010.

[9] I. Kuon and J. Rose. “Measuring the gap between FPGAs and ASICs.” In Proceedings of the internation symposium on Field programmable gate arrays - FPGA’06, p. 21. ACM Press, New York, New York, USA, 2006, doi:10.1145/1117201.1117205.

[10] K. E. Murray, S. Whitty, S. Liu, J. Luu, and V. Betz. “Timing-Driven Titan: Enabling Large Benchmarks and Exploring the Gap Between Academic and Commercial CAD.” ACM Trans. Reconfigurable Technol. Syst., 8 (2), pp. 10:1–10:18, 2015.

[11] Stratix III Device Family Overview. Altera Corp., SIII51001-1.8, 2010. https://www.intel.com/ content/dam/www/programmable/us/en/pdfs/literature/hb/stx3/stx3_siii51001.pdf.

181 BIBLIOGRAPHY 182

[12] Intel Stratix 10 GX/SX Product Table. Intel Corp., Gen-1023-1.8, 2003. https: //www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/pt/stratix- 10-product-table.pdf.

[13] Fourth Quarter 2008 SPEC CPU2006 Results. Standard Performance Evaluation Corporation, , 2008. https://www.spec.org/cpu2006/results/res2008q4/.

[14] Fourth Quarter 2019 SPEC CPU2017 Results. Standard Performance Evaluation Corporation, , 2019. https://www.spec.org/cpu2017/results/res2019q4/.

[15] “Speedcore eFPGAs.”, Achronix, PB02B v1.7. 2019. https://www.achronix.com/sites/ default/files/docs/Speedcore_eFPGA_Product_Brief_PB028.pdf.

[16] “EFLX Embedded FPGA.”, FlexLogic, 2020. https://flex-logix.com/wp-content/uploads/ 2020/04/2020-04-EFLX-4-page-Overview-TGF.pdf.

[17] “ArcticPro.”, QuickLogic, 2019. https://www.quicklogic.com/wp-content/uploads/2019/09/ ArcticPro-Flyer_June-2019_Online_FINAL.pdf.

[18] “Reconfigurable Imaging (ReImagine).”, DARPA, 2016. https://www.darpa.mil/attachments/ Final_Compiled_ReImagineProposersDay.pdf.

[19] “Reconfigurable Integrated Circuits for ReImagine.”, MIT Lincoln Lab, 2016. https://www.darpa. mil/attachments/MITLL_ProposerDaySlides%20v3.pdf.

[20] B. Gaide, D. Gaitonde, C. Ravishankar, and T. Bauer. “Xilinx Adaptive Compute Accelera- tion Platform: VersalTM Architecture.” In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 19, p. 8493. Association for Computing Machinery, New York, NY, USA, 2019, doi:10.1145/3289602.3293906.

[21] R. Ho, K. W. Mai, and M. A. Horowitz. “The Future of Wires.” Proceedings of the IEEE, 89 (4), pp. 490–504, 2001.

[22] I. Swarbrick, D. Gaitonde, S. Ahmad, B. Gaide, and Y. Arbel. “Network-on-Chip Programmable Platform in VersalTM ACAP Architecture.” In ACM/SIGDA Int. Symp. on Field-programmable Gate Arrays (FPGA), pp. 212–221. 2019.

[23] B. Gaide, D. Gaitonde, C. Ravishankar, and T. Bauer. “Xilinx Adaptive Compute Acceleration Platform: VersalTM Architecture.” In ACM/SIGDA Int. Symp. on Field-programmable Gate Arrays (FPGA), FPGA ’19, pp. 84–93. 2019.

[24] K. P. Niu and J. H. Anderson. “Compact Area and Performance Modelling for CGRA Architecture Evaluation.” In FPT. 2018.

[25] K. Shi, D. Boland, and G. A. Constantinides. “Accuracy-Performance Tradeoffs on an FPGA through Overclocking.” In IEEE FCCM, pp. 29–36. 2013, doi:10.1109/FCCM.2013.10.

[26] J. Rose, et al. “The VTR Project: Architecture and CAD for FPGAs from Verilog to Routing.” In ACM/SIGDA Int. Symp. on Field Programmable Gate Arrays (FPGA), pp. 77–86. 2012. BIBLIOGRAPHY 183

[27] J. Luu, et al. “VTR 7.0: Next Generation Architecture and CAD System for FPGAs.” ACM Trans. Reconfigurable Technol. Syst., 7 (2), pp. 6:1–6:30, 2014.

[28] K. E. Murray. Divide-and-conquer Techniques for Large Scale FPGA Design. Master’s thesis, Department of Electrical and Computer Engineering, University of Toronto, 2015. http://hdl. handle.net/1807/69673.

[29] “Standard Cell ASIC to FPGA Design Methodology and Guidelines.” Technical report, Altera Corporation, 2009.

[30] N. Azizi, I. Kuon, A. Egier, A. Darabiha, and P. Chow. “Reconfigurable Molecular Dynamics Simulator.” In 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pp. 197–206. IEEE, 2004, doi:10.1109/FCCM.2004.48.

[31] J. Cassidy, L. Lilge, and V. Betz. “Fast, Power-Efficient Biophotonic Simulations for Cancer Treatment Using FPGAs.” pp. 133–140. IEEE Computer Society, 2014, doi:10.1109/.43.

[32] A. Putnam, et al. “A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services.” In 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp. 13–24. IEEE, 2014, doi:10.1109/ISCA.2014.6853195.

[33] “Implementing FPGA Design with the OpenCL Standard.” Technical report, Altera Corporation, 2012.

[34] W. Zhang, V. Betz, and J. Rose. “Portable and scalable FPGA-based acceleration of a direct linear system solver.” ACM TRETS, 5 (1), pp. 6:1–6:26, 2012.

[35] E. Chung, et al. “Serving DNNs in Real Time at Datacenter Scale with Project Brainwave.” IEEE Micro, 38 (2), pp. 8–20, 2018.

[36] A. S. Marquardt, V. Betz, and J. Rose. “Using cluster-based logic blocks and timing-driven packing to improve FPGA speed and density.” In Proceedings of the 1999 ACM/SIGDA seventh international symposium on Field programmable gate arrays - FPGA ’99, pp. 37–46. ACM Press, New York, New York, USA, 1999, doi:10.1145/296399.296426.

[37] V. Betz and J. Rose. “Cluster-based logic blocks for FPGAs: Area-efficiency vs. input sharing and size.” In IEEE Custom Integrated Circuits Conference, pp. 551–554. IEEE, 1997.

[38] V. Betz, J. Rose, and A. Marquardt. Architecture and CAD for Deep-Submicron FPGAs. Kluwer Academic Publishers, 1999.

[39] K. E. Murray, et al. “Optimizing FPGA Logic Block Architectures for Arithmetic.” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 28 (6), pp. 1378–1391, 2020, doi:10.1109/TVLSI.2020.2965772.

[40] D. Singh, V. Manohararajah, and S. Brown. “Two-stage Physical Synthesis for FPGAs.” In Proceedings of the IEEE 2005 Custom Integrated Circuits Conference, 2005., pp. 170–177. IEEE, 2005, doi:10.1109/CICC.2005.1568635.

[41] D. Chen, J. Cong, and P. Pan. “FPGA Design Automation: A Survey.” Foundations and Trends in Electronic Design Automation, 1 (3), pp. 195–334, 2006, doi:10.1561/1000000003. BIBLIOGRAPHY 184

[42] A. Canis, et al. “LegUp: High-level synthesis for FPGA-based processor/accelerator systems.” In FPGA, pp. 33–36. 2011.

[43] “Vivado Design Suite User Guide: High-Level Synthesis.” Technical report, Xilinx Incorporated, 2014.

[44] D. Chen, et al. “A Comprehensive Approach to Modeling, Characterizing and Optimizing for Metastability in FPGAs.” In ACM/SIGDA Int. Symp. on Field Programmable Gate Arrays (FPGA), pp. 167–176. 2010.

[45] Y.-C. Hsu, S. Sun, and D. H. C. Du. “Finding the longest simple path in cyclic combinational circuits.” In ICCAD. 1998.

[46] V. Betz and J. Rose. “VPR: A new packing, placement and routing tool for FPGA research.” In Field-Programmable Logic and Appl., pp. 213–222. 1997.

[47] A. Marquardt, V. Betz, and J. Rose. “Timing-driven Placement for FPGAs.” In FPGA, pp. 203–213. ACM, New York, NY, USA, 2000.

[48] M. Wainberg and V. Betz. “Robust Optimization of Multiple Timing Constraints.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 34 (12), pp. 1942–1953, 2015.

[49] J. Lamoureux and S. J. E. Wilton. “On the Interaction Between Power-Aware FPGA CAD Algorithms.” In Proceedings of the 2003 IEEE/ACM International Conference on Computer-Aided Design, ICCAD 03, p. 701. IEEE Computer Society, USA, 2003.

[50] G. Chen, et al. “RippleFPGA: Routability-Driven Simultaneous Packing and Placement for Modern FPGAs.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37 (10), pp. 2022–2035, 2018.

[51] W. Li, S. Dhar, and D. Z. Pan. “UTPlaceF: A Routability-Driven FPGA Placer With Physical and Congestion Aware Packing.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37 (4), pp. 869–882, 2018.

[52] Z. Abuowaimer, et al. “GPlace3.0: Routability-Driven Analytic Placer for UltraScale FPGA Architectures.” ACM Trans. Des. Autom. Electron. Syst., 23 (5), 2018, doi:10.1145/3233244.

[53] M. Hutton, V. Betz, and J. Anderson. “FPGA Synthesis and Physical Design.” In Electronic Design Automation for IC Implementation, Circuit Design, and Process Technology, chapter 16. CRC Press, 2nd edition, 2016.

[54] “Spectra-Q Engine.” Technical Report BG-1002-1.1, Altera Corp., 2015. https: //www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/backgrounder/ spectra-q-engine-backgrounder.pdf.

[55] C. Sechen and A. Sangiovanni-Vincentelli. “The TimberWolf placement and routing package.” IEEE Journal of Solid-State Circuits, 20 (2), pp. 510–522, 1985.

[56] M. Gort and J. H. Anderson. “Analytical placement for heterogeneous FPGAs.” In Int. Conf. on Field Programmable Logic and Applications, pp. 143–150. 2012, doi:10.1109/FPL.2012.6339278. BIBLIOGRAPHY 185

[57] D. Vercruyce, E. Vansteenkiste, and D. Stroobandt. “Liquid: High quality scalable placement for large heterogeneous FPGAs.” In Int. Conf. on Field Programmable Technology (FPT), pp. 17–24. 2017.

[58] W. Li, Y. Lin, and D. Z. Pan. “elfPlace: Electrostatics-based Placement for Large-Scale Het- erogeneous FPGAs.” In 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–8. 2019.

[59] S. Brown, J. Rose, and Z. G. Vranesic. “A detailed router for field-programmable gate arrays.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 11 (5), pp. 620–628, 1992.

[60] L. McMurchie and C. Ebeling. “PathFinder: A Negotiation-based Performance-driven Router for FPGAs.” In Field Programmable Gate Arrays, pp. 111–117. 1995.

[61] S. M. Trimberger. “Three Ages of FPGAs: A Retrospective on the First Thirty Years of FPGA Technology.” Proceedings of the IEEE, 103 (3), 2015.

[62] A. Ludwin and V. Betz. “Efficient and Deterministic Parallel Placement for FPGAs.” ACM Transactions on Design Automation of Electronic Systems, 16 (3), pp. 22:1–22:23, 2011.

[63] A. B. Kahng. “New Game, New Goal Posts: A Recent History of Timing Closure.” In DAC, pp. 4:1–4:6. 2015.

[64] C. Ebeling et al. “Stratix—10 High Performance Routable Clock Networks.” In Int. Symp. on FPGAs, pp. 64–73. 2016.

[65] M. Hutton et al. “Efficient Static Timing Analysis and Applications Using Edge Masks.” In FPGA, pp. 174–183. 2005.

[66] D. Lewis, et al. “Architectural Enhancements in Stratix V—.” In Field Programmable Gate Arrays, pp. 147–156. 2013.

[67] I. Ganusov and B. Devlin. “Time-borrowing platform in the Xilinx UltraScale+ family of FPGAs and MPSoCs.” In 2016 26th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–9. 2016.

[68] D. Lewis, G. Chiu, et al. “The Stratix—10 Highly Pipelined FPGA Architecture.” In FPGA, pp. 159–168. 2016.

[69] R. Fung, V. Betz, and W. Chow. “Slack Allocation and Routing to Improve FPGA Tim- ing While Repairing Short-Path Violations.” IEEE TCAD, 27 (4), pp. 686–697, 2008, doi:10.1109/TCAD.2008.917585.

[70] J. B. Goeders, G. G. F. Lemieux, and S. J. E. Wilton. “Deterministic Timing-Driven Parallel Placement by Simulated Annealing Using Half-Box Window Decomposition.” In ReConFig, pp. 41–48. 2011.

[71] M. An, J. G. Steffan, and V. Betz. “Speeding Up FPGA Placement: Parallel Algorithms and Methods.” In Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), pp. 178–185. 2014. BIBLIOGRAPHY 186

[72] C. Fobel, G. Grewal, and D. Stacey. “A scalable, serially-equivalent, high-quality parallel placement methodology suitable for modern multicore and GPU architectures.” In Int. Conf. on Field Programmable Logic and Applications (FPL), pp. 1–8. 2014.

[73] M. Gort and J. H. Anderson. “Deterministic multi-core parallel routing for FPGAs.” In FPT, pp. 78–86. 2010, doi:10.1109/FPT.2010.5681758.

[74] C. H. Hoo, A. Kumar, and Y. Ha. “ParaLaR: A parallel FPGA router based on Lagrangian relaxation.” In Int. Conf. on Field Programmable Logic and Applications (FPL), pp. 1–6. 2015.

[75] M. Stojilovic. “Parallel FPGA Routing: Survey and Challenges.” In Field Programmable Logic and Appl., pp. 1–8. 2017, doi:10.23919/FPL.2017.8056782.

[76] M. Shen and G. Luo. “Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion.” In Field Programmable Gate Arrays, pp. 105–114. 2017.

[77] J. Bhasker and R. Chadha. Static Timing Analysis for Nanometer Designs: A Practical Approach. Springer, 2009.

[78] D. Blaauw, K. Chopra, A. Srivastava, and L. Scheffer. “Statistical Timing Analysis: From Basic Principles to State of the Art.” IEEE Trans. Comput.-Aided Design Integ. Circuits. Syst., 27 (4), pp. 589–607, 2008, doi:10.1109/TCAD.2007.907047.

[79] W. Donath and D. Hathaway. “Distributed Static Timing Analysis. US Patent 6,557,151.”, 1998.

[80] A. Holder et al. “Prototype for a large-scale static timing analyzer running on an IBM Blue Gene.” In IEEE Int. Sym. on Paral. Distr. Proc., pp. 1–8. 2010.

[81] V. Veetil et al. “Efficient Monte Carlo based incremental statistical timing analysis.” In DAC, pp. 676–681. 2008.

[82] T.-W. Huang and M. D. F. Wong. “OpenTimer: A High-Performance Timing Analysis Tool.” In ICCAD, pp. 895–902. 2015.

[83] Y. Deng, B. D. Wang, and S. Mu. “Taming irregular EDA applications on GPUs.” In ICCAD, p. 539. 2009.

[84] K. Gulati et al. “Accelerating statistical static timing analysis using graphics processing units.” In ASP-DAC, pp. 260–265. 2009.

[85] Y. Shen and J. Hu. “GPU acceleration for PCA-based statistical static timing analysis.” In ICCAD, pp. 674–679. 2015.

[86] H. Wang et al. “CASTA: CUDA-Accelerated Static Timing Analysis for VLSI Designs.” In Int. Conf. on Paral. Proc., pp. 192–200. 2014.

[87] J. L. Hennessy and D. A. Patterson. “Memory Hierarchy Design.” In Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 5th edition, 2011.

[88] J. L. Hennessy and D. A. Patterson. “Data-Level Parallelism in Vector, SIMD, and GPU Archi- tectures.” In Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 5th edition, 2011. BIBLIOGRAPHY 187

[89] Intel Corporation. www.cilkplus.org, 2017.

[90] T. David, R. Guerraoui, and V. Trigonakis. “Everything You Always Wanted to Know about Synchronization but Were Afraid to Ask.” In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP 13, p. 3348. Association for Computing Machinery, New York, NY, USA, 2013, doi:10.1145/2517349.2522714.

[91] R. Fung and V. Betz. “Optimizing long-path and short-path timing and accounting for manufac- turing and operating condition variability.”, 7254789. 2007.

[92] M. McCool, J. Reinders, and A. Robison. Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann, 2012.

[93] “Tau 2015 Contest Resources.”, https://sites.google.com/site/taucontest2015, 2015. Ac- cessed: April 12, 2018.

[94] V. Betz and J. Rose. “VPR: A new packing, placement and routing tool for FPGA research.” In FPL, pp. 213–222. 1997.

[95] S. Yang. “Logic Synthesis and Optimization Benchmarks User Guide 3.0.” Technical report, MCNC, 1991.

[96] K. Eguro and S. Hauck. “Enhancing Timing-driven FPGA Placement for Pipelined Netlists.” In DAC, pp. 34–37. 2008.

[97] S. Sapatnekar. “Static Timing analysis.” In L. Lavagno, L. Scheffer, and G. Martin, eds., EDA for IC implementation, circuit design, and process technology, chapter 6. CRC press, 2006.

[98] S. R. Nassif. “Modeling and forecasting of manufacturing variations.” In Int. Workshop on Statistical Metrology, pp. 2–10. 2000, doi:10.1109/IWSTM.2000.869299.

[99] J. Han and M. Orshansky. “Approximate computing: An emerging paradigm for energy-efficient design.” In European Test Symp., pp. 1–6. 2013, doi:10.1109/ETS.2013.6569370.

[100] K. Shi, D. Boland, E. Stott, S. Bayliss, and G. A. Constantinides. “Datapath Synthesis for Overclocking: Online Arithmetic for Latency-Accuracy Trade-offs.” In DAC, pp. 190:1–190:6. 2014, doi:10.1145/2593069.2593118.

[101] H. C. Chen and D. H. C. Du. “Path sensitization in critical path problem.” In ICCAD Digest of Technical Papers, pp. 208–211. 1991, doi:10.1109/ICCAD.1991.185233.

[102] D. H. C. Du, S. H. C. Yen, and S. Ghanta. “On the General False Path Problem in Timing Analysis.” In DAC, pp. 555–560. 1989, doi:10.1145/74382.74475.

[103] R. I. Bahar, H. Cho, G. D. Hachtel, E. Macii, and F. Somenzi. “Timing analysis of combinational circuits using ADDs.” In Proceedings of European Design and Test Conference EDAC-ETC- EUROASIC, pp. 625–629. 1994, doi:10.1109/EDTC.1994.326813.

[104] P. Ashar and S. Malik. “Functional timing analysis using ATPG.” IEEE Trans. Comput.-Aided Design Integ. Circuits. Syst., 14 (8), pp. 1025–1030, 1995, doi:10.1109/43.402501. BIBLIOGRAPHY 188

[105] Y.-T. Chung and J.-H. R. Jiang. “Functional Timing Analysis Made Fast and General.” IEEE Trans. Comput.-Aided Design Integ. Circuits. Syst., 32 (9), pp. 1421–1434, 2013.

[106] B. Liu. “Signal Probability Based Statistical Timing Analysis.” In DATE, pp. 562–567. 2008, doi:10.1109/DATE.2008.4484736.

[107] S. Devadas, K. Keutzer, and S. Malik. “Computation of floating mode delay in combinational circuits: theory and algorithms.” IEEE Trans. Comput.-Aided Design Integ. Circuits. Syst., 12 (12), pp. 1913–1923, 1993, doi:10.1109/43.251155.

[108] C. P. Gomes, A. Sabharwal, and B. Selman. “Model Counting.” In A. Biere, M. Heule, and H. van Maaren, eds., Handbook of satisfiability, chapter 20. IOS press, 2009.

[109] R. E. Bryant. “Symbolic Boolean manipulation with ordered binary-decision diagrams.” ACM Computing Surveys, 24 (3), pp. 293–318, 1992.

[110] F. Somenzi. “CUDD: CU Decision Diagram package release 2.5.1.”, University of Colorado at Boulder, 2015. http://vlsi.colorado.edu/~fabio/CUDD.

[111] C. Visweswariah, et al. “First-Order Incremental Block-Based Statistical Timing Analy- sis.” IEEE Trans. Comput.-Aided Design Integ. Circuits. Syst., 25 (10), pp. 2170–2180, 2006, doi:10.1109/TCAD.2005.862751.

[112] P. C. McGeer and R. K. Brayton. Integrating Functional and Temporal Domains in Logic Design: The False Path Problem and Its Implications. Kluwer Academic Publishers, Norwell, MA, USA, 1991.

[113] R. Rudell. “Dynamic Variable Ordering for Ordered Binary Decision Diagrams.” In ICCAD, pp. 42–47. 1993.

[114] J. Bhasker and R. Chadha. Static Timing Analysis for Nanometer Designs A Practical Approach. Springer Science + Business Media, New York, NY, USA, 2009.

[115] S. Ramprasad, N. R. Shanbha, and I. N. Hajj. “Analytical estimation of signal transition activity from word-level statistics.” IEEE Trans. Comput.-Aided Design Integ. Circuits. Syst., 16 (7), pp. 718–733, 1997, doi:10.1109/43.644033.

[116] J. Zejda and P. Frain. “General framework for removal of clock network pessimism.” In ICCAD, pp. 632–639. 2002, doi:10.1109/ICCAD.2002.1167599.

[117] T.-W. Huang, P.-C. Wu, and M. D. F. Wong. “UI-timer: An Ultra-fast Clock Network Pessimism Removal Algorithm.” In ICCAD, pp. 758–765. 2014.

[118] C. J. CLOPPER and E. S. PEARSON. “THE USE OF CONFIDENCE OR FIDUCIAL LIMITS ILLUSTRATED IN THE CASE OF THE BINOMIAL.” Biometrika, 26 (4), pp. 404–413, 1934, doi:10.1093/biomet/26.4.404.

[119] S. Seabold and J. Perktold. “statsmodels: Econometric and statistical modeling with python.” In 9th Python in Science Conference. 2010. BIBLIOGRAPHY 189

[120] S. Greenland, et al. “Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations.” European Journal of Epidemiology, 31, 2016, doi:10.1007/s10654-016-0149-3.

[121] Y. Rubner, C. Tomasi, and L. J. Guibas. “The earth mover’s distance as a metric for image retrieval.” Int. J. of Computer Vision, 40 (2), pp. 99–121, 2000.

[122] D. Chen, J. Cong, and P. Pan. “FPGA Design Automation: A Survey.” Found. Trends Electron. Des. Autom., 1 (3), pp. 139–169, 2006.

[123] C. Chiasson and V. Betz. “COFFE: Fully-automated transistor sizing for FPGAs.” In Int. Conf. on Field-Programmable Technology (FPT), pp. 34–41. 2013.

[124] F. F. Khan and A. Ye. “An evaluation on the accuracy of the minimum width transistor area models in ranking the layout area of FPGA architectures.” In Int. Conf. on Field Programmable Logic and Applications (FPL), pp. 1–11. 2016.

[125] J. B. Goeders and S. J. E. Wilton. “VersaPower: Power estimation for diverse FPGA architectures.” In Int. Conf. on Field-Programmable Technology (FPT), pp. 229–234. 2012.

[126] J. Luu, et al. “VPR 5.0: FPGA CAD and Architecture Exploration Tools with Single-driver Routing, Heterogeneity and Process Scaling.” ACM Trans. Reconfigurable Technol. Syst., 4 (4), pp. 32:1–32:23, 2011.

[127] D. Lewis, et al. “The Stratix— Routing and Logic Architecture.” In Field Programmable Gate Arrays, pp. 12–20. 2003.

[128] N. Steiner, et al. “Torc: Towards an Open-source Tool Flow.” In ACM/SIGDA Int. Symp. on Field Programmable Gate Arrays (FPGA), pp. 41–44. 2011.

[129] C. Lavin, et al. “RapidSmith: Do-It-Yourself CAD Tools for Xilinx FPGAs.” In Int. Conf. on Field Programmable Logic and Applications (FPL), pp. 349–355. 2011.

[130] T. Haroldsen, B. Nelson, and B. Hutchings. “RapidSmith 2: A Framework for BEL-level CAD Exploration on Xilinx FPGAs.” In ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays (FPGA), pp. 66–69. 2015.

[131] C. Lavin. “Building domain-specific implementation tools using the RapidWright framework.” In FPGA. 2019.

[132] A. Sharma, S. Hauck, and C. Ebeling. “Architecture-adaptive routability-driven placement for FPGAs.” In Int. Conf. on Field Programmable Logic and Applications (FPL), pp. 427–432. 2005.

[133] X. Niu, W. Luk, and Y. Wang. “EURECA: On-Chip Configuration Generation for Effective Dynamic Data Access.” In ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays (FPGA), pp. 74–83. 2015.

[134] G. Zgheib and P. Ienne. “Evaluating FPGA clusters under wide ranges of design parameters.” In Int. Conf. on Field Programmable Logic and Applications (FPL), pp. 1–8. 2017. BIBLIOGRAPHY 190

[135] S. Yazdanshenas and V. Betz. “COFFE 2: Automatic Modelling and Optimization of Complex and Heterogeneous FPGA Architectures.” ACM Trans. Reconfigurable Technol. Syst., 12 (1), pp. 3:1–3:27, 2019.

[136] H. Parandeh-Afshar, H. Benbihi, D. Novo, and P. Ienne. “Rethinking FPGAs: Elude the Flexibility Excess of LUTs with And-inverter Cones.” In ACM/SIGDA Int. Symp. on Field Programmable Gate Arrays (FPGA), pp. 119–128. 2012.

[137] G. Zgheib, et al. “Revisiting And-inverter Cones.” In ACM/SIGDA Int. Symp. on Field- programmable Gate Arrays (FPGA), pp. 45–54. 2014.

[138] Z. Huang, et al. “NAND-NOR: A Compact, Fast, and Delay Balanced FPGA Logic Element.” In ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays (FPGA), pp. 135–140. 2017.

[139] S. A. Chin, J. Luu, S. Huda, and J. H. Anderson. “Hybrid LUT/Multiplexer FPGA Logic Architectures.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 24 (4), pp. 1280–1292, 2016.

[140] I. Ahmadpour, B. Khaleghi, and H. Asadi. “An efficient reconfigurable architecture by characterizing most frequent logic functions.” In Int. Conf. on Field Programmable Logic and Applications (FPL), pp. 1–6. 2015.

[141] Z. Ebrahimi, B. Khaleghi, and H. Asadi. “PEAF: A Power-Efficient Architecture for SRAM-Based FPGAs Using Reconfigurable Hard Logic Design in Dark Silicon Era.” IEEE Transactions on Computers, 66 (6), pp. 982–995, 2017.

[142] J. Luu, et al. “On Hard Adders and Carry Chains in FPGAs.” In Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), pp. 52–59. 2014.

[143] O. Petelin and V. Betz. “The Speed of Diversity: Exploring Complex FPGA Routing Topologies for the Global Metal Layer.” In Field Programmable Logic and Appl., pp. 1–10. 2016.

[144] K. Siozios and D. Soudris. “A Customizable Framework for Application Implementation onto 3-D FPGAs.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 35 (11), pp. 1783–1796, 2016.

[145] E. Nasiri, J. Shaikh, A. H. Pereira, and V. Betz. “Multiple Dice Working as One: CAD Flows and Routing Architectures for Silicon Interposer FPGAs.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 24 (5), pp. 1821–1834, 2016.

[146] E. Kadric, D. Lakata, and A. DeHon. “Impact of Memory Architecture on FPGA Energy Con- sumption.” In ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays (FPGA), pp. 146–155. 2015.

[147] S. Yazdanshenas, K. Tatsumura, and V. Betz. “Don’t Forget the Memory: Automatic Block RAM Modelling, Optimization, and Architecture Exploration.” In ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays (FPGA), pp. 115–124. 2017.

[148] M. Gao, et al. “DRAF: A Low-power DRAM-based Reconfigurable Acceleration Fabric.” SIGARCH Comput. Archit. News, 44 (3), pp. 506–518, 2016. BIBLIOGRAPHY 191

[149] N. Kulkarni, J. Yang, and S. Vrudhula. “A fast, energy efficient, field programmable threshold-logic array.” In Int. Conf. on Field-Programmable Technology (FPT), pp. 300–305. 2014.

[150] K. Huang, R. Zhao, W. He, and Y. Lian. “High-Density and High-Reliability Nonvolatile Field- Programmable Gate Array With Stacked 1D2R RRAM Array.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 24 (1), pp. 139–150, 2016.

[151] C. Wei, A. Dhar, and D. Chen. “A Scalable and High-density FPGA Architecture with Multi-level Phase Change Memory.” In Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1365–1370. 2015.

[152] X. Tang, P. Gaillardon, and G. De Micheli. “A high-performance low-power near-Vt RRAM-based FPGA.” In Int. Conf. on Field-Programmable Technology (FPT), pp. 207–214. 2014.

[153] S. Huda and J. H. Anderson. “Leveraging Unused Resources for Energy Optimization of FPGA Interconnect.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 25 (8), pp. 2307–2320, 2017.

[154] Z. Seifoori, B. Khaleghi, and H. Asadi. “A power gating switch box architecture in routing network of SRAM-based FPGAs in dark silicon era.” In Design, Automation Test in Europe Conference Exhibition (DATE), pp. 1342–1347. 2017.

[155] I. Ahmed, L. L. Shen, and V. Betz. “Becoming More Tolerant: Designing FPGAs for Variable Supply Voltage.” In Int. Conf. on Field Programmable Logic and Applications (FPL). 2019.

[156] B. Erbagci, N. E. Can Akkaya, C. Erbagci, and K. Mai. “An Inherently Secure FPGA using PUF Hardware-Entanglement and Side-Channel Resistant Logic in 65nm Bulk CMOS.” In ESSCIRC 2019 - IEEE 45th European Solid State Circuits Conference (ESSCIRC), pp. 65–68. 2019, doi:10.1109/ESSCIRC.2019.8902789.

[157] D. Vercruyce, E. Vansteenkiste, and D. Stroobandt. “How Preserving Circuit Design Hierarchy During FPGA Packing Leads to Better Performance.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37 (3), pp. 629–642, 2018.

[158] Y. Chen, S. Chen, and Y. Chang. “Efficient and effective packing and analytical placement for large-scale heterogeneous FPGAs.” In Int. Conf. on Computer-Aided Design (ICCAD), pp. 647–654. 2014.

[159] S. Chen and Y. Chang. “Routing-architecture-aware analytical placement for heterogeneous FPGAs.” In ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6. 2015.

[160] J. Yuan, L. Wang, X. Zhou, Y. Xia, and J. Hu. “RBSA: Range-based simulated annealing for FPGA placement.” In Int. Conf. on Field Programmable Technology (FPT), pp. 1–8. 2017.

[161] C. Xu, et al. “A Parallel Bandit-Based Approach for Autotuning FPGA Compilation.” In Int. Symp. on Field-Programmable Gate Arrays (FPGA), pp. 157–166. 2017.

[162] C. H. Hoo and A. Kumar. “ParaDRo: A Parallel Deterministic Router Based on Spatial Partitioning and Scheduling.” In Field-Programmable Gate Arrays, pp. 67–76. 2018. BIBLIOGRAPHY 192

[163] B. Khaleghi, B. Omidi, H. Amrouch, J. Henkel, and H. Asadi. “Stress-aware routing to mitigate aging effects in SRAM-based FPGAs.” In Int. Conf. on Field Programmable Logic and Applications (FPL), pp. 1–8. 2016.

[164] E. Hung and S. J. E. Wilton. “Incremental Trace-Buffer Insertion for FPGA Debug.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 22 (4), pp. 850–863, 2014.

[165] K. E. Murray, S. Whitty, S. Liu, J. Luu, and V. Betz. “Titan: Enabling Large and Complex Benchmarks in Academic CAD.” In Int. Conf. on Field Programmable Logic and Applications (FPL). 2013.

[166] E. Hung, F. Eslami, and S. J. E. Wilton. “Escaping the Academic Sandbox: Realizing VPR Circuits on Xilinx Devices.” In Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), pp. 45–52. 2013.

[167] E. Hung. “Mind the (synthesis) gap: Examining where academic FPGA tools lag behind industry.” In nternational Conference on Field Programmable Logic and Applications (FPL), pp. 1–4. 2015.

[168] J. H. Kim and J. H. Anderson. “Synthesizable Standard Cell FPGA Fabrics Targetable by the Verilog-to-Routing CAD Flow.” ACM Trans. Reconfigurable Technol. Syst., 10 (2), pp. 11:1–11:23, 2017.

[169] J. Teifel, M. E. Land, and R. D. Miller. “Improving ASIC Reuse with Embedded FPGA Fabrics.” In Government Microcircuit Applications & Critical Technology Conference. 2016.

[170] B. Grady and J. H. Anderson. “Synthesizable Verilog Backend for the VTR FPGA Evaluation Framework.” In Int. Conf. on Field-Programmable Technology (FPT). 2018.

[171] B. Chauviere, A. Alacchi, E. Giacomin, X. Tang, and P.-E. Gaillardon. “OpenFPGA: Complete Open Source Frameworkfor FPGA Prototyping.”, 2019.

[172] A. Li and D. Wentzlaff. “PRGA: An Open-source Framework for Building and Using Custom FPGAs.” In Workshop on Open Source Design Automation. 2019.

[173] G. Yu, et al. “FPGA with Improved Routability and Robustness in 130nm CMOS with Open-Source CAD Targetability.” ArXiv e-prints, 2017. arXiv:1712.03411.

[174] H. J. Liu. Archipelago - An Open Source FPGA with Toolflow Support. Master’s thesis, EECS Department, University of California, Berkeley, 2014. http://www2.eecs.berkeley.edu/Pubs/ TechRpts/2014/EECS-2014-43.html.

[175] J. Tian, et al. “A field programmable transistor array featuring single-cycle partial/full dynamic reconfiguration.” In Design, Automation Test in Europe Conference Exhibition (DATE), 2017, pp. 1336–1341. 2017, doi:10.23919/DATE.2017.7927200.

[176] QuickLogic Open Reconfigurable Computing (QORC) MCU + eFPGA SoC Open Source Soft- ware Tools. QuickLogic, , 2020. https://ir.quicklogic.com/press-releases/detail/535/ quicklogic-announces-open-reconfigurable-computing.

[177] “SymbiFlow Project.”, 2019. https://symbiflow.github.io/. BIBLIOGRAPHY 193

[178] J. Luu. Architecture-aware Packing and CAD Infrastructure for Field-Programmable Gate Arrays. Ph.D. thesis, University of Toronto, 2014. http://hdl.handle.net/1807/68469.

[179] Quartus 18.0. Intel Corporation, , 2019. https://www.intel.ca/content/www/ca/en/software/ programmable/quartus-prime/overview.html.

[180] “Vivado.”, Xilinx Corporation, 2019. https://www.xilinx.com/products/design-tools/ vivado.html.

[181] “Arachne-pnr.”, 2019. https://github.com/YosysHQ/arachne-pnr.

[182] D. Shah, et al. “Yosys+nextpnr: an Open Source Framework from Verilog to Bitstream for Commercial FPGAs.” In Int. Symp. on Field-Programmable Custom Computing Machines (FCCM). 2019.

[183] P. Jamieson, K. B. Kent, F. Gharibian, and L. Shannon. “Odin II - An Open-Source Verilog HDL Synthesis Tool for CAD Research.” In Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), pp. 149–156. 2010.

[184] R. Brayton and A. Mishchenko. “ABC: An Academic Industrial-Strength Verification Tool.” In T. Touili, B. Cook, and P. Jackson, eds., Computer Aided Verification, pp. 24–40. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.

[185] B. L. Synthesis and V. Group. “ABC: A System for Sequential Synthesis and Verification.”, 2018. Revision 1fc200ffacabed1796639b562181051614f5fedb. http://www.eecs.berkeley.edu/~alanmi/ abc/.

[186] C. Wolf. “Yosys Open SYnthesis Suite.”, 2019. http://www.clifford.at/yosys/.

[187] V. Betz and J. Rose. “Automatic Generation of FPGA Routing Architectures from High-level Descriptions.” In ACM/SIGDA Int. Symp. on Field Programmable Gate Arrays (FPGA), pp. 175–184. 2000.

[188] C. Huriaux, O. Sentieys, and R. Tessier. “Effects of I/O routing through column interfaces in embedded FPGA fabrics.” In Int. Conf. on Field Programmable Logic and Applications (FPL), pp. 1–9. 2016.

[189] Speedster7t FPGAs. Achronix Semiconductor, , 2019. PB003 v1.0.

[190] Zynq UltraScale+ MPSoC Data Sheet. Xilinx Inc., , 2018. DS891 v1.7.

[191] Stratix V Device Handbook. Altera Corporation, , 2015. SV5V1.

[192] M. S. Abdelfattah and V. Betz. “Design tradeoffs for hard and soft FPGA-based Networks-on-Chip.” In Int. Conf. on Field-Programmable Technology (FPT), pp. 95–103. 2012.

[193] S. Wilton. Architectures and Algorithms for Field-Programmable Gate Arrays with Embedded Memories. Ph.D. thesis, University of Toronto, 1997.

[194] X. Inc. “The Programmable Logic Data Book.”, 1994. BIBLIOGRAPHY 194

[195] Y.-W. Chang, D. F. Wong, and C. K. Wong. “Universal Switch Modules for FPGA Design.” ACM Trans. Des. Autom. Electron. Syst., 1 (1), pp. 80–101, 1996.

[196] D. Lewis, et al. “Architectural Enhancements in Stratix V—.” In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA ’13, pp. 147–156. ACM, New York, NY, USA, 2013, doi:10.1145/2435264.2435292.

[197] O. Petelin. CAD Tools and Architectures for Improved FPGA Interconnect. Master’s thesis, University of Toronto, 2016. http://hdl.handle.net/1807/75854.

[198] UltraRAM: Breakthrough Embedded Memory Integration on UltraScale+ Devices. Xilinx Inc., , 2016. WP477 v1.0.

[199] Versal Architecture and Product Data Sheet. Xilinx Inc., , 2019. DS950 v1.1.

[200] S. Sivaswamy, et al. “HARP: Hard-wired Routing Pattern FPGAs.” In ACM/SIGDA Int. Symp. on Field-programmable Gate Arrays (FPGA), pp. 21–29. 2005.

[201] X. Sun, H. Zhou, and L. Wang. “Bent Routing Pattern for FPGA.” In 2019 29th Inter- national Conference on Field Programmable Logic and Applications (FPL), pp. 9–16. 2019, doi:10.1109/FPL.2019.00012.

[202] D. Lewis, et al. “The Stratix—10 Highly Pipelined FPGA Architecture.” In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’16, pp. 159–168. ACM, New York, NY, USA, 2016, doi:10.1145/2847263.2847267.

[203] Stratix IV Device Handbook. Altera, , 2016.

[204] J. Luu, J. Rose, and J. Anderson. “Towards Interconnect-adaptive Packing for FPGAs.” In ACM/SIGDA Int. Symp. on Field-programmable Gate Arrays (FPGA), pp. 21–30. 2014.

[205] A. DeHon. “Balancing Interconnect and Computation in a Reconfigurable Computing Array (or, Why You Don’t Really Want 100% LUT Utilization).” In ACM/SIGDA Int. Symp. on Field-programmable Gate Arrays (FPGA), pp. 69–78. 1999.

[206] Z. Abuowaimer, et al. “GPlace3.0: Routability-Driven Analytic Placer for UltraScale FPGA Architectures.” ACM Trans. Des. Autom. Electron. Syst., 23 (5), pp. 66:1–66:33, 2018.

[207] W. Feng, J. Greene, K. Vorwerk, V. Pevzner, and A. Kundu. “Rent’s Rule Based FPGA Packing for Routability Optimization.” In ACM/SIGDA Int. Symp. on Field-programmable Gate Arrays (FPGA), pp. 31–34. 2014.

[208] D. T. Chen, K. Vorwerk, and A. Kennings. “Improving Timing-Driven FPGA Packing with Physical Information.” In Int. Conf. on Field Programmable Logic and Applications (FPL), pp. 117–123. 2007.

[209] A. Singh and M. Marek-Sadowska. “Efficient Circuit Clustering for Area and Power Reduction in FPGAs.” In ACM/SIGDA Int. Symp. on Field-programmable Gate Arrays (FPGA), pp. 59–66. 2002. BIBLIOGRAPHY 195

[210] K. Honda, T. Imagawa, and H. Ochi. “Placement algorithm for mixed-grained reconfigurable architecture with dedicated carry chain.” In IEEE International System-on-Chip Conference (SOCC), pp. 80–85. 2017.

[211] G. G. Lemieux and S. D. Brown. “A Detailed Routing Algorithm for Allocating Wire Segments in Field-Programmable Gate Arrays.” In Proc. Physical Design Workshop, pp. 215–226. 1993.

[212] M. J. Alexander and G. Robins. “New performance-driven FPGA routing algorithms.” IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 15 (12), pp. 1505–1517, 1996, doi:10.1109/43.552083.

[213] K. So. “Enforcing Long-path Timing Closure for FPGA Routing with Path Searches on Clamped Lexicographic Spirals.” In Field Programmable Gate Arrays, pp. 24–34. 2008.

[214] X. Chen, J. Zhu, and M. Zhang. “Timing-Driven Routing of High Fanout Nets.” In Field Programmable Logic and Appl., pp. 423–428. 2011, doi:10.1109/FPL.2011.84.

[215] D. Vercruyce, E. Vansteenkiste, and D. Stroobandt. “CRoute: A Fast High-quality Timing-driven Connection-based FPGA Router.” In Field-Programmable Custom Comput. Mach. 2019.

[216] M. Gort and J. H. Anderson. “Accelerating FPGA Routing Through Parallelization and Engineering Enhancements.” IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 31 (1), pp. 61–74, 2012, doi:10.1109/TCAD.2011.2165715.

[217] J. S. Swartz, V. Betz, and J. Rose. “A Fast Routability-driven Router for FPGAs.” In Int. Symp. on Field Programmable Gate Arrays, pp. 140–149. 1998.

[218] P. E. Hart, N. J. Nilsson, and B. Raphael. “A Formal Basis for the Heuristic Determination of Minimum Cost Paths.” IEEE Trans. on Syst. Sci. and Cybern., 4 (2), pp. 100–107, 1968.

[219] R. Y. Rubin and A. M. DeHon. “Timing-driven Pathfinder Pathology and Remediation: Quantifying and Reducing Delay Noise in VPR-pathfinder.” In Field Programmable Gate Arrays, pp. 173–176. 2011.

[220] M. Hutton, D. Karchmer, B. Archell, and J. Govig. “Efficient Static Timing Analysis and Applications Using Edge Masks.” In ACM/SIGDA 13th Int. Symp. on Field-programmable Gate Arrays (FPGA), pp. 174–183. 2005.

[221] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional Computing Series. Pearson Education, 1994.

[222] “gcov: A Test Coverage Program.”, 2019. https://gcc.gnu.org/onlinedocs/gcc/Gcov.html.

[223] “Flex: The Fast Lexical Analyzer.”, 2019. https://www.gnu.org/software/flex/.

[224] “Bison.”, 2019. https://www.gnu.org/software/bison/.

[225] “Pugixml: Light-weight, simple and fast XML parser for C++.”, 2019. https://pugixml.org/.

[226] G. Chen and J. Cong. “Simultaneous Timing Driven Clustering and Placement for FPGAs.” In Int. Conf. on Field Programmable Logic and Applications (FPL), pp. 158–167. 2004. BIBLIOGRAPHY 196

[227] C. Yu and Z. Zhang. “Painting on Placement: Forecasting Routing Congestion Using Con- ditional Generative Adversarial Nets.” In Proceedings of the 56th Annual Design Automation Conference 2019, DAC 19. Association for Computing Machinery, New York, NY, USA, 2019, doi:10.1145/3316781.3317876.

[228] B. Chauviere, A. Alacchi, E. Giacomin, X. Tang, and P.-E. Gaillardon. “OpenFPGA: Complete Open Source Framework for FPGA Prototyping.” In Workshop on Open Source Design Automation. 2019.

[229] V. Betz and J. Rose. “Automatic Generation of FPGA Routing Architectures from High-Level Descriptions.” In Proceedings of the 2000 ACM/SIGDA Eighth International Symposium on Field Programmable Gate Arrays, FPGA 00, p. 175184. Association for Computing Machinery, New York, NY, USA, 2000, doi:10.1145/329166.329203.

[230] “FPGA ASM (FASM) Specification.”, 2019. https://symbiflow.readthedocs.io/en/latest/ fasm/docs/specification.html.

[231] C. Xu, et al. “A Parallel Bandit-Based Approach for Autotuning FPGA Compilation.” In ACM/SIGDA Int. Symp. on Field Programmable Gate Arrays, pp. 157–166. 2017.

[232] N. Kapre, H. Ng, K. Teo, and J. Naude. “InTime: A Machine Learning Approach for Efficient Selection of FPGA CAD Tool Parameters.” In ACM/SIGDA Int. Symp. on Field Programmable Gate Arrays, pp. 23–26. 2015.

[233] J. Kober and J. Peters. Reinforcement Learning in Robotics: A Survey, pp. 9–67. Springer International Publishing, 2014.

[234] B. Zoph and Q. V. Le. “Neural Architecture Search with Reinforcement Learning.” ArXiv, abs/1611.01578, 2016.

[235] L. Li, W. Chu, J. Langford, and R. E. Schapire. “A contextual-bandit approach to personalized news article recommendation.” Int. Conf. on World Wide Web, 2010, doi:10.1145/1772690.1772758.

[236] D. Silver, et al. “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.” Science, 362 (6419), pp. 1140–1144, 2018, doi:10.1126/science.aar6404.

[237] O. Vinyals, et al. “Grandmaster level in StarCraft II using multi-agent reinforcement learning.” Nature, 2019.

[238] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 2nd edition, 2018.

[239] K. Vorwerk, A. Kennings, and J. W. Greene. “Improving Simulated Annealing-based FPGA Placement with Directed Moves.” Trans. Comp.-Aided Des. Integ. Cir. Sys., 28 (2), pp. 179–192, 2009, doi:10.1109/TCAD.2008.2009167.

[240] I. L. Markov, J. Hu, and M. Kim. “Progress and Challenges in VLSI Placement Research.” Proc. of the IEEE, 103 (11), pp. 1985–2003, 2015, doi:10.1109/JPROC.2015.2478963.

[241] V. Mnih, et al. “Playing Atari With Deep Reinforcement Learning.” In NIPS Deep Learning Workshop. 2013. BIBLIOGRAPHY 197

[242] D. Vercruyce, E. Vansteenkiste, and D. Stroobandt. “Liquid: High quality scalable placement for large heterogeneous FPGAs.” In Int. Conf. on Field Programmable Technology, pp. 17–24. 2017, doi:10.1109/FPT.2017.8280116.

[243] A. Hosny, S. Hashemi, M. Shalan, and S. Reda. “DRiLLS: Deep Reinforcement Learning for Logic Synthesis.” In 2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 581–586. 2020.

[244] Q. Huang, et al. “AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning.”, 2020. arXiv:2003.00671.

[245] M. Goldenberg, N. Sturtevant, A. Felner, and J. Schaeffer. “The Compressed Differential Heuristic.” In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, AAAI11, p. 2429. AAAI Press, 2011.

[246] A. V. Goldberg and C. Harrelson. “Computing the Shortest Path: A Search Meets Graph Theory.” In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 05, p. 156165. Society for Industrial and Applied Mathematics, USA, 2005.

[247] S. Koenig and M. Likhachev. “Incremental A*.” In T. G. Dietterich, S. Becker, and Z. Ghahramani, eds., Advances in Neural Information Processing Systems 14, pp. 1539–1546. MIT Press, 2002.

[248] S. Koenig and M. Likhachev. “Fast replanning for navigation in unknown terrain.” IEEE Transactions on Robotics, 21 (3), pp. 354–363, 2005.

[249] “Google OR-Tools.”, 2017. https://developers.google.com/optimization.

[250] A. Alaghi and J. P. Hayes. “STRAUSS: Spectral Transform Use in Stochastic Circuit Synthesis.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 34 (11), pp. 1770–1783, 2015, doi:10.1109/TCAD.2015.2432138.