LAMDA: Learning-Assisted Multi-Stage Autotuning for FPGA Design Closure
Total Page:16
File Type:pdf, Size:1020Kb
LAMDA: Learning-Assisted Multi-Stage Autotuning for FPGA Design Closure Ecenur Ustun∗, Shaojie Xiang, Jinny Gui, Cunxi Yu∗, and Zhiru Zhang∗ School of Electrical and Computer Engineering, Cornell University, Ithaca, NY, USA feu49, cunxi.yu, [email protected] 12.5 12.30 Abstract—A primary barrier to rapid hardware specialization Default Timing with FPGAs stems from weak guarantees of existing CAD 12.0 11.66 tools on achieving design closure. Current methodologies require 11.5 11.29 extensive manual efforts to configure a large set of options 11.16 11.13 across multiple stages of the toolflow, intended to achieve high 11.0 quality-of-results. Due to the size and complexity of the design 10.5 space spanned by these options, coupled with the time-consuming 10.0 9.93 9.91 9.86 9.68 evaluation of each design point, exploration for reconfigurable 9.50 9.5 computing has become remarkably challenging. To tackle this Critical Path (ns) challenge, we present a learning-assisted autotuning framework 9.0 8.76 called LAMDA, which accelerates FPGA design closure by 8.66 8.5 8.52 8.37 utilizing design-specific features extracted from early stages of the 8.22 8.0 design flow to guide the tuning process with significant runtime 8.8 8.9 9.0 9.1 9.2 9.3 9.4 savings. LAMDA automatically configures logic synthesis, tech- Timing Target (ns) nology mapping, placement, and routing to achieve design closure Fig. 1. Timing distribution of bfly for various tool settings and timing efficiently. Compared with a state-of-the-art FPGA-targeted auto- constraints – x-axis represents target clock period (ns), and y-axis represents tuning system, LAMDA realizes faster timing closure on various critical path delay (ns). realistic benchmarks using Intel Quartus Pro. I. INTRODUCTION Recent years have seen an increasing employment of ma- Limitations in technology scaling have led to a growing chine learning (ML) in EDA to enable rapid design space interest in non-traditional system architectures incorporating exploration (DSE) [3]–[7] and automatic configuration of specialized hardware accelerators for improved performance design or tool parameters [2], [8], [9]. However, there are two and energy efficiency. Although FPGAs have shown to be major limitations with the existing approaches. First, current a significant potential in hardware specialization [1], weak techniques mainly focus on a single stage of the design flow guarantees of existing CAD tools on achieving design closure such as high-level synthesis (HLS) [4] or logic synthesis [5], out-of-the-box is a main barrier to its adaptation. To achieve thereby missing important cross-boundary optimization oppor- high quality-of-results (QoR), CAD tools require huge manual tunities. Second, existing methods often use pre-PnR or even effort to configure a large set of design and tool parameters. pre-synthesis reports for assessing the quality of a design To meet the diverse requirements of a broad range of point [10]. While this shortens the execution times, simply application domains, FPGA development environments com- relying on crude estimates from an early design stage may monly provide users with an extensive set of tool options. prevent DSE from reaching high-quality design points. For instance, synthesis and place-and-route (PnR) options To address the aforementioned limitations, we propose in Intel Quartus Pro translate to a search space of over LAMDA, a Learning-Assisted Multi-stage Design Autotuning 1:8 × 1024 design points. Fig. 1 shows 500 design points framework that accelerates FPGA design closure. We develop randomly sampled from possible combinations of tool options a multi-stage QoR inference model based on online supervised of Intel Quartus Pro, and the resulting critical path delays. learning, which allows LAMDA to effectively detect and Results of default tool options are included for reference. prune unpromising design points over search spaces. LAMDA There are two important observations: (1) default timing automatically configures a wide range of CAD tool options results are on average 15% higher than the best results of through balancing the trade-off between computing effort and the random samples; (2) more than 30% of the randomly estimation accuracy. Our main technical contributions include: sampled configurations produce better timing than the default • An ML-based multi-stage autotuner, which leverages configurations. This suggests that there is considerable room features from early stages to estimate post-PnR QoR. for timing improvement by tuning tool options. However, ex- • LAMDA achieves faster design closure using online ploring a large number of tool options is extremely inefficient learning—design points visited during autotuning are and cannot be effectively carried out by human effort alone. A used to further increase the ML model accuracy. similar challenge exists also in high-performance computing • LAMDA achieves 5:43× speedup compared to a state- and software compilation, e.g., autotuners have been developed of-the-art FPGA-targeted autotuning system for multiple for automatically optimizing compiler configurations [2]. realistic designs using Intel Quartus Pro. • Emulation databases of five realistic designs using Intel C/C++/OpenCL LAMDA High-Level Synthesis (HLS) Propose Tool Quartus Pro, which enables fast autotuning evaluation. Search Options Potential HDL Engine The databases will be open sourced to facilitate further Speedup Logic Synthesis Actual QoR research of autotuning algorithms and tools. 230x HDL Fast and Low-Cost Technology Mapping Description Design Stages 175x Packing Design-Specific 10x Features II. BACKGROUND Placement 5x Mainstream FPGA compilation flow takes untimed Routing ML Timing and Area C++/OpenCL or an RTL design as input and generates LAMDA a device-specific bitstream. This process involves several Pruning Evaluation Fig. 2. Potential speedups by Engine distinct and modular steps including HLS, logic synthesis, leveraging QoR inference at technology mapping, packing, placement, and routing. Each different stages Fig. 3. Overview of LAMDA step provides designers a set of configuration switches that select between different heuristics or influence the behavior of a heuristic. These switches need to be calibrated with III. APPROACH significant manual effort and expert knowledge to achieve The overall autotuning flow of LAMDA is illustrated in desired QoRs. Due to the lack of predictability and time- Fig. 3. It takes an HDL description as input and automatically consuming FPGA design flow, there is an urgent need to configures the tool options across logic synthesis, technology lower the design cost by minimizing human supervision and mapping, packing, and PnR stages, where search space is significantly reducing the time required to obtain accurate defined by extensive tool options. Table I lists a subset of QoR estimation. tunable tool options of Intel Quartus Pro. LAMDA leverages Autotuning has been used for optimization in FPGA- a highly accurate ML model to effectively prune the design targeted CAD toolflow by automatically configuring the pa- space, thus accelerating FPGA design closure. The rest of this rameters and tool options to optimize certain objective func- section describes key components of LAMDA in more detail. tions. InTime [8], [11], [12] explores supervised learning Multi-stage QoR inference: We develop a multi-stage in- techniques to accelerate FPGA timing closure. It automatically ference model that estimates post-PnR results based on tool selects tools options for a given design by exploring the design features (configurations of the tool options) and design-specific space using a timing estimator. DATuner [13] utilizes the features. Using multi-stage design-specific features is one of multi-armed bandit technique to automatically tune the options the main contributions compared to InTime and DATuner. As for a complete FPGA compilation flow. discussed in Section II, collecting early stage features is fast, Note that these are single-stage autotuning frameworks, but could lack QoR estimation accuracy. Collecting features meaning that they are based on QoR estimation conducted on from later stages is more informative, yet time-consuming. features of a single stage. Single-stage autotuning with QoR Therefore, fast and low-cost design stages in Fig. 3 need to estimation conducted at a late design stage is time-consuming be carefully selected to balance accuracy-runtime trade-off. because each iteration runs through PnR. On the other hand, To this end, we analyze the effects of features in Table II, using early-stage features for assessing the quality of a design from which one can draw three conclusions. First, design- point during autotuning shortens runtime. However, simply specific features help estimate QoR more accurately compared relying on crude estimates from an early stage may prevent the to using tool options only (i.e. pre-synthesis). Second, ac- CAD tool from applying the appropriate set of optimizations, curacy increases as features from later stages are included resulting in sub-optimal trade-offs. Lo et al. proposed a multi- in the feature set, bringing about an accuracy-runtime trade- fidelity approach for tuning HLS parameters, incorporating off. Third, although tool estimates are less accurate under features across HLS, synthesis, and implementation stages tight constraints, design-specific features still help improve [14]. This paper demonstrates how multi-stage approach can estimation accuracy compared to using tool options only. To significantly