A Parallel Bandit-Based Approach for Autotuning FPGA Compilation

A Parallel Bandit-Based Approach for Autotuning FPGA Compilation Chang Xu1;∗, Gai Liu2, Ritchie Zhao2, Stephen Yang3, Guojie Luo1, Zhiru Zhang2;y 1 Center for Energy-Efficient Computing and Applications, Peking University, Beijing, China 2 School of Electrical and Computer Engineering, Cornell University, Ithaca, USA 3 Xilinx, Inc., San Jose, USA ∗[email protected], [email protected] Abstract combinatorial optimization problems such as logic synthe- Mainstream FPGA CAD tools provide an extensive collec- sis, technology mapping, placement, and routing [7]. tion of optimization options that have a significant impact To meet the stringent yet diverse design requirements on the quality of the final design. These options together from different domains and use cases, modern FPGA CAD create an enormous and complex design space that cannot tools commonly provide users with a large collection of op- effectively be explored by human effort alone. Instead, we timization options (or parameters) that have a significant propose to search this parameter space using autotuning, impact on the quality of the final design. For instance, the which is a popular approach in the compiler optimization placement step alone in the Xilinx Vivado Design Suite of- fers up to 20 different parameters, translating to a search domain. Specifically, we study the effectiveness of apply- 6 ing the multi-armed bandit (MAB) technique to automat- space of more than 10 design points [3]. In addition, mul- ically tune the options for a complete FPGA compilation tiple options may interact in subtle ways resulting in unpre- flow from RTL to bitstream, including RTL/logic synthe- dictable effects on solution quality. Traditionally, navigating sis, technology mapping, placement, and routing. To miti- through such an enormous design space requires designers gate the high runtime cost incurred by the complex FPGA to rely on either prior design experience or vendor-supplied implementation process, we devise an efficient paralleliza- guidelines. Such ad hoc design practices incur costly manual tion scheme that enables multiple MAB-based autotuners to effort to achieve the desired quality of results (QoR). Worse, explore the design space simultaneously. In particular, we each new design may require a drastically different set of propose a dynamic solution space partitioning and resource options to achieve the best QoR [24]. allocation technique that intelligently allocates computing One solution to improve design productivity is employing resources to promising search regions based on the runtime meta-heuristic search techniques to explore the parameter information of search quality from previous iterations. Ex- space automatically. Figure 1 shows the improvement of the periments on academic and commercial FPGA CAD tools worst negative slack (WNS) of three designs generated by demonstrate promising improvements in quality and conver- Vivado, each tuned using three different search techniques: gence rate across a variety of real-life designs. active learning, Bayes classification, and greedy mutation. From our experiments, it is evident that the most effective search technique (in terms of the number of Vivado runs 1. Introduction needed to close timing) varies across different designs. In- Over the last three decades, FPGAs have evolved from a tuitively, distinct designs often present vastly different struc- small chip with a few thousand logic blocks to billion- tures of the search space. Besides, different phases of the transistor system-on-chips containing hardened DSP blocks, design space exploration benefit from different search tech- embedded memories, multicore processors, alongside mil- niques. For example, stochastic methods such as genetic al- lions of programmable logic elements. Concurrently, FPGA gorithm may be more useful during the initial phase of the development tools have also grown into sophisticated design search, while first-order optimizations like gradient descent environments. Compiling an RTL design into bitstream typ- are very efficient in finding local minima when the promis- ically involves heuristically solving a sequence of complex ing search space is narrowed. The above observations clearly motivate the use of an en- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed semble of search heuristics rather than one particular tech- for profit or commercial advantage and that copies bear this notice and the full citation nique to effectively explore the design space of FPGA com- on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, pilation. Similar insights were also gained in the OpenTuner to post on servers or to redistribute to lists, requires prior specific permission and/or a project, which aimed to provide an extensible open-source fee. Request permissions from [email protected]. FPGA’17, February 22-24, 2017, Monterey, CA, USA framework for software program autotuning [4]. OpenTuner c 2017 ACM. ISBN 978-1-4503-4354-1/17/02 . $15.00 currently incorporates a collection of search techniques to DOI: http://dx.doi.org/10.1145/3020078.3021747 bgcd cfd sort 0.05 0.05 0.05 0.00 0.00 0.00 -0.05 -0.05 -0.05 -0.10 -0.10 -0.10 -0.15 -0.15 -0.15 -0.20 -0.20 -0.20 -0.25 -0.25 Active learning -0.25 -0.30 -0.30 Bayes classification -0.30 -0.35 -0.35 Greedy mutation -0.35 Worst negative slack (ns) slack negative Worst -0.40 -0.40 -0.40 0 50 100 0 50 100 0 50 100 Number of Vivado runs Number of Vivado runs Number of Vivado runs (a) (b) (c) Figure 1: The search traces of three designs using three different search algorithms with the goal of improving worst negative slack — We use meta-heuristic algorithms to analyze results from Vivado, and guide the selection of Vivado configuration parameters. The x-axis denotes the number of Vivado runs. (a) greedy mutation, a simple genetic algorithm, is to first to close timing for binary GCD; (b) active leaning, a semi-supervised machine learning technique, is the first to close timing for computational fluid dynamics; (c) Bayes classification, the na¨ıve Bayes classifiers, is the first to close timing for the bubble sort design. provide robustness against different search spaces and uses exploration of unknown subspaces and the exploitation the multi-armed bandit (MAB) algorithm [11] to determine of subspaces with known high-quality solutions. the allocation of trials between the available techniques dy- 3. Experiments with DATuner on academic and commer- namically. In addition to applications in program autotun- cial FPGA CAD tools demonstrate very encouraging im- ing [4], MAB has already been applied to many important provements in design quality across a variety of real-life optimization problems in various fields, such as artificial in- benchmarks. We believe that our framework is also ap- telligence [19, 22] and operations research [9, 15]. plicable to many other EDA problems. Since FPGA CAD tools usually require long execution times (minutes to hours for real-world designs), it is crucial The rest of the paper is organized as follows: Section 2 to significantly speed up the MAB-guided search without introduces the preliminaries that serve as the basics of this sacrificing the final QoR. An intuitive approach is launching work; Section 3 discusses our proposed techniques; Sec- multiple machines simultaneously, each conducting a MAB- tion 4 presents the experimental results; Section 5 summa- guided search within the solution space independently. Al- rizes the related work, followed by conclusions in Section 6. ternatively, one can use a more efficient scheme that dynam- ically partitions the solution space into multiple partitions, 2. Preliminaries and allocates additional computing resources to regions that In this section, we provide an overview of the MAB problem are more likely to generate high-quality solutions. formulation, its usage in OpenTuner, as well as the basics of In this paper, we propose DATuner — a parallel the FPGA compilation process. bandit-based framework for autotuning FPGA compilation. DATuner is built on OpenTuner but instead focuses on im- 2.1 Multi-Armed Bandit Approach proving the productivity and quality of FPGA-targeted hard- The MAB problem and its solutions are extensively studied ware designs. We also propose scalable and effective paral- in statistics and machine learning [11]. The classic problem lelization techniques based on dynamical solution space par- is formulated as a game, where the player has a fixed number titioning to speed up the convergence of DATuner. Our main of actions to choose from (i.e., arms), each of which having contributions are as follows: a reward given by an unknown probability distribution with an unknown expected value. The game progresses in rounds, and in each round, the player chooses one action (i.e., pull 1. We adapt OpenTuner to tune the CAD tool parameters for an arm) and obtain a reward sampled from the correspond- FPGA compilation and demonstrate the effectiveness of ing distribution. The reward loosely captures the effective- the bandit-based approach in improving the design QoR. ness of an arm, and crucially, its probability distribution is 2. We propose a scalable parallelization scheme which ac- learned during the process of the game. The objective is to complishes the following: (1) efficiently partitions the maximize the total payoff after all the rounds. An effective global solution space into promising subspaces; (2) allo- MAB algorithm must find a right balance between exploita- cates compute resource among subspaces to balance the tion (choosing the known best arm to obtain the highest expected reward), and exploration (selecting an infrequently a/Intel and Xilinx, a much larger collection of switches are used arm to gain more information about its reward distri- available (roughly 60 to 80 options are exposed to design- bution).

A Parallel Bandit-Based Approach for Autotuning FPGA Compilation

Open Source Synthesis and Verification Tool for Fixed-To-Floating and Floating-To-Fixed Points Conversions

Xilinx Vivado Design Suite User Guide: Release Notes, Installation, And

Vivado Design Suite User Guide: High-Level Synthesis (UG902)

Vivado Tutorial

Xilinx Vivado Design Suite 7 Series FPGA and Zynq-7000 All Programmable Soc Libraries Guide (UG953)

Vivado Design Suite User Guide: Implementation

UG908 (V2019.2) October 30, 2019 Revision History

Xilinx Vivado Design Suite User Guide: Release Notes, Installation, and Licensing (IUG973)

Analýza Síťového Provozu Na Síťové Kartě Fpga Network Traffic Analysis on Fpga Network Card

Eee4120f Hpes

Xilinx Vivado Design Suite User Guide: Release Notes, Installation, And

Codiseño HW/SW En System-On-Chip Programable De Última Generación