Intelligent Exploration for High- and System Synthesis

Antonino Tumeo

AIDArc 2020

1 Outline

• Synthesis of accelerators for irregular applications and data analytics • Speeding up design space exploration in high-level and system synthesis with bio-inspired heuristics • Overview of SODALITE • Opportunities for artificial intelligence in SODALITE

2 Irregular Applications Characteristics

• Unpredictable, fine-grained data accesses • Poor locality • Pointer or linked-list based data structures • Graphs & sparse matrices, unbalanced trees, unstructured grids • Difficult to partition in a balanced way • Inherent parallelism (for each element) • High synchronization intensity

• In general, memory-bound • High memory parallelism, but many small memory operations in unrelated locations • The key problem is actual bandwidth utilization

• Prototypical irregular kernels: graph algorithms • Data analysts do not only want to compute metric on graphs, but also and foremost query graph databases (e.g., to find interesting patterns)

3 Application-specific Accelerators

• As Moore’s law slows down, application-specific accelerators appear the main approach to keep increasing efficiency

• At one end, sea of application-specific accelerators

• At the other end, (re)emergence of (re)configurable • FPGAs in the cloud, FPGAs for HPC • Renewed interest for Coarse Grained Reconfigurable Arrays (CGRAs)

• Reconfigurable , and FPGAs in particular, may have hard time to reach peak flop rates of ASICs • Can make it up in efficiency • Key aspect (especially for irregular applications): enable exploration around the memory interface

4 High-Level Synthesis

• Tries to bridge the design gap of FPGA accelerators • Generation of hardware design language descriptions starting from high-level program specifications

• Conventional High-Level Synthesis flows address: • Dense, regular data structures • Simple memory models • Instruction-level parallelism • Compute-bound kernels (Digital Signal Processing-like) • Latest commercial tools based on OpenCL works well for regular, compute- bound workloads • Significant limits for nested-loops, no support for atomic memory operations

5 Our contributions

• We have developed a set of techniques to enable HLS of Irregular Applications • Customizable architectural templates and related analysis and synthesis methodologies • Implemented in an open-source HLS research framework – PandA Bambu – available at: https://panda.dei.polimi.it

6 Query example Return the names of all persons owning at least two cars, of which at least one is a SUV

7 Source Code Example

8 Multithreaded template

• Architectural templates: expose set of parameters (number of accelerators, memory channels, contexts) to explore • Synthesizes effectively parallel loop iterations with atomic memory operations

9 Design Space Exploration

10 Intelligent System Design

• The previous example has shown only the space of the parameters for the multithreaded architecture template • In reality, High-Level or System Synthesis need to solve various NP-Complete Problems • High-Level Synthesis: • Resource Allocation, Scheduling, Resource Binding, Interconnection… • System Synthesis: • HW/SW partitioning, Scheduling, Mapping, Communication orchestration… • Brute-force methods require too much time • Problems also are strictly correlated, and executed in different orders • Many Integer Linear Programming formulations • Still too much time to converge • Heuristic optimization algorithms • Many bio-inspired (genetic algorithms, swarm optimization)

11 Genetic Algorithm for High-Level Synthesis

• Genetic Algorithms (GAs) enable exploring non-convex design spaces by evolving a population of solutions • mutation introduces local variations • crossover allows jumping across areas of the space and exit from local minima or maxima • selection of the fittests then guides the search along the most promising areas

• We apply GAs High-Level Synthesis process • Each chromosome represent a full synthesis process • By considering the full synthesis process, we can explore much larger design space than considering each synthesis “task” alone

12 NSGA-II for synthesis example

• Non-Dominated Sorting Genetic Algorithm II [C. Pilato, A. Tumeo, G. Palermo, F. Ferrandi, P. L. Lanzi, D. Sciuto: Improving evolutionary exploration to area-time optimization of FPGA • Chromosome encoding designs. J. Syst. Archit. 54(11): 1046-1057 (2008)] • Binding of Operations to Functional Units • Algorithms for Scheduling, Register Allocation, Interconnection • Mutation and crossover • Elitism preserves diversity into the population • Crowded-comparison operator based on density estimation, allows obtaining the crowding distance • Selection: non-dominated rank and crowding distance • Solutions are ranked also inside a non-dominated level: if they have the same rank, they belong to the same front and selection prefers the. less crowded region

13 Ant Colony Optimization for Scheduling and Mapping • Ant Colony Optimization: multi-agent optimization heuristic • Ants randomly explore different paths to the food. • At each decision point:

• They deposit pheromone proportionally to the length of the path, which suggests other ants to follow the same trail. Pheromone also evaporates with time.

14 Scheduling and Mapping Example

[F.Ferrandi, P.L. Lanzi, C. Pilato, D. Sciuto, A. Tumeo: Ant Colony Heuristic for Mapping and Scheduling Tasks and Communications on Heterogeneous Embedded Systems. IEEE Trans. on CAD of Integrated Circuits and Systems 29(6): 911-924 (2010)]

15 Bayesian Optimization for Mapping Pipelined Applications

• The Bayesian Optimization Algorithm (BOA) is a Probabilistic Model Building Genetic Algorithm (PMBGA) • mutation and crossover operators are replaced by the construction and the sampling of a Bayesian network. • Through the Bayesian Network, it can find underlying sub-structures of some complex problems

• We apply BOA for the mapping of pipelined applications on a heterogeneous platform

16 BOA example

[A. Tumeo, M. Branca, L. Camerini, C. Pilato, P. L. Lanzi, F. Ferrandi, D. Sciuto: Mapping pipelined applications onto heterogeneous embedded systems: a bayesian optimization algorithm based approach. CODES+ISSS 2009: 443-452]

17 BOA Results

• Applied to more complex task graph (B,C, D) • Compared to multiobjective Simulated Annealing (MSA), Tabu Search (TSA), Genetic Algorithm (GA) • Also a hybrid formulation where each offspring generation of BOA is followed by several iterations of SA • Reports execution latency in clock cycles, Relative Standard Deviation, and execution time of the optimization algorithm

18 Multi-objective Synthesis for Real-Time Systems

• Consider a real time application with hard and soft deadlines, described by a task graph • We are given a set of resources that could be composed together to form a system • Processors, accelerators, memories, communication elements (buses or point- to-point channels) • We want to obtain the system that is able to minimize area, is feasible (no violations of hard deadlines), minimize buffer/memories size, and minimize violation of soft deadlines

19 Multi-objective Synthesis for Real-Time Systems

OVERALL FLOW CONVERSION TO MULTI-RATE TASK GRAPH

20 Multiobjective Synthesis for Real Time Systems

• Problem formulation • Resource library, communication paths, mapping, scheduling • Optimization algorithms evaluated: • Multiobjective Simulated Annealing (SA) • Multiobjective Tabu Search (TS) • Niched Pareto Genetic Algorithm II (GA) • In average, the GA is more robust and able to cover more non dominated solutions in highly constrained problem • The TS performs worse than the SA with high [M. Ceriani, F. Ferrandi, P. L. Lanzi, D. Sciuto, A. Tumeo: Multiprocessor systems-on-chip synthesis using multi-objective number of evaluations, but is comparable or evolutionary computation. GECCO 2010: 1267-1274] better with few evaluations • SA obtains valuable results on problems with higher degrees of freedom

21 SODALITE: Software Defined Accelerators from Machine Learning Tools Environment

• SODALITE is PNNL’s project in the DARPA RTML (Real Time Machine Learning) program • 3 years, 2 phases of 1.5 years each • Coordinated with parallel NSF Program • DARPA RTML looks at the development of a compiler that will allow to generate Verilog designs starting from High-Level Machine Learning Frameworks (e.g., Pytorch, TensorFlow, MXNet, CNTK, …) • The designs will then be fabricated in chiplets

22 SODALITE overview

• Distill promising network architectures from suggested application area • High-Bandwidth Imaging • Driver to enable agile codesign approach and identification of architectural templates, but objective is generality of the synthesizer • Synthesizer frontend lowers a High-Level Intermediate Representation (HLIR) to Low Level IR (LLIR) • Initially exploit ONNX to lower to a common HLIR • Explore opportunities to employ MLIR as HLIR • LLIR: LLVM IR • Synthesizer Middle end performs the actual synthesis • New dataflow template-based synthesis • Classical high-level synthesis path • Design Space Exploration engine plugs-in in the middle end • Heuristic optimization algorithms, including bio-inspired • Closed loop with chip design and evaluation • Provides constant feedback for synthesizer development

23 Artificial Intelligence in SODALITE

• SODALITE is a new generation synthesizer • Like the examples for high-level and system synthesis, we will use optimization algorithms to explore a multidimensional design space • Performance, power, accuracy, heat… • A synthesizer is a compiler • Large amount of compiler optimizations can significantly influence the Verilog generation process • Not only optimizations, but also ability to understand computational patterns and reuse • Patterns may not be the conventional ones • We also need estimation methods to estimate the quality of the results • ASIC vs FPGA interconnect • Estimators for FPGA work mostly based on linear regression through the synthesizers - can we do better for ASICs?

24 Conclusion

• Synthesis techniques for graph analytics and large design space • Overview of heuristic optimization methods for high-level and system synthesis • Overview of SODALITE • Possible directions for SODALITE design space exploration

• Looking to create an opensource ecosystem for synthesis and system level design space exploration

25 Thank you!

• Thank you to my past and present collaborators • Thanks to the SODALITE and SO(DA)2 team • Vinay Amatya, Vito Giovanni Castellana, Joseph Manzano, Marco Minutoli, Cheng Tan

• Questions? • [email protected]

26