Hybrid Optimization/Heuristic Instruction Scheduling for Programmable Accelerator Codesign

Hybrid Optimization/Heuristic Instruction Scheduling for Programmable Accelerator Codesign Tony Nowatzki∗ Newsha Ardalaniy Karthikeyan Sankaralingamz Jian Weng∗ University of California, Los Angeles∗ SimpleMachines, Inc.y University of Wisconsin-Madisonz [email protected] ABSTRACT 1 INTRODUCTION Recent programmable accelerators are faster and more en- With the slowing of exponential technology scaling, it has ergy efficient than general purpose processors, but expose become clear that processors with traditional instruction sets complex hardware/software abstractions for compilers. A are limited in terms of their energy efficiency and hence per- key problem is instruction scheduling, which requires sophis- formance scalability. To address this problem, programmable ticated algorithms for mapping instructions to distributed accelerators add a rich interface between the software (com- processing elements, routing of operand dependences, and piler or programmer) and hardware layers, enabling more ef- timing the arrival of operands to enable high throughput. ficient and exposed microarchitectures while retaining some The complex dependences between mapping, communi- level of programmability and generality [10, 11, 22, 29, 34, 42, cation and timing make prior scheduling techniques insuf- 44, 46, 48, 55, 56]. Essentially, these accelerators have repur- ficient. Optimization-based approaches are too slow, and posed spatial architecture principles for the specialization heuristic-based approaches cannot achieve high quality. Our era, where such a rich interface is acceptable. first insight is that the algorithm can be solved in a series In an attempt to achieve ASIC-like energy efficiency, sev- of phases with overlapping responsibilities to reduce com- eral recent spatial architectures completely eschew control plexity. Second, it is possible to combine optimization-based capabilities from each processing element (PE) [12, 13, 15, 16, and stochastic-heuristic based search strategies, to exploit 20, 35, 36], hearkening back to classic systolic arrays [8, 50], the best features of both. This leads to the two primary tech- but with a more flexible network. We refer to such designs niques we explore, phase overlapping and hybridization. as dedicated-PE accelerators. In this work we explore the codesign of scheduling algo- The primary challenge in adopting these exposed and re- rithms with a challenging-to-schedule programmable accel- strictive microarchitectures is that they require increasingly erator. We show we can improve its area by 35% by trimming demanding compiler support. One of the most recurring its scheduling-friendly structures, using a scheduling algo- and difficult problems in these designs is the problem of rithm that is 5× faster than the state-of-the-art optimization- instruction-scheduling: deciding when and where each com- based scheduler, with up to 2× better throughput. putation and operand communication should be performed. ACM Reference Format: More specifically, this includes the highly dependent prob- Tony Nowatzki, Newsha Ardalani, Karthikeyan Sankaralingam, Jian lems of mapping instructions to resources, routing dependent Weng. 2018. Hybrid Optimization/Heuristic Instruction Scheduling values between instances of instructions, and the critical task for Programmable Accelerator Codesign. In International conference of assigning the timing of concurrent instructions to achieve on Parallel Architectures and Compilation Techniques (PACT ’18), a high execution rate (hardware utilization). We identify November 1–4, 2018, Limassol, Cyprus. ACM, New York, NY, USA, and demonstrate that the main challenge in attaining high 15 pages. https://doi.org/10.1145/3243176.3243212 throughput is precisely matching arrival times of instruction Permission to make digital or hard copies of all or part of this work for operands, avoiding delay-mismatch. personal or classroom use is granted without fee provided that copies Though scheduling algorithms exist for dedicated-PE ac- are not made or distributed for profit or commercial advantage and that celerators, they inadequately address the above timing prob- copies bear this notice and the full citation on the first page. Copyrights lem, imposing a stark hardware/software co-design tradeoff: for components of this work owned by others than the author(s) must either sacrifice area/power efficiency by adding structures be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific to the hardware to ease the burden of scheduling, or have a permission and/or a fee. Request permissions from [email protected]. lean accelerator and suffer exorbitant scheduling times ofa PACT ’18, November 1–4, 2018, Limassol, Cyprus rigorous schedule-space search. © 2018 Copyright held by the owner/author(s). Publication rights licensed Fast scheduling algorithms are typically based on a variety to ACM. of heuristics around minimizing communication distances ACM ISBN 978-1-4503-5986-3/18/11...$15.00 and resource consumption. They have extreme difficulty https://doi.org/10.1145/3243176.3243212 PACT ’18, November 1–4, 2018, Limassol, Cyprus Nowatzki et al. methods into a hybrid scheduling algorithm, by applying Easy Config (+50% area) Hard Config (+10% area) them on the phases they are best suited for. 5.0 Greedy Heuristic In this work, we explore a design space of instruction 4.5 Our Heuristic scheduling algorithms for an extremely-lean dedicated-PE 4.0 Optimization array from the stream-dataflow [35] accelerator, a represen- 3.5 40% fail 3.0 tative programmable accelerator. We then explore how these 2.5 algorithms can tradeoff scheduling time, solution quality 2.0 0% (performance) and hardware overhead. 33% 0% 40% fail 1.5 fail 50% fail fail fail Specifically, our contributions are: 1.0 0 1 2 3 0 1 2 3 3 2 1 3 2 1 - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 • 1 1 1 1 1 1 Identification of delay mismatch as the bottleneck in Sched. Quality (throughput degredation) Scheduling Time (seconds) Scheduling Time (seconds) dedicated-PE scheduling; mitigation with delay-FIFOs. Figure 1: State-of-art scheduler effectiveness. Lower is • Developing the principles of phase-overlapping better in both axes. Max time: 20 Minutes, Benchmark- (overlapped execution of scheduling responsibilities) and s/methodology in Section 6 (page 9) hybrid-scheduling (integration of heuristic phases), which can be applied to spatial scheduling in general. constructing a schedule with matching arrival times on all • Evaluation on a representative dedicated-PE accelerator, inputs, as their timing is dependent on the source instruc- demonstrating phase-overlapping/hybrid-scheduling ef- tions’ mapping and routing of operands. The more rigorous fectiveness on extremely restrictive hardware. approach is to solve the problem mathematically, commonly as a mixed integer linear program (MILP) or using satisfiabil- As a minor contribution, we construct a sophisticated ity modulo theory (SMT) [38]. While this can lead to optimal heuristic scheduler based on stochastic search for use in solutions, and require less complex hardware, they are far hybrid scheduling, and augment a prior MILP scheduling too slow when applied to even modestly sized dedicated-PE formulation with advanced delay-matching techniques. accelerators. The fundamental codesign tradeoff can be seen in Figure 1, Findings: First, solving any phase independently, even with which shows the scheduling results for the accelerator we sophisticated optimization based search, conclusively fails on target in this work, stream-dataflow [35]. The x-axis is the a dedicated-PE accelerator. Second, this work demonstrates scheduling time in a log scale, and y-axis is the expected the importance of compiler/architecture codesign – we show throughput degradation (big circles are centroids, small cir- we can improve the area-efficiency of a high performance cles are benchmarks). Existing simple heuristics fail due to a accelerator by 35% by trimming its scheduling-friendly struc- small search space, and existing optimization techniques fail tures, using a scheduling algorithm that is 5× faster than the due to too-large a search space. The sophisticated heuristic state-of-the-art optimization-based scheduler, with up to 2× scheduler we develop, in green, always succeeds. However, better throughput. it either must suffer performance loss (up to 2.5 ×) on the Paper Organization: We first discuss challenges and con- lean and “hard” hardware configuration, or spend 1.5× area straints for dedicated-PE accelerator scheduling (§2), then to achieve good performance on the “easy” hardware config- overview the explored and developed scheduling techniques uration. (§3). In the following section, we discuss techniques to en- Goal and Insight: The goal of this work is to discover sched- able phase overlapping for an optimization-based scheduler uling techniques which are both reasonably fast and find (§4). This motivates the discussion of our heuristic scheduler high-performance solutions on lean hardware configurations. and its integration with optimization phases (§5). Finally, we To this end, we have two main insights. First, we observe then discuss the evaluation methodology (§6), results (§7), that it is neither necessary to solve the problem in a series related work (§8), and conclude (§9). of independent phases for mapping, routing and timing, nor to solve them all simultaneously. Rather, these different

Hybrid Optimization/Heuristic Instruction Scheduling for Programmable Accelerator Codesign

Integrating Program Optimizations and Transformations with the Scheduling of Instruction Level Parallelism*

Computer Architecture: Static Instruction Scheduling

Static Instruction Scheduling for High Performance on Limited Hardware

Lecture 1 Introduction

Compiler Optimisation 6 – Instruction Scheduling

Register Allocation and Instruction Scheduling in Unison

An Application of Constraint Programming to Superblock Instruction Scheduling

CS415 Compilers Instruction Scheduling

A Simple, Verified Validator for Software Pipelining

SSE Optimizations These Optimizations Leverage the Streaming SIMD Extension (SSE) Instruction Set of the CPU

Survey on Combinatorial Register Allocation and Instruction Scheduling ACM Computing Surveys

Learning Heuristics for the Superblock Instruction Scheduling Problem Tyrel Russell, Abid M