Compiling Irregular Software to Specialized Hardware
Total Page:16
File Type:pdf, Size:1020Kb
Compiling Irregular Software to Specialized Hardware Richard Townsend Submitted in partial fulllment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2019 ©2019 Richard Townsend All rights reserved ABSTRACT Compiling Irregular Software to Specialized Hardware Richard Townsend High-level synthesis (HLS) has simplied the design process for energy-ecient hardware accelerators: a designer species an accelerator’s behavior in a “high-level” language, and a toolchain synthesizes register-transfer level (RTL) code from this specication. Many HLS sys- tems produce ecient hardware designs for regular algorithms (i.e., those with limited condi- tionals or regular memory access patterns), but most struggle with irregular algorithms that rely on dynamic, data-dependent memory access patterns (e.g., traversing pointer-based structures like lists, trees, or graphs). HLS tools typically provide imperative, side-eectful languages to the designer, which makes it dicult to correctly specify and optimize complex, memory-bound applications. In this dissertation, I present an alternative HLS methodology that leverages properties of functional languages to synthesize hardware for irregular algorithms. The main contribution is an optimizing compiler that translates pure functional programs into modular, parallel dataow networks in hardware. I give an overview of this compiler, explain how its source and target together enable parallelism in the face of irregularity, and present two specic optimizations that further exploit this parallelism. Taken together, this dissertation veries my thesis that pure functional programs exhibiting irregular memory access patterns can be compiled into specialized hardware and optimized for parallelism. This work extends the scope of modern HLS toolchains. By relying on properties of pure func- tional languages, our compiler can synthesize hardware from programs containing constructs that commercial HLS tools prohibit, e.g., recursive functions and dynamic memory allocation. Hardware designers may thus use our compiler in conjunction with existing HLS systems to accelerate a wider class of algorithms than before. Contents List of Figures v List of Tables x Acknowledgements xi 1 Introduction 1 1.1 My Thesis . 1 1.2 Structure of the Dissertation . 2 1.3 High-Level Synthesis and Irregular Algorithms . 3 1.4 Contributions . 4 2 Related Work 8 2.1 High-Level Synthesis . 8 2.1.1 Functional HLS . 9 2.1.2 Irregular HLS . 11 2.2 Hardware Dataow Networks . 13 2.3 Parallelizing Divide-and-Conquer Algorithms . 15 2.3.1 Software Techniques . 15 2.3.2 Memory Partitioning . 16 2.3.3 HLS for Divide-and-Conquer . 17 2.4 Optimizing Recursive Data Structures . 19 3 An Overview of Our Compiler 22 3.1 Our Main IR: a Variant of GHC Core . 22 3.2 The End-to-End Compilation Flow . 27 3.3 Lowering Core . 35 i 3.3.1 Removing Polymorphism . 35 3.3.2 Making Names Unique . 37 3.3.3 Lambda Lifting . 37 3.3.4 Removing Recursion . 38 3.3.5 Tagging Memory Operations . 47 3.3.6 Simplifying Case ................................ 49 3.3.7 Adding Go .................................... 50 3.3.8 Lifting Expressions . 53 3.3.9 Optimizing Reads . 54 4 From Functional Programs to Dataow Networks 55 4.1 A Restricted Dialect of Core: Floh . 56 4.1.1 An Example: Map . 57 4.2 Dataow Networks . 59 4.3 Translation from Floh to Dataow . 62 4.3.1 Translating Expressions . 62 4.3.2 Translating Simple Functions and Cases . 63 4.3.3 Translating Clustered Functions and Cases . 66 4.3.4 Putting It All Together: Translating the Map Example . 69 4.4 Dataow Networks in Hardware . 71 4.4.1 Evaluation Order . 72 4.4.2 Stateless Actors . 72 4.4.3 Stateful Actors . 73 4.4.4 Inserting Buers . 73 4.5 Experimental Evaluation . 77 4.5.1 Methodology . 77 4.5.2 Strict vs. Non-strict Tail Recursive Calls . 78 4.5.3 Sensitivity to Memory Latency . 79 4.5.4 Sensitivity to Function Latency . 80 5 Realizing Dataow Networks in Hardware 81 5.1 Specications: Kahn Networks . 83 5.1.1 Kahn Networks . 83 ii 5.1.2 Dataow Actors . 85 5.1.3 Unit-rate, Mux, and Demux Actors . 86 5.1.4 Nondeterministic Merge . 87 5.2 Hardware Dataow Actors . 88 5.2.1 Communication and Combinational Cycles . 88 5.2.2 Unit-Rate Actors . 89 5.2.3 Mux and Demux Actors . 90 5.2.4 Merge Actors . 92 5.3 Channels in Hardware . 95 5.3.1 Data and Control Buers . 95 5.3.2 Forks . 97 5.4 The Argument for Correctness . 98 5.5 Our Dataow IR: DF . 101 5.5.1 Channel Type Denitions . 101 5.5.2 Actor Instances and Type Denitions . 104 5.5.3 Checking DF Specications . 106 5.6 Experimental Evaluation . 107 5.6.1 Experimental Networks . 108 5.6.2 Random Buer Allocation . 109 5.6.3 Manual Buer Allocation . 110 5.6.4 Pipelining the Conveyor . 110 5.6.5 Memory in Dataow Networks . 111 5.6.6 Overhead of Our Method . 112 6 Optimizing Irregular Divide-and-Conquer Algorithms 114 6.1 Transforming Code . 116 6.1.1 Finding Divide-and-Conquer Functions . 116 6.1.2 Task-Level Parallelism: Copying Functions . 117 6.1.3 Memory-Level Parallelism: Copying Types . 118 6.1.4 New Types and Conversion Functions . 119 6.2 Partitioning On-Chip Memory . 119 6.3 Experimental Evaluation . 121 6.3.1 Exposing Memory-Level Parallelism . 123 iii 6.3.2 Exploiting Memory-Level Parallelism . 124 7 Packing Recursive Data Types 127 7.1 An Example: Appending Lists . 128 7.2 Packing Algorithm . 131 7.2.1 Packed Data Types . 132 7.2.2 Pack and Unpack Functions . 133 7.2.3 Injection and Hoisting . 135 7.2.4 Simplication . 138 7.2.5 Heuristics to Limit Code Growth . 143 7.3 Experimental Evaluation . 144 7.3.1 Testing Scheme . 145 7.3.2 Experimental Results . 146 8 Conclusions and Further Work 150 8.1 Conclusions . 150 8.2 Further Work . 151 8.2.1 Formalism . 151 8.2.2 Extending the Compiler . 152 8.2.3 Synthesizing a Realistic Memory System . 152 8.2.4 Improved and Additional Optimizations . 153 Bibliography 155 iv List of Figures 1.1 A visualization of the research supporting my thesis. I focus on synthesizing hard- ware from pure functional programs exhibiting irregular memory access patterns, e.g., the function map f l, which applies a function f to each element of a linked list l, storing the results in a new list (a). I translate these programs into modular, parallel dataow networks in hardware (b), and apply two optimizations that exploit more parallelism in the synthesized circuits: the rst improves memory-level parallelism in divide-and-conquer algorithms by combining a novel code transformation with a type-based memory partitioning scheme (c); the second packs more data into re- cursive types to improve spatial locality (i.e., data-level parallelism) and reduce an irregular program’s memory footprint (d). 5 3.1 The abstract syntax of the compiler’s main IR: a variant.