Spatial Computation
Total Page:16
File Type:pdf, Size:1020Kb
Spatial Computation Mihai Budiu December 2003 CMU-CS-03-217 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Thesis Committee: Seth Copen Goldstein, Chair Peter Lee Todd Mowry Babak Falsafi, ECE Nevin Heinze, Agere Systems This research was sponsored by the Defense Advanced Projects Agency (DARPA) under US Army contract no. DABT6396C0083, DARPA under US Navy grant no. N000140110659 via Pennsylvania State University and via Hewlett-Packard (HP) Laboratories, the National Science Foundation (NSF) under grant no. CCR-0205523 and no. CCR-9876248, and by generous donations from the Altera Corporation and HP. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of the sponsors or any other entity. Keywords: spatial computation, optimizing compilers, internal representation, dataflow machines, asyn- chronous circuits, reconfigurable computing, power-efficient computation Abstract This thesis presents a compilation framework for translating ANSI C programs into hardware dataflow ma- chines. The framework is embodied in the CASH compiler, a Compiler for Application-Specific Hardware. CASH generates asynchronous hardware circuits that directly implement the functionality of the source program, without using any interpretative structures. This style of computation is dubbed “Spatial Compu- tation.” CASH relies extensively on predication and speculation for building efficient hardware circuits. The first part of this document describes Pegasus, the internal representation of CASH, and a series of novel program transformations performed by CASH. The most notable of these are a new optimal register- promotion algorithm and partial redundancy elimination for memory accesses based on predicate manipu- lation. The second part of this document evaluates the performance of the generated circuits using simulation. Using media processing benchmarks, we show that for the domain of embedded computation, the circuits generated by CASH can sustain high levels of instruction level parallelism, due to the effective use of dataflow software pipelining. A comparison of Spatial Computation and superscalar processors highlights some of the weaknesses of our model of computation, such as the lack of branch prediction and register renaming. Low-level simulation however suggests that the energy efficiency of Application-Specific Hard- ware is three orders of magnitude better than superscalar processors, one order of magnitude better than low-power digital signal processors and asynchronous processors, and approaching custom hardware chips. The results presented in this document can be applied in several domains: (1) most of the compiler optimizations are applicable to traditional compilers for high-level languages; (2) CASH itself can be used as a hardware synthesis tool for very fast system-on-a-chip prototyping directly from C sources; (3) the compilation framework we describe can be applied to the translation of imperative languages to dataflow machines; (4) we have extended the dataflow machine model to encompass predication, data-speculation and control-speculation; and (5) the tool-chain described and some specific optimizations, such as lenient execution and pipeline balancing, can be used for synthesis and optimization of asynchronous hardware. i ii Contents 1 Introduction 1 1.1 Motivation . 1 1.2 Application-Specific Hardware and Spatial Computation . 2 1.2.1 ASH and Computer Architecture . 5 1.3 Research Approach . 5 1.4 Contributions . 6 1.4.1 Thesis Statement . 7 1.5 Thesis Outline . 7 I Compiling for Application-Specific Hardware 9 2 CASH: A Compiler for Application-Specific Hardware 11 2.1 Overview . 11 2.2 The Virtual Instruction Set Architecture . 12 2.3 Compiler Structure . 13 2.3.1 Front-End Transformations . 14 2.3.2 Core Analyses and Transformations . 15 2.3.3 Back-Ends . 17 2.4 Optimizations Overview . 17 2.4.1 What CASH Does Not Do . 17 2.4.2 What CASH Does . 19 2.5 Compiling C to Hardware . 19 2.5.1 Using C for Spatial Computation . 19 2.5.2 Related Work . 20 2.5.3 Limitations . 22 2.6 Code Complexity . 23 2.7 Compilation Speed . 23 3 Pegasus: a Dataflow Internal Representation 27 3.1 Introduction . 27 3.1.1 Summary of Pegasus Features . 28 3.1.2 Related Work . 29 3.2 Pegasus Primitives . 30 3.2.1 Basic Model . 30 iii 3.2.2 Data Types . 31 3.2.3 Operations . 33 3.3 Building Pegasus Representations . 41 3.3.1 Overview . 42 3.3.2 Input Language: CFGs . 42 3.3.3 Decomposition in Hyperblocks . 43 3.3.4 Block and Edge Predicates . 44 3.3.5 Multiplexors . 46 3.3.6 Predication . 46 3.3.7 Speculation . 47 3.3.8 Control-Flow . 47 3.3.9 Token Edges . 49 3.3.10 Prologue and Epilogue Creation . 51 3.3.11 Handling Recursive Calls . 53 3.3.12 Program Counter, Current Execution Point . 56 3.3.13 Switch Nodes . 56 3.4 Measurements . 56 4 Program Optimizations 59 4.1 Introduction . 59 4.2 Optimization Order . 61 4.3 Classical Optimizations . 62 4.3.1 Free Optimizations . 64 4.3.2 Term-Rewriting Optimizations . 64 4.3.3 Other Optimizations . 68 4.3.4 Induction Variable Analysis . 70 4.3.5 Boolean Minimization . 72 4.4 Partial Redundancy Elimination for Memory . 74 4.4.1 An Example . 74 4.4.2 Increasing Memory Parallelism . 76 4.4.3 Partial Redundancy Elimination for Memory . 77 4.4.4 Experimental Evaluation . 84 4.5 Pipelining Memory Accesses in Loops . 85 4.5.1 Motivation . 85 4.5.2 Fine-Grained Memory Tokens . 86 4.5.3 Using Address Monotonicity . 88 4.5.4 Dynamic Slip Control: Loop Decoupling . 89 4.6 Optimal Inter-Iteration Register Promotion . 92 4.6.1 Preliminaries . 92 4.6.2 Conventions . 93 4.6.3 Scalar Replacement of Loop-Invariant Memory Operations . 93 4.6.4 Inter-Iteration Scalar Promotion . 96 4.6.5 Implementation . 99 4.6.6 Handling Loop-Invariant Predicates . 104 4.6.7 Discussion . 105 4.6.8 Experimental Evaluation . 108 iv 4.7 Dataflow Analyses on Pegasus . 111 4.7.1 Global Constant Propagation . 112 4.7.2 Dead Code Elimination . 114 4.7.3 Dead Local Variable Removal . 114 5 Engineering a C Compiler 117 5.1 Software Engineering Rules . 117 5.2 Design Mistakes . 120 5.3 Debugging Support . 120 5.4 Coding in C++ . 121 II ASH at Run-time 123 6 Pegasus at Run-Time 125 6.1 Asynchronous Circuit Model . 125 6.1.1 The Finite-State Machine of an Operation . 125 6.2 Timing Assumptions . ..