Stream-Dataflow Acceleration

Stream-Dataflow Acceleration Tony Nowatzki∗† Vinay Gangadhar† Newsha Ardalani† Karthikeyan Sankaralingam† ∗University of California, Los Angeles † University of Wisconsin, Madison [email protected] vinay,newsha,[email protected] ABSTRACT 1 INTRODUCTION Demand for low-power data processing hardware continues to rise Data processing hardware is vital to the global economy – from inexorably. Existing programmable and “general purpose” solutions the scale of web services and warehouse computing, to networked (eg. SIMD, GPGPUs) are insufficient, as evidenced by the order- Internet of Things and personal mobile devices. As application needs of-magnitude improvements and industry adoption of application in these areas have evolved, general purpose techniques (even SIMD and domain-specific accelerators in important areas like machine and GPGPUs) are not sufficient and have fallen out of focus, because learning, computer vision and big data. The stark tradeoffs between of the energy and performance overheads of traditional VonNeumann efficiency and generality at these two extremes poses a difficult ques- architectures. tion: how could domain-specific hardware efficiency be achieved Instead, application-specific and domain-specific hardware is pre- without domain-specific hardware solutions? vailing. For large scale computing, Microsoft has deployed the In this work, we rely on the insight that “acceleratable” algo- Catapult FPGA accelerator [25] in its datacenters, and likewise rithms have broad common properties: high computational intensity for Google’s Tensor Processing Unit for distributed machine learn- with long phases, simple control patterns and dependences, and ing [12]. Internet of things devices and modern mobile systems on simple streaming memory access and reuse patterns. We define a chip (SOCs) are already laden with custom hardware, and innovation general architecture (a hardware-software interface) which can more continues in this space with companies (eg. Movidius) developing efficiently expresses programs with these properties called stream- specialized processors for computer vision [11]. dataflow. The dataflow component of this architecture enables high While more narrow hardware solutions are effective, they pose concurrency, and the stream component enables communication many challenges. As algorithms change at an alarming rate, hard- and coordination at very-low power and area overhead. This pa- ware must be redesigned and re-verified, which is burdensome in per explores the hardware and software implications, describes its terms of development cost and time-to-market. As a corollary, inno- detailed microarchitecture, and evaluates an implementation. Com- vation in algorithms becomes more difficult without access to flexible pared to a state-of-the-art domain specific accelerator (DianNao), hardware. Furthermore, programmable hardware can be time-shared and fixed-function accelerators for MachSuite, Softbrain can match across applications, while domain-specific cannot, making it more their performance with only 2× power overhead on average. costly in terms of silicon. Finally, from the academic viewpoint, it is difficult to formalize and apply improvements from domain-specific CCS CONCEPTS hardware to the broader field of computer architecture – limiting the • Computer systems organization ! Heterogeneous (hybrid) intellectual impact of such work. systems; Reconfigurable computing; Data flow architectures; Single Ideally, what we require is hardware that is capable of executing instruction, multiple data; Special purpose systems; data-intensive algorithms at high performance with much lower power than existing programmable architectures, while remaining KEYWORDS broadly applicable and adaptable. An important observation, as alluded to in the literature [9, 21], Streaming, Dataflow, Architecture, Accelerator, Reconfigurable, is that typically-accelerated workloads have common characteristics: CGRA, Programmable, Domain-Specific 1. High computational intensity with long phases; 2. Small instruc- ACM Reference format: tion footprints with simple control flow, 3. Straightforward memory Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan access and re-use patterns. The reason for this is simple: these prop- Proceedings of Sankaralingam. 2017. Stream-Dataflow Acceleration. In erties lend themselves to very efficient hardware implementations ISCA ’17, Toronto, ON, Canada, June 24-28, 2017, 14 pages. through exploitation of concurrency. Existing data-parallel hard- https://doi.org/10.1145/3079856.3080255 ware solutions perform well on these workloads, but in their attempt Permission to make digital or hard copies of all or part of this work for personal or to be far more general, sacrifice too much efficiency to supplant classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation domain-specific hardware. As an example, short-vector SIMD relies on the first page. Copyrights for components of this work owned by others than the on inefficient general pipelines for control and address generation, author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or but accelerated codes typically do not have complex control and republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. memory access. GPGPUs hide memory latency using hardware for ISCA ’17, June 24-28, 2017, Toronto, ON, Canada massive-multithreading, but accelerated codes’ memory access pat- © 2017 Copyright held by the owner/author(s). Publication rights licensed to Association terns can usually be trivially decoupled without multithreading. for Computing Machinery. ACM ISBN 978-1-4503-4892-8/17/06. $15.00 To take advantage of this opportunity, this work proposes an https://doi.org/10.1145/3079856.3080255 ISCA ’17, June 24-28, 2017, Toronto, ON, Canada T. Nowatzki et al. Dataflow Cache/Memory-heirarchy Memory Stream Computation (address, pattern, length) Scratchpad Memory Control From Control Stream Memory Reuse To (×1)local (×N) Core Dispatch memory Memory Coarse Grained Reconfigurable Architecture Reduction Stream (CGRA) : Computation Node : Dataflow edge : Control path : Wide-memory interface (a) Stream-Dataflow Architecture Abstractions (b) Softbrain: Stream-Dataflow Implementation Figure 1: Stream-Dataflow Abstractions & Implementation architecture and execution model for acceleratable workloads, whose 2 MOTIVATION AND OVERVIEW hardware implementation can approach the power and area efficiency For a broad class of data-processing algorithms, domain-specific of specialized designs, while remaining flexible across application hardware provides orders of magnitude performance and energy domains. Because of its components, it is called stream-dataflow, benefits over existing general purpose solutions. By definition, the and exposes these basic abstractions: strategy that domain-specific accelerators employ is to limit the pro- gramming interface to support a much narrower set of functionality • A dataflow graph for repeated, pipelined computations. suitable for the domain, and in doing so simplify the hardware design • Stream-based commands for facilitating efficient data- and improve efficiency. We further hypothesize that the efficiency movement across components and to memory. gap between domain-specific and general purpose architectures is • A private (scratchpad) address space for efficient data reuse. fundamental to the way general purpose programs are expressed at an instruction-level, rather than a facet of the microarchitectural Figure1(a) depicts the programmer view of stream-dataflow, con- mechanisms employed. sisting of the dataflow graph itself, and explicit stream communica- So far, existing programmable architectures (eg. SIMD, SIMT, tion for memory access, read reuse and recurrence. The abstractions Spatial) have shown some promise, but have only had limited suc- lead to an intuitive hardware implementation; Figure1(b) shows cess in providing a hardware/software interface that enables the our high-level design. It consists of a coarse grain reconfigurable same specialized microarchitecture techniques that more customized architecture (CGRA) and scratchpad, connected with wide buses to designs have employed. memory. It is controlled from a simple control core, which sends Therefore, our motivation is to discover what are the architectural stream commands to be executed concurrently by the memory con- abstractions that would enable microarchitectures with the execution trol engine, scratchpad control engine and the CGRA. This coarse style and efficiency of a customized design, at least for abroad grain nature of the stream-based interface enables the core to be quite and important class of applications that have long phases of data- simple without sacrificing highly-parallel execution. The stream ac- processing and streaming memory behavior. To get insights into the cess patterns and restricted memory semantics also enable efficient limitations of current architectures and opportunities, this section address generation and coordination hardware. examines the specialization mechanisms of existing programmable Relative to a domain specific architecture, a stream-dataflow pro- hardware paradigms. We then discuss how their limitations can cessor can reconfigure its datapath and memory streams, so itis inspire

Load more