The Aladdin Approach to Accelerator Design and Modeling

The Aladdin Approach to Accelerator Design and Modeling

’ POWER AND PERFORMANCE IN AN SOC ENVIRONMENT. ......As we near the end of Dennard 2). The natural evolution of this trend leads to scaling, traditional performance and power agrowingvolumeanddiversity of customized scaling benefits based on technology improve- accelerators in future heterogeneous architec- ments no longer exist. At the same time, tures ranging from mobile SoCs to high-per- transistor density improvements continue, formance servers (Figure 3). To thoroughly resulting in the dark silicon problem, wherein evaluate such architectures, designers must chips have more transistors than a system can perform large design space exploration to Yakun Sophia Shao fullypoweratanypointintime.1 To over- understand the tradeoffs across the entire sys- come these challenges, application-specific tem, which is currently infeasible due to the Brandon Reagen hardware acceleration has surfaced as a prom- lack of a fast simulation infrastructure for ising approach because it delivers orders of accelerator-centric systems. Gu-Yeon Wei magnitude performance and energy benefits Computer architects have long been compared to general-purpose solutions. Anal- developing and using high-level power and David Brooks ysis of die photos (http://vlsiarch.eecs.harvard. performance simulation tools for general- edu/accelerators/die-photo-analysis) from purpose cores and GPUs. In contrast, current Harvard University Apple’s A6 (iPhone 5), A7 (iPhone 5s), and accelerator-related research relies primarily A8(iPhone6)systemsonchips(SoCs)shows on register-transfer level (RTL) implementa- that more than half of the die area is dedicated tions, a tedious and time-consuming process. to blocks that are neither CPUs nor GPUs, It takes hours, if not days, to generate and but rather specialized IP blocks (see Figure 1). simulate RTL and then synthesize it into We also observe a consistent trend of an logic to estimate the power and performance increasing number of specialized IP blocks of a single accelerator design. This low-level, acrossgenerationsofApple’sSoCs(seeFigure RTL-centric infrastructure cannot support ....................................................... 58 Published by the IEEE Computer Society 0272-1732/15/$31.00 c 2015 IEEE CPU GPU Others 14% 16% 18% 18% 22% 22% 66% 60% 64% Apple A6 Apple A7 Apple A8 (a) (b) (c) Figure 1. Die area breakdown of Apple’s systems on chips (SoCs). (a) A6 (iPhone 5), (b) A7 (iPhone 5s), and (c) A8 (iPhone 6). More than half of the die area is dedicated to specialized IP blocks. large architecture-level design space explora- tion, which is required to navigate the vast 30 parameter space governing the interactions between general-purpose cores, accelerators, and shared resources, including cache hierar- 25 chies and on-chip networks. Hence, there is a clear need for a fast, high-level design tool to 20 enable broad design space exploration for next-generation customized architectures. In this article, we present Aladdin, a pre- 15 RTL, power-performance simulator designed to enable rapid design space exploration for 10 accelerator-centric systems. Aladdin takes No. of specialized IP blocks high-level language descriptions of algorithms as inputs and uses dynamic data dependence 5 graphs (DDDG) as a representation of an accelerator without having to generate RTL. 0 Starting with an unconstrained program A4 A5 A6 A7 A8 DDDG, which corresponds to an initial rep- resentation of accelerator hardware, Aladdin applies optimizations as well as constraints to Figure 2. Number of specialized IP blocks across generations of Apple’s the graph to create a realistic model of acceler- SoCs. We see a consistent increase in the number of blocks. ator activity. We rigorously validated Aladdin against RTL implementations of accelerators from both handwritten Verilog and a com- suite called SHOC.5 Our results show that mercial high-level synthesis (HLS) tool for Aladdin can model performance within 0.9 various applications, including accelerators in percent, power within 4.9 percent, and area Memcached,2 HARP,3 NPU,4 and a com- within 6.6 percent compared to accelerator monly used throughput-oriented benchmark designs generated by traditional RTL flows. In ............................................................. MAY/JUNE 2015 59 .............................................................................................................................................................................................. TOP PICKS Little Little Big core Private L1$/ core core scratchpad Private L1$ Private L1$ Private L1$ reg_a reg_b Shared LLC and NoC GPGPU Accelerator- specific datapath Application-specific accelerators Figure 3. Future heterogeneous architecture will include a large number of customized accelerators. Large design space exploration is needed to design such architecture. addition, Aladdin provides these estimates Accelerator design flow more than 100 times faster. The current accelerator design flow Aladdin captures accelerator design trade- requires multiple CAD tools, which is inher- offs, enabling new architectural research direc- ently tedious and time-consuming. The proc- tions in heterogeneous systems comprising ess starts with a high-level description of an general-purpose cores, accelerators, and a algorithm; then, designers either manually shared memory hierarchy. We demonstrated implement the algorithm in RTL or use HLS this capability by integrating Aladdin with a tools, such as Xilinx’s Vivado HLS, to com- full memory hierarchy model. Such infrastruc- pile the high-level implementation (for ture lets users explore customized and shared example, C/Cþþ) to RTL. memory hierarchies for accelerators in a heter- Writing RTL manually takes significant ogeneous environment. In a case study with effort, and the quality highly depends on the GEMM benchmark, Aladdin uncovers designers’ expertise. Although HLS tools significant high-level design tradeoffs by evalu- offer opportunities to automatically generate ating a broader design space of the entire sys- the RTL implementation, extensively tuning tem. This analysis results in more than 3Â C code is still necessary to meet design performance improvements compared to the requirements. After generating RTL, design- conventional approach of designing accelera- ers must use commercial CAD tools, such as tors in isolation. Synopsys’s Design Compiler and Mentor Graphics’ ModelSim, to estimate power and Background cycle counts. Hardware acceleration exists in many In contrast, Aladdin takes unmodified, forms, ranging from analog accelerators, high-level language descriptions of algo- fixed-function accelerators, and program- rithms to generate a DDDG representation mable accelerators, such as GPUs and digital of accelerators, which accurately models the signal processors. In this work, we focus on cycle-level power, performance, and area of fixed-function accelerators. We discuss the realistic accelerator designs. As a pre-RTL design flow, design space, and state-of-the-art simulator, Aladdin is orders of magnitude research infrastructure for fixed-function faster than existing CAD flows. accelerators in order to illustrate the chal- lenges associated with current accelerator Accelerator design space research and discuss why a tool like Aladdin Despite the application-specific nature of opens up new research opportunities for accelerators, the accelerator design space is architects. quite large, given a range of architecture- and ............................................................ 60 IEEE MICRO circuit-level alternatives. Figure 4 illustrates a power-performance design space of accelera- 140 Datapath + memory tor designs for the GEMM workload from Datapath only the SHOC benchmark suite. The square points were generated from a commercial 120 HLS flow sweeping datapath parameters, including loop-iteration parallelism, pipelin- 100 ing, array partitioning, and clock frequency. However, HLS flows generally provision a 80 fixed latency for all memory accesses, implic- itly assuming local scratchpad memory fed 60 by direct memory access (DMA) controllers. Such simple designs are not well suited for Accelerator power (mW) 40 capturing data locality or interactions with complex memory hierarchies. The circle points in Figure 4 were generated by Aladdin 20 integrated with a full cache hierarchy and memory model, sweeping not only datapath 0 0 200 400 600 800 1,000 1,200 parameters but also memory parameters. By Execution time (μs) doing so, Aladdin exposes a rich design space that incorporates the realistic memory penal- Figure 4. GEMM design space. A large

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    13 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us