...... THE ALADDIN APPROACH TO ACCELERATOR DESIGN AND MODELING

...... THE AUTHORS DEVELOPED ALADDIN, APRE-RTL, POWER-PERFORMANCE ACCELERATOR

MODELING FRAMEWORK AND DEMONSTRATED ITS APPLICATION TO SYSTEM-ON-CHIP

(SOC) SIMULATION.ALADDIN ESTIMATES PERFORMANCE, POWER, AND AREA OF

ACCELERATORS WITHIN 0.9, 4.9, AND 6.6 PERCENT WITH RESPECT TO REGISTER-

TRANSFER LEVEL (RTL) IMPLEMENTATIONS.INTEGRATED WITH ARCHITECTURE-LEVEL

GENERAL-PURPOSE CORE AND MEMORY HIERARCHY SIMULATORS,ALADDIN PROVIDES A

FAST BUT ACCURATE WAY TO MODEL ACCELERATORS’ POWER AND PERFORMANCE IN AN

SOC ENVIRONMENT.

...... As we near the end of Dennard 2). The natural evolution of this trend leads to scaling, traditional performance and power agrowingvolumeanddiversity of customized scaling benefits based on technology improve- accelerators in future heterogeneous architec- ments no longer exist. At the same time, tures ranging from mobile SoCs to high-per- transistor density improvements continue, formance servers (Figure 3). To thoroughly resulting in the dark silicon problem, wherein evaluate such architectures, designers must chips have more transistors than a system can perform large design space exploration to Yakun Sophia Shao fullypoweratanypointintime.1 To over- understand the tradeoffs across the entire sys- come these challenges, application-specific tem, which is currently infeasible due to the Brandon Reagen hardware acceleration has surfaced as a prom- lack of a fast simulation infrastructure for ising approach because it delivers orders of accelerator-centric systems. Gu-Yeon Wei magnitude performance and energy benefits Computer architects have long been compared to general-purpose solutions. Anal- developing and using high-level power and David Brooks ysis of die photos (http://vlsiarch.eecs.harvard. performance simulation tools for general- edu/accelerators/die-photo-analysis) from purpose cores and GPUs. In contrast, current Harvard University Apple’s A6 (iPhone 5), A7 (iPhone 5s), and accelerator-related research relies primarily A8(iPhone6)systemsonchips(SoCs)shows on register-transfer level (RTL) implementa- that more than half of the die area is dedicated tions, a tedious and time-consuming process. to blocks that are neither CPUs nor GPUs, It takes hours, if not days, to generate and but rather specialized IP blocks (see Figure 1). simulate RTL and then synthesize it into We also observe a consistent trend of an logic to estimate the power and performance increasing number of specialized IP blocks of a single accelerator design. This low-level, acrossgenerationsofApple’sSoCs(seeFigure RTL-centric infrastructure cannot support ......

58 Published by the IEEE Computer Society 0272-1732/15/$31.00 c 2015 IEEE CPU GPU Others

14% 16% 18%

18% 22% 22% 66% 60% 64%

Apple A6

(a) (b) (c)

Figure 1. Die area breakdown of Apple’s systems on chips (SoCs). (a) A6 (iPhone 5), (b) A7 (iPhone 5s), and (c) A8 (iPhone 6). More than half of the die area is dedicated to specialized IP blocks.

large architecture-level design space explora- tion, which is required to navigate the vast 30 parameter space governing the interactions between general-purpose cores, accelerators, and shared resources, including cache hierar- 25 chies and on-chip networks. Hence, there is a clear need for a fast, high-level design tool to 20 enable broad design space exploration for -generation customized architectures. In this article, we present Aladdin, a pre- 15 RTL, power-performance simulator designed to enable rapid design space exploration for 10

accelerator-centric systems. Aladdin takes No. of specialized IP blocks high-level language descriptions of algorithms as inputs and uses dynamic data dependence 5 graphs (DDDG) as a representation of an accelerator without having to generate RTL. 0 Starting with an unconstrained program A4 A5 A6 A7 A8 DDDG, which corresponds to an initial rep- resentation of accelerator hardware, Aladdin applies optimizations as well as constraints to Figure 2. Number of specialized IP blocks across generations of Apple’s the graph to create a realistic model of acceler- SoCs. We see a consistent increase in the number of blocks. ator activity. We rigorously validated Aladdin against RTL implementations of accelerators from both handwritten Verilog and a com- suite called SHOC.5 Our results show that mercial high-level synthesis (HLS) tool for Aladdin can model performance within 0.9 various applications, including accelerators in percent, power within 4.9 percent, and area Memcached,2 HARP,3 NPU,4 and a com- within 6.6 percent compared to accelerator monly used throughput-oriented benchmark designs generated by traditional RTL flows. In ...... MAY/JUNE 2015 59 ...... TOP PICKS

Little Little Big core Private L1$/ core core scratchpad

Private L1$ Private L1$ Private L1$ reg_a reg_b Shared LLC and NoC GPGPU Accelerator- specific datapath Application-specific accelerators

Figure 3. Future heterogeneous architecture will include a large number of customized accelerators. Large design space exploration is needed to design such architecture.

addition, Aladdin provides these estimates Accelerator design flow more than 100 times faster. The current accelerator design flow Aladdin captures accelerator design trade- requires multiple CAD tools, which is inher- offs, enabling new architectural research direc- ently tedious and time-consuming. The proc- tions in heterogeneous systems comprising ess starts with a high-level description of an general-purpose cores, accelerators, and a algorithm; then, designers either manually shared memory hierarchy. We demonstrated implement the algorithm in RTL or use HLS this capability by integrating Aladdin with a tools, such as ’s Vivado HLS, to com- full memory hierarchy model. Such infrastruc- pile the high-level implementation (for ture lets users explore customized and shared example, C/Cþþ) to RTL. memory hierarchies for accelerators in a heter- Writing RTL manually takes significant ogeneous environment. In a case study with effort, and the quality highly depends on the GEMM benchmark, Aladdin uncovers designers’ expertise. Although HLS tools significant high-level design tradeoffs by evalu- offer opportunities to automatically generate ating a broader design space of the entire sys- the RTL implementation, extensively tuning tem. This analysis results in more than 3 C code is still necessary to meet design performance improvements compared to the requirements. After generating RTL, design- conventional approach of designing accelera- ers must use commercial CAD tools, such as tors in isolation. Synopsys’s Design Compiler and Mentor Graphics’ ModelSim, to estimate power and Background cycle counts. Hardware acceleration exists in many In contrast, Aladdin takes unmodified, forms, ranging from analog accelerators, high-level language descriptions of algo- fixed-function accelerators, and program- rithms to generate a DDDG representation mable accelerators, such as GPUs and digital of accelerators, which accurately models the signal processors. In this work, we focus on cycle-level power, performance, and area of fixed-function accelerators. We discuss the realistic accelerator designs. As a pre-RTL design flow, design space, and state-of-the-art simulator, Aladdin is orders of magnitude research infrastructure for fixed-function faster than existing CAD flows. accelerators in order to illustrate the chal- lenges associated with current accelerator Accelerator design space research and discuss why a tool like Aladdin Despite the application-specific nature of opens up new research opportunities for accelerators, the accelerator design space is architects. quite large, given a range of architecture- and ...... 60 IEEE MICRO circuit-level alternatives. Figure 4 illustrates a power-performance design space of accelera- 140 Datapath + memory tor designs for the GEMM workload from Datapath only the SHOC benchmark suite. The square points were generated from a commercial 120 HLS flow sweeping datapath parameters, including loop-iteration parallelism, pipelin- 100 ing, array partitioning, and clock frequency. However, HLS flows generally provision a 80 fixed latency for all memory accesses, implic- itly assuming local scratchpad memory fed 60 by direct memory access (DMA) controllers.

Such simple designs are not well suited for Accelerator power (mW) 40 capturing data locality or interactions with complex memory hierarchies. The circle points in Figure 4 were generated by Aladdin 20 integrated with a full cache hierarchy and memory model, sweeping not only datapath 0 0 200 400 600 800 1,000 1,200 parameters but also memory parameters. By Execution time (μs) doing so, Aladdin exposes a rich design space that incorporates the realistic memory penal- Figure 4. GEMM design space. A large design space is overlooked if we ties in terms of time and power, which is ignore the memory parameters. impractical with existing HLS tools alone.

State-of-the-art accelerator research ture has confined the exploratory scope of infrastructure accelerator research. The International Technology Roadmap for Semiconductors (ITRS)predictshundredsto The Aladdin framework thousands of customized accelerators by Aladdin takes high-level language descrip- 2022.6 However, state-of-the-art accelerator tions of algorithms and accelerator design research projects still contain only a handful parameters as inputs and then outputs power, of accelerators because of the cumbersome performance, area, and cycle-level activities design flow that inhibits computer architects of accelerator implementations, including from evaluating large accelerator-centric sys- memory activity that can be connected with tems. Table 1 categorizes accelerator-related shared memory and interconnect models. research projects in the computer architecture Aladdin can be first used as an accelerator community over the past five years based on simulator that models an accelerator’s behav- the means of implementation (handwritten ior in an accelerator-rich SoC to facilitate the RTL versus HLS tools) and the scope of pos- evaluation of the interaction between acceler- sible design exploration. ators and shared resources like memory. On Researchers have been able to propose the other hand, Aladdin can also be used as novel implementations of accelerators for a an early-stage accelerator design assistant that wide range of applications, by either writing allows SoC designers to quickly navigate the RTL directly or using HLS tools despite the large design space of accelerators before they time-consuming process. With the help of start implementing RTL, thereby greatly the HLS flow, we have begun to see studies reducing design iterations. evaluating design tradeoffs in accelerator datapaths, which is otherwise impractical using handwritten RTL. However, as dis- Modeling methodology cussed earlier, HLS tools cannot easily navi- The foundation of the Aladdin infrastruc- gate large design spaces of heterogeneous ture is the use of DDDG to represent acceler- architectures. This inadequacy in infrastruc- ators. A DDDG is a directed acyclic graph ...... MAY/JUNE 2015 61 ...... TOP PICKS

Table 1. Accelerator research infrastructure.

Means of Accelerator datapath Heterogeneous implementation Novel accelerator design tradeoffs SoC tradeoffs

Hand-coded RTL Buffer-integrated cache,7 Inadequate Inadequate Memcached,2,8 Sonic Millip3De,9 HARP3 High-level LINQits,10 Convolution Engine,11 Cong,13 Liu,14 Reagen15 Inadequate synthesis (HLS) Conservation Cores12

Optimization phase Realization phase

Program- Resource- Activity C Optimistic Initial Idealistic Power constrained constrained Power model code IR DDDG DDDG DDDG DDDG Time

Figure 5. Overview of the Aladdin framework. Aladdin takes C code and passes it through an optimization phase and a realization phase to estimate an accelerator’s power and performance.

wherein nodes represent computation and frameworks for multicore processors.17 These edges represent dynamic data dependences studies sought to quickly measure the upper between nodes. The dataflow nature of hard- bound of performance achievable on an ideal ware accelerators makes the DDDG a good parallel machine. Our work has two main candidate to model their behavior. Figure 5 distinctions from these efforts. First, previ- illustrates Aladdin’s overall structure, starting ous efforts model traditional Von Neumann fromaCdescriptionofanalgorithmand machines where instructions are fetched, passing through an optimization phase, where decoded, and executed on a fixed, but pro- the DDDG is constructed and optimized to grammable, architecture. In contrast, Aladdin derive an idealized representation of the algo- models a vast palette of different accelerator rithm. The idealized DDDG then passes to a implementation alternatives for the DDDG; realization phase that restricts the DDDG by the optimization phase incorporates typical applying realistic program dependences and hardware optimizations, such as removing resource constraints. User-defined configura- memory operations via customized storage tions allow wide design space exploration of inside the datapath and reducing the bit accelerator implementations. The outcome of width of functional units. The second dis- these two phases is a pre-RTL, power-per- tinction is that Aladdin provides a realistic formance model of accelerators. power-performance model of accelerators Aladdin uses a DDDG to represent pro- across a range of design alternatives during gram behaviors so that it can take arbitrary C its realization phase, unlike previous work code descriptions of an algorithm—without that offered an upper-bound performance any modifications—to expose algorithmic estimate. parallelism. This fundamental feature lets In contrast to dynamic approaches, paral- users rapidly investigate different algorithms lelizing compilers and HLS tools use program and accelerator implementations. Owing to dependence graphs (PDG) that statically cap- its optimistic nature, dynamic analysis has ture both control and data dependences. Static been previously deployed in parallelism analysis is inherently conservative in its research exploring the limits of instruction dependence analysis, because it is used to level parallelism (ILP)16 and recent modeling generate code and hardware that works in all ...... 62 IEEE MICRO circumstances and is built without runtime IR instructions in a trace file. The trace information. A classic example of this conser- includes dynamic instruction information vatism is the enforcement of false dependences such as operation codes, register IDs, parame- that restrict algorithmic parallelism. For ter data values, and dynamic addresses of instance, programmers often use pointers to memory operations. navigate arrays, and disambiguating these memory references is a challenge for HLS Initial DDDG. Aladdin analyzes both regis- tools. Such situations frequently lead to ter and memory dependences based on the designs that are more sequential compared to IR trace. Only true read-after-write data what a human RTL programmer would dependences are respected in the initial develop. Therefore, although HLS tools offer DDDG construction. This DDDG is opti- the opportunity to automatically generate mistic enough for the purpose of ILP limit RTL, designers still need to extensively tune studies but is missing several characteristics of their C code to expose parallelism explicitly. hardware accelerators. Thus, Aladdin differs from HLS tools; it is a pre-RTL simulator to model the accelerator’s Idealized DDDG. Hardware accelerators can behavior, whereas HLS is burdened with gen- take on application-specific features not erating actual, correct hardware. modeled in the initial DDDG. Aladdin gen- erates an idealized DDDG representation of Optimization phase the original algorithm by employing com- monly used hardware customization strat- The optimization phase forms an ideal- egies such as bit-width tuning, strength ized DDDG that represents only the funda- reduction, induction-variable removal, and mental dependences of the algorithm. An memory-to-register conversion.19 idealized DDDG for accelerators must satisfy three requirements: Realization phase express only necessary computation The realization phase uses program and and memory accesses, resource parameters, defined by users, to con- capture only true read-after-write de- strain the idealized DDDG generated in the pendences, and optimization phase. remove unnecessary dependences in the context of customized accelerators. Program-constrained DDDG. The idealized DDDG assumes that hardware designers can Here, we describe how Aladdin’s optimi- eliminate all control and false data dependen- zation phase addresses these requirements. ces at design time, which is too optimistic. To model more realistic hardware designs, Optimistic intermediate representation. Alad- program-constrained DDDG brings back din builds the DDDG from a dynamic the control and data dependences that are instruction trace, where the choice of the hard to resolve in static time. For example, instruction set architecture (ISA) significantly the idealized DDDG optimistically removes impacts the complexity and granularity of the all false memory dependences between nodes in the graph. In fact, a trace using a dynamic instructions, keeping true read- machine-specific ISA contains instructions after-write dependences. This is realistic for that are not part of the program but produced memory accesses with addresses that can be due to the artifacts of the ISA (for example, resolved at design time. However, some register spills).18 To avoid such artifacts, Alad- algorithms have input-dependent memory din uses a high-level, machine-independent accesses (for example, histogram), wherein intermediate representation (IR) provided by different inputs result in different dynamic theLLVMcompiler.TheLLVMIRisopti- dependences. Without runtime memory dis- mistic because it allows an unlimited number ambiguation support, designers must make of registers, eliminating additional instructions conservative assumptions about memory associated with register spilling. We use a cus- dependences to ensure correctness. To model tomizedLLVMpasstoemitfullyoptimized realistic memory dependences, the realization ...... MAY/JUNE 2015 63 ...... TOP PICKS

Power and area models. We can construct Table 2. Realization phase user-defined parameters. i::j::k denotes detailed power and area models by synthesiz- a set of values from i to k by a stepping factor j. ing microbenchmarks that exercise the func- Parameters Example range tional units. Our microbenchmarks cover all of the computational instructions in IR so Loop rolling factor [1::2::trip count] that there is a one-to-one mapping between Clock period (ns) [1::2::16] nodes in the DDDG and functional units in Functional unit latency Single cycle, pipelined the power model. We synthesized these Memory ports [1::2::64] microbenchmarks using Synopsys’s Design Compiler in conjunction with a commercial 40-nm standard cell library to characterize the switching, internal, and leakage power of each phase includes memory ambiguation that functional unit. This characterization is fully constrains the input-dependent memory automated to easily migrate to new technolo- accesses by adding dependences between all gies. The memory power model is based on a dynamic instances of a load-store pair, as commercial register file and static RAM long as a true memory dependence is (SRAM) memory compiler that accompanies observed for any pair. our standard cell library. We compared this memory power model to CACTI and found Resource-constrained DDDG. Finally, Aladdin consistent trends, but we retained the memory accounts for user-specified hardware resource compiler model for consistency with the constraints, a subset of which is shown in Table standard cell library. 2. Users specify the type and size of hardware resources in an input configuration file. Alad- Limitations din then binds the program-constrained As a dynamic modeling approach, Alad- DDDG onto the hardware resources, leading din has its limitations. to the resource-constrained DDDG. Aladdin can easily sweep resource parameters to explore Algorithm choices. Aladdin does not auto- an algorithm’s design space. matically sweep different algorithms. Rather, it provides a framework to quickly explore An example. Figure 6 illustrates different various hardware designs of a particular algo- phases of Aladdin transformations using a rithm. This means designers can use Aladdin microbenchmark as an example. After the IR to quickly and quantitatively compare the trace of the C code has been produced, the design spaces of multiple algorithms for the optimization and realization phases generate same application to find the most suitable the resource-constrained DDDG that models algorithm choice. accelerator behavior. In this example, we assume the user wants an accelerator with a Input dependent. Aladdin models only those factor-of-2 loop-iteration parallelism and operations that appear in the dynamic trace, without loop pipelining. The solid arrows in which means it does not instantiate hardware the DDDG are true data dependences, and for code not executed with a specific input. the dashed arrows represent resource con- For Aladdin to completely model the hard- straints, such as loop rolling and turning off ware cost of a program, users must provide loop pipelining. The horizontal dashed lines inputs that exercise all paths of the code. represent clock cycle boundaries. The corre- sponding resource activities are shown to the Input C code. Aladdin can create a DDDG right of the DDDG example. We see that the for any C code. However, in terms of model- DDDG reflects the dataflow nature of the ing accelerators, C constructs that require accelerator. Figure 7 shows a realistic cycle- resources outside the accelerator, such as sys- level resource activity for one implementa- tem calls and dynamic memory allocation, tion of the fast Fourier transform (FFT) are not modeled. In fact, understanding how benchmark, wherein Aladdin accurately cap- to handle such events is a research direction tures the distinct phases of FFT. that Aladdin facilitates...... 64 IEEE MICRO IR trace: 0. r0 = 0 1. r4 = load (r0 + r1) //load a[i] 2. r5 = load (r0 + r1) //load b[i] C code: 3. r6 = r4 +r5 4. store(r0 + r3, r6) //store c[i] for (i = 0; i < N; ++i) 5. r0 = r0 + 1 // ++i c[i] = a[i] + b[i]; 6. r4 = load (r0 + r1) //load a[i] 7. r5 = load (r0 + r2) //load b[i] 8. r6 = r4 + r5 9. store(r0 + r3, r6) //store c[i] 10. r0 = r0 +1 // ++i ... Resource-constrained DDDG Resource activity

0. i=0 5.i++ +

1.Id a 2.Id b 6.Id a 7.Id b MEM MEM MEM MEM

3.+ 8.+ + +

4.st c 9.st c MEM MEM

10.i++ 15.i++ + +

11.Id a 12.Id b 16.Id a 17.Id b MEM MEM MEM MEM

+ + 13. + 18.+

MEM MEM 14.st c 19.st c

20.i++ 25.i++ ++

Cycle

Figure 6. Different phases of Aladdin transformation illustrated with a microbenchmark: starting from a C code, to an intermediate representation (IR) trace, then a resource- constrained dynamic data dependence graph (DDDG), and finally a dynamic activity profile.

Aladdin validation file generated from ModelSim is fed to Design Figure 8 outlines the methodology used to Compiler to capture the switching activity at validate Aladdin. To generate Verilog, we the gate level. either hand-code RTL or use Xilinx’s Vivado Figure 9 shows that Aladdin accurately HLS tool. The RTL design flow is an iterative models performance, power, and area com- process and requires extensive tuning of both pared against RTL implementations across RTL and C code. The power and area esti- all of the presented benchmarks with average mates from Aladdin are compared against errors of 0.9, 4.9, and 6.5 percent, respec- Verilog synthesized by Design Compiler using tively. For each SHOC workload, we vali- commercial 40-nm standard cells. Aladdin’s dated six points on the power-performance performance model is validated against Mod- Pareto frontier. The SHOC validation results elSim Verilog simulations. The Switching show that Aladdin accurately models entire Activity Interchange Format (SAIF) activity design spaces, whereas for single accelerator ...... MAY/JUNE 2015 65 ...... TOP PICKS

(50 seconds) and realization phase (12 sec- 200 onds). However, because Aladdin needs to Active functional units perform the optimization phase only once Twiddle Memory bandwidth for each algorithm, this optimization time can be amortized across large design spaces. 150 Consequently, it takes only 7 minutes to enu- FFT8 Twiddle merate the full design space of FFT with Aladdin, compared to 52 hours with the FFT8 100 RTL flow. FFT8 No. of active Embracing the era of specialization Shuffle Shuffle 50 Customized architectures comprising

functional units and bandwidth CPUs, GPUs, and accelerators are already seen in mobile systems and are beginning to emerge in servers and desktops. However, 0 current SoC designs that pull together hard 0 200 400 600 800 IP blocks into a single integrated substrate— Time (cycles) simply continuing the multidecade trend of SoC integration—cannot leverage higher- Figure 7. Realistic cycle-level functional units and memory activity of a fast level coordination and optimization between Fourier transform (FFT) accelerator. This cycle-level activity is also an input traditional general-purpose cores, accelera- to the power model to represent the dynamic activity of accelerators. tors, and shared resources such as cache hier- archies. As the industry dives further into this designs, Aladdin is not subject to HLS short- era of specialization, there is a clear need for a comings and can accurately model different design methodology like Aladdin that facili- customization strategies. tates broad design space exploration of next- Aladdin also enables rapid design space generation heterogeneous architectures. exploration of accelerators. Table 3 quantifies To illustrate, we consider a heterogeneous the differences in algorithm-to-solution time system comprising a general-purpose core, a to explore a design space of the FFT bench- GEMM accelerator, and a shared L2 cache mark with 36 points. Compared to tradi- that can be accessed by both the core and the tional RTL flows, Aladdin skips the time- accelerator. Figure 10a shows the accelerator consuming RTL generation, synthesis, and design space without memory contention simulation process. On average, it takes 87 from the general-purpose core. We modulate minutes to generate a single design using the the algorithmic blocking factor and find that a RTL flow but only 1 minute for Aladdin, blocking factor of 16 is always better than 32 including both Aladdin’s optimization phase with respect to both power and performance.

Aladdin

C code Power performance RTL Design designer compiler Verilog Activity HLS C tuning Vivado HLS ModelSim

Design iteration

Figure 8. Aladdin validation flow. Aladdin’s estimates are compared against synthesized accelerators using commercial CAD tools...... 66 IEEE MICRO 5 60 0.9% 4 50 40 3 Aladdin 6 RTL flow 2 4 Time (KCycles) 2 Time (KCycles) 1 0 0 MD FFT SORT SCAN NPUHASHHARP STENCIL GEMM TRIAD (a) REDUCTION

140 3 4.9% 120 100 2 80 60 40 1 Power (mW) Power (mW) 20 0 0

MD FFT SORT SCAN NPUHASHHARP STENCIL GEMM TRIAD (b) REDUCTION

10–3 1.0 15 6.5% 0.8 ) ) 2 2 10 0.6 0.4 5 Area (mm Area Area (mm Area 0.2 0.0 0 MD FFT SORT SCAN NPUHASHHARP STENCIL GEMM TRIAD REDUCTION (c)

Figure 9. Aladdin validation results against RTL implementations. (a) Performance, b) power, and (c) area validation. Aladdin accurately models performance, power, and area across various benchmarks.

Table 3. Algorithm-to-solution time per design.

Design stage Hand-coded RTL HLS Aladdin

Programming effort High Medium N/A RTL generation Designer dependent 37 minutes N/A RTL simulation time 5 minutes N/A RTL synthesis time 45 minutes N/A Time to solution per design 87 minutes 1 minute Time to solution (36 designs) 52 hours 7 minutes

...... MAY/JUNE 2015 67 ...... TOP PICKS

160 160 block = 16 140 block = 32 140

120 120

100 100

80 80

Power (mW) 60 Power (mW) 60

40 40

20 20

0 0 0 0.5 1.0 1.5 2.0 2.5 3.0 0 0.5 1.0 1.5 2.0 2.5 3.0 Time (million cycles) Time (million cycles) (a) (b)

Figure 10. GEMM design space: (a) without contention and (b) with contention. A blocking factor of 16 outperforms a blocking factor of 32 when no memory contention presents, but the opposite occurs when there is memory contention.

tion, the accelerator could suffer up to 3 performance degradation in a contended sys- tem. Aladdin helps designers find and avoid such design decisions by evaluating system- gem5 gem5 wide accelerator design tradeoffs, which is not possible with conventional RTL-based design flows. Cacti/Orion2 GPGPU- Sim oving forward, the era of specializa- M tion poses unique opportunities and challenges for computer architects. Despite Aladdin the energy efficiency benefit, accelerators lack flexibility and programmability, especially compared with general-purpose processors. For example, virtual memory and cache coherency are well-known mechanisms to ease programmability in traditional systems Figure 11. SoC simulation infrastructure with Aladdin. Aladdin can be but are not commonly applied in today’s integrated with other architectural simulators to model future accelerators. Instead, accelerators rely heavily heterogeneous SoCs. on software-managed scratchpad memory to provide fixed-latency memory access and DMA to communicate data back and forth, However, when both the accelerator and the which leads to significant chip resource and core actively access the shared cache, shown in software engineering cost. To redesign today’s Figure 10b, we see that blocking factor 32 SoCs to support programmability, architects offers better power-performance tradeoffs can use Aladdin to evaluate the cost and ben- than blocking factor 16. With a larger block- efit by applying different mechanisms to ing factor, the accelerator requires fewer refer- accelerators without worrying about low-level ences to the matrices in total and, thus, fewer implementation details. data requests from the L2 cache. If designers Aladdin is publicly available, and we have had picked a blocking factor of 16 instead of integrated Aladdin with the gem5 simulator,20 32 without considering the memory conten- one of the most widely used architectural ...... 68 IEEE MICRO simulators in the community. Aladdin, inte- /2013Chapters/2013SysDrivers Summary grated with gem5, lets researchers model an .pdf. SoC-like environment (for example, Figure 7. C.F. Fajardo et al., “Buffer-Integrated- 11). Tools like Aladdin will transform the proc- Cache: A Cost-Effective SRAM Architecture ess of architectural design for heterogeneous for Handheld and Embedded Platforms,” systems, unlocking new opportunities for wide Proc. 48th Design Automation Conf., 2011, adoption of accelerator-rich architectures. MICRO pp. 966–971. 8. M. Lavasani, H. Angepat, and D. Chiou, “An FPGA-Based In-line Accelerator for Mem- Acknowledgments cached,” IEEE Computer Architecture Let- This work was partially supported by C- ters, 2013, pp. 57–60. FAR, one of six centers of STARnet, a Semi- 9. R. Sampson et al., “Sonic Millip3de: A Mas- conductor Research Corporation program sively Parallel 3D-Stacked Accelerator for sponsored by MARCO and DARPA, the 3D Ultrasound,” Proc. IEEE 19th Int’l Symp. National Science Foundation (NSF) Expedi- High Performance Computer Architecture, tions in Computing Award no. CCF- 2013, pp. 318–329. 0926148, DARPA under contract no. 10. E.S. Chung, J.D. Davis, and J. Lee, “Linqits: HR0011-13-C-0022, and a Google Faculty Big Data on Little Clients,” Proc. 40th Int’l Research Award. Any opinions, findings, Symp. Computer Architecture, 2013, pp. conclusions, or recommendations expressed 261–272. in this material are those of the authors and 11. W. Qadeer et al., “Convolution Engine: Bal- do not necessarily reflect the views of our ancing Efficiency & Flexibility in Specialized sponsors. Computing,” Proc. 40th Ann. Int’l Symp. Computer Architecture, 2013, pp. 24–35...... 12. G. Venkatesh et al., “Conservation Cores: References Reducing the Energy of Mature 1. H. Esmaeilzadeh et al., “Dark Silicon and Computations,” Proc. 15th Conf. Architec- the End of Multicore Scaling,” Proc. 38th tural Support for Programming Languages Ann. Int’l Symp. Computer Architecture, and Operating Systems, 2010, pp. 205–218. 2011, pp. 365–376. 13. J. Cong, K. Gururaj, and G. Han, “Synthesis 2. K.T. Lim et al., “Thin Servers with Smart of Reconfigurable High-Performance Multi- Pipes: Designing SOC Accelerators for core Systems,” Proc. ACM/SIGDA Int’l Memcached,” Proc. 40th Ann. Int’l Symp. Symp. Field Programmable Gate Arrays, Computer Architecture, 2013, pp. 36–47. 2009, pp. 201–208. 3. L. Wu et al., “Navigating Big Data with 14. H.-Y. Liu, M. Petracca, and L.P. Carloni, High-Throughput, Energy-Efficient Data Par- “Compositional System-Level Design titioning,” Proc. 40th Ann. Int’l Symp. Com- Exploration with Planning of High-Level Syn- puter Architecture, 2013, pp. 249–260. thesis,” Proc. Conf. Design, Automation 4. H. Esmaeilzadeh et al., “Neural Acceleration and Test in Europe, 2012, pp. 641–646. for General-Purpose Approximate Pro- 15. B. Reagen et al., “Quantifying Acceleration: grams,” Proc. 45th Ann. IEEE/ACM Int’l Power/Performance Trade-offs of Applica- Symp. Microarchitecture, 2012, pp. tion Kernels in Hardware,” Proc. Int’l Symp. 449–460. Low Power Electronics and Design, 2013, 5. A. Danalis et al., “The Scalable Heterogene- pp. 395–400. ous Computing (SHOC) Benchmark Suite,” 16. D.W. Wall, “Limits of Instruction-Level Par- Proc. 3rd Workshop General-Purpose Com- allelism,” Proc. 4th Int’l Conf. Architectural putation on Graphics Processing Units, Support for Programming Languages and 2010, pp. 63–74. Operating Systems, 1991, pp. 176–188. 6. System Drivers, ITRS, 2013; www.itrs.net 17. D. Jeon et al., “Kismet: Parallel Speedup /ITRS%201999-2014%20Mtgs,%20 Estimates for Serial Programs,” Proc. ACM Presentations%20&%20Links/2013ITRS Int’l Conf. Object Oriented Programming ...... MAY/JUNE 2015 69 ...... TOP PICKS

Systems Languages and Applications, Yakun Sophia Shao is a PhD candidate in 2011, pp. 519–536. computer science at Harvard University. Her 18. Y.S. Shao and D. Brooks, “ISA-Independent research focuses on computer architecture, Workload Characterization and Its Implica- particularly on modeling and design for spe- tions for Specialized Architectures,” Proc. cialized architectures. Shao has an MS in IEEE Int’l Symp. Performance Analysis of computer science from Harvard University. Systems and Software, 2013, pp. 245–255. Contact her at [email protected]. 19. Y.S. Shao et al., “Aladdin: A Pre-RTL, Power Performance Accelerator Simulator Enabling Brandon Reagen is a PhD student in com- Large Design Space Exploration of Custom- puter science at Harvard University. His ized Architectures,” Proc. ACM/IEEE 41st research interests include methods for low- Int’l Symp. Computer Architecture, 2014, power, high-performance design leveraging pp. 97–108. specialized hardware, accelerator-centric sys- tems, and computer architecture/VLSI 20. N.L. Binkert et al., “The Gem5 Simulator,” codesign. Reagen has an MS in computer ACM SIGARCH Computer Architecture science from Harvard University. Contact News, vol. 39, no. 2, 2011, pp. 1–7. him at [email protected].

Gu-Yeon Wei is a Gordon McKay Professor of Electrical Engineering and Computer Science at Harvard University. His research interests include mixed-signal integrated cir- cuits, computer architecture, and runtime software, looking for cross-layer opportuni- ties to develop energy-efficient systems. Wei has a PhD in electrical engineering from Stanford University. Contact him at [email protected].

David Brooks is the Haley Family Professor of Computer Science at Harvard University. His research interests include architectural and software approaches to address power, thermal, and reliability issues for embedded and high-performance computing systems. Brooks has a PhD in electrical engineering from Princeton University. Contact him at [email protected].

...... 70 IEEE MICRO