MIAOW: an Open Source RTL Implementation of a GPGPU

Programmable Hardware Acceleration

Vinay Gangadhar

PhD Final Examination Thursday, Nov 16th, 2017

Advisor: Karu Sankaralingam Committee: Mark Hill, Mikko Lipasti, David Wood, Dimitris Papailiopoulos

11/16/2017 Dissertation Talk 1

Computing Trends

Device scaling slowdown Emerging applications (or dead) driving computing with new & demands Dark silicon problem

11/16/2017 Dissertation Talk 2 Era of Specialization

Traditional Multicore Domain Specific Acceleration Reg Expr. AI Neural Approx. Graph Scan Application domain Traversal Image Processing Deep specialization Stencil Neural Sort Fixed-function Accelerators for specific domain: Domain Specific Accelerators (DSAs) + High Efficiency

10 – 1000x Movidius Performance/PowerMyriad VPU NVIDIA DGX-1 AI Accelerator or & NVDLA Architecture Google TPU Performance/Area 11/16/2017 Dissertation Talk 3 Caveats of Domain-Specific Accelerators (DSAs)

Reg Expr. AI Neural Approx. Graph Scan Traversal Image DSAs Processing Deep Stencil Neural Sort - Minimally programmable/ Not Re-configurable H.265 H.266 Server Mobile IOT - Obsoletion prone

- Domains targeting each device type

- Architecture, design, verification and fabrication cost

- Multi-DSA chip for “N” application domains  Area and cost inefficient 11/16/2017 Dissertation Talk Source: Malitel Consulting 4 The Universal Accelerator Dream...

Deep Neural

Image Processing Automated Driving

Source: Compression Malitel Consulting Regex Matching Query Convert 100+ Accelerators Processing Standard programming and 1 Programmable Accelerator Fabric threading interface A generic programmable hardware accelerator matching the efficiency of Domain Specific Accelerators (DSAs) with an efficient hardware-software interface 11/16/2017 Dissertation Talk 5 Specialization Paradigms

11/16/2017 Dissertation Talk 6 Research Overview

Reg Expr. AI Neural Approx. Graph Scan Domain-Specific Traversal Image Processing Accelerators (DSAs) Deep Stencil Neural Sort

Commonality in DSAs ?

Specialization Principles

Micro-Architectural Mechanisms

Programmable Hardware Accelerator Architecture 11/16/2017 Dissertation Talk 7 Research Overview Generality ASIC/ GPGPU FPGA DSP SIMD GPP DSA

Specialization Efficiency Programmability / Principles (energy efficient computing) Re-configurability Features

General Set of Architecture with Flexible Micro-Architectural Mechanisms + Hardware-Software Programming Interface ProgrammableProgrammable or Re Hardware-Configurable SpecializedAccelerator Architecture

Efficiency Retain Trivial adaptation of close to DSAs/ASICs programmability new algorithms/applications

11/16/2017 Dissertation Talk 88 Dissertation Research Goal Programmable Hardware Acceleration 1. Explore the commonality in the way the DSAs specialize – Specialization Principles

2. General Mechanisms for the design of a generic programmable hardware accelerator matching the efficiency of DSAs

3. A programmable/re-configurable accelerator architecture with an efficient accelerator hardware-software (ISA) interface

4. Easy adaptation of new acceleratable algorithms in a domain-agnostic way

11/16/2017 Dissertation Talk 9 Dissertation Statement Programmable Hardware Acceleration

A programmable hardware accelerator nearing the efficiency of a domain-specific accelerator (DSA) is feasible to build by:

• Identifying the common principles of architectural specialization

• Applying general set of micro-architectural mechanisms for the identified principles

• Having an efficient hardware-software interface to be able to express any typical accelerator application

11/16/2017 Dissertation Talk 10 Contributions Modeling Programmable Architectural Realization with Hardware Acceleration Stream-Dataflow Acceleration • Exploring the common principles • Stream-Dataflow programmable of architectural specialization accelerator architecture with:

 Programming abstractions • Modeling a general set of and execution model mechanisms to exploit the  ISA interface specialization principles – GenAccel Model • Detailed micro-architecture with an efficient architectural • Quantitative evaluation of realization of stream-dataflow GenAccel Model with four DSAs accelerator – Softbrain

• System-Level Tradeoffs of • Quantitative evaluation of GenAccel Model vs. DSAs Softbrain with state-of-the-art DSA solutions 11/16/2017 Dissertation Talk 11 Modeling Programmable Hardware Acceleration*

*Published in HPCA 2016, IEEE Micro Top Picks 2017

11/16/2017 Dissertation Talk 12 Outline

Concurrency

• Principles of architectural specialization Computation

Communication

 Embodiment of principles in DSAs Data Reuse Coordination • Modeling mechanisms exploiting specialization principles for a generic programmable accelerator (GenAccel Model)

• Evaluation of GenAccel with 4 DSAs Energy (Performance, power & area) Speedup

• System-level energy efficiency tradeoffs with Core Accel. $ GenAccel and DSA System Bus Memory

11/16/2017 Dissertation Talk 13

Key Insight: Commonality in DSAs’ Specialization Principles

Host System DSAs

Cache Reg Expr. AI Neural Approx.

Graph

Scan Traversal Linear

Algebra

Core Core Core Deep Neural Stencil Sort

Most DSAs employ 5 common Specialization Principles

Concurrency Computation Communication Data Reuse Coordination

S S

+ FU S S FU

11/16/2017 Dissertation Talk 14 Principles of Architectural Specialization • Match hardware concurrency to that of algorithm

• Problem-specific computation units • Explicit communication as opposed to implicit communication • Customized structures for data reuse • Hardware coordination using simple low-power control logic

Concurrency Computation Communication Data Reuse Coordination

S S

+ FU S S FU 11/16/2017 Dissertation Talk 15

Talk Dissertation 11/16/2017

Engine

Sort

Neural Deep Stencil

Q100

Convolution Convolution

Database Stencil

Algebra

Traversal Linear Linear

Scan

Graph Graph

Approx.

NPU

DianNao

Neural Deep Approx.

Neural Neural

Expr. Reg Neural Neural

domain specific way ? way specific domain

How do DSAs embody these principles in a a in principles these embody DSAs do How

S S FU

FU + S S

Computation Concurrency

Coordination Reuse Data Communication

5 Specialization Principles Specialization 5

Principles in DSAs NPU – Neural Proc. Unit

General Purpose Processor

Bus Sched

Fifo

Fifo PE PE In In

Out PE PE Most DSAs employ

High Level Level High PE PE Organization

Five Common SpecializationPE PE Principles

Weight Buf.

Fifo

Mult-Add Cont-

Units Acc Reg. Processing Processing roller Sigmoid Out Buf.

Concurrency Computation Communication Data Reuse Coordination

11/16/2017 Dissertation Talk 17 Outline

Concurrency • Computation Principles of architectural specialization Communication

Data Reuse  Embodiment of principles in DSAs Coordination

• Modeling mechanisms exploiting specialization principles for a generic programmable accelerator (GenAccel Model)

• Evaluation of GenAccel with 4 DSAs Energy (Performance, power & area) Speedup

Core Accel. • System-level energy efficiency tradeoffs with $ System Bus GenAccel and DSA Memory 11/16/2017 Dissertation Talk 18

Implementation of Principles in a General Way Composition of simple micro-architectural mechanisms

• Concurrency: Multiple tiles (Tile – hardware for coarse grain unit of work)

• Computation: Special FUs in spatial fabric

• Communication: Dataflow + spatial fabric Each Tile • Data Reuse: Scratchpad (SRAMs)

• Coordination: Low-power simple core

Concurrency Computation Communication Data Reuse Coordination

11/16/2017 Dissertation Talk 19 Modeling the Generic Programmable Accelerator Design Memory Memory Memory DMA Scratchpad DMAD$ Scratchpad D$ Input Interface

Input Interface Spatial Fabric

Spatial Fabric Spatial Fabric .Low . . -power FU FU Low-power Core Core S

FU FU

Output InterfaceS – Switch Output Interface Low power core | Spatial fabric | Scratchpad | DMA  GenAccel Model

Concurrency Computation Communication Data Reuse Coordination

11/16/2017 Dissertation Talk 20 Instantiating GenAccel Programmable hardware template for specialization GenAccel Fabric

Provisioned for Provisioned for one single application domain multiple application domains

Neural Approx. GAN Deep Neural GABalanced Stencil GAC Stencil or Neural Approx. GAB Deep Neural GAD Database

Database GAQ

GenAccel Usage, Design point selection & Synthesis etc. More details in backup….. 11/16/2017*Figures not to scale Dissertation Talk 21 Outline

Concurrency • Computation Principles of architectural specialization Communication

Data Reuse  Embodiment of principles in DSAs Coordination

• Modeling mechanisms exploiting specialization principles for a generic programmable accelerator (GenAccel Model)

• Evaluation of GenAccel with 4 DSAs Energy (Performance, power & area) Speedup

Core Accel. • System-level energy efficiency tradeoffs with $ System Bus GenAccel and DSA Memory 11/16/2017 Dissertation Talk 22

Methodology • Modeling framework for GenAccel

 Performance: Trace driven simulator + application specific modeling

 Power & Area: Synthesized modules, CACTI and McPAT

• Compared to four DSAs (published perf., area & power)

• Four parameterized GenAccels One combined balanced GenAccel GAN GAC GAD GAQ 1 Unit 1 Unit 8 Units 4 Units NPU Conv. GAB DianNao Q100 NPU Conv. DianNao Q100 8 Units

• Provisioned to match performance of DSAs

 Other tradeoffs possible (power, area, energy etc. )

11/16/2017 Dissertation Talk 23 Performance Analysis GenAccel vs DSAs

GAN vs. NPU GAC vs. Conv. GAD vs. DianNao GAQ vs. Q100 (1 Unit) (1 Unit) (8 Units) (4 Units) 14 35 120 200 Domain Provisioned 180 GenAccel (GA) 12 30 100 Domain Provisioned GenAccels160 10 25 140 GA (+reuse.)

80 120 Spatial (+comm.) 8 20 60 100 SIMD (+concur.) 6 Performance:15 GenAccel able to match DSA 80 Multi-Tile (+concur.) SpeedUp 40 4 10 60 LP core + SFUs (+comp.) Main contributor to speedup:40 ConcurrencyDSA GeoMean 2 5 20 20 0 0 0 0 NPU Conv. Engine Diannao Q100 (GeoMean) (GeoMean) (GeoMean) (GeoMean)

Baseline – 4 wide OOO core (Intel 3770K)

11/16/2017 Dissertation Talk 24 Domain Provisioned GenAccels

GenAccel area & power compared to a single DSA ?

11/16/2017 Dissertation Talk 25 Domain Provisioned GenAccels Area and Power Analysis

Area Comparison Power Comparison 4 4.5 4.1x Domain3.8x provisioned GenAccel4 overhead

3.6x

3 3.5 1x – 4x worse in3 Area 2.5 2 1.7x 2x 2x – 4x worse in 2Power 1.2x 1.5 0.5x 0.6x

1 1 Normalized Area Normalized Power Normalized 0.5 0 0

*Detailed area breakdown in backup 11/16/2017 Dissertation Talk 26 Balanced GenAccel design

Area and power of GenAccel Balanced design, when multiple domains mapped* ?

* Still provisioned to match the performance of each DSA

11/16/2017 Dissertation Talk 27 GenAccel Balanced Design Area-Power Analysis Area Power 1.4 3 2.5x

1.2 Balance GenAccel design 2.5 overheads 1 2 0.8 Area efficient than multiple DSAs 0.6x 1.5 0.6 2.5x worse in Power than multiple1 DSAs

Normalized Area Normalized 0.4

Power Normalized 0.2 0.5

0 0

11/16/2017 Dissertation Talk 28 Outline • Introduction

Concurrency Computation • Principles of architectural specialization Communication Data Reuse  Embodiment of principles in DSAs Coordination

• Modeling mechanisms exploiting specialization principles for a generic programmable accelerator (GenAccel Model)

• Evaluation of GenAccel with 4 DSAs Energy (Performance, power & area) Speedup

Core Accel.

• System-level energy efficiency tradeoffs with System Bus GenAccel and DSA Memory 11/16/2017 Dissertation Talk 29

Conclusion – Modeling Programmable Hardware Acceleration • 5 common principles for architectural specialization

• Modeled the mechanisms embodying the specialization principles – Design of a Generic Programmable accelerator (GenAccel Model)

• GenAccel model competitive with DSA performance and overheads of only up to 4x in area and power

• Power overhead inconsequential when system-level energy tradeoffs considered

• GenAccel Model as a baseline for future accelerator research

11/16/2017 Dissertation Talk 30 Dissertation Research Goal Programmable Hardware Acceleration 1. Explore the commonality in the way the DSAs specialize – Specialization Principles 

2. General Mechanisms for the design of a generic programmable hardware accelerator matching the efficiency of DSAs  3. A programmable/re-configurable accelerator architecture with an efficient accelerator hardware-software (ISA) interface

4. Easy adaptation of new acceleratable algorithms in a domain-agnostic way

11/16/2017 Dissertation Talk 31 Contributions Modeling Programmable Architectural Realization with Hardware Acceleration Stream-Dataflow Acceleration • Exploring the common principles • Stream-Dataflow programmable of architectural specialization accelerator architecture with:

• System-Level Tradeoffs of • Quantitative evaluation of GenAccel Model vs. DSAs Softbrain with state-of-the-art DSA solutions 11/16/2017 Dissertation Talk 32 Stream-Dataflow Acceleration*

*Published in ISCA 2017, Submitted to IEEE Micro Top-Picks 2018

11/16/2017 Dissertation Talk 33 Architectural Realization of Programmable Hardware Acceleration • Workloads characteristics:  Regular streaming memory accesses with straightforward patterns  Computationally intensive with long execution phases  Ample data-level parallelism with large datapath  Small instruction footprints with simple control flow

• Accelerator architecture to accelerate data-streaming applications  Instantiates the hardware primitives from GenAccel model . Exploit all the five specialization principles  Stream-Dataflow high-performance compute substrate with Dataflow and Stream specialization components  Exposes a novel stream-dataflow ISA interface for programming the accelerator

11/16/2017 Dissertation Talk 34 Stream-Dataflow Acceleration

From Memory Exploit common accelerator application behavior: Local storage Dataflow Computation Memory Reuse • Stream-Dataflow Execution model – Stream Stream Abstracts typical accelerator computation phases RecurrenceStream

Stream Patterns and Interface x x Dataflow • Stream-Dataflow ISA encoding and + Graph Hardware-Software interface – Exposes parallelism available in these

+ phases

Synchronization Primitives • Barrier commands to facilitate data To Memory coordination and data consistency

11/16/2017 Dissertation Talk 35 Stream-Dataflow Acceleration Stream-Dataflow Model Programmable Stream-Dataflow From Memory Accelerator

Local Memory/Cache Hierarchy storage Memory Interface Reuse Memory

Stream Input Data Output Data ... Input Data Output Data

Stream ... Streams Streams Streams Streams

Local Storage RecurrenceStream (Programmable Re-configurable Scratchpad) Reuse Computation Fabric Recurring x x streams Data Streams Dataflow + Graph • Data-parallel program kernels streaming data from (DFG) memory

+ • Dataflow computation fabric operates on data streams iteratively

• Computed output streams stored back to memory

To Memory 11/16/2017 Dissertation Talk 36 Outline • Overview

• Stream-Dataflow Execution Model

• Hardware-Software (ISA) Interface for Programmable Hardware Accelerator

• Stream-Dataflow Accelerator Architecture and Example program

Cache/ Memory Heirarchy

Writes Reads LEGEND Scratchpad Memory Interface Control State storage/SRAM Scratch Scratch SCR to MSE Memory Memory Datapath writes Stream Engine (SSE) Stream Engine (SSE) Stream Engine (MSE) Stream Engine (MSE) BLACK Data Line for Writes for Reads for Writes for Reads GREEN Control/Commands Tag Invalidate

MSE Write Cmd MSE Read Cmd

Resp

/ /

Free RSE

Free SSE Free Read

SSE Read CmdReadSSE

Free SSE Free Write

Free MSE WriteFree Free MSE Read Free SSE Write SSECmd Stream CGRA Config Input

Cache Req Cmds . . .

Cache Req

- -

I Data VPs

D FromMSE VP Scoreboard FromSSE Stream to SEs Config Cmd. Queue Cmd. RSE Cmd Resource Status Issue Checker CGRA Recurrence Stream Dispatcher Stream Engine (RSE) Indirect Load/Store SD CMD Output VPs . . . • Stream-Dataflow Micro-Architecture – Softbrain Data VPs RISCV To SSE Rocket Core To MSE

1000

100 • Evaluation and Results 10 1 GM 11/16/2017 Dissertation Talk 37 Stream-Dataflow Execution Model

Architectural Abstractions for Stream-Dataflow Model • From Memory Computation abstraction – Dataflow Graph (DFG) with input/output vector ports

Local storage • Data abstraction – Streams of data fetched Reuse Memory from memory and stored back to memory Stream Stream • Reuse abstraction – Streams of data fetched Acc(1) B(3) A(3) Input Vector Ports (width) once from memory, stored in local storage RecurrenceStream (programmable scratchpad) and reused again

x x • Communication abstraction – Stream-Dataflow Dataflow based firing Dataflowof data from data movement commands and barriers + Graphvector ports (DFG) Source Destination

Memory Address Memory Address + Local Storage Address Local Storage Address DFG Port Access DFG Port Pattern R(1) Out(3) Output Vector Ports (width)

To Memory 11/16/2017 Dissertation Talk 38 Stream-Dataflow Execution Model Programmer Abstractions for Stream-Dataflow Model

• From Memory Computation abstraction – Dataflow Graph (DFG) with input/output vector ports

Local storage • Data abstraction – Streams of data fetched Reuse Memory from memory and stored back to memory • Separates the data -movement from computation Stream Stream • Reuse abstraction – Streams of data fetched once from memory, stored in local storage

RecurrenceStream • Achieves high-concurrency through the execution of (programmable scratchpad) and reused again coarser-grained data streams alongside dataflow x x • Communication abstraction – Stream-Dataflow Dataflow datacomputation movement commands and barriers + Graph Time

+ Read Data Read Barrier

Compute All Barrier To Memory Write Data 11/16/2017 Dissertation Talk 39 Outline • Overview

• Stream-Dataflow Execution Model

• Hardware-Software (ISA) Interface for Programmable Hardware Accelerator

• Stream-Dataflow Accelerator Architecture and Example program

Cache/ Memory Heirarchy

MSE Write Cmd MSE Read Cmd

Resp

/ /

Free RSE

Free SSE Free Read

SSE Read CmdReadSSE

Free SSE Free Write

Free MSE WriteFree Free MSE Read Free SSE Write SSECmd Stream CGRA Config Input

Cache Req Cmds . . .

Cache Req

- -

I Data VPs

1000

100 • Evaluation and Results 10 1 GM 11/16/2017 Dissertation Talk 40 Traditional Progammable Accelerator Arch. Hardware Accelerator (DSA)

Programs Programs Domain-Specific (“Specialized”) Programs General H/W Language Parameters Tiny H/W-S/W H/W-S/W Compiler Interface Interface General ISA General Purpose Re-Configurable Application/Domain Hardware Hardware Specific Hardware

Can the specialized10-1000x programs Performance/Power be adapted in a domain or Performance/Area- agnostic(completely way with this lose interface? generality/programmability)

11/16/2017 Dissertation Talk 41 Stream-Dataflow ISA Interface

Express any data-stream pattern of accelerator applications using simple, flexible and yet efficient encoding scheme

11/16/2017 Dissertation Talk 42 Stream-Dataflow ISA • Set-up Interface: SD_Config – Configuration data stream for dataflow computation fabric (CGRA)

• Control Interface: SD_Barrier_Scratch_Rd, SD_Barrier_Scratch_Wr, SD_Barrier_All

• Stream Interface  SD_[source]_[dest] Source/Dest Parameters: Address (memory or local_storage), DFG Port number Pattern Parameters: access_size, stride_size, num_strides

Local Storage Memory (Scratchpad)

Compute Fabric

11/16/2017 Dissertation Talk 43 Stream-Dataflow Programming Interface Access Pattern Source Start Address Destination Memory, Stride Memory, Local Storage, Local Storage, DFG Port DFG Port Access Size Number of Strides

Linear access_size = 4 mem_addr Strided = 0xA 2D Direct Example Overlapped Streams Access num_stridesRepeating Patterns= 2 2D Indirect Offset-Indirect Streams memory_stride = 8

11/16/2017 Dissertation Talk 44 Stream-Dataflow ISA Encoding

Stream: Stream Encoding Time for i = 1 to 100: ... = a[2*i]; a Eg:

... = b[i]; b

c[b[i]] = ... c IND<[prev], c, 100>

Dataflow: Vector A[0:2] Vector B[0:2]

Specified in a × × × Domain Specific Dataflow Language (DSL) Graph + +

11/16/2017 Dissertation Talk 45 Example Pseudo-Code: Dot Product

Stream ISA Encoding Put a[0: N]  P1 Put b[0: N]  P2 Recur P3, N - 1 Original Program Get P3  c for(int i = 0 to N) { c += a[i] * b[i]; Dataflow Encoding } P1 P2 × + P3

11/16/2017 Dissertation Talk 46 New ISA Class for Programmable Hardware Acceleration Local Storage Memory Stream-Dataflow ISA (Scratchpad)

• Expresses long memory streams and ASIC Hardware For Computation access patterns efficiently – Address generation hardware becomes A New ISA Paradigm for Acceleration much simpler • Need to embody common accelerator principles and execution model • Decouples access and execute phases • Need to represent programs without • Reduces instruction overheads requiring complex micro-architecture • Dependences are explicitly encoded techniques for performance – VLIW, SIMT and SIMD have their own • Reduces cache requests and pressure by drawbacks for accelerators encoding alias-free memory requests – Implicit coalescing for concurrent • Micro-Architecture for C-programmable memory accesses ASICs – Enables ‘hardened’ ASIC compute • Separates architecture abstractions from substrate implementation the implementation details – Separates the memory interface

primitives and interaction 11/16/2017 Dissertation Talk 47 Outline • Overview

• Stream-Dataflow Execution Model

• Hardware-Software (ISA) Interface for Programmable Hardware Accelerator

• Stream-Dataflow Accelerator Architecture and Example program

Cache/ Memory Heirarchy

MSE Write Cmd MSE Read Cmd

Resp

/ /

Free RSE

Free SSE Free Read

SSE Read CmdReadSSE

Free SSE Free Write

Free MSE WriteFree Free MSE Read Free SSE Write SSECmd Stream CGRA Config Input

Cache Req Cmds . . .

Cache Req

- -

I Data VPs

1000

100 • Evaluation and Results 10 1 GM 11/16/2017 Dissertation Talk 48 Requirements for Stream- Dataflow Accelerator Architecture 1. Should employ the common specialization principles and hardware mechanisms explored in GenAccel model (*IEEE Micro Top-Picks 2017: Domain Specialization is Generally Unnecessary for Accelerators)

Concurrency Computation Communication Data Reuse Coordination

S S

+ FU S S FU

Multiple-Tiles Problem-Specific Spatial Fabric Scratchpad Low-Power Core FUs (CGRA) 2. Programmability features without the inefficiencies of existing data-parallel architectures (with less power, area and control overheads)

11/16/2017 Dissertation Talk 49 Inefficiencies in Data-Parallel Architectures SIMD & Short Vector SIMD SIMT Vector Thread Spatial Dataflow Vector Warp Scheduler + Control Core + Scalar Dispatch Register File Vector Dispatch Vector Dispatch Distributed PEs Large Register File + Scalar Dispatch Control SIMD Vector Scratchpad Register File Core Units – Control … … Sub-SIMD Vector Lanes Vector Lanes

Memory Coalescer Vector Fetch Support • Unaligned • Redundant address • Redundant • Redundant address addressing generation address generation Addressing• & Vector• Complex architectures scatter- • Address – Efficient coalescing parallelgeneration memory • interfaceInefficient memory Communication gather across threads b/w for local accesses • Spatial• Mask Architectures & merge • Non– Efficient-decoupled access parallel- computation interface instructions execute phases Resource • Application/Domain• Core-issue width • Thread schedulingSpecific Architectures• Redundant – Efficient• Redundant dispatch Utilization • Fixed vector width • Multi-ported large dispatchers & datapath for pipelined concurrent execution • Core to reorder register file & cache • Core issue width Latency hiding pressure and re-ordering instructions

Irregular • Inefficient general • Warp divergence • Re-convergence - execution pipeline hardware support for diverged support vector threads 11/16/2017 Dissertation Talk 50 Stream-Dataflow Accelerator Architecture Opportunities • Reduce address generation & duplication overheads

• Distributed control to boost pipelined concurrent execution Stream Dataflow

• High utilization of execution resources w/o massive multi- Scratchpad Command Core Command threading, reducing cache pressure or using multi- Vector Interface

ported scratchpad Coarse-Grained Reconfigurable Arch.

• Decouple access and execute phases of programs Vector Interface Memory Interface • Simplest hardware fallback mechanism for irregular memory access support

• Able to be easily customizable/configurable for new application domain

11/16/2017 Dissertation Talk 51

Recurrence 52 Stream Engine

Indirect Vector Port Interface Engine To/from Memory Stream Memory Stream

S S S

memory hierarchy memory hierarchy

FU FU

S S S S

......

S S S S

FU FU

Input Vector Port Interface Port Vector Input Stream EngineStream Output Vector Port Interface Port Vector Output

S S S S

Scratchpad

CGRA Spatial Fabric Spatial CGRA

Scrathcpad Dissertation Talk Dissertation reuse -

Architecture

Dataflow Accelerator Accelerator Dataflow

)

A(3) + x locality and data and locality Out(3 -

engine to engine to support -

engine to facilitate data data engine facilitate to - +

B(3) 64b

Stream

x R(1) engine for data engine for -

(1) 512b

Acc Indirect vector port interface for streaming streaming for vector port interface Indirect load/stores) (indirect addresses recurrent stream data recurrent streaming in and out and inof the accelerator streaming stream Recurrence Memory stream Programmable scratchpad supportingand scratchpad Programmable stream

Direct vector port interface into and out and into port vector interface Direct execution vector of for CGRA (CGRA) for data parallel execution parallel data (CGRA) for Coarse grained reconfigurable architecture architecture reconfigurable grained Coarse

•

Stream Interface: Interface: Stream • • Dataflow: Dataflow: 11/16/2017

Recurrence 53 Stream Engine

Indirect Vector Port Interface Engine To/from Memory Stream Memory Stream

S S S

memory hierarchy memory hierarchy

FU FU

S S S S

......

S S S S

FU FU

Input Vector Port Interface Port Vector Input Stream EngineStream Output Vector Port Interface Port Vector Output

S S S S

Scratchpad

CGRA Spatial Fabric Spatial CGRA

Scrathcpad Dissertation Talk Dissertation

Architecture Dataflow Accelerator Accelerator Dataflow grained Stream grained Stream Stream - - Command Dispatcher Stream Commands Stream

Coarse coreby issued commands queue command a through

P2 P1

  1

] ] -

c D$ N N

Stream  : :

0 0

[ [ a b P3, N P3 Tiny Tiny

order core order - I$ intrusive intrusive

512b 64b CommandStream

Non design accelerator programmable core programmable interface exposed to a to exposed interface purpose general command Stream

Get Recur Put Put

•

11/16/2017 Stream ISA Encoding ISA Stream Stream-Dataflow Accelerator Architecture Integration

Multi-Tile Stream-Dataflow Accelerator

Memory/Cache Hierarchy

. . .

• Each tile is connected to higher-L2 cache interface

• Need a simple scheduler logic to schedule the offloaded stream- dataflow kernels to each tile

11/16/2017 Dissertation Talk 54 Programming Stream-Dataflow Accelerator 1. Specify Datapath for the CGRA – Simple Dataflow Language for DFG

2. Orchestrate the parallel execution of hardware components – Coarse-grained stream commands using the stream-interface

Scratchpad Memory Data Flow Graph Input Ports Input . . . Ports: Tiny In-order CGRA Core (Execution CGRA Resources) Instructions

. . . Output Ports Output Ports:

11/16/2017 Dissertation Talk 55 Classifier Layer (Original)

#define Ni 8 Input Neurons (Ni) #define Nn 8 × // synapse and neurons – 2 bytes

uint16_t synapse[Nn][Ni]; ∑

uint16_t neuron_i[Ni]; )

uint16_t neuron_n[Nn]; Nn

for (n = 0; n < Nn; n++) { sum = 0;

for (i = 0; i < Ni; i++) {

sum += synapse[n][i] * neuron_i[i]; ( Output Neurons } Synapses (Nn x Ni) neuron_n[n] = sigmoid(sum); }

11/16/2017 Dissertation Talk 56 Dataflow Graph (DFG) for CGRA: Classifier Kernel sum += synapse[n][i] * neuron_i[i]; Computation DFG for neuron_n[n] = sigmoid(sum); Input Input: do_sig Ports: Input: acc Input: N Input: S

M = Mul16x4(N, S) CGRA R = Red16x4(M, acc) Instructions out = Sig16(R, do_sig)

Output: out Output Ports: N – Input neuron (Ni) port Compilation + S – Synapses (synapse) port Spatial scheduling do_sig – Input sigmoid predicate port acc – Input accumulate port class_cfg out – Output neurons (Nn(Configuration) port data for CGRA) 11/16/2017 Dissertation Talk 57 Stream Dataflow Program: Classifier Kernel // Configure the CGRA SD_CONFIG(class_cfg, sizeof(class_cfg));

// Stream the data from memory to ports SD_MEM_PORT(synapse, 8, 8, Ni * Nn/ 4, Port_S); SD_MEM_PORT(neuron_i, 8, 8, Ni/4, Port_N);

for (n = 0; n < Nn/nthreads; n++) { // Stream the constant values to constant ports SD_CONST(Port_acc, 0, 1); SD_CONST(Port_do_sig, 0, Ni - 1);

// Recur the computed data back for accumulation SD_PORT_PORT(Port_out, N - 1, Port_acc);

// Sigmoid computation and output neuron written SD_CONST(Port_do_sig, 1, 1); Compilation + SD_PORT_MEM(Port_out, 2, 2, 1, &neuron_n[n]); Spatial scheduling } class_cfg (Configuration data SD_BARRIER_ALL(); for CGRA) 11/16/2017 Dissertation Talk 58 Performance Considerations • Goal: Fully pipeline the largest dataflow graph – Increase performance [CGRA Instructions / Cycle] – Increase throughput [Graph computation instances per cycle]

• Primary Bottlenecks: – Computations per Size of Dataflow Graph Increase through Loop Unrolling/Vectorization – General Core (for Issuing Streams) Increase “length” of streams – Memory/Cache Bandwidth Use Scratchpad for data-reuse – Recurrence Serialization Overhead Increase Parallel Computations (tiling)

11/16/2017 Dissertation Talk 59 Outline • Overview

• Stream-Dataflow Execution Model

• Hardware-Software (ISA) Interface for Programmable Hardware Accelerator

• Stream-Dataflow Accelerator Architecture and Example program

Cache/ Memory Heirarchy

MSE Write Cmd MSE Read Cmd

Resp

/ /

Free RSE

Free SSE Free Read

SSE Read CmdReadSSE

Free SSE Free Write

Free MSE WriteFree Free MSE Read Free SSE Write SSECmd Stream CGRA Config Input

Cache Req Cmds . . .

Cache Req

- -

I Data VPs

1000

100 • Evaluation and Results 10 1 GM 11/16/2017 Dissertation Talk 60

Micro-Architecture Design Principles

1. Low-overhead control structures

2. Efficient execution of concurrent stream commands with simple resource dependency tracking

3. Not introduce power hungry or large CAM-like structures

4. Parameterizable design

11/16/2017 Dissertation Talk 61 Micro-Architecture of Stream-Dataflow Accelerator – Softbrain

11/16/2017 Dissertation Talk 62 Stream-Dispatcher of Softbrain

• Issues the stream commands to stream-engines

• Resource dependency tracking  Simple vector-port to stream-engine scoreboard mechanism

• Barriers – Enforces the explicit stream-barriers for data-consistency in scratchpad as well as memory state

• Interfaces to the low-power core using a simple queue-based custom accelerator logic 11/16/2017 Dissertation Talk 63 Micro-Architecture of Stream-Dataflow Accelerator – Softbrain

11/16/2017 Dissertation Talk 64 Stream-Engine of Softbrain

Memory Stream-Engine (MSE) Scratchpad Stream-Engine (SSE)

• Arbitration of multiple stream command requests

• Responsible for address generation for various data-stream access patterns

• Manages concurrent accesses to vector ports, scratchpad and the cache/memory hierarchy

• Dynamic switching of streams to account for L2 cache misses and maintain the high-bandwidth memory accesses

11/16/2017 Dissertation Talk 65 Softbrain Stream-Engine Controller Request Pipeline

Stream-Engine Controller Stream Request Pipeline • Responsible for address generation for both direct and indirect data-streams

• Priority based selection among multiple queued data-steams

• Direct streams – Affine Address Generation Unit (AGU) generates memory addresses

• Indirect Streams – Non-affine AGU gets addresses, offsets from indirect vector ports 11/16/2017 Dissertation Talk 66 Micro-Architecture Flow of Softbrain

Cache/ Memory Heirarchy

MSE Write Cmd MSE Read Cmd

Resp

/ /

Free RSE

Free SSE Free Read

SSE Read CmdReadSSE

Free SSE Free Write

Free MSE WriteFree Free MSE Read Free SSE Write SSECmd Stream CGRA Config Input

Cache Req Cmds . . .

Cache Req

- -

I Data VPs

11/16/2017 Dissertation Talk 67 Outline • Overview

• Stream-Dataflow Execution Model

• Hardware-Software (ISA) Interface for Programmable Hardware Accelerator

• Stream-Dataflow Accelerator Architecture and Example program

Cache/ Memory Heirarchy

MSE Write Cmd MSE Read Cmd

Resp

/ /

Free RSE

Free SSE Free Read

SSE Read CmdReadSSE

Free SSE Free Write

Free MSE WriteFree Free MSE Read Free SSE Write SSECmd Stream CGRA Config Input

Cache Req Cmds . . .

Cache Req

- -

I Data VPs

1000

100 • Evaluation and Results 10 1 GM 11/16/2017 Dissertation Talk 68 Stream-Dataflow Implementation: Softbrain Evaluation Stream- RISCV Dataflow Code GCC RISCV (C/C++) RISCV ISA Binary Accelerator Cycle-level Softbrain DFG DFG Simulator Config. File Compiler DFG.h (ILP Solver) Software Stack

Chisel- Accelerator Chisel Parameterizable generated Model Softbrain Accelerator Verilog Configuration RTL Implementation Synthesis + Hardware Synopsis DC

11/16/2017 Dissertation Talk 69 Evaluation Methodology • Workloads

 Deep Neural Networks (DNN) – For domain provisioned comparison

 Machsuite Accelerator Workloads – For comparison with application specific accelerators

• Comparison

 Domain Provisioned Softbrain vs. DianNao DSA

 Broadly provisioned Softbrain vs. ASIC design points – Aladdin* generated performance, power and area

• Area and Power of Softbrain

 Synthesized area, power estimates

 CACTI for cache and SRAM estimates

*Sophia, Shao et al. – Aladdin: a Pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures

11/16/2017 Dissertation Talk 70 Domain-Specific Comparison (Softbrain vs DianNao DSA)

Speedup Relative to OOO4 (DNN Workloads) 1000 298 191

100

10 SPEEDUP

SoftBrain DianNao

11/16/2017 Dissertation Talk 71 Area-Power Estimates of Domain Provisioned Softbrain Components Area (mm2) @ 28nm Power (mW) Rocket Core 0.16 39.1 (16KB I$ + D$)Softbrain vs Diannao (DNN DSA) Network 0.12 31.2 CGRA• Perf. – Able FUsto (5match x 4) the performance0.04 24.4 Total CGRA 0.16 55.6 5 x• StreamArea Engines – 1.74x Overhead 0.02 18.3 Scratchpad• Power (4KB) – 2.28x Overhead 0.1 2.6 Vector Ports (Input & 0.03 Output) 1 Softbrain Unit 0.47 119.3 8 Softbrain Units 3.76 954.4 DianNao DSA 2.16 418.3 Softbrain / DianNao 1.74 2.28 Overhead 11/16/2017 Dissertation Talk 72 Broadly Provisioned Softbrain vs ASIC Performance Comparison

Speedup Relative to OOO4 (Machsuite Workloads) 10

6 4 SPEEDUP 2.59 2.67 2 0

Softbrain ASIC

Aladdin* generated ASIC design points – Resources constrained to be in ~15% of Softbrain Perf. to do iso-performance analysis *Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures. Sophia Shao , .et. al 11/16/2017 Dissertation Talk 73 Broadly Provisioned Softbrain vs ASIC Area & Power Comparison

Power Efficiency Relative to Energy Efficiency ASIC Area Relative OOO4 (GM) Relative to OOO4 (GM) to Softbrain (GM) 20 60 18 18 0.15 0.14 Softbrain50 vs ASIC designs48 16 14 40 12 11 0.1 • Perf. – Able to match31 the performance 10 • Power – 1.6x 30overhead 8 20 6 • Energy – 1.5x overhead 0.05 4 • Area – 8x overhead*10 2 0 *All 8 ASICs combined0  2.15x more area than Softbrain0 Softbrain ASIC Softbrain ASIC GM

11/16/2017 Dissertation Talk 74 Conclusion – Stream-Dataflow

Acceleration • Stream-Dataflow Acceleration

 Stream-Dataflow Execution Model – Abstracts typical accelerator computation phases using a dataflow graph

 Stream-Dataflow ISA Encoding and Hardware-Software Interface – Exposes parallelism available in these phases

• Stream-Dataflow Accelerator Architecture

 CGRA and vector ports for pipelined vector-dataflow computation

 Highly parallel stream-engines for low-power stream communication

• Stream-Dataflow Prototype & Implementation – Softbrain

 Matches performance of domain provisioned accelerator (DianNao DSA) with ~2x overheads in area and power

 Compared to application specific designs (ASICs), Softbrain has ~2x overheads in power and ~8x in area

11/16/2017 Dissertation Talk 75 Dissertation Research Goal Programmable Hardware Acceleration 1. Explore the commonality in the way the DSAs specialize – Specialization Principles 

4. Easy adaptation of new acceleratable algorithms in a domain-agnostic way 

11/16/2017 Dissertation Talk 76 Conclusion – Programmable

Hardware Acceleration • New acceleration paradigm in specialization era

 Programmable Hardware Acceleration breaking the limits of acceleration

• Foundational specialization principles abstracting the acceleration primitives Getting There !! • Enables programmable accelerators instantiation in IOT, embedded, cloud environment to support Edge Computing A good enabler for exploring general purpose • A new acceleratorprogrammable ISA paradigm hardware for an accelerationefficient programmable …. accelerator hardware implementation

• Reduce the orders of magnitude overheads of programmability and generality compared to ASICs

• Drives future accelerator research and innovation 11/16/2017 Dissertation Talk 77 Future Work • Multiple DFG executions  Configuration cache for CGRA to switch between DFGs

• Further distribute the control into vector ports  Dynamic deadlock detection for buffer overflow  Concurrent execution of different set of streams (of different DFGs)

• Low-power dynamic credit-based CGRA schedule  Allow vector ports to run out-of-order reducing the overall latency

• 3D support for streams in ISA

• Partitioned scratchpad to support data dependent address generation

• Support for fine-grained configuration through FPGA slices (along with SRAM mats) next to CGRA for memory-dependent algorithm acceleration 11/16/2017 Dissertation Talk 78 Related Work

• Programmable specialization architectures:  Smart memories, Charm, Camel, Mosphosys, XLOOPS, Maven-VT

• Principles of Specialization  GPPs inefficient and need specialization – Hameed. et. Al  Trace processing – Beret  Transparent Specialization – CCA, CRIB etc,

• Heterogeneous Cores – GPP + Specialized engines  Composite cores, DySER, Cambricon

• Streaming Engines:  RSVP arch, Imagine, Triggered instructions, MAD, CoRAM++

11/16/2017 Dissertation Talk 79 Other Works • Open Source GPGPU – MIAOW  Lead developer and contributor to open source hardware GPGPU – MIAOW  AMD Southern Island based RTL implementation of GPGPU able to execute unmodified AMDAPP OpenCL kernels  Published in [ACM TACO 2015, HOTCHIPS’ 2015, COOLCHIPS’ 2015, HiPEAC’ 2016]

• Von-Neumann/Dataflow Hybrid Architecture  A hybrid architecture aimed to exploit ILP in irregular applications  Lead developer of the micro-architecture of the dataflow offload engine – Specialized Engine for Explicit Dataflow (SEED)  Published in [ISCA‘ 2015, IEEE MICRO Top Picks 2016]

• Open-source Hardware: Opportunities and Challenges  A position article on the advantages of open-source hardware for hardware innovation  Huge believer in open-source hardware and contribution  To be published in IEEE Computer’ 17

11/16/2017 Dissertation Talk 80 Back Up

11/16/2017 Dissertation Talk 81 Programmable Hardware Acceleration

Idea 1: Specialization principles can be exploited in a general way

Idea 2: Composition of known Micro-Architectural mechanisms embodying the specialization principles

Programmable Hardware Accelerator (GenAccel)

GenAccel as a programmable hardware design template to map one or many application domains

Deep Neural Stencil, Sort, Scan, AI Domain provisioned GenAccel Balanced GenAccel 11/16/2017*Figures not to scale Dissertation Talk 82 Principles in DSAs NPU – Neural Proc. Unit General Purpose Processor

• Match hardware concurrency to that

Bus Sched

of algorithm Fifo Fifo PE PE

In In • Problem-specific computation units Out PE PE • Explicit communication as opposed to

PE PE implicit communication High Level Level High Organization PE PE • Customized structures for data reuse • Hardware coordination using simple

low-power control logic Weight Buf.

Fifo Mult-Add Cont-

roller Acc Reg. Engine

Processing Processing Sigmoid Out Buf.

Concurrency Computation Communication Data Reuse Coordination 11/16/2017 Dissertation Talk 83 Accelerator Workloads

Neural Approx. Convolution

DNN Database Streaming 1. Ample Parallelism 2. Regular Memory 3. Large Datapath 4. Computation Heavy

11/16/2017 Dissertation Talk 84 GenAccel Modeling Strategy • Phase 1. Model Single-Core with PIN + Gem5 based trace simulation

 The algorithm to specialize in the form of c-code/binary

 Potential Core Types, CGRA sizes, any specialized instructions

 Degree of memory customization (which memory accesses to be specialized, either with DMA or scratchpad)

 Output: single-core perf./energy for “Pareto-optimal” designs

• Phase 2. Model coarse-grained parallelism

 Use profiling information to determine parallel portion of the algorithm (or tell user to indicate or estimate)

 Use simple Amdahl's law to get performance estimate

 Use execution time, single-core energy estimate, and static power estimate to get overall energy estimate

11/16/2017 Dissertation Talk 85 GenAccel in Practice 1. Design Synthesis 2. Programming 3. Runtime

Performance Runtime configuration Requirements (Serial) Hardware Configure Perf. for App. 1 Constraints App. 1: ... Run App. 1 App. 2: ... Area goal: ... Configure for App. 2 App. 3: ... Power goal: ... (etc.)

Runtime configuration  FU Types For each application: (Parallel)  No. of FUs Design Configure for App. 1  Spatial fabric size decisions  Write Control Program Run App. 1

Hardware Hardware Architect/Designer  No. of GenAccel tiles (C Program + Annotations)

Synthesis  Write Datapath Program Configure for App. 2 (spatial scheduling) Run App. 2 Programmable Accelerator (GenAccel) Configure for App. 3 Run App. 3

11/16/2017 Dissertation Talk 86 Programming GenAccel Pragmas Insert data transfer LSSD #pragma genaccel cores 2 Memory #pragma reuse-scratchpad weights DMA Scratchpad void nn_layer(int num_in, int num_out, D$ const float* weights, Input Interface const float* in,

const float* out ) x x x x SpatialFabric { Low-power Core x for (int j = 0; j < num_out; ++j) x x x { + + + +

for (int i = 0; i < num_in; ++i) + + + Ʃ { out[j] += weights[j][i] *in[i]; Output Interface } out[j] = sigmoid(out[j]); } } Loop Parallelize, Insert Communication, Modulo Schedule

Resize Computation (Unroll), Extract Computation Subgraph, Spatial Schedule

11/16/2017 Dissertation Talk 87 GenAccel Design Point Selection

No. of Design Concurrency Computation Communication Data Reuse GenAccel Units 24-tile CGRA 2k x 32b sigmoid 32b CGRA; 256b 2k x 32b GAN (8 Mul, 8 Add, 1 Sigmoid) lookup table SRAM interface weight 1 buffer 64-tile CGRA Standard 16b 16b CGRA; 512b 512 x 16b GAC (32 Mul/Shift, 32 Add/logic) FUs SRAM interface SRAM for 1 inputs 64-tile CGRA Piecewise linear 32b CGRA; 512b 2k x 16b GAD (32 Mul, 32 Add, 2 Sigmoid) sigmoid unit SRAM interface SRAMs for 8 inputs 32-tile CGRA Join + Filter units 64b CGRA; 256b SRAMs for (16 ALU, 4 Agg, 4 Join) SRAM interface buffering 4 GAQ 32-tile CGRA Combination of 64b CGRA; 512b 4KB SRAM GAB (Combination of above) above FUs SRAM interface 8

Mul: Multiplier, Add: Adder 11/16/2017 Dissertation Talk 88 Design-Time vs. Runtime Decisions

Synthesis – Time Run – Time Concurrency No. of GenAccel Units Power-gating unused GenAccel Units Computation Spatial fabric FU mix Scheduling of spatial fabric and core

Communication Enabling spatial datapath Configuration of spatial datapath, elements, & SRAM interface switches and ports, memory access widths pattern

Data Reuse Scratchpad (SRAM) size Scratchpad used as DMA/reuse buffer

11/16/2017 Dissertation Talk 89 Performance Analysis (1)

GAN vs. NPU 18 16 14 GA N (+reuse.) 12 Spatial (+comm.) 10 SIMD (+concur.) 8 LP Core + Sig. (+comp.) Speedup 6 NPU (DSA) 4 2

fft

jpeg

sobel

Mean

(2-8-2)

jmeint

(9-8-1)

(1-4-4-2)

kmeans

(6-8-4-1)

inversek2j

Geometric

(64-16-64) (18-32-8-2) Baseline – 4 wide OOO core (Intel 3770K) 11/16/2017 Dissertation Talk 90 Source of Accelertion Benefits

Significant benefit from Massive benefits from Overall,algorithmic specializationSome modifications benefit of for straightforward algorithm theto hardware improveoptimizing isconcurrency. never algorithm the to parallelization. sole factor,expose and concurrency/reuse. rarely the

Somelarger benefit factor. from Some benefit from vector specializedMassiveSome weight benefit benefit buffer from from and bit-with specialization. and interoptimizingspecialized-layer broadcast. the shift algorithm registers to Convolution andavoid graph data fusion copying. unit.

Engine Specialization

NPU Diannao

Q100

Algorithm/Concurrency 11/16/2017 Dissertation Talk 91 Performance Analysis (2)

GAc vs. Conv. GAD vs. DianNao (1 Tile) (8 Tiles) 50 400 45 350 40 300 35 GA (+reuse.)

D GA C (+reuse.) 30 250 Spatial (+comm.) Spatial (+comm.) 25 200

SIMD (+concur.) SIMD (+concur.) Speedup 20 Speedup 15 LP core + FUs (+comp.) 150 8-Tile (+concur.) 10 Conv. (domain-acccel) 100 LP core + Sig. (+comp.) 5 50 DianNao (domain-acccel) 0

IME

FME

DOG

EXTR.

Mean

class1 class3

pool1 pool3 pool5

conv1 conv2 conv3 conv4 conv5

Geometric GeoMean GAQ vs. Q100 (4 Tiles) 500

400

GAQ (+comm.) 300 SIMD (+concur.) 4-Tile (+concur.)

Speedup 200 LP core + SFUs (+comp.) Q100 (domain-acccel) 100

q1 q2 q3 q4 q5 q6

q17 q10 q15 q16 q17

Baseline – 4 wideDissertation OOO Talk core (Intel 3770K) 11/16/2017 92 GenAccel Area & Power Numbers Area (mm2) Power (mW)

Neural GAN 0.37 149 Approx. NPU 0.30 74

GAC 0.15 108 Stencil Conv. Engine 0.08 30 GA 2.11 867 Deep Neural. D DianNao 0.56 213

Database GAQ 1.78 519 Streaming Q100 3.69 870 GA Balanaced 2.74 352

*Intel Ivybridge 3770K CPU 1 core Area – 12.9mm2 | Power – 4.95W *Intel Ivybridge 3770K iGPU 1 execution lane Area – 5.75mm2 +AMD Kaveri APU Tahiti based GPU 1CU Area – 5.02mm2

*Source: http://www.anandtech.com/show/5771/the-intel-ivy-bridge-core-i7-3770k-review/3 Dissertation Talk 11/16/2017+Estimate from die-photo analysis and block diagrams from wccftech.com 93 Power & Area Analysis (1)

GAN GAC

1.2x more Area than DSA 1.7x more Area than DSA 2x more Power than DSA 3.6x more Power than DSA

11/16/2017 Dissertation Talk 94 Power & Area Analysis (2)

GAQ GAD

0.5x more Area than DSA 3.8x more Area than DSA 0.6x more Power than DSA 4.1x more Power than DSA

11/16/2017 Dissertation Talk 95 Power & Area Analysis (3)

LSSDB  Balanced LSSD design

2.7x more Area than DSAs 0.6x more Area than DSA 2.4x more Power than DSAs 2.5x more Power than DSA

11/16/2017 Dissertation Talk 96 Unsuitable Workloads for GenAccel /Stream-Dataflow

• Memory-dominated workloads • Specifically small-memory footprint, but “irregular” • Heavily serialized data dependent address generation • Memory compression for example – A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs, Fower et. al • Other examples: – IBM PowerEN Regular Expression – DFA based codes

11/16/2017 Dissertation Talk 97 GenAccel vs. FPGA

• FPGAs are much lower frequency (global-routing and too fine-grained) • BlockRAMs too small to gang-up • Logical Multi-ported Register File needed to pass values between DSP slices to match high operand-level concurrency • Altera’s Stratix 10 seems headed exactly this direction

11/16/2017 Dissertation Talk 98 GenAccel’s power overhead of 2x - 4x matter in a system with accelerator?

In what scenarios you want to build DSA over GenAccel?

11/16/2017 Dissertation Talk 99 Energy Efficiency Tradeoffs

System with accelerator

OOO Accel. acc Pcore: 5W (GenAccel P : 0.1 – 5W Core or Core power DSA) Accelerator power t: execution time Caches S: accelerator’s speedup Psys: 5W System power System Bus U: accelerator utilization Memory

Overall energy of the computation executed on system

E = P acc * (U/S) * t + Psys * (1 – U + U/S) * t + Pcore * (1 - U) * t Accel. energy System energy Core energy

11/16/2017*Power numbers are example representation Dissertation Talk 100 Energy Efficiency Gains of GenAccel & DSA over OOO core Speedupga = Speedupdsa (Speedup w.r.t OOO) 500mW (5x)Power overhead

Pdsa ≈ 0.0W Pga = 0.5W

18 18 16 16 14 14

12 U = 1 OOO over 12 10 U = 0.95 10 8 U = 0.9 8 6 U = 0.75 6 4 GenAccel of 4

2 Eff. 2 Energy Eff. of DSA over OOO over DSA of Eff. Energy 0 0

0 10 20 30 40 50 Energy 0 10 20 30 40 50 Accelerator Speedup w.r.t OOO core Accelerator Speedup w.r.t OOO core Efficiency gains of both GenAccel and DSA are almost similar & Baseline – 4 wide OOO core At higher speedups both get “capped” due to large system power 11/16/2017 Dissertation Talk 101 GenAccel’s power overhead of 2x - 4x matter in a system with accelerator?

When Psys >> Pga, 2x - 4x power overheads of GenAccel become inconsequential

11/16/2017 Dissertation Talk 102 Energy Efficiency Gains of DSA over GenAccel Speedupga = Speedupdsa (Speedup w.r.t OOO)

1.12 1.10

1.08 U = 1 푬풇풇 풅풔풂 = (1 / DSA energy) / (1 / GenAccel energy) 1.06 품풂 U = 0.95 = GenAccel energy / DSA energy 1.04 U = 0.9

GenAccel U = 0.75 1.02

Energy Eff. of DSA over over DSA of Eff. Energy 1.00 0 10 20 30 40 50 Accelerator Speedup w.r.t OOO core

AtAt lower higher푬풇풇 speedups, speedups,풅풔풂 DSA’s benefitsBaseline energy of DSAefficiency– GenAccel less than gains 5% 6 on- 10% energy over efficiency GenAccel 품풂 is no more than 10% even at 100% utilization 11/16/2017 Dissertation Talk 103 In what scenarios you want to build DSA over GenAccel?

Only when application speedups are small & small energy efficiency gains too important

11/16/2017 Dissertation Talk 104 When does accelerator power or DSA matter? • GenAccel cannot match DSA for performance • Accelerator is a “vertically-integrated” accelerator

– Logic attached to memory or IO, that Psys is affected – ShiDianNao for example (DNN attached to image sensor) • Speedups are “small” and 10% energy difference is “valuable”

11/16/2017 Dissertation Talk 105 Energy Efficiency Gains of DianNao over GenAccel SpeedupGA = SpeedupDianNao (Speedup w.r.t OOO)

1.14

1.12

over over

1.10 U = 1

DianNao 1.08 U = 0.95

GenAccel 1.06 U = 0.9 1.04 U = 0.75

Energy Eff. of of Eff. Energy 1.02 1.00 0 10 20 30 40 50 Accelerator Speedup w.r.t OOO 11/16/2017 Dissertation Talk 106 Does Accelerator power matter?

• At Speedups > 10x, DSA eff. is around 5%, when accelerator power == core power

• At smaller speedups, makes a bigger difference, up to 35% 11/16/2017 Dissertation Talk 107 Detailed Example of Stream- Dataflow Execution Model Legend: Enqueued Barrier Dispatched Dependency Resource idle Iter. boundary Resource in use All data at dStreamest. -Dataflow Accelerator PotentialC[i] = A[i] * B[i] Stream Commands Time C1) Mem  Scratch 1. Dataflow based pipelined concurrent executionScratchpad

ProgramOrder C2) Scratch Wr Barrier Maps to two i/p scalar C3) Scratch  Port A vector ports Input Ports: A B 2. High ComputationMaps Activity to multiplier Ratio: of C4) Mem  Port B CGRA substrate X Number of Computations/StreamMaps to an o/p Commands

C5) Port C  Mem scalar vector port Output Port: C

C6) Mem  Port B C7) All Barrier CGRA fabric state Processing Low-power core state Command Resume 11/16/2017 generation Dissertation Talk 108 Example Code: Dot Product (Instruction Comparisons)

P1 P2 Original Program Computation × for(int i = 0 to N) { Graph: dot_prod += a[i] * b[i] + } P3

Scalar Vector Stream-Dataflow for(i = 0 to N) { for(i = 0 to N, i+=vec_len) { Send a[i:i+N] -> P1 Send a[i] -> P1 Send a[i:i+vec_len] -> P1 Send b[i:i+N] -> P2 Send b[i] -> P2 Send b[i:i+vec_len] -> P2 Get P3 -> result } } Get P3 -> result Get P3 -> result

~2N Instructions ~3 Instructions ~2N/vec_len Instructions

11/16/2017 Dissertation Talk 109 Stream-Dataflow ISA vs. TPU ISA

Google TPU ISA

• Design goal of TPU ISA – To be a programmable ISA with less instruction overheads

• Restricted to neural networks domain only  More of programmable ISA for NN domain

• CISC principle to run complex tasks  To run fast multiple-add accumulations

• Uses matrix as a primitive instead of vector or scalar – Huge performance benefit for neural network applications – Reduced latency for inference [< 7ms] – ISA restricted heavily for certain type of computations [Read_Host_Memory, Read_Weights, MatrxMultiply/Convolve, Activate, Write_Host_Memory]

• Heavily relies on host processor to send the instructions. Host software will be a bottleneck

• Does not decouple the memory and computation phases 11/16/2017 Dissertation Talk 110

TPU Compute Capability

• 700 Mhz target frequency with 40W TDP. External accelerator and PCIe based interconnect to host – 12.5GB/s effective bandwidth

• An inference chip for MLPs, CNN and LSTM  Matrix-Matrix multiplication support – 65K operations per cycle using a 256 x 256 systolic array 2D pipeline

• Quantization helps performance to operate on 8-bit integers only

11/16/2017 Dissertation Talk 111 Potential Performance Bottlenecks

1. Computations Per CGRA Instance 2. General Core Instructions 3. Cache  GRA Bandwidth 4. Initialization/Draining Latency (Memory & CGRA) 5. Length of Recurrence through CGRA

11/16/2017 Dissertation Talk 112 1. Computations Per CGRA Instance Principle: Few instructions control many computation instances

HINT: This usually involves unrolling a loop – but not necessarily the inner loop.

11/16/2017 Dissertation Talk 113 2. General Core Instructions

• Principle: Few core instructions control many computation instances – Use as long streams as possible – Computation Instances > 2 * Number of Commands

for(int i = 0; i < 128; ++i) { for(int i = 0; i < 128; i+=2) { SB_MEM_PORT(array[i], stride_size, SB_MEM_PORT(array[i], stride_size, acc_size, num_times, Port); acc_size, num_times*2, Port); … … } }

SB_MEM_PORT(array[0], stride_size, acc_size, num_times*128, Port);

for(int i = 0; i < 128; ++i) { … } 11/16/2017 Dissertation Talk 114 3. Cache  CGRA Bandwidth (1)

• Principle 1: Only 64-bytes per cycle can come from memory – Can feed One 8-wide port, Two 4-wide ports, Four 2-wide ports – Use scratch streams to supplement memory streams

Scratchpad Memory

11/16/2017 Dissertation Talk 115 3. Cache  CGRA Bandwidth (2)

• Principle 2: Not-accessed elements within a 64-byte cache line COUNT towards bandwidth

Stream: access_size = 16 bytes stride_size = 24 bytes

16 8 Address Pattern: 16 8 8 Cache Line Size: 64 HINT 1: Don’t use access patterns with “gaps” smaller than the cache line size. HINT 2: Try to align accesses with cache line boundaries

11/16/2017 Dissertation Talk 116 Optimizing Classifier Layer

SD_Config(classifier_cfg, sizeof(cfg)); sizeof(classifier_config)); ComputationComputation DFGDFG SD_Mem_Port (synapse, 8, SD_Mem_Port (synapse8, Ni * ,Nn 8,/4,Port_S); 8, Ni * Nn/4, Port_S); SD_Mem_Scratch (neuron_i, Ni * 2, SD_Mem_Port ( neuron_i Ni * 2,, Ni1, *0 );2, SD_Barrier_Scratch_Wr Ni * 2, Ni,(); Port_N); SD_Scratch_Port (0, Ni * 2, for (n = 0; n < NiNn ;* n++)2, 1, { Port_N); SD_Const_Port(0, 1, Port_acc); for SD_Const_Port (n = 0; n < (0Nn,; Nin++) – 1,{ Port_do_sig); SD_Const_PortSD_Port_Port((0,Port_out 1, Port_acc, Ni - );1, Port_acc); SD_Const_Port(0,(1, Ni/41, Port_do_sig - 1, Port_do_sig); ); SD_Const_PortSD_Port_Mem(Port_out(1, 1, Port_do_sig, 1, &neuron_n); [n]); } SD_Port_Port(Port_out, Ni/4 - 1, Port_acc); Optimization: SD_Port_Mem(Port_out, 1, &neuron_n[i]) Size of DFG }SD_Barrier_All ;

SD_Barrier_All; Optimization: Scratch for Memory B/W 11/16/2017 Dissertation Talk 117

6. Initialization/Draining Latency (Memory & CGRA)

• Principle: Hide memory latency by having “longer pipelined phases” Memory ~100-cycle (or ~20-cyces from cache)

~15-cycles

~100-cycle (or ~20-cyces from cache)

11/16/2017 Dissertation Talk 118 7. Length of Recurrence through CGRA • Principle: Number of independent instances should be > the length of the longest recurrence.

B[12] B[13] B[14] B[15] A[12] A[13] A[14] A[15] Carry Dot Product of B[8] B[9] B[10] B[11] A[8] A[9] A[10] A[11] Carry arrays B B[4] B[5] B[6] B[7] A[4] A[5] A[6] A[7] Carry and A B[0] B[1] B[2] B[3] A[0] A[1] A[2] A[3] 0

Latency = 15 Cycles

Instances / Cycle = 1 / 15

11/16/2017 Dissertation Talk 119 7. Length of Recurrence through CGRA (2)

B[12] B[13] B[14] B[15] A[12] A[13] A[14] A[15] Carry2 Dot B[8] B[9] B[10] B[11] A[8] A[9] A[10] A[11] Carry1 Product of arrays B B[4] B[5] B[6] B[7] A[4] A[5] A[6] A[7] 0 and A B[0] B[1] B[2] B[3] A[0] A[1] A[2] A[3] 0

Latency=15 Cycles

Instances / Cycle = 2 / 15 Carry2

Carry1 11/16/2017 Dissertation Talk 120 Recurrence Serialization Overhead Maximum Computation Rate = # Pipelinable Instances / Recurrence Length

Recurrence Length = 12 Cycles

Max. Computation Rate = 1 / 12 Cycles

11/16/2017 Dissertation Talk 121 Pipelining Classifier Layer

SD_Config(classifier_cfg) Input Neurons (Ni)

SD_Mem_Scratch(neuron_i, 0,Ni*2,1, 0) SD_Barrier_Scratch_Write() tile_w

for (n = 0; n < Nn; n+=tile_h) {

SD_Constant(0, tile_height, Port_acc) ) Nn for(i = 0; i < Ni; i+=tile_w) { tile_h if(not last_iter) { SD_Constant(0, tile_h,P_do_sig) SD_Port_Port(P_out, tile_h,P_acc) } else { SD_Constant(0, tile_h,P_do_sig) SD_Port_Mem(Port_out, 1, &neuron_n[i])

} ( Output Neurons SD_Scratch_Port(i*2, 0, 8*tile_w, 1, Port_N) SD_Mem_Port(&synapse[n][i], Synapses (Nn x Ni) 2*Ni, 8*tile_w, tile_h, Port_S) } } SD_Barrier_All();

11/16/2017 Dissertation Talk 122 2D Stencil Example