Compilation and Hardware Support for Approximate Acceleration

Thierry Moreau, Adrian Sampson, Andre Baixo, Mark Wyse, Ben Ransford, Jacob Nelson, Hadi Esmaeilzadeh (Georgia Tech), Luis Ceze and Mark Oskin University of Washington [email protected]

Theme: 2384.004

1 Thierry Moreau Approximate Computing

Aims to exploit application resilience to trade-off quality for efficiency

2 Thierry Moreau Approximate Computing

3 Thierry Moreau Approximate Computing

✅ Accurate ✅ Approximate ❌ Expensive ✅ Cheap

4 Thierry Moreau 5 Thierry Moreau 6 Thierry Moreau 7 Thierry Moreau Neural Networks as Approximate Accelerators

CPU

Esmaeilzadeh et al. [MICRO 2012] 8 Thierry Moreau Neural Acceleration

float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration }

9 Thierry Moreau Neural Acceleration

compiler-support

float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration } ACCEPT*

*Sampson et. al [UW-TR] 10 Thierry Moreau Neural Acceleration

compiler-support HW-support

float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration } ACCEPT SNNAP*

*Moreau et. al [HPCA2015] 11 Thierry Moreau Neural Acceleration

compiler-support HW-support

float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration } ACCEPT SNNAP

3.8x speedup and 2.8x efficiency - 10% error

12 Thierry Moreau Talk Outline

Introduction

Compiler Support with ACCEPT

SNNAP Accelerator design

Evaluation & Comparison with HLS

13 Thierry Moreau Compilation Overview

code 1. Region detection annotation

14 Thierry Moreau Compilation Overview

ACCEPT

code region detection 1. Region detection & program annotation instrumentation

15 Thierry Moreau Compilation Overview

ACCEPT

code region detection 1. Region detection & program annotation instrumentation

back prop. 2. ANN Training [training.data] & topology search

16 Thierry Moreau Compilation Overview

ACCEPT

code region detection 1. Region detection & program annotation instrumentation

back prop. 2. ANN Training [training.data] & topology search ACCEPT

code executes SNNAP 3. Code Generation transformation CPU

17 Thierry Moreau Compilation Overview

ACCEPT

code region detection 1. Region detection & program annotation instrumentation

back prop. 2. ANN Training [training.data] & topology search ACCEPT

code executes SNNAP 3. Code Generation transformation CPU

18 Thierry Moreau Compilation Overview

ACCEPT

code region detection 1. Region detection & program annotation instrumentation

back prop. 2. ANN Training [training.data] & topology search ACCEPT

code executes SNNAP 3. Code Generation transformation CPU

19 Thierry Moreau Programming Model

float sobel (float* p); . . .

float** src; float** dst;

while (true) { sobel src = read_from_camera(); for (y=0; y < h; ++y) { for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(dst); }

20 Thierry Moreau Programming Model

APPROX float sobel (APPROX float* p); . . .

APPROX float** src; APPROX float** dst;

while (true) { sobel src = read_from_camera(); for (y=0; y < h; ++y) { for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(ENDORSE(dst)); }

21 Thierry Moreau Programming Model

APPROX float sobel (APPROX float* p); . . .

APPROX float** src; APPROX float** dst;

while (true) { ✅ no side effects sobel src = read_from_camera(); for (y=0; y < h; ++y) { ✅ executes often for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(ENDORSE(dst)); }

22 Thierry Moreau Checking for Quality

annotated program sobel.c

23 Thierry Moreau Checking for Quality

annotated quality program metric

sobel.c d(y, y0)

24 Thierry Moreau Checking for Quality

annotated quality input data program metric

sobel.c d(y, y0)

25 Thierry Moreau Checking for Quality

annotated quality input data program metric test sobel.c d(y, y0)

training

26 Thierry Moreau Checking for Quality

annotated quality input data program metric test sobel.c d(y, y0)

training Performance

Output Quality 27 Thierry Moreau Talk Outline

Introduction

Compiler Support with ACCEPT

SNNAP Accelerator design

Evaluation & Comparison with HLS

28 Thierry Moreau Background: Multi-Layer Perceptrons

neural network computing a single layer

x7 w57 x6 6 x x8 f x5 ∑ wi7•xi ! 7 = i=4 [x]9 ([ ][x])4 x 0 w47 x x4 7 w 57 y0 x1 x x8 5 w67 y1 activation function f x2 x x6 9 Output x3 Hidden Layer 0 Hidden Layer 1

Input Layer

29 Thierry Moreau Background: Systolic Arrays computing a single layer systolic array

x6 x7 w67 w57 w47 x6 x5 x8 = f w68 w58 w48 x5 x4 [x]9 ([w69 w59 w49][x])4

w49 w48 w47

w59 w58 w57

w69 w68 w67

f

x7 x8 x9 30 Thierry Moreau PU Micro-Architecture

systolic array processing

x6 unit x5 x4 PU control w49 w48 w47 PE w59 w58 w57 PE w69 w68 w67 PE f Storage PE

x7 x8 f x9 31 PU Micro-Architecture

systolic array processing 1 - processing elements in DSP logic x6 unit x5 x4 PU control w49 w48 w47 PE w59 w58 w57 PE w69 w68 w67 PE f Storage PE

x7 x8 f x9 32 Thierry Moreau PU Micro-Architecture

systolic array processing 1 - processing elements in DSP logic x6 unit x5 x4 PU control w49 w48 w47 PE 2 - local storage for synaptic weights w59 w58 w57 PE w69 w68 w67 PE f Storage PE

x7 x8 f x9 33 Thierry Moreau PU Micro-Architecture

systolic array processing 1 - processing elements in DSP logic x6 unit x5 x4 PU control w49 w48 w47 PE 2 - local storage for synaptic weights w59 w58 w57 PE 3 - sigmoid unit implements non- w69 w68 w67 linear activation functions PE f Storage PE

x7 x8 f x9 34 Thierry Moreau PU Micro-Architecture

systolic array processing 1 - processing elements in DSP logic x6 unit x5 x4 PU control w49 w48 w47 PE 2 - local storage for synaptic weights w59 w58 w57 PE 3 - sigmoid unit implements non- w69 w68 w67 linear activation functions PE f Storage PE

x7 x8 f 4 - vertically micro-coded sequencer x9 35 Thierry Moreau Multi-Processing Units

DMA Master scheduler

bus

PU PU PU PU control control control control

PE PE PE PE

PE PE PE PE

PE PE PE PE

Storage PE Storage PE Storage PE Storage PE

f f f f

36 Thierry Moreau CPU-SNNAP Integration

coherent reads & writes custom with accelerator mastering coherency port interface

$L2 ACP DMA scheduler low-latency $L1 master event signaling, SE bus WF sleep & wakeup CPU PU PU PU PU

37 Thierry Moreau Talk Outline Introduction

Programming model

SNNAP design:

• Efficient neural network evaluation

• Low-latency communication

Evaluation & Comparison with HLS

38 Thierry Moreau Evaluation Neural acceleration on SNNAP (8x8 configuration, clocked at 1/4 of fCPU) vs. precise CPU execution

application domain error metric

blackscholes option pricing MSE fft DSP MSE inversek2j robotics MSE jmeint 3D-modeling miss rate jpeg compression image diff kmeans ML image diff sobel vision image diff

39 Thierry Moreau Whole-Application Speedup

10.8 38.1 4.00 3.8

3.00 2.7 2.3 2.4 2.00 1.5 1.3

1.00 Whole Application Speedup

0.00 bscholes fft inversek2j jmeint jpeg kmeans sobel GEOMEAN

40 Thierry Moreau Energy Savings

7.8 28.0 4.00 +36% Energy = Power * Runtime on 3.00 2.8 (DRAM 2.2 + SoC) 2.00 1.7 1.8

Energy Savings 1.1 1.00 .9

0.00 bscholes fft inversek2j jmeint jpeg kmeans sobel GEOMEAN

41 Thierry Moreau Conclusion

float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration }

42 Thierry Moreau Conclusion

compiler-support HW-support

float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration } ACCEPT

43 Thierry Moreau Conclusion

compiler-support HW-support

float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration } ACCEPT SNNAP

3.8x speedup & 2.8x energy savings

44 Thierry Moreau Compilation and Hardware Support for Approximate Acceleration

Thierry Moreau, Adrian Sampson, Andre Baixo, Mark Wyse, Ben Ransford, Jacob Nelson, Luis Ceze and Mark Oskin University of Washington [email protected]

ACCEPT: http://accept.rocks SNNAP: upon request

45 Thierry Moreau