Compilation and Hardware Support for Approximate Acceleration
Thierry Moreau, Adrian Sampson, Andre Baixo, Mark Wyse, Ben Ransford, Jacob Nelson, Hadi Esmaeilzadeh (Georgia Tech), Luis Ceze and Mark Oskin University of Washington [email protected]
Theme: 2384.004
1 Thierry Moreau Approximate Computing
Aims to exploit application resilience to trade-off quality for efficiency
2 Thierry Moreau Approximate Computing
3 Thierry Moreau Approximate Computing
✅ Accurate ✅ Approximate ❌ Expensive ✅ Cheap
4 Thierry Moreau 5 Thierry Moreau 6 Thierry Moreau 7 Thierry Moreau Neural Networks as Approximate Accelerators
CPU
Esmaeilzadeh et al. [MICRO 2012] 8 Thierry Moreau Neural Acceleration
float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration }
9 Thierry Moreau Neural Acceleration
compiler-support
float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration } ACCEPT*
*Sampson et. al [UW-TR] 10 Thierry Moreau Neural Acceleration
compiler-support HW-support
float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration } ACCEPT SNNAP*
*Moreau et. al [HPCA2015] 11 Thierry Moreau Neural Acceleration
compiler-support HW-support
float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration } ACCEPT SNNAP
3.8x speedup and 2.8x efficiency - 10% error
12 Thierry Moreau Talk Outline
Introduction
Compiler Support with ACCEPT
SNNAP Accelerator design
Evaluation & Comparison with HLS
13 Thierry Moreau Compilation Overview
code 1. Region detection annotation
14 Thierry Moreau Compilation Overview
ACCEPT
code region detection 1. Region detection & program annotation instrumentation
15 Thierry Moreau Compilation Overview
ACCEPT
code region detection 1. Region detection & program annotation instrumentation
back prop. 2. ANN Training [training.data] & topology search
16 Thierry Moreau Compilation Overview
ACCEPT
code region detection 1. Region detection & program annotation instrumentation
back prop. 2. ANN Training [training.data] & topology search ACCEPT
code executes SNNAP 3. Code Generation transformation CPU
17 Thierry Moreau Compilation Overview
ACCEPT
code region detection 1. Region detection & program annotation instrumentation
back prop. 2. ANN Training [training.data] & topology search ACCEPT
code executes SNNAP 3. Code Generation transformation CPU
18 Thierry Moreau Compilation Overview
ACCEPT
code region detection 1. Region detection & program annotation instrumentation
back prop. 2. ANN Training [training.data] & topology search ACCEPT
code executes SNNAP 3. Code Generation transformation CPU
19 Thierry Moreau Programming Model
float sobel (float* p); . . .
float** src; float** dst;
while (true) { sobel src = read_from_camera(); for (y=0; y < h; ++y) { for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(dst); }
20 Thierry Moreau Programming Model
APPROX float sobel (APPROX float* p); . . .
APPROX float** src; APPROX float** dst;
while (true) { sobel src = read_from_camera(); for (y=0; y < h; ++y) { for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(ENDORSE(dst)); }
21 Thierry Moreau Programming Model
APPROX float sobel (APPROX float* p); . . .
APPROX float** src; APPROX float** dst;
while (true) { ✅ no side effects sobel src = read_from_camera(); for (y=0; y < h; ++y) { ✅ executes often for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(ENDORSE(dst)); }
22 Thierry Moreau Checking for Quality
annotated program sobel.c
23 Thierry Moreau Checking for Quality
annotated quality program metric
sobel.c d(y, y0)
24 Thierry Moreau Checking for Quality
annotated quality input data program metric
sobel.c d(y, y0)
25 Thierry Moreau Checking for Quality
annotated quality input data program metric test sobel.c d(y, y0)
training
26 Thierry Moreau Checking for Quality
annotated quality input data program metric test sobel.c d(y, y0)
training Performance
Output Quality 27 Thierry Moreau Talk Outline
Introduction
Compiler Support with ACCEPT
SNNAP Accelerator design
Evaluation & Comparison with HLS
28 Thierry Moreau Background: Multi-Layer Perceptrons
neural network computing a single layer
x7 w67 w57 w47 x6 6 x x8 f w68 w58 w48 x5 ∑ wi7•xi ! 7 = i=4 [x]9 ([w69 w59 w49][x])4 x 0 w47 x x4 7 w 57 y0 x1 x x8 5 w67 y1 activation function f x2 x x6 9 Output x3 Hidden Layer 0 Hidden Layer 1
Input Layer
29 Thierry Moreau Background: Systolic Arrays computing a single layer systolic array
x6 x7 w67 w57 w47 x6 x5 x8 = f w68 w58 w48 x5 x4 [x]9 ([w69 w59 w49][x])4
w49 w48 w47
w59 w58 w57
w69 w68 w67
f
x7 x8 x9 30 Thierry Moreau PU Micro-Architecture
systolic array processing
x6 unit x5 x4 PU control w49 w48 w47 PE w59 w58 w57 PE w69 w68 w67 PE f Storage PE
x7 x8 f x9 31 PU Micro-Architecture
systolic array processing 1 - processing elements in DSP logic x6 unit x5 x4 PU control w49 w48 w47 PE w59 w58 w57 PE w69 w68 w67 PE f Storage PE
x7 x8 f x9 32 Thierry Moreau PU Micro-Architecture
systolic array processing 1 - processing elements in DSP logic x6 unit x5 x4 PU control w49 w48 w47 PE 2 - local storage for synaptic weights w59 w58 w57 PE w69 w68 w67 PE f Storage PE
x7 x8 f x9 33 Thierry Moreau PU Micro-Architecture
systolic array processing 1 - processing elements in DSP logic x6 unit x5 x4 PU control w49 w48 w47 PE 2 - local storage for synaptic weights w59 w58 w57 PE 3 - sigmoid unit implements non- w69 w68 w67 linear activation functions PE f Storage PE
x7 x8 f x9 34 Thierry Moreau PU Micro-Architecture
systolic array processing 1 - processing elements in DSP logic x6 unit x5 x4 PU control w49 w48 w47 PE 2 - local storage for synaptic weights w59 w58 w57 PE 3 - sigmoid unit implements non- w69 w68 w67 linear activation functions PE f Storage PE
x7 x8 f 4 - vertically micro-coded sequencer x9 35 Thierry Moreau Multi-Processing Units
DMA Master scheduler
bus
PU PU PU PU control control control control
PE PE PE PE
PE PE PE PE
PE PE PE PE
Storage PE Storage PE Storage PE Storage PE
f f f f
36 Thierry Moreau CPU-SNNAP Integration
coherent reads & writes custom with accelerator mastering coherency port interface
$L2 ACP DMA scheduler low-latency $L1 master event signaling, SE bus WF sleep & wakeup CPU PU PU PU PU
37 Thierry Moreau Talk Outline Introduction
Programming model
SNNAP design:
• Efficient neural network evaluation
• Low-latency communication
Evaluation & Comparison with HLS
38 Thierry Moreau Evaluation Neural acceleration on SNNAP (8x8 configuration, clocked at 1/4 of fCPU) vs. precise CPU execution
application domain error metric
blackscholes option pricing MSE fft DSP MSE inversek2j robotics MSE jmeint 3D-modeling miss rate jpeg compression image diff kmeans ML image diff sobel vision image diff
39 Thierry Moreau Whole-Application Speedup
10.8 38.1 4.00 3.8
3.00 2.7 2.3 2.4 2.00 1.5 1.3
1.00 Whole Application Speedup
0.00 bscholes fft inversek2j jmeint jpeg kmeans sobel GEOMEAN
40 Thierry Moreau Energy Savings
7.8 28.0 4.00 +36% Energy = Power * Runtime on 3.00 2.8 (DRAM 2.2 + SoC) 2.00 1.7 1.8
Energy Savings 1.1 1.00 .9
0.00 bscholes fft inversek2j jmeint jpeg kmeans sobel GEOMEAN
41 Thierry Moreau Conclusion
float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration }
42 Thierry Moreau Conclusion
compiler-support HW-support
float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration } ACCEPT
43 Thierry Moreau Conclusion
compiler-support HW-support
float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration } ACCEPT SNNAP
3.8x speedup & 2.8x energy savings
44 Thierry Moreau Compilation and Hardware Support for Approximate Acceleration
Thierry Moreau, Adrian Sampson, Andre Baixo, Mark Wyse, Ben Ransford, Jacob Nelson, Luis Ceze and Mark Oskin University of Washington [email protected]
ACCEPT: http://accept.rocks SNNAP: upon request
45 Thierry Moreau