Thierry Moreau
Total Page:16
File Type:pdf, Size:1020Kb
Compilation and Hardware Support for Approximate Acceleration Thierry Moreau, Adrian Sampson, Andre Baixo, Mark Wyse, Ben Ransford, Jacob Nelson, Hadi Esmaeilzadeh (Georgia Tech), Luis Ceze and Mark Oskin University of Washington [email protected] Theme: 2384.004 1 Thierry Moreau Approximate Computing Aims to exploit application resilience to trade-off quality for efficiency 2 Thierry Moreau Approximate Computing 3 Thierry Moreau Approximate Computing ✅ Accurate ✅ Approximate ❌ Expensive ✅ Cheap 4 Thierry Moreau 5 Thierry Moreau 6 Thierry Moreau 7 Thierry Moreau Neural Networks as Approximate Accelerators CPU Esmaeilzadeh et al. [MICRO 2012] 8 Thierry Moreau Neural Acceleration float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration } 9 Thierry Moreau Neural Acceleration compiler-support float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration } ACCEPT* *Sampson et. al [UW-TR] 10 Thierry Moreau Neural Acceleration compiler-support HW-support float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration } ACCEPT SNNAP* *Moreau et. al [HPCA2015] 11 Thierry Moreau Neural Acceleration compiler-support HW-support float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration } ACCEPT SNNAP 3.8x speedup and 2.8x efficiency - 10% error 12 Thierry Moreau Talk Outline Introduction Compiler Support with ACCEPT SNNAP Accelerator design Evaluation & Comparison with HLS 13 Thierry Moreau Compilation Overview code 1. Region detection annotation 14 Thierry Moreau Compilation Overview ACCEPT code region detection 1. Region detection & program annotation instrumentation 15 Thierry Moreau Compilation Overview ACCEPT code region detection 1. Region detection & program annotation instrumentation back prop. 2. ANN Training [training.data] & topology search 16 Thierry Moreau Compilation Overview ACCEPT code region detection 1. Region detection & program annotation instrumentation back prop. 2. ANN Training [training.data] & topology search ACCEPT code executes SNNAP 3. Code Generation transformation CPU 17 Thierry Moreau Compilation Overview ACCEPT code region detection 1. Region detection & program annotation instrumentation back prop. 2. ANN Training [training.data] & topology search ACCEPT code executes SNNAP 3. Code Generation transformation CPU 18 Thierry Moreau Compilation Overview ACCEPT code region detection 1. Region detection & program annotation instrumentation back prop. 2. ANN Training [training.data] & topology search ACCEPT code executes SNNAP 3. Code Generation transformation CPU 19 Thierry Moreau Programming Model float sobel (float* p); . float** src; float** dst; while (true) { sobel src = read_from_camera(); for (y=0; y < h; ++y) { for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(dst); } 20 Thierry Moreau Programming Model APPROX float sobel (APPROX float* p); . APPROX float** src; APPROX float** dst; while (true) { sobel src = read_from_camera(); for (y=0; y < h; ++y) { for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(ENDORSE(dst)); } 21 Thierry Moreau Programming Model APPROX float sobel (APPROX float* p); . APPROX float** src; APPROX float** dst; while (true) { ✅ no side effects sobel src = read_from_camera(); for (y=0; y < h; ++y) { ✅ executes often for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(ENDORSE(dst)); } 22 Thierry Moreau Checking for Quality annotated program sobel.c 23 Thierry Moreau Checking for Quality annotated quality program metric sobel.c d(y, y0) 24 Thierry Moreau Checking for Quality annotated quality input data program metric sobel.c d(y, y0) 25 Thierry Moreau Checking for Quality annotated quality input data program metric test sobel.c d(y, y0) training 26 Thierry Moreau Checking for Quality annotated quality input data program metric test sobel.c d(y, y0) training Performance Output Quality 27 Thierry Moreau Talk Outline Introduction Compiler Support with ACCEPT SNNAP Accelerator design Evaluation & Comparison with HLS 28 Thierry Moreau Background: Multi-Layer Perceptrons neural network computing a single layer x7 w67 w57 w47 x6 6 x x8 f w68 w58 w48 x5 ∑ wi7•xi ! 7 = i=4 [x]9 ([w69 w59 w49][x])4 x 0 w47 x x4 7 w 57 y0 x1 x x8 5 w67 y1 activation function f x2 x x6 9 Output x3 Hidden Layer 0 Hidden Layer 1 Input Layer 29 Thierry Moreau Background: Systolic Arrays computing a single layer systolic array x6 x7 w67 w57 w47 x6 x5 x8 = f w68 w58 w48 x5 x4 [x]9 ([w69 w59 w49][x])4 w49 w48 w47 w59 w58 w57 w69 w68 w67 f x7 x8 x9 30 Thierry Moreau PU Micro-Architecture systolic array processing x6 unit x5 x4 PU control w49 w48 w47 PE w59 w58 w57 PE w69 w68 w67 PE f Storage PE x7 x8 f x9 31 PU Micro-Architecture systolic array processing 1 - processing elements in DSP logic x6 unit x5 x4 PU control w49 w48 w47 PE w59 w58 w57 PE w69 w68 w67 PE f Storage PE x7 x8 f x9 32 Thierry Moreau PU Micro-Architecture systolic array processing 1 - processing elements in DSP logic x6 unit x5 x4 PU control w49 w48 w47 PE 2 - local storage for synaptic weights w59 w58 w57 PE w69 w68 w67 PE f Storage PE x7 x8 f x9 33 Thierry Moreau PU Micro-Architecture systolic array processing 1 - processing elements in DSP logic x6 unit x5 x4 PU control w49 w48 w47 PE 2 - local storage for synaptic weights w59 w58 w57 PE 3 - sigmoid unit implements non- w69 w68 w67 linear activation functions PE f Storage PE x7 x8 f x9 34 Thierry Moreau PU Micro-Architecture systolic array processing 1 - processing elements in DSP logic x6 unit x5 x4 PU control w49 w48 w47 PE 2 - local storage for synaptic weights w59 w58 w57 PE 3 - sigmoid unit implements non- w69 w68 w67 linear activation functions PE f Storage PE x7 x8 f 4 - vertically micro-coded sequencer x9 35 Thierry Moreau Multi-Processing Units DMA Master scheduler bus PU PU PU PU control control control control PE PE PE PE PE PE PE PE PE PE PE PE Storage PE Storage PE Storage PE Storage PE f f f f 36 Thierry Moreau CPU-SNNAP Integration coherent reads & writes custom with accelerator mastering coherency port interface $L2 ACP DMA scheduler low-latency $L1 master event signaling, SE bus WF sleep & wakeup CPU PU PU PU PU 37 Thierry Moreau Talk Outline Introduction Programming model SNNAP design: • Efficient neural network evaluation • Low-latency communication Evaluation & Comparison with HLS 38 Thierry Moreau Evaluation Neural acceleration on SNNAP (8x8 configuration, clocked at 1/4 of fCPU) vs. precise CPU execution application domain error metric blackscholes option pricing MSE fft DSP MSE inversek2j robotics MSE jmeint 3D-modeling miss rate jpeg compression image diff kmeans ML image diff sobel vision image diff 39 Thierry Moreau Whole-Application Speedup 10.8 38.1 4.00 3.8 3.00 2.7 2.3 2.4 2.00 1.5 1.3 1.00 Whole Application Speedup 0.00 bscholes fft inversek2j jmeint jpeg kmeans sobel GEOMEAN 40 Thierry Moreau Energy Savings 7.8 28.0 4.00 +36% Energy = Power * Runtime on 3.00 2.8 (DRAM 2.2 + SoC) 2.00 1.7 1.8 Energy Savings 1.1 1.00 .9 0.00 bscholes fft inversek2j jmeint jpeg kmeans sobel GEOMEAN 41 Thierry Moreau Conclusion float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration } 42 Thierry Moreau Conclusion compiler-support HW-support float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration } ACCEPT 43 Thierry Moreau Conclusion compiler-support HW-support float foo (float a, float b) { AR F … NPUM P G return val; approximation acceleration } ACCEPT SNNAP 3.8x speedup & 2.8x energy savings 44 Thierry Moreau Compilation and Hardware Support for Approximate Acceleration Thierry Moreau, Adrian Sampson, Andre Baixo, Mark Wyse, Ben Ransford, Jacob Nelson, Luis Ceze and Mark Oskin University of Washington [email protected] ACCEPT: http://accept.rocks SNNAP: upon request 45 Thierry Moreau.