COEN-4730 Computer Architecture Lecture 12 Domain Specific Architectures (DSA) Chapter 7

COEN-4730 Computer Architecture Lecture 12 Domain Specific Architectures (DSA) Chapter 7 Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University 1 1 2 1 Introduction 3 Introduction • Need factor of 100 improvements in number of operations per instruction – Requires domain specific architectures (DSAs) – For ASICs, NRE cannot be amoratized over large volumes – FPGAs are less efficient than ASICs • Video: – https://www.acm.org/hennessy-patterson-turing-lecture • Paper: – https://cacm.acm.org/magazines/2019/2/234352-a-new-golden-age-for-computer-architecture/fulltext • Slides: – https://iscaconf.org/isca2018/docs/HennessyPattersonTuringLectureISCA4June2018.pdf 4 2 Machine Learning Domain 5 5 Artificial Intelligence, Machine Learning and Deep Learning 6 6 3 Example Domain: Deep Neural Networks 7 7 Example Domain: Deep Neural Networks 8 8 4 Example Domain: Deep Neural Networks • Three of the most important DNNs: 1. Multi-Layer Perceptrons 2. Convolutional Neural Network 3. Recurrent Neural Network 9 9 1) Multi-Layer Perceptrons 10 10 5 2) Convolutional Neural Network 11 11 2) Convolutional Neural Network ◼ Parameters: ◼ DimFM[i-1]: Dimension of the (square) input Feature Map ◼ DimFM[i]: Dimension of the (square) output Feature Map ◼ DimSten[i]: Dimension of the (square) stencil ◼ NumFM[i-1]: Number of input Feature Maps ◼ NumFM[i]: Number of output Feature Maps 2 ◼ Number of neurons: NumFM[i] x DimFM[i] ◼ Number of weights per output Feature Map: NumFM[i-1] x DimSten[i]2 ◼ Total number of weights per layer: NumFM[i] x Number of weights per output Feature Map ◼ Number of operations per output Feature Map: 2 x DimFM[i]2 x Number of weights per output Feature Map ◼ Total number of operations per layer: NumFM[i] x Number of operations per output Feature Map = 2 x DimFM[i]2 x NumFM[i] x Number of weights per output Feature Map = 2 x DimFM[i]2 x Total number of weights per layer 2 ◼ Operations/Weight: 2 x DimFM[i] 12 12 6 2) Convolutional Neural Network ◼ Batches: ◼ Reuse weights once fetched from memory across multiple inputs ◼ Increases operational intensity ◼ Quantization ◼ Use 8- or 16-bit fixed point ◼ Summary: ◼ Need the following kernels: ◼ Matrix-vector multiply ◼ Matrix-matrix multiply ◼ Stencil ◼ ReLU ◼ Sigmoid ◼ Hyperbolic tangeant 13 13 3) Recurrent Neural Network ◼ Speech recognition and language translation ◼ Long short-term memory (LSTM) network 14 14 7 3) Recurrent Neural Network ◼ Parameters: ◼ Number of weights per cell: 3 x (3 x Dim x Dim)+(2 x Dim x Dim) + (1 x Dim x Dim) = 12 x Dim2 ◼ Number of operations for the 5 vector-matrix multiplies per cell: 2 x Number of weights per cell = 24 x Dim2 ◼ Number of operations for the 3 element-wise multiplies and 1 addition (vectors are all the size of the output): 4 x Dim ◼ Total number of operations per cell (5 vector-matrix multiplies and the 4 element-wise operations): 24 x Dim2 + 4 x Dim ◼ Operations/Weight: ~2 15 15 Machine Learning Mapping to Linear Algebra 16 16 8 Summary 17 17 Five Guidelines for Domain Specific Architectures (DSAs) 18 18 9 Guidelines for DSAs 19 19 Examples of DSAs 1. Tensor Processing Unit 2. Microsoft Catapult 3. Intel Crest 4. Pixel Visual Core 20 20 10 1) Tensor Processing Unit (TPU) 21 21 Tensor Processing Unit 22 22 11 Tensor Processing Unit ◼ Read_Host_Memory ◼ Reads memory from the CPU memory into the unified buffer ◼ Read_Weights ◼ Reads weights from the Weight Memory into the Weight FIFO as input to the Matrix Unit ◼ MatrixMatrixMultiply/Convolve ◼ Perform a matrix-matrix multiply, a vector-matrix multiply, an element- wise matrix multiply, an element-wise vector multiply, or a convolution from the Unified Buffer into the accumulators ◼ takes a variable-sized B*256 input, multiplies it by a 256x256 constant input, and produces a B*256 output, taking B pipelined cycles to complete ◼ Activate ◼ Computes activation function ◼ Write_Host_Memory ◼ Writes data from unified buffer into host memory 23 23 TPU Microarchitecture – Systolic Array 24 24 12 TPU Implementation 25 25 Improving the TPU 26 26 13 The TPU and the Guidelines 1. Use dedicated memories ◼ 24 MiB dedicated buffer, 4 MiB accumulator buffers 2. Invest resources in arithmetic units and dedicated memories ◼ 60% of the memory and 250X the arithmetic units of a server-class CPU 3. Use the easiest form of parallelism that matches the domain ◼ Exploits 2D SIMD parallelism 4. Reduce the data size and type needed for the domain ◼ Primarily uses 8-bit integers 5. Use a domain-specific programming language ◼ Uses TensorFlow 27 27 Other DSAs Read the remainder of Chapter 7: 2) Microsoft Catapult, FPGA-based DSA solution 3) Intel Crest, more details needed 4) Google Pixel Visual Core: DSA for stencil, very interesting • Heterogeneity and Open Instruction Set (RISC-V) – Checkout RISC-V Summit, 12/03 – 12/06 2018 https://tmt.knect365.com/risc-v-summit/ – Agile HW development • CPU vs GPUs vs DNN accelerators – Performance comparison and Roofline Model 28 28 14 Conclusion: A New Golden Age • End of Dennard Scaling and Moore’s Law ➢ architecture innovation to improve performance/cost/energy • Security ⇒architecture innovation too • Domain Specific Languages ⇒Domain Specific Architectures • Free, open architectures and open source implementations ➢ everyone can innovate and contribute • Cloud FPGAs ⇒all can design and deploy custom “HW” • Agile HW development ⇒all can afford to make (small) chips • Like 1980s, great time for architects in academia & in industry! 29 29 Credits • Copyright © 2019, Elsevier Inc. All rights Reserved – Textbook slides • A New Golden Age for Computer Architecture Slides – https://iscaconf.org/isca2018/docs/HennessyPattersonTuringLectureISCA4June 2018.pdf • Yonghong Yan’s slides – https://passlab.github.io/CSCE513/notes/lecture26_DSA_DomainSpecificArchit ectures.pdf • Machine Learning for Science” in 2018 and A – Superfacility Model for Science” in 2017 By Kathy Yelic – https://people.eecs.berkeley.edu/~yelick/talks.html 30 30 15.

COEN-4730 Computer Architecture Lecture 12 Domain Specific Architectures (DSA) Chapter 7

Lecture 26: Domain Specific Architectures Chapter 07, CAQA 6Th Edition

P1360R0: Towards Machine Learning for C++: Study Group 19

SIMD in Different Processors

An Analysis and Implementation of the HDR+ Burst Denoising Method

Gables: a Roofline Model for Mobile Socs

Euphrates: Algorithm-Soc Co-Design for Low-Power Mobile Continuous Vision

Poster/Demo Sessions Slides

Concepts Introduced in Chapter 7 Reasons for Adoption of Domain SpeciC Architectures

Cloudleak: DNN Model Extractions from Commercial Mlaas Platforms

Review & Background

Mempool: a Shared-L1 Memory Many-Core Cluster with a Low-Latency Interconnect

Smartphone Huawei Senza Futuro Stop Ai Chip E Alla Licenza Android