Pixel Visual Core Pixel Visual Core

The University of Adelaide, School of Computer Science 22 November 2018 Computer Architecture A Quantitative Approach, Sixth Edition Chapter 7 Domain-Specific Architectures Copyright © 2019, Elsevier Inc. All rights Reserved 1 Introduction Introduction n Moore’s Law enabled: n Deep memory hierarchy n Wide SIMD units n Deep pipelines n Branch prediction n Out-of-order execution n Speculative prefetching n Multithreading n Multiprocessing n Objective: n Extract performance from software that is oblivious to architecture Copyright © 2019, Elsevier Inc. All rights Reserved 2 Chapter 2 — Instructions: Language of the Computer 1 The University of Adelaide, School of Computer Science 22 November 2018 Introduction Introduction n Need factor of 100 improvements in number of operations per instruction n Requires domain specific architectures n For ASICs, NRE cannot be amoratized over large volumes n FPGAs are less efficient than ASICs Copyright © 2019, Elsevier Inc. All rights Reserved 3 Guidelines for DSAs for Guidelines Guidelines for DSAs n Use dedicated memories to minimize data movement n Invest resources into more arithmetic units or bigger memories n Use the easiest form of parallelism that matches the domain n Reduce data size and type to the simplest needed for the domain n Use a domain-specific programming language Copyright © 2019, Elsevier Inc. All rights Reserved 4 Chapter 2 — Instructions: Language of the Computer 2 The University of Adelaide, School of Computer Science 22 November 2018 Guidelines for DSAs for Guidelines Guidelines for DSAs Copyright © 2019, Elsevier Inc. All rights Reserved 5 Example: Deep Neural Networks Deep Example: Example: Deep Neural Networks n Inpired by neuron of the brain n Computes non-linear “activiation” function of the weighted sum of input values n Neurons arranged in layers Copyright © 2019, Elsevier Inc. All rights Reserved 6 Chapter 2 — Instructions: Language of the Computer 3 The University of Adelaide, School of Computer Science 22 November 2018 Example: Deep Neural Networks Deep Example: Example: Deep Neural Networks n Most practioners will choose an existing design n Topology n Data type n Training (learning): n Calculate weights using backpropagation algorithm n Supervised learning: stocastic graduate descent n Inferrence: use neural network for classification Copyright © 2019, Elsevier Inc. All rights Reserved 7 Example: Deep Neural Networks Deep Example: Multi-Layer Perceptrons n Parameters: n Dim[i]: number of neurons n Dim[i-1]: dimension of input vector n Number of weights: Dim[i-1] x Dim[i] n Operations: 2 x Dim[i-1] x Dim[i] n Operations/weight: 2 Copyright © 2019, Elsevier Inc. All rights Reserved 8 Chapter 2 — Instructions: Language of the Computer 4 The University of Adelaide, School of Computer Science 22 November 2018 Example: Deep Neural Networks Deep Example: Convolutional Neural Network n Computer vision n Each layer raises the level of abstraction n First layer recognizes horizontal and vertical lines n Second layer recognizes corners n Third layer recognizes shapes n Fourth layer recognizes features, such as ears of a dog n Higher layers recognizes different breeds of dogs Copyright © 2019, Elsevier Inc. All rights Reserved 9 Example: Deep Neural Networks Deep Example: Convolutional Neural Network n Parameters: n DimFM[i-1]: Dimension of the (square) input Feature Map n DimFM[i]: Dimension of the (square) output Feature Map n DimSten[i]: Dimension of the (square) stencil n NumFM[i-1]: Number of input Feature Maps n NumFM[i]: Number of output Feature Maps 2 n Number of neurons: NumFM[i] x DimFM[i] n Number of weights per output Feature Map: NumFM[i-1] x DimSten[i]2 n Total number of weights per layer: NumFM[i] x Number of weights per output Feature Map n Number of operations per output Feature Map: 2 x DimFM[i]2 x Number of weights per output Feature Map n Total number of operations per layer: NumFM[i] x Number of operations per output Feature Map = 2 x DimFM[i]2 x NumFM[i] x Number of weights per output Feature Map = 2 x DimFM[i]2 x Total number of weights per layer 2 n Operations/Weight: 2 x DimFM[i] Copyright © 2019, Elsevier Inc. All rights Reserved 10 Chapter 2 — Instructions: Language of the Computer 5 The University of Adelaide, School of Computer Science 22 November 2018 Example: Deep Neural Networks Deep Example: Recurrent Neural Network n Speech recognition and language translation n Long short-term memory (LSTM) network Copyright © 2019, Elsevier Inc. All rights Reserved 11 Example: Deep Neural Networks Deep Example: Recurrent Neural Network n Parameters: n Number of weights per cell: 3 x (3 x Dim x Dim)+(2 x Dim x Dim) + (1 x Dim x Dim) = 12 x Dim2 n Number of operations for the 5 vector-matrix multiplies per cell: 2 x Number of weights per cell = 24 x Dim2 n Number of operations for the 3 element-wise multiplies and 1 addition (vectors are all the size of the output): 4 x Dim n Total number of operations per cell (5 vector-matrix multiplies and the 4 element-wise operations): 24 x Dim2 + 4 x Dim n Operations/Weight: ~2 Copyright © 2019, Elsevier Inc. All rights Reserved 12 Chapter 2 — Instructions: Language of the Computer 6 The University of Adelaide, School of Computer Science 22 November 2018 Example: Deep Neural Networks Deep Example: Convolutional Neural Network n Batches: n Reuse weights once fetched from memory across multiple inputs n Increases operational intensity n Quantization n Use 8- or 16-bit fixed point n Summary: n Need the following kernels: n Matrix-vector multiply n Matrix-matrix multiply n Stencil n ReLU n Sigmoid n Hyperbolic tangeant Copyright © 2019, Elsevier Inc. All rights Reserved 13 Tensor Processing Unit Processing Tensor Tensor Processing Unit n Google’s DNN ASIC n 256 x 256 8-bit matrix multiply unit n Large software-managed scratchpad n Coprocessor on the PCIe bus Copyright © 2019, Elsevier Inc. All rights Reserved 14 Chapter 2 — Instructions: Language of the Computer 7 The University of Adelaide, School of Computer Science 22 November 2018 Tensor Processing Unit Processing Tensor Tensor Processing Unit Copyright © 2019, Elsevier Inc. All rights Reserved 15 Tensor Processing Unit Processing Tensor TPU ISA n Read_Host_Memory n Reads memory from the CPU memory into the unified buffer n Read_Weights n Reads weights from the Weight Memory into the Weight FIFO as input to the Matrix Unit n MatrixMatrixMultiply/Convolve n Perform a matrix-matrix multiply, a vector-matrix multiply, an element- wise matrix multiply, an element-wise vector multiply, or a convolution from the Unified Buffer into the accumulators n takes a variable-sized B*256 input, multiplies it by a 256x256 constant input, and produces a B*256 output, taking B pipelined cycles to complete n Activate n Computes activation function n Write_Host_Memory n Writes data from unified buffer into host memory Copyright © 2019, Elsevier Inc. All rights Reserved 16 Chapter 2 — Instructions: Language of the Computer 8 The University of Adelaide, School of Computer Science 22 November 2018 Tensor Processing Unit Processing Tensor TPU ISA Copyright © 2019, Elsevier Inc. All rights Reserved 17 Tensor Processing Unit Processing Tensor TPU ISA Copyright © 2019, Elsevier Inc. All rights Reserved 18 Chapter 2 — Instructions: Language of the Computer 9 The University of Adelaide, School of Computer Science 22 November 2018 Tensor Processing Unit Processing Tensor TPU ISA n Read_Host_Memory n Reads memory from the CPU memory into the unified buffer n Read_Weights n Reads weights from the Weight Memory into the Weight FIFO as input to the Matrix Unit n MatrixMatrixMultiply/Convolve n Perform a matrix-matrix multiply, a vector-matrix multiply, an element- wise matrix multiply, an element-wise vector multiply, or a convolution from the Unified Buffer into the accumulators n takes a variable-sized B*256 input, multiplies it by a 256x256 constant input, and produces a B*256 output, taking B pipelined cycles to complete n Activate n Computes activation function n Write_Host_Memory n Writes data from unified buffer into host memory Copyright © 2019, Elsevier Inc. All rights Reserved 19 Tensor Processing Unit Processing Tensor Improving the TPU Copyright © 2019, Elsevier Inc. All rights Reserved 20 Chapter 2 — Instructions: Language of the Computer 10 The University of Adelaide, School of Computer Science 22 November 2018 Tensor Processing Unit Processing Tensor The TPU and the Guidelines n Use dedicated memories n 24 MiB dedicated buffer, 4 MiB accumulator buffers n Invest resources in arithmetic units and dedicated memories n 60% of the memory and 250X the arithmetic units of a server-class CPU n Use the easiest form of parallelism that matches the domain n Exploits 2D SIMD parallelism n Reduce the data size and type needed for the domain n Primarily uses 8-bit integers n Use a domain-specific programming language n Uses TensorFlow Copyright © 2019, Elsevier Inc. All rights Reserved 21 Microsoft Capapult Microsoft Microsoft Catapult n Needed to be general purpose and power efficient n Uses FPGA PCIe board with dedicated 20 Gbps network in 6 x 8 torus n Each of the 48 servers in half the rack has a Catapult board n Limited to 25 watts n 32 MiB Flash memory n Two banks of DDR3-1600 (11 GB/s) and 8 GiB DRAM n FPGA (unconfigured) has 3962 18-bit ALUs and 5 MiB of on-chip memory n Programmed in Verilog RTL n Shell is 23% of the FPGA Copyright © 2019, Elsevier Inc. All rights Reserved 22 Chapter 2 — Instructions: Language of the Computer 11 The University of Adelaide, School of Computer Science 22 November 2018 Microsoft Capapult Microsoft Microsoft Catapult: CNN n CNN accelerator, mapped across multiple FPGAs Copyright © 2019, Elsevier Inc. All rights Reserved 23 Microsoft Capapult Microsoft Microsoft Catapult: CNN Copyright © 2019, Elsevier Inc.

Pixel Visual Core Pixel Visual Core

Lecture 26: Domain Specific Architectures Chapter 07, CAQA 6Th Edition

P1360R0: Towards Machine Learning for C++: Study Group 19

SIMD in Different Processors

An Analysis and Implementation of the HDR+ Burst Denoising Method

Gables: a Roofline Model for Mobile Socs

Euphrates: Algorithm-Soc Co-Design for Low-Power Mobile Continuous Vision

Poster/Demo Sessions Slides

Concepts Introduced in Chapter 7 Reasons for Adoption of Domain SpeciC Architectures

Cloudleak: DNN Model Extractions from Commercial Mlaas Platforms

COEN-4730 Computer Architecture Lecture 12 Domain Specific Architectures (DSA) Chapter 7

Review & Background

Mempool: a Shared-L1 Memory Many-Core Cluster with a Low-Latency Interconnect