Pixel Visual Core Pixel Visual Core

The University of Adelaide, School of Computer Science 22 November 2018 Computer Architecture A Quantitative Approach, Sixth Edition Chapter 7 Domain-Specific Architectures Copyright © 2019, Elsevier Inc. All rights Reserved 1 Introduction Introduction n Moore’s Law enabled: n Deep memory hierarchy n Wide SIMD units n Deep pipelines n Branch prediction n Out-of-order execution n Speculative prefetching n Multithreading n Multiprocessing n Objective: n Extract performance from software that is oblivious to architecture Copyright © 2019, Elsevier Inc. All rights Reserved 2 Chapter 2 — Instructions: Language of the Computer 1 The University of Adelaide, School of Computer Science 22 November 2018 Introduction Introduction n Need factor of 100 improvements in number of operations per instruction n Requires domain specific architectures n For ASICs, NRE cannot be amoratized over large volumes n FPGAs are less efficient than ASICs Copyright © 2019, Elsevier Inc. All rights Reserved 3 Guidelines for DSAs for Guidelines Guidelines for DSAs n Use dedicated memories to minimize data movement n Invest resources into more arithmetic units or bigger memories n Use the easiest form of parallelism that matches the domain n Reduce data size and type to the simplest needed for the domain n Use a domain-specific programming language Copyright © 2019, Elsevier Inc. All rights Reserved 4 Chapter 2 — Instructions: Language of the Computer 2 The University of Adelaide, School of Computer Science 22 November 2018 Guidelines for DSAs for Guidelines Guidelines for DSAs Copyright © 2019, Elsevier Inc. All rights Reserved 5 Example: Deep Neural Networks Deep Example: Example: Deep Neural Networks n Inpired by neuron of the brain n Computes non-linear “activiation” function of the weighted sum of input values n Neurons arranged in layers Copyright © 2019, Elsevier Inc. All rights Reserved 6 Chapter 2 — Instructions: Language of the Computer 3 The University of Adelaide, School of Computer Science 22 November 2018 Example: Deep Neural Networks Deep Example: Example: Deep Neural Networks n Most practioners will choose an existing design n Topology n Data type n Training (learning): n Calculate weights using backpropagation algorithm n Supervised learning: stocastic graduate descent n Inferrence: use neural network for classification Copyright © 2019, Elsevier Inc. All rights Reserved 7 Example: Deep Neural Networks Deep Example: Multi-Layer Perceptrons n Parameters: n Dim[i]: number of neurons n Dim[i-1]: dimension of input vector n Number of weights: Dim[i-1] x Dim[i] n Operations: 2 x Dim[i-1] x Dim[i] n Operations/weight: 2 Copyright © 2019, Elsevier Inc. All rights Reserved 8 Chapter 2 — Instructions: Language of the Computer 4 The University of Adelaide, School of Computer Science 22 November 2018 Example: Deep Neural Networks Deep Example: Convolutional Neural Network n Computer vision n Each layer raises the level of abstraction n First layer recognizes horizontal and vertical lines n Second layer recognizes corners n Third layer recognizes shapes n Fourth layer recognizes features, such as ears of a dog n Higher layers recognizes different breeds of dogs Copyright © 2019, Elsevier Inc. All rights Reserved 9 Example: Deep Neural Networks Deep Example: Convolutional Neural Network n Parameters: n DimFM[i-1]: Dimension of the (square) input Feature Map n DimFM[i]: Dimension of the (square) output Feature Map n DimSten[i]: Dimension of the (square) stencil n NumFM[i-1]: Number of input Feature Maps n NumFM[i]: Number of output Feature Maps 2 n Number of neurons: NumFM[i] x DimFM[i] n Number of weights per output Feature Map: NumFM[i-1] x DimSten[i]2 n Total number of weights per layer: NumFM[i] x Number of weights per output Feature Map n Number of operations per output Feature Map: 2 x DimFM[i]2 x Number of weights per output Feature Map n Total number of operations per layer: NumFM[i] x Number of operations per output Feature Map = 2 x DimFM[i]2 x NumFM[i] x Number of weights per output Feature Map = 2 x DimFM[i]2 x Total number of weights per layer 2 n Operations/Weight: 2 x DimFM[i] Copyright © 2019, Elsevier Inc. All rights Reserved 10 Chapter 2 — Instructions: Language of the Computer 5 The University of Adelaide, School of Computer Science 22 November 2018 Example: Deep Neural Networks Deep Example: Recurrent Neural Network n Speech recognition and language translation n Long short-term memory (LSTM) network Copyright © 2019, Elsevier Inc. All rights Reserved 11 Example: Deep Neural Networks Deep Example: Recurrent Neural Network n Parameters: n Number of weights per cell: 3 x (3 x Dim x Dim)+(2 x Dim x Dim) + (1 x Dim x Dim) = 12 x Dim2 n Number of operations for the 5 vector-matrix multiplies per cell: 2 x Number of weights per cell = 24 x Dim2 n Number of operations for the 3 element-wise multiplies and 1 addition (vectors are all the size of the output): 4 x Dim n Total number of operations per cell (5 vector-matrix multiplies and the 4 element-wise operations): 24 x Dim2 + 4 x Dim n Operations/Weight: ~2 Copyright © 2019, Elsevier Inc. All rights Reserved 12 Chapter 2 — Instructions: Language of the Computer 6 The University of Adelaide, School of Computer Science 22 November 2018 Example: Deep Neural Networks Deep Example: Convolutional Neural Network n Batches: n Reuse weights once fetched from memory across multiple inputs n Increases operational intensity n Quantization n Use 8- or 16-bit fixed point n Summary: n Need the following kernels: n Matrix-vector multiply n Matrix-matrix multiply n Stencil n ReLU n Sigmoid n Hyperbolic tangeant Copyright © 2019, Elsevier Inc. All rights Reserved 13 Tensor Processing Unit Processing Tensor Tensor Processing Unit n Google’s DNN ASIC n 256 x 256 8-bit matrix multiply unit n Large software-managed scratchpad n Coprocessor on the PCIe bus Copyright © 2019, Elsevier Inc. All rights Reserved 14 Chapter 2 — Instructions: Language of the Computer 7 The University of Adelaide, School of Computer Science 22 November 2018 Tensor Processing Unit Processing Tensor Tensor Processing Unit Copyright © 2019, Elsevier Inc. All rights Reserved 15 Tensor Processing Unit Processing Tensor TPU ISA n Read_Host_Memory n Reads memory from the CPU memory into the unified buffer n Read_Weights n Reads weights from the Weight Memory into the Weight FIFO as input to the Matrix Unit n MatrixMatrixMultiply/Convolve n Perform a matrix-matrix multiply, a vector-matrix multiply, an element- wise matrix multiply, an element-wise vector multiply, or a convolution from the Unified Buffer into the accumulators n takes a variable-sized B*256 input, multiplies it by a 256x256 constant input, and produces a B*256 output, taking B pipelined cycles to complete n Activate n Computes activation function n Write_Host_Memory n Writes data from unified buffer into host memory Copyright © 2019, Elsevier Inc. All rights Reserved 16 Chapter 2 — Instructions: Language of the Computer 8 The University of Adelaide, School of Computer Science 22 November 2018 Tensor Processing Unit Processing Tensor TPU ISA Copyright © 2019, Elsevier Inc. All rights Reserved 17 Tensor Processing Unit Processing Tensor TPU ISA Copyright © 2019, Elsevier Inc. All rights Reserved 18 Chapter 2 — Instructions: Language of the Computer 9 The University of Adelaide, School of Computer Science 22 November 2018 Tensor Processing Unit Processing Tensor TPU ISA n Read_Host_Memory n Reads memory from the CPU memory into the unified buffer n Read_Weights n Reads weights from the Weight Memory into the Weight FIFO as input to the Matrix Unit n MatrixMatrixMultiply/Convolve n Perform a matrix-matrix multiply, a vector-matrix multiply, an element- wise matrix multiply, an element-wise vector multiply, or a convolution from the Unified Buffer into the accumulators n takes a variable-sized B*256 input, multiplies it by a 256x256 constant input, and produces a B*256 output, taking B pipelined cycles to complete n Activate n Computes activation function n Write_Host_Memory n Writes data from unified buffer into host memory Copyright © 2019, Elsevier Inc. All rights Reserved 19 Tensor Processing Unit Processing Tensor Improving the TPU Copyright © 2019, Elsevier Inc. All rights Reserved 20 Chapter 2 — Instructions: Language of the Computer 10 The University of Adelaide, School of Computer Science 22 November 2018 Tensor Processing Unit Processing Tensor The TPU and the Guidelines n Use dedicated memories n 24 MiB dedicated buffer, 4 MiB accumulator buffers n Invest resources in arithmetic units and dedicated memories n 60% of the memory and 250X the arithmetic units of a server-class CPU n Use the easiest form of parallelism that matches the domain n Exploits 2D SIMD parallelism n Reduce the data size and type needed for the domain n Primarily uses 8-bit integers n Use a domain-specific programming language n Uses TensorFlow Copyright © 2019, Elsevier Inc. All rights Reserved 21 Microsoft Capapult Microsoft Microsoft Catapult n Needed to be general purpose and power efficient n Uses FPGA PCIe board with dedicated 20 Gbps network in 6 x 8 torus n Each of the 48 servers in half the rack has a Catapult board n Limited to 25 watts n 32 MiB Flash memory n Two banks of DDR3-1600 (11 GB/s) and 8 GiB DRAM n FPGA (unconfigured) has 3962 18-bit ALUs and 5 MiB of on-chip memory n Programmed in Verilog RTL n Shell is 23% of the FPGA Copyright © 2019, Elsevier Inc. All rights Reserved 22 Chapter 2 — Instructions: Language of the Computer 11 The University of Adelaide, School of Computer Science 22 November 2018 Microsoft Capapult Microsoft Microsoft Catapult: CNN n CNN accelerator, mapped across multiple FPGAs Copyright © 2019, Elsevier Inc. All rights Reserved 23 Microsoft Capapult Microsoft Microsoft Catapult: CNN Copyright © 2019, Elsevier Inc.

Pixel Visual Core Pixel Visual Core

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support