Pruning and Quantization for Deep Neural Network Acceleration: a Survey A,B A,B,C a A,B a ∗ Tailin Liang , John Glossner , Lei Wang , Shaobo Shi and Xiaotong Zhang

Total Page:16

File Type:pdf, Size:1020Kb

Pruning and Quantization for Deep Neural Network Acceleration: a Survey A,B A,B,C a A,B a ∗ Tailin Liang , John Glossner , Lei Wang , Shaobo Shi and Xiaotong Zhang Pruning and Quantization for Deep Neural Network Acceleration: A Survey a,b a,b,c a a,b a < Tailin Liang , John Glossner , Lei Wang , Shaobo Shi and Xiaotong Zhang , aSchool of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China bHua Xia General Processor Technologies, Beijing 100080, China cGeneral Processor Technologies, Tarrytown, NY 10591, United States ARTICLEINFO ABSTRACT Keywords: Deep neural networks have been applied in many applications exhibiting extraordinary abilities in convolutional neural network the field of computer vision. However, complex network architectures challenge efficient real-time neural network acceleration deployment and require significant computation resources and energy costs. These challenges can neural network quantization be overcome through optimizations such as network compression. Network compression can often neural network pruning be realized with little loss of accuracy. In some cases accuracy may even improve. This paper low-bit mathematics provides a survey on two types of network compression: pruning and quantization. Pruning can be categorized as static if it is performed offline or dynamic if it is performed at run-time. We compare pruning techniques and describe criteria used to remove redundant computations. We discuss trade-offs in element-wise, channel-wise, shape-wise, filter-wise, layer-wise and even network-wise pruning. Quantization reduces computations by reducing the precision of the datatype. Weights, biases, and activations may be quantized typically to 8-bit integers although lower bit width implementations are also discussed including binary neural networks. Both pruning and quantization can be used independently or combined. We compare current techniques, analyze their strengths and weaknesses, present compressed network accuracy results on a number of frameworks, and provide practical guidance for compressing networks. 1. Introduction connections between neurons. Feed forward layers reduce connections by considering only connections in the forward Deep Neural Networks (DNNs) have shown extraordinary path. This reduces the number of connections to N. Other abilities in complicated applications such as image classifica- types of components such as dropout layers can reduce the tion, object detection, voice synthesis, and semantic segmen- number of connections even further. tation [138]. Recent neural network designs with billions of Network Architecture Search (NAS) [63], also known as parameters have demonstrated human-level capabilities but network auto search, programmatically searches for a highly at the cost of significant computational complexity. DNNs efficient network structure from a large predefined search with many parameters are also time-consuming to train [26]. space. An estimator is applied to each produced architecture. These large networks are also difficult to deploy in embedded While time-consuming to compute, the final architecture of- environments. Bandwidth becomes a limiting factor when ten outperforms manually designed networks. moving weights and data between Compute Units (CUs) and Knowledge Distillation (KD) [80, 206] evolved from memory. Over-parameterization is the property of a neural knowledge transfer [27]. The goal is to generate a simpler network where redundant neurons do not improve the accu- compressed model that functions as well as a larger model. racy of results. This redundancy can often be removed with KD trains a student network that tries to imitate a teacher net- little or no accuracy loss [225]. work. The student network is usually but not always smaller Figure 1 shows three design considerations that may con- arXiv:2101.09671v3 [cs.CV] 15 Jun 2021 and shallower than the teacher. The trained student model tribute to over-parameterization: 1) network structure, 2) net- should be less computationally complex than the teacher. work optimization, and 3) hardware accelerator design. These Network optimization [137] includes: 1) computational design considerations are specific to Convolutional Neural convolution optimization, 2) parameter factorization, 3) net- Networks (CNNs) but also generally relevant to DNNs. work pruning, and 4) network quantization. Convolution op- Network structure encompasses three parts: 1) novel com- erations are more efficient than fully connected computations ponents, 2) network architecture search, and 3) knowledge dis- because they keep high dimensional information as a 3D ten- tillation. Novel components is the design of efficient blocks sor rather than flattening the tensors into vectors that removes such as separable convolution, inception blocks, and residual the original spatial information. This feature helps CNNs blocks. They are discussed in Section 2.4. Network com- to fit the underlying structure of image data in particular. ponents also encompasses the types of connections within N2 Convolution layers also require significantly less coefficients layers. Fully connected deep neural networks require compared to Fully Connected Layers (FCLs). Computational < Corresponding author convolution optimizations include Fast Fourier Transform [email protected] (T. Liang); [email protected] (J. (FFT) based convolution [168], Winograd convolution [135], [email protected] [email protected] Glossner); (L. Wang); (S. Shi); and the popular image to column (im2col) [34] approach. [email protected] (X. Zhang) ORCID(s): 0000-0002-7643-912X (T. Liang) We discuss im2col in detail in Section 2.3 since it is directly T Liang et al.: Preprint submitted to Elsevier Page 1 of 41 Survey on pruning and quantization CNN Acceleration [40, 39, 142, 137, 194, 263, 182] Network Optimization Hardware Accelerator [151, 202] Network Structure Convolution Optimization Novel Components Platform Optimization Factorization Network Architecture Search [63] CPU Lookup Table Pruning [201, 24, 12, 250] GPU Knowledge Distillation [80, 206] Computation Reuse Quantization [131, 87] ASIC Memory Optimization FPGA [86,3, 234, 152] ... Figure 1: CNN Acceleration Approaches: Follow the sense from designing to implementing, CNN acceleration could fall into three categories, structure design (or generation), further optimization, and specialized hardware. related to general pruning techniques. include: General Processor Technologies (GPT) [179], ARM, Parameter factorization is a technique that decomposes nVidia, and 60+ others [202] all have processors targeting higher-rank tensors into lower-rank tensors simplifying mem- this space. ASICs may also target both training and inference ory access and compressing model size. It works by breaking in datacenters. Tensor processing units (TPU) from Google large layers into many smaller ones, thereby reducing the [125], Habana from Intel [169], Kunlun from Baidu [191], number of computations. It can be applied to both convolu- Hanguang from Alibaba [124], and Intelligence Processing tional and fully connected layers. This technique can also be Unit (IPU) from Graphcore [121]. applied with pruning and quantization. Programmable reconfigurable FPGAs have been used for Network pruning [201, 24, 12, 250] involves removing neural network acceleration [86, 3, 234, 152]. FPGAs are parameters that don’t impact network accuracy. Pruning can widely used by researchers due to long ASIC design cycles. be performed in many ways and is described extensively in Neural network libraries are available from Xilinx [128] and Section 3. Intel [69]. Specific neural network accelerators are also being Network quantization [131, 87] involves replacing datatypes integrated into FPGA fabrics [248, 4, 203]. Because FPGAs with reduced width datatypes. For example, replacing 32-bit operate at the gate level, they are often used in low-bit width Floating Point (FP32) with 8-bit Integers (INT8). The val- and binary neural networks [178, 267, 197]. ues can often be encoded to preserve more information than Neural network specific optimizations are typically in- simple conversion. Quantization is described extensively in corporated into custom ASIC hardware. Lookup tables can Section 4. be used to accelerate trigonometric activation functions [46] Hardware accelerators [151, 202] are designed primarily or directly generate results for low bit-width arithmetic [65], for network acceleration. At a high level they encompass partial products can be stored in special registers and reused entire processor platforms and often include hardware opti- [38], and memory access ordering with specialized address- mized for neural networks. Processor platforms include spe- ing hardware can all reduce the number of cycles to compute cialized Central Processing Unit (CPU) instructions, Graph- a neural network output [126]. Hardware accelerators are ics Processing Units (GPUs), Application Specific Integrated not the primary focus of this paper. However, we do note Circuits (ASICs), and Field Programmable Gate Arrays (FP- hardware implementations that incorporate specific accelera- GAs). tion techniques. Further background information on efficient CPUs have been optimized with specialized Artificial processing and hardware implementations of DNNs can be Intelligence (AI) instructions usually within specialized Sin- found in [225]. gle Instruction Multiple Data (SIMD) units [49, 11]. While We summarize our main contributions as follows: CPUs can be used for training, they have primarily been used for inference in systems that do not have specialized inference • We provide a review of two network compression
Recommended publications
  • Lab 7: Floating-Point Addition 0.0
    Lab 7: Floating-Point Addition 0.0 Introduction In this lab, you will write a MIPS assembly language function that performs floating-point addition. You will then run your program using PCSpim (just as you did in Lab 6). For testing, you are provided a program that calls your function to compute the value of the mathematical constant e. For those with no assembly language experience, this will be a long lab, so plan your time accordingly. Background You should be familiar with the IEEE 754 Floating-Point Standard, which is described in Section 3.6 of your book. (Hopefully you have read that section carefully!) Here we will be dealing only with single precision floating- point values, which are formatted as follows (this is also described in the “Floating-Point Representation” subsec- tion of Section 3.6 in your book): Sign Exponent (8 bits) Significand (23 bits) 31 30 29 ... 24 23 22 21 ... 0 Remember that the exponent is biased by 127, which means that an exponent of zero is represented by 127 (01111111). The exponent is not encoded using 2s-complement. The significand is always positive, and the sign bit is kept separately. Note that the actual significand is 24 bits long; the first bit is always a 1 and thus does not need to be stored explicitly. This will be important to remember when you write your function! There are several details of IEEE 754 that you will not have to worry about in this lab. For example, the expo- nents 00000000 and 11111111 are reserved for special purposes that are described in your book (representing zero, denormalized numbers and NaNs).
    [Show full text]
  • The Hexadecimal Number System and Memory Addressing
    C5537_App C_1107_03/16/2005 APPENDIX C The Hexadecimal Number System and Memory Addressing nderstanding the number system and the coding system that computers use to U store data and communicate with each other is fundamental to understanding how computers work. Early attempts to invent an electronic computing device met with disappointing results as long as inventors tried to use the decimal number sys- tem, with the digits 0–9. Then John Atanasoff proposed using a coding system that expressed everything in terms of different sequences of only two numerals: one repre- sented by the presence of a charge and one represented by the absence of a charge. The numbering system that can be supported by the expression of only two numerals is called base 2, or binary; it was invented by Ada Lovelace many years before, using the numerals 0 and 1. Under Atanasoff’s design, all numbers and other characters would be converted to this binary number system, and all storage, comparisons, and arithmetic would be done using it. Even today, this is one of the basic principles of computers. Every character or number entered into a computer is first converted into a series of 0s and 1s. Many coding schemes and techniques have been invented to manipulate these 0s and 1s, called bits for binary digits. The most widespread binary coding scheme for microcomputers, which is recog- nized as the microcomputer standard, is called ASCII (American Standard Code for Information Interchange). (Appendix B lists the binary code for the basic 127- character set.) In ASCII, each character is assigned an 8-bit code called a byte.
    [Show full text]
  • Penguin Computing Upgrades Corona with Latest AMD Radeon Instinct GPU Technology for Enhanced ML and AI Capabilities
    Penguin Computing Upgrades Corona with latest AMD Radeon Instinct GPU Technology for Enhanced ML and AI Capabilities November 18, 2019 Fremont, CA., November 18, 2019 -Penguin Computing, a leader in high-performance computing (HPC), artificial intelligence (AI), and enterprise data center solutions and services, today announced that Corona, an HPC cluster first delivered to Lawrence Livermore National Lab (LLNL) in late 2018, has been upgraded with the newest AMD Radeon Instinct™ MI60 accelerators, based on Vega which, per AMD, is the World’s 1st 7nm GPU architecture that brings PCIe® 4.0 support. This upgrade is the latest example of Penguin Computing and LLNL’s ongoing collaboration aimed at providing additional capabilities to the LLNL user community. As previously released, the cluster consists of 170 two-socket nodes with 24-core AMD EPYCTM 7401 processors and a PCIe 1.6 Terabyte (TB) nonvolatile (solid-state) memory device. Each Corona compute node is GPU-ready with half of those nodes today utilizing four AMD Radeon Instinct MI25 accelerators per node, delivering 4.2 petaFLOPS of FP32 peak performance. With the MI60 upgrade, the cluster increases its potential PFLOPS peak performance to 9.45 petaFLOPS of FP32 peak performance. This brings significantly greater performance and AI capabilities to the research communities. “The Penguin Computing DOE team continues our collaborative venture with our vendor partners AMD and Mellanox to ensure the Livermore Corona GPU enhancements expand the capabilities to continue their mission outreach within various machine learning communities,” said Ken Gudenrath, Director of Federal Systems at Penguin Computing. Corona is being made available to industry through LLNL’s High Performance Computing Innovation Center (HPCIC).
    [Show full text]
  • POINTER (IN C/C++) What Is a Pointer?
    POINTER (IN C/C++) What is a pointer? Variable in a program is something with a name, the value of which can vary. The way the compiler and linker handles this is that it assigns a specific block of memory within the computer to hold the value of that variable. • The left side is the value in memory. • The right side is the address of that memory Dereferencing: • int bar = *foo_ptr; • *foo_ptr = 42; // set foo to 42 which is also effect bar = 42 • To dereference ted, go to memory address of 1776, the value contain in that is 25 which is what we need. Differences between & and * & is the reference operator and can be read as "address of“ * is the dereference operator and can be read as "value pointed by" A variable referenced with & can be dereferenced with *. • Andy = 25; • Ted = &andy; All expressions below are true: • andy == 25 // true • &andy == 1776 // true • ted == 1776 // true • *ted == 25 // true How to declare pointer? • Type + “*” + name of variable. • Example: int * number; • char * c; • • number or c is a variable is called a pointer variable How to use pointer? • int foo; • int *foo_ptr = &foo; • foo_ptr is declared as a pointer to int. We have initialized it to point to foo. • foo occupies some memory. Its location in memory is called its address. &foo is the address of foo Assignment and pointer: • int *foo_pr = 5; // wrong • int foo = 5; • int *foo_pr = &foo; // correct way Change the pointer to the next memory block: • int foo = 5; • int *foo_pr = &foo; • foo_pr ++; Pointer arithmetics • char *mychar; // sizeof 1 byte • short *myshort; // sizeof 2 bytes • long *mylong; // sizeof 4 byts • mychar++; // increase by 1 byte • myshort++; // increase by 2 bytes • mylong++; // increase by 4 bytes Increase pointer is different from increase the dereference • *P++; // unary operation: go to the address of the pointer then increase its address and return a value • (*P)++; // get the value from the address of p then increase the value by 1 Arrays: • int array[] = {45,46,47}; • we can call the first element in the array by saying: *array or array[0].
    [Show full text]
  • Subtyping Recursive Types
    ACM Transactions on Programming Languages and Systems, 15(4), pp. 575-631, 1993. Subtyping Recursive Types Roberto M. Amadio1 Luca Cardelli CNRS-CRIN, Nancy DEC, Systems Research Center Abstract We investigate the interactions of subtyping and recursive types, in a simply typed λ-calculus. The two fundamental questions here are whether two (recursive) types are in the subtype relation, and whether a term has a type. To address the first question, we relate various definitions of type equivalence and subtyping that are induced by a model, an ordering on infinite trees, an algorithm, and a set of type rules. We show soundness and completeness between the rules, the algorithm, and the tree semantics. We also prove soundness and a restricted form of completeness for the model. To address the second question, we show that to every pair of types in the subtype relation we can associate a term whose denotation is the uniquely determined coercion map between the two types. Moreover, we derive an algorithm that, when given a term with implicit coercions, can infer its least type whenever possible. 1This author's work has been supported in part by Digital Equipment Corporation and in part by the Stanford-CNR Collaboration Project. Page 1 Contents 1. Introduction 1.1 Types 1.2 Subtypes 1.3 Equality of Recursive Types 1.4 Subtyping of Recursive Types 1.5 Algorithm outline 1.6 Formal development 2. A Simply Typed λ-calculus with Recursive Types 2.1 Types 2.2 Terms 2.3 Equations 3. Tree Ordering 3.1 Subtyping Non-recursive Types 3.2 Folding and Unfolding 3.3 Tree Expansion 3.4 Finite Approximations 4.
    [Show full text]
  • A Variable Precision Hardware Acceleration for Scientific Computing Andrea Bocco
    A variable precision hardware acceleration for scientific computing Andrea Bocco To cite this version: Andrea Bocco. A variable precision hardware acceleration for scientific computing. Discrete Mathe- matics [cs.DM]. Université de Lyon, 2020. English. NNT : 2020LYSEI065. tel-03102749 HAL Id: tel-03102749 https://tel.archives-ouvertes.fr/tel-03102749 Submitted on 7 Jan 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. N°d’ordre NNT : 2020LYSEI065 THÈSE de DOCTORAT DE L’UNIVERSITÉ DE LYON Opérée au sein de : CEA Grenoble Ecole Doctorale InfoMaths EDA N17° 512 (Informatique Mathématique) Spécialité de doctorat :Informatique Soutenue publiquement le 29/07/2020, par : Andrea Bocco A variable precision hardware acceleration for scientific computing Devant le jury composé de : Frédéric Pétrot Président et Rapporteur Professeur des Universités, TIMA, Grenoble, France Marc Dumas Rapporteur Professeur des Universités, École Normale Supérieure de Lyon, France Nathalie Revol Examinatrice Docteure, École Normale Supérieure de Lyon, France Fabrizio Ferrandi Examinateur Professeur associé, Politecnico di Milano, Italie Florent de Dinechin Directeur de thèse Professeur des Universités, INSA Lyon, France Yves Durand Co-directeur de thèse Docteur, CEA Grenoble, France Cette thèse est accessible à l'adresse : http://theses.insa-lyon.fr/publication/2020LYSEI065/these.pdf © [A.
    [Show full text]
  • Survey and Benchmarking of Machine Learning Accelerators
    1 Survey and Benchmarking of Machine Learning Accelerators Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner MIT Lincoln Laboratory Supercomputing Center Lexington, MA, USA freuther,pmichaleas,michael.jones,vijayg,sid,[email protected] Abstract—Advances in multicore processors and accelerators components play a major role in the success or failure of an have opened the flood gates to greater exploration and application AI system. of machine learning techniques to a variety of applications. These advances, along with breakdowns of several trends including Moore’s Law, have prompted an explosion of processors and accelerators that promise even greater computational and ma- chine learning capabilities. These processors and accelerators are coming in many forms, from CPUs and GPUs to ASICs, FPGAs, and dataflow accelerators. This paper surveys the current state of these processors and accelerators that have been publicly announced with performance and power consumption numbers. The performance and power values are plotted on a scatter graph and a number of dimensions and observations from the trends on this plot are discussed and analyzed. For instance, there are interesting trends in the plot regarding power consumption, numerical precision, and inference versus training. We then select and benchmark two commercially- available low size, weight, and power (SWaP) accelerators as these processors are the most interesting for embedded and Fig. 1. Canonical AI architecture consists of sensors, data conditioning, mobile machine learning inference applications that are most algorithms, modern computing, robust AI, human-machine teaming, and users (missions). Each step is critical in developing end-to-end AI applications and applicable to the DoD and other SWaP constrained users.
    [Show full text]
  • CS31 Discussion 1E Spring 17’: Week 08
    CS31 Discussion 1E Spring 17’: week 08 TA: Bo-Jhang Ho [email protected] Credit to former TA Chelsea Ju Project 5 - Map cipher to crib } Approach 1: For each pair of positions, check two letters in cipher and crib are both identical or different } For each pair of positions pos1 and pos2, cipher[pos1] == cipher[pos2] should equal to crib[pos1] == crib[pos2] } Approach 2: Get the positions of the letter and compare } For each position pos, indexes1 = getAllPositions(cipher, pos, cipher[pos]) indexes2 = getAllPositions(crib, pos, crib[pos]) indexes1 should equal to indexes2 Project 5 - Map cipher to crib } Approach 3: Generate the mapping } We first have a mapping array char cipher2crib[128] = {‘\0’}; } Then whenever we attempt to map letterA in cipher to letterB in crib, we first check whether it violates the previous setup: } Is cipher2crib[letterA] != ‘\0’? // implies letterA has been used } Then cipher2crib[letterA] should equal to letterB } If no violation happens, } cipher2crib[letterA] = letterB; Road map of CS31 String Array Function Class Variable If / else Loops Pointer!! Cheat sheet Data type } int *a; } a is a variable, whose data type is int * } a stores an address of some integer “Use as verbs” } & - address-of operator } &b means I want to get the memory address of variable b } * - dereference operator } *c means I want to retrieve the value in address c Memory model Address Memory Contents 1000 1 byte 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 Memory model Address Memory
    [Show full text]
  • Videocard Benchmarks
    Software Hardware Benchmarks Services Store Support About Us Forums 0 CPU Benchmarks Video Card Benchmarks Hard Drive Benchmarks RAM PC Systems Android iOS / iPhone Videocard Benchmarks Over 1,000,000 Video Cards Benchmarked Video Card List Below is an alphabetical list of all Video Card types that appear in the charts. Clicking on a specific Video Card will take you to the chart it appears in and will highlight it for you. Find Videocard VIDEO CARD Single Video Card Passmark G3D Rank Videocard Value Price Videocard Name Mark (lower is better) (higher is better) (USD) High End (higher is better) 3DP Edition 826 822 NA NA High Mid Range Low Mid Range 9xx Soldiers sans frontiers Sigma 2 21 1926 NA NA Low End 15FF 8229 114 NA NA 64MB DDR GeForce3 Ti 200 5 2004 NA NA Best Value Common 64MB GeForce2 MX with TV Out 2 2103 NA NA Market Share (30 Days) 128 DDR Radeon 9700 TX w/TV-Out 44 1825 NA NA 128 DDR Radeon 9800 Pro 62 1768 NA NA 0 Compare 128MB DDR Radeon 9800 Pro 66 1757 NA NA 128MB RADEON X600 SE 49 1809 NA NA Video Card Mega List 256MB DDR Radeon 9800 XT 37 1853 NA NA Search Model 256MB RADEON X600 67 1751 NA NA GPU Compute 7900 MOD - Radeon HD 6520G 610 1040 NA NA Video Card Chart 7900 MOD - Radeon HD 6550D 892 775 NA NA A6 Micro-6500T Quad-Core APU with RadeonR4 220 1421 NA NA A10-8700P 513 1150 NA NA ABIT Siluro T400 3 2059 NA NA ALL-IN-WONDER 9000 4 2024 NA NA ALL-IN-WONDER 9800 23 1918 NA NA ALL-IN-WONDER RADEON 8500DV 5 2009 NA NA ALL-IN-WONDER X800 GT 84 1686 NA NA All-in-Wonder X1800XL 30 1889 NA NA All-in-Wonder X1900 127 1552
    [Show full text]
  • Chapter 9: Memory Management
    Chapter 9: Memory Management I Background I Swapping I Contiguous Allocation I Paging I Segmentation I Segmentation with Paging Operating System Concepts 9.1 Silberschatz, Galvin and Gagne 2002 Background I Program must be brought into memory and placed within a process for it to be run. I Input queue – collection of processes on the disk that are waiting to be brought into memory to run the program. I User programs go through several steps before being run. Operating System Concepts 9.2 Silberschatz, Galvin and Gagne 2002 Binding of Instructions and Data to Memory Address binding of instructions and data to memory addresses can happen at three different stages. I Compile time: If memory location known a priori, absolute code can be generated; must recompile code if starting location changes. I Load time: Must generate relocatable code if memory location is not known at compile time. I Execution time: Binding delayed until run time if the process can be moved during its execution from one memory segment to another. Need hardware support for address maps (e.g., base and limit registers). Operating System Concepts 9.3 Silberschatz, Galvin and Gagne 2002 Multistep Processing of a User Program Operating System Concepts 9.4 Silberschatz, Galvin and Gagne 2002 Logical vs. Physical Address Space I The concept of a logical address space that is bound to a separate physical address space is central to proper memory management. ✦ Logical address – generated by the CPU; also referred to as virtual address. ✦ Physical address – address seen by the memory unit. I Logical and physical addresses are the same in compile- time and load-time address-binding schemes; logical (virtual) and physical addresses differ in execution-time address-binding scheme.
    [Show full text]
  • AI Chips: What They Are and Why They Matter
    APRIL 2020 AI Chips: What They Are and Why They Matter An AI Chips Reference AUTHORS Saif M. Khan Alexander Mann Table of Contents Introduction and Summary 3 The Laws of Chip Innovation 7 Transistor Shrinkage: Moore’s Law 7 Efficiency and Speed Improvements 8 Increasing Transistor Density Unlocks Improved Designs for Efficiency and Speed 9 Transistor Design is Reaching Fundamental Size Limits 10 The Slowing of Moore’s Law and the Decline of General-Purpose Chips 10 The Economies of Scale of General-Purpose Chips 10 Costs are Increasing Faster than the Semiconductor Market 11 The Semiconductor Industry’s Growth Rate is Unlikely to Increase 14 Chip Improvements as Moore’s Law Slows 15 Transistor Improvements Continue, but are Slowing 16 Improved Transistor Density Enables Specialization 18 The AI Chip Zoo 19 AI Chip Types 20 AI Chip Benchmarks 22 The Value of State-of-the-Art AI Chips 23 The Efficiency of State-of-the-Art AI Chips Translates into Cost-Effectiveness 23 Compute-Intensive AI Algorithms are Bottlenecked by Chip Costs and Speed 26 U.S. and Chinese AI Chips and Implications for National Competitiveness 27 Appendix A: Basics of Semiconductors and Chips 31 Appendix B: How AI Chips Work 33 Parallel Computing 33 Low-Precision Computing 34 Memory Optimization 35 Domain-Specific Languages 36 Appendix C: AI Chip Benchmarking Studies 37 Appendix D: Chip Economics Model 39 Chip Transistor Density, Design Costs, and Energy Costs 40 Foundry, Assembly, Test and Packaging Costs 41 Acknowledgments 44 Center for Security and Emerging Technology | 2 Introduction and Summary Artificial intelligence will play an important role in national and international security in the years to come.
    [Show full text]
  • Objective-C Internals Realmac Software
    André Pang Objective-C Internals Realmac Software 1 Nice license plate, eh? In this talk, we peek under the hood and have a look at Objective-C’s engine: how objects are represented in memory, and how message sending works. What is an object? 2 To understand what an object really is, we dive down to the lowest-level of the object: what it actually looks like in memory. And to understand Objective-C’s memory model, one must first understand C’s memory model… What is an object? int i; i = 0xdeadbeef; de ad be ef 3 Simple example: here’s how an int is represented in C on a 32-bit machine. 32 bits = 4 bytes, so the int might look like this. What is an object? int i; i = 0xdeadbeef; ef be af de 4 Actually, the int will look like this on Intel-Chip Based Macs (or ICBMs, as I like to call them), since ICBMs are little- endian. What is an object? int i; i = 0xdeadbeef; de ad be ef 5 … but for simplicity, we’ll assume memory layout is big-endian for this presentation, since it’s easier to read. What is an object? typedef struct _IntContainer { int i; } IntContainer; IntContainer ic; ic.i = 0xdeadbeef; de ad be ef 6 This is what a C struct that contains a single int looks like. In terms of memory layout, a struct with a single int looks exactly the same as just an int[1]. This is very important, because it means that you can cast between an IntContainer and an int for a particular value with no loss of precision.
    [Show full text]