Adaptivfloat: a Floating-Point Based Data Type for Resilient Deep Learning Inference
Total Page:16
File Type:pdf, Size:1020Kb
ADAPTIVFLOAT:AFLOATING-POINT BASED DATA TYPE FOR RESILIENT DEEP LEARNING INFERENCE Thierry Tambe 1 En-Yu Yang 1 Zishen Wan 1 Yuntian Deng 1 Vijay Janapa Reddi 1 Alexander Rush 2 David Brooks 1 Gu-Yeon Wei 1 ABSTRACT Conventional hardware-friendly quantization methods, such as fixed-point or integer, tend to perform poorly at very low word sizes as their shrinking dynamic ranges cannot adequately capture the wide data distributions commonly seen in sequence transduction models. We present AdaptivFloat, a floating-point inspired number representation format for deep learning that dynamically maximizes and optimally clips its available dynamic range, at a layer granularity, in order to create faithful encoding of neural network parameters. AdaptivFloat consistently produces higher inference accuracies compared to block floating-point, uniform, IEEE-like float or posit encodings at very low precision (≤ 8-bit) across a diverse set of state-of-the-art neural network topologies. And notably, AdaptivFloat is seen surpassing baseline FP32 performance by up to +0.3 in BLEU score and -0.75 in word error rate at weight bit widths that are ≤ 8-bit. Experimental results on a deep neural network (DNN) hardware accelerator, exploiting AdaptivFloat logic in its computational datapath, demonstrate per-operation energy and area that is 0.9× and 1.14×, respectively, that of equivalent bit width integer-based accelerator variants. (a) ResNet-50 Weight Histogram (b) Inception-v3 Weight Histogram 1 INTRODUCTION 105 Max Weight: 1.32 105 Max Weight: 1.27 Min Weight: -0.78 Min Weight: -1.20 4 4 Deep learning approaches have transformed representation 10 10 103 103 learning in a multitude of tasks. Recurrent Neural Networks 102 102 Frequency (RNNs) are now the standard solution for speech recog- 101 101 nition, exhibiting remarkably low word error rates (Chiu 100 100 -10 -5 0 5 10 15 20 -10 -5 0 5 10 15 20 et al., 2017) while neural machine translation has narrowed weight value weight value the performance gap versus human translators (Wu et al., (c) DenseNet-201 Weight Histogram (d) Transformer Weight Histogram 105 Max Weight: 1.33 107 Max Weight: 20.41 2016). Convolutional Neural Networks (CNNs) are now Min Weight: -0.92 6 Min Weight: -12.46 104 10 105 the dominant engine behind image processing and have 103 104 been pushing the frontiers in many computer vision appli- 102 103 Frequency 102 cations (Krizhevsky et al., 2012; He et al., 2016). Today, 101 101 0 deep neural networks (DNNs) are deployed at all comput- 10 100 -10 -5 0 5 10 15 20 -10 -5 0 5 10 15 20 ing scales, from resource-constrained IoT edge devices to weight value weight value massive data center farms. In order to exact higher compute density and energy efficiency on these compute platforms, a Figure 1. Histogram of the weight distribution for (a) ResNet-50, plethora of reduced precision quantization techniques have (b) Inception-v3, (c) DenseNet-201 and (d) Transformer whose arXiv:1909.13271v3 [cs.LG] 11 Feb 2020 been proposed. weight values are more than 10× higher than the maximum weight value from popular CNNs. In this line of research, a large body of work has focused on fixed-point encodings (Choi et al., 2019; Gupta et al., 2015; Lin et al., 2015; Hwang & Sung, 2014; Courbariaux can be seen from Figure1, sequence transduction models et al., 2015) or uniform quantization via integer (Migacz, with layer normalization such as the Transformer (Vaswani 2017; Jacob et al., 2017). These fixed-point techniques are et al., 2017) can contain weights more than an order of mag- frequently evaluated on shallow models or on CNNs exhibit- nitude larger than those from popular CNN models with ing relatively narrow weight distributions. However, as it batch normalization such as ResNet-50, Inception-v3 or DenseNet-201. The reason for this phenomenon is that 1 2 Harvard University, Cambridge, MA, USA Cornell Tech, batch normalization effectively produces a weight normal- New York, NY, USA. ization side effect (Salimans & Kingma, 2016) whereas Preprint. layer normalization adopts invariance properties that do not reparameterize the network (Ba et al., 2016). AdaptivFloat: A Floating-point based Data Type for Resilient Deep Learning Inference In the pursuit of wider dynamic range and improved numeri- adaptive posit and float quantization techniques. cal accuracy, there has been surging interest in floating-point • We propose a hybrid float-integer (HFINT) PE imple- based (Drumond et al., 2018;K oster¨ et al., 2017), logarith- mentation that exploits the AdaptivFloat mechanism mic (Johnson, 2018; Miyashita et al., 2016) and posit repre- and provides a cost-effective compromise between the sentations (Gustafson & Yonemoto, 2017), which also form high accuracy of floating-point computations and the the inspiration of this work. greater hardware density of fixed-point post-processing. AdaptivFloat improves on the aforementioned techniques We show that the HFINT PE produces higher energy ef- by dynamically maximizing its available dynamic range at a ficiencies compared to conventional monolithic integer- neural network layer granularity. And unlike block floating- based PEs. point (BFP) approaches with shared exponents that may lead • We design and characterize an accelerator system tar- to degraded rendering of smaller magnitude weights, Adap- geted for sequence-to-sequence neural networks and tivFloat achieves higher inference accuracy by remaining show that, when integrated with HFINT PEs, lower committed to the standard floating-point delineation of inde- overall power consumption compared to an integer- pendent exponent and mantissa bits for each tensor element. based adaptation is obtained. However, we break from IEEE 754 standard compliance with a unique clamping strategy for denormal numbers and The rest of the paper is structured as follows. A summary of with a customized proposition for zero assignment, which prominent number and quantization schemes used in deep enables us to engineer leaner hardware. learning is narrated in Section2. We present the intuition Rather than proposing binary or ternary quantization tech- and a detailed description of the AdaptivFloat algorithm in niques evaluated on a small number of carefully selected Section3. The efficacy and resiliency of AdaptivFloat is models, through AdaptivFloat, we aim to inform a gen- demonstrated in Section4 across DNN models of varying eralized floating-point based mathematical blueprint for parameter distributions. Section5 describes the hardware adaptive and resilient DNN quantization that can be eas- modeling with energy, area and performance efficiency re- ily applied on neural models of various categories (CNN, sults reported in Section6. Section7 concludes the paper. RNN or MLP), layer depths and parameter statistics. By virtue of an algorithm-hardware co-design, we also pro- 2 RELATED WORK pose a processing element implementation that exploits Quantization Techniques. Low-precision DNN training the AdaptivFloat arithmetic in its computational datapath and inference have been researched heavily in recent years in order to yield energy efficiencies that surpass those of with the aim of saving energy and memory costs. A rather integer-based variants. Furthermore, owing to the superior significant percentage of prior work in this domain (Wu performance of AdaptivFloat at very low word sizes, as et al., 2015; Mishra et al., 2017; Park et al., 2017; Zhou it will be shown, higher compute density can be acquired et al., 2016; Cai et al., 2017; Zhang et al., 2018; Han et al., at a lower penalty for computational accuracy compared to 2015) have focused on or evaluated their low precision strate- block floating-point, integer, or non-adaptive IEEE-like float gies strictly on CNNs or on models with narrow parameter or posit encodings. Altogether, the AdaptivFloat algorithm- distributions. Notably, inference performance with mod- hardware co-design framework offers a compelling alterna- est accuracy degradation has been demonstrated with bi- tive to integer or fixed-point solutions. nary (Courbariaux & Bengio, 2016), ternary (Zhu et al., Finally, we note that the AdaptivFloat encoding scheme is 2016), and quaternary weight precision (Choi et al., 2019). self-supervised as it only relies on unlabeled data distribu- Often, tricks such as skipping quantization on the sensitive tions in the network. first and last layers are performed in order to escape steeper end-to-end accuracy loss. This paper makes the following contributions: Extending these aggressive quantization techniques to • We propose and describe AdaptivFloat: a floating-point RNNs have been reported (Alom et al., 2018), although based data encoding algorithm for deep learning, which still with recurrent models exhibiting the same narrow distri- maximizes its dynamic range at a neural network layer bution seen in many CNNs. (Park et al., 2018) noticed that granularity by dynamically shifting its exponent range large magnitude weights bear a higher impact on model per- and by optimally clipping its representable datapoints. formance and proposed outlier-aware quantization, which requires separate low and high bit-width precision for small • We evaluate AdaptivFloat across a diverse set of DNN and outlier weight values, respectively. However, this tech- models and tasks and show that it achieves higher clas- nique complicates the hardware implementation by requir- sification and prediction accuracies compared to equiv- ing two separate PE datapaths for the small and the outlier alent bit width uniform, block floating-point and non- weights. AdaptivFloat: A Floating-point based Data Type for Resilient Deep Learning Inference Hardware-Friendly Encodings. Linear fixed-point or uni- Floating points w/o denormal, Floating points w/o denormal form integer quantization is commonly used for deep learn- but sacrifice ±min for ±0 ing hardware acceleration (Jouppi et al., 2017; Jacob et al., +0.25 -0.25 +0 -0 2017; Reagen et al., 2016) as it presents an area and energy +0.375 -0.375 +0.375 -0.375 cost-effective solution compared to floating-point based pro- +0.5 -0.5 +0.5 -0.5 cessors.