Delta-DNN: Efficiently Compressing Deep Neural Networks via Exploiting Floats Similarity
Zhenbo Hu§,∗, Xiangyu Zou§,∗, Wen Xia§,$, Sian Jin휁 , Dingwen Tao휁 , Yang Liu§,$, Weizhe Zhang§,$, Zheng Zhang§
§School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China $Cyberspace Security Research Center, Peng Cheng Laborary, Shenzhen, China 휁 School of Electrical Engineering and Computer Science, Washington State University, WA, USA [email protected],[email protected] {xiawen,liu.yang,wzzhang,zhengzhang}@hit.edu.cn {sian.jin,dingwen.tao}@wsu.edu ABSTRACT KEYWORDS Deep neural networks (DNNs) have gained considerable attention Lossy compression, neural network, delta compression in various real-world applications due to the strong performance on representation learning. However, a DNN needs to be trained many 1 INTRODUCTION epochs for pursuing a higher inference accuracy, which requires In recent years, deep neural networks (DNNs) have been widely storing sequential versions of DNNs and releasing the updated ver- applied to many artificial intelligence tasks in various scientific sions to users. As a result, large amounts of storage and network and technical fields, such as computer vision [44], natural language resources are required, significantly hampering DNN utilization on processing [4]. Generally, DNNs are designed to solve complicated resource-constrained platforms (e.g., IoT, mobile phone). and non-linear problems (e.g., the face recognition [45]) by using In this paper, we present a novel delta compression framework multi-layer structures, which consist of millions of parameters (i.e., called Delta-DNN, which can efficiently compress the float-point floating-point data types). numbers in DNNs by exploiting the floats similarity existing in To further improve the data analysis capabilities by increasing DNNs during training. Specifically, (1) we observe the high simi- the inference accuracy, DNNs are becoming deeper and more com- larity of float-point numbers between the neighboring versions ofa plicated [43]. Meanwhile, the frequent training of DNNs results in neural network in training; (2) inspired by delta compression tech- huge overheads for storage and network. The overheads overwhelm nique, we only record the delta (i.e., the differences) between two resource-constrained platforms such as mobile devices and IoT de- neighboring versions, instead of storing the full new version for vices. For instance, a typical use case is to train a DNN in the cloud DNNs; (3) we use the error-bounded lossy compression to compress servers using high-performance accelerators, such as GPUs or TPUs, the delta data for a high compression ratio, where the error bound and then transfer the trained DNN model to the edge devices to is strictly assessed by an acceptable loss of DNNs’ inference accu- provide accurate, intelligent, and effective services [20], while the racy; (4) we evaluate Delta-DNN’s performance on two scenarios, cloud to edge data transfer is very costly. including reducing the transmission of releasing DNNs over the Compressing neural networks [32] is an effective way to reduce network and saving the storage space occupied by multiple versions the data transfer cost. However, existing approaches focus on simpli- of DNNs. fying DNNs for compression such as pruning [21, 33], low-rank ap- According to experimental results on six popular DNNs, Delta- proximation [14], quantization [12, 37], knowledge distillation [11], DNN achieves the compression ratio 2×-10× higher than state-of- and compact network design [38, 55]. These methods usually operate the-art methods, while without sacrificing inference accuracy and on models that have already been trained, modifying the model struc- changing the neural network structure. ture and may make the model untrainable [30]. Therefore, in this paper, we focus on designing an approach to effectively compressing * Z. Hu and X. Zou equally contributed to this work. DNNs without changing their structures. Existing studies [15, 24] suggest that DNNs are difficult to be compressed using traditional compressors (e.g., GZIP [5], LZMA [35]) Permission to make digital or hard copies of all or part of this work for personal or since there are large amounts of floating-point numbers with the ran- classroom use is granted without fee provided that copies are not made or distributed dom ending mantissa bits. However, our observation on existing for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the DNNs suggests that most of the floating-point values only slightly author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or change during the training of DNNs. Thus, there exists data simi- republish, to post on servers or to redistribute to lists, requires prior specific permission 1 and/or a fee. Request permissions from [email protected]. larity between the neighboring neural networks in training, which ICPP ’20, August 17–20, 2020, Edmonton, AB, Canada means we only need to record the delta (i.e., the differences) of © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-8816-0/20/08. . . $15.00 1Neighboring neural networks in this paper refers to the checkpointed/saved neural https://doi.org/10.1145/3404397.3404408 network models in the training epochs/iterations. the data between two neighboring versions for potentially higher example, data deduplication, a coarse-grained compression tech- data reduction. Moreover, considering the locality of the delta, error- nique, detects and eliminates the duplicate chunks/files among the bounded lossy compression mechanisms can be very effective. different backup versions in backup storage systems. Delta compres- In this paper, we present a novel delta compression framework sion observes the high data similarity and thus the high data redun- called Delta-DNN to efficiently compress DNNs with a high com- dancies in the neighboring versions of the (modified) files in backup pression ratio by exploiting the data similarity existing in DNNs storage systems, and then records the delta data (the differences) during training. Our contributions are four-fold: of them for space savings [42, 51], by eliminating redundancies (1) We observe the high similarity of floating-point numbers (i.e., between the similar files at the byte- and string-level. parameters) existing in the neighboring versions of a DNN, Lossy compression also appears for decades. The typical lossy which can be exploited for data reduction. compressors are for images, and their designs are based on human (2) We propose to only store the delta data between neighboring perception, such as JPEG2000 [36]. Thus they use lossy compression versions, then apply error-bounded lossy compression to the techniques, such as wavelet transform and vector optimization, to delta data for a high compression ratio. The error bound dramatically reduce the image data sizes. is strictly assessed by the maximal tolerable loss of DNNs’ Nowadays, more attention is paid to the lossy compression of inference accuracy. To the best of our knowledge, this is the floating-point data generated from HPC. The most popular lossy first work to employ the idea delta compression for efficiently compressor includes ZFP [27], SZ [6, 26, 47]. ZFP is an error- reducing the size of DNNs. controlled lossy compressor designed for scientific data. Specifically, (3) We discuss Delta-DNN’s benefits on two typical scenarios, ZFP first transforms the floating-point numbers to fixed-point data including significantly reducing the bandwidth consumption values block by block. Then, a reversible orthogonal block transfor- when releasing DNNs over the network and reducing the mation is applied in each block to mitigate the spatial correlation, storage requirements by preserving versions of DNNs. and embedded coding [39] is used to encode the coefficients. (4) Our evaluation on six popular DNNs suggests that compared SZ, another lossy compressor, is designed for scientific data focus- with state-of-the-art compressors including SZ, LZMA, and ing on a high compression ratio by fully exploiting the characteristics Zstd, Delta-DNN can improve the compression ratio by 2× to of floating-point numbers. Specifically, there is a data-fitting predic- 10× while keeping the neural network structure unchanged at tor in SZ’s workflow, which generates a predicted value for each the cost of only losing inference accuracy by less than 0.2%. data according to its surrounding data (i.e., data’s spatial correla- tion). Predictor utilizes the characteristics (or the rules) in floats for The rest of the paper is organized as follows. Section 2 presents prediction while ensuring the point-wise error controls on demand. the background and motivation of this work. Section 3 describes Consequently, the difference between the predicted value and the the design methodologies of the Delta-DNN framework in detail. real value will be quantized, encoded, and batch compressed, which Section 4 discusses the two typical application scenarios of Delta- often achieve a much higher compression ratio than directly com- DNN. Section 5 discusses the evaluation results of Delta-DNN on pressing the real value. Note that SZ’s compression ratio depends six well-known DNNs, compared with state-of-the-art compressors. on the predictor design, and also the rules are existing in the data. In Section 6, we conclude this paper and present our future work. Thus, to achieve a high compression ratio, some variants of SZ try 2 BACKGROUND AND MOTIVATION to explore more rules existing in the data, such as the temporal correlation and spatiotemporal decimation [23, 25]. In this section, we present the necessary background about data compression techniques, compressibilities of DNNs, lossy compres- sion of floats, and also discuss our critical observations of thedata 2.2 Compressing DNNs similarity existing in DNNs as our research motivation. Recently, with the rise of AI boom, efficiently compressing deep neu- ral networks (DNNs) are also gaining increasing attention. However, 2.1 Data Compression Techniques this is not easy, due to that DNNs consist of many floating-point Recently, data grows exponentially in cloud computing, big data, numbers generated from training on a large amount of users’ data, and artificial intelligence environments. As a result, efficient data which is of very low compressibility. compression techniques are especially important for data reduction. A widely adopted method of training DNNs is called Gradient Generally, data compression can be categorized into lossless com- Descent [19], which usually generates large amounts of floating- pression and lossy compression techniques. point numbers and requires lots of resources (e.g., memory, storage, General lossless compression, such as GZIP [5], LZMA [35], and network). More specifically, in the training process of aDNN, and Zstd [57], has been developed for decades. These techniques usu- snapshot will be generated periodically for backup, to prevent the ally deal with data as byte streams, and reduce data at the byte/string case of accuracy decreasing by over-fitting the DNN. Thus, storing level based on classic algorithms such as Huffman coding [13] and many versions of DNNs is space-consuming, and also the case of dictionary coding [56]. These fine-grained data compression tech- frequently dispatching a new version of DNN to users, is becoming niques are extremely compute-intensive, and are usually used to a challenge for using DNN in practical, especially in the resource- eliminate redundancies inside a file or in a limited data range. limited scenarios, such as mobile phones and wireless sensors. Recently, some other special lossless compression techniques Generally, compressing DNNs means compressing a large amount have been proposed, such as data deduplication [8, 31], and delta of very random floating-point numbers. Due to the strong random- compression [42], which eliminates redundancies among files. For ness of the ending mantissa bits in DNNs’ floats, existing traditional 2 compression approaches only achieve a very limited compression Although there exists data similarity, parameters in neighboring ratio on DNNs according to recent studies [28, 41], both lossless versions of DNNs are completely different in terms of bit repre- compressors (such as GZIP [5], Zstd [57], and LZMA [35]) and sentation due to the random ending mantissa bits. Thus, traditional lossy compressors (such as SZ [6, 26, 47] and ZFP [27]). lossless compression methods can not achieve an ideal compression Therefore, other special technologies for compressing DNNs are ratio on these similar floats in DNNs. Meanwhile, SZ compressor proposed, such as pruning [7, 21, 33] and quantization [12, 37]. can well compress the similar floats (i.e., the rule existing in floats) Deep Compression [9] uses the techniques of pruning (removing by using a data-fitting predictor and an error-controlled quantizator. some unimportant parameters), quantization (transforming the floats Therefore, according to the above observation and discussion, we into integers), and Huffman coding to compress DNNs, which may present the motivations of this paper: Motivation ○1 , inspired by change the neural network structure (by pruning) and may signifi- the delta compression technique used in the area of backup stor- cantly degrade the networks’ inference accuracy (by quantization age, we can calculate the delta data (i.e., difference) of the similar and pruning). Therefore, the network needs to be retrained many floats between neighboring networks, which is very compressible times to avoid accuracy degradation, thus resulting in huge execu- in the lossy compression; Motivation ○2 , we employ the ideas of tion overheads. DeepSZ [15] combines SZ lossy compressor and the error-bound SZ lossy compression, i.e., a data-fitting predictor and pruning technique to achieve a high compression ratio on DNNs, but an error-controlled quantizator, to compress the delta data while it may also change the structure of networks due to pruning. ensuring the inference accuracy loss under control. There are also some other approaches focusing on not only com- The two motivations, combined with our observations of floats pressing the storage space of DNNs but also simplifying the models similarity existing in DNNs, inspire us to adopt delta compression in DNNs to reduce computation, which is called structured prun- and error-bounded lossy compression with high compressibility. ing [21, 50]. For example, Wen et al. [50] added a LASSO (Least Specifically, regarding there exists floats similarity between two Absolute Shrinkage and Selection Operator) regular penalty term neighboring versions of a DNN (see Figure 1), we propose an “inter- into the loss function to realize the structured sparsity of DNNs. Li version” predictor to predict the corresponding data value between et al. [21] proposed a method of pruning the filter of convolutional two neighboring network versions, which will achieve a much higher layers to remove the unimportant filters accordingly. compression ratio (as demonstrated in Section 5), and the specific Overall, the pruning or quantization based approaches will de- techniques will be introduced in Section 3. stroy the completeness of the structure of DNNs, and may degrade Note that DeepSZ [15] also uses the error-bounded SZ lossy com- the inference accuracy. In this paper, we focus on using the tradi- pression to compress DNNs, and the key difference between DeepSZ tional compression approaches to compress the parameters of DNNs and our approach Delta-DNN is that DeepSZ directly compresses without changing DNNs’ structures. floats one-by-one on DNNs combining with some pruning tech- niques (may change the network structure [30]). At the same time, Delta-DNN exploits the floats similarity of neighboring networks 2.3 Observation and Motivation for lossy compression to target at a much higher compression ratio As introduced in the last subsection, there are many parameters (i.e., on DNNs regardless of whether or not to use the pruning techniques. floating-point numbers) in a neural network, and in the training pro- cess (i.e., Gradient Descent), some parameters in the networks vary 3 DESIGN AND IMPLEMENTATION greatly while some parameters vary slightly. According to our ex- Generally, Delta-DNN runs after the time-consuming training or fine- perimental observations, most of the parameters vary slightly during tuning of the network on the training datasets (to get the updated training, which means data similarity exists (i.e., data redundancy) neural networks). In this section, we describe Delta-DNN design among the neighboring neural network versions. in detail, including exploiting similarity of the neighboring neural To study the similarity in DNNs, we collect 368 versions of networks to calculate the lossy delta data, optimizing parameters, six popular neural networks in training (six DNNs’ characteristics and encoding schemes. and the training dataset will be detailed in Subsection5.1) and then linearly fit all the corresponding parameters from every two neigh- 3.1 Overview of Delta-DNN Framework boring networks (including all layers) as shown in Figure 1. Besides, in this test, we use the metric of Structural Similarity (SSIM) [49] The general workflow of Delta-DNN framework is shown in Figure (also shown in Figure 1), which is widely used to measure the simi- 2. To compress a neural network (called target network), we need a larity of two bitmap pictures. Because bitmap pictures and neural reference neural network, which is usually the former version of the networks both can be regarded as matrices, it is reasonable to use network in training, and Delta-DNN will calculate and compress the this metric to measure the similarity of two networks. delta data of two networks for efficient space savings. Specifically, Observation: the results shown in Figure 1 suggest the floating- Delta-DNN consists of three key steps: calculating the delta data, point numbers of the neighboring networks are very similar: ○1 the optimizing the error bound, and compressing the delta data. results of the linear fitting are very close to 푦 = 푥. Specifically, in (1) Calculating the delta data is to calculate the lossy delta data Figure 1, the x-axis and y-axis denote the parameters of two neigh- of the target and reference networks (including all layers), boring network versions, respectively. We can observe a clear linear which will be much more compressible than directly com- relationship, which reflects the extremely high similarity between pressing the floats in DNNs, as detailed in Subsection 3.2. the neighboring neural network models. ○2 the SSIM is very close (2) Optimizing the error bound is to select the suitable error to 1.0, which usually suggests that bitmaps are very similar [49]. bound used for maximizing the lossy compression efficiency 3 (a) VGG-16, SSIM: 0.99994 (b) ResNet101, SSIM: 0.99971 (c) GoogLeNet, SSIM: 0.99999
(d) EfficientNet, SSIM: 0.99624 (e) MobileNet, SSIM: 0.99998 (f) ShuffleNet, SSIM: 0.99759 Figure 1: Parameter similarity (using linear fitting) of the neighboring neural networks in training, from the six popular DNNs.
compute score
target network compressed
analyze binary file calculating
compressing decompressed network
reference different relative error param reference network Calculating the Delta Data &Optimizing the Error Bound Compressing the Delta Data network Figure 2: Overview of Delta-DNN framework for compressing deep neural networks.
while meeting the requirements of the maximal tolerable loss of DNNs’ inference accuracy, as detailed in Subsection 3.3. 퐴푖 − 퐵푖 푀푖 = ⌊ + 0.5⌋ (1) (3) Compressing the delta data is to reduce the delta data size by 2 · 푙표푔(1 + 휖) using the lossless compressors, as detailed in Subsection 3.4. Here 휖 is the predefined relative error bound used for SZ lossy compression (e.g., 1E-1 or 1E-2), and 푀푖 is an integer (called ‘quan- In the remainder of this section, we will discuss these three steps tization factor’) for recording the delta data of 퐴푖 and 퐵푖 . This delta in detail to show how Delta-DNN efficiently compresses the floating- calculation of floats shown in Equation (1) is demonstrated to be very point numbers by exploiting their similarity. efficient in lossy compression of scientific data (i.e., compressing large amounts of floats) where the ‘quantization factors’ (i.e., the integers 푀푖 ) are highly compressible [26]. 3.2 Calculating the Delta Data In Delta-DNN, according to the similarity we observed in Sub- In Delta-DNN, because of the observed floats similarity existing in section 2.3, we believe that 퐴푖 − 퐵푖 will be very small in most cases, the neighboring networks, the corresponding parameters (i.e., floats) which means most of 푀푖 are equal to zero. Therefore, the ‘quantiza- of the target network and reference network will be calculated their tion factors’ 푀 are also very compressible in Delta-DNN as that in lossy delta data, following the idea of SZ lossy compressor [6, 26, Scientific data. Then the target network will be replaced by“the 푀푖 47]. Specifically, we denote a parameter from the target network as network” (the lossy delta data) for space savings in Delta-DNN. 퐴푖 and the corresponding parameters from the reference network as With the delta data (i.e., amounts of 푀푖 ) and the reference network 퐵푖 , and the lossy delta data of the floats 퐴푖 and 퐵푖 can be calculated (i.e., 퐵푖 ), we can recover (i.e., decompress) the target network (i.e., and quantized as below. the parameters 퐴푖 ) as illustrated in Equation (2) with a limited loss 4 ′ based on the error bound 휖: |퐴푖 − 퐴푖 | < 휖. Algorithm 1: Error Bound Assessment and Selection ′ Input: Target network: 푁 ; Reference network: 푁 ; 퐴 = 2 · 푀푖 · 푙표푔(1 + 휖) + 퐵푖 (2) 1 2 푖 Accepted accuracy loss: 휃; Available error bounds: 퐸퐵; All in all, the delta data 푀푖 between the target and reference Compression ratio of network 푁 with error bound 휖: Ω(푁, 휖); networks are very compressible, while its compression ratio depends Accuracy degradation of network 푁 with error bound 휖: Φ(푁, 휖); on the two key factors. The first factor is the similarity of the data 퐴푖 Output: The best error bound: 퐸퐵푏푒푠푡 ; and 퐵푖 , which is demonstrated to be very high in Subsection 2.3. The //훼, 훽 are weights of compression ratio & accuracy defined by user; second one is the predefined relative error bound 휖 used for lossy for 휖 in 퐸퐵 do {Φ(푁 , 휖), Ω(푁 , 휖) } ← 퐸푠푡푖푚푎푡푒 (푁 , 푁 , 휖); compression, which also impacts the inference accuracy of DNNs in 1 1 1 2 if 푎푏푠 (Φ(푁1, 휖)) < 휃 then our lossy delta compression framework. The selection of the error save {Φ(푁1, 휖), Ω(푁1, 휖)} in 푆푒푡푠; bound 휖 used in our Delta-DNN will be discussed in Subsection 3.3. 푆퐶푂푅퐸푏푒푠푡 ← 0; 퐸퐵 ← 휆; //휆 is user-defined default minimal error bound; 3.3 Optimizing the Error Bound 푏푒푠푡 for {Φ(푁1, 휖), Ω(푁1, 휖)} in 푆푒푡푠 do In this subsection, we discuss how to get a reasonable relative error 푆푐표푟푒 ← 퐶푎푙푐푆푐표푟푒 (Φ(푁1, 휖), Ω(푁1, 휖), 훼, 훽); bound 휖 to maximize the compression ratio of Delta-DNN without if 푆푐표푟푒 > 푆퐶푂푅퐸푏푒푠푡 then 푆퐶푂푅퐸 ← 푆푐표푟푒; compromising DNNs’ inference accuracy. 푏푒푠푡 퐸퐵 ← 휖; From Equation (1) described in Subsection 3.2, we can know that 푏푒푠푡 the larger error bound results in the smaller ‘quantization factor’ return 퐸퐵푏푒푠푡 ; and thus the higher compression ratio. Nevertheless, at the same time, it leads to an uncertain inference accuracy loss of DNNs. This is because the recovered target network parameters would vary (3), and 훼 + 훽 = 1 (e.g., the user can set 훼 = 0, 훽 = 1 to maximize randomly (i.e., sometimes larger, sometimes smaller) along with the importance of compression ratio in Delta-DNN). different error bounds, after decompression in Delta-DNN. Thus, as shown in Formula (3) and Algorithm 1, we use the 푆푐표푟푒 Figures 3 and 4 show examples of studying the impact of the error to assess the compression efficiency of Delta-DNN to select the bound 휖 on the final compression ratio on DNNs, with Delta-DNN optimal error bound for DNNs, satisfying users’ requirements on running on six well known DNNs. Generally, the results demonstrate both the compression ratio and the inference accuracy of DNNs. our discussion in the last paragraph that Delta-DNN can achieve a Note that the computation cost of Algorithm 1 is minor compared higher compression ratio (see Figure 4) but cause uncertain inference with the training process of DNNs, which is explained as below. accuracy when increasing the error bounds (see Figure 3). Generally, the time complexity of Algorithm 1 is 푂 (푛 ·(휏 + 푒 + Note that Figure 3 shows of the inference accuracy of the last 푑)), where 푛 is the number of error bounds for testing, and 푛 is model files on six DNNs using Delta-DNN with different error equal to 10 in this paper; 푂 (휏) is the time complexity of testing bounds. Figure 3 (c) has a significant accuracy decrease, and we a network’s inference accuracy on a testing dataset; 푂 (푒) is the also observe this phenomenon among the different training stages time complexity of compressing a network, and 푂 (푑) is the time in other DNNs. This is because gradient descents are nonlinear, and complexity of decompressing a network (usually negligible [15]). some steps of them seem to be more sensitive. The time costs of compressing and testing (for DNN accuracy) are Therefore, in Delta-DNN’s workflow of calculating the delta data, both positively related to the size of the network while compressing we need to find a reasonable error bound for both considering the is usually faster than testing (for DNN accuracy). Hence, the time two key metrics: the compression ratio and the inference accuracy complexity of Algorithm 1 can be simplified to 푂 (푛 · 푘 · 휏) where loss on DNNs. Moreover, we need to design a flexible solution to 1 < 푘 < 2. Delta-DNN is running after the time-consuming training meet the users’ various requirements on both the two metrics while or fine-tuning of the network on the training datasets. Specifically, strictly ensuring the inference accuracy loss is acceptable for DNNs in deep neural networks, the time complexity of training DNN on (e.g., 0.2% as shown in Figure 3). the training datasets 푂 (푀) over that of verifying DNN’s accuracy 2 To this end, Algorithm 1 presents our method of selecting an on the testing datasets, is usually 99:1 , so the time complexity of 푛·푘 ·푀 푀 optimal error bound by assessing its impact on both compression Algorithm 1 is about 푂 ( 99 ) ≈ 푂 ( 5 ), which is much smaller ratio and inference accuracy loss on DNNs. Generally, it consists than the training time complexity 푂 (푀) in deep learning. of two steps: ○1 Collecting the results of compression ratio and the inference accuracy degradation along with the available error 3.4 Compressing the Delta Data bounds, while meeting the requirement of the maximal tolerable After calculating the lossy delta data (i.e., ‘quantization factor’ as accuracy loss of the tested DNN, ○2 Assessing the collected results introduced in Equation (1) in Subsection 3.2) with the optimized to select an optimal error bound according to Formula (3) as below. error bound according to the requirement of the inference accuracy of DNNs, Delta-DNN then compresses these delta data using the 푆푐표푟푒 = 훼 · Φ + 훽 · Ω, (훼 + 훽 = 1) (3) lossless compressors, such as Zstd and LZMA, which is widely used where Φ is the relative loss of inference accuracy (calculated the to compress ‘quantization factor’ in lossy compression [6]. Here recovered network after decompression, compared with the original before using Zstd and LZMA in Delta-DNN, we introduce Run- network); Ω is the compression ratio (i.e., the ratio of before/after Length Encoding (RLE) that efficiently records the duplicate bytes compression); 훼 and 훽 are the influencing weights defined by users, which are used to fine-tune the importance of Φ and Ω in Formula 2Andrew Ng. https://www.deeplearning.ai/ 5 Δ % % % 1