Neural Network Distiller: A Python Package For DNN Compression Research

Neta Zmora∗ Guy Jacob∗ Lev Zlotnik Bar Elharar Gal Novik Intel AI Lab {neta.zmora, guy.jacob, lev.zlotnik, bar.elharar, gal.novik}@intel.com

Abstract

This paper presents the philosophy, design and feature-set of Neural Network Dis- tiller, an open-source Python package for DNN compression research. Distiller is a library of DNN compression algorithms implementations, with tools, tutorials and sample applications for various learning tasks. Its target users are both engineers and researchers, and the rich content is complemented by a design-for-extensibility to facilitate new research. Distiller is open-source and is available on Github at https://github.com/NervanaSystems/distiller

1 Introduction

Deep learning (DL) is transforming incumbent industries and creating new ones. In many use-cases it is preferable to execute inference on network edge devices, due to privacy, bandwidth, latency, connectivity and power constraints. However, the power, thermal, memory and compute requirements of some deep neural networks (DNN) are prohibitive for many classes of edge devices. Efforts to ameliorate this tension span multiple disciplines: Hardware [1], [2], algorithm [3] and DNN design and optimizations [4]. Neural Network Distiller is a library for DNN compression research in PyTorch [5], which caters to the latter domain: The search for ever-smaller, faster and more energy-efficient neural networks. DNN compression is a dynamic research area with both practical and theoretical implications, making it important to industry and academia. For all practical reasons, it is important to be able to test and compare DNN compression methods under the same test conditions (datasets, pre-processing, hyper-parameters and execution environment); to experiment with mixing compression methods (e.g. quantization and pruning); and to share knowledge, results, and learnings. Distiller is our humble attempt to bring together researchers and practitioners by providing a library of algorithmic tools for DNN compression, together with tutorials, implementations of example applications on various tasks, arXiv:1910.12232v1 [cs.LG] 27 Oct 2019 and infrastructure code to quickly prototype new ideas. Most of the implemented algorithms relate to quantization and sparsification via pruning and regularization. Additionally, Distiller contains examples of low-rank approximation [6], conditional computation [7], knowledge distillation [8] and automated compression [9] - and we continue adding more methods. In this paper, we briefly describe Distiller’s motivation, design decisions and algorithms, in the hope that this work can serve others.

2 Design

Distiller is written in Python and is designed to be simple and extendible, accessible to experts and non-experts alike, and reusable as a library in various contexts. An important design goal was low barriers-to-adoption by compression aficionados and researchers, for easy integration of Distiller into users’ research applications. This meant a flatter code structure, with less abstraction layers so that ∗Equal contribution library code feels approachable and easy to understand. We chose PyTorch as the underlying DL framework because of its wide adoption by the research community, and opted for tight-coupling with PyTorch for simplicity and clarity at the expense of genericity. Published works exemplify a few ways researchers are using Distiller for implementation of new compression methods [10, 11], and for advancing their work in fields related to DNN compression [12, 13, 14, 15].

Figure 1: Distiller high-level architecture. Green boxes are 3rd-party components, user application code is colored gray and Distiller library code is blue.

Figure 1 depicts Distiller’s high-level architecture2. At the core of Distiller are implementations of the algorithmic building-blocks required for DNN compression (e.g. pruning, quantization and regularization). These are assembled into compression methods using software components often needed for compression:

• New operation layers, for example: TruncatedSVD as a decomposed replacement of Linear layers, and DistillerLSTM for fine control of quantization inside an LSTM cell • Layers fusion (e.g. Batch-norm folding) • Model contraction (“thinning”) and expansion • Utilities such as activation statistics, graph data-dependency analysis and model summaries

For compression methods that involve training or fine-tuning models, Distiller supports various training techniques including full/partial dataset, single/multi-GPU, knowledge distillation and lottery ticket hypothesis (LTH) [16] training. A compression-scheduling component is responsible for interleaving training and compression, for methods such as iterative-pruning [17]. To accelerate development, Distiller also provides APIs for common utility functionalities. These include logging and plotting (e.g. per-parameter tensor sparsity); checkpointing (with compression- specific metadata) and simple experiment-artifact management; data-loading (full/partial datasets with various samplers); compression-specific command-line argument handling; and reproducible execution. We’ve integrated models from Torchvision [18] and Cadene’s GitHub repository [19], together with our own models. We demonstrate the range and flexibility of Distiller on various example tasks and datasets: image classification, recommendation systems (NCF [20]), NLP (e.g. GNMT [21]), and object detection (e.g. Faster R-CNN [22]). Users can readily add models, datasets and learning tasks.

2.1 Compression Scheduling

Most compression methods involve some form of model training (post-training quantization being a notable exception). These can be, for example, simple fine-tuning using a subset of the data; LTH training with some of the weights kept masked at zero; or iterative pruning following some pruning

2More details available at https://nervanasystems.github.io/distiller/design.

2 recipe. To provide the user fine-control over the interaction between the training process and the execution of compression algorithms, we built a scheduling subsystem composed of a compression- scheduler and a YAML-syntax schedule parser. The scheduler orchestrates the compression process by invoking algorithmic components, following a user-prescribed schedule. This process is demonstrated in Algorithm 1. The scheduling recipe can be defined programmatically or specified in a YAML file. Using a YAML file allows the user to quickly and easily change the compression algorithms, their configuration and scheduling, without changing code - which is especially amenable to an environment requiring rapid experimentation. The recipe defines3the compression methods to use, and their parameters. In some cases, parameters can be overridden per-layer or per groups of layers. In addition, the recipe defines when each method should be applied - start epoch, end epoch and frequency.

/* Read schedule from file */ scheduler = distiller.file_config(file_name) foreach epoch do scheduler.on_epoch_begin(epoch) foreach mini_batch do Get next mini-batch of data x and labels y scheduler.on_minibatch_begin(mini_batch) Clear gradients of weights Feed-forward on network M: y’ = M(x) /* Compute additive loss due to compression constraints */ compression_loss = scheduler.before_backward_pass(mini_batch) Compute loss = criterion(y, y’) + compression_loss Compute gradients /* Final opportunity to alter weights before they are updated */ scheduler.before_parameter_optimization(mini_batch) Update weights scheduler.on_minibatch_end(mini_batch) end Validate results using validation dataset scheduler.on_epoch_end(epoch) end

Algorithm 1: Compression scheduling. Example pseudo-code of an application invoking compression-scheduling from its training-loop. Lines in blue show commands related to the scheduler, and illustrate the simplicity of adding scheduled-compression to existing PyTorch applications. Call-backs into the scheduler trigger the invocation of back-end compression algorithms according to the programmed schedule.

2.2 Supporting Components

Distiller is designed as a toolbox for the research community and is packaged with tools to share, teach and accelerate application development. Command-line utilities facilitate export to ONNX [23] for execution in an inference-serving environment; generation of model summaries, pruning sensitivity-analysis and activation statistics. Jupyter notebooks4, wiki tutorials5 and rich documentation6 provide a means to share and discuss algorithmic details, explain sample applications and APIs, and explore the technical and algorithmic aspects of the compression processes and methods.

3Examples of scheduling recipes can be found at https://nervanasystems.github.io/distiller/ schedule.html 4https://github.com/NervanaSystems/distiller/tree/master/jupyter 5https://github.com/NervanaSystems/distiller/wiki 6https://nervanasystems.github.io/distiller/index.html

3 3 Algorithms and Methods

DNN compression is a rapidly moving field and keeping up with SoTA is challenging. Table 1 lists the algorithms and methods currently supported.

Table 1: Compression algorithms and methods in Distiller

Category Methods Weights Regularization Lp-norm, group Lasso [24], GSS [25], SSL [26] Weights Pruning Magnitude pruning [17], block and structure pruning, greedy pruning [27], feature-map reconstruction [28], hybrid prun- ing [29], activation statistics-based [30, 31], Taylor expan- sion [31], Network Surgery [32], automated scheduling [33, 34], one-shot pruning [35], automated-pruning [9], sen- sitivity analysis [17], network thinning [36] Post-Training Quantization Gemmlowp-based [37, 38]: dynamic / stats-based, symmet- ric / a-symmetric, per-tensor / per-channel, activations clip- ping (average min/max, ACIQ [39]) Quantization-Aware Training EMA-tracked ranges [38, 40], DoReFa [41], WRPN [42], PACT [43] Conditional Computation Early-exit [7] Low-rank Decomposition Truncated SVD [6] Others Knowledge distillation [8]

4 Related Work

Support for INT8 and FP16 quantization is becoming more common, at different levels of the DL software spectrum. DL frameworks such as TensorFlow [44] and MXNet [45] have APIs for post-training quantization (PTQ) to INT8, with TensorFlow supporting quantization-aware training (QAT) as well. Support for built-in PTQ and QAT to INT8 was added to PyTorch very recently. NN compiler libraries such as Glow [46] and TVM [47], and platform-specific solutions such as Intel OpenVino [48] and NVIDIA TensorRT [49] provide PTQ of FP32 models to INT8/FP16. Support for pruning is far less prevalent. TensorFlow provides a model pruning API which is based on a specific magnitude-pruning algorithm [33]. PocketFlow [50] is a framework for automatic model compression and acceleration which has some overlapping features with Distiller, with both pruning and quantization capabilities. Compared to the above, Distiller provides a larger collection and mix of compression methods implemented out-of-the-box. Distiller offers flexibility and control over the compression process via its scheduling mechanism and YAML configuration. This makes it easy to experiment with, for example, mixed-precision quantization at the layer level, or with different pruning settings for different layers at different stages of training. Finally, beyond ready-to-use algorithms, Distiller implements the infrastructure and exposes the basic building blocks which can be used to develop new ones. A disadvantage of Distiller’s current implementation of PTQ is that it is simulating INT8 using FP32 operations. In addition, it is lacking the capability to export quantized model to ONNX and/or Glow. This makes Distiller a great vehicle to evaluate the effects of quantization on a model, but makes it hard to realize the gains from quantization on actual hardware. With the upcoming release of native quantization capabilities in PyTorch, we hope to rectify this.

5 Conclusion and Future Work

We presented Neural Network Distiller, a DNN-compression research library built for the community at large, as part of Intel AI Lab’s contribution to the DL community. Distiller packages together compression algorithms and methods, a supporting training framework and examples of model com- pression for various learning tasks. Distiller is a work-in-progress and there is room for improvement.

4 Compression automation is an emerging research field with an exciting promise that we intend to follow very closely. Progress in DNN compression research can provide the tailwind necessary to help bring DL innovation to more industries and application domains, to make people’s lives easier, healthier, and more productive. Distiller is an attempt to build a community of researchers, scientists and engineers vested in this future vision.

Acknowledgements We wish to thank Yury Nahshan for the implementation of ACIQ; Haim Barad for the implementation of Early-exit; and the Distiller community on GitHub for bug fixes, small features and challenging questions and observations.

References [1] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, pages 243–254, 2016. [2] Intel Deep Learning Deployment Toolkit, 2018. URL https://software.intel.com/en-us/ openvino-toolkit/deep-learning-cv. [3] T. Abtahi, C. Shea, A. Kulkarni, and T. Mohsenin. Accelerating convolutional neural network with FFT on embedded hardware. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 26(9):1737–1749, 2018. [4] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, and Kevin Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [5] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adem Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017. [6] J. Xue, J. Li, and Y. Gong. Restructuring of deep neural network acoustic models with singular value decomposition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pages 2365–2369, 01 2013. [7] Surat Teerapittayanon, Bradley McDanel, and H.T. Kung. BranchyNet: Fast inference via early exiting from deep neural networks. 2016 23rd International Conference on Pattern Recognition (ICPR), Dec 2016. [8] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015. [9] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. AMC: AutoML for Model Compression and Acceleration on Mobile Devices. In European Conference on Computer Vision (ECCV), 2018. [10] Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Chris De Sa, and Zhiru Zhang. Improving neural network quantization without retraining using outlier channel splitting. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 7543–7552, 2019. [11] Ahmed T. Elthakeb, Prannoy Pilligundla, and Hadi Esmaeilzadeh. SinReQ: Generalized sinusoidal regularization for low-bitwidth deep quantized training. ArXiv, abs/1905.01416, 2019. [12] Gil Shomron, Tal Horowitz, and Uri C. Weiser. SMT-SA: Simultaneous multithreading in systolic arrays. IEEE Computer Architecture Letters, 18:99–102, 2019. [13] Moin Nadeem, Wei Fang, Brian Xu, Mitra Mohtarami, and James Glass. FAKTA: An automatic end-to-end fact checking system. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 78–83, 2019. [14] Shangqian Gao, Cheng Deng, and Heng Huang. Cross domain model compression by structurally weight sharing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8973–8982, 2019.

5 [15] Angad S. Rekhi, Brian Zimmer, Nikola Nedovic, Ningxi Liu, Rangharajan Venkatesan, Miaorong Wang, Brucek Khailany, William J. Dally, and C. Thomas Gray. Analog/mixed-signal hardware error modeling for deep learning inference. 2019 56th ACM/IEEE Design Automation Conference (DAC), 2019.

[16] Jonathan Frankle and Michael Carbin. The Lottery Ticket Hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), 2019.

[17] Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks. ArXiv, abs/1506.02626, 2015.

[18] torchvision. URL https://github.com/pytorch/vision.

[19] Remi Cadene. Pretrained models for PyTorch. URL https://github.com/Cadene/ pretrained-models.pytorch. [20] Xiangnan He, Lizi Liao, Hanwang Zhang, Ligiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17, pages 173–182, 2017.

[21] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Gregory S. Corrado, Macduff Hughes, and Jeffrey Dean. Google’s Neural Machine Translation System: Bridging the gap between human and machine translation. ArXiv, abs/1609.08144, 2016.

[22] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017.

[23] ONNX. Open Neural Network Exchange. URL https://onnx.ai/. [24] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B, 68:49–67, 02 2006.

[25] Amirsina Torfi, Rouzbeh A. Shirvani, Sobhan Soleymani, and Nasser M. Nasrabadi. Attention-based guided structured sparsity of deep neural networks. In International Conference on Learning Representations (ICLR), 2018.

[26] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. ArXiv, abs/1608.03665, 2016.

[27] Reza Abbasi-Asl and Bin Yu. Structural compression of convolutional neural networks based on greedy filter pruning. ArXiv, abs/1705.07356, 2017.

[28] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017.

[29] W. Yang, L. Jin, S. Wang, Z. Cu, X. Chen, and L. Chen. Thinning of convolutional neural network with mixed pruning. IET Image Processing, 13(5):779–784, 2019.

[30] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network Trimming: A data-driven neu- ron pruning approach towards efficient deep architectures. In International Conference on Learning Representations (ICLR), 2016.

[31] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. ArXiv, abs/1611.06440, 2016.

[32] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient DNNs. ArXiv, abs/1608.04493, 2016.

[33] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. ArXiv, abs/1710.01878, 2017.

[34] Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta. Exploring sparsity in recurrent neural networks. In International Conference on Learning Representations (ICLR), 2017.

[35] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning Filters for Efficient ConvNets. ArXiv, abs/1608.08710, 2016.

6 [36] Guillaume Leclerc, Manasi Vartak, Raul Castro Fernandez, Tim Kraska, and Samuel Madden. Smallify: Learning network size while training. ArXiv, abs/1806.03723, 2018.

[37] gemmlowp: a small self-contained low-precision GEMM library. URL https://github.com/google/ gemmlowp. [38] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew G. Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer- arithmetic-only inference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2704–2713, 2018.

[39] Ron Bannner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolutional networks for rapid-deployment. In Advances in Neural Information Processing Systems 32, 2019. [40] Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. ArXiv, abs/1806.08342, 2018.

[41] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. ArXiv, abs/1606.06160, 2016. [42] Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook, and Debbie Marr. WRPN: Wide reduced-precision networks. In International Conference on Learning Representations (ICLR), 2018.

[43] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. PACT: Parameterized clipping activation for quantized neural networks. ArXiv, abs/1805.06085, 2018.

[44] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. ArXiv, abs/1603.04467, 2015.

[45] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. ArXiv, abs/1512.01274, 2015.

[46] Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, Bert Maher, Nadathur Satish, Jakob Olesen, Jongsoo Park, Artem Rakhov, and Misha Smelyanskiy. Glow: Graph lowering compiler techniques for neural networks. ArXiv, abs/1805.00907, 2018.

[47] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018.

[48] Intel. Intel Distribution of OpenVino Toolkit. URL https://software.intel.com/en-us/ openvino-toolkit.

[49] NVIDIA. NVIDIA TensorRT. URL https://developer.nvidia.com/tensorrt. [50] Jiaxiang Wu, Yao Zhang, Haoli Bai, Huasong Zhong, Jinlong Hou, Wei Liu, and Junzhou Huang. Pock- etFlow: An automated framework for compressing and accelerating deep neural networks. In Advances in Neural Information Processing Systems (NIPS), Workshop on Compact Deep Neural Networks with Industrial Applications, 2018.

7