Arxiv:1912.08795V2 [Cs.LG] 16 Jun 2020 the Teacher and Student Network Logits

Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion Hongxu Yin1;2:∗ , Pavlo Molchanov1˚, Zhizhong Li1;3:, Jose M. Alvarez1, Arun Mallya1, Derek Hoiem3, Niraj K. Jha2, and Jan Kautz1 1NVIDIA, 2Princeton University, 3University of Illinois at Urbana-Champaign fhongxuy, [email protected], fzli115, [email protected], fpmolchanov, josea, amallya, [email protected] Knowledge Distillation DeepInversion Target class Synthesized Images Feature distribution regularization Teacher Cross Input (updated) logits entropy ... Feature Back Conv ReLU Loss Pretrained model (fixed) maps propagation BatchNorm ⟶ Noise Image Kullback–Leibler Pretrained model (fixed), e.g. ResNet50 1-JS divergence Teacher logits Student model Student Jensen- Inverted from a pretrained ImageNet ResNet-50 classifier (updated) Back Student model (fixed) logits Shannon (JS) (more examples in Fig. 5 and Fig. 6) propagation Loss divergence Adaptive DeepInversion Figure 1: We introduce DeepInversion, a method that optimizes random noise into high-fidelity class-conditional images given just a pretrained CNN (teacher), in Sec. 3.2. Further, we introduce Adaptive DeepInversion (Sec. 3.3), which utilizes both the teacher and application-dependent student network to improve image diversity. Using the synthesized images, we enable data-free pruning (Sec. 4.3), introduce and address data-free knowledge transfer (Sec. 4.4), and improve upon data-free continual learning (Sec. 4.5). Abstract 1. Introduction We introduce DeepInversion, a new method for synthesiz- ing images from the image distribution used to train a deep The ability to transfer learned knowledge from a trained neural network. We “invert” a trained network (teacher) to neural network to a new one with properties desirable for the synthesize class-conditional input images starting from ran- task at hand has many appealing applications. For example, dom noise, without using any additional information on the one might want to use a more resource-efficient architecture training dataset. Keeping the teacher fixed, our method opti- for deployment on edge inference devices [46, 68, 78], or to mizes the input while regularizing the distribution of interme- adapt the network to the inference hardware [10, 65, 73], or diate feature maps using information stored in the batch nor- for continually learning to classify new image classes [31, malization layers of the teacher. Further, we improve the di- 36], etc. Most current solutions for such knowledge transfer versity of synthesized images using Adaptive DeepInversion, tasks are based on the concept of knowledge distillation [22], which maximizes the Jensen-Shannon divergence between wherein the new network (student) is trained to match its arXiv:1912.08795v2 [cs.LG] 16 Jun 2020 the teacher and student network logits. The resulting syn- outputs to that of a previously trained network (teacher). thesized images from networks trained on the CIFAR-10 and However, all such methods have a significant constraint – ImageNet datasets demonstrate high fidelity and degree of re- they assume that either the previously used training dataset is alism, and help enable a new breed of data-free applications available [9, 31, 47, 59], or some real images representative – ones that do not require any real images or labeled data. of the prior training dataset distribution are available [27, 28, We demonstrate the applicability of our proposed method to 36, 58]. Even methods not based on distillation [29, 52, 76] three tasks of immense practical importance – (i) data-free assume that some additional statistics about prior training is network pruning, (ii) data-free knowledge transfer, and (iii) made available by the pretrained model provider. data-free continual learning. Code is available at https: The requirement for prior training information can be //github.com/NVlabs/DeepInversion very restrictive in practice. For example, suppose a very deep network such as ResNet-152 [20] was trained on datasets ∗Equal contribution. : Work done during an internship at NVIDIA. with millions [11] or even billions of images [38], and we Work supported in part by ONR MURI N00014-16-1-2007. wish to distill its knowledge to a lower-latency model such 1 as ResNet-18. In this case, we would need access to these tween the responses of the two networks. datasets, which are not only large but difficult to store, trans- In order to show that our dataset synthesis method is use- fer, and manage. Further, another emerging concern is that ful in practice, we demonstrate its effectiveness on three of data privacy. While entities might want to share their different use cases. First, we show that the generated images trained models, sharing the training data might not be desir- support knowledge transfer between two networks using dis- able due to user privacy, security, proprietary concerns, or tillation, even with different architectures, with a minimal competitive disadvantage. accuracy loss on the simple CIFAR-10 as well as the large In the absence of prior data or metadata, an interesting and complex ImageNet dataset. Second, we show that we question arises – can we somehow recover training data from can prune the teacher network using the synthesized images the already trained model and use it for knowledge transfer? to obtain a smaller student on the ImageNet dataset. Finally, A few methods have attempted to visualize what a trained we apply DeepInversion to continual learning that enables deep network expects to see in an image [3, 39, 48, 51]. The the addition of new classes to a pretrained CNN without the most popular and simple-to-use method is DeepDream [48]. need for any original data. Using our DeepInversion tech- It synthesizes or transforms an input image to yield high nique, we empower a new class of “data-free” applications output responses for chosen classes in the output layer of of immense practical importance, which need neither any a given classification model. This method optimizes the natural image nor labeled data. input (random noise or a natural image), possibly with some Our main contributions are as follows: regularizers, while keeping the selected output activations • We introduce DeepInversion, a new method for synthe- fixed, but leaves intermediate representations constraint-free. sizing class-conditional images from a CNN trained for The resulting “dreamed” images lack natural image statistics image classification (Sec. 3.2). Further, we improve syn- and can be quite easily identified as unnatural. These images thesis diversity by exploiting student-teacher disagree- are also not very useful for the purposes of transferring ments via Adaptive DeepInversion (Sec. 3.3). knowledge, as our extensive experiments in Section 4 show. • We enable data-free and hardware-aware pruning that In this work, we make an important observation about achieves performance comparable to the state-of-the- deep networks that are widely used in practice – they all art (SOTA) methods that rely on the training dataset implicitly encode very rich information about prior training (Sec. 4.3). data. Almost all high-performing convolutional neural net- • We introduce and address the task of data-free knowledge works (CNNs), such as ResNets [20], DenseNets [24], or transfer between a teacher and a randomly initialized their variants, use the batch normalization layer [26]. These student network (Sec. 4.4). layers store running means and variances of the activations • We improve prior work on data-free continual (a.k.a. at multiple layers. In essence, they store the history of pre- incremental) learning, and achieve results comparable to viously seen data, at multiple levels of representation. By oracle methods given the original data (Sec. 4.5). assuming that these intermediate activations follow a Gaus- sian distribution with mean and variance equal to the running statistics, we show that we can obtain “dreamed” images that 2. Related Work are realistic and much closer to the distribution of the training Knowledge distillation. Transfer of knowledge from one dataset as compared to prior work in this area. model to another was first introduced by Breiman and Our approach, visualized in Fig. 1, called DeepInversion, Shang when they learned a single decision tree to approx- introduces a regularization term for intermediate layer acti- imate the outputs of multiple decision trees [4]. Simi- vations of dreamed images based on just the two layer-wise lar ideas are explored in neural networks by Bucilua et statistics: mean and variance, which are directly available al. [6], Ba and Caruana [2], and Hinton et al. [22]. Hin- with trained models. As a result, we do not require any train- ton et al. formulate the problem as “knowledge distilla- ing data or metadata to perform training image synthesis. tion,” where a compact student mimics the output distri- Our method is able to generate images with high fidelity and butions of expert teacher models [22]. These methods and realism at a high resolution, as can be seen in the middle improved variants [1, 55, 59, 69, 75] enable teaching stu- section of Fig. 1, and more samples in Fig. 5 and Fig. 6. dents with goals such as quantization [44, 57], compact We also introduce an application-specific extension of neural network architecture design [59], semantic segmen- DeepInversion, called Adaptive DeepInversion, which can tation [33], self-distillation [15], and un-/semi-supervised enhance the diversity of the generated images. More specif- learning [36, 56, 72]. All these methods still rely on images ically, it exploits disagreements between the pretrained from the original or proxy datasets. More recent research has teacher and the in-training student network to expand the explored data-free knowledge distillation. Lopes et al.[35] coverage of the training set by the synthesized images. It synthesize inputs based on pre-stored auxiliary layer-wise does so by maximizing the Jensen-Shannon divergence be- statistics of the teacher network. Chen et al. [8] train a new 2 generator network for image generation while treating the the parameters of the student model, WS, can be learned by teacher network as a fixed discriminator.

Arxiv:1912.08795V2 [Cs.LG] 16 Jun 2020 the Teacher and Student Network Logits

Training Autoencoders by Alternating Minimization

ML-Based Interactive Data Visualization System for Diversity and Fairness Issues 1

Q-Learning in Continuous State and Action Spaces

Building Intelligent Systems with Large Scale Deep Learning Jeff Dean Google Brain Team G.Co/Brain

Audio Event Classification Using Deep Learning in an End-To-End Approach

Comparative Analysis of Recurrent Neural Network Architectures for Reservoir Inﬂow Forecasting

Training Deep Networks Without Learning Rates Through Coin Betting

Bridging Reinforcement Learning and Creativity: Implementing Reinforcement Learning in Processing

The Perceptron

Learning Semantics of Raw Audio

Neural Networks a Simple Problem (Linear Regression)

Learning-Rate Annealing Methods for Deep Neural Networks