Data-Free Knowledge Distillation for Image Super-Resolution

Data-Free Knowledge Distillation For Image Super-Resolution Yiman Zhang1, Hanting Chen1,3, Xinghao Chen1, Yiping Deng2, Chunjing Xu1, Yunhe Wang1* 1 Noah’s Ark Lab, Huawei Technologies. 2 Central Software Institution, Huawei Technologies. 3 Key Lab of Machine Perception (MOE), Dept. of Machine Intelligence, Peking University. {yiman.zhang, chenhanting, yunhe.wang}@huawei.com Abstract els, they cannot be directly embedded into mobile devices with limited computing capacity, such as self-driving cars, Convolutional network compression methods require micro-robots and cellphones. Therefore, how to compress training data for achieving acceptable results, but train- CNNs with enormous parameters and then apply them on ing data is routinely unavailable due to some privacy and resource-constrained devices, becomes a research hotspot. transmission limitations. Therefore, recent works focus on In order to accelerate the pre-trained heavy convolutional learning efficient networks without original training data, networks, various attempts have been made recently, in- i.e., data-free model compression. Wherein, most of ex- cluding quantization [24], NAS [5, 39], pruning [25, 34], isting algorithms are developed for image recognition or knowledge distillation [37, 38] and etc. For example, Han et segmentation tasks. In this paper, we study the data-free al.[11] utilize pruning, quantization and Huffman coding to compression approach for single image super-resolution compress a deep model with extremely higher compression (SISR) task which is widely used in mobile phones and and speed-up ratio. Hinton et al.[14] propose knowledge smart cameras. Specifically, we analyze the relationship distillation, which learns a portable student network from between the outputs and inputs from the pre-trained net- a heavy teacher network. Luo et al.[29] propose ThiNet work and explore a generator with a series of loss func- to perform filter pruning by solving an optimization prob- tions for maximally capturing useful information. The lem. Courbariaux et al.[6] propose binary neural network generator is then trained for synthesizing training sam- with only binary weights and activations to extremely re- ples which have similar distribution to that of the origi- duce the networks’ computation cost and storage consump- nal data. To further alleviate the training difficulty of the tion. Han et al.[10] introduce cheap operations in Ghost- student network using only the synthetic data, we intro- Net to generate more features for lightweight convolutional duce a progressive distillation scheme. Experiments on var- models. ious datasets and architectures demonstrate that the pro- Although the compressed models with low computation posed method is able to be utilized for effectively learn- complexity can be easily deployed in mobile devices, these ing portable student networks without the original data, techniques require original data to fine-tune or train the e.g., with 0.16dB PSNR drop on Set5 for 2 super resolu- × compressed networks to achieve comparable performance tion. Code will be available at https://github.com/huawei- with the pre-trained model. However, the original train- noah/Data-Efficient-Model-Compression. ing data is often unavailable due to some privacy or transmission constraints. For example, Google shared a series of excellent models trained on the JFT-300M dataset [20] 1. Introduction which is still unpublic. In addition, there are considerable apps trained using privacy-related data such as face ID and Deep convolutional neural networks have achieved huge voice assistant, which is often not provided. It is very hard success in various computer vision tasks, such as image to provide model compression and acceleration service for recognition [12], object detection [26], semantic segmen- these models without the original training data. Therefore, tation [27] and super-resolution [7]. Such great progress existing network compression methods cannot be well per- largely relies on the advances of computing power and stor- formed. age capacity in modern equipments. For example, ResNet- 50 [12] requires a 98MB storage and 4G FLOPs. How- To this end, recent works are devoted to compress and ever, due to the heavy∼ computation cost∼ of these deep mod- accelerate the pre-trained heavy deep models without the training dataset, i.e. data-free model compression. Lopes et *Corresponding author. al.[28] first propose to use meta data to reconstruct the orig- 7852 Teacher Network Progressive Knowledge Distillation Random KD Loss Randomly initialized Signals . H B0 T Loaded from pre-trained model Head Block Block Block Tail downsample KD Loss H B0 B1 T Progressive Knowledge Reconstruction Loss Distillation Loss Adversarial Loss KD Loss H B0 B1 B2 T Generative Network . KD Loss Head Block Block Block Tail … H B0 B1 Bp-1 T Student Network Figure 1. Framework of the proposed data-free knowledge distillation. The generator is trained with reconstruction loss and adversarial loss to synthesize images that similar with the original data. The student network is then obtained utilizing progressive distillation from the teacher network. inal dataset for knowledge distillation. Nayak et al.[32] periments on several benchmark datasets and models. The present zero-shot knowledge distillation, which leverages results demonstrate that the proposed framework can effec- the information of pre-trained model to synthesize useful tively learn a portable network from a pre-trained model training data. Chen et al.[4] exploit Generative Adversarial without any training data. Networks (GAN) to generate training samples which have The rest of this paper is organized as follows. Section similar distribution with the original images and achieve 2 investigates related work about model compression in better performance. Yin et al.[40] propose DeepInversion super-resolution and data-free knowledge distillation meth- and successfully generate training data on the ImageNet ods. Section 3 introduces our data-free distillation method dataset. However, they only focus on image recognition for image super-resolution. Section 4 provides experimen- tasks with sophisticated loss function, e.g., the one-hot loss tal results on several benchmark datasets and models and for capturing features learned by the conventional cross- Section 5 concludes the paper. entropy loss. Differently, the training of SISR models does not involve the semantic information, the ground-truth im- 2. Related Work ages are exactly the original high-resolution images. Thus, In this section, we briefly review the related works on the existing data-free compression approaches cannot be di- data-driven knowledge distillation for super-resolution net- rectly employed. works and data-free model compression approaches. To this end, we propose a new data-free knowledge dis- 2.1. DataDriven Knowledge Distillation for Super tillation framework for super-resolution. A generator net- Resolution work is also adopted for approximating the original training data from the given teacher network. Different from the As there are urgent demands for applying image super- classification networks whose outputs are probability dis- resolution networks to the mobile devices such as cell- tributions, the inputs and outputs of SR models are images phones and cameras, various attempts have been made to with similar patterns. Therefore, we develop the reconstruc- learn lightweight super-resolution models. tion loss function by utilizing this relationship. In prac- Gao et al.[9] calculate different statistical maps of fea- tice, we have to ensure that the synthetic images will not be ture maps to distill from teacher super-resolution networks. distorted significantly by the teacher SR network, i.e., the He et al.[13] propose a feature affinity-based knowledge super-resolution results of these images should be similar distillation (FAKD) framework for super-resolution net- to themselves. Moreover, an adversarial loss is combined works, which improves the distillation performance by us- to prevent the model collapse of the generator. Since SISR ing the correlation within a feature map. Lee et al.[22] models are often required to capture and emphasize details propose to utilize ground truth high resolution images as such as edge and texture from the input images, the learn- privileged information and use feature distillation to im- ing on intermediate features is also very important. Thus, prove the performance of compact super-resolution net- we propose to conduct the distillation progressively to alle- work. To enhance the performance of lightweight super- viate the training difficulty. We then conduct a series of ex- resolution networks, Zhang et al.[42] introduce the con- 7853 cept of learnable pixel-wise importance map, and use pre- student for classification tasks. In addition, based on the ob- diction of teacher network to initialize the importance map. servation that Batch Normalization (BN) layers which con- In addition, Hui et al.[16] and Jiang et al.[17] design new tain channel-wise means and variances of training dataset structures to perform distillation between different parts of are widely used in classification neural networks, they pro- the model and improve the performance of the lightweight pose to use these BN data to inverse teacher network and super-resolution network. generate samples. This method provides comparable per- The compressed models obtained by the aforementioned formance in classification and generates samples that are methods

Load more