Knowledge Distillation by Sparse Representation Matching

arXiv:2103.17012v1 [cs.CV] 31 Mar 2021 pet,KoldeDsilto K)[9 sasml and simple devel- a these is Of network [29] [29]. (KD) one Distillation distillation from Knowledge knowledge knowledge opments, via transferring another designin as [2 to well or [26], systems as [25], analysis [28], [24], and acquisition approximation data compress qua compressive [23], [21], [19], [22], [20], [18], pruning tization [17], weight hand- through [16], networks and [15], pretrained automatic [14], (both [13], families crafted) network m on designing neural on research achievin efficient efforts includes while of This models lot performances. light-weight similar a and encouraged smaller developing has power computational computational high the limits to which due [4], practice [3], complexity. in networks large deployment very their solutio by state-of-the-art driven domains, compu other in mainly having in only not layers, also fact, but of In vision [12]. hundreds [11], parameters of of millions consist nowadays state-of-the-ar [10], recogniti topologies layers character convolutional handwritten two only for Initia using 1990s system. learning the a in in workhorse facto developed ex main feature the de a or a block as developments tion either as signals, important high-dimensional evolved for those have choice Of Networks financ Neural [9]. or Convolutional [6] [8], [5], [7], analysis do- biomedical langua forecasting to natural several [4] [2], in [1], [3], problems vision processing machine learning from tackle ranging to mains, tools primary the ifrne ewe h ece n tdn ewrs and networks, student datasets. several and across techniques teacher KD other the outperforms architect between to manne robust plug-and-play is differences SRM a that in demonstrate CNN experiments any Our processing into g integrated neural stochastic using and a optimized descent as efficiently be SRM pixel-leve can formulate which m both t block, We feature generate network. of intermediate to student training features the used for labels hidden then image-level are fi the and to which SRM of CNN, (CNN) learning. representations teacher representation Network sparse sparse a Neural extracts utilizing to Convolutional knowledg by intermediate network one another transfer from teacher to method obtained a a Represent Sparse from (SRM), propose we Matching paper, knowledge this In network. the student transfers that h rmsn eut bandfo aial attainable maximally from obtained results promising The become have networks neural deep decade, past the Over Abstract nweg itlainB preRepresentation Sparse By Distillation Knowledge KoldeDsilto eest ls fmethods of class a to refers Distillation —Knowledge † eateto lcrcladCmue niern,Aarhus Engineering, Computer and Electrical of Department .I I. ∗ NTRODUCTION eateto optn cecs apr nvriy Tam University, Tampere Sciences, Computing of Department a hn Tran Thanh Dat Email: { thanh.tran,moncef.gabbouj ∗ ocfGabbouj Moncef , Matching radient CNN t sare ns p of aps ation trac- ural ore ing 7], lly rst ter ial on ge he n- r. g g e l , tesuet ln ihtelblddt.Tu,teeaetw are there Thus, KD: data. in mo labeled questions another the models central train with many along to or student) signals work one (the supervisory variants in as its acquired teacher(s)) and knowledge (the KD the to networks. access utilizing the pretrained by in given many effective network, or be a to one of shown performance been the has improving that technique used widely o ersnigifrainecddi h idnfeature hidden the in choice good encoded Fourie a information in as representing serves e.g., for representation compressible, sparse are Thus, domain. between they correlations high feature pixels, with hidden smooth neighboring often spar Since are the CNN the signals. in with maps in example together atoms using dictionary the o the decomposition is optimize of it prespecified, to combination be desirable the can linear dictionary decomposing the sparse While and dictionary. a dictionary as achieve overcomplete is signal This an sparsest. using are Sparse coefficients by [35]. the sig works where input domain important the a representing many com- at of vision aims basis computer learning representation a in is interest and of munity amount great a attracted n s h prerpeetto stesuc fsupervisi of domain source sparse the a as in representation maps) sparse feature the the use propo intermediat (of and we pixel signals, the each the supervisory encode utilizing as of to teacher directly the knowledge of of To maps the the feature students. instead representing of the CNN, of representation teach teacher question to good teacher the a of the address source not in good are encoded feature a knowledge intermediate themselves as the act by that can maps transformed argue thus progressively we CNN, However, is a knowledge. input of the layers how through c on maps clues feature final intermediate certain [3 Intuitively, the [30], [34]. knowledge additional Besides [33], as [32], teacher prediction. intermed the utilize of soft to maps proposed feature this have works mimic other predictions, to netwo student trained the and knowledge is its represent teacher the by • • ro oteeao eplann,sas representations sparse learning, deep of era the to Prior produ probabilities soft [29], formulation original the In } o orpeettekoldeecddi teacher a in encoded knowledge the network? represent to How ewe h ece n h tdn networks? student the and teacher difference net- the architectural between other are there to when knowledge especially such works, transfer efficiently to How tn.,[email protected] @tuni.fi, ∗ lxnrsIosifidis Alexandros , nvriy ahs Denmark Aarhus, University, ee Finland pere, † ontain a in nal ften iate ced on. del 1], se rk se o d e s r maps. the teacher as additional sources of knowledge. For example, Sparse representation learning is a well-established topic in in FitNet [30], the authors referred to intermediate feature which several algorithms have been proposed [35]. However, maps of the teacher as hints and the student is first pretrained to the best of our knowledge, existing formulations are com- by regressing its intermediate features to the teacher’s hints, putationally intensive to fit a large amount of data. Although then later optimized with the standard KD approach. In other learning task-specific sparse representations have been pro- works, activation maps [31] as well as feature distributions posed in prior works [36], [37], [38], we have not seen its [34] computed from intermediate layers have been proposed. utilization for knowledge transfer using deep neural networks In recent works [43], [44], instead of transferring knowledge and stochastic optimization. In this work, we formulate sparse about each individual sample, the authors proposed to transfer representation learning as a computation block that can be relational knowledge between pairs of samples. incorporated into any CNN and be efficiently optimized using Our SRM method bears some resemblances to previous mini-batch update from stochastic gradient-descent based al- works in the sense that SRM also uses intermediate feature gorithms. Our formulation allows us to take advantage of not maps as additional sources of knowledge. However, there only modern stochastic optimization techniques but also data are many differences between SRM and existing works. For augmentation to generate target sparsity on-the-fly. example, in FitNet [30], the student network learns to regress Given the sparse representations obtained from the teacher from its intermediate features to the teacher’s; however, the network, we derive the target pixel-level and image-level regressed features are not actually used in the student network. sparse representation for the student network. Transferring Thus, the hints in FitNet only indirectly influence the student’s knowledge from the teacher to the student is then con- features. Since the sparse representation is another representa- ducted by optimizing the student with its own dictionaries tion (equivalent) of the feature maps, SRM directly influences to induce the target sparsity. Thus, our knowledge distil- the student’s features. In addition, by manipulating the sparse lation method is dubbed as Sparse Representation Match- representation rather than the hidden features themselves, ing (SRM). Extensive experiments presented in Section IV SRM is less prone to feature value range mismatch between show that SRM significantly outperforms other recent KD the teacher and the student. This is because by construction, methods, especially in transfer learning tasks by large mar- the sparse coefficients generated by SRM only have values gins. In addition, empirical results also indicate that SRM in the range [0, 1] as we will see in Section III. Attention- exhibits robustness to architectural mismatch between the based KD method [31] overcomes this problem by normalizing teacher and the student. The implementation of the proposed (using l2 norm) the attention maps of the teacher and the method is publicly accessible through the following repository student. This normalization step, however, might suffer from https://github.com/viebboy/SRM numerical instability (when l2 norm is very small) when attention maps are calculated from activation layers such as II. RELATED WORK ReLU. The idea of transferring knowledge from one model to In [49], the authors employed the idea of sparse coding, another has existed for a long time. This idea was first however, to represent the network’s parameters rather than the introduced in [39] in which the authors proposed to grow intermediate feature maps as in our work. In [50], the authors decision trees to mimic the output of a complex predictor. used the idea of feature quantization and k-means clustering Later, similar ideas were proposed for training neural networks using intermediate features of the teacher to train the student [40], [41], [29], mainly for the purpose of model compression. with additional convolutional modules to predict the cluster Variants of the knowledge transfer idea differ in the methods labels. Our pixel-level label bears some resemblances to this of representing and transferring knowledge [42], [33], [30], method. However, we explicitly represent the intermediate fea- [31], [34], [43], [44], as well as the types of data being used tures by sparse representation (by minimizing reconstruction [45], [46], [47], [48]. error) and use the same process to transfer both local (pixel- In [41], the final predictions of an ensemble on unlabeled level) and global (image-level) information. data are used to train a single neural network. In [40], the authors proposed to use the logits produced by the source net- III. KNOWLEDGE DISTILLATION BY SPARSE work as the representation of knowledge, which is transferred REPRESENTATION MATCHING to a target network by minimizing the Mean Squared Error A. Knowledge Representation (MSE) between the logits. The term Knowledge Distillation (l) was introduced in [29] in which the student network is trained Given the n-th input image Xn, let us denote by Tn ∈ RHl×Wl×Cl to simultaneously minimize the cross-entropy measured on the output of the l-th layer of the teacher CNN, the labeled data and the Kullback-Leibler (KL) divergence with Hl and Wl being the spatial dimensions and Cl is between its predicted probabilities and the soft probabilities the number of channels. In the following, we used the sub- produced by the teacher network. Since its introduction, this script T and S to denote a variable that is related to the formulation has been widely adopted. teacher and student networks, respectively. In addition, we (l) (l) RCl In addition to the soft probabilities of the teacher, later also denote by tn,i,j = Tn (i, j, :) ∈ , which is the (l) works have been proposed to utilize intermediate features of pixel at position (i, j) of Tn . The first objective in SRM (l) is to represent each pixel tn,i,j in a sparse domain. To do match (Hp = Hl and Wp = Wl) while the channel dimensions so, SRM learns an overcomplete dictionary of Ml atoms: might differ. (l) (l) (l) × D = [d ,..., d ] ∈ RCl Ml (M > C ), which is T T ,1 T ,Ml l l (l) Given the sparse representation of the teacher in Eq. (2), a used to express each pixel tn,i,j as a linear combination of straightforward way is to train the student network to produce (l) dT ,m as follows: hidden features at spatial position (i, j), having the same sparse coefficients as its teacher. However, trying to learn

Ml exact sparse representations as produced by the teacher is a (l) (l) (l) (l) (l) (l) tn,i,j = ψk(tn,i,j , dT ,m) · κ(tn,i,j , dT ,m) · dT ,m (1) too restrictive task since this enforces learning the absolute mX=1 value of every point in a high-dimensional space. Instead where of enforcing an absolute constraint on how each pixel of (l) (l) • every sample should be represented, to transfer knowledge, κ(tn,i,j , dT ,m) denotes a function that measures the sim- (l) (l) we only enforce a relative constraints between them in the ilarity between tn,i,j and atom dT ,m. We further denote sparse domain. Specifically, we propose to train the student by k(l) t(l) d(l) t(l) d(l) the T ,n,i,j = [κ( n,i,j , T ,1),...,κ( n,i,j , T ,Ml )] to only approximate sparse structures of the teacher network (l) vector that contains similarities between tn,i,j and all by solving a classification problem with two types of labels (l) extracted from the sparse representation ˜t(l) of the teacher: atoms in the dictionary DT . n,i,j (l) (l) • pixel-level and image-level labels. ψk(tn,i,j , dm ) denotes the indicator function that returns (l) (l) a value of 1 if κ(tn,i,j , dT ,m) belongs to the set of top-k Pixel-level labeling: for each spatial position (i, j), we (l) assign a class label, which is the index of the largest element values in kT , and a value of 0 otherwise. ,n,i,j ˜(l) (l) of tn,i,j , i.e., the index of the closest (most similar) atom The decomposition in Eq. (1) basically means that tn,i,j is (l) in DT . This basically means that we partition all pixels expressed as the linear combination of k most similar atoms in (l) (l) into Ml disjoint sets using dictionary DT , and the student DT , with the coefficients being the corresponding similarity (l) (l) (l) (l) (l) network is trained to learn the same partitioning using its values. Let λ = ψk(t , d ) · κ(t , d ), then n,i,j,m n,i,j T ,m n,i,j T ,m (p) (p) (p) Cp×Ml own dictionary DS = [dS ,..., dS ] ∈ R . Let the sparse representation of t(l) is then defined as: ,1 ,Ml n,i,j k(p) s(p) d(p) s(p) d(p) denote the S,n,i,j = [κ( n,i,j , S,1),...,κ( n,i,j , S,Ml )] (l) (l) (l) M (p) ˜t = [λ ,...,λ ] ∈ R l (2) vector that contains similarities between pixel sn,i,j and Ml n,i,j n,i,j,1 n,i,j,Ml atoms in D(p). The first knowledge transfer objective in our By construction, there are only non-zero elements in S k method using pixel-level label is defined as follows: ˜(l) tn,i,j , and k defines the degree of sparsity, which is a hyperparameter of SRM. In order to find ˜t(l) , we simply minimize n,i,j (p) the reconstruction error in Eq. (1) as follows: arg min LCE(cn,i,j , kS,n,i,j ) (4) Θ D(p) S , S n,i,jX (l) ΘS arg min tn,i,j where denotes parameters of the student network. D(l) T n,i,jX LCE denotes the cross-entropy loss function, and cn,i,j = (3) (l) M ˜ l 2 arg max(tn,i,j ). Here we should note that the idea of trans- (l) (l) (l) (l) (l) − ψk(tn,i,j , dT ,m) · κ(tn,i,j , dT ,m) · dT ,m ferring the structure instead of the absolute representation is 2 mX=1 not new. For example, in [43] and [44], the authors proposed

There are many choices for the similarity function κ such to transfer the relative distance between the embeddings of as linear kernel, RBF kernel, sigmoid kernel and so on. In our samples. In our case, the pixel-level labels provide supervisory work, we used the sigmoid kernel κ(x, y)= sigmoid(xT y + information on how the pixels in the student network should be c) since the dot-product makes it computationally efficient and represented in the sparse domain so that their partition using the gradients in the backward pass are stable. Although the the nearest atom is the same. RBF kernel is popular in many works, we empirically found ˜(l) Image-level labeling: given the sparse representation tn,i,j that RBF kernel is sensitive to the learning rate, which easily of the teacher, we generate an image-level label by averaging leads to numerical issues. ˜(l) tn,i,j over the spatial dimensions. While pixel-level labels B. Transferring Knowledge provide local supervisory information encoding the spatial (p) information, image-level label provides global supervisory RHp×Wp×Cp Let us denote by Sn ∈ the output of the p- information, promoting the shift-invariance property. Image- th layer of the student network given input image is Xn. In level label bears some resemblances to the Bag-of-Feature (p) RCp addition, sn,i,j ∈ denotes the pixel at position (i, j) of model [51], which aggregates the histograms of image patches (p) Sn . We consider the task of transferring knowledge from the to generate an image-level feature. The second knowledge l-th layer of the teacher to the p-th layer of the student. To transfer objective in our method using image-level labels is do so, we require that the spatial dimensions of both networks defined as follows: TABLE I reducing the training set size to 45K. Our setup is different PERFORMANCE ON CIFAR100 from the conventional practice of validating and reporting the Model Test Accuracy (%) result on the test set of CIAR100. In this set of experiments, we used DenseNet121 [11] as the teacher network (7.1M DenseNet121 (teacher) 75.09 ± 00.29 parameters and 900M FLOPs) and a non-residual architecture AllCNN 67.64 ± 01.87 (a variant of AllCNN network [53]) as the student network AllCNN-KD 73.27 ± 00.20 (2.2M parameters and 161M FLOPs). AllCNN-FitNet 72.03 ± 00.27 Regarding the training setup, we used ADAM optimizer and AllCNN-AT 70.88 ± 0.29 trained all networks for 200 epochs with the initial learning AllCNN-PKT 72.22 ± 0.35 rate of 0.001, which is reduced by 0.1 at epochs 31 and 131. AllCNN-RKD 70.39 ± 0.17 For those methods that have pre-training phase, the number of AllCNN-CRD 72.70 ± 0.18 epochs for pre-training was set to 160. For regularization, we AllCNN-SRM (our) 74.73 ± 00.26 used weight decay with coefficient 0.0001. For data augmentation, we followed the conventional protocol, which randomly performs horizontal flipping, random horizontal and/or vertical shifting by four pixels. For SRM, KD and FitNet, we used ˜(l) (p) our own implementation, for other methods (AT, PKT, RKD, i,j tn,i,j i,j kS,n,i,j arg min LBCE , (5) CRD), we used the code provided by the authors of CRD Θ D(p) PHl · Wl PHl · Wl S , S Xn method [44]. where LBCE denotes the binary cross-entropy loss. Here we TABLE II should note that since most kernel functions output a similarity COMPARISON (TEST ACCURACY %) BETWEEN FITNET AND SRM IN ˜(l) (p) score in [0, 1], elements of tn,i,j and kS,n,i,j are also in this TERMSOFQUALITYOFINTERMEDIATEKNOWLEDGETRANSFERON CIFAR100 USING ALLCNN ARCHITECTURE range, making the two inputs to LBCE in Eq. (5) valid. To summarize, the procedures of our SRM method is similar Linear Probing Whole Network Update to FitNet [30], which consists of the following steps: FitNet 69.95 ± 00.20 68.06 ± 00.10 • Step 1: Given the source layers (with indices l) in the SRM (our) 69.10 ± 00.31 71.99 ± 00.08 teacher network T , find the sparse representation by solving Eq. (3). For KD, FitNet and SRM, we validated the temperature τ • Step 2: Given the target layers (with indices p), optimize (used to soften teacher’s probability) and the weight α (used to the student network to predict pixel-level and image- S balance between classification and distillation loss) from the level labels by solving Eq. (4), (5). following set: τ ∈ {2.0, 4.0, 8.0}, and α ∈ {0.25, 0.5, 0.75}. • Step 3: Given the student network obtained in Step 2, For other methods, we used the provided values used in CRD optimize it using the original KD algorithm. paper [44]. In addition, there are two hyperparametersof SRM: All optimization objectives in our algorithm are solved by the degree of sparsity (λ = k/Ml) and overcompleteness stochastic gradient descent. of dictionaries (µ = Ml/Cl). Lower values of λ indicate IV. EXPERIMENTS sparser representations while higher values of µ indicate larger The first set of experiments was conducted on CIFAR100 dictionaries. For CIFAR100, we performed validation with dataset [52] to compare our SRM method with other KD λ ∈{0.01, 0.02, 0.03} and µ = {1.5, 2.0, 3.0}. methods: KD [29], FitNet [30], AT [31], PKT [34], RKD [43] Overall comparison (Table I): It is clear that the student and CRD [44]. In the second set of experiments, we tested the network significantly benefits from all knowledge transfer algorithms under transfer learning setting. In the final set of methods. The proposed SRM method clearly outperforms experiments, we evaluated SRM on the large-scale ImageNet other competing methods, establishing more than 1% margin dataset. compared to the second best method (KD). In fact, the student For every experiment configuration, we ran 3 times and re- network trained with SRM achieves very close performance ported the mean and standard deviation. Regarding the source with its teacher (74.73% versus 75.09%), despite having a and the target layers for transferring intermediate knowledge, non-residual architecture and 3× less parameters. In addition, we used the outputs of the downsampling layers of the teacher recent methods, such as PKT and CRD, perform better than and student networks. Detailed experimental setup and our FitNet, however, inferior to the original KD method. analysis are provided in the following sections. In addition, our Quality of intermediate knowledge transfer: Since FitNet implementation is publicly accessible through the following and SRM have a pretraining phase to transfer intermediate repository https://github.com/viebboy/SRM knowledge, we compared the quality of intermediate knowledge transferred by FitNet and SRM by conducting two types A. Experiments on CIFAR100 of experiments after transferring intermediate knowledge: (1) Since CIFAR100 has no validation set, we randomly se- the student network is optimized for 60 epochs using only the lected 5K samples from the training set for validation purpose, training data (without teacher’s soft probabilities), given all TABLE III via image-level labels, even though it was later optimized with SRM TEST ACCURACY (%) ON CIFAR100 WITHDIFFERENTDICTIONARY the standard KD phase. Similar to the observations made from SIZES (PARAMETERIZED BY µ) AND DEGREE OF SPARSITY (PARAMETERIZED BY λ) Table II, this again suggests that the position in the parameter space, which is obtained after the intermediate knowledge µ transfer phase, and prior to the standard KD phase, heavily 1.5 2.0 3.0 λ affects the final performance. Once the student network is 0.01 74.00 ± 00.15 74.12 ± 00.35 74.09 ± 00.08 badly initialized, even receiving soft probabilities from the 0.02 74.34 ± 00.07 74.73 ± 00.26 74.20 ± 00.27 teacher as additional signals does not help. Although using 0.03 73.77 ± 00.05 73.83 ± 00.12 73.71 ± 00.51 pixel-level label alone is better than image-level label, the best performance is obtained by combining both objectives.

TABLE IV B. Transfer Learning Experiments EFFECTS OF PIXEL-LEVEL LABEL AND IMAGE-LEVEL LABEL IN SRM ON CIFAR100 Since transfer learning is key in the success of many Pixel-level label Image-level label Test accuracy % applications that utilize deep neural networks, we conducted experiments in 5 transfer learning problems (Flowers [54], X 73.16 ± 00.39 X 53.50 ± 09.96 CUB [55], Cars [56], Indoor-Scenes [57] and PubFig83 [58]) to assess how well the proposed method works under transfer X X 74.73 ± 00.26 learning setting compared to others. In our experiments, we used a pretrained ResNext50 [12] on ILSVRC2012 database as the teacher network, which parameters are fixed, except the last linear layer for classifica- is finetuned using the training set of each transfer learn- tion. This experiment is called Linear Probing; (2) the student ing task. We then benchmarked how well each knowledge network is optimized with all parameters for 200 epochs using distillation method transfers both pretrained knowledge and only the training data (without teacher’s soft probabilities). domain specific knowledge to a randomly initialized student This experiment is named Whole Network Update. The results network using domain specific data. Both residual (ResNet18 are shown in Table II. [59], 11.3M parameters, 1.82G FLOPs) and non-residual (a Firstly, both experiments show that the student networks variant of AllCNN [53], 5.1M parameters, 1.35G FLOPs) outperform their baseline, thus, benefit from intermediate architectures were used as the student. knowledge transfer by both methods, even without the final For each transfer learning dataset, we randomly sampled a KD phase. In the first experiment when we only updated the few samples from the training set to establish the validation output layer, the student pretrained by FitNet achieves slightly set. All images were resized to resolution 224 × 224 and we better performance than by SRM (69.95% versus 69.10%). used standard ImageNet data augmentation (random crop and However, when we optimized with respect to all parameters, horizontal flipping) during stochastic optimization. Based on the student pretrained by SRM performs significantly better the analysis of hyperparameters in CIFAR100 experiments, than the one pretrained by FitNet (71.99% versus 68.06%). we validated KD, FitNet and SRM using α ∈{0.5, 0.75} and While full parameter update led the student pretrained by SRM τ ∈ {4.0, 8.0}. Sparsity degree λ = 0.02 and dictionary size to better local optima, it undermines the student pretrained by µ = 2.0 were used for SRM. For other methods, we used FitNet. This result suggests that the process of intermediate the hyperparameter settings provided in CRD paper [44]. All knowledge transfer in SRM can initialize the student network experiments were conducted using ADAM optimizer for 200 at better positions in the parameter space compared to FitNet. epochs with the initial learning rate of 0.001, which is reduced Effects of sparsity (λ) and dictionary size (µ): in Table by 0.1 at epochs 41 and 161. The weight decay coefficient was III, we show the performance of SRM with different degrees set to 0.0001. of sparsity (parameterized by λ = k/Ml, lower λ indicates Full transfer setting: in this setting, we used all samples higher sparsity) and dictionary sizes (parameterized by µ = available in the training set to perform knowledge transfer Ml/Cl, higher µ indicates higher overcompleness). As can be from the teacher to the student. The test accuracy achieved seen from Table III, SRM is not sensitive to λ and µ. In fact, by different methods is shown in the upper part of Table V. the worst configuration still performs slightly better than KD It is clear that the proposed SRM outperforms other methods (73.71% versus 73.27%), and much better than other AT, PKT, on many datasets, except on Cars and Indoor-Scenes datasets RKD and CRD. for AllCNN student. While KD, AT, PKT, CRD and SRM Pixel-level label and image-level label: finally, to show successfully led the students to better minima with both the importance of combining both pixel-level and image- residual or non-residual students, FitNet was only effective level label, we experimented with two other variants of SRM with the residual one. The results suggest that the proposed on CIFAR100: using either pixel-level or image-level label. intermediate knowledge transfer mechanism in SRM is robust The results are shown in Table IV. The student network to architectural differences between the teacher and the student performs poorly when only receiving intermediate knowledge networks. TABLE V FULLTRANSFERLEARNINGUSING ALLCNN (ACNN) AND RESNET18(RN18) (TEST ACCURACY %)

Model Flowers CUB Cars Indoor-Scenes PubFig83 ResNext50 89.35 ± 00.62 69.53 ± 00.45 87.45 ± 00.27 63.51 ± 00.43 91.41 ± 00.14 Full Shot ACNN 40.80 ± 02.33 47.26 ± 00.18 61.93 ± 01.38 35.82 ± 00.43 78.47 ± 00.17 ACNN-KD 46.14 ± 00.39 51.80 ± 00.41 66.12 ± 00.17 38.44 ± 00.99 81.54 ± 00.09 ACNN-FitNet 30.10 ± 02.41 44.30 ± 01.25 60.20 ± 03.83 30.87 ± 00.35 77.61 ± 00.87 ACNN-AT 51.62 ± 00.69 51.74 ± 00.39 73.89 ± 00.06 43.56 ± 00.52 81.11 ± 01.51 ACNN-PKT 47.12 ± 00.52 47.60 ± 00.85 70.16 ± 00.51 37.71 ± 00.64 82.03 ± 00.26 ACNN-RKD 42.00 ± 01.10 39.99 ± 00.61 56.99 ± 02.44 30.94 ± 00.45 75.44 ± 00.50 ACNN-CRD 46.99 ± 01.13 51.12 ± 00.34 68.89 ± 00.47 42.82 ± 00.21 83.02 ± 00.11 ACNN-SRM 51.72 ± 00.58 54.51 ± 01.72 71.44 ± 04.76 43.09 ± 00.60 82.89 ± 01.78 RN18 44.25 ± 00.42 44.79 ± 00.68 57.17 ± 01.95 36.72 ± 00.29 79.08 ± 00.36 RN18-KD 48.26 ± 00.33 54.91 ± 00.33 75.29 ± 00.46 43.84 ± 00.67 84.49 ± 00.28 RN18-FitNet 48.29 ± 01.24 61.28 ± 00.48 85.01 ± 00.10 45.93 ± 01.00 89.78 ± 00.20 RN18-AT 51.49 ± 00.42 53.13 ± 00.40 77.14 ± 00.15 44.13 ± 00.55 83.60 ± 00.19 RN18-PKT 45.32 ± 00.51 45.24 ± 00.39 71.24 ± 02.23 37.27 ± 00.79 82.00 ± 00.10 RN18-RKD 42.32 ± 00.28 36.29 ± 00.58 56.87 ± 02.64 29.57 ± 00.90 70.90 ± 01.79 RN18-CRD 47.67 ± 00.05 53.25 ± 00.83 76.51 ± 00.74 43.76 ± 00.40 84.43 ± 00.40 RN18-SRM 67.46 ± 01.06 63.11 ± 00.45 84.40 ± 00.67 54.59 ± 00.38 89.79 ± 00.53

TABLE VI 5-SHOTTRANSFERLEARNINGUSING ALLCNN (ACNN) AND RESNET18(RN18) (TEST ACCURACY %)

Model Flowers CUB Cars Indoor-Scenes PubFig83 ResNext50 89.35 ± 00.62 69.53 ± 00.45 87.45 ± 00.27 63.51 ± 00.43 91.41 ± 00.14 5-Shot ACNN 32.91 ± 00.94 13.19 ± 00.46 05.03 ± 00.11 09.21 ± 00.81 05.15 ± 00.45 ACNN-KD 35.98 ± 00.76 21.60 ± 00.83 10.61 ± 00.52 14.81 ± 00.67 11.97 ± 01.40 ACNN-FitNet 28.73 ± 01.18 14.78 ± 00.60 06.11 ± 00.37 08.36 ± 00.52 08.25 ± 01.36 ACNN-AT 38.21 ± 00.88 17.69 ± 00.94 08.52 ± 01.50 08.76 ± 00.39 08.02 ± 00.70 ACNN-PKT 33.25 ± 00.31 11.30 ± 00.20 06.16 ± 00.29 10.75 ± 01.04 06.13 ± 00.18 ACNN-RKD 30.27 ± 00.27 09.55 ± 00.34 04.82 ± 00.31 09.68 ± 00.39 04.82 ± 00.16 ACNN-CRD 35.01 ± 00.57 18.09 ± 00.91 06.77 ± 00.55 09.73 ± 00.49 06.33 ± 00.28 ACNN-SRM 41.14 ± 01.09 22.89 ± 01.18 11.71 ± 01.21 16.63 ± 00.34 13.96 ± 00.52 RN18 32.95 ± 00.37 11.55 ± 00.10 05.00 ± 00.27 09.04 ± 00.44 04.98 ± 00.22 RN18-KD 38.07 ± 01.47 25.53 ± 00.35 11.37 ± 00.58 14.61 ± 00.88 10.69 ± 00.99 RN18-FitNet 39.17 ± 00.62 26.50 ± 00.46 12.36 ± 01.00 13.72 ± 01.07 11.49 ± 00.66 RN18-AT 37.36 ± 01.23 18.22 ± 00.47 08.96 ± 00.92 09.93 ± 00.73 08.76 ± 00.34 RN18-PKT 33.24 ± 00.47 10.63 ± 00.18 05.88 ± 00.18 11.03 ± 00.82 05.34 ± 00.11 RN18-RKD 29.97 ± 00.55 08.83 ± 00.40 04.98 ± 00.13 08.41 ± 00.49 04.49 ± 00.24 RN18-CRD 34.24 ± 00.32 18.36 ± 02.09 06.95 ± 00.26 09.01 ± 00.23 06.36 ± 00.21 RN18-SRM 51.24 ± 00.48 34.67 ± 00.63 26.63 ± 01.82 21.18 ± 01.29 29.31 ± 00.51

Few-shot transfer setting: we further assessed how well similar to full transfer learning setting. The test performance the methods perform when there is a limited amount of data (% in accuracy) under 5-shot and 10-shot transfer learning for knowledge transfer. For each dataset, we randomly selected setting are reported in Table VI and Table VII, respectively. 5 samples (5-shot) and 10 samples (10-shot) from the training Under this restrictive regime, the proposed SRM method set for training purpose, and kept the validation and test set performs far better than other tested methods for both types similar to the full transfer setting. Since the Flowers dataset has of students, largely improving the baseline results. only 10 training samples in total (the original split provided by the database has 20 samples per class, however, we used 10 samples for validation purpose), the results for 10-shot are TABLE VII 10-SHOTTRANSFERLEARNINGUSING ALLCNN (ACNN) AND RESNET18 (RN18) (TEST ACCURACY %)

Model Flowers CUB Cars Indoor-Scenes PubFig83 ResNext50 89.35 ± 00.62 69.53 ± 00.45 87.45 ± 00.27 63.51 ± 00.43 91.41 ± 00.14 10-Shot ACNN 40.80 ± 02.33 25.53 ± 01.07 09.50 ± 00.82 16.23 ± 00.46 10.09 ± 00.19 ACNN-KD 46.14 ± 00.39 34.63 ± 00.35 23.43 ± 01.47 21.76 ± 00.53 28.11 ± 00.50 ACNN-FitNet 30.10 ± 02.41 29.40 ± 00.63 15.81 ± 02.16 15.71 ± 01.32 17.89 ± 00.67 ACNN-AT 51.62 ± 00.69 30.26 ± 00.35 25.22 ± 02.90 17.90 ± 00.40 26.27 ± 01.76 ACNN-PKT 47.12 ± 00.53 24.93 ± 01.92 13.82 ± 01.18 16.78 ± 00.59 10.67 ± 00.26 ACNN-RKD 42.00 ± 01.10 19.36 ± 00.50 09.60 ± 00.25 12.40 ± 00.76 06.99 ± 00.57 ACNN-CRD 46.99 ± 01.13 29.72 ± 00.25 16.96 ± 00.77 17.60 ± 00.85 17.24 ± 02.98 ACNN-SRM 51.72 ± 00.58 35.86 ± 01.39 36.28 ± 04.33 24.82 ± 00.47 31.47 ± 01.84 RN18 44.25 ± 00.42 22.72 ± 00.91 11.76 ± 01.35 15.16 ± 00.54 08.92 ± 00.73 RN18-KD 48.26 ± 00.33 40.57 ± 00.53 35.44 ± 01.13 23.85 ± 00.78 29.67 ± 00.91 RN18-FitNet 48.29 ± 01.24 43.83 ± 00.47 51.88 ± 01.72 24.62 ± 00.47 36.79 ± 01.27 RN18-AT 51.49 ± 00.42 30.47 ± 00.55 27.70 ± 02.72 17.48 ± 00.38 29.80 ± 01.41 RN18-PKT 45.32 ± 00.52 20.62 ± 02.56 11.00 ± 00.63 15.76 ± 00.62 08.80 ± 00.54 RN18-RKD 42.32 ± 00.28 17.73 ± 00.14 08.73 ± 00.71 11.82 ± 00.34 06.49 ± 00.27 RN18-CRD 47.67 ± 00.05 30.04 ± 00.13 17.11 ± 00.66 17.82 ± 00.49 16.08 ± 01.67 RN18-SRM 67.46 ± 01.06 48.42 ± 01.86 61.22 ± 06.84 32.21 ± 01.21 51.99 ± 02.95

TABLE VIII TOP-1ERROROF RESNET18 ON IMAGENET. (*) INDICATES RESULTS OBTAINED BY 110 EPOCHS

KD AT Seq. KD KD+ONE ESKD ESKD+AT CRD* SRM (our) 30.79 29.30 29.60 29.45 29.16 28.84 28.62∗ 28.79

C. ImageNet Experiments one network to another using sparse representation learn- For ImageNet [60] experiments, we followed similar exper- ing. Experimental results on several datasets indicated that imental setup as in [61], [31]: ResNet34 and ResNet18 were SRM outperforms related methods, successfully performing used as the teacher and student networks, respectively. For intermediate knowledge transfer even if there is a significant hyperparameters of SRM, we set µ = 4.0, λ = 0.02, τ = architectural mismatch between networks and/or a limited 4.0, α = 0.3. For other training protocols, we followed the amount of data. SRM serves as a starting point for developing standard setup for ImageNet, which trains the student network specific knowledge transfer use cases, e.g., data-free knowl- for 100 epochs using SGD optimizer with an initial learning edge transfer, which is an interesting future research direction. rate of 0.1, dropping by 0.1 at epochs 51, 81, 91. Weight decay VI. ACKNOWLEDGEMENT coefficient was set to 0.0001. In addition, the pre-training phase took 80 epochs with an initial learning rate of 0.1, This project has received funding from the European dropping by 0.1 for every 20 epochs. Union’s Horizon 2020 research and innovation programme Table VIII shows top-1 classification errors of SRM and under grant agreement No 871449 (OpenDR). This publication related methods, including KD [29], AT [31], Seq. KD [62], reflects the authors’ views only. The European Commission KD+ONE [63], ESKD [61], ESKD+AT [61], CRD [44], is not responsible for any use that may be made of the having the same experimental setup. The teacher network information it contains. (Resnet34) achieves top-1 error of . Using SRM, we 26.70% REFERENCES can successfully train ResNet18 to achieve 28.79% classification error, which is lower than existing methods (except CRD, [1] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural which was trained for 110 epochs), including the recently information processing systems, pp. 91–99, 2015. proposed ESKD+AT [61] that combines early-stopping trick [2] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” and Attention Transfer [31]. arXiv preprint arXiv:1804.02767, 2018. [3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv V. CONCLUSION preprint arXiv:1810.04805, 2018. [4] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, In this work, we proposed Sparse Representation Matching “Language models are unsupervised multitask learners,” OpenAI Blog, (SRM), a method to transfer intermediate knowledge from vol. 1, no. 8, p. 9, 2019. [5] S. Kiranyaz, T. Ince, and M. Gabbouj, “Real-time patient-specific ecg [27] D. T. Tran, M. Gabbouj, and A. Iosifidis, “Multilinear compressive classification by 1-d convolutional neural networks,” IEEE Transactions learning with prior knowledge,” arXiv preprint arXiv:2002.07203, 2020. on Biomedical Engineering, vol. 63, no. 3, pp. 664–675, 2015. [28] D. T. Tran, M. Gabbouj, and A. Iosifidis, “Performance indicator in [6] D. T. Tran, N. Passalis, A. Tefas, M. Gabbouj, and A. Iosifidis, multilinear compressive learning,” in 2020 IEEE Symposium Series on “Attention-based neural bag-of-features learning for sequence data,” Computational Intelligence (SSCI), pp. 1822–1828, IEEE, 2020. arXiv preprint arXiv:2005.12250, 2020. [29] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural [7] D. T. Tran, A. Iosifidis, J. Kanniainen, and M. Gabbouj, “Tempo- network,” arXiv preprint arXiv:1503.02531, 2015. ral attention-augmented bilinear network for financial time-series data [30] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Ben- analysis,” IEEE transactions on neural networks and learning systems, gio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, vol. 30, no. 5, pp. 1407–1418, 2018. 2014. [8] Z. Zhang, S. Zohren, and S. Roberts, “Deeplob: Deep convolutional [31] S. Zagoruyko and N. Komodakis, “Paying more attention to attention: neural networks for limit order books,” IEEE Transactions on Signal Improving the performance of convolutional neural networks via atten- Processing, vol. 67, no. 11, pp. 3001–3012, 2019. tion transfer,” arXiv preprint arXiv:1612.03928, 2016. [9] D. T. Tran, J. Kanniainen, M. Gabbouj, and A. Iosifidis, “Data normal- [32] M. Gao, Y. Shen, Q. Li, J. Yan, L. Wan, D. Lin, C. C. Loy, and X. Tang, ization for bilinear structures in high-frequency financial time-series,” “An embarrassingly simple approach for knowledge distillation,” arXiv arXiv preprint arXiv:2003.00598, 2020. preprint arXiv:1812.01819, 2018. [10] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning [33] B. Heo, M. Lee, S. Yun, and J. Y. Choi, “Knowledge transfer via applied to document recognition,” Proceedings of the IEEE, vol. 86, distillation of activation boundaries formed by hidden neurons,” in no. 11, pp. 2278–2324, 1998. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, [11] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely pp. 3779–3787, 2019. connected convolutional networks,” in Proceedings of the IEEE confer- [34] N. Passalis and A. Tefas, “Learning deep representations with probabilis- ence on computer vision and pattern recognition, pp. 4700–4708, 2017. tic knowledge transfer,” in Proceedings of the European Conference on [12] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual Computer Vision (ECCV), pp. 268–284, 2018. transformations for deep neural networks,” in Proceedings of the IEEE [35] Z. Zhang, Y. Xu, J. Yang, X. Li, and D. Zhang, “A survey of sparse rep- conference on computer vision and pattern recognition, pp. 1492–1500, resentation: algorithms and applications,” IEEE access, vol. 3, pp. 490– 2017. 530, 2015. [13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, [36] J. Mairal, F. Bach, and J. Ponce, “Task-driven dictionary learning,” IEEE T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo- transactions on pattern analysis and machine intelligence, vol. 34, no. 4, lutional neural networks for mobile vision applications,” arXiv preprint pp. 791–804, 2011. arXiv:1704.04861, 2017. [37] P. Sprechmann, A. M. Bronstein, and G. Sapiro, “Learning efficient [14] D. T. Tran, A. Iosifidis, and M. Gabbouj, “Improving efficiency in sparse and low rank models,” IEEE transactions on pattern analysis convolutional neural networks with multilinear filters,” Neural Networks, and machine intelligence, vol. 37, no. 9, pp. 1821–1833, 2015. vol. 105, pp. 328–339, 2018. [38] V. Monga, Y. Li, and Y. C. Eldar, “Algorithm unrolling: Interpretable, [15] D. T. Tran, S. Kiranyaz, M. Gabbouj, and A. Iosifidis, “Heterogeneous efficient deep learning for signal and image processing,” arXiv preprint multilayer generalized operational perceptron,” IEEE transactions on arXiv:1912.10557, 2019. neural networks and learning systems, 2019. [39] L. Breiman and N. Shang, “Born again trees,” University of California, [16] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement Berkeley, Berkeley, CA, Technical Report, vol. 1, p. 2, 1996. learning,” arXiv preprint arXiv:1611.01578, 2016. [40] J. Ba and R. Caruana, “Do deep nets really need to be deep?,” in [17] D. T. Tran, S. Kiranyaz, M. Gabbouj, and A. Iosifidis, “Pygop: A python Advances in neural information processing systems, pp. 2654–2662, library for generalized operational perceptron algorithms,” Knowledge- 2014. Based Systems, vol. 182, p. 104801, 2019. [41] C. Buciluˇa, R. Caruana, and A. Niculescu-Mizil, “Model compression,” [18] D. T. Tran, S. Kiranyaz, M. Gabbouj, and A. Iosifidis, “Knowledge in Proceedings of the 12th ACM SIGKDD international conference on transfer for face verification using heterogeneous generalized operational Knowledge discovery and data mining, pp. 535–541, 2006. perceptrons,” in 2019 IEEE International Conference on Image Process- [42] C. Cao, F. Liu, H. Tan, D. Song, W. Shu, W. Li, Y. Zhou, X. Bo, and ing (ICIP), pp. 1168–1172, IEEE, 2019. Z. Xie, “Deep learning and its applications in biomedicine,” Genomics, [19] D. T. Tran and A. Iosifidis, “Learning to rank: A progressive neural proteomics & bioinformatics, vol. 16, no. 1, pp. 17–32, 2018. network learning approach,” in ICASSP 2019-2019 IEEE International [43] W. Park, D. Kim, Y. Lu, and M. Cho, “Relational knowledge distilla- Conference on Acoustics, Speech and Signal Processing (ICASSP), tion,” in Proceedings of the IEEE Conference on Computer Vision and pp. 8355–8359, IEEE, 2019. Pattern Recognition, pp. 3967–3976, 2019. [20] F. Manessi, A. Rozza, S. Bianco, P. Napoletano, and R. Schettini, [44] Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distilla- “Automated pruning for deep neural network compression,” in 2018 24th tion,” in International Conference on Learning Representations, 2020. International Conference on Pattern Recognition (ICPR), pp. 657–664, [45] R. G. Lopes, S. Fenu, and T. Starner, “Data-free knowledge distillation IEEE, 2018. for deep neural networks,” arXiv preprint arXiv:1710.07535, 2017. [21] F. Tung and G. Mori, “Clip-q: Deep network compression learning by in- [46] J. Yoo, M. Cho, T. Kim, and U. Kang, “Knowledge extraction with no parallel pruning-quantization,” in Proceedings of the IEEE Conference observable data,” in Advances in Neural Information Processing Systems, on Computer Vision and Pattern Recognition, pp. 7873–7882, 2018. pp. 2701–2710, 2019. [22] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, [47] N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar, “Quantized neural networks: Training neural networks with low pre- “Semi-supervised knowledge transfer for deep learning from private cision weights and activations,” The Journal of Machine Learning training data,” arXiv preprint arXiv:1610.05755, 2016. Research, vol. 18, no. 1, pp. 6869–6898, 2017. [48] A. Kimura, Z. Ghahramani, K. Takeuchi, T. Iwata, and N. Ueda, “Few- [23] S.-C. Zhou, Y.-Z. Wang, H. Wen, Q.-Y. He, and Y.-H. Zou, “Balanced shot learning of neural networks from scratch by pseudo example quantization: An effective and efficient approach to quantized neural optimization,” arXiv preprint arXiv:1802.03039, 2018. networks,” Journal of Computer Science and Technology, vol. 32, no. 4, [49] J. Liu, D. Wen, H. Gao, W. Tao, T.-W. Chen, K. Osa, and M. Kato, pp. 667–682, 2017. “Knowledge representing: Efficient, sparse representation of prior [24] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, knowledge for knowledge distillation,” in Proceedings of the IEEE “Exploiting linear structure within convolutional networks for efficient Conference on Computer Vision and Pattern Recognition Workshops, evaluation,” in Advances in neural information processing systems, pp. 0–0, 2019. pp. 1269–1277, 2014. [50] H. Jain, S. Gidaris, N. Komodakis, P. Pérez, and M. Cord, “Quest: [25] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convo- Quantized embedding space for transferring knowledge,” arXiv preprint lutional neural networks with low rank expansions,” arXiv preprint arXiv:1912.01540, 2019. arXiv:1405.3866, 2014. [51] N. Passalis and A. Tefas, “Neural bag-of-features learning,” Pattern [26] D. T. Tran, M. Yamaç, A. Degerli, M. Gabbouj, and A. Iosifidis, “Multi- Recognition, vol. 64, pp. 277–294, 2017. linear compressive learning,” IEEE transactions on neural networks and [52] A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features learning systems, 2020. from tiny images,” 2009. [53] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806, 2014. [54] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729, IEEE, 2008. [55] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The Caltech-UCSD Birds-200-2011 Dataset,” Tech. Rep. CNS-TR-2011-001, California Institute of Technology, 2011. [56] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in Proceedings of the IEEE international conference on computer vision workshops, pp. 554–561, 2013. [57] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 413–420, IEEE, 2009. [58] N. Pinto, Z. Stone, T. Zickler, and D. Cox, “Scaling up biologically- inspired computer vision: A case study in unconstrained face recognition on facebook,” in CVPR 2011 WORKSHOPS, pp. 35–42, IEEE, 2011. [59] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016. [60] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015. [61] J. H. Cho and B. Hariharan, “On the efficacy of knowledge distillation,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 4794–4802, 2019. [62] C. Yang, L. Xie, S. Qiao, and A. Yuille, “Knowledge distillation in generations: More tolerant teachers educate better students,” arXiv preprint arXiv:1805.05551, 2018. [63] X. Zhu, S. Gong, et al., “Knowledge distillation by on-the-fly native ensemble,” in Advances in neural information processing systems, pp. 7517–7527, 2018.