JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Similarity Transfer for Knowledge Distillation Haoran Zhao, Kun Gong, Xin Sun, Member, IEEE, Junyu Dong, Member, IEEE and Hui Yu, Senior Member, IEEE

Abstract—Knowledge distillation is a popular paradigm for following perspectives: network pruning [6][7][8], network de- learning portable neural networks by transferring the knowledge composition [9][10][11], network quantization and knowledge from a large model into a smaller one. Most existing approaches distillation [12][13]. Among these methods, the seminal work enhance the student model by utilizing the similarity information between the categories of instance level provided by the teacher of knowledge distillation has attracted a lot of attention due to model. However, these works ignore the similarity correlation its ability of exploiting dark knowledge from the pre-trained between different instances that plays an important role in large network. confidence prediction. To tackle this issue, we propose a novel Knowledge distillation (KD) is proposed by Hinton et al. method in this paper, called similarity transfer for knowledge [12] for supervising the training of a compact yet efficient distillation (STKD), which aims to fully utilize the similarities between categories of multiple samples. Furthermore, we propose student model by capturing and transferring the knowledge of to better capture the similarity correlation between different in- a large teacher model to a compact one. Its success is attributed stances by the mixup technique, which creates virtual samples by to the knowledge contained in class distributions provided by a weighted linear interpolation. Note that, our distillation loss can the teacher via soften softmax. Many follow up works have fully utilize the incorrect classes similarities by the mixed labels. been proposed since then focusing on different categories of The proposed approach promotes the performance of student model as the virtual sample created by multiple images produces knowledge with various kinds of distillation loss functions in a similar probability distribution in the teacher and student the field of CNNs optimization. However, it is still an open networks. Experiments and ablation studies on several public question to define the distillation loss, i.e. how to best capture classification datasets including CIFAR-10,CIFAR-100,CINIC-10 and define the knowledge of teacher to train the student. and Tiny-ImageNet verify that this light-weight method can In vanilla KD [12], the knowledge of teacher network is effectively boost the performance of the compact student model. It shows that STKD substantially has outperformed the vanilla transferred to the student network by mimicking the class knowledge distillation and has achieved superior accuracy over probabilities outputs, which are softened by setting a temper- the state-of-the-art knowledge distillation methods. ature hyperparameter in softmax. Inspired by KD, following Index Terms—Deep neural networks, image classification, works introduce the output of intermediate layers as knowl- model compression, knowledge distillation. edge to supervise the training of the student network. For example, FitNets [14] first introduces the intermediate repre- sentation as the hints to guide the training of the student net- I.INTRODUCTION work, which directly matches the feature maps of intermediate EEP convolutional neural networks (CNNs) have made layers between the teacher network and the student network. D unprecedented advances in a wide range of computer Later, Zagoruyko et al. [15] introduce attention maps (AT) vision applications such as image classification [1][2], object from the feature maps of intermediate layers as knowledge. detection [3][4] and semantic segmentation [5]. However, FSP [16] designs a flow distillation loss to encourage the these top-performing neural networks are usually developed student to mimic the teacher’s flow matrices within the feature with large depth, parameters and high complexity, which maps between two layers. Recently, Park et al. [17] propose a consume expensive computing resources and make it hard to relational knowledge distillation (RKD) method which draws mutual relations of data examples by the proposed distance- arXiv:2103.10047v1 [cs.CV] 18 Mar 2021 be deployed on low-capacity edge devices. With the increasing demands for low cost networks and real-time response on wise and angle-wise distillation losses. In particular, SP [18] resource-constrained devices, there is an urgently need for preserves the pairwise similarities in student’s representation novel solutions that can reduce model complexities while space instead to mimic the representation space of the teacher. keeping decent performance. Besides, various approaches extend these works by matching other statistics, such as gradient [19] and distribution [20]. To tackle this problem, a large body of works has been Recently, there have been some attempts to extend KD to proposed to accelerate or compress these deep neural net- other domains where KD also shows its potential. For example, works in recent years. Generally, these solutions fall into the Papernot et al. [21] introduce the defensive distillation for This work was supported in part by the National Natural Science Foundation adversarial attack to reduce the effectiveness of adversarial of China (No. U1706218, 61971388), and Major Program of Natural Science samples on CNNs. Gupta et al. [22] transfer the knowl- Foundation of Shandong Province (No. ZR2018ZB0852). (Corresponding edge among the images from different modalities. Besides, author: Xin Sun and Junyu Dong.) H. Zhao, K. Gong, X. Sun, and J. Dong are with the College knowledge distillation can be employed to some task-specific of Information Science and Engineering, Ocean University of China, methods such as object detection [23], semantic segmentation 238 Songling Road, Qingdao, China (e-mail:[email protected]; [24][25], face model compression [26] and depth estimation [email protected]; [email protected]). H. Yu is with the School of Creative Technologies, University of [27][28]. Portsmouth, Portsmouth, PO1 2DJ, UK (e-mail: [email protected]). From a theoretical perspective, Yuan et al. [29] interpret JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 knowledge distillation in terms of Label Smoothing Regular- distillation and other SOTA approaches. ization [30] and find the importance of soft targets regular- The rest of this paper is organized as follows. Related ization. They propose to manually design the regularization work is reviewed in Part II. And we present the proposed distribution to instead the teacher network. Notably, Hinton et similarity transfer for knowledge distillation architecture in al. [31] investigate the effect of label smoothing to knowledge Part III. Experimental results are presented in Part IV. Finally, distillation and observe that label smoothing can alter the per- Part V concludes this paper. formance of distillation when the teacher model is trained with label smoothing. Through visualization of penultimate layer’s II.RELATED WORKS activations of CNNs., they find that the label smoothing could discard the similarity information of categories, causing In this section, we first briefly introduce the backgrounds poor distillation results. Thus, the similarity information of of network model compression and acceleration and then categories is vital for knowledge distillation. But this issue has summarize existing works on knowledge distillation. Finally, been ignored by existing methods. As mentioned above, most we review existing works related to the mixup technique. existing works investigate knowledge distillation depending on the similarities of instance level. A. Model compression and Acceleration. Different from previous methods, we aim to exploit the priv- Recently, CNNs become predominant in many computer ileged knowledge on similarity correlation between different vision tasks. Network architectures become deeper and wider instances using virtual samples, created by mix-samples. Gen- which achieve better performances than the shallow ones. erally speaking, the classification confidence represents over- However, these complicated networks also bring in high confident predictions when the high-capacity teacher model computation and memory costs, which obstruct us to deploy has the excellent performance. Consequently, the incorrect these models into real-time applications. Thus, model com- classes contain low confidence which has low similarity infor- pression and acceleration aim to reduce the model size and mation among categories. As shown in Fig. 1, the teacher from computation cost meanwhile maintaining high performance. the Vanilla Knowledge Distillation [12] outputs over-confident One kind of method directly designs small network structure prediction for the correct category and low confidence for by modifications on convolution since the original deep model incorrect classes. However, this category similarity informa- has many redundant parameters. For example, MobileNet [33] tion is exactly the most important knowledge that should introduces depth-wise separable convolution to build block be transferred to the student. For example, the probability instead of standard convolution. ShuffleNet [34] proposes of incorrectly classifying a cat as a car must be lower than point-wise group convolution and channel shuffle to maintain a the probability of misclassifying it as a dog. Thus, we argue decent performance without adding the burden of computation. that the image labels are more informative for the specific These approaches could retain a decent performance without images. In other words, the teacher should guide the student adding too much computing burden at inference phase. to distinguish what the image is and what it looks like. Besides, there has been another kind of method which In this work, we employ the mixup [32] to create our attempts to remove the redundant information of teacher virtual training samples by a weighted linear interpolation. network. For example, network pruning aims to boost the In this way, the virtual samples with mix-labels can provide speed of inference by getting rid of redundancy of the large additional intra- and inter-class relations in datasets. In contrast CNN model. Han et al. [35] propose to boost the speed to existing approaches, the proposed method transfers the of inference by deleting unimportant connections. However, similarity correlation between different instances instead of network pruning methods require many iterations to converge similarities information of individual samples. It means that and the pruning threshold needs to be set manually. Network we force the virtual samples created by multiple mages to decomposition use matrix decomposition technique to decom- produce similar probability distribution in the teacher and pose the original convolution kernel in CNNs model. For student networks. Due to the soften probability distribution example, Novikov et al. [36] propose to reduce the number of from virtual samples, the teacher’s knowledge could be easily parameters while preserve the original performance of network learned by the compact student model that further improve its by converting the dense weight matrices of the fully-connected robustness. layers to the Tensor Train format. Network quantization meth- To sum up, our contributions in this paper can be summa- ods aim to represent the float weights with fewer bits. Yang rized as follow: et al. [37] propose to simulate the quantization process with a • We propose similarity transfer for knowledge distillation, differentiable function. Note that, above mentioned works can which utilizes the valuable knowledge of similarities be combined with KD methods for further improvement. between categories on multiple instances. • We introduce mixup technique to create virtual samples and a novel knowledge distillation loss to combine the B. Knowledge Distillation. mixed labels. Different from above methods, knowledge distillation enrich • Extensive experiments are conducted on various public and get the student model by extracting kinds of knowledge datasets such as CIFAR-10/100, CINIC-10 and Tiny- from the fixed teacher model. To address the challenge of ImageNet. We experimentally demonstrate that our ap- deploying CNNs in resource-constrained edge devices, Bu- proach significantly outperforms the vanilla knowledge cilua et al. [38] first propose to transfer the knowledge of JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

...

Softmax(p / t)

Teacher ... Teacher

Student KL divergence ... Cross Entropy Student Mixup-loss Original Datasets Virtual Samples

labels ... Mix-labels

Original Datasets Vanilla Knowledge Distillation Similarity Transfer Knowledge Distillation

Fig. 1. The whole framework of our Similarity Transfer Knowledge Distillation vs. Vanilla Knowledge Distillation. STKD extracts more similarities information among categories by feeding the virtual samples with original samples of mixed labels. As shown in the right, STKD outputs more soften probability distributions from the teacher and transfers the similarities information to the student. an ensemble of models to a small model. Then Caruana et Thus, some very recent works [17][18] aim to explore the al. [39] propose to train student model by mimicking the relationship between data samples. Park et al. [17] propose a teacher model’s logits. Later, Hinton et al. [12] popularize relation knowledge distillation (RKD) method, which extracts the idea of knowledge distillation, which efficiently transfers mutual relations of data samples by the proposed distance- knowledge from large teacher network to compact student wise and angle-wise distillation losses. Tung et al. propose network by mimicking the class probabilities outputs. Note a similarity preserving (SP) knowledge distillation method that, the outputs of the teacher network are defined as the which transfers the similar activations of input pairs from dark knowledge in KD, which provides similarity information the teacher network to the student network. Peng et al. [43] as extra supervisions compared with one-hot labels. In other propose to capture the high order correlation between samples words, there are two sources of supervision used to train the using the kernel method. However, the similarities information student in KD, one from the ground-truth label, and the other between instances is hard to learn for the student network due from the soft targets of teacher network. to the high dimension in the embedding space. Different from Afterward, some recent works [14] [15] extend KD by existing works, the proposed approach presents the similarity distilling knowledge from intermediate feature representations correlation between instances through the probability distri- instead of soft labels. For example, FitNets [14] propose to bution outputs of virtual samples, which are created by the train student network by mimicking the intermediate feature mixup technique. In contrast to the vanilla KD which softens maps of teacher network, which are defined as hints. Inspired the softmax outputs using a temperature hyperparameter, our by this, Zagoruyko et al. [15] propose to match the attention approach directly outputs a softened probability distribution maps between the teacher and the student, which are defined using the virtual samples. from the original feature maps as knowledge. Wang et al. [40] propose to improve the performance of student network by matching the distributions of spatial neuron activations C. Mixed Samples Data Augmentation. between the teacher and the student. Recently, Heo et al. [41] There are many training strategies to further improve the introduce the activation boundary of the hidden neuron as performance of CNNs, such as data augmentation and regular- knowledge for distilling the compact student network. ization techniques. In particular, mixed sample data augmen- However, the aforementioned knowledge distillation meth- tation has gain increasing attention due to its state-of-the-art ods only utilize the knowledge contained in the output of performance. It trains the models using the augmented data specific layers of the teacher network. More richer knowledge set by combining data samples according to some policy. between different layers is explored and utilized for knowledge For example, Zhang et al. [32] first introduce a data aug- distillation. For example, Yim et al. [16] propose to use mentation approach named mixup, which could improve the Gram matrix between different feature layers as distilled generalization of CNNs by linearly interpolating a random pair knowledge, which named flow of solution process (FSP) that of original training data to create a new training data and reflects the relations of different features maps. Lee et al. corresponding labels. Inspired by this, some mixup variants [42] propose to extract valuable correlation information as [44][45] [46] have been proposed. Later, Cutmix [47] proposes knowledge between different feature maps using singular value to replace the removed regions with a patch from another decomposition (SVD). However, these methods only focus on image. FMix [48] proposes to use binary masks obtained the knowledge contained in the individual data samples and by applying a threshold to low frequency images sampled ignore the similarities between categories of multiple samples, from Fourier space. Following them, Pairing Samples [49] which also valuable for knowledge distillation. propose to take an average of two samples for each pixel. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Recently, Cubuk et al. [50] employ automatically searching to B. Similarity Transfer Knowledge Distillation improve data augmentation policies that called AutoAugment. In this section, we introduce similarity transfer for knowl- Although these methods have achieved remarkable perfor- edge distillation in detail. Usually, traditional knowledge dis- mance to the training of CNNs, they suffer from the prohibitive tillation methods [12] utilize the temperature hyperparameter training time. to soften the one-hot labels to soft labels. Different from exist- ing works, STKD presents the similarity correlation between III.PROPOSED METHOD different samples through the probability distribution outputs of virtual samples, which are created by the mixup technique. In this section, we first revisit the definition of vanilla To be specific, we first employ the recently proposed mixup KD [12], which transfers knowledge from the high-capacity [32] to create the virtual samples. We randomly sample two teacher network to student network with soft labels. Then samples (x , y ) and (x , y ) from D. Then it generates the we describe the proposed similarity transfer for knowledge i i j j new synthetic sample (x, y) by a weighted linear interpolation distillation. Fig. 1 compares the vanilla KD with STKD and of these two samples: describes the whole framework of our approach.

x = λxi + (1 − λ)xj (3) A. Problem Definition y = λyi + (1 − λ)yj (4) In our approach, we aim to explore a more effective knowl- edge, which contains in the similarity correlation between where the merging coefficient λ ∈ [0, 1] is a random number instances, to guide the training of student model. Moreover, drawn from the Beta(α, α) distribution. And y is the label of we employ mixup technique to represent these knowledge by x which is a convexed combination of the labels from xi and introducing virtual samples. xj. For simplicity, we define the large, complex Teacher deep As can be seen from Figure. 1, the virtual samples are neural network as T with learned parameters PT , the less created by this mixup technique. In other words, each sample complex Student network as S with learned parameters PS. from the virtual set is formed by two arbitrary original images The original training data consist of tuples of input data and by a weighted linear interpolation. target (x, y) ∈ D. In general, the conventional knowledge By Eq. 4, we turn the one-hot labels to mixed labels, which distillation methods train the Student by minimizing the also contribute to the similarity information between different following objective function with respect to the parameters samples. Thus, the mixed labels of the virtual samples are PS over the training samples (x, y) ∈ D: employed in our framework. Therefore, the mix loss can be formulated as:

L = LKD(S(x, PS),T (x, PT )) + λH(ys, y) (1)

Lmix = λH(ysi , y) + (1 − λ)H(ysj , y) (5) where λ is a balancing hyperparameter, H is the cross- entropy loss that computes on the predicted label ys and where ysi and ysj are the predicted labels corresponding the the corresponding ground truth labels y. And LKD is the ground truth labels y. And λ is a hyperparameter as same as distillation loss such as cross-entropy or mean square error. the coefficient in the mixup technique. Note that, we supervise T (x, PT ) and S(x, PS) represent the softmax output of the the training of student network by using these mixed labels and T eacher and Student respectively. the teacher network’s logits. For example, Hinton et al. [12] use pre-softmax outputs for Then we distill the similarity knowledge between multiple T (x, PT ) and S(x, PS), and Kullback-Leibler divergence for samples from teacher network to student network using the LKD: created virtual samples. Here we define the total for similarity transfer knowledge distillation as follows:

X T (xi,PT ) S(xi,PS) KL(softmax( ), softmax( )) (2) t t Ltotal = λH(ysi , y) + (1 − λ)H(ysj , y) + (6) xi∈D LKD(S(x, PS),T (x, PT )) where t is a temperature hyperparameter to soften the softmax outputs of teacher network. Thus, we need to set a proper Different from the vanilla KD method, we get rid of the t manually. However, it is hard for us to obtain a proper temperature hyperparameter in our method. Thus, the proba- temperature t due to the gap between the teacher and student bility distribution output of T and S is defined as T (x, PT ) = networks. softmax(T (xi,PT ) and S(x, PS) = softmax(S(xi,PS) Likewise, other works [14][16] involved an objective func- respectively. Note that the classification loss has the hyper- tion that can also be formulated as a form of Eq. 1. However, parameter λ to balance the weight of the mix labels. these conventional KD methods only focus on the similarity Unlike previous methods, we define the similarity correla- correlation between categories of individual samples while tion between instances as knowledge and present it through ignore the similarity correlation between different samples. probability distribution of the virtual images. To this end, we Moreover, the temperature hyperparameter t need to be set employ the mixup technique to create virtual samples which manually which is used to soften the probability distribution. is effective for our method. It generates mixup images that JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

Algorithm 1 Similarity Transfer Knowledge Distillation IV. EXPERIMENTSAND RESULTS Input: A pre-trained teacher model T with parameters P . T In this section, extensive experiments are conducted to Input: The original training data (x, y) ∈ D. verify the effectiveness of the proposed approach. We first Input: Hyper-parameter (, coefficient λ, etc). design comparison experiments with state-of-the-art (SOTA) Output: A compact student model S with parameters P . S KD methods (e.g., AT [15], SP [18] and RKD [17]) to test our 1: Initialize: The student network S and training hyper- approach. Additionally, different network architectures (e.g., parameters. ResNet [51], WideResNet [52], VGG [53]) are explored in 2: Repeat: ablation studies. 3: Stage 1: Creating the Virtual Samples by Mixup. Note that, we implement the experiments using PyTorch 4: Sample a batch (x, y) from the training dataset. on Nvidia 1080TI GPU devices. For all the experiments, the 5: Sample two samples (xi, yi) and (xj, yj) randomly results are averaged over 5 trials of different randomly selected 6: from the batch. seeds. 7: Synthetic the virtual samples by Eq. 3 and 4. 8: Stage 2: Training the student model. A. Experiment Settings 9: Sample a batch (x, yi, yj) from the virtual samples. 10: Calculate the softmax outputs of teacher network and Datasets. To demonstrate our method under general situations 11: student network: Pt(x), Ps(x) of data diversity, we run experiments on CIFAR-10/100, 12: Calculate the knowledge distillation loss in Eq. 6. CINIC-10 and Tiny-ImageNet which are popular benchmark 13: Update weights in PS according to the gradient. datasets for image classification. CIFAR-10 [54] and CIFAR- 14: Until: convergence. 100 both contain 50K training images and 10K test images. CIFAR-10 has 10 categories when CIFAR-100 contains 100 classes. The CINIC-10 dataset consists of images from both contain multiple images’ similarity correlation on the one CIFAR-10 dataset and ImageNet dataset, whose scale is hand, on the other hand also regularizes the student model closer to ImageNet. It is composed of 270,000 images at through knowledge distillation to favor simple linear behavior a spatial resolution of 32 × 32 via the addition of down- in-between training samples. Moreover, we could generate sampled ImageNet images. We adopt the CINIC-10 dataset different mixup images by varying the coefficient λ. We study for rapid experimentation. Tiny-ImageNet dataset [55] is a this coefficient in the ablation study. popular subset of the ImageNet dataset [1]. It contains 100k training images with 64 × 64 resolution in 200 categories, In this way, teacher network transfers more valuable in- 10K additional images for testing. Each class has 500 training formation to the student network than existing methods. That images, 50 validation images, and 50 test images. means the teacher’s dark knowledge could be better transferred Network architecture. We use three state-of-the-art con- to the student. volutional neural network architectures: ResNet [56], Wide Residual Network (WRN) [57] and VGG [58]. For each network family, we employ a deep or wide one as the teacher C. Training Procedure network and a shallow or thin one as the student network. Note that, WRN has a standard convolutional layer followed by Algorithm 1 describes the details of the whole training three groups of residual blocks, each of size n. Additionally, it paradigm. We first obtain a pre-trained teacher model T uses an additional widen factor m to increase the width, which with parameters P which is trained using standard back- T could bring more representation ability. We denote WRN as propagation on original dataset (x, y) ∈ D. WRN-n-m in our experiment. In the following experiments, Then we iterate the following stages until the accuracy of we take ResNet-110, WRN-40-2 and VGG-13 as the teacher the student S converges. 1) We create the virtual set instead networks, ResNet-20, WRN-40-1 and VGG-8 as student net- of original dataset using the mixup technique. Specifically, we works. In addition, we use different network architectures use a single data loader to obtain the mini-batch and apply (e.g., ResNet-50 as teacher and VGG-8 as student) in ablation the mixup to the same mini-batch after random shuffling. 2) studies. We feed these virtual samples to our knowledge distillation Evaluation Metric. We adopt the classification accuracy as framework, which distills the similarity correlation between the evaluation metric. Note that, the accuracy is computed as instances from the pre-trained teacher to the student network median of 5 runs with different seeds. using our STKD loss in Eq. 6. Fig. 1 illustrates the overall framework of STKD. Different from the previous methods, the proposed STKD regularizes the B. Experiment on CIFAR-10/CIFAR-100 student network to favor the simple linear behavior under the On CIFAR dataset, we use three groups of teacher/student teacher’s guidance. In other words, we employ the mixup tech- model pairs including ResNet-110/ResNet-20, WRN-40- nique to provide more similarity information among classes 2/WRN-40-1 and VGG-13/VGG-8. We apply a standard hori- and make the high dimensional representation of teacher more zontal flip and random crop data augmentation scheme which easily transferable to the student. We will demonstrate its is widely adopted for these datasets. We set the weight decay effectiveness in the following section. to 10−4., batch size to 128, and use stochastic gradient descent JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

TABLE I CLASSIFICATION ACCURACY (%) ON CIFAR-10 AND CIFAR-100 DATASETS (5 RUNS).BASELINEMEANSTHESTUDENTNETWORKTRAINS INDIVIDUALLY.STKD MEANS THE CLASSIFICATION ACCURACY IN OUR METHOD.

Dataset Model(S/T) Baseline KD [12] AT [15] SP [18] RKD [17] STKD Teacher WRN-40-1 (0.56M) 93.84 94.17 93.96 94.21 94.06 94.44 94.94 WRN-40-2 (2.20M) CIFAR-10 ResNet-20 (0.27M) 92.56 93.08 93.21 93.30 93.20 93.52 94.51 ResNet-110 (1.7M) VGG-8 (0.27M) 92.76 93.14 93.29 93.08 93.31 93.68 94.27 VGG-13 (1.7M) WRN-40-1 (0.56M) 72.08 73.34 73.77 73.46 73.08 74.63 75.61 WRN-40-2 (2.20M) CIFAR-100 ResNet-20 (0.27M) 69.14 70.69 70.97 71.02 70.77 72.14 74.37 ResNet-110 (1.7M) VGG-8 (0.27M) 70.39 72.87 73.43 73.49 73.01 73.65 74.66 VGG-13 (1.7M)

TABLE II CLASSIFICATION ACCURACY (%) ON CINIC-10 (5 RUNS). WRN-40-2 IS ASTHETEACHERNETWORK, WRN-40-1 ASTHESTUDENTNETWORK. BASELINEMEANSTHE WRN-40-1 TRAINS INDIVIDUALLY.STKD MEANS THE WRN-40-1 RESULTS IN OUR METHOD.

Type Model Params(M) Acc (%) Baseline WRN-40-1 0.56 84.30 KD WRN-40-1 0.56 85.04 AT WRN-40-1 0.56 85.53 (a) (b) RKD WRN-40-1 0.56 84.96 SP WRN-40-1 0.56 85.16 Fig. 2. (a) Testing accuracy of the pre-trained teacher, student from our STKD WRN-40-1 0.56 85.94 method and student trains individually. (b) Testing accuracy of different Teacher WRN-40-2 2.20 86.40 knowledge distillation methods on CIFAR-10.

(SGD) with momentum 0.9. The initial learning rate is set to Figure 2 (a), it shows the testing accuracy curves of teacher γ = 0.1 and divided by 10 at 80th, 100th, 150th epochs, totally network, student network which is trained individually and 200 epochs. We use the same setting for training the teacher student network from our method. Note that, STKD gets network and distilling the student network. a significant improvement compared to the student trained We conduct baseline comparisons with respect to the vanilla individually and its final accuracy value is close to that of the KD [12], AT [15], SP [18] and RKD [17]. For vanilla KD, teacher. Figure 2 (b) shows the validation accuracy change we set temperature hyperparameter t = 4 following the curves of WRN-40-1 over time among different knowledge experiments in [15]. We set the hyperparameter β = 1000 distillation approaches on CIFAR-10 dataset. We can observe for AT, γ = 0.1 for SP respectively. that our method performs a significant improvement on the Table I summarizes the classification accuracy of several final accuracy and outperforms the other SOTA approaches. teacher/student model pairs on CIFAR-10/100 datasets. For From the perspective of the teacher model, our method WRN network architecture, the teacher and student networks also shows the potential to compress large networks into have the same depth but different width (WRN-40-2 teacher more compact ones with minimal accuracy loss. For example, with WRN-40-1 student). we distill the knowledge from the pre-trained WRN-40-2 Our method gets 94.44%, 93.52% and 93.68% of top-1 teacher network, which contains 2.20M parameters, to a much accuracy for WRN-40-1, ResNet-20 and VGG-8 on CIFAR- smaller WRN-40-1 student network, which contains 0.56M 10, respectively. It substantially surpasses the vanilla KD parameters. The student network gets a 4× compression rate by 1.6%, 1,9% and 1.8%. Compared to other baselines, the with only 0.5% loss in classification accuracy using off-the- student network from STKD still significantly surpass the shelf Pytorch. three related SOTA methods. For CIFAR-100, our method also has 1.6%, 1.7% and 1.8% top-1 accuracy promoting over vanilla KD. Moreover, our approach still surpasses the C. Experiment on CINIC-10 three SOTA distillation related methods, which verifies the In this section, we conduct experiments on CINIC-10 effectiveness of our approach. These results validate our intu- dataset, whose scale is closer to that of ImageNet. We use ition that the similarity correlation between instances encodes WRN-40-2 and WRN-40-1 as the teacher network and student valuable knowledge from the teacher network and provides an network respectively. For training phase, we apply CIFAR- important supervision for knowledge distillation. style data augmentation with horizontal flips and random Figure 2 plots the accuracy change curves for WRN-40- crops. The beginning learning rate is set to 0.01 and decayed 2/WRN-40-1 experiments on CIFAR-10. As can be seen from by a factor of 10 after the 100th, 160th and 220th epochs. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

TABLE III TABLE IV CLASSIFICATION ACCURACY (%) ON TINY IMAGENET (5 RUNS). ADDITIONAL EXPERIMENTS UNDER DIFFERENT NETWORK BASELINEMEANSTHE WRN-40-1 TRAINS INDIVIDUALLY.STKD MEANS ARCHITECTUREFAMILIESON CIFAR-100. CLASSIFICATION ACCURACY THE WRN-40-1 RESULTS IN OUR METHOD. (%) OVER FIVE RUNS IS REPORTED.THEBESTPERFORMANCEISSHOWN INBOLD.BASELINEMEANSTHESTUDENTNETWORKTRAINED Type Model Params(M) Acc (%) INDIVIDUALLY.STKD MEANSTHEPERFORMANCEOFSTUDENT Baseline WRN-40-1 0.56 56.93 NETWORKINOURMETHOD. KD WRN-40-1 0.56 58.52 AT WRN-40-1 0.56 58.43 Teacher ResNet-50 ResNet-50 VGG-13 RKD WRN-40-1 0.56 57.02 Student MobileNetV2 VGG-8 MobileNetV2 STKD WRN-40-1 0.56 59.25 Baseline 64.64 70.40 64.64 Teacher WRN-40-2 2.20 62.21 KD 67.62 73.39 67.36 AT 58.58 71.14 59.83 SP 68.00 72.89 65.89 STKD 69.07 73.81 68.22 All the networks are trained for 240 epochs using SGD with Teacher 79.34 79.34 74.31 Nesterov momentum, a batch size of 96. For fair comparison, we adopt the same setting as mentioned above for all baseline methods. We set the hyperparameters experiments on different network architecture families for for the baseline methods (t = 8 for KD; β = 100 for AT, the student/teacher pairs on CIFAR-100. Table IV shows the γ = 0.1 for SP). The results on CINIC-10 are shown in Table experimental results. II. Our method achieves a 85.94% classification accuracy, which surpasses the student network trained individually by As can be seen from Table IV, STKD continuously out- promoting 1.64%. Moreover, STKD substantially surpasses the performs KD and other SOAT methods, which indicates the vanilla KD by 0.9%. It also shows comparable performance effectiveness of our method. In particular, STKD achieves the with other SOTA approaches. These results imply that STKD accuracy of 69.07%, 73.81% and 68.22% for three network method induces more meaningful dark knowledge such as the pairs of different network families respectively, obtaining similarity information between instances than other baseline a performance gain of 1.07%, 0.92% and 2.33% over SP. methods. However, we observe that some methods such as AT [15] perform quite poorly when the teacher and student belong to different network architectures. Thus, the results validate D. Experiment on Tiny-ImageNet that our approach is robust to teacher/student pairs. And For rapid experimentation, we conduct experiments on Tiny- we conclude that the knowledge contained in the similarity ImageNet dataset, which is a popular subset of the ImageNet. correlation between different instances helps to improve the We adopt the VGG network architecture for the following robustness of student network. experiments. To be specific, VGG-13 is employed for the Impact of Different coefficients for mixup. In this sub- teacher network, which achieves 62.21% classification accu- section, we explore the impact of coefficients for mixup. We racy. VGG-8 is used for the student network. Both of the adopt ResNet-110 as the teacher network, ResNet-20 as the networks are trained for 240 epochs. student network. The results on CIFAR-100 could be found in In our Tiny ImageNet classification experiments, we apply Table V, which illustrates how the performance of STKD is random rotation and horizontal flipping for data augmentation. affected by the choice of the coefficient λ. By setting different We optimize the model using stochastic gradient descent(SGD) λ for STKD, we observe that STKD achieves the accuracy of with mini-batch 128 and momentum 0.9. The learning rate 72.14% (λ = 0.50), which is the best performance in our starts from 0.1 and is multiplied by 0.2 at 60, 120, 160, 200, settings. 250 epochs. We totally train the network for 300 epochs and Furthermore, we show the mixup images from the same adopt the deep and wide WRN (WRN-40-1) for a teacher sample pairs by varying the mixup coefficient λ in Figure 3. model and WRN-16-1 as a student model. The outputs of probability distribution are also plotted under In this section, ResNet with four blocks is selected with each mixup image. As can be seen from Figure 3, when channel size of 16, 32, 64 and 128. Random crop and λ = 0.50, the mixup image obtains a smoother probability horizontal flip are used for data augmentation. Table III shows distribution than other settings. It means that the relative the results, where baseline denotes the student network trained probability assigned to similar categories encodes semantic individually. As can be seen from Table III, our method, as similarity between similar categories. Note that, previous con- well as the other SOTA methods, gets higher classification ventional KD methods usually achieve this goal by setting a accuracy than the original student network. Moreover, our large temperature hyperparameter t. STKD method achieves better performance than the KD and Visualization the distribution of student networks. In this shows comparable performance with the other knowledge subsection, we visualize the output distribution of student from distillation methods, and further overcomes them. STKD and vanilla training methods on CIFAR-100. Figure 4 shows the visualization results over 100 classes and 10 E. Ablation Study sampled classes using tSNE [59] respectively. Notably, the Evaluation on Different Network Architectures. To fur- feature space of STKD is significantly more separable than it ther evaluate the performance of STKD, we perform additional in vanilla training manner. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

V. CONCLUSION In this paper, we introduce a novel knowledge distillation

� = 0.43 � = 0.46 � = 0.50 � = 0.53 method, called similarity transfer knowledge distillation that aims to utilize the similarity correlation between different instances. Moreover, we adopt the mixup technique to encode semantic similarity between categories, which assigns the relative probability to different categories from instances. And the one-hot labels are replaced by the mix labels in virtual samples, created by the mixup technique. Our experiments Fig. 3. Different mixup images from the same sample pairs by varying the demonstrate that the dark knowledge from the teacher model mixup coefficient λ. We plot the probability distribution of each mixup image from the teacher network. could be better transferred to the student model and improve the training outcomes of the student. We also show that our approach can be regarded as the regularized term for the student in knowledge distillation. The experiment evaluation on three benchmarks CIFAR-10/100, CINIC-10 and Tiny- ImageNet verify the effectiveness of our approach, which TABLE V achieves the state-of-the-art accuracy for a variety of network CLASSIFICATION ACCURACY (%) ON CIFAR-100 (5 RUNS).BASELINE MEANSTHE RESNET-20 TRAINS INDIVIDUALLY.STKD MEANSTHE architectures. RESNET-110 RESULTS IN OUR METHOD.

λ 0.43 0.46 0.50 0.53 REFERENCES 1 − λ 0.57 0.54 0.50 0.47 Baseline 69.14 69.14 69.14 69.14 [1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, STKD 70.21 71.08 72.14 69.82 Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein, “Imagenet large scale visual recognition challenge,” International Journal of Computer Teacher 74.37 74.37 74.37 74.37 Vision, pp. 211–252, 2015. [2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in The IEEE Conference on and Pattern Recognition (CVPR), June 2016, pp. 770–778. [3] Q. Li, S. Jin, and J. Yan, “Mimicking very efficient network for object detection,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [4] S. Chen, Z. Li, and Z. Tang, “Relation r-cnn: A graph based relation- aware network for object detection,” IEEE Signal Processing Letters, vol. 27, pp. 1680–1684, 2020. [5] J. Xie, B. Shuai, J.-F. Hu, J. Lin, and W.-S. Zheng, “Improving fast segmentation with teacher-student learning,” British Machine Vision Conference 2018 (BMVC), p. 205, 2018. [6] Y. L. Cun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in International Conference on Neural Information Processing Systems, 1989, pp. 598–605. [7] L. Hao, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” 5th International Conference on Learning Representations, ICLR, 2017. [8] S. Lin, R. Ji, Y. Li, C. Deng, and X. Li, “Towards compact convnets via structure-sparsity regularized filter pruning,” IEEE transactions on

STKD vanilla student training neural networks and learning systems, 2019. [9] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas, “Predict- ing parameters in ,” in International Conference on Neural Information Processing Systems, 2013, pp. 2148–2156. [10] Y. D. Kim, E. Park, S. Yoo, T. Choi, Y. Lu, and D. Shin, “Compression of deep convolutional neural networks for fast and low power mobile applications,” Computer Science, vol. 71, no. 2, pp. 576–584, 2015. [11] W. Ren, J. Zhang, L. Ma, J. Pan, X. Cao, W. Zuo, W. Liu, and M. Yang, “Deep non-blind deconvolution via generalized low-rank approximation,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, 2018, pp. 295–305. [12] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” Computer Science, vol. 14, no. 7, pp. 38–39, 2015. [13] D. Yoon, J. Park, and D. Cho, “Lightweight deep cnn for natural image STKD vanilla student training matting via similarity-preserving knowledge distillation,” IEEE Signal Processing Letters, vol. 27, pp. 2139–2143, 2020. Fig. 4. The tSNE visualization over 100 classes of the STKD (left) and the [14] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, and Y. Bengio, vanilla student training (right). Each color represents a category, best viewed “Fitnets: Hints for thin deep nets,” 3rd International Conference on in color. The first row shows 100 classed together. The second row shows 10 Learning Representations, ICLR, 2015. sampled classes from CIFAR-100. [15] S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via atten- tion transfer,” 5th International Conference on Learning Representa- tions, ICLR, 2017. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

[16] J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation: [35] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights Fast optimization, network minimization and transfer learning,” in IEEE and connections for efficient neural network,” in Advances in Neural Conference on Computer Vision and Pattern Recognition, 2017, pp. Information Processing Systems 28: Annual Conference on Neural 7130–7138. Information Processing Systems 2015, December 7-12, 2015, Montreal, [17] W. Park, D. Kim, Y. Lu, and M. Cho, “Relational knowledge distilla- Quebec, Canada, 2015, pp. 1135–1143. tion,” in Proceedings of the IEEE Conference on Computer Vision and [36] A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov, “Tensorizing Pattern Recognition, 2019, pp. 3967–3976. neural networks,” in Advances in Neural Information Processing Systems [18] F. Tung and G. Mori, “Similarity-preserving knowledge distillation,” in 28: Annual Conference on Neural Information Processing Systems 2015, Proceedings of the IEEE International Conference on Computer Vision, December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp. 442–450. 2019, pp. 1365–1374. [37] J. Yang, X. Shen, J. Xing, X. Tian, H. Li, B. Deng, J. Huang, and X. Hua, [19] S. Srinivas and F. Fleuret, “Knowledge transfer with Jacobian matching,” “Quantization networks,” in IEEE Conference on Computer Vision and ser. Proceedings of Research, J. Dy and A. Krause, Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, Eds., vol. 80. Stockholmsmassan,¨ Stockholm Sweden: PMLR, 10–15 2019, 2019, pp. 7308–7316. Jul 2018, pp. 4723–4731. [38] C. Bucila, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in [20] N. Passalis and A. Tefas, “Learning deep representations with prob- Proceedings of the Twelfth ACM SIGKDD International Conference on abilistic knowledge transfer,” in Computer Vision - ECCV 2018 - Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 15th European Conference, Munich, Germany, September 8-14, 2018, 20-23, 2006, 2006, pp. 535–541. Proceedings, Part XI, 2018, pp. 283–299. [39] J. Ba and R. Caruana, “Do deep nets really need to be deep?” in [21] N. Papernot, P. D. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation Advances in Neural Information Processing Systems 27: Annual Confer- as a defense to adversarial perturbations against deep neural networks,” ence on Neural Information Processing Systems 2014, December 8-13 in IEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, 2014, Montreal, Quebec, Canada, 2014, pp. 2654–2662. USA, May 22-26, 2016. IEEE Computer Society, 2016, pp. 582–597. [40] Z. Huang and N. Wang, “Like what you like: Knowledge distill via [22] S. Gupta, J. Hoffman, and J. Malik, “Cross modal distillation for neuron selectivity transfer,” CoRR, vol. abs/1707.01219, 2017. [Online]. supervision transfer,” in 2016 IEEE Conference on Computer Vision Available: http://arxiv.org/abs/1707.01219 and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, [41] B. Heo, M. Lee, S. Yun, and J. Y. Choi, “Knowledge transfer via 2016. IEEE Computer Society, 2016, pp. 2827–2836. distillation of activation boundaries formed by hidden neurons,” in The [23] J. Uijlings, S. Popov, and V. Ferrari, “Revisiting knowledge transfer for Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The training object class detectors,” in Proceedings of the IEEE Conference Thirty-First Innovative Applications of Artificial Intelligence Conference, on Computer Vision and Pattern Recognition, 2018, pp. 1101–1110. IAAI 2019, The Ninth AAAI Symposium on Educational Advances in [24] Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang, “Structured Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 knowledge distillation for semantic segmentation,” in IEEE Conference - February 1, 2019, 2019, pp. 3779–3787. on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, [42] S. H. Lee, D. H. Kim, and B. C. Song, “Self-supervised knowledge CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, distillation using singular value decomposition,” in Computer Vision - 2019, pp. 2604–2613. ECCV 2018 - 15th European Conference, Munich, Germany, September [25] J. Jiao, Y. Wei, Z. Jie, H. Shi, R. W. H. Lau, and T. S. Huang, 8-14, 2018, Proceedings, Part VI, 2018, pp. 339–354. “Geometry-aware distillation for indoor semantic segmentation,” in [43] B. Peng, X. Jin, D. Li, S. Zhou, Y. Wu, J. Liu, Z. Zhang, and Y. Liu, IEEE Conference on Computer Vision and Pattern Recognition, CVPR “Correlation congruence for knowledge distillation,” in 2019 IEEE/CVF 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision International Conference on Computer Vision, ICCV 2019, Seoul, Korea Foundation / IEEE, 2019, pp. 2869–2878. (South), October 27 - November 2, 2019, 2019, pp. 5006–5015. [26] P. Luo, Z. Zhu, Z. Liu, X. Wang, X. Tang et al., “Face model [44] V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz, compression by distilling knowledge from neurons.” in AAAI, 2016, pp. and Y. Bengio, “Manifold mixup: Better representations by interpolating 3560–3566. hidden states,” in Proceedings of the 36th International Conference [27] A. Pilzer, S. Lathuiliere,` N. Sebe, and E. Ricci, “Refine and distill: Ex- on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, ploiting cycle-inconsistency and knowledge distillation for unsupervised California, USA, 2019, pp. 6438–6447. monocular depth estimation,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16- [45] C. Summers and M. J. Dinneen, “Improved mixed-example data aug- IEEE Winter Conference on Applications of Computer 20, 2019. Computer Vision Foundation / IEEE, 2019, pp. 9768–9777. mentation,” in Vision, WACV 2019, Waikoloa Village, HI, USA, January 7-11, 2019 [28] X. Xu, Q. Zou, X. Lin, Y. Huang, and Y. Tian, “Integral knowledge , distillation for multi-person pose estimation,” IEEE Signal Processing 2019, pp. 1262–1270. Letters, vol. 27, pp. 436–440, 2020. [46] R. Takahashi, T. Matsubara, and K. Uehara, “RICAP: random image [29] L. Yuan, F. E. H. Tay, G. Li, T. Wang, and J. Feng, “Revisiting cropping and patching data augmentation for deep cnns,” in Proceedings knowledge distillation via label smoothing regularization,” in 2020 of The 10th Asian Conference on Machine Learning, ACML 2018, IEEE/CVF Conference on Computer Vision and Pattern Recognition, Beijing, China, November 14-16, 2018, 2018, pp. 786–798. CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 2020, pp. [47] S. Yun, D. Han, S. Chun, S. J. Oh, Y. Yoo, and J. Choe, “Cutmix: Reg- 3902–3910. [Online]. Available: https://doi.org/10.1109/CVPR42600. ularization strategy to train strong classifiers with localizable features,” 2020.00396 in 2019 IEEE/CVF International Conference on Computer Vision, ICCV [30] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, 2019, Seoul, Korea (South), October 27 - November 2, 2019, 2019, pp. “Rethinking the inception architecture for computer vision,” in 2016 6022–6031. IEEE Conference on Computer Vision and Pattern Recognition, [48] E. Harris, A. Marcu, M. Painter, M. Niranjan, A. Prugel-Bennett,¨ CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE and J. S. Hare, “Understanding and enhancing mixed sample data Computer Society, 2016, pp. 2818–2826. [Online]. Available: https: augmentation,” CoRR, vol. abs/2002.12047, 2020. [Online]. Available: //doi.org/10.1109/CVPR.2016.308 https://arxiv.org/abs/2002.12047 [31]R.M uller,¨ S. Kornblith, and G. E. Hinton, “When does label smoothing [49] H. Inoue, “Data augmentation by pairing samples for images help?” in Advances in Neural Information Processing Systems, vol. 32. classification,” CoRR, vol. abs/1801.02929, 2018. [Online]. Available: Curran Associates, Inc., 2019, pp. 4694–4703. http://arxiv.org/abs/1801.02929 [32] H. Zhang, M. Cisse,´ Y. N. Dauphin, and D. Lopez-Paz, “mixup: [50] E. D. Cubuk, B. Zoph, D. Mane,´ V. Vasudevan, and Q. V. Le, Beyond empirical risk minimization,” in 6th International Conference on “Autoaugment: Learning augmentation policies from data,” CoRR, vol. Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 abs/1805.09501, 2018. [Online]. Available: http://arxiv.org/abs/1805. - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. 09501 [33] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, [51] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural recognition,” in Proceedings of the IEEE conference on computer vision networks for mobile vision applications,” CoRR, vol. abs/1704.04861, and pattern recognition, 2016, pp. 770–778. 2017. [Online]. Available: http://arxiv.org/abs/1704.04861 [52] S. Zagoruyko and N. Komodakis, “Wide residual networks,” Proceed- [34] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely effi- ings of the British Machine Vision Conference, 2016. cient convolutional neural network for mobile devices,” in Proceedings [53] K. Simonyan and A. Zisserman, “Very deep convolutional networks of the IEEE conference on computer vision and pattern recognition, for large-scale image recognition,” 3rd International Conference on 2018, pp. 6848–6856. Learning Representations, ICLR, 2015. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

[54] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009. [55] Y. Le and X. Yang, “Tiny imagenet visual recognition challenge,” Stanford Class CS 231N, 2015. [56] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 770–778. [57] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in Proceed- ings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016, 2016. [58] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. [59] L. van der Maaten and K. Q. Weinberger, “Stochastic triplet embedding,” in IEEE International Workshop on Machine Learning for Signal Pro- cessing, MLSP 2012, Santander, Spain, September 23-26, 2012, 2012, pp. 1–6.