Heterogeneous Knowledge Distillation Using Information Flow Modeling

Heterogeneous Knowledge Distillation using Information Flow Modeling N. Passalis, M. Tzelepi and A. Tefas Department of Informatics, Aristotle University of Thessaloniki, Greece {passalis, mtzelepi, tefas}@csd.auth.gr Abstract Knowledge Distillation (KD) methods are capable of transferring the knowledge encoded in a large and complex teacher into a smaller and faster student. Early meth- Critical ods were usually limited to transferring the knowledge only connections Further formed fitting and ≈ 1.9 between the last layers of the networks, while latter ap- compression < 0.01 proaches were capable of performing multi-layer KD, further increasing the accuracy of the student. However, despite their improved performance, these methods still suffer from several limitations that restrict both their efficiency Forming critical and flexibility. First, existing KD methods typically ignore connections that neural networks undergo through different learning 2 ≤ ≤ 100 phases during the training process, which often requires different types of supervision for each one. Furthermore, existing multi-layer KD methods are usually unable to ef- Figure 1. Existing knowledge distillation approaches ignore the fectively handle networks with significantly different archi- existence of critical learning periods when transferring the knowl- tectures (heterogeneous KD). In this paper we propose a edge, even when multi-layer transfer approaches are used. How- ever, as argued in [1], the information plasticity rapidly declines novel KD method that works by modeling the information after the first few training epochs, reducing the effectiveness of flow through the various layers of the teacher model and knowledge distillation. On the other hand, the proposed method then train a student model to mimic this information flow. models the information flow in the teacher network and provides The proposed method is capable of overcoming the afore- the appropriate supervision during the first few critical learning mentioned limitations by using an appropriate supervision epochs in order to ensure that the necessary connections between scheme during the different phases of the training process, successive layers of the networks will be formed. Note that even as well as by designing and training an appropriate auxil- though this process initially slows down the convergence of the iary teacher model that acts as a proxy model capable of network slightly (epochs 1-8), it allows for rapidly increasing the “explaining” the way the teacher works to the student. The rate of convergence after the critical learning period ends (epochs effectiveness of the proposed method is demonstrated using 10-25). The parameter α controls the relative importance of trans- four image datasets and several different evaluation setups. ferring the knowledge from the intermediate layers during the various learning phases, as described in detail in Section 3. is knowledge distillation (KD) [9], which is also known as 1. Introduction knowledge transfer (KT) [30]. These approaches aim to Despite the tremendous success of Deep Learning (DL) transfer the knowledge encoded in a large and complex neu- in a wide range of domain [12], most DL methods suffer ral network into a smaller and faster one. In this way, it is from a significant drawback: powerful hardware is needed possible to increase the accuracy of the smaller model, com- for training and deploying DL models. This significantly pared to the same model trained without employing KD. hinders DL applications on resource-scarce environments, Typically, the smaller model is called student model, while such as embedded and mobile devices, leading to the de- the larger model is called teacher model. velopment of various methods for overcoming these limi- Early KD approaches focused on transferring the knowl- tations. Among the most prominent methods for this task edge between the last layer of the teacher and student mod- 2339 Student precision (top-1) els [4, 9, 19, 26, 28, 31]. This allowed for providing richer @ layer 4 training targets to the student model, which capture more information regarding the similarities between different sam- No Intermediate 76.18% 77.39% No Intermediate Layer Supervision Positive Layer Supervision ples, reducing overfitting and increasing the student’s ac- regularization effect curacy. Later methods further increased the efficiency of Teacher Auxiliary Teacher Student Over- KD by modeling and transferring the knowledge encoded regularization 76.32% 77.27% in the intermediate layers of the teacher [22, 30, 32]. These Layer 1 Layer 1 Layer 1 NCC: 46.40% NCC: 41.04% NCC: 43.09% approaches usually attempt to implicitly model the way in- Correct 74.30% 77.93% formation gets transformed through the various layers of a Layer 2 Layer 2 Layer 2 layer network, providing additional hints to the student model re- NCC: 69.19% NCC: 48.82% NCC: 57.39% matching garding the way that the teacher model process the informa- Layer 3 69.78% Layer 3 77.76% Layer 3 tion. NCC: 86.18% NCC: 59.70% NCC: 65.75% Even though these methods were indeed able to further Layer 4 74.98% Layer 4 76.40% Layer 4 increase the accuracy of models trained with KD, they also NCC: 92.14% NCC: 65.95% NCC: 74.71% suffer from several limitations that restrict both their ef- CNN-1 CNN-1-A ResNet-18 ficiency and flexibility. First, note that neural networks Representation exhibit an evolving behavior, undergoing several different Collapse and distinct phases during the training process. For ex- Figure 2. Examining the effect of transferring the knowledge from ample, during the first few epochs critical connections are different layers of a teacher model into the third layer of the stu- formed [1], defining almost permanently the future informa- dent model. Two different teachers are used, a strong teacher tion flow paths on a network. After fixing these paths, the (ResNet-18, where each layer refers to each layer block) and an training process can only fine-tune them, while forming new auxiliary teacher (CNN-1-A). The nearest centroid classifier accu- paths is significantly less probable after the critical learning racy (NCC) is reported for the representations extracted from each period ends [1]. After forming these critical connections, layer (in order to provide an intuitive measure of how each layer the fitting and compression (when applicable) phases fol- transforms the representations extracted from the input data). The low [24, 23]. Despite this dynamic time-dependent behav- final precision is reported for a student model trained by either not ior of neural networks, virtually all existing KD approaches using intermediate layer supervision (upper black values) or by us- ignores the phases that neural networks undergo during the ing different layers of the teacher (4 subsequent precision values). training. This observation leads us to the first research ques- Several different phenomena are observed when the knowledge is transferred from different layers, while the proposed auxiliary tion of this paper: Is a different type of supervision needed teacher allows for achieving the highest precision and provides a during the different learning phases of the student and is it straightforward way to match the layers between the models (the possible to use a stronger teacher to provide this supervi- auxiliary teacher transforms the data representations in a way that sion? is closer to the student model, as measure through the NCC accu- To this end, we propose a simple, yet effective way to racy). exploit KD to train a student that mimics the information flow paths of the teacher, while also providing further evi- ploy multiple intermediate layers is their ability to han- dence confirming the existence of critical learning periods dle heterogeneous multi-layer knowledge distillation, i.e., during the training phase of a neural network, as originally transfer the knowledge between teachers and students with described in [1]. Indeed, as also demonstrated in the abla- vastly different architectures. Existing methods almost ex- tion study shown in Fig. 1, providing the correct supervision clusively use network architectures that provide a trivial during the critical learning period of a neural network can one-to-one matching between the layers of the student and have a significant effect on the overall training process, in- teacher, e.g., ResNets with the same number of blocks are creasing the accuracy of the student model. More informa- often used, altering only the number of layers inside of each tion regarding this ablation study are provided in Section 4. residual block [30, 32]. Many of these approaches, such It is worth noting that the additional supervision, which is as [30], are even more restrictive, also requiring the layers employed to ensure that the student will form similar infor- of the teacher and student to have the same dimensional- mation paths to the teacher, actually slows down the learn- ity. As a result, it is especially difficult to perform multi- ing process until the critical learning period is completed. layer KD between networks with vastly different architec- However, after the information flow paths are formed, the tures, since even if just one layer of the teacher model is rate of convergence is significantly accelerated compared to incorrectly matched to a layer of the student model, then the student networks that do not take into account the exis- the accuracy of the student can be significantly reduced, ei- tence of critical learning periods. ther due to over-regularizing the network or by forcing to Another limitation of existing KD approaches that em- early compress the representations of the student. This be- 2340 havior is demonstrated in Fig. 2, where the knowledge is related work is briefly discussed and compared to the pro- transferred from the 3rd layer of two different teachers to posed method in Section 2. Then, the proposed method is various layers of the student. These findings lead us to the presented in Section 3, while the experimental evaluation second research question of this paper: Is it possible to han- is provided in Section 4.

Load more