<<

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. -, NO. -, 2021 1 A Survey on Negative Transfer

Wen Zhang, Lingfei Deng, Lei Zhang, Senior Member, IEEE, Dongrui Wu, Senior Member, IEEE

Abstract—Transfer learning (TL) utilizes data or knowledge from one or more source domains to facilitate the learning in a target domain. It is particularly useful when the target domain has very few or no labeled data, due to annotation expense, privacy concerns, etc. Unfortunately, the effectiveness of TL is not always guaranteed. Negative transfer (NT), i.e., leveraging source domain data/knowledge undesirably reduces the learning performance in the target domain, has been a long-standing and challenging problem in TL. Various approaches have been proposed in the literature to handle it. However, there does not exist a systematic survey on the formulation of NT, the factors leading to NT, and the algorithms that mitigate NT. This paper fills this gap, by first introducing the definition of NT and its factors, then reviewing about fifty representative approaches for overcoming NT, according to four categories: secure transfer, domain similarity estimation, distant transfer, and NT mitigation. NT in related fields, e.g., multi-task learning, lifelong learning, and adversarial attacks, are also discussed.

Index Terms—Negative transfer, transfer learning, domain adaptation, domain similarity !

1 INTRODUCTION

Common assumption in traditional A is that the training data and the test data are drawn from the same distribution. However, this assumption does not hold in many real-world applications. For example, two image datasets may contain images taken using cameras with different resolutions under different light conditions; different people may demonstrate strong individual differ- ences in brain-computer interfaces [1]. Therefore, the result- ing machine learning model may generalize poorly. A conventional approach to mitigate this problem is to re-collect a large amount of labeled or partly labeled data, which have the same distribution as the test data, and then Fig. 1. Illustration of NT: introducing source domain data/knowledge train a machine learning model on the new data. However, decreases the target domain learning performance. many factors may prevent easy access to such data, e.g., high annotation cost, privacy concerns, etc. A better solution to the above problem is transfer learn- the source domain and target domain data distributions are ing (TL) [2], or domain adaptation (DA) [3], which tries not too different; and, 3) a suitable model can be applied to utilize data or knowledge from related domains (called to both domains. Violations of these assumptions may lead source domains) to facilitate the learning in a new domain to negative transfer (NT), i.e., introducing source domain (called target domain). TL was first studied in educational data/knowledge undesirably decreases the learning perfor- psychology to enhance human’s ability to learn new tasks mance in the target domain, as illustrated in Fig. 1. NT is a and to solve novel problems [4]. In machine learning, TL is long-standing and challenging problem in TL [2], [14], [15]. mainly used to improve a model’s ability to generalize in The following fundamental problems need to be ad- arXiv:2009.00909v4 [cs.LG] 9 Aug 2021 the target domain, which usually has zero or a very small dressed for reliable TL [2]: 1) what to transfer; 2) how to number of labeled data. Many different TL approaches have transfer; and, 3) when to transfer. Most TL research [3], [16] been proposed, e.g., traditional (statistical) TL [5]–[9], deep focused only on the first two, whereas all three should be TL [10], [11], adversarial TL [12], [13], etc. taken into consideration to avoid NT. To our knowledge, Unfortunately, the effectiveness of TL is not always NT was first studied in 2005 [14], and received increasing guaranteed, unless its basic assumptions are satisfied: 1) the attention recently [15], [17], [18]. Various ideas, e.g., finding learning tasks in the two domains are related/similar; 2) similar parts of domains, evaluating the transferability of different tasks/models/features, etc., have been explored. Though very important, there does not exist a compre- • Wen Zhang, Lingfei Deng and Dongrui Wu are with the Key Laboratory of the Ministry of Education for Image Processing and Intelligent Control, hensive survey on NT. This paper aims to fill this gap, School of Artificial Intelligence and Automation, Huazhong University of by systematically reviewing about fifty representative ap- Science and Technology, Wuhan 430074, China. (e-mail: {wenz, lfdeng, proaches to cope with NT. We mainly consider homoge- drwu}@hust.edu.cn). neous and closed-set classification problems in TL, i.e., the • Lei Zhang is with the School of Microelectronics and Communication Engineering, Chongqing University, Chongqing 400044, China (e-mail: source and target tasks are the same, and the target feature [email protected]). and label spaces are also unchanged during testing. This is • Wen Zhang and Lingfei Deng contributed equally to this work. Don- the most studied TL scenario. We introduce the definition grui Wu is the corresponding author. and factors of NT, methods that can avoid NT under theo- IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. -, NO. -, 2021 2

retical guarantees, methods that can mitigate NT to a certain 2.2 TL Categorization extent, and some related fields. Articles that do not explain According to [2], TL approaches can be categorized into four their methods from the perspective of NT are not included groups: instance based, feature based, model/parameter in this survey, to keep it more focused. based, and relation based. The remainder of this paper is organized as follows. Instance based approaches mainly focus on sample Section 2 introduces background knowledge in TL and NT. weighting, assuming the distribution discrepancy between Section 3 proposes an scheme for reliable TL. Sections 4-7 the source and target domains is caused by a sample selec- review secure transfer, domain similarity estimation, distant tion bias, which can be compensated by reusing a certain transfer, and NT mitigation strategies, respectively. Section 8 portion of the weighted source domain data [3], [8]. introduces several related machine learning fields. Section 9 Feature based approaches aim to find a latent subspace compares all reviewed approaches. Finally, Section 10 draws or representation to match the two domains, assuming there conclusions and points out some future research directions. exists a common space in which distribution discrepancies of different domains can be minimized [3]. 2 BACKGROUND KNOWLEDGE Model/parameter based approaches transfer knowledge This section introduces some background knowledge on TL via parameters, assuming the distributions of model param- and NT, including the notations, definitions and categoriza- eters in different domains are the same or similar [19], [20]. tions of TL, and factors of NT. Relation based approaches assume that some internal logical relationships or rules in the source domain are pre- served in the target domain. 2.1 Notations and Definitions More details on TL can be found in [2], [3], [21], [22]. We consider classifiers with K categories, with an input feature space X and an output label space Y. Assume we i i ns 2.3 NT have access to one labeled source domain S = {(xs, ys)}i=1 drawn from PS (X,Y ), where X ⊆ X and Y ⊆ Y. The Rosenstein et al. [14] first discovered NT through exper- target domain consists of two sub-datasets: T = (Tl, Tu), iments, and concluded that “transfer learning may actually j j nl hinder performance if the tasks are too dissimilar” and “inductive where Tl = {(xl , yl )}j=1 consists of nl labeled samples k nu bias learned from the auxiliary tasks will actually hurt perfor- drawn from PT (X,Y ), and Tu = {xu}k=1 consists of nu mance on the target task.” Pan et al. [2] also briefly mentioned unlabeled samples drawn from PT (X). The main notations are summarized in Table 1. NT in their TL survey: “When the source domain and target domain are not related to each other, brute-force transfer may be TABLE 1 unsuccessful. In the worst case, it may even hurt the performance Main notations in this survey. of learning in the target domain, a situation which is often referred to as negative transfer.” Notation Description Notation Description Wang et al. [15] gave a mathematical definition of NT, x Feature vector `(·), L(·) Loss function and proposed a negative transfer gap (NTG) to determine y Label of x h Hypothesis whether NT happens. X Feature space f Classifier Definition 1. (Negative transfer gap [15]). Let  be the Y Label space θ TL algorithm T S Source domain g Feature extractor test error in the target domain, θ(S, T ) a TL algorithm T Target domain  Error (risk) between S and T , and θ(∅, T ) the same algorithm but P (·) Distribution n Number of samples does not use the source domain information at all. Then, E(·) Expectation K Number of classes  (θ(S, T )) >  (θ(∅, T )) d(·) Distance metric M No. of source domains NT happens when T T , and the degree of NT can be evaluated by the NTG:

In TL, the condition that the source and target domains NTG = T (θ(S, T )) − T (θ(∅, T )). (1) are different (i.e., S 6= T ) implies one or more of the following: Obviously, NT occurs if the NTG is positive. However, NTG may not always be computable. For example, in an un- 1) The feature spaces are different, i.e., XS 6= XT . supervised scenario,  (θ(∅, T )) is impossible to compute 2) The label spaces are different, i.e., YS 6= YT . T 3) The marginal probability distributions of the two do- due to the lack of labeled target data. mains are different, i.e., PS (X) 6= PT (X). 4) The conditional probability distributions of the two 2.4 Factors of NT P (Y |X) 6= P (Y |X) domains are different, i.e., S T . Ben-David et al. [23] gave a theoretical bound for TL: This survey focuses on the last two differences, i.e., we 1 assume that the source and target domains share the same T (h) ≤ S (h) + dH∆H(Xs,Xt) + λ, (2) feature and label spaces. This is actually the case of DA. 2 TL aims to design a learning algorithm θ(S, T ), which where T (h) and S (h) are respectively the expected error of utilizes data/information in the source and target do- hypothesis h in the source domain and the target domain, mains to output a hypothesis h as the target domain dH∆H(Xs,Xt) is the domain divergence between the two mapping function, with a small expected loss T (h) = domains, and λ is a problem-specific constant.

Ex,y∼PT (X,Y )[`(h(x), y)], where ` is a target domain loss Based on (2), the following four factors could contribute function. to NT: IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. -, NO. -, 2021 3

1) Domain divergence. Arguably, the divergence between As shown in Table 2, there are only a few secure transfer the source and target domains is the root of NT. TL approaches. Some of them consider only classification prob- approaches that do not explicitly consider minimizing lems, whereas others consider only regression problems. the divergence, whether at the feature, classifier, or target output level, are more likely to result in NT. TABLE 2 2) Transfer algorithm. A secure transfer algorithm should Existing secure transfer approaches. have a theoretical guarantee that the learning perfor- mance in the target domain is better when auxiliary CATEGORY APPROACH REFERENCE data are utilized, or the algorithm has been carefully Classification Transfer Adaptive learning [24] designed to improve the transferability of auxiliary Performance gain [17], [25] domains. Otherwise, NT may happen. Regression Transfer Output truncation [26] 3) Source data quality. Source data quality determines the Regularization [27] quality of the transferred knowledge. If the source data Bayesian optimization [28] are inseparable or very noisy, then a classifier trained on them may be unreliable. Sometimes the source data have been converted into pre-trained models, e.g., for privacy-preserving. An over-fitting or under-fitting 4.1 Secure Transfer for Classification source domain model may also cause NT. Cao et al. [24] proposed a Bayesian adaptive learning ap- 4) Target data quality. The target domain data may be noisy proach to adjust the transfer schema automatically accord- and/or non-stationary, which may also lead to NT. ing to the similarity of the two tasks. It assumes the source and target data obey a Gaussian distribution with a semi- 3 RELIABLE TL parametric transfer kernel K,

−ς(xn,xm)ρ Fig. 2 shows our proposed reliable TL scheme, considering Knm ∼ k(xn, xm)(2e − 1), (3) existing typical strategies for alleviating or avoiding NT. where k is a valid kernel function. ς(xn, xm) = 0 if xn and xm are from the same domain, and ς(xn, xm) = 1 other- wise. The parameter ρ represents the dissimilarity between the source and target domains. By assuming ρ is from a Gamma distribution Γ(b, µ), where b and µ are respectively the shape and the scale parameters inferred from a few labeled samples in both domains, we can define

 1 b λ = 2 − 1, (4) 1 + µ which determines the similarity between the two domains, and also what can be transferred. For example, when λ is Fig. 2. Our proposed reliable TL scheme. close to 0, the correlation between the domains is low, so only the parameters in the kernel function k can be shared. Secure transfer can overcome NT with theoretical guar- Jamal et al. [25] proposed a deep face detector adaptation antees, regardless of whether the source and target domains approach to avoid NT and catastrophic forgetting in deep are similar or not. Most other approaches assume the source TL, by minimizing the following loss function: domain has some similarity with the target domain. Ac-   curately estimating this similarity helps determine which λ 2  ˜ min kuk2 + Et max RESt w + u, θ , (5) strategy should be used to handle NT. u,θ˜ 2 yt∈{0,1} With the estimated domain similarity, we can: 1) refuse where w are the classifier weights of the source detector, to transfer or use distant transfer when the similarity is low; u the offset weights to constrain the target face detector 2) perform NT mitigation when the similarity is medium; around the source detector, and θ˜ the parameters of the or, 3) concatenate data from different domains directly and target feature extractor. RES is the relative performance train a classifier from them, when the similarity is high. t loss of the learned target detector over the pre-trained When there are multiple source domains, domain simi- source face detector, which is non-positive after optimiza- larity estimation can also be used to select the most trans- tion. Hence, the obtained target detector is always no worse ferable source domains. than the source detector, i.e., NT is avoided. Li et al. [17] proposed a safe weakly supervised learn- 4 SECURE TRANSFER ing (SAFEW) approach for semi-supervised DA. Assume ∗ Secure transfer explicitly avoids NT in its objective function, the target hypothesis h can be constructed from mul- ∗ i.e., the TL algorithm should perform better than the one tiple base learners in the source domains, i.e., h = PM M without transfer, which is appealing in real-world applica- i=1 αihi, where {hi}i=1 are the M source models with PM tions, especially when it is difficult to estimate the domain α = [α1; α2; ...; αM ] ≥ 0 and i=1 αi = 1. The goal is similarity. to learn a prediction h that maximizes the performance gain IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. -, NO. -, 2021 4

against the baseline h0, which is trained on the labeled target 5 DOMAIN SIMILARITY ESTIMATION data only, by optimizing the following objective function: Domain similarity (or transferability) estimation is a very M ! M ! important block in reliable TL, as shown in Fig. 2. Existing X X max min ` h0, αihi − ` h, αihi , (6) estimation approaches can be categorized into three groups: h α∈M i=1 i=1 feature statistics based, test performance based, and fine- i.e., SAFEW optimizes the worst-case performance gain to tuning based, as summarized in Table 3. avoid NT. TABLE 3 Approaches for domain similarity estimation. 4.2 Secure Transfer for Regression

Kuzborskij and Orabona [26] introduced a class of regular- CATEGORY APPROACH REFERENCE ized least squares (RLS) [29] algorithms with biased regular- Feature Statistics MMD [30] ization to avoid NT. The original RLS algorithm solves the Based following optimization problem: Correlation [31], [32] KL divergence [33], [34] 1 n  2 X > 2 Test Performance Target performance [35], [36] min w xi − yi + λkwk . (7) w n Based i=1 Domain classifier [37]–[39] 0 After obtaining the optimized source hypothesis h (·), the Fine-Tuning based Clustering quality [40] 0 n authors constructed a training set {(xi, yi −h (xi))}i=1, and Entropy [18], [41]–[43] generated the transfer hypothesis > 0 hT (x) = TC (x wˆT ) + h (x), (8) 5.1 Feature Statistics Based where TC (ˆy) = min(max(ˆy, −C),C) is a truncation func- tion to limit the output to [−C,C], and The original feature representation and its first- or high- order statistics, such as mean and covariance, are direct n 1 X > 0 2 2 and important inputs for measuring the domain distribu- wˆT = arg min (w xi − yi + h (xi)) + λkwk . (9) w n tion discrepancy. Three widely used domain discrepancy i=1 measurements are maximum mean discrepancy (MMD), Kuzborskij and Orabona [26] showed that the proposed correlation coefficient, and KL-divergence. approach is equivalent to RLS trained solely in the target MMD [30] may be the most popular discrepancy mea- domain when the source domains are unrelated to the target sure in traditional TL [5], [6], [8], [44], due to its simplicity domain. and effectiveness. It is a nonparametric measure, and can Yoon and Li [27] proposed a positive TL approach, based be computed directly from the feature means in the raw also on the RLS algorithm. It assumes the source parameters feature space or a mapped Reproducing Kernel Hilbert follow a normal distribution, and optimizes the following Space (RKHS). Empirically, the MMD between the source loss function: and target domains can be computed via: 2 min `Tl (w; b) + βR(w) + λN(w; µw, Σw), (10) w ns nt 2 1 X 1 X MMD (S, T ) = xs,i − xt,j , (12) where w are model coefficients, R(w) a regularization term n n s i=1 t j=1 to control the model complexity, and N(w; µw, Σw) a regu- 2 larization term to constrain w in a multi-variable Gaussian where k is a kernel function, and xs and xt are source and distribution with mean µw and variance Σw computed from target domain samples, respectively. However, MMD only the source domains. They showed that NT arises when λ is considers the marginal distribution discrepancy between too large, thus proposed an optimization rule to select the different domains, so it may fail when the conditional dis- weight λ and hence to eliminate harmful source domains. tributions between the source and target domains are also Sorocky et al. [28] derived a theoretical bound on the test significantly different. error and proposed a Bayesian-optimization based approach The correlation between two high-dimensional random to estimate this bound to guarantee positive transfer in a variables from different distributions can also be used to robot tracking system. Firstly, they bounded the 2-norm evaluate the distribution discrepancy. Lin and Jung [31] of the tracking error of the target robot using the source evaluated the inter-subject similarity in emotion classifica- module by tion via the correlation coefficient of the original feature representations from two different subjects. ket,sk ≤ kEt,sk kydk , (11) 2 ∞ 2 To fully utilize the source label information, Zhang where Et,s represents the transfer function of the robot and Wu [32] developed a domain transferability estimation tracking system, and yd the desired output of the source (DTE) index to evaluate the transferability between a source module. Given the baseline target tracking error et,b, domain and a target domain via between-class and between- this bound can guarantee positive transfer if ket,bk2 ≥ domain scatter matrices: kEt,sk∞ kydk2. Since yd is fixed and known, the authors es- S kSb k1 tablished a Gaussian Process model to estimate kEt,sk∞ and DTE(S, T ) = S,T , (13) compute the error bound, guaranteeing positive transfer. kSb k1 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. -, NO. -, 2021 5

S where Sb is the between-class scatter matrix in the source Ben-David et al. [37] proposed an unsupervised A- S,T domain, Sb the between-domain scatter matrix, and k · k1 distance to find the minimum-error classifier the 1-norm. DTE has low computational cost and is insensi-   dA(µS, µT ) = 2 1 − 2 min (h) , (16) tive to the sample size. h∈H KL-divergence [45] is a non-symmetric measure of the where H is the hypothesis space, h a domain classifier, and divergence between two probability distributions. Gong et (h) the domain classification error. The A-distance should al. [33] proposed a rank of domain (ROD) approach to rank be small for good transferability. Unfortunately, computing the similarities of the source domains to the target domain, d (µ , µ ) is NP-hard. To reduce the computational cost, by computing the symmetrized KL divergence weighted A S T they trained a linear classifier to determine which domain average of the principal angles. It can be used to automati- the data come from, and utilized its error to approximate cally select the optimal source domains to adapt. Azab et al. the optimal classifier. However, A-distance neglects the [34] computed the similarity αs between the target domain difference in label spaces and only considers the marginal feature set dt and the source domain feature set ds as: distribution discrepancy between domains, which may af- 4 1/ KL[dt, ds] +  fect its performance. αs = , (14) PM  4 Recently, Wu and He [39] proposed a novel label- m=1 1/ KL[dt, dm] +  informed divergence between the source and target do- mains, when the target domain is time evolving. This di- where KL represents the average per-class KL-divergence, vergence can measure the shift of joint distributions, which M the number of source domains, and  = 0.0001 is used to improves the A-distance. ensure the stability of calculation. Additionally, Hilbert-Schmidt independence criterion 5.3 Fine-Tuning Based (HSIC) [46], Bregman divergence [47], optimal transport and Wasserstein distance [48], etc., have also been used to mea- The domain similarity can also be estimated from fine- sure the feature distribution discrepancy in conventional TL. tuning [50], [51], which is frequently used in deep TL to adapt a source domain model to the target domain, by fixing its lower layer parameters and re-tuning 5.2 Test Performance Based the higher layer parameters. The domain similarity can also be measured from the test Generally, these approaches feed the target data into the performance: if a source domain classifier can achieve good source neural network, and use the output to determine the performance on the labeled target domain data, then that domain similarity. Meiseles et al. [40] introduced a clustering source domain and the target domain should be very simi- quality metric, mean silhouette coefficient [52], to assess the lar. However, this requires the target domain to have some quality of the target encodings produced by a given source labeled data. model. They found that this metric has the potential for source model selection. Tran et al. [41] developed negative Yao and Doretto [35] used M iterations to train M weak conditional entropy (NCE), which measures the amount of classifiers from N source domains (M ≤ N) and then information from a source domain to the target domain, to combined them as the final classifier. In the m-th iteration, evaluate the source domain transferability. the classifier with the smallest error on the labeled target Recently, computationally-efficient domain similarity domain samples was chosen as the m-th weak classifier. measures without the source data have attracted great atten- Some works assume the target model is accessible for tion [18], [42], [43]. Nguyen et al. [18] proposed log expected source selection. Xie et al. [36] proposed selective transfer empirical prediction (LEEP), which can be computed from a incremental learning (STIL) to remove less relevant source source model θ with n labeled target data, by running the (historical) models for online TL. STIL computes the follow- l target data through the model only once: ing Q-statistic as the correlation between a historical model n ! and the newly trained target model: 1 Xl X T (θ, D) = log Pˆ(yi|z)θ(xi)z , (17) 11 00 01 10 n N N − N N l i=1 z∈Z Q = , (15) fi,fj N 11N 00 + N 01N 10 where Pˆ(yi|z) is the empirical conditional distribution of yiyj where fi and fj are two classifiers. N is the number of the real target label yi given the dummy target label z instances for which the classification result is yi by fi (yi = 1 predicted by model θ. T (θ, D) represents the transferability if fi classifies the example correctly; otherwise yi = 0), and of the pre-trained model θ to the target domain D, and is an yj by fj. STIL then removes the less transferable historical upper bound of the NCE measure. models, whose Q-statistics are close to 0. In this way, it can In addition, Huang et al. [43] developed transfer rate overcome NT. This strategy was also used in [49]. (TransRate) for transferability estimation in fine-tuning The above approaches require to know a sufficient num- based TL. Its motivation comes from mutual information ber of labeled target samples, or the target model, which of the output from the pre-trained feature extractor: may not always be available. In such cases, discriminator TrRS→T (g, ) = R(Z, ) − R(Z, |Y ), (18) based similarity measures could be used. These approaches train classifiers to discriminate the two domains and then where Y are labels of the target instances, Z = g(X) features define a similarity measure from the classification error [23], extracted by the pre-trained feature extractor g, and R(Z, ) [38]. the rate distortion of H(Z) to encode Z with an expected IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. -, NO. -, 2021 6

decoding error less than . They showed that TransRate has TABLE 4 superior performance in selecting the source data, source Approaches for NT mitigation. model architecture, and even network layers. CATEGORY APPROACHES REFERENCES Data Transferability Domain level [57]–[59] ISTANT RANSFER 6 D T Enhancement Instance level [15], [60]–[63] NT may happen easily when the source and target data Feature level [64]–[70] have very low similarity. For example, using text data as Model Transferability TransNorm [71] the auxiliary source is likely to reduce the performance of Enhancement Adversarial robust [72]–[74] the target classifier originally trained on image data, causing Target Prediction Soft labeling [75], [76] NT. Distant TL (also called transitive TL) [53], which bridges Enhancement Selective labeling [77], [78] dramatically different source and target domains through Weighted clustering [20], [78], [79] one or more intermediate domains, could be a solution to this problem. At the domain level, when there are multiple source do- Tan et al. [54] introduced an instance selection mecha- mains, we can select a subset of them or weight them. At the nism to identify useful source data, and constructed multi- instance level, we can select or weight the source instances. ple intermediate domains. They learned a pair of encoding At the feature level, we can transform the original features function fe(·) and decoding function fd(·) to minimize into a common latent space or enhance their transferability. the reconstruction errors on the selected instances in the intermediate domains, and also on all instances in the target 7.1.1 Domain Level Transferability Enhancement domain simultaneously: When there are multiple source domains, selectively aggre- n 1 Xs gating the most similar ones to the target domain, or a L(f , f , v , v ) = R(v , v ) + vi kxˆi − xi k2 e d S I S I n S S S 2 weighted aggregation of all source domains, may achieve S i=1 n n (19) better performance than simply averaging all source do- 1 XI 1 Xt vi kxˆi − xi k2 + kxˆi − xi k2, mains [80], [81]. Therefore, domain selection/weighting can n I I I 2 n T T 2 I i=1 T i=1 be used to mitigate NT. i i i i i i The approaches introduced in Section 5 for estimating where xˆS, xˆT and xˆI are reconstructions of xS, xT and xI the similarity between a single source domain and a sin- from an auto-encoder, vS and vI are selection indicators, and gle target domain can be easily extended to multi-source R(vS, vI ) is a regularization term. They also incorporated TL scenarios. For example, Wang and Carbonell [57] used side information, such as predictions in the intermediate MMD to measure the proximity between the source and the domains, to help the model learn more task-related feature target domains. They first trained a classifier in each source representations. domain and then weighted them for target domain predic- Similar strategies have also been used in applications tion, where each weight was a combination of the source where the training data are very scarce, such as medical domain’s MMD-based proximity to the target domain and diagnostics and remote sensing. Niu et al. [55] proposed a its transferability to other source domains. In the special case distant domain TL method that transfers knowledge from that the confidence of a source domain classifier is low, its object recognition datasets, chest X-ray images, etc., to coro- own classification is discarded; instead, it queries its peers navirus diagnosis. They developed a convolutional auto- on this specific test sample, and each peer is weighted by its encoder pair to reconstruct both common image domains transferability to the current source domain. and medical image domains in the same intermediate fea- The domain similarity (weight) can also be obtained ture space. All task-related information that may induce NT through optimization [58], [59]. Zuo et al. [58] introduced an was removed after reconstruction. Xie et al. [56] proposed attention-based domain recognition module to estimate the a feature-based method to enrich scarce daytime satellite domain correlations to alleviate the effects caused by dis- image data, by transferring knowledge learned from an similar domains. Its main idea is to reorganize the instance object classification task, using the night-time light intensity labels when there are multiple source domains so that it information as a bridge. can simultaneously distinguish each category and domain. It redefines the source labels by Yˆs,i = Ys,i +(i−1)×K, and 7 NT MITIGATION trains a domain recognition model on the original features. The learned weight of the i-th domain is In most TL problems, the source and target domains have Pnt ˆ some similarity, which could be used to mitigate NT. This j=1 sign(dj, i) can be achieved by data transferability enhancement, model wi = , (20) nt transferability enhancement, and target prediction enhance- where nt is the target image number in a batch, sign(·, ·) a ment, as shown in Table 4. ˆ sign function, and dj the domain label of a target instance xj by analyzing the domain recognition model prediction. 7.1 Data Transferability Enhancement The authors verified that the learned domain weights had a The transferability of the source domain can be enhanced high correlation with the groundtruth. by improving the data quality from coarse to fine-grained, Ahmed et al. [59] considered optimizing the weights at the domain level, instance level, or feature level. in a more challenging scenario when the source domain IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. -, NO. -, 2021 7

data are absent, and only the pre-trained source models are formulated as an optimization problem of non-negative accessible. They first developed a sophisticated loss function matrix tri-factorization: Ltar from the source model predictions on the unlabeled > target data. Then, the domain weights were optimized by, min XS − [U0,US]HVS , (24) U0,US ,H,VS ≥0 M X where XS is the source domain feature matrix, U0 and min Ltar, s.t. αi = 1, αi ≥ 0. (21) {α }M U i i=1 i=1 S are respectively common feature clusters and domain specific feature clusters, VS is a sample cluster assignment With the learned weights, the target model performs at least matrix, and H is the association matrix. (24) minimizes as good as the single best source model. the marginal distribution discrepancy between different do- mains to enable optimal knowledge transfer. Shi et al. [65] 7.1.2 Instance Level Transferability Enhancement proposed a twin bridge transfer approach, which uses latent Instance selection/weighting are frequently used in TL [2], factor decomposition of users and similarity graph transfer [3], [22]. They can also be used to mitigate NT. to facilitate knowledge transfer to reduce NT. This idea was Seah et al. [60] proposed a predictive distribution match- also investigated in [67], which seeks a latent feature space ing (PDM) regularizer to remove irrelevant source domain of the source and target data to minimize the distribution data. It iteratively infers the pseudo-labels of the unlabeled discrepancy. target domain data and retains the highly confident ones. Another prominent line of work, particularly used in Finally, an SVM or logistic regression classifier is trained deep TL, is to enhance the feature transferability at the using the remaining source domain data and the pseudo- time of feature representations learning. Yosinski et al. [86] labeled target domain data. Yi et al. [62] partitioned source defined feature transferability based on its specificity to the data into components by clustering, and assigned them domain in which it is trained and its generality. different weights by iterative optimization. Multiple approaches have been proposed to compute Active learning [82], [83], which selects the most useful and enhance the feature transferability [68]–[70], [87]. For unlabeled samples for labeling, can also be used to select example, Chen et al. [68] found that features with small the most appropriate source samples [63], [84]. Peng et al. singular values have low transferability in deep network [63] proposed active TL to optimally select source samples fine-tuning; they thus proposed a regularization term to that are class balanced and highly similar to those in the reduce NT, by suppressing the small singular values of the target domain. It simultaneously minimizes the MMD and feature matrices. mitigates NT. If the unlabeled target domain samples can Unfortunately, focusing only on improving the feature be queried for their labels, then active learning can also be transferability may lead to poor discriminability. It is nec- integrated with TL for instance selection [61], [85]. essary to consider both feature transferability and discrim- Instance weighting has also been used in deep TL to inability to mitigate NT. To this end, Chen et al. [69] pro- handle NT. Wang et al. [15] developed a discriminator posed to enhance the feature transferability with guaranteed gate to achieve both adversarial adaptation and class-level acceptable discriminability by using batch spectral penaliza- weighting of the source samples. They used the output of tion regularization on the largest few singular values. a discriminator to estimate the distribution density ratio of To reduce the negative impact of noise in the learned two domains at each specific feature point: feature spaces, Xu et al. [66] introduced a sparse matrix in unsupervised TL to model the feature noise. The loss Pt(x, y) D(x, y) = , (22) function with noise minimization is: Ps(x, y) 1 − D(x, y) 1 where D(x, y) is the output of the discriminator when the min φ(P,Y,XS) + kZk∗ + αkZk1 + βkEk1 P,Z,E 2 (25) input is the concatenation of the feature representation x > > and its predicted label y. The loss is: s.t. P Xt = P XsZ + E, where P , Z and E are the transformation matrix, recon- L(C,F ) = Exj ,yj ∼Tl [`(C(F (xj)), yj)] (23) struction matrix and noise matrix, respectively, φ(P,Y,XS) + λExi,yi∼S [w(xi, yi)`(C(F (xi)), yi)], is a discriminant subspace learning function, and k·k∗ is the where C and F are respectively the classifier and the feature nuclear norm of a matrix. The goal is to align the source and extractor, and w(xi, yi) = D(xi, yi)/(1 − D(xi, yi)) is the target domains in a common low-rank sparse space with weight of a source sample xi. Wang et al. [15] demonstrated noise suppression. that this approach can remarkably mitigate NT.

7.1.3 Feature Level Transferability Enhancement 7.2 Model Transferability Enhancement A popular strategy for feature level transferability enhance- Model transferability enhancement can be achieved through ment is to learn a common latent feature space, where the transferable normalization (TransNorm) [71], adversarial ro- features of different domains become more consistent. bust training, etc. Long et al. [64] proposed dual TL to distinguish between TransNorm [71] reduces domain shift in batch normal- the common and domain-specific latent factors automati- ization [88], and is usually used after the convolutional cally. Its main idea is to find a latent feature space that layer to enhance the model transferability. Let the mean and can maximally help the classification in the target domain, variance of the source domain be us and σs, and the target IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. -, NO. -, 2021 8 domain be ut and σt. TransNorm quantifies the domain select a subset containing mnt/T high-probability target distance as samples in the m-th iteration, where T is the number of (j) (j) iterations of the learning process. Their experiments showed us u d(j) = − t , (26) that this simple strategy generally improved the target pre- 2(j) 2(j) σs +  σt +  diction performance. where j denotes the j-channel in a layer that TransNorm Cluster enhanced pseudo-labeling is based on soft is applied to. Then, it uses distance-based probability α to pseudo-labeling, but further explores the unsupervised clus- adapt each channel according to its transferability: tering information to enhance target domain prediction [20], [78], [79]. For example, Liang et al. [20] developed a self- c(1 + d(j))−1 α(j) = . supervised pseudo-labeling approach to alleviate harmful Pc (k) −1 (27) k=1(1 + d ) effects resulted from inaccurate adaptation network out- puts. Its main idea is to perform weighted k-means clus- Another way to enhance the model transferability is to tering on the target data to get the class means, improve its robustness to adversarial examples [89]. Ad- P ˆ versarial examples are slightly perturbed inputs aiming to δk(θt(xt))ˆgt(xt) µ = xt∈T , fool a machine learning model [90], [91]. An adversarially k P ˆ (29) δk(θt(xt)) robust model that is resilient to such adversarial examples xt∈T can be achieved by replacing the standard empirical risk ˆ ˆ where θt = ft(ˆgt(xt)) denotes the learned target network minimization loss with a robust optimization loss [89]: ˆ parameters, gˆt(·) is a feature extractor, ft(·) is a classification   layer, and δk(·) denotes the k-th element in the soft-max min E(x,y)∼D max `(x + δ, y; θ) , (28) µ θ kδk2≤ε output. With the learned robust feature centroids k, the pseudo-labels can be updated by a nearest centroid classi- δ ε where is a small perturbation, a hyper-parameter to fier: control the perturbation magnitude, and θ the set of model 2 yˆt = arg min kgˆt(xt) − µkk2. parameters. k (30) Several recent studies found that adversarially robust models have better transferability [72]–[74]. Salman et al. The centroids and pseudo-labels can be optimized itera- [72] empirically verified that adversarially robust networks tively to obtain better pseudo-labels. achieved higher transfer accuracies than standard ImageNet models, and increasing the width of a robust network may 8 NT IN RELATED FIELDS increase its transfer performance gain. Liang et al. [73] found NT has also been found and studied in several related fields, a strong positive correlation between adversarial transfer- including multi-task learning [93], lifelong learning [94], and ability and knowledge transferability; thus, increasing the adversarial attacks [89]. adversarial transferability may benefit knowledge transfer- ability. 8.1 Multi-Task Learning 7.3 Target Prediction Enhancement Multi-task learning solves multiple learning tasks jointly, by exploiting commonalities and differences across them. TL is frequently applied to the target domain with few Similar to TL, it needs to facilitate positive transfer among labeled samples but abundant unlabeled ones. Similar to tasks to improve the overall learning performance on all semi-supervised learning [92], pseudo-labels can be used tasks. Previous studies [95], [96] have observed that con- to exploit these unlabeled samples in TL. Soft pseudo- flicting gradients among different tasks may induce NT labeling, selective pseudo-labeling and cluster enhanced (also known as negative interference). Various techniques pseudo-labeling could be used to mitigate NT. have been explored to remedy negative interference, such Soft pseudo-labeling assigns each unlabeled sample to as altering the gradients directly [97], [98], weighting tasks different classes with different probabilities, rather than a [99], learning task relatedness [100], [101], routing networks single class, in order to alleviate label noise from a weak [102], [103], and searching for Pareto solutions [104], [105], source classifier [78]. For example, in multi-adversarial DA etc. [75], the soft pseudo-label of a target sample is used to As a concrete example of multi-task learning, multilin- indicate how much this sample should be emphasized by gual models have demonstrated success in processing tens different class-specific domain discriminators. In deep unsu- or even hundreds of languages simultaneously [106]–[108]. et al. pervised DA, Ge [76] introduced a soft softmax-triplet However, not all languages can benefit from this training loss based on the soft pseudo-labels, which outperformed paradigm. Studies [109] have revealed NT in multilingual hard labeling. models, especially for high-resource languages [108]. Pos- Selective pseudo-labeling is another strategy to enhance sible remedies include parameter soft-sharing [110], meta- target prediction. Its main motivation is to select the unla- learning [109], and gradient vaccine [96]. beled samples with high confidence as the training targets. For instance, Gui et al. [77] developed an approach to predict when NT would occur. They identified and removed the 8.2 Lifelong Learning noisy samples in the target domain to reduce class noise ac- Lifelong learning learns a series of tasks in a sequential cumulation in future training iterations. Wang and Breckon order, without revisiting previously seen data. While the [78] proposed selective pseudo-labeling to progressively goal is to master all tasks in a single model, there are two IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. -, NO. -, 2021 9 key challenges, which may lead to NT. First, the model may similarity is high; choose a NT mitigation approach forget earlier knowledge when trained on new tasks, known if the similarity is medium; or, use distant transfer or as catastrophic forgetting [111], [112]. Second, transferring no transfer at all if the similarity is low. from early tasks may hurt the performance in later tasks. Ex- • If the domain similarity cannot be estimated, then isting literature mainly studies how to mitigate catastrophic secure transfer may be used. forgetting using regularization [113], [114], memory replay [115]–[117], parameter isolation [118], [119], etc., whereas The following directions may be considered in future forward NT in lifelong learning is less investigated [120]. research: 1) Develop secure transfer approaches that correspond to 8.3 Adversarial Attacks popular TL paradigms such as unsupervised DA, few shot TL, and adversarial TL. Adversarial attack aims at learning perturbations on the 2) Investigate NT mitigation approaches for regression training data or models, and then sabotaging the test per- problems. Currently most such approaches are for clas- formances. Researchers found that adversarial examples sification problems. have high transferability, and the unsecured source data or 3) Ensure positive transfer in challenging open environ- models highly affect the target learning performance. As ments, which may include continual data stream, het- a result, the target model performs poorly, and NT may erogeneous features, private sources, unclear domain occur. For example, the white-box teacher model, black- boundaries, unseen/unknown categories, etc. box student model and TL parameters can be affected by 4) Design theoretical tests or empirical procedures to iden- evasion attacks [121]–[123], and source data by backdoor tify the exact factors leading to NT in a specific appli- attacks [124]–[126]. cation, which can help us choose the most appropriate approach to mitigate NT. 9 METHOD COMPARISON According to the scheme shown in Fig. 2, we have intro- duced secure transfer, domain similarity estimation, distant ACKNOWLEDGMENT transfer, and tens of NT mitigation approaches. To see the This work was supported by the Hubei Province Funds for forest for the trees, we compare their characteristics and Distinguished Young Scholars under Grant 2020CFA050, the differences in Table 5. Technology Innovation Project of Hubei Province of China Several observations can be made from Table 5: under Grant 2019AEA171, the National Natural Science 1) Most current works in the NT literature focus on NT Foundation of China under Grants 61873321 and U1913207, mitigation and domain similarity estimation. and the International Science and Technology Coopera- 2) NT mitigation research mainly focuses on data transfer- tion Program of China under Grant 2017YFE0128300. The ability enhancement. authors would also like to thank Mr. Zirui Wang of the 3) The research on overcoming NT spreads across many Carnegie Mellon University for insightful discussions. different categories of TL, indicating that it has received extensive attention in TL. 4) Most secure transfer strategies are based on model REFERENCES adaptation. 5) Most NT mitigation approaches consider two or more [1] D. Wu, Y. Xu, and B.-L. Lu, “Transfer learning for EEG-based factors of NT (domain divergence, transfer algorithm, brain-computer interfaces: A review of progress made since 2016,” IEEE Trans. on Cognitive and Developmental Systems, 2021, source data quality, and target data quality). in press. [2] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, ONCLUSIONSAND UTURE ESEARCH 10 C F R 2009. TL utilizes data or knowledge from one or more source [3] L. Zhang and X. Gao, “Transfer adaptation learning: A decade domains to facilitate the learning in a target domain, which survey,” arXiv preprint arXiv:1903.04687, 2019. [4] Z. Chen and M. W. Daehler, “Positive and negative transfer is particularly useful when the target domain has very in analogical problem solving by 6-year-old children,” Cognitive few or no labeled data. NT is undesirable in TL, and has Development, vol. 4, no. 4, pp. 327–344, 1989. been attracting increasing research interest recently. This [5] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adap- paper has systematically categorized and reviewed about tation via transfer component analysis,” IEEE Trans. on Neural Networks, vol. 22, no. 2, pp. 199–210, 2011. fifty representative approaches for handling NT, from four [6] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer feature perspectives: secure transfer, domain similarity estimation, learning with joint distribution adaptation,” in Proc. IEEE Int’l distant transfer, and NT mitigation. Besides, some funda- Conf. on , Sydney, Australia, Dec. 2013, pp. 2200– 2207. mental concepts, e.g., the definition of NT, the factors of [7] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu, “Boosting for transfer NT, and related fields of NT, are also introduced. To our learning,” in Proc. 24th Int’l Conf. on Machine learning, Corvallis, knowledge, this is the first comprehensive survey on NT. OR, Jun. 2007, pp. 193–200. We suggest the following guidelines in coping with NT: [8] J. Zhang, W. Li, and P. Ogunbona, “Joint geometrical and sta- tistical alignment for visual domain adaptation,” in Proc. IEEE • If the domain similarity can be estimated, then we Conf. on Computer Vision and Pattern Recognition, Honolulu, HI, can choose different strategies according to the simi- Jul. 2017, pp. 1859–1867. [9] D. Wu, “Online and offline domain adaptation for reducing BCI larity: directly concatenate the source and target do- calibration effort,” IEEE Trans. on Human-Machine Systems, vol. 47, main data and train a machine learning model if the no. 4, pp. 550–563, 2017. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. -, NO. -, 2021 10

TABLE 5 Comparison of approaches for mitigating NT. “TL Category” groups all methods into five categories [3]: instance adaptation, feature adaptation, model adaptation, deep TL, and adversarial TL. “Ability to overcome NT” has three levels: guaranteed (???), very probable (??), and possible (?). “Factors of NT” includes four elements mentioned in Section 2.4: domain divergence (D), transfer algorithm (A), source data quality (S), and target data quality (T).

[Reference] Strategy TL Category Ability to Overcome NT Factors of NT Approach [30] MMD Feature adaptation ?? D, S [33] ROD Feature adaptation ?? S [31] cTL Feature adaptation ?? D, A, S [32] DTE Feature adaptation ?? S [34] WTL Model adaptation ?? D, A, S [35] Instance adaptation ?? D, S Domain [36] Q-statistic Model adaptation ?? D, S Similarity Estimation [37] A-distance Instance adaptation ?? D, S [38] DCTN Deep TL/Feature adaptation ?? D, A, S [39] TransLATE Adversarial TL ?? D, A, S, T [40] MSC Deep TL ?? S [41] NCE Deep TL/Model adaptation ?? A, S [42] Deep TL/Model adaptation ?? A, S [18] LEEP Deep TL/Model adaptation ?? A, S [43] TransRate Deep TL/Model adaptation ?? A, S [24] AT-GP Model adaptation ??? A, T [25] Deep TL ??? A, T [17] SAFEW Model adaptation ??? A, T Secure Transfer [26] Model adaptation ??? A, T [27] PTL Model adaptation ??? A, T [28] Model adaptation ??? A, T [53] TTL Instance adaptation ? D, A, S [54] DDTL Instance adaptation ? D, A, S Distant Transfer [55] DFF Deep TL/Feature adaptation ? D, A, S [56] Feature adaptation ? A, S [57] PW-MSTL Instance adaptation ?? A, S [58] ABMSDA Deep TL/Feature adaptation ?? D, A, S [59] DECISION Deep TL/Model adaptation ?? D, A, S, T [60] PDM Instance adaptation ?? A, S [61] AwAR Feature adaptation ?? D, A, T [62] MCTML Instance adaptation ?? D, A, S Data [63] ATL Feature adaptation ?? D, A, S Transferability Enhancement [15] GATE Adversarial TL ?? D, A, S [64] DTL Feature adaptation ?? D, A, S [65] TBT Feature adaptation ?? D, A, S [66] Feature adaptation ?? D, A, S [67] DTL Feature adaptation ?? D, A, S NT ?? Mitigation [68] BSS Deep TL/Model adaptation A, S [69] BSP Adversarial TL ?? D, A, S [70] HTCN Adversarial TL ?? D, A, S [71] TransNorm Deep TL ?? A Model [72] AT Deep TL ?? A Transferability Enhancement [73] Deep TL ?? A [74] Deep TL ?? A [75] MADA Adversarial TL ?? D, A, T [76] MMT Deep TL/Model adaptation ?? D, A, T Target [77] NTD Instance adaptation ?? D, A, T Prediction [78] SPL Feature adaptation ?? D, A, T Enhancement [79] PACET Feature adaptation ?? D, A, T [20] SHOT Deep TL/Model adaptation ?? D, A, T IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. -, NO. -, 2021 11

[10] M. Ghifary, W. B. Kleijn, and M. Zhang, “Domain adaptive neural Computer Vision and Pattern Recognition, Providence, RI, Jun. 2012, networks for object recognition,” in Proc. Pacific Rim Int’l Conf. on pp. 2066–2073. Artificial Intelligence, Queensland, Australia, Jun. 2014, pp. 898– [34] A. M. Azab, L. Mihaylova, K. K. Ang, and M. Arvaneh, 904. “Weighted transfer learning for improving motor imagery-based [11] M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable brain-computer interface,” IEEE Trans. on Neural Systems and features with deep adaptation networks,” in Proc. 32nd Int’l Conf. Rehabilitation Engineering, vol. 27, no. 7, pp. 1352–1359, 2019. on Machine Learning, Lille, France, Jul. 2015, pp. 97–105. [35] Y. Yao and G. Doretto, “Boosting for transfer learning with [12] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, multiple sources,” in Proc. IEEE Conf. on Computer Vision and F. Laviolette, M. Marchand, and V. Lempitsky, “Domain- Pattern Recognition, San Francisco, CA, Jun. 2010, pp. 1855–1862. adversarial training of neural networks,” Journal of Machine Learn- [36] G. Xie, Y. Sun, M. Lin, and K. Tang, “A selective transfer learning ing Research, vol. 17, no. 1, pp. 2096–2030, 2016. method for concept drift adaptation,” in Proc. Int’l Symposium on [13] H. Tang and K. Jia, “Discriminative adversarial domain adapta- Neural Networks, Hokkaido, Japan, Jun. 2017, pp. 353–361. tion.” in Proc. 34th AAAI Conf. on Artificial Intelligence, New York, [37] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis NY, Feb. 2020, pp. 5940–5947. of representations for domain adaptation,” in Proc. Advances in [14] M. T. Rosenstein, Z. Marx, L. P. Kaelbling, and T. G. Dietterich, Neural Information Processing Systems, Vancouver, Canada, Dec. “To transfer or not to transfer,” in Proc. NIPS 2005 Workshop on 2007, pp. 137–144. Transfer Learning, vol. 898, Vancouver, Canada, May 2005, pp. 1–4. [38] R. Xu, Z. Chen, W. Zuo, J. Yan, and L. Lin, “Deep cocktail [15] Z. Wang, Z. Dai, B. Poczos,´ and J. Carbonell, “Characterizing network: Multi-source unsupervised domain adaptation with and avoiding negative transfer,” in Proc. IEEE Conf. on Computer category shift,” in Proc. of the IEEE Conf. on Computer Vision and Vision and Pattern Recognition, Long Beach, CA, Jun. 2019, pp. Pattern Recognition, Salt Lake City, Utah, Jun. 2018, pp. 3964–3973. 11 293–11 302. [39] J. Wu and J. He, “Continuous transfer learning with [16] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu, “A survey label-informed distribution alignment,” arXiv preprint on deep transfer learning,” in Proc. Int’l Conf. on Artificial Neural arXiv:2006.03230, 2020. Networks, Rhodes, Greece, Oct. 2018, pp. 270–279. [40] A. Meiseles and L. Rokach, “Source model selection for deep [17] Y.-F. Li, L.-Z. Guo, and Z.-H. Zhou, “Towards safe weakly su- learning in the time series domain,” IEEE Access, vol. 8, pp. 6190– pervised learning,” IEEE Trans. on Pattern Analysis and Machine 6200, 2020. Intelligence, vol. 43, no. 1, pp. 334–346, 2019. [41] A. T. Tran, C. V. Nguyen, and T. Hassner, “Transferability and [18] C. Nguyen, T. Hassner, C. Archambeau, and M. Seeger, “LEEP: hardness of supervised classification tasks,” in Proc. IEEE Int’l A new measure to evaluate transferability of learned represen- Conf. on Computer Vision, Seoul, Korea, Nov. 2019, pp. 1395–1405. tations,” in Proc. 37th Int’l Conf. on Machine Learning, Vienna, [42] M. J. Afridi, A. Ross, and E. M. Shapiro, “On automated source Austria, Jul. 2020, pp. 5640–5651. selection for transfer learning in convolutional neural networks,” [19] P. Agrawal, R. Girshick, and J. Malik, “Analyzing the perfor- Pattern Recognition, vol. 73, pp. 65–75, 2018. mance of multilayer neural networks for object recognition,” [43] L.-K. Huang, Y. Wei, Y. Rong, Q. Yang, and J. Huang, in Proc. European Conf. on Computer Vision, Zurich, Switzerland, “Frustratingly easy transferability estimation,” arXiv preprint September 2014, pp. 329–344. arXiv:2106.09362, 2021. [20] J. Liang, D. Hu, and J. Feng, “Do we really need to access the [44] W. Zhang and D. Wu, “Discriminative joint probability maximum source data? Source hypothesis transfer for unsupervised domain mean discrepancy (DJP-MMD) for domain adaptation,” in Proc. adaptation,” in Proc. Int’l Conf. on Machine Learning, Vienna, Int’l Joint Conf. on Neural Networks, Glasgow, UK, Jul. 2020. Austria, jul 2020, pp. 6028–6039. [45] S. Kullback and R. A. Leibler, “On information and sufficiency,” [21] Q. Yang, Y. Zhang, W. Dai, and S. J. Pan, Transfer learning. The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951. Cambridge, UK: Cambridge University Press, 2020. [46] A. Gretton, O. Bousquet, A. Smola, and B. Scholkopf,¨ “Measuring [22] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, statistical dependence with Hilbert-Schmidt norms,” in Proc. Int’l and Q. He, “A comprehensive survey on transfer learning,” Conf. on Algorithmic Learning Theory, Padova, Italy, Oct. 2005, pp. Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2020. 63–77. [23] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and [47] S. Si, D. Tao, and B. Geng, “Bregman divergence-based regular- J. W. Vaughan, “A theory of learning from different domains,” ization for transfer subspace learning,” IEEE Trans. on Knowledge Machine Learning, vol. 79, no. 1-2, pp. 151–175, 2010. and Data Engineering, vol. 22, no. 7, pp. 929–942, 2009. [24] B. Cao, S. J. Pan, Y. Zhang, D.-Y. Yeung, and Q. Yang, “Adaptive [48] J. Shen, Y. Qu, W. Zhang, and Y. Yu, “Wasserstein distance guided transfer learning,” in Proc. 24th AAAI Conf. on Artificial Intelli- representation learning for domain adaptation,” in Proc. 32nd gence, vol. 2, no. 5, Atlanta, GA, Jul. 2010, p. 7. AAAI Conf. on Artificial Intelligence, New Orleans, LA, Feb. 2018, [25] M. Abdullah Jamal, H. Li, and B. Gong, “Deep face detector pp. 4058–4065. adaptation without negative transfer or catastrophic forgetting,” [49] Y. Sun, K. Tang, Z. Zhu, and X. Yao, “Concept drift adaptation by in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Salt exploiting historical knowledge,” IEEE Trans. on Neural Networks Lake City, Utah, Jun. 2018, pp. 5608–5618. and Learning Systems, vol. 29, no. 10, pp. 4822–4832, 2018. [26] I. Kuzborskij and F. Orabona, “Stability and hypothesis transfer [50] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, learning,” in Proc. 30th Int’l Conf. on Machine Learning, Atlanta, and T. Darrell, “DeCAF: A deep convolutional activation feature GA, Jun. 2013, pp. 942–950. for generic visual recognition,” in Proc. 31st Int’l Conf. on Machine [27] H. Yoon and J. Li, “A novel positive transfer learning approach Learning, Beijing, China, Jun. 2014, pp. 647–655. for telemonitoring of Parkinson’s disease,” IEEE Trans. on Au- [51] P. Agrawal, R. Girshick, and J. Malik, “Analyzing the perfor- tomation Science and Engineering, vol. 16, no. 1, pp. 180–191, 2018. mance of multilayer neural networks for object recognition,” [28] M. J. Sorocky, S. Zhou, and A. P. Schoellig, “To share or not in Proc. European Conf. on Computer Vision, Zurich, Switzerland, to share? Performance guarantees and the asymmetric nature 2014, pp. 329–344. of cross-robot experience transfer,” IEEE Control Systems Letters, [52] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation vol. 5, no. 3, pp. 923–928, 2021. and validation of cluster analysis,” Journal of Computational and [29] C. M. Bishop, Pattern Recognition and Machine Learning. New Applied Mathematics, vol. 20, pp. 53–65, 1987. York, NY: Springer, 2006. [53] B. Tan, Y. Song, E. Zhong, and Q. Yang, “Transitive transfer [30] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Scholkopf,¨ and learning,” in Proc. 21st ACM SIGKDD Int’l Conf. on Knowledge A. Smola, “A kernel two-sample test,” Journal of Machine Learning Discovery and Data Mining, Sydney, Australia, Aug. 2015, pp. Research, vol. 13, no. 3, pp. 723–773, 2012. 1155–1164. [31] Y.-P. Lin and T.-P. Jung, “Improving EEG-based emotion classi- [54] B. Tan, Y. Zhang, S. J. Pan, and Q. Yang, “Distant domain transfer fication using conditional transfer learning,” Frontiers in Human learning,” in Proc. 31st AAAI Conf. on Artificial Intelligence, San Neuroscience, vol. 11, p. 334, 2017. Francisco, CA, Feb. 2017, pp. 2604–2610. [32] W. Zhang and D. Wu, “Manifold embedded knowledge transfer [55] S. Niu, M. Liu, Y. Liu, J. Wang, and H. Song, “Distant domain for brain-computer interfaces,” IEEE Trans. on Neural Systems and transfer learning for medical imaging,” IEEE Journal of Biomedical Rehabilitation Engineering, vol. 28, no. 5, pp. 1117–1127, 2020. and Health Informatics, 2021, early Access. [33] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel [56] M. Xie, N. Jean, M. Burke, D. Lobell, and S. Ermon, “Transfer for unsupervised domain adaptation,” in Proc. IEEE Conf. on learning from deep features for remote sensing and poverty IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. -, NO. -, 2021 12

mapping,” in Proc. 30th AAAI Conf. on Artificial Intelligence, no. 7, the AAAI Conf. on Artificial Intelligence, vol. 34, no. 04, New York, Phoenix, AZ, February 2016, pp. 3929–3935. NY, Feb. 2020, pp. 6243–6250. [57] Z. Wang and J. Carbonell, “Towards more reliable transfer [79] J. Liang, R. He, Z. Sun, and T. Tan, “Exploring uncertainty in learning,” in Proc. Joint European Conf. on Machine Learning and pseudo-label guided unsupervised domain adaptation,” Pattern Knowledge Discovery in Databases, Dublin, Ireland, 2018, pp. 794– Recognition, vol. 96, p. 106996, 2019. 810. [80] E. Eaton et al., “Selective transfer between learning tasks using [58] Y. Zuo, H. Yao, and C. Xu, “Attention-based multi-source domain task-based boosting,” in Proc. 25th AAAI Conf. on Artificial Intelli- adaptation,” IEEE Trans. on Image Processing, vol. 30, pp. 3793– gence, San Francisco, CA, Aug. 2011, pp. 337–342. 3803, 2021. [81] Y.-L. Yu and C. Szepesvari,´ “Analysis of kernel mean matching [59] S. M. Ahmed, D. S. Raychaudhuri, S. Paul, S. Oymak, and A. K. under covariate shift,” in Proc. 29th Int’l Conf. on Machine Learning, Roy-Chowdhury, “Unsupervised multi-source domain adapta- Edinburgh, Scotland, Jun. 2012, pp. 1147–1154. tion without access to source data,” in Proc. IEEE Conf. on [82] D. Wu, “Pool-based sequential active learning for regression,” Computer Vision and Pattern Recognition, Nashville, TN, Jun. 2021. IEEE Trans. on Neural Networks and Learning Systems, vol. 30, no. 5, [60] C.-W. Seah, Y.-S. Ong, and I. W. Tsang, “Combating negative pp. 1348–1359, 2019. transfer from predictive distribution differences,” IEEE Trans. on [83] D. Wu, C.-T. Lin, and J. Huang, “Active learning for regression Cybernetics, vol. 43, no. 4, pp. 1153–1165, 2012. using greedy sampling,” Information Sciences, vol. 474, pp. 90–105, [61] D. Wu, V. J. Lawhern, W. D. Hairston, and B. J. Lance, “Switching 2019. EEG headsets made easy: Reducing offline calibration effort [84] Z. Peng, Y. Jia, and J. Hou, “Non-negative transfer learning using active wighted adaptation regularization,” IEEE Trans. on with consistent inter-domain distribution,” IEEE Signal Processing Neural Systems and Rehabilitation Engineering, vol. 24, no. 11, pp. Letters, vol. 27, pp. 1720–1724, 2020. 1125–1137, 2016. [85] D. Wu, B. J. Lance, and V. J. Lawhern, “Transfer learning and [62] Y. Xu, H. Yu, Y. Yan, Y. Liu et al., “Multi-component transfer active transfer learning for reducing calibration data in single- metric learning for handling unrelated source domain samples,” trial classification of visually-evoked potentials,” in Proc. IEEE Knowledge-Based Systems, p. 106132, 2020. Int’l Conf. on Systems, Man, and Cybernetics, San Diego, CA, [63] Z. Peng, W. Zhang, N. Han, X. Fang, P. Kang, and L. Teng, “Active October 2014. IEEE Trans. on Circuits and Systems for Video transfer learning,” [86] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable Technology , vol. 30, no. 4, pp. 1022–1036, 2020. are features in deep neural networks?” in Proc. Advances in Neural [64] M. Long, J. Wang, G. Ding, W. Cheng, X. Zhang, and W. Wang, Information Processing Systems, Montreal,´ Canada, Dec. 2014, pp. “Dual transfer learning,” in Proc. 2012 SIAM Int’l Conf. on Data 3320–3328. Mining, Brussels, Belgium, Dec. 2012, pp. 540–551. [87] J. Chen, F. Lecu´ e,´ J. Z. Pan, I. Horrocks, and H. Chen, [65] J. Shi, M. Long, Q. Liu, G. Ding, and J. Wang, “Twin bridge trans- “Knowledge-based transfer learning explanation,” in Proc. 16th fer learning for sparse collaborative filtering,” in Proc. Pacific- Int’l Conf. on Principles of Knowledge Representation and Reasoning, Asia Conf. on Knowledge Discovery and Data Mining, Gold Coast, Tempe, AZ, Oct. 2018, pp. 349–358. Australia, Apr. 2013, pp. 496–507. [88] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep [66] Y. Xu, X. Fang, J. Wu, X. Li, and D. Zhang, “Discriminative trans- network training by reducing internal covariate shift,” in Proc. fer subspace learning via low-rank and sparse representation,” 32nd Int’l Conf. on Machine Learning, Lille, France, Jul. 2015, pp. IEEE Trans. on Image Processing, vol. 25, no. 2, pp. 850–863, 2015. 448–456. [67] M. Rajesh and J. Gnanasekar, “Annoyed realm outlook taxonomy [89] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, using twin transfer learning,” Int’l Journal of Pure and Applied “Towards deep learning models resistant to adversarial attacks,” Mathematics, vol. 116, no. 21, pp. 549–558, 2017. in Proc. Int’l Conf. on Learning Representations, Vancouver, Canada, [68] X. Chen, S. Wang, B. Fu, M. Long, and J. Wang, “Catastrophic Apr. 2018. forgetting meets negative transfer: Batch spectral shrinkage for safe transfer learning,” in Proc. Advances in Neural Information [90] X. Zhang, D. Wu, L. Ding, H. Luo, C.-T. Lin, T.-P. Jung, and Processing Systems, Vancouver, Canada, Dec. 2019, pp. 1908–1918. R. Chavarriaga, “Tiny noise, big mistakes: Adversarial perturba- tions induce errors in brain-computer interface spellers,” National [69] X. Chen, S. Wang, M. Long, and J. Wang, “Transferability vs. dis- Science Review, vol. 8, no. 4, 2021. criminability: Batch spectral penalization for adversarial domain adaptation,” in Proc. 36th Int’l Conf. on Machine Learning, Long [91] Z. Liu, L. Meng, X. Zhang, W. Fang, and D. Wu, “Universal ad- Beach, CA, Jun. 2019, pp. 1081–1090. versarial perturbations for CNN classifiers in EEG-based BCIs,” [70] C. Chen, Z. Zheng, X. Ding, Y. Huang, and Q. Dou, “Harmo- Journal of Neural Engineering, vol. 8, p. 0460a4, 2021. nizing transferability and discriminability for adapting object [92] D.-H. Lee et al., “Pseudo-label: The simple and efficient semi- detectors,” in Proc. IEEE/CVF Conf. on Computer Vision and Pattern supervised learning method for deep neural networks,” in Work- Recognition, Seattle, WA, Jun. 2020, pp. 8869–8878. shop on Challenges in Representation Learning, ICML, vol. 3, no. 2, [71] X. Wang, Y. Jin, M. Long, J. Wang, and M. I. Jordan, “Transferable Atlanta, GA, jun 2013, p. 896. normalization: Towards improving transferability of deep neural [93] S. Ruder, “An overview of multi-task learning in deep neural networks,” in Proc. Advances in Neural Information Processing networks,” arXiv preprint arXiv:1706.05098, 2017. Systems, Vancouver, Canada, Dec. 2019, pp. 1953–1963. [94] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, [72] K. Liang, J. Y. Zhang, O. Koyejo, and B. Li, “Does adversarial “Continual lifelong learning with neural networks: A review,” transferability indicate knowledge transferability?” in Proc. 38th Neural Networks, vol. 113, pp. 54–71, 2019. Int’l Conf. on Machine Learning, 2021. [95] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, [73] H. Salman, A. Ilyas, L. Engstrom, A. Kapoor, and A. Madry, “Gradient surgery for multi-task learning,” Proc. Advances in “Do adversarially robust ImageNet models transfer better?” Proc. Neural Information Processing Systems, Dec. 2020. Advances in Neural Information Processing Systems, Dec. 2020. [96] Z. Wang, Y. Tsvetkov, O. Firat, and Y. Cao, “Gradient vaccine: In- [74] Z. Deng, L. Zhang, K. Vodrahalli, K. Kawaguchi, and J. Zou, vestigating and improving multi-task optimization in massively “Adversarial training helps transfer learning via better represen- multilingual models,” in Proc. Int’l Conf. on Learning Representa- tations,” arXiv preprint arXiv:2106.10189, 2021. tions, Vienna, Austria, Apr. 2021. [75] Z. Pei, Z. Cao, M. Long, and J. Wang, “Multi-adversarial domain [97] Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich, “Grad- adaptation,” in Proc. 32nd AAAI Conf. on Artificial Intelligence, norm: Gradient normalization for adaptive loss balancing in deep New Orleans, LA, Feb. 2018. multitask networks,” in Proc. Int’l Conf. on Machine Learning, [76] Y. Ge, D. Chen, and H. Li, “Mutual mean-teaching: Pseudo Stockholm, Sweden, Jul. 2018, pp. 794–803. label refinery for unsupervised domain adaptation on person [98] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using re-identification,” in Proc. Int’l Conf.on Learning Representations, uncertainty to weigh losses for scene geometry and semantics,” Addis Ababa, Ethiopia, Apr. 2020. in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Salt [77] L. Gui, R. Xu, Q. Lu, J. Du, and Y. Zhou, “Negative transfer de- Lake City, Utah, Jun. 2018, pp. 7482–7491. tection in transductive transfer learning,” Int’l Journal of Machine [99] S. Liu, Y. Liang, and A. Gitter, “Loss-balanced task weighting Learning and Cybernetics, vol. 9, no. 2, pp. 185–197, 2018. to reduce negative transfer in multi-task learning,” in Proc. 33rd [78] Q. Wang and T. Breckon, “Unsupervised domain adaptation via AAAI Conf. on Artificial Intelligence, vol. 33, Honolulu, HI, Jan. structured prediction based selective pseudo-labeling,” in Proc. of 2019, pp. 9977–9978. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. -, NO. -, 2021 13

[100] Y. Zhang and D.-Y. Yeung, “A convex formulation for learn- [121] B. Wang, Y. Yao, B. Viswanath, H. Zheng, and B. Y. Zhao, “With ing task relationships in multi-task learning,” arXiv preprint great training comes great vulnerability: Practical attacks against arXiv:1203.3536, 2012. transfer learning,” in Proc. 27th USENIX Security Symposium, [101] C. Shui, M. Abbasi, L.-E.´ Robitaille, B. Wang, and C. Gagne,´ Baltimore, MD, August 2018, pp. 1281–1297. “A principled approach for learning task similarity in multitask [122] S. Cheng, Y. Dong, T. Pang, H. Su, and J. Zhu, “Improving black- learning,” in Proc. 28th Int’l Joint Conf. on Artificial Intelligence, box adversarial attacks with a transfer-based prior,” in Proc. Macao, China, Aug. 2019, pp. 3446–3452. Advances in Neural Information Processing Systems, Vancouver, [102] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirk- Canada, Dec. 2019. patrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progres- [123] A. Abdelkader, M. J. Curry, L. Fowl, T. Goldstein, sive neural networks,” arXiv preprint arXiv:1606.04671, 2016. A. Schwarzschild, M. Shu, C. Studer, and C. Zhu, “Headless [103] C. Rosenbaum, I. Cases, M. Riemer, and T. Klinger, “Routing horseman: Adversarial attacks on transfer learning models,” networks and the challenges of modular and compositional in Proc. Int’l Conf. on Acoustics, Speech and Signal Processing, computation,” arXiv preprint arXiv:1904.12774, 2019. Barcelona, Spain, May 2020, pp. 3087–3091. [104] O. Sener and V. Koltun, “Multi-task learning as multi-objective [124] S. Wang, S. Nepal, C. Rudolph, M. Grobler, S. Chen, and T. Chen, optimization,” in Proc. Advances in Neural Information Processing “Backdoor attacks against transfer learning with pre-trained deep Systems, Montreal,´ Canada, Dec. 2018, pp. 527–538. learning models,” IEEE Trans. on Services Computing, 2020. [125] L. Meng, J. Huang, Z. Zeng, X. Jiang, S. Yu, T.-P. Jung, C.-T. [105] X. Lin, H.-L. Zhen, Z. Li, Q.-F. Zhang, and S. Kwong, “Pareto Lin, R. Chavarriaga, and D. Wu, “EEG-based brain-computer multi-task learning,” in Proc. Advances in Neural Information Pro- interfaces are vulnerable to backdoor attacks,” Engineering, 2021, cessing Systems, Vancouver, Canada, Dec. 2019, pp. 12 060–12 070. submitted. [Online]. Available: https://www.researchsquare. [106] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- com/article/rs-108085/v1 training of deep bidirectional transformers for language under- [126] X. Jiang, D. Wu, L. Meng, and S. Li, “Active poisoning: Efficient standing,” arXiv preprint arXiv:1810.04805, 2018. backdoor attacks to transfer learning based BCIs,” IEEE Trans. on [107] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, Neural Systems and Rehabilitation Engineering, 2021, submitted. F. Guzman,´ E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” in Proc. 58th Annual Meeting of the Association for Computational Linguistics, Seattle, WA, Jul. 2020. [108] N. Arivazhagan, A. Bapna, O. Firat, D. Lepikhin, M. Johnson, M. Krikun, M. X. Chen, Y. Cao, G. Foster, C. Cherry et al., “Massively multilingual neural machine translation in the wild: Findings and challenges,” arXiv preprint arXiv:1907.05019, 2019. [109] Z. Wang, Z. C. Lipton, and Y. Tsvetkov, “On negative interference in multilingual models: Findings and a meta-learning treatment,” in Proc. 2020 Conf. on Empirical Methods in Natural Language Processing, Dominican Republic, nov 2020. [110] J. Guo, D. J. Shah, and R. Barzilay, “Multi-source domain adap- tation with mixture of experts,” in Proc. 2018 Conf. on Empirical Methods in Natural Language Processing, Brussels, Belgium, nov 2018, pp. 4694–4703. [111] M. McCloskey and N. J. Cohen, “Catastrophic interference in connectionist networks: The sequential learning problem,” in Psychology of Learning and Motivation. Elsevier, 1989, vol. 24, pp. 109–165. [112] T. Doan, M. A. Bennani, B. Mazoure, G. Rabusseau, and P. Alquier, “A theoretical analysis of catastrophic forgetting through the NTK overlap matrix,” in Proc. Int’l Conf. on Artificial Intelligence and Statistics, San Diego, CA, 2021, pp. 1072–1080. [113] A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny, “Efficient lifelong learning with A-GEM,” in Proc. Int’l Conf. on Learning Representations, New Orleans, LA, May 2019. [114] P. Sprechmann, S. Jayakumar, J. Rae, A. Pritzel, A. P. Badia, B. Uria, O. Vinyals, D. Hassabis, R. Pascanu, and C. Blundell, “Memory-based parameter adaptation,” in Proc. Int’ Conf. on Learning Representations, Vancouver, Canada, May 2018. [115] D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for continual learning,” in Proc. Advances in Neural Information Pro- cessing Systems, Long Beach, CA, Dec. 2017, pp. 6467–6476. [116] D. Rolnick, A. Ahuja, J. Schwarz, T. Lillicrap, and G. Wayne, “Experience replay for continual learning,” in Proc. Advances in Neural Information Processing Systems, Vancouver, Canada, Dec. 2019, pp. 348–358. [117] C. d. M. d’Autume, S. Ruder, L. Kong, and D. Yogatama, “Episodic memory in lifelong language learning,” in Proc. Advances in Neural Information Processing Systems, Vancouver, Canada, Dec. 2019. [118] A. Mallya and S. Lazebnik, “PackNet: Adding multiple tasks to a single network by iterative pruning,” in Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Salt Lake City, UT, Jun. 2018, pp. 7765–7773. [119] J. Serra, D. Suris, M. Miron, and A. Karatzoglou, “Overcoming catastrophic forgetting with hard attention to the task,” in Proc. 35th Int’l Conf. on Machine Learning, Stockholm, Sweden, Jul. 2018. [120] Z. Wang, S. V. Mehta, B. Poczos,´ and J. Carbonell, “Efficient meta lifelong-learning with limited memory,” in Proc. 2020 Conf. on Empirical Methods in Natural Language Processing, Dominican Republic, nov 2020.