Quick viewing(Text Mode)

Transfer Adaptation Learning: a Decade Survey

Transfer Adaptation Learning: a Decade Survey

JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 1 Transfer Adaptation Learning: A Decade Survey

Lei Zhang, Xinbo Gao

Abstract—The world we see is ever-changing and it always changes with people, things, and the environment. Domain is referred to as the state of the world at a certain moment. A research problem is characterized as transfer adaptation learning (TAL) when it needs knowledge correspondence between different moments/domains. Conventional aims to find a model with the minimum expected risk on test data by minimizing the regularized empirical risk on the training data, which, however, supposes that the training and test data share similar joint probability distribution. TAL aims to build models that can perform tasks of target domain by learning knowledge from a semantic related but distribution different source domain. It is an energetic research filed of increasing influence and importance, which is presenting a blowout publication trend. This paper surveys the advances of TAL methodologies in the past decade, and the technical challenges and essential problems of TAL have been observed and discussed with deep insights and new perspectives. Broader solutions of transfer adaptation learning being created by researchers are identified, i.e., instance re-weighting adaptation, feature adaptation, classifier adaptation, deep network adaptation and adversarial adaptation, which are beyond the early semi-supervised and unsupervised split. The survey helps researchers rapidly but comprehensively understand and identify the research foundation, research status, theoretical limitations, future challenges and under-studied issues (universality, interpretability, and credibility) to be broken in the field toward universal representation and safe applications in open-world scenarios.

Index Terms—Transfer Learning, Domain Adaptation, Distribution Discrepancy, !

1 INTRODUCTION learning is no longer satisfied, which, therefore promotes the emergence of transfer learning (TL) and domain adaptation ISUAL understanding of an image or video is a long- (DA) [3], [4], [5]. Originally, the concept of transfer was first standing and challenging problem in computer vision. V proposed in 1991 by Pratt et al. [6] and further introduced Visual classification, as a fundamental problem of visual in [7] for transfer between neural networks. Mathematically, understanding, aims to recognize what an image depicts. A the earlier TL supposed different joint probability distribu- solidified route of visual classification is to establish a learn- tion, i.e., P (X , Y ) 6= P (X , Y ) between ing model based on a well-labeled image dataset, which source source target target source and target domains. DA supposed different marginal can be recognized as target data (task). However, labeling distribution, i.e., P (X ) 6= P (X ) but similar cat- a large number of target samples is cost-ineffective, because source target egory space between domains i.e., P (Y |X ) = it consumes a lot of human resources and time expenses in source source P (Y |X ). Currently, according to different set- labeling and becomes almost unrealistic for many domain target target tings of label set between domains, a preliminary split of specific tasks in real applications. Therefore, leveraging an- semi-supervised domain adaptation (SSDA), unsupervised other sufficiently labeled, distribution different but semantic domain adaptation (UDA), open-set domain adaptation related source domain for recognizing target samples is (OSDA) and partial domain adaptation (PDA) is presented, becoming an increasingly important topic. in all which the source labels are completely available. With the explosive increase of multi-source data from Notably, both SSDA and UDA assume the same category Internet such as YouTube and Flickr, a large number of space C across domain (close-set), i.e., C = C . The former labeled web database can be easily crawled. It is thus natural S T assumes a few labeled target samples can be used, while to consider train a learning model using multi-source web arXiv:1903.04687v2 [cs.CV] 22 Nov 2020 the target labels are completely unknown for the latter. data for recognizing target data. However, a prevailing Contrastively, OSDA and PDA assume the different category problem is that distribution mismatch and domain shift [1], space across domains, i.e. C 6= C . In general, OSDA [2] across source and target domain often exist owing to S T supposes C ⊂ C and PDA supposes C ⊂ C . In this various factors such as resolution, illumination, viewpoint, S T T S survey, we aim to cover the recent progress from a pure background, weather condition, etc. in computer vision. methodological perspective beyond the conventional split Therefore, the classification performance on the target task of semi-supervised, unsupervised, open-set and partial DA is dramatically degraded when the source data for training tasks. the classifier has different distribution from the target data. The earlier related reviews on transfer learning and This is due to that the fundamental independent identical domain adaptation can be referred to as [4], [15], [13], distribution (i.i.d.) condition implied in statistical machine [12], [10], [11], [14], [9], [8], [16], which are summarized in Table 1. However, our concern is that, these existing surveys • L. Zhang is with the School of Microelectronics and Communication can only reflect the several aspects in TL/DA community. Engineering, Chongqing University, Chongqing 400044, China. (E-mail: [email protected]). With the recent explosive increase of transfer learning and • X. Gao is with the Chongqing Key Laboratory of Image Cognition, domain adaptation models and algorithms, these surveys Chongqing University of Posts and Telecommunications, Chongqing are unable to accurately characterize her status, challenges 400065. (E-mail: [email protected]). and future for this community from a more comprehensive JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 2

TABLE 1 Source Tasks Target Tasks Summarization of Related Surveys in TL/DA

F F No. Survey Title Ref. Year ? 1 A Survey on Transfer Learning [4] 2010 Extreme learning machine based transfer 2 [8] 2017 learning algorithms:A survey Fig. 1. Cross-domain object detection, recognition and semantic seg- 3 A survey of transfer learning [9] 2016 mentation. F denotes models of three tasks learned on source domain. Transfer Learning for Visual The value of TAL depends on the performance gain of target tasks. 4 [10] 2015 Categorization: A Survey A survey of transfer learning for collaborative 5 [11] 2016 recommendation with auxiliary data adversarial net (GAN) [36] and its variants [37], [38], [39], [40], [41] that aim to synthesize plausible images belonging Visual Domain Adaptation 6 [12] 2015 to target distribution from some source distribution (e.g., A survey of recent advances noise signal), can also be recognized as generalized transfer Transfer learning using computational 7 [13] 2015 learning techniques (e.g., style transfer tasks). Differently, intelligence: A survey conventional transfer learning approaches put emphasis on 8 Deep visual domain adaptation: A survey [14] 2015 the output knowledge adaptation (e.g., high-level features) Transfer learning for activity across source and target domains, while GANs focus on 9 [15] 2013 recognition: a survey input data adaptation (e.g., low-level image pixels) from Domain adaptation for visual source distribution to target distribution. Recently, image 10 [16] 2017 applications: A comprehensive survey pixel-level transfer has been intensively studied in image-to- image translation [42], [43], [44], [45], [46], style transfer [47], [48], [49] and target face synthesis (e.g., pose transfer vs. age transfer) [50], [51], [52], [53], [54], etc. perspective. Therefore, in this survey, we use a general name In this survey, we focus on elaborating the technical Transfer Adaptation Learning (TAL) for unifying both TLs and advances and remained challenges in model-driven transfer DAs from a broader perspective, and discuss the technical adaptation learning. Learning from multiple sources for advances, challenges and under-studied issues. In the past transferring or adapting to new target domains offers the decade, especially the recent five years, TAL was one of possibility of promoting model generalization and under- the most active areas in machine learning community, and standing the biological learning essence. Notably, transfer the goal of which is to narrow down the marginal and adaptation learning is similar but different from multi-task conditional distribution gap between source and target data, learning that resorts to the shared feature representations such that the labeled source data from one or more relevant or classifiers for multiple different tasks (e.g., detection vs. domains can be utilized for executing different learning retrieval) [55], [56], simultaneously. tasks in target domain, as illustrated in Fig. 1. The cross- In the past decade, a number of transfer learning and domain recognition, detection and segmentation tasks and domain adaptation approaches have been emerged by fol- challenges are highlighted. lowing the expected target error upper-bounds of domain Moving forward, (DL) techniques [17], adaptation that minimizes a convex combination of the [18], [19], [20] have recently become dominant and powerful source empirical risk, domain discrepancy and joint error algorithms in feature representation and abstraction for im- have been analyzed and justified [57], [58]. Specifically, age classification. In particular, the parameter adaptability given a labeled source domain Ds, an unlabeled target do- and generality of DL models to other target data is worthy main Dt and a hypothesis h ∈ H (hypothesis set), according of praise, by fine-tuning a pre-trained deep neural network to the Ben-David theorem [57], the expected target risk using a small amount of labeled target data. Therefore, T (h) can be upper-bounded by the expected source risk fine-tune has become a commonly used strategy for train- S(h), domain discrepancy dH∆H(Ds, Dt) and joint error ing deep models and frameworks in various applications, λ = minh∈H S(h) + T (h). Intuitively, domain adaptation such as object detection [21], [22], [23], [24], person re- can be achieved by modeling and simultaneously minimiz- identification [25], [26], [27], medical imaging [28], [29], ing the upper-bound of T (h) from multiple different as- [30], remote sensing [31], [32], [33], [34], etc. Generally, the pects, such as source error S, domain discrepancy d(Ds, Dt) fine-tune can be recognized as a prototype for bridging the and joint error λ under an ideal hypothesis h∗. sufficiently labeled big source data and the few/un-labeled Due to the endless emergence and variety of existing target data [35], which also facilitates the research progress methods of visual transfer adaptation learning (VTAL), of visual transfer adaptation learning (VTAL) in computer in this paper, the remained challenges and main-stream vision. Conceptually, fine-tune is an intuitive and data- methodological advances in VTAL community are identified driven transfer learning method, which depends on pre- and surveyed without further discussing the domain adap- trained models with task-specific source database (e.g., Im- tation theory. Specifically, around the ultimate challenge ageNet). The context of transfer learning challenge and why of universal representation, considering the covariate shift, the generated representation with pre-trained models is domain discrepancy and class discrimination, we propose useful have been formally explored in [35]. Extensively, from five main-stream key technical categories of transfer adap- the viewpoint of generative learning, the popular generative tation learning for domain adaptive feature representations JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 3

and classifiers, which are beyond the early split of semi- The existing benchmarks and future challenging tasks are supervised and unsupervised (close-set) and recent split of discussed in Section 8. Extension of the versatile TAL theory open-set and partial domain adaptation. The novel taxon- and algorithm in more vision tasks is presented in Section omy of VTAL is summarized as follows. 9. A conclusion of the research status and an outlook of the field for further promoting the healthy development of TAL • Instance Re-weighting Adaptation. Due to the prob- are provided in Section 10. ability distribution discrepancy across domains, it is natural to account for the difference by directly inferring the resampling weights of instances based 2 WEAKLY- PERSPECTIVE on feature distribution matching in a non-parametric Consider that VTAL can be characterised as weak- manner. Probability and Shannon entropy based re- supervised learning, in this section, we first present some weighting has become a common technique in UDA, weakly-supervised learning perspectives, including semi- OSDA and PDA problems. The adaptive parameter supervised, active learning, zero/few-shot learning and estimation of the weights under a parametric distri- open set recognition. These learning methodologies are sim- bution assumption remains to be a challenge. ilar but essentially different from TAL, which, therefore, are • Feature Adaptation. For adapting the data from multi- briefly introduced for having a taste in this survey. Note ple sources, learning a common feature subspace or that, in this section, unsupervised/self-supervised learning representation where the projected source and target is not discussed, considering that the category label is com- domain are with similar distribution is generally re- pletely unavailable which has different starting point from sulted. However, learning common subspace gener- TAL. It must be mentioned that TAL does share essentially ally belongs to shallow methodology, which relies on the same goal as unsupervise/self-supervised learning, i.e. pre-trained deep models for discriminative feature learning generalized/universal representation. representation. It is challenging to unify the shallow The concept of weak learning originated 20 years ago model with the deep model. Also, the heterogeneity in AdaBoost [59] and Ensemble learning [60] algorithms, of data distribution makes it challenging to gain such which tend to ensemble multiple weak learners to solve generic feature. a problem. AdaBoost, that has been listed as the top 10 • Classifier Adaptation. The classifier trained on in- algorithms in data mining [61], aims to learn multiple weak stances of source domain is often biased when recog- learners, in which each weak learner is obtained by training nizing instances from target domain due to the do- on the weighted incorrectly classified examples. By ensem- main shifts and category space inconsistency. Learn- ble of multiple weak learners, the performance is signifi- ing a generalized classifier from source domain that cantly boosted. Although the weak concept was proposed as can be used for recognizing seen/partially seen/unseen early as 1997, the problem in that era was still established on object classes in another domain without bias is an strong supervision due to the relatively smaller data. That is, essential but challenging topic. the early problem can be strongly learned by conventional • Deep Network Adaptation. Deep neural networks have statistical learning models. However, today, the big data era, been recognized with strong feature representation the problem becomes really a problem, power and general deep model is built on single due to the inaccurate, inexact, and incomplete characteristics domain. Large domain shift makes deep neural net- of data labels [62], which, therefore, has to be weakly work training challenging to obtain transferrable learned. Currently, weakly-supervised learning is becoming deep representation. Additionally, the unavailability a leading research topic. Undoubtedly, transfer adaptation (uncertainty) of target labels (pseudo-labels) makes learning, that resorts to solving cross-domain problems, is deep network adaptation still challenging due to also a kind of weakly-supervised learning methodology. negative transfer. This section is deployed with typical weakly-supervised • Adversarial Adaptation. Adversarial learning origi- learning frameworks and perspectives. nates from the generative adversarial nets, which has been a general technique in VTAL. A basic objective 2.1 Semi-supervised Learning of VTAL is to make the source and target domains Semi-supervised learning (SSL) aims to solve the problem more close in marginal distribution. Therefore, it is where there are a large amount of unlabeled examples X amount to confusing the two domains, such that they u and a few labeled examples (X ,Y ) in the dataset [63], [64]. can not be easily discriminated. However, due to the l l Generally, semi-supervised learning methods consist of four class mis-alignment problem, there comes a technical categories. (i) Generative methods that advocate generating challenge in class conditional domain confusion by the labeled and unlabeled data via an inherent model [65], using adversarial training and gaming strategy. [66], [67]. (ii) Low-density separation methods that con- For each challenge, the taxonomic classes and sub- strains the classifier boundary crossing the low-density classes are presented to structure the recent work in transfer region [68], [69], [70]. (iii) Disagreement based methods adaptation learning. We start with a discussion of weakly- that advocate co-training of multiple learners for annotating supervised learning perspectives in Section 2, which is fol- the unlabeled instances [71], [72], [73]. (iv) Graph based lowed by the technical advances of TAL, including instance methods that propose to build the connection graph of re-weighting adaptation (Section 3), feature adaptation (Sec- the training instances for label propagation through graph tion 4), classifier adaptation (Section 5), deep network adap- modeling [74], [75], [76], [77]. A good literature review of tation (Section 6), and adversarial adaptation (Section 7). semi-supervised learning can be referred to as [78], [79]. JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 4

Consider a general SSL framework, then the following (a) (b) (c) expected risk is generally minimized. Dense Region Sparse R[P , W, l(X,Y,W )] = E [l(X,Y,W )] Region r (X,Y )∼Pr (1) Dense Region where Pr is the probability distribution, X = [Xl,Xu] ∈ D×N C×N < is the data, Y = [Yl, 0] ∈ < is the label index in which zero vector is posed for unlabeled samples, W is D,N C Fig. 2. Illustration of the three basic assumptions in SSL. a) Smoothness the model parameter. and denote the number of assumption. b) Cluster assumption. c) Manifold assumption. dimensionality, samples, and classes of data, respectively. The training data usually comes from a subset, therefore, the regularized risk, i.e., the average empirical risk with 2.3 Zero/Few-shot Learning regularization is minimized. Recently, zero/few-shot learning (Z/FSL) [84], [85], [86], [87], [88], as a typical weakly-supervised learning paradigm, Rreg[W, l(X,Y,W )] = Remp[W, l(X,Y,W )] + λΩ(W ) (2) has attracted researchers’ attention. ZSL tries to recognize the samples of unseen categories that never appear in train- where l(·) is the prediction loss function and R [·] is the emp ing data, i.e., there is no overlap between the seen categories average empirical risk (e.g. mean squared loss) on training in training data and the unseen categories in test data. That data. A general SSL model with graph based manifold is, the label space distribution between training and test data regularization can be written as is different, i.e., P (Yseen|Xseen) 6= P (Yunseen|Xunseen), N which can be recognized as a special case of transfer learn- X 2 min Rreg[W, l(X,Y,W )] + γ Ai,jd (fi, fj) (3) ing. This situation often occurs in various fields, due to W i,j that manually annotating tens of thousands of different object classes in the world is quite expensive and almost where fi is the predicted label for sample i and A is the unrealistic. The general problem of ZSL is as follows. affinity matrix used for locality preservation. Usually, Ai,j = 2 Zero-shot learning with disjoint training and testing exp(−σd (xi, xj)) if xi and xj are neighbors, otherwise 0. classes Let X be an arbitrary feature space of training data. Let Y The key difference from transfer learning is that the . and Z be the sets of seen and unseen object categories, respectively, marginal distribution and label space distribution are the and there is Y T Z = ∅. The task is to learn a classifier f: X 7→ Z same, i.e., P (Xl) = P (Xu) and P (Yl|Xl) = P (Yu|Xu). by using the training data (x , y ), ··· , (x , y ) ⊂ (X , Y) Generally, SSL attempts to exploit the unlabeled data for 1 1 N N . An extension of ZSL is the one/few shot learning auxiliary training on the labeled data without human inter- (O/FSL) where few labeled examples of each unseen object vention, because the distribution of unlabeled data can in- classes are revealed during training process. The usual idea trinsically reflect sample class information. Actually, in SSL of Z/O/FSL is to learn the embedding of the image feature model, three basic assumptions, i.e., smoothness, cluster, and into the semantic space or semantic attributes [85], [89]. manifold, have been established. The smoothness assumption Afterwards, recognition of new classes can be conducted denotes that data is distributed with different density, and by matching the semantic embedding of the visual fea- the two instances falling into the region of high density tures with the semantic/attribute representation. However, have the same label. The cluster assumption denotes that visual-semantic mapping learned from the seen categories data have inherent cluster structure, and the two samples in may not generalize well to the unseen category due to the the same cluster are more similar. The manifold assumption domain shift, which, thus can be a challenging topic by uti- means that the data lie on a manifold, and the instances in a lizing transfer learning to ZSL. Actually, for improving ZSL small local neighborhood have similar semantics. The three under domain shifts, transductive or semi-supervised zero- basic assumptions are visually shown in Fig. 2. shot learning approaches have been studied for reducing the difference of visual-semantic mappings between seen 2.2 Active Learning and unseen categories [90], [91], [92], [93], [94]. Active learning (AL) aims to obtain the ground-truth la- bels of selected unlabeled instances with human interven- 2.4 Open Set Recognition tion [80], [81], which is different from semi-supervised Conventional recognition tasks in computer vision where learning that exploits unlabeled data together with labeled all testing classes are known at training time are generally data for improving recognition performance. Specifically, recognized as closed-set recognition. Open set recognition AL aims at progressively selecting and annotating the most addresses a more realistic vision scenario where unknown informative data points from the pool of unlabeled samples, classes can be encountered during testing time [95], [96], such that the labeling cost for training an effective model [97], [98], which shares very similar characteristic with ZSL can be minimized [82], [83]. There are two engines, learning in tasks. ZSL is different from open set recognition that the engine and selection engine, in active learning paradigm. former uses the semantic embedding of visual features for The learning engine targets at obtaining a baseline classifier, recognizing unknown classes, while the latter focus on a while the selection engine tries to select unlabeled instances one-class classification problem. and deliver them to human experts for manual annotation. More recently, a similar open set framework with trans- The selection criteria is generally determined based on ductive ZSL for recognition under domain shift is the open- information uncertainty [80], [81]. set domain adaptation (OSDA) approach [99], [100], [101], JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 5

Impossibility Theory of DA SDA (Sun and Saenko) (Ben-David, et al.) LTSL (Shao et al.) Visual Domain Adaptation LSDT (Zhang, et al.) MMD (Saenko, et al.) DTSL (Xu, et al.) (Gretton, et al.) Survey of Transfer Learning CORAL (Sun and Saenko) Domain Adaptation (Pan and Yang) and Upper-bound Theory RTML (Ding, et al.) SGF (Gopalan, et al.) (Ben-David, et al.) JGSA (Zhang, et al.) TCA (Pan, et al.) Neural Network Transfer MCTL (Zhang, et al.) GFK (Gong, et al.) CRAFT (Chen, et al.) (Pratt L.) SA (Sun and Saenko) Bud of Transfer JDA (Long et al.) GSL (Zhang, et al.) (Pratt et al.) HFA (Li et al.) Universal ⋯ Deep Learning

1991 1993 2006 2007 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 Representation

2011 2017 2018 2016 2017 2018 2020 Instance Re-weighting Deep Adaptation DRCN (Li, et al.) Instance Weighting DeCAF TransNorm (Wang, et al.) (Jiang and Zhai) Co-training (Donahue, et al.) SAN (Wang, et al.) (Chen, et al.) WMMD DDC (Tzeng, et al.) IWAN (Zhang, et al.) DAN (Long, et al.) (Yan, et al.) RAAN JAN (Long, et al.) (Chen, et al.) RTN (Long, et al.)

2007 2009 2011 2014 2016 2018 2015 2017 2018 2020 Classifier Adaptation Adversarial Adaptation

ASVM GAN STAR (Lu, et al.) (Yang, et al.) (Goodfellow, et al.) RDA (Fu, et al.) DTSVM GRL (Ganin, et al.) Source-free (Kundu, et al.) (Duan, et al.) MEDA DANN (Ajakan, et al.) AMKL&DTMKL MADA (Pei, et al.) (Wang, et al.) Domain Confusion (Duan, et al.) CAN (Zhang, et al.) ARTL OBTL (Tzeng, et al.) ADDA CDAN (Long, et al.) (Long, et al.) (Karbalayghareh, et al.) MCD (Saito, et al.) KBTL EDA (Tzeng, et al.) (Gonen et al.) (Zhang and Zhang) CyCADA (Hoffman, et al.)

Fig. 3. A road map and research context of transfer adaptation learning. We represent the feature adaptation as the main stream which is divided into three stages (1991∼1993, 2006∼2015, 2015∼2020). Other four sub-branches, including instance re-weighting, classifier adaptation, deep adaptation and adversarial adaptation, are closely connected to the main stream towards universal representation. The milestones and key time nodes of TAL in the past decade have been flagged on each branch.

[102], which were established on the concept of open set open set domain adaptation was proposed for the scenario recognition. Conventional domain adaptation assumes that where there still a few categories of interest are shared across the categories in target domain are known and can be seen source and target data. The open set domain adaptation in the source domain, while open-set domain adaptation shares some similarity with ZSL that some classes are un- addresses the scenarios where the target domain contains seen. The difference lies in that the unseen classes in OSDA the instances of categories that are unseen in the source are directly recognized as unknown class. domain [100]. The differences between zero-shot learning and open-set domain adaptation lie in that (1) Conventional This paper provides a novel taxonomy of TAL regardless closed-set open-set partial-set ZSL tends to solve the recognition of instances of unseen cat- of , and split. In the past 30 egories under the same marginal distribution across training years, the overall research context of transfer adaptation and testing data, while open set domain adaptation aims learning (TAL), including the origin of transfer concept, to solve the same problem but under different marginal dis- establishment of domain adaptation theory, and rapid de- tribution across source and target domains. (2) Generalized velopment of TAL models and algorithms towards universal ZSL [89], [103] was proposed for the scenario where the representation, is illustrated in Fig. 3. The milestones with training and test classes are not necessarily disjoint, while respect to those representative breakthroughs in theory and models around the taxonomy have been flagged. JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 6

3 INSTANCE RE-WEIGHTING ADAPTATION TABLE 2 Our Taxonomy of Instance Re-weighting Adaptation Approaches When the training and test data are drawn from different distribution, this is commonly referred to as sample selection RE-WEIGHTING ADAPTATION MODEL BASIS REFERENCE bias or covariate shift [104]. Instance re-weighting aims to [106], [107], infer the resampling weight directly by feature distribution Intuitive Weighting Adaptive tuning matching across different domains in a non-parametric man- [108], [109] Kernel Map-Based ner. Generally, given a dataset (x, y) ∼ Pr(x, y), a learning [104], [110], model can be obtained by minimizing the following ex- Distribution Matching KMM&MMD [111] pected risk of the training set, K-Means Sample Selection [112], [113] &l21-norm R[Pr, θ, l(x, y, θ)] = E(x,y)∼Pr (x,y)[l(x, y, θ)] (4) Co-training-Based Double classifiers [114], [115] But actually, we are more concerned about the expected risk of the testing set, shown as follows four parameters for characterizing the distribution differ- 0 R[P , θ, l(x, y, θ)] = E 0 [l(x, y, θ)] r (x,y)∼Pr (x,y) ence between source and target samples. For example, for 0 s s P (x, y) each (xi , yi ) ∈ Ds, the labeled source data, the parameter = E [ r l(x, y, θ)] (5) s s (x,y)∼Pr (x,y) αi that was used to indicate how likely Ptarget(yi |xi ) is Pr(x, y) s s close to Psource(yi |xi ) and the parameter βi that was ideally = E [β(x, y)l(x, y, θ)] s (x,y)∼Pr (x,y) Ptarget(xi ) computed as s were introduced. Obviously, large Psource(x ) 0 i where Pr(x, y) and Pr(x, y) represent the probability distri- αi means the high confidence of the labeled source sample s s bution of training and testing data, respectively. l(x, y, θ) is (xi , yi ) contributing positively to the learning effectiveness. the loss function and β(x, y) is the ratio between the two Small αi means the two probabilities are very different, and s s probabilities, which is amount to the weighting coefficient. the instance (xi , yi ) can be discarded in the learning process. 0 t,u Obviously, when Pr(x, y) = Pr(x, y), we have β(x, y) = 1. Additionally, for each xi ∈ Dt,u, the unlabeled target data, 0 From Eq.(5), we know that Pr(x, y) and Pr(x, y) can and for each possible label y ∈ Y, the hypothesis space, be estimated for computing the weight β(x, y) by follow- the parameter γi(y) that indicates how likely a tentative t,u t,u ing [105] based on the prior knowledge of the class dis- pseudo-label y can be assigned to xi , then the (xi , y) tributions. Although this is intuitive, it requires very good is included as a training sample. 0 density estimation of Pr(x, y) and Pr(x, y). Particularly, a Generally, αi and γi play an intuitive role in sample serious overweighting of the observations with very large selection by removing those misleading source samples and coefficients β(x, y) will be resulted from possible small er- adding those valuable labeled target samples during the 0 rors or noise in estimating Pr(x, y) and Pr(x, y). Therefore, transfer learning process. Although the optimal weighting in order to improve the reliability of the weights, β(x, y) values of these parameters for the target domain are un- can be directly estimated by imposing flexible constraints known, the intuitions behind the weights can be served into the learning model without having to estimate the two as guidelines for researchers designing heuristic parameter probability distributions. tuning scheme [106]. Therefore, adaptive learning of these Sample re-weighting based domain adaptation methods intuitive weights remains still a challenging issue. mainly focuses on the case where the difference between In [107], Wang et al. proposed two instance weighting the source domain and the target domain is not too large. schemes for neural machine translation (NMT) domain The objective is to re-weight the source samples so that adaptation, i.e., sentence weighting and dynamic domain the source data distribution can be more close to the target weighting. Specifically, given the parallel training corpus data distribution. Usually, when the distribution difference D = [Din, Do] consisting of in-domain data and out-of- between the two domains is relatively large, the sample domain data, the sentence weighted NMT objective function re-weighting methods can be combined with others (e.g. was written as feature adaptation) for auxiliary transfer learning. Instance X Jsw = λi log P(yi|xi) re-weighting has been studied with different models, which (6) hx ,y i∈D can be divided into three categories based on weighting i i scheme: (i) Intuitive Weighting, (ii) Kernel Mapping Based where λi is the weight to score each hxi, yii. P(·) is the Weighting, and (iii) Co-training Based Weighting. This kind of conditional probability activated by softmax function. x methods put emphasis on the learning or computation of the and y represent the source sentence and target sentence, weights by using different criterions and training protocols. respectively. For domain weighting (dw), a weight λ was The taxonomy of instance re-weighting based models is designed for the in-domain data, and the NMT objective summarized in Table 2. function Eq.(6) can be transformed as [107] X X J = λ log P(y|x) + log P(y0|x0) dw (7) 0 0 3.1 Intuitive Weighting hx,yi∈Din hx ,y i∈Do Instance re-weighting based domain adaptation was first A dynamic batch weight tuning scheme was proposed by proposed for natural language processing (NLP) [106], [107]. monotonically increasing the ratio of in-domain data in the In [106], Jiang and Zhai proposed an intuitive instance minibatch, which is supervised by the training cost. Dai et weighted domain adaptation framework, which introduced al. proposed a TrAdaBoost [108] transfer learning method, JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 7

which leveraged Boosting algorithm to automatically tune and simultaneously minimizing the empirical risk of the the weights of the training samples. classifiers learned on the reweighted training samples. In [109], Chen et al. proposed a more intuitive weight- tr tr te (w, s) = arg min Rw(D , s) + λΩs(D , D ) ing based subspace alignment method by re-weighting the w,s (11) source samples for generating source subspace that are close T m where Rw(·) is the empirical risk (loss) on the training set to the target subspace. Let w = [w1, ··· , wm] ∈ R denote Dtr, Ω indicates the distribution mismatch formulated by the weighting vector of the source samples. Obviously, the s KMM, s is the weighting vector of the source samples, and wi w.r.t. the source sample xi increases if its distribution is w is the classifier parameters. From Eq.(11), the KMM based more close to target data. Therefore, a simple weight assign- distribution mismatch plays an important role in model ment strategy was presented for assigning larger weights to regularization on the sampling weights. the source samples that are closer to target domain [109]. More recently, Yan et al. [111] proposed a weighted After obtaining the weight vector w, the weighted source MMD (WMMD) for domain adaptation, which was imple- space can be obtained by performing PCA on the following mented with convolutional neural network. WMMD over- covariance matrix C of weighted source data, comes the flaw of conventional MMD that ignores the class m 1 X T weight bias and assumes the same class weights between C = (xi − µ) wi(xi − µ) (8) m source and target domain. WMMD is formulated as [111] i=1 M N where µ is the weighted mean vector. Then the eigenvectors 2 1 X s 1 X t 2 d = k α s φ(x ) − φ(x )k wmmd M yi i j H (12) P Σ α s N S can span the source subspace. By performing PCA on i=1 yi i=1 j=1 the target data, the eigenvectors PT can span the target s th α s y i subspace. Thereafter, the following unsupervised domain where yi is the class weight w.r.t. the class i of the adaptation model, subspace alignment (SA) [116], with Frobe- source sample and φ(·) is the nonlinear mapping into RHKS nius norm minimization, was implemented. H. M and N denote the number of samples drawn from 2 source and target domain, respectively. min kPSM − PT kF M (9) (2) Sample Selection is another kind of kernel mapping based re-weighting method. Zhong et al. [112] proposed M The subspace alignment matrix can be easily solved with a cluster based sample selection method KMapWeighted least-square solution. which was established on the assumption that the kernel mapping can make the marginal distribution across do- 3.2 Kernel Mapping Based Weighting mains similar, but the conditional probabilities between two domains after kernel mapping are still different. Therefore, The intuitive weighting based domain adaptation was im- in the RKHS space, they further select those source samples plemented in the raw data space. In order to infer the that are more likely similar to target data via a K-means sampling weights by distribution matching across source based clustering criterion. The data in the same cluster and target data in feature space in a non-parametric way, should be with the same labels and then the source samples kernel mapping based weighting was proposed. Briefly, the with similar labels to target data were selected. distribution difference between source and target data can Long et al. [113] proposed a TJM method for domain be better characterised by re-weighting the source samples adaptation by minimizing the MMD based distribution such that the means of the source and target instances in mismatch between source and target data, in which the a reproducing kernel Hilbert space (RKHS) are close [104]. transformation matrix A was imposed with structural spar- Kernel mapping based weighting consists of two categories sity (i.e., l -norm regularization constraint) for sampling. of methods: Distribution Matching [104], [110], [111] and 2,1 Then, larger coefficients correspond to the strong correlation Sample Selection [112], [113]. between the source samples and the target domain samples. (1) Distribution Matching. The intuitive idea of distribu- The TJM model is provided as [113] tion matching is to match the means between the source and target data in a reproducing kernel Hilbert space T 2 min tr(A MA) + λ(kAsk2,1 + kAtkF ) (13) (RKHS) by resampling the weights of the source data. Two AT MA=I similar distribution matching criterions, i.e., kernel mean where the l2,1-norm on source transformation As means matching (KMM) [104] and maximum mean discrepancy that source outliers can be excluded in transferring to target (MMD) [117], [118], have been used as non-parametric domain, the target transformation At was regularized for statistic to measure the distribution difference. Specifically, smoothness, and M = KHKT is the deduced matrix from Huang et al. [104] firstly proposed to re-weight the source MMD. H is the centering matrix and K is kernel matrix. samples with β, such that the KMM between the means of target data and the weighted source data is minimized. 3.3 Co-training Based Weighting 0 min kEx0∼P 0 [Φ(x )] − Ex∼P [β(x)Φ(x)]k r r Co-training [72] assumes that the dataset is characterized β (10) into two different views, in which two classifiers are then s.t. β(x) ≥ 0,Ex∼P [β(x)] = 1 r separately learned for each view. The inputs with high where Φ(·) is the nonlinear mapping function into RKHS. confidence of one of the two classifiers can be moved to the Chu et al. [110] further proposed a selective transfer ma- training set. In weighting based transfer learning, Chen et chine (STM) by minimizing KMM for distribution matching, al. proposed a CODA [114] method, in which two classifiers JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 8 with different weight vectors were trained. For better train- researchers lie in the domain subspace alignment, projection ing both classifiers on the training set, the two classifiers learning for distribution matching, generic representation were jointly minimized with weighting. In essence, the and shared domain dictionary coding. The taxonomy of method of sample re-weighting based on the classifier is feature adaptation approaches is summarized in Table 3. similar to the TrAdaBoost [108] and KMapWeighted [112]. In [115], Chen et al. proposed a re-weighted adversar- TABLE 3 ial adaptation network (RAAN) for unsupervised domain Our Taxonomy of Feature Adaptation Approaches adaptation. Two classifiers including a multi-class source instance classifier C and a binary domain classifier D were FEATURE ADAPTATION MODEL BASIS REFERENCE designed for adversarial training. The domain classifier D Feature Subspace Geodesic path Grassman manifold [119], [120] aims to discriminate whether features are from source or [116], [121], Alignment Subspace learning target domain, while the domain feature representation net- [122], [123] work T tries to confuse them, which formulates an adver- Feature Transformation sarial training manner. For improving the domain confusion MMD&HSIC& [124], [125], Projection effect, the source feature distribution is re-weighted with β Bregman divergence [126], [127] during training of the domain classifier D. With the gaming First/second-order [128], [129], T D Metric between and as GAN does [36], the following minimax statistic [130], [131] objective function was used [115], Zero-padding& [132], [133], Re Augmentation min max Ladv Generative [134], [135] T D,β (14) Feature Reconstruction where the weight β is multiplied with D, and both β and Low-rank [136], [137], Low-rank models D were trained in a cooperative way. The learning of the representation (LRR) [135], [138] source classifier C was easily performed by minimizing the Sparse subspace [139], [140], Sparse models cross-entropy loss. clustering (SSC) [141], [142] Feature Coding [143], [144], 3.4 Discussion and Summary Domain-shared dictionary Dictionary learning [145], [146] In this section, we recognize three kinds of instance re- [147], [148], Domain-specific dictionary Dictionary learning weighting: intuitive, kernel mapping and co-training. The [149], [150] intuitive re-weighting advocates to tune the weights of the source samples, such that the weighted source distribution is closer to target distribution. The kernel mapping based re- weighting is further divided into distribution matching and 4.1 Feature Subspace-Based sample selection. The former aims to learn source sample weights such that the kernel mean discrepancy between Learning subspace generally resorts to unsupervised do- target data and the weighted source data is minimized, main adaptation. Three representative models are referred and the latter advocates sample selection by using K-means to as sampling geodesic flow (SGF) [119], geodesic flow kernel (GFK) [120] and subspace alignment (SA) [116]. There clustering (cluster assumption) and l2,1-norm based struc- tural sparsity in RKHS space. The co-training mechanism exists a common property of the three methods, i.e. the data focuses on learning with two classifiers. Additionally, the is assumed to be represented by a low-dimensional linear adversarial training of the weighted domain classifier can subspace. That is, a low-dimensional Grassmann mani- facilitate domain confusion. fold is embedded in the high-dimensional data. Generally, Although instance re-weighting is the earliest method principal component analysis (PCA) was used to construct to address domain mismatch problem, there are still some the Grassmann manifold, where the source and target do- directions worth studying: 1) essentially, instance weighting mains become two points and a geodesic flow or path was can be incorporated into most of learning frameworks; 2) formulated. SGF proposed by Gopalan, et al. [119] is an the initialization and estimation of instance weights are unsupervised low-dimensional subspace transfer method, important and can be treated as a latent variable obeying which samples a group of subspaces along the geodesic some probability distribution. path between source and target data, and aims to find an intermediate representation with closer domain distance. Similar to but different from SGF, Gong et al. proposed 4 FEATURE ADAPTATION a GFK [120], in which the geodesic flow kernel was used to Feature adaptation aims to discover the common feature model the domain shift by integrating an infinite number representation of the data drawn from multiple sources of subspaces. GFK explores an intrinsic low-dimensional by using different techniques including linear and nonlin- spatial structure that associates two domains and the main ear ones. In the past decade, feature adaptation induced idea behind is to find a geodesic line from φ(0) to φ(1), transfer adaptation learning has been intensively studied, such that the raw feature can be transformed into a space which, in our taxonomy, can be categorized into (i) Feature of infinite dimension from φ(0) to φ(1) where distribution Subspace-Based, (ii) Feature Transformation-Based, (iii) Feature difference is easy to be reduced. In particular, the infinite Reconstruction-Based and (iv) Feature Coding-Based. Despite dimensional features in the manifold space can be repre- T these advances, the technical challenges being faced by sented as z = φ(t) x. The inner product of the transformed JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 9

features zi and zj defines a positive semi-definite geodesic the projection. Mathematically, the formulation of empirical flow kernel as follows: nonparametric MMD in universal RKHS is written as   Z 1 ∞ ∞ T T T T M N zi , zi = (φ(t) xi) (φ(t) xi)dt = xi Gxj (15) 2 1 X s 1 X t 2 d (Ds, Dt) = k φ(x ) − φ(x )k (17) 0 H M i N j H i=1 j=1 where√ G is a positive semi-definite mapping matrix. With z = Gx, features in the original space can be transformed Specifically, with MMD based kernel matching crite- into the Grassmann manifold space. rion, Pan and Yang firstly proposed a transfer component For aligning the source subspace to the target subspace, analysis (TCA) [124] by introducing the marginal MMD in SA [116], Fernando, et al. proposed to move closer the with projection as the loss function. The joint distribution two subspaces with respect to the points in Grassmann adaptation (JDA) proposed by Long et al. [125] further manifold by directly designing an alignment matrix M, introduced the conditional MMD on the basis of TCA, such which well bridges the source and target subspaces. The that the cross-domain distribution alignment becomes more model of SA is described in Eq.(9). As presented in SA, the discriminative. The general model can be written as subspaces of source and target data were spanned by the 2 2 0 min dm(XS,XT ,W ) + λdc (XS,XT ,YS,YT ,W ) eigenvectors induced with a PCA. Further, Sun and Saenko W (18) proposed a subspace distribution alignment (SDA) [121] where W denotes the projection matrix, Y 0 denotes the by simultaneously aligning the distributions as well as the T predicted pseudo-label of target data, d2 and d2 repre- subspace bases, which overcomes the flaw of SA that does m c sent the marginal and conditional distribution discrepancy, not take into account the distribution difference. respectively. For improving the discrimination of the pro- More intuitively, Liu and Zhang proposed a guided jection matrix, such that the within-class compactness and transfer hashing (GTH) [122] framework, which introduced between-class separability in each domain can be better a more generic method for moving the source subspace Ws characterized, the model with joint discriminative subspace closer to target subspace Wt, learning and MMD minimization was proposed, for exam- 1 1 2 ple, JGSA [157] and CDSL [158], and generally written as min kM 2 (Wt − Ws)k (16) Ws,Wt 2 0 2 min F (W, XS,XT ,YS,YT ) + λd{m,c}(XS,XT ,W ) (19) where M is a weighting matrix on the difference between W source and target subspaces. Through this way, the two sub- where F (·) is a scalable subspace learning function of the spaces can be solved alternatively and progressively, which projection W , for example, linear discriminative analysis is therefore recognized as a guided transfer mechanism. (LDA), local preservation projection (LPP), marginal fisher Further, a guide subspace learning (GSL) was proposed for analysis (MFA), principal component analysis (PCA), etc. In unsupervised domain adaptation [123]. addition to the MMD based criterion in projection based transfer model, Bregman divergence based [127], Hilbert- 4.2 Feature Transformation-Based Schmidt independence criterion (HSIC) based [162], [126], [163], [142], and manifold criterion based [135]. This kind of models aim to learn a transformation or pro- In [127], Si et al. proposed a transfer subspace learning jection of the data with some distribution matching metrics (TSL) by introducing a Bregman divergence-based discrep- between source and target domains [5], [151], [152]. Then, ancy as regularization instead of MMD, which is written as the transformed or projected feature distribution difference across two domains can be removed or relieved. Feature W = arg min F (W ) + λDW (PL||PU ) W (20) transformation based domain adaptation has been a main- stream in visual transfer learning community in last years, where F (W ) is similar to Eq.(19) and DW (PL||PU ) is the which can be further divided into Projection, Metric, and Bregman divergence-based regularization that measures the Augmentation according to the model formulation. distance between the probability distribution of training (1) Projection-Based domain adaptation aims to solve a samples PL and that of the testing samples PU in the projection matrix in source and target domain for reduc- projected subspace W . ing the marginal distribution difference and conditional The HSIC proposed by Gretton et al. [162], the same distribution difference between domains, by introducing author as that of MMD, was used to measure the depen- Kernel Matching Criterion [127], [124], [153], [125], [126], dency between two sets X and Y. Let kx and ky denote [154], [155], [156] and Discriminative Criterion [157], [158], the kernel function w.r.t. the RKHS F and G. The HSIC is [159], [160]. The kernel matching criterion generally adopts mathematically written as [162] the maximum mean discrepancy (MMD) statistic, which characterizes the marginal distribution difference and condi- HSIC(X , Y, F, G) 2 −2 tional distribution difference between source and target data. = kCXY kH−S = (N − 1) T r(KX HKY H) (21) In unsupervised domain adaptation setting, the labels of −1 T s.t. H = I − N 1N×11N×1 target domain samples are generally unavailable, therefore 2 the pseudo-labels of target samples should be iteratively where N is the size of the set X and Y and kCXY kH−S is predicted for quantifying the conditional MMD between the Hilbert-Schmidt norm of the cross-covariance operator. domains [157], [161]. The discriminative criterion focus on KX and KY denote the two kernel Gram matrix, and H is within-class compactness and between-class separability of the centering matrix. HSIC will be zero if and only if X and JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 10

Y are independent. In [142], Wang et al. proposed to use the The Eq.(25) is amount to matching the two centered Gaus- projected HSIC as regularization, which is written as sian distribution, which is the basic assumption for such second-order statistic based transfer. min F (W ) − λHSIC(W, X , Y, F, G) W (22) (3) Augmentation-Based domain adaptation often assume that the feature representation is grouped with three types: where Y denotes the label set of source and target data. common representation, source-specific representation and Obviously, the model constrains W to reduce the indepen- target-specific representation. In general case, the source do- dency between feature set X and label set Y, such that main should be characterized as the composition of common the classification performance can be improved. In model component and source-specific component, and similarly, formulation, the general way is to set a common projection the target domain should be characterized as the composi- for both domains. Another way is to learn two projections tion of common component and target-specific component. W and W , one for each domain, such that domain specific S T Feature augmentation based DA can be divided into the projection can be solved [157], [131], [122], [138]. For moving generic Zero Padding [172], [132], [173], [133], [174] and the the two projections of both domains closer, the Frobenius latest Generative [134], [135] types. norm of their difference like Eq.(16) can be used. Zero Padding was firstly proposed by Daume III [172], (2) Metric-Based aims to learn a good distance metric which presented an EasyAdapt (EA) model. Assume the from labeled source data which can be easily adapted to F raw input data space to be X ∈ < , then the augmented a related but different target domain [164]. Metric transfer 3F feature spaces should be Y ∈ < . By defining the mapping has a close link to projection based examples, if the metric functions of source and target domain from X to Y as Φs(·) M is a semi-definite matrix and can be decomposed into and Φt(·), respectively. Then, there is M = WW T [128]. The metric-based transfer can be divided into First-order statistic [128], [129], [151], [165], [166], [167], Φs(x) = [x, x, 0], Φt(x) = [x, 0, x] (26) [168] and Second-order statistic [130], [131], [169], [170], [171] F based distance metric, such as Euclidean, Mahalanobis dis- where 0 ∈ < is a zero vector. The first, second and third tance, moment distance, etc. bits of the augmented feature Φ(x) in Eq.(26) represent the The First-order metric transfer generally learns a metric common, source-specific and target-specific feature compo- under which the distance between source and target feature nent, respectively. However, in heterogeneous domain adap- is minimized, and it can be written as tation that addressing different feature dimensions between source and target domain [175], [176], [177], for example, min d(M, φ(XS), φ(XT )) + λ<(M) M (23) cross-modal learning (e.g., images vs. text), Li et al. [133] argued that such simple zero-padding for dimensionality where φ(·) is the feature representation or mapping func- consistence between domains is not meaningful. The reason tion, and it can be linear mapping [151], kernel map- is that there would be no correspondences between the ping [129], [166], auto-encoder [128] or neural network [165]. heterogeneous features. Therefore, Li et al. [133] proposed For example, the robust transfer metric learning (RTML) a heterogeneous feature augmentation (HFA) model, which proposed by Ding et al. [128] adopted an auto-encoder incorporates the projected features together with the raw based feature representation for metric learning, such that features for feature augmentation by introducing two pro- the Mahalanobis distance between source and target domain jection matrices P ∈

where M is positive semi-definite matrix, X is the repeated where ds and dt represent the dimensionality of source and version of X, Xe is the randomly corrupted version of X. target data, respectively. For incorporating the unlabeled The first item is Mahalanobis distance induced domain target data, Daume III further proposed an EA++ model discrepancy under metric M, the second item is auto- with zero padding based feature augmentation for semi- encoder for feature learning, and the third term is the low- supervised domain adaptation [132], [173]. Chen et al. [174] rank constraint for characterizing the internal correlation proposed to use zero padding based camera correlation between domains. aware feature augmentation (CRAFT) for cross-view person The Second-order metric transfer generally learns a metric re-identification. under which the distance between the covariances of source Generative methods used for feature augmentation and target domain instead of the means is minimized [170], mainly focus on plausible data generation towards enhanc- [131], [169], [130]. For example, Sun et al. [169], [130] pro- ing the robustness of domain transfer. In [134], Volpi et al. posed a simple but efficient correlation alignment (CORAL) proposed an adversarial feature augmentation by introduc- by aligning the second-order statistic (i.e. the covariance) ing two generative adversarial nets (GANs). The first GAN between source and target distributions instead of the first- was used to train the generator S for synthesizing implau- order metric. By introducing a metric matrix A, the differ- sible source images () by inputting noise ence between source covariance ΣS and target covariance and conditional labels. The second GAN was used to train ΣT in CORAL can be minimized by solving the shared feature encoder E (feature augmentation) for T 2 both domains, by adversarial learning with the synthesized min kA ΣSA − ΣT kF A (25) source images via S. Finally, the encoder E was used as the JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 11 domain adapted feature extractor shared by both domains. handling the data near the intersections of subspaces [186]. In [135], Zhang et al. proposed a manifold criterion guided Therefore, in [139], Zhang et al. proposed a latent sparse do- intermediate domain generation for feature augmentation, main transfer (LSDT) model, which jointly learn the sparse which improved the transfer performance by generating coding Z between domains and the latent subspace W . high-quality intermediate features. 2 min Z + λ1 WXT − WXZ Z,W 1 F 4.3 Feature Reconstruction-Based 2 + λ X − W T WX 2 F (30) Feature reconstruction between source and target data us- T T T s.t. W W = I, 1 Z = 1 ,ZN +i,i = 0, ing a representational matrix for domain transfer has been NS +NT NT S studied for several years. By linear sample reconstruction ∀i = 1, ..., NT in an intermediate representation with low-rankness and where X is the feature set grouped by X and X . sparsity, it can well characterize the intrinsic relatedness and S T With the sparsity constraint on Z, the most transferrable correspondences between source and target domain, while samples can be selected during domain adaptation, which excluding noises and outliers during domain adaptation. is more robust to noise or outliers drawn from source To this end, feature reconstruction based domain transfer domain. The model has also been kernerlized by defining can be generally divided into three types: Low-rank Recon- the projection W as the linear representation of X. The struction [136], [137], [135], [138], Sparse Reconstruction [139], reconstruction is then implemented in a high-dimensional [141], [140], [142], and Graph Matching [178], [179], [180]. For reproducing kernel Hilbert space (RKHS), based on the the first one, for characterizing the domain differences and Representor theorem. In [141], Zhang et al. proposed a uncovering the domain noises, the reconstruction matrix l -norm constraint based reconstruction transfer model was imposed with low-rank constraint, such that the relat- 2,1 with discriminative subspace learning and the domain-class edness between domains can be discovered. For the second consistency was guaranteed. The joint constraint with low- one, sparsity or structural sparsity was generally used for rankness and sparsity for the reconstruction matrix was transferrable sample selection. For the last one, the graphs proposed in [140], such that the global and local structures of each domain are matched by a correspondence matrix. of data can be preserved. Methodologically, reconstruction based domain transfer is (3) Graph Matching based domain adaptation focused on closely related to low-rank representation (LRR) [181], [182], the graph constructions of both domains with adjacency matrix recovery [183], [184] and sparse subspace clustering matrices and solved the correspondence matrix C between (SSC) [185], [186], [187]. (1) Low-rank Reconstruction based domain adaptation was graphs for inherent domain similarity matching. In [178], firstly proposed by Jhuo et al. [136], in which the W trans- the first-order (data matrices) and second-order (adjacency formed source feature was reconstructed by the target do- matrices) based matching were formulated. main with low-rank constraint on the reconstruction matrix 2 2 min kCXT − XSkF + λkCDT − DSCkF + γR(C) (31) and l2,1-norm constraint on the error. C

min rank(Z) + αkEk2,1 where DS and DT represent the adjacency matrices of W,Z,E (28) graphs for source and target domain, respectively. R(C) T s.t. W XS = XT Z + E,WW = I denotes the class regularization.

However, seeking for an alignment between WXS and XT may not transfer knowledge directly, due to the out of 4.4 Feature Coding-Based domain problem of W for unilateral projection. On the basis of [136], Shao et al. [137] proposed a In feature reconstruction based transfer models, the focus is latent subspace transfer learning (LTSL), which tends to the learning of reconstruction coefficients across domains, reconstruct the target data by using the source data as basis on the basis of the raw feature of source or target data. in a projected latent subspace. Different from that, feature coding based transfer learning put emphasis on seeking a group of basis (i.e., dictionary) min F (W, XS) + λ1rank(Z) + αkEk2,1 W,Z,E and representation coefficients in each domain, which was (29) generally called domain adaptive dictionary learning. The s.t. W T X = W T X Z + E S T typical dictionary learning approach aims to minimize the where F (·) is a subspace learning function, similar to representation error of the given data set under a sparsity Eq.(19), Eq.(20) and Eq.(22). By comparing Eq.(28) to Eq.(29), constraint [188], [189], [190]. The cross-domain dictionary the major difference lies in the latent space learning of W learning aims to learn domain adaptive dictionaries without for both domains in LTSL. Both methods, established on requiring any explicit correspondences between domains, LRR, advocated low-rank reconstruction between domains which was generally divided into two types of learn- for transfer learning. As demonstrated in [182], trivial so- ing, domain-shared dictionary-based [143], [144], [145], [146] lution may be easily encountered when handling disjoint and domain-specific dictionary-based [147], [148], [149], [150], subspaces and insufficient data using LRR and a strong [191]. Obviously, the former resorts to learning one common independent subspace assumption is necessary. dictionary for both domains, while the latter contributes to (2) Sparse Reconstruction based domain transfer was es- obtain two or more dictionaries for each domain. tablished on the SSC, which, different from LRR, is well (1) Domain-shared dictionary aims at representing the supported by theoretical analysis and experiments when source and target domain using a common dictionary. JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 12

In [143], [145], Shekhar et al. proposed to separately rep- 5 CLASSIFIER ADAPTATION resent the source and target data in a latent subspace with a In cross-domain visual categorization, classifier adaptation shared dictionary D, which can be written as based TAL aims to learn a generic classifier by leveraging X 2 labeled samples drawn from source domain and few labeled min kP(k)X(k) − Dα(k)kF + <(D, P, α) D,P,α (32) samples from target domain [3], [192], [193], [194]. Typical k∈{s,t} cross-domain classifier adaptation can be divided into (i) where P denotes the latent subspace projection, α denotes Kernel Classifier-Based [3], [195], [196], [197], [192], [198], the representational coefficients for source data Xs and [199], (ii) Manifold Regularizer-Based [200], [201], [202], [203], target data Xt using a shared dictionary D, and <(·) denotes [204], [205], [206] and (iii) Bayesian Classifier-Based [207], the regularizer. The shared dictionary D is demonstrated to [208], [209], [210], [211], [212]. The taxonomy of classifier incorporate the common information from both domains. adaptation approaches is summarized in Table 4. (2) Domain-specific dictionary tends to learn multiple dic- tionaries, one for each domain, to represent the data in each TABLE 4 domain based on domain specific or common representation Our Taxonomy of Classifier Adaptation Approaches coefficients [149], [191]. The general model can be written as CLASSIFIER ADAPTATION MODEL BASIS REFERENCE X 2 min kX(k) − D(k)α(k)kF + Ω(αs, αt) [3], [196], [197], D,P,α (33) Kernel Classifier SVM&MKL k∈{s,t} [192], [198], [199] Label Propagation [200], [201], [202], where Ω(·) denotes the difference between representation Manifold Regularizer &MMD [203], [204], [206] coefficients of source and target. If αs = αt = α, then Probabilistic [207], [208], [209], Ω(αs, αt) = 0 and the model in Eq.(33) is degenerated as the Bayesian Classifier common representation coefficients based domain adaptive graph models [210], [211], [212] dictionary learning [147]. In [148], [144], [150], a set of intermediate domains that bridge the gap between source and target domains were 5.1 Kernel Classifier-Based K−1 incorporated as multiple dictionaries {Dk}k=1 , which can Yang et al. [3] firstly proposed an adaptive support vector progressively capture the intrinsic domain shift between machine (ASVM) in 2007 for target classifier training, which source domain dictionary D0 and target domain dictionary assumed that there exists a bias ∆f(x) between source DK . The difference 4Dk between the atoms of adjacent two classifier f a(x) and target classifier f(x). This means that sub-dictionaries can well characterize the incremental tran- the bias can be added to the source classifier to generate sition and shift between two domains. Actually, this kind a new decision function, that is adapted to classifying the of models can be linked with SGF [119] and GFK [120] by target data. There is, sampling finite or infinite number of intermediate subspaces a a T on the Grassmann manifold for better capturing the intrinsic f(x) = f (x) + ∆f(x) = f (x) + w φ(x) (34) domain shift. where w is the parameter of the bias function ∆f(x), which was solved by standard SVM, 4.5 Discussion and Summary N 1 2 X In this section, feature adaptation methods are presented, min kwk + C εi w 2 (35) including subspace, transformation, reconstruction and cod- i=1 a T ing based types. Feature subspace focuses on the subspace s.t. yif (xi) + yiw φ(xi) ≥ 1 − εi, εi > 0 alignment between domains in Grassmann manifold. Fea- In Eq.(35), f a(·) was known and trained on labeled source ture transformation is further categorized into three sub- data, (x , y ) are drawn from few labeled target data, and w classes: projection learning with MMD criterion, metric i i is the parameter of ∆f(·) rather than f(·). learning with first-order or second-order statistics and aug- More recently, on the basis of ASVM, Duan et al. pro- mentation with zero-padding. Feature reconstruction aims posed a series of multiple kernel learning (MKL) based to explicitly bridge the source and target data in a latent domain transfer classifiers [195], [196], [197], [198], including subspace by low-rank or sparse reconstruction. Besides the AMKL, DTSVM, and DTMKL, in which the kernel func- data-level matching, higher-order graph matching between tion was assumed to be a linear combination of multiple the adjacency matrices of graphs for domains is also intro- predefined base kernel functions by following the MKL duced. Finally, the feature coding focus on domain data methodology [213], [214]. Additionally, for reducing the do- representation by learning domain adaptive dictionaries main distribution mismatch, MMD based kernel matching without explicit correspondences between domains. metric d2(·) was jointly minimized with the structural risk Feature adaptation has been intensively studied by ad- k based classifiers. The general model of MKL based classifier dressing negative transfer and under-adaptation problems adaptation can be written as from different perspectives. Two future directions in feature 2 level are specified: 1) more reliable probability distribution min R(k, f, XS,XT ) + λΩ(d (XS,XT )) ,f k (36) similarity metric is needed, except the Gaussian kernel k induced MMD; 2) for learning domain-invariant representa- where R(·) denotes the structural risk on labeled training tion, model ensemble of linear and nonlinear ones is desired. samples, f is the decision function, Ω(·) is the monotonic JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 13 PM increasing function, k = m=1 dmkm is a linear combina- [212], which aim to have better insights on the transferrable PM tion of a set of base kernels kms with m=1 dm = 1 and process from source domain to target domain. dm > 0. The structural risk R(·) was generally formulated In [207], Go¨nen and Margolin firstly proposed graphical based on the hinge loss, i.e., l~(t) = max(0, 1 − t), as model, i.e., kernelized Bayesian transfer learning (KBTL) for that in SVM. Duan et al. [215] also proposed a domain domain adaptation. This work aims to seek a shared sub- adaptation machine (DAM), which incorporated SVM hinge space and learn a coupled linear classifier in this subspace loss based structural risk with multiple domain regularizers using a full Bayesian framework, solved by a variational for target classifier learning. Regularized least-square loss approximation based inference algorithm. In [210], Gho- based classifier adaptation can be referred to as [200], [201]. lami et al. proposed a probabilistic latent variable model (PUnDA) for unsupervised domain adaptation, by simul- 5.2 Manifold Regularizer-Based taneously learning the classifier in a projected latent space and minimizing the MMD based domain disparity. A regu- The manifold assumption in semi-supervised learning larized Variational Bayesian (VB) algorithm was used for means that the the similar samples with small distance in efficient model parameter estimation in PUnDA, because feature space more likely belongs to the same class. By the computation of exact posterior distribution of the latent constructing the affinity graph based manifold regularizer, variables is intractable. More recently, Karbalayghareh et under which, the classifier trained on source data can be al. [211] proposed an optimal Bayesian transfer learning more easily adapted to target data through label propaga- (OBTL) classifier to formulate the optimal Bayesian classifier tion. Long et al. [200] and Cao et al. [203] proposed ARTL (OBC) in target domain by using the prior knowledge of and DMM which advocated manifold regularization based source and target domains, where OBC [216] aims to achieve structural risk and between-domain MMD minimization Bayesian minimum mean squared error over uncertainty for classifier training, structural preservation and domain classes. In order to avoid costly computations such as alignment. In [204], Yao et al. proposed to simultaneously MCMC sampling, OBTL classifier was derived based on the minimize the classification error, preserve the geometric Laplace approximated hypergeometric functions. structure of data and restrict similarity characterized on unlabeled target data. Zhang and Zhang [201] proposed a manifold regularization based least-square classifier EDA 5.4 Discussion and Summary on both domains with label pre-computation and refining In this section, classifier adaptation including kernel clas- for domain adaptation. More recently, Wang et al. [202] pro- sifier, manifold regularizer and Bayesian classifier are sur- posed a domain-invariant classifier MEDA in Grassmann veyed, which mostly rely on a small amount of tagged manifold with structural risk minimization, while perform- target domain data and facilitate semi-supervised transfer ing cross-domain distribution alignment of marginal and learning. This can be easily adapted to unsupervised trans- conditional distributions with different importances. Graph fer learning by pre-computing and iteratively updating the based manifold regularization M(·) can be written as pseudo-labels of the completely unlabeled target domain in X M(X) = W (f(x ) − f(x ))2 = tr(F T LF ) classifier adaptation. The kernel classifier focuses on SVM ij i j (37) i,j or MKL learning jointly with MMD based domain disparity minimization. The manifold regularizer based models aim X F where is the data of source and target domain, is to preserve the data affinity structure for label propagation. L = D − W the predicted labels, is the Laplacian matrix, The Bayesian classifier based models resort to compensating W i j D ij is the weight between sample and , and is a the generalization performance loss due to data scarcity by D = P W diagonal matrix with ii i ij. This term constrains modeling on the prior knowledge under reliable distribu- the geometric structure preservation in label propagation tion assumptions, and having theoretical understanding on and helps classifier adaptation. Although manifold regu- transfer learning from the viewpoint of data generation. larizer can improve classifier adaptation performance, the However, some inherent flaws exist: 1) incorrect pseudo- fact is that the manifold assumption may not always hold, labels of target data significantly lead to performance degra- particularly when domain distribution does not match [135]. dation; 2) inaccurate distribution assumption in estimating various latent variables produces very negative effect; 3) the 5.3 Bayesian Classifier-Based manifold assumption between domains does not hold for In learning complex systems with limited data, Bayesian serious domain disparity. learning can well integrate prior knowledge to improve the weak generalization of models caused by data scarcity. For EEP ETWORK DAPTATION unsupervised domain adaptation, an underlying assump- 6 D N A tion in the kernel classifier and manifold classifier based Deep neural networks (DNNs) have been recognized as models is that the conditional domain shift between do- dominant techniques for addressing computer vision tasks, mains can be minimized without relying on the target labels. due to their powerful feature representation and end-to-end Additionally, these methods are deterministic, which rely training capability. Although DNNs can achieve more gen- more on the expensive cross-validation for determining the eralized features and performance in visual categorization, underlying manifold space where the kernel mismatch be- they rely on massive amounts of labeled data. For a target tween domains is effectively reduced. Recently, probabilistic domain where the labeled data is unavailable or a very few model, i.e., Bayesian classifier based graphical models for labeled data is available, deep network adaptation started DA/TL have been studied [207], [208], [209], [210], [211], to rise. Yosinski et al. [217] has discussed the transferability JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 14

of features in bottom, middle and top layers of DNNs, TABLE 5 and demonstrated that the transferability of features de- Our Taxonomy of Deep Network Adaptation Approaches creases as the distance between domains increases. In [218], Donahue et al. proposed the deep convolutional activation DEEP NET ADAPTATION MODEL BASIS REFERENCE feature (DeCAF) extracted by using a pre-trained AlexNet [219], [229], [230], Marginal Alignment CNN&MMD model [19], which has well proved the generalization of [231], [232], [233] DNNs for generic visual classification. This work further [220], [234], [235], CNN&MMD facilitated deep transfer learning and deep domain adapta- Conditional Alignment [236], [237], [238] &Semantics tion. Generally, the presented three types of TAL models in [239], [240] Section 3, 4 and 5, including instance re-weighting, feature [241], [242], [243], BN Layer-Derived CNN&Whitening adaptation and classifier adaptation, can be incorporated [244], [245] into DNNs with end-to-end training for deep network Stacked Denoising [246], [247], [248], Autoencoder-Based adaptation. In 2015, Long et al. [219], [220] proposed a autoencoders [176], [249], [250] deep adaptation network (DAN) for learning transferrable features, which, for the first time opened the topic of deep transfer and adaptation. The basic idea of DAN is to enhance top layered features were generally transformed to a RKHS feature transferability in task-specific layers of DNNs by space where the maximum mean discrepancy (MMD) based embedding the higher layered features into reproducing kernel matching between domains was performed, which kernel Hilbert spaces (RKHSs) for nonparametric kernel is recognized as marginal alignment based deep network matching (e.g., MMD-based) between domains. In training adaptation [219], [229], [230], [231]. For image classification, process, DAN was trained by fine-tuning on the ImageNet the softmax guided cross-entropy loss on the labeled source pre-trained DNN, such as AlexNet [19], VGGNet [221], data is generally minimized. Representative works can be GoogLeNet [222] and ResNet [20]. referred to as DDC proposed by Tzeng et al. [229] and Another sub-branch of deep network adaptation is par- DAN [219]. The model can be written as

tial domain adaptation [223], [224], [225], [226] and open- Ns 1 X X 2 l l set domain adaptation [101], [102], [227], which loose the min J (θ(xi), yi) + λ dma(Ds, Dt) (38) Θ N category space (label set) consistency assumption across do- s i=1 l mains. The former supposes that the source domain shares where J (·) is the cross-entropy loss function, θ(·) is the fea- C ⊃ C partial categories with target data, i.e., S T , and the ture representation function, Dl denotes the domain feature disjoint source categories are recognized as noisy instances. th 2 set from the l layer and dma(·) is the marginal align- The latter assumes that the target domain shares partial ment function (e.g., MMD in Eq.(17)) between domains. C ⊂ C categories with source data, i.e., S T , and the disjoint Clearly, in Eq.(38), multiple MMDs were formulated, one for target categories are unified as unknown category, i.e., the each layer, and the summation of all MMDs is minimized. (C + 1)th C class. is the number of known classes. The com- For better measuring the discrepancy between domains, mon strategy is to actively discover the common/priviate a unified MMD called joint MMD (JMMD) was further classes with thresholding and re-weighting the model in designed by Long et al. [231] in a tensor product Hilbert sample-level. More general, You et al. [228] proposed a space for matching the joint distribution of activations of universal domain adaptation (UDA) which requires no prior multiple layers. Besides, other domain discrepancy metrics, knowledge on the label sets. That is, both domains can such as Kullback-Leibler (KL) divergence [251], [252], [253], have private categories. Actually, the methodology in con- [254], Jensen-Shannon (JS) divergence [255], Wasserstein dis- ventional DA can be freely generalized into PDA or OSDA tance [256], [257], larger feature norms [258], margin dispar- with modifications, therefore, they are not further specified ity discrepancy [259], moment distance [260], and optimal in order to put emphasis on the methodological aspects. transport distance [261], [262] have also been explored. In recent years, various deep network adaptation models The model in Eq.(38) does not take into account the net- have emerged by modifying the network architecture, loss work outputs of target domain stream, which may not well and threshold tricks, which is presenting a blowout trend, adapt the source classifier to target data. For addressing this because deep network adaptation has yield significant per- problem, conditional-entropy minimization principle [263] formance gains against shallow domain adaptation meth- that favors the low-density separation between classes in ods. For a better overview of various deep network adap- unlabeled target data Dt was further exploited in [220], tation methods, we divide them into (i) Marginal Alignment- [232], [233]. The entropy minimization is written as Based, (ii) Conditional Alignment-Based, (iii) Batch Normaliza- Nt C tion Layer-Derived, and (iv) Autoencoder-Based, in which the 1 X X t t min − fj(xi) log fj(xi) (39) first three mainly focus on convolutional neural network f∈F N t i=1 j=1 architecture. The taxonomy of deep network adaptation approaches is summarized in Table 5. where fj(x) is the probability that sample x is predicted as class j. Entropy minimization is amount to uncertainty minimization of the predicted labels of target samples. Ad- 6.1 Marginal Alignment-Based ditionally, by following the assumption of ASVM in [3], the In unsupervised deep domain adaptation frameworks, for residual ∆f(x) between source and target classifiers was reducing the distribution disparity dH∆H(DS, DT ) between learned in the residual transfer network (RTN) [232], with a labeled source domain and unlabeled target domain, the residual connection. JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 15

6.2 Conditional Alignment-Based should be different. Therefore, researchers have proposed In marginal alignment based deep network adaptation, to embed domain alignment layers into the network based only the top layered feature matching in RKSH spaces on batch normalization layer for domain adaptation [241], was formulated by using the nonparametric MMD metric. [242], [243], [244], [245]. However, the high-level semantic information was not taken Roy et al. [243] proposed a domain specific batch into account in domain alignment, which may degrade whitening (BW) layer instead of BN for reducing domain shift, which is defined as BW (xi,k) = γkxˆi,k + βk and the adaptability of source data trained DNNs to unlabeled T −1 xˆi = WB(xi − µB), such that WB WB = ΣB , where µB target domain due to the class mis-alignment. Theoreti- −1 cally, an unexpected larger upper bound of the joint error is the domain-specific mean vector in a batch B and ΣB λ = minh∈H S(h, ls) + T (h, lt) ≤ minh∈H S(h, ls) + is the domain-specific covariance matrix computed using B T (h, ls) + T (ls, lt) may be encountered, where ls and lt . Wang et al. [245] proposed a transferable normalization are labeling function for source and target domains, respec- (TransNorm), in which domain specific mean vectors µS, µT σ , σ tively. This is because T (ls, lt) can be large enough due to and variances S T are computed, while the learnable class mis-alignment. Therefore, conditional alignment based parameters γ, β to scale and shift the normalized signal deep network adaptation methods were presented jointly are domain shared. Similarly, the AutoDial proposed by with marginal alignment based models [234], [235], [239], Carlucci et al. [242] also shares the two parameters across [236], [237], [238], [240]. Due to the unavailability of target domains and Chang et al. [244] also proposed a domain spe- labels, in such kinds of models, clustering labels or progres- cific batch normalization layer. Besides BN layer, it is worthy sively updated target pseudo-labels (label refinement) are noting that, Lee [252] proposed a dropout adaptation (DTA) generally exploited. Similar to the formulation of MMD in by imposing a learnable dropout mask. Eq.(17), the conditional alignment was generally formulated 6.4 Autoencoder-Based by building between-domain MMD like discrepancy metric 2 dca on the probability p, which reflects the uncertainty As mentioned above, the training of DNNs needs a large predicting a sample to class c. amount of labeled source data. For unsupervised feature learning in domain adaptation, deep autoencoder based C n n X 1 Xs 1 Xt d2 = k p(ys = c|xs) − p(yt = c|xt )k2 network adaptation framework was presented [246], [247], ca n i i n i j [248], [176], [249], [267]. Generic auto-encoders are com- c=1 s i=1 t j=1 (40) prised of an encoder function f(·) and a decoder function g(·), which are typically trained to minimize the reconstruc- Therefore, conditional alignment based deep adaptation tion error. Denoising autoencoders (DAE) were generally model was generally constructed by combining Eq.(38) and constructed with one-layer neural networks for reconstruct- Eq.(40) together. The probability constraint between do- ing original data from partially or randomly corrupted mains can effectively improve the semantic discrimination. data [268]. The denoising autoencoders can be stacked into Actually, l1-norm can also be imposed on the difference a deep network (i.e., SDA), optimized by greedy layer-wise between the probabilities of source and target samples. fashion based on stochastic gradient descent (SGD). The ra- Generally, the basic strategies for constraining the joint tional behind deep autoencoder based network adaptation error λ to be small enough refer to as target pseudo- is that the source data trained parameters of encoder and labeling [236], source pre-trained network freezing [264], decoder can be adapted to represent those samples from a information-theory metric [255], etc. target domain. Very recently, with the privacy protection mechanism, In [246], Glorot et al. proposed a SDA based feature source-free domain adaptation models have emerged [265], representation in conjunction with SVMs for sentiment anal- [266], in which the source samples are used to pre-train a ysis across different domains. Chen et al. [247] proposed CNN model but not explicitly exploited in DA model. a marginalized stacked denoising autoencoder (mSDA), which addressed two crucial limitations of SDAs, such 6.3 Batch Normalization Layer-Derived as high computational cost and low scalability to high- dimensional features, by inducing a closed-form solution of Other than loss functions design for TAL, in this section, parameters without SGD. In [248], Zhuang et al. proposed a we introduce another branch, i.e., network design. That supervised deep autoencoder for learning domain invariant is modifying the internal component of CNN for trans- features. The encoder is constructed with two encoding ferable representation. Batch normalization (BN) has been layers: embedding layer for domain disparity minimization proved to improve the predictive performance and training and label encoding layer for softmax guided source classifier efficiency by normalizing the inputs to be zero-mean and training. Suppose x, z and xˆ to be the input sample, inter- univariance, and further scales and shifts the normalized mediate representation (encoded) and reconstructed output signals with two trainable parameters.Given a batch B, each xi,k−µB,k (decoded), respectively, then there is xi ∈ B is transformed as BN(xi,k) = γk √ + βk, σ2 + B,k z = f(x), xˆ = g(z) (41) where k (1 ≤ k ≤ d) indicates the k-th dimension of xi, µB,k and σB,k are the mean and std. w.r.t. the k-th dimension in where z is the intermediate feature representation of sample batch B,  is a constant for numerical stability, γk and βk are x. Generally, stacked deep autoencoder based TAL frame- learnable scaling and shifting parameters. For cross-domain work can be written as applications, due to the datasets bias and domain shift, min J (x, xˆ) + λΩ(zs, zt) + βL(zs, ys, θ) + γR(f, g) the means and variance in a batch for different domains f,g,θ (42) JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 16

where f is domain shared encoder, g is domain shared representations by training robust DNNs. Currently, adver- decoder, J (·) = Js(·) + Jt(·) represents the reconstruc- sarial learning has become an increasing popular idea for tion error loss (e.g., l2-norm squared loss), Ω(·) is the addressing TAL issues, by minimizing the between-domain distribution discrepancy metric between source feature zs discrepancy through an adversarial objective (e.g., binary and target feature zt, L(·) is the classifier loss (e.g. cross- domain discriminator), instead of the generic MMD-based entropy) with parameter θ learned on the set(zs, ys), and domain disparity in RKHS spaces. In fact, minimization of R(·) is the regularizer of the network parameters of f the domain disparity is amount to domain confusion in and g. In [248], Kullback-Leibler (KL) divergence [269] a learned feature space, where the domain discriminator based distribution distance metric was considered. KL is cannot discriminate which domain a sample comes from. In a non-symmetric measure of the divergence between two this paper, the adversarial adaptation based TAL approaches probability distributions P and Q, which was defined as are divided into three types: (i) Gradient Reversal-Based, (ii) P P (i) DKL(P ||Q) = i P (i) ln( Q(i) ). Smaller value of DKL(·) Minimax Optimization-Based and (iii) Generative Adversarial means higher similarity of two distributions. Due to that Net-Based. The first two resort to feature-level domain con- DKL(P ||Q) 6= DKL(Q||P ), a symmetric KL version was fusion supervised by a domain discriminator for domain used in [248], in which the Ω(·) in Eq.(42) was written as distribution discrepancy minimization, while the last one tends to pixel-level domain transfer by synthesizing implau- Ω(zs, zt) = DKL(Ps||Pt) + DKL(Pt||Ps) (43) sible target domain images. Actually, domain adaptation is in essence a minimax problem. The taxonomy of adversarial z¯s z¯t where Ps = and Pt = represent the probability Σ¯zs Σ¯zt adaptation approaches is summarized in Table 6. distribution of source and target domains. z¯s and z¯t repre- sent the mean vector of encoded feature representations of TABLE 6 source and target samples, respectively. Our Taxonomy of Adversarial Adaptation Approaches Similar to the reconstruction protocol in stacked autoen-

coder, a related work with deep reconstruction based on ADVERSARIAL ADAPTATION MODEL BASIS REFERENCE convolutional neural networks can be referred to as [250], in Domain [270], [271], [224], which the encoded source feature representation is feeded Gradient Reversal Confusion [272], [273], [274], into the source classifier for visual classification and simul- &GRL [275], [276], [277] taneously into the decoder module for reconstructing the Domain [278], [279], [280] target data. Under this framework, a shared encoder for Minimax Optimization Confusion [281], [282], [283], both domains can be learned. &Game [284], [285], [286] [287], [288], [289], Pixel-level [290], [291], [44], GANs-Based 6.5 Discussion and Summary Image Synthesis [292], [293], [294] In this section, deep network adaptation advances are pre- [295], [51], [50] sented and categorized, which mainly contains three types of technical challenges: marginal alignment based, condi- tional alignment based and autoencoder based. A com- 7.1 Gradient Reversal-Based mon characteristic of these methods is that the softmax guided cross-entropy loss based on labeled source data was In adversarial optimization of DNNs between the general minimized for classifier learning. In marginal alignment cross-entropy loss for source classifier learning and the based models, the distribution discrepancy of feature rep- domain discriminator for domain label prediction, Ganin resentation from top layers is generally characterized by and Lempitsky [270] firstly demonstrated that the domain MMD. Besides that, the semantic similarity across domains adaptation behavior can be achieved by adding a simple but was further characterized in conditional alignment based effective gradient reversal layer (GRL). The augmented deep models. Different from both marginal and conditional align- architecture can still be trained using standard stochastic ment models, the autoencoder based ones tend to learn do- gradient descent (SGD) based backpropagation. The gradi- main invariant feature embedding by imposing a Kullback- ent reversal based adversarial adaptation network consists Leibler divergence in feature embedding layer. of three parts: domain-invariant feature representation θf , Despite recent advances deep network adaptation faces visual classifier θc and domain classifier θd. Objectively, θf several challenges: 1) a number of labeled source data is can be learned by trying to minimize the visual classifier needed for training (fine-tuning) a deep network; 2) the loss Lc and simultaneously maximize the domain classifier confidence of an unlabeled target sample predicted to class k loss Ld, such that the feature representation can be domain is sometimes very low when domain disparity is very large; invariant (i.e. domain confusion) and class discriminative. 3) the interpretability of deep adaptation is not optimistic Therefore, in backpropagation optimization of θf , the con- ∂Lc tributed gradients from losses Lc and Ld are and and negative transfer is easily encountered. ∂θf −λ ∂Ld , respectively. The essence of GRL lies in the reversal ∂θf gradient with negative multiplier −λI. 7 ADVERSARIAL ADAPTATION More recently, the gradient reversal based adversarial Adversarial learning, originated from the generative adver- strategy has been used for domain adaptation [271], [224], sarial net (GAN) [36], is a promising approach for gen- [272], [273], [296] under CNN architecture, domain adap- erating pixel-level target samples or feature-level target tive object detection [274], [297], [298] under Faster-RCNN JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 17

framework, large-scale kinship verification [275], [276] and where DS and DT mean the source and target domain fine-grained visual classification [277] under Siamese net- samples, and YS denotes the source data labels. work. By following a similar protocol with [270], in [271], Under this basic framework in Eq.(44), Tzeng et al. [281] [274], a binary domain classifier (e.g., source is labeled further proposed an adversarial discriminative domain as +1 and -1 for target) was designed as an adversarial adaptation (ADDA) method, in which two CNNs were sep- objective for learning domain-invariant features by deploy- arately learned for source and target domain. The training ing a GRL layer. In [275], [276], two methods, AdvNet of source CNN relied only on the source data and labels by and Adv-Kin were proposed, in which a general Siamese minimizing the cross-entropy loss LC , while the target CNN network was constructed with three fully-connected (fc-) and the domain discriminator loss LD was alternatively layers for similarity learning. The reversal gradient with trained in an adversarial fashion with the source CNN negative multiplier −λI was placed in the 1st fc-layer fixed. Rozantsev et al. [283] proposed a residual parameter (MMD-loss), the generic contrastive loss was deployed in transfer model with adversarial domain confusion super- the 2nd fc-layer and the softmax guided cross-entropy loss vised by a domain classifier, in which the residual transform was deployed in the last fc-layer. In [272], Pei et al. argued between domains was deployed in convolutional layers. that single domain discriminator based adversarial adap- For augmenting the domain-specific feature representation, tation only aligns the between-domain distribution with- Long et al. [284] proposed a conditional domain adversarial out exploiting the multimode structures. Therefore, they network (CDAN), in which the feature representation and proposed a multi-adversarial domain adaptation (MADA) classifier prediction were integrated via multilinear map method based on GRL with multiple class-wise domain dis- for jointly learning the domain classifier. More recently, criminators for capturing multimode structures, such that Saito et al. [286] proposed a novel adversarial strategy, fine-grained alignment of different distributions is enabled. i.e., maximum classifier discrepancy (MCD), which aims Also, Zhang et al. [273] proposed a collaborative adversarial to maximize the discrepancy between two classifiers’ out- network (CAN) by designing multiple domain classifiers, puts instead of domain discriminator. The feature extractor one for each feature extraction block in CNN. aims to minimize the two classifiers’ discrepancy. They argued that the general domain discriminator does not take 7.2 Minimax Optimization-Based into account the task-specific decision boundaries between classes, which may lead to ambiguous features near class In GANs, the two key parts G and D are often placed with boundaries from the feature extractor. In MCD, only two an adversarial state, and generally solved by using a mini- classifiers are designed. By following the wisdom of crowd max based gaming optimization method [36]. Therefore, the principle, a stochastic classifier (STAR) [299] which can minimax optimization based adversarial adaptation can be sample an arbitrary number of classifiers from a Gaussian implemented for domain confusion, through an adversarial distribution rather than two classifiers is proposed and objective of the domain discriminator or regressor [278], achieves better performance. By following the adversarial [279], [280], [281], [282], [283], [284], [285], [240]. Minimax principle of double classifiers, a classifier competition model optimization based adversarial adaptation training of DNNs for reliable domain adaptation was proposed in [300]. originated in 2015 [278], [279]. Domain confusion maxi- mization based adversarial domain adaptation was first proposed by Tzeng et al. [278], in which an adversarial CNN 7.3 Generative Adversarial Net-Based framework was deployed with classification loss, soft label In generative adversarial net (GAN) [36] and its variants, loss and two adversarial objectives i.e., domain confusion two key parts: generator G and discriminator D are gener- loss and domain discriminator loss. In [279], Ajakan et al. ally composed. The generator G aims to synthesize implau- firstly proposed an adversarial training of stacked autoen- sible images by using the encoder and decoder, while the coders (DANN) deployed with classification loss and an discriminator D plays a role in identification of authenticity adversarial objective i.e., domain regressor loss. by recognizing a sample to be true or false. A minimax Suppose the labeled source data trained visual classifier gaming based alternative optimization scheme is generally C D to be , the domain discriminator to be , and the feature used for solving G and D. In TAL studies, started from F representation to be . The corresponding parameters are 2017, GAN based models have been presented to synthesize θ θ θ defined as C , D and F . The general adversarial adap- distribution approximated pixel-level images with target L tation model aims to minimize the visual classifier loss C domain and then enable the cross-domain image classifi- L and maximize the domain discriminator loss D by learning cation by using synthesized image samples (e.g., objects, θ F F , such that the feature representation function can be scenes, pedestrians and faces, etc.) [287], [288], [289], [290], more discriminative and domain-invariant. Simultaneously, [291], [44], [292], [293], [294], [301], [302]. the adversarial training aims to minimize the domain dis- Under the CycleGAN framework proposed by Zhu et criminator loss L under θ . Generally, maximizing L D F D al. [43], Hoffman et al. [287] firstly proposed a cycle- is amount to maximizing the domain confusion, such that consistent adversarial domain adaptation model (CyCADA) it cannot discriminate which domain the samples come for adapting representations in both pixel-level and feature- from, and vice versa. The above process can be generally level without requiring aligned pairs, by jointly minimizing formulated as the following adversarial adaptation model, pixel loss, feature loss, semantic loss and cycle consistence min LC (DS, YS; θC , θF ) − λLD(DS, DT , θD; θF ) loss. Bousmalis et al. [288] and Taigman et al. [289] pro- θ ,θ C F (44) posed GAN-based models for unsupervised image-level min LD(DS, DT , θF ; θD) domain adaptation, which aims to adapt source domain θD JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 18 images to appear as if drawn from target domain with well- 8.1 Office-31 (3DA) preserved identity. In [290], Hu et al. proposed a duplex Office-31 is a popular benchmark for visual domain transfer, GAN (DupGAN) for image-level domain transformation, which includes 31 categories of samples drawn from three in which the duplex discriminators, one for each domain, different domains, i.e., Amazon (A), DSLR (D) and Webcam were trained against the generator for ensuring the reality of (W). Amazon consists of online e-commerce pictures, DSLR the domain transformation. Murez et al. [291] and Hong et contains high-resolution pictures and Webcam contains low- al. [292] proposed image-to-image translation based domain resolution pictures taken by a web camera. There are totally adaptation models by leveraging GAN and synthetic data 4652 images, composed of 2817, 498 and 795 images from for semantic segmentation of the target domain images. domain A, D and W, respectively. In feature extraction, (1) Person re-identification (ReID) is typically a cross-domain for shallow features, 800-dimensional feature vectors ex- feature match and retrieval problem [191], [174]. Recently, tracted by the Speed Up Robust Features (SURF) were used, for addressing ReID challenges in complex scenarios, GAN- and (2) for deep features, 4096-dimensional feature vec- based domain adaptation was presented for implausible tors extracted from pre-trained AlexNet/VGG-net/ResNet- person image generation from source to target domain [295], 50 were generally used. In model evaluation, six kinds of [293], [294], across different visual cues and styles, such as source-target domain pairs were tested, i.e., A→D, A→W, poses, backgrounds, lightings, resolutions, seasons, etc. Ad- D→A, D→W,W→A, W→D. ditionally, GAN based cross-domain facial image generation for pose-invariant face representation, face frontalization and rotation were intensively studied [39], [53], [51], [50], 8.2 Office+Caltech-10 (4DA) all of which tend to address domain adaptation and transfer problems in face recognition across poses. This 4DA dataset contains 4 domains, in which 3 domains (A, D, W) are from the Office-31 and another domain (C) is from Caltech-256, a benchmark containing 30,607 images 7.4 Discussion and Summary of 256 classes in object recognition. The common 10 classes among the Office-31 and Caltech-256 were selected to form In this section, adversarial adaptation is presented with the 4DA, and therefore 2,533 images composed of 958, 157, three streams, including gradient reversal, minimax opti- 295 and 1123 images from domain A, D, W and C were mization and generative adversarial net (GAN). The gra- collected. In evaluation, 12 tasks with different source-target dient reversal and minimax optimization share a common domain pairs are addressed, i.e., A→D, A→C, A→W, D→A, characteristic, i.e., feature-level adaptation, by introducing a D→C, D→W, C→A, C→D, C→W, W→A, W→C, W→D. domain discriminator based adversarial objective for train- ing against the feature extractor. The difference between them is the against strategy. Different from both of them, 8.3 MNIST+USPS GAN-based adversarial adaptation focuses on pixel-level MNIST and USPS are two benchmarks containing 10 cat- adaptation, i.e., image generation from source domain to egories of digit images under different distribution for a target, such that the synthesized implausible images are handwritten digit recognition, and therefore qualified for as if drawn from target domain. TAL tasks. The MNIST includes 60,000 training pictures Adversarial adaptation is recognized to be an emerg- and 10,000 test pictures. The USPS includes 7291 training ing perspective, despite these advances it still faces with pictures and 2007 test pictures. For TAL tasks, 2000 pictures several challenges: 1) the domain discriminator is easily and 1800 pictures were randomly selected from MNIST overtrained; 2) maximizing only domain confusion easily and USPS, respectively. For feature extraction, each image leads to class bias; 3) the gaming between feature generator was resized into 16×16 and a 256-dimensional feature vec- and discriminator is human dependent. tor that encode the pixel values was finally extracted. In evaluation, 2 cross-domain tasks, i.e., MNIST→USPS and USPS→MNIST are addressed. 8 BENCHMARK DATASETS OF VTAL In this section, the benchmark datasets for testing TAL 8.4 Multi-PIE models are introduced to facilitate readers’ impression on how to start studies of transfer adaptation learn- Multi-PIE is a benchmark with poses, illuminations and ing. Totally, 12 benchmark datasets including Office- expressions in face recognition, which includes 41,368 faces 31 (3DA) [5], Office+Caltech-10 (4DA) [5], [120], [218], of 68 different identities. For TAL tasks, (1) face recognition [303], MNIST+USPS [139], [140], Multi-PIE [139], [140], across poses is generally evaluated on five different face COIL-20 [304], MSRC+VOC2007 [125], IVLSC [305], [306], orientations, including C05: left pose, C07: upward pose, AwA [307], Cross-dataset Testbed [1], Office Home [308], C09: downward pose, C27: front pose and C29: right pose. ImageCLEF [309], and P-A-C-S [305] are summarized, each Totally, 3332, 1629, 1632, 3329, and 1632 facial images are of which contains at least 2 different domains. For these contained in C05, C07, C09, C27 and C29. Therefore, 20 benchmarks, classification accuracy of target data is com- tasks were evaluated, i.e., C05→C07, C05→C09, C05→C27, monly used for performance comparison. Due to the space etc.; (2) face recognition across illuminations and exposure limitation, the classification performances of different mod- conditions is evaluated by randomly selecting two sets: PIE1 els are not listed, so that we can focus more on discussing and PIE2 from front face images. Two tasks: PIE→PIE2 and the methodological advances and potential issues. PIE2→PIE1 were evaluated. JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 19 8.5 COIL-20 8.11 ImageCLEF COIL-20 is a 3D object recognition benchmark containing This benchmark includes 1800 images of 12 categories, 1440 images of 20 object categories. By rotating each object which were drawn from 3 domains: 600 images in Caltech class horizontally of 5 degrees, 72 images per class after 256 (C), 600 images in ImageNet ILSVRC2012 (I), and 600 rotating 360 degrees were obtained. For TAL tasks, two images in Pascal VOC2012 (P). Therefore, 6 cross-domain disjoint subsets with different distribution i.e., COIL1 and tasks i.e., C→I, C→P, I→C, I→P, P→C, P→I were evaluated. COIL2 were prepared, where COIL1 contains the images in ◦ ◦ ◦ ◦ ◦ [0 , 85 ]U[180 , 265 ] and the images of COIL2 are in [90 , 8.12 P-A-C-S 175◦]U[270◦, 355◦]. Therefore, two cross-domain tasks i.e., PACS is a new benchmark containing 7 common categories: COIL1→COIL2 and COIL2→COIL1 were evaluated. dog, elephant, giraffe, guitar, horse, house and person, from 4 domains, i.e., 1670 images in Photo (P), 2048 images in Art 8.6 MSRC+VOC2007 Painting (A), 2344 images in Cartoon (C), and 3929 images The MSRC contains 4323 images of 18 categories and in Sketch (S). In feature representation, 4096-dimensional VOC2007 contains 5011 images of 20 categories. 1269 and VGG-M deep features were used and 12 cross-domain tasks 1530 images w.r.t six common categories, i.e., aeroplane, bicy- are evaluated, e.g., P→A, P →C, P→S, A→P, A→C, etc. cle, bird, car, cow and sheep, were finally selected from MSRC and VOC2007, respectively. In feature representation, 128- 8.13 Discussion and Summary dimensional DenseSIFT features were extracted for cross- In this section, 12 benchmarks constructed based on popular domain image classification tasks, i.e., MSRC→VOC2007 datasets in computer vision such as ImageNet, ILSVRC, and VOC2007→MSRC. PASCAL VOC, Caltech-256, multi-PIE and MNIST for ad- dressing cross-domain image classification tasks and evalu- 8.7 IVLSC ating the DA models are presented. Despite these endeavors IVLSC is a large-scale image dataset containing five subsets, made by researchers, more benchmarks in cross-domain i.e., ImageNet (I), VOC2007 (V), LabelMe (L), SUN09 (S), vision understanding problems we could see, namely: object and Caltech (C). For TAL tasks, 7341, 3376, 2656, 3282, and detection, semantic segmentation, visual relation modeling, 1415 samples w.r.t. five common categories i.e., bird, cat, scene parsing, etc. are future challenges for more universal chair, dog and human, were randomly selected from I, V, L, and safe applications. These can better testify the practicality S, and C domains, respectively. In feature representation, of TAL methodologies. 4096-dimensional DeCaf6 deep features were extracted for Pleasantly, with the stronger models and algorithms, cross-domain image classification under 20 tasks, i.e., I→V, the performance on the benchmarks has taken a big step I→L, I→S, I→C, ..., C→I, C→V, C→L, C→S. forward. However, we have to denote that the existing UDA setting has always used target data for training, which, 8.8 AwA inevitably leads to overfitting and loses fairness. Therefore, we must be aware of this, and a reasonable and scientific AwA is an animal identification dataset containing 30,475 training and testing protocol is expected to effectively evalu- images of 50 categories, which provides a benchmark due ate the DA models. Considering the necessity of target data, to the inherent data distribution difference. This data set is a possible solution is that, as few-shot learning does, partial currently less used in evaluating TAL algorithms. unlabeled target samples are used as seen samples (training) for determining an ideal hypothesis h∗ with small joint error ∗ ∗ 8.9 Cross-dataset Testbed λ = S(h ) + T (h ), and the unseen target samples (test) This benchmark contains 10,473 images of 40 categories, col- are used for model evaluation. lected from three domains: 3,847 images in Caltech256 (C), 4,000 images in ImageNet (I), and 2,626 images in SUN (S). 9 VTAL IN MORE VISION TASKS In feature extraction, the 4096-dimensional DeCAF7 deep The first 8 sections presents the preliminaries, methodolog- features were used for cross-domain image classification ical advances and benchmarks of VTAL for image classi- tasks, i.e., C→I, C→S, I→C, I→S, S→C, S→I. fication, a fundamental vision task. In this section, from the perspective of applications, we have a discussion on 8.10 Office Home VTAL theory utilized in more vision tasks, such as ob- Office Home is a relatively new benchmark containing ject detection and semantic segmentation, as illustrated in 15,585 images of 65 categories, collected from 4 domains, Fig. 1. It is not difficult to imagine that, for example, fog i.e., (1) Art (Ar): artistic depictions of objects in the form caused image degradation will inevitably lead to detection of sketches, paintings, ornamentation, etc.; (2) Clipart (Cl): performance degradation. A preliminary research in [310] collection of clipart images; (3) Product (Pr): images of finds that degradation removal does not help CNN-based objects without a background, akin to the Amazon category image classification, because degradation removal focuses in Office dataset; (4) Real-World (RW): images of objects on pixel manipulation for image enhancement but does captured with a regular camera. In detail, there contains not bring any new information beneficial to classification. 2421, 4379, 4428 and 4357 images in Ar, Cl, Pr and RW Therefore, cross-dataset object detection/segmentation (e.g., domains, respectively. In evaluation, 12 cross-domain tasks from cityscapes to foggy cityscapes) rather than learning to were tested, e.g., Ar→Cl, Ar→Pr, Ar→RW, Cl→Ar, etc. dehaze is an important but challenging vision task. In the JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 20 past two years, this topic has attracted researcher’ attention domains can be found [330], and co-adaptation expects and a significant progress is witnessed. Specifically, many to capture commonality and specificity among multiple do- domain adaptive object detection models [274], [297], [311], mains. Additionally, the essence of TAL is to learn domain- [298], [312], [313], [314], [315] and domain adaptive semantic invariant but class-discriminative representation for unla- segmentation models [316], [317], [318], [319], [320], [321], beled target data regardless of close-set, open-set or partial [322], [323] have been proposed. These works have well DA. Therefore, unsupervised/self-supervised learning idea revealed that the versatile VTAL theory and algorithm can can be a choice used to improve TAL toward universal rep- be combined with detection network (e.g., Faster-RCNN) resentations beyond the domain-invariant representations. or segmentation network (i.e. CNN-based) for more gen- To uncover the knowledge used to transfer, meta learning eralized vision tasks in the wild. Besides, VTAL has also can be a choice towards explainable TAL. The meta-transfer been used for cross-domain visual retrieval [324], person re- learning in [331] and task disentanglement [332] show some identification [325], person search [326], image restoration instructive insights on learning what and where to transfer. and dehazing [327], [328]. It is hopeful that VTAL will play Undoubtedly, a huge and pleasant progress in TAL an increasingly important role in other vision tasks with model and algorithm is witnessed with blowout publi- more heuristic models and algorithms towards transferable cations, which has well answered how to facilitate domain and universal representations. adaptation under the well-established expected target error upper bound theory. However, we are not optimistic about this and three open questions that remain to be unanswered ONCLUSIONAND UTLOOK 10 C O are stressed. 1) When will we need transfer adaptation Transfer adaptation learning is an energetic research field learning for a given application scenario? The basic ana- which aims to learn domain adaptive representations and lyzing condition of whether cross-domain happens is still classifiers from source domains toward representing and not clear, which makes TAL blind. 2) Where does the do- recognizing samples from a distribution different but se- main adaptation theory go when two domains are com- mantic related target domain. This paper surveyed recent pletely disjoint (i.e., large domain discrepancy) in open- advances in transfer adaptation learning in the past decade world (dynamic) scenarios? The domain adaptation theory and present a new taxonomy of five technical challenges is scenario-dependent and the impossibility theory [333] of being faced by researchers: instance re-weighting adapta- TAL needs to be broken. Also, studying the target error tion, feature adaptation, classifier adaptation, deep network lower-bound may be more instructive than the expected error adaptation and adversarial adaptation. Actually, a number upper-bound, but seems to be paid less attention. 3) What of models are overlapped among the five subcategories, useful knowledge has been transferred from the source and the most appealing combination in most recent works to target? Currently, it cannot be explicitly explained. We is re-weighting+deep network+adversarial strategies, which observe these heuristic but promising directions of transfer are not specially discussed due to space limitation. Besides, adaptation learning for future research. Addressing these 12 visual benchmarks that address multiple cross-domain issues and challenges can well improve the universality, recognition tasks are collected and summarized to help interpretability and credibility of TAL in open-world scenarios facilitate researchers’ insight on the tasks and scenarios that toward safer applications. transfer adaptation learning aims to address. The proposed taxonomy of transfer adaptation learning challenges in this paper provides a framework for better ACKNOWLEDGMENT understanding and identifying the status of the field, future The author would like to thank the pioneer researchers research challenges and directions. Each challenge was sum- in transfer learning, domain adaptation and other related marized with a discussion of existing problems and future fields. The author would also like to thank Dr. Mingsheng direction, which, we believe, are worth studying for bet- Long and Dr. Lixin Duan for their kindly help in providing ter capturing general domain knowledge, toward universal insightful discussions. machine learning. The typical problems of transfer learning and domain adaptation are negative transfer (i.e., model over-fitting) and under-adaptation (i.e., model under-fitting), REFERENCES which attracted many researchers’ focus in the past decade, [1] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in but not well addressed due to the difficulty in explicitly CVPR, 2011, pp. 1521–1528. characterizing the domain discrepancy. There lacks of some [2] H. Shimodaira, “Improving predictive inference under covariate judgment conditions and evaluation criteria. Negative trans- shift by weighting the log-likelihood function,” J. Stat. Plan. transferring knowledge from source can Inference, vol. 90, no. 2, pp. 227–244, 2000. fer can be stated as [3] J. Yang, R. Yan, and A. G. Hauptmann, “Cross-domain video have a negative impact on the target learner. In [329], three im- concept detection using adaptive svms,” in ACM MM, 2007, pp. pacting factors leading to underlying negative transfer are 188–197. stressed, such as algorithmic choice, distribution divergence [4] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345– and the size of labeled target data. Throughout the entire 1359, 2010. research lines, one specific research area of transfer adapta- [5] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual tion learning that seems to be still under-studied is the co- category models to new domains,” in ECCV, 2010. adaptation of multiple but heterogeneous domains, which [6] L. Pratt, J. Mostow, and C. Kamm, “Direct transfer of learned information among neural networks,” in AAAI, 1991. goes beyond two homogeneous domains. This challenge [7] L. Pratt, “Discriminability-based transfer between neural net- is more approaching real-world scenarios that numerous works,” in NIPS, 1993. JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 21

[8] S. Salaken, A. Khosravi, T. Nguyen, and S. Nahavandi, “Extreme [38] L. Tran, X. Yin, and X. Liu, “Representation learning by rotating learning machine based transfer learning algorithms: A survey,” your faces,” IEEE Trans. Pattern Analysis and Machine Intelligence, Neurocomputing, vol. 267, pp. 516–524, 2017. 2018. [9] K. Weiss, T. Khoshgoftaar, and D. Wang, “A survey of transfer [39] L. Tran, X. Yin, and X. Liu, “Disentangled representation learning learning,” J. Big Data, vol. 3, pp. 1–40, 2016. gan for pose-invariant face recognition,” in CVPR, 2017. [10] L. Shao, F. Zhu, and X. Li, “Transfer learning for visual cate- [40] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, gorization: A survey,” IEEE Trans. Neural Networks and Learning “Generative adversarial text to image synthesis,” in ICML, 2016. Systems, vol. 26, no. 5, pp. 1019–1034, 2015. [41] A. Radford, L. Metz, and S. Chintala, “Unsupervised represen- [11] W. Pan, “A survey of transfer learning for collaborative recom- tation learning with deep convolutional generative adversarial mendation,” Neurocomputing, vol. 177, pp. 447–453, 2016. networks,” in ICLR, 2016. [12] V. Patel, R. Gopalan, R. Li, and R. Chellappa, “Visual domain [42] P. Isola, J. Zhu, T. Zhou, and A. Efros, “Image-to-image transla- adaptation,” IEEE Signal Processing Magazine, pp. 53–69, 2015. tion with conditional adversarial networks,” in CVPR, 2017, pp. [13] J. Lu, V. Behbood, P. Hao, H. Zuo, S. Xue, and G. Zhang, “Transfer 1125–1134. learning using computational intelligence: A survey,” Knowledge [43] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to- Based Systems, vol. 80, pp. 14–23, 2015. image translation using cycle-consistent adversarial networks,” [14] M. Wang and W. Deng, “Deep visual domain adaptation: A in ICCV, 2017. survey,” Neurcomputing, vol. 312, pp. 135–153, 2018. [44] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo, “Star- [15] D. Cook, K. Feuz, and N. Krishnan, “Transfer learning for activity gan: Unified generative adversarial networks for multi-domain recognition: a survey,” Knowl. Inf. Syst., vol. 36, pp. 537–556, 2013. image-to-image translation,” in CVPR, 2018, pp. 8789–8797. [16] G. Csurka, “Domain adaptation for visual applications: A com- [45] M. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image prehensive survey,” in arXiv, 2017. translation networks,” in NIPS, 2017. [17] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality [46] D. Yoo, N. Kim, S. Park, A. Paek, and I. Kweon, “Pixel-level of data with neural networks,” Science, vol. 313, no. 5786, pp. domain transfer,” in ECCV, 2016. 504–507, 2006. [47] L. Gatys, A. Ecker, and M. Bethge, “Image style transfer using [18] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. convolutional neural networks,” in CVPR, 2016, pp. 2414–2423. 521, pp. 436–444, 2015. [48] L. Gatys, A. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman, [19] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classifica- “Controlling perceptual factors in neural style transfer,” in CVPR, tion with deep convolutional neural networks,” in NIPS, 2012. 2017, pp. 3985–3993. [20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for [49] J. Johnson, A. Alahi, and F. Li, “Perceptual losses for real-time image recognition,” in CVPR, 2015, pp. 770–778. style transfer and super-resolution,” in ECCV, 2016, pp. 694–711. [21] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- [50] Y. Hu, X. Wu, B. Yu, R. He, and Z. Sun, “Pose-guided photoreal- time object detection with region proposal networks,” in NIPS, istic face rotation,” in CVPR, 2018. 2015. [51] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker, “Towards [22] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region large-pose face frontalization in the wild,” in ICCV, 2017. based fully convolutional networks,” in NIPS, 2016. [52] J. Cao, Y. Hu, B. Yu, R. He, and Z. Sun, “Load balanced gans for [23] W. Liu, D. Anguelov, D. Erhan, S. Christian, S. Reed, C. Fu, and multi-view face image synthesis,” in arXiv:1802.07447, 2018. A. Berg, “Ssd: single shot multibox detector,” in ECCV, 2016. [53] R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation: [24] C. Fu, W. Liu, A. Ranga, A. Tyagi, and A. Berg, “Dssd: Deconvo- Global and local perception gan for photorealistic and identity lutional single shot detector,” in arXiv, 2017. preserving frontal view synthesis,” in ICCV, 2017. [25] E. Ahmed, M. Jones, and T. Marks, “An improved deep learning [54] H. Yang, D. Huang, Y. Wang, and A. Jain, “Learning face age architecture for person re-identification,” in CVPR, 2015. progression: A pyramid architecture of gans,” in CVPR, 2018. [26] D. Chung, K. Tahboub, and E. Delp, “A two steam siamese [55] A. Evgeniou and M. Pontil, “Multi-task feature learning,” in convolutional neural network for person re-identification,” in NIPS, 2007. ICCV, 2017. [27] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, “Person re- [56] L. Zhang, Z. He, Y. Yang, L. Wang, and X. Gao, “Tasks integrated identification by multi-channel parts-based cnn with improved networks: Joint detection and retrieval for image search,” IEEE triplet loss function,” in CVPR, 2016. Trans. Pattern Analysis and Machine Intelligence, 2020. [28] L. Hou, D. Samaras, T. Kurc, Y. Gao, J. Davis, and J. Saltz, “Patch- [57] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis of based convolutional neural network for whole slide tissue image representations for domain adaptation,” in NIPS, 2007. classification,” in CVPR, 2016. [58] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman, [29] A. Esteva, B. Kuprel, R. Novoa, J. Ko, S. Swetter, H. Blau, and “Learning bounds for domain adaptation,” in NIPS, 2008, pp. S. Thrun, “Dermatologist-level classification of skin cancer with 129–136. deep neural networks,” Nature, vol. 542, pp. 115–118, 2017. [59] Y. Freund and R. Schapire, “A decision-theoretic generalization [30] D. Shen, G. Wu, and H. Suk, “Deep learning in medical image of on-line learning and an application to boosting,” J. Comput. analysis,” Annual Review of Biomedical Engineering, vol. 19, pp. Syst. Sci, vol. 55, no. 1, pp. 119–139, 1997. 221–248, 2017. [60] T. Dietterich, “Machine learning: Four current directions,” AI [31] M. Xie, N. Jean, M. Burke, D. Lobell, and S. Ermon, “Transfer Mag., vol. 18, no. 4, pp. 97–136, 1997. learning from deep features for remote sensing and poverty [61] X. Wu, V. Kumar, J. Quinlan, J. Ghosh, Q. Yang, H. Motoda, mapping,” in AAAI, 2016. G. McLachlan, A. Ng, B. Liu, P. Yu, Z. Zhou, M. Steinbach, [32] N. Jean, M. Burke, M. Xie, W. Davis, D. Lobell, and S. Ermon, D. Hand, and D. Steinberg, “Top 10 algorithms in data mining,” “Combining satellite imagery and machine learning to predict Knowl. Inf. Syst., vol. 14, pp. 1–37, 2008. poverty,” Science, vol. 353, no. 6301, p. 790–794, 2016. [62] Z.-H. Zhou, “A brief introduction to weakly supervised learn- [33] Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi, “Deep feature ing,” National Science Review, vol. 1, pp. 1–10, 2017. extraction and classification of hyperspectral images based on [63] O. Chapelle, B. Scholkopf, and A. Zien, Semi-Supervised Learning, convolutional neural networks,” IEEE Trans. Geoscience and Re- 2006. mote Sensing, vol. 54, no. 10, pp. 6232–6251, 2016. [64] Z. Zhou and M. Li, “Semi-supervised learning by disagreement,” [34] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Convo- Knowledge and Information Systems, vol. 24, no. 3, pp. 415–439, lutional neural networks for large-scale remote-sensing image 2010. classification,” IEEE Trans. Geoscience and Remote Sensing, vol. 55, [65] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, “Text clas- no. 2, pp. 645–657, 2017. sification from labeled and unlabeled documents using em,” [35] Y. Bengio, “Deep learning of representations for unsupervised Machine Learning, vol. 39, pp. 103–134, 2000. and transfer learning,” JMLR, vol. 27, pp. 17–37, 2012. [66] D. Miller and H. Uyar, “A mixture of experts classifier with [36] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- learning based on both labeled and unlabeled data,” in NIPS, Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver- 1997, pp. 571–577. sarial nets,” in arXiv, 2014. [67] D. Kingma, S. Mohamed, D. Jimenez Rezende, and M. Welling, [37] M. Mirza and S. Osindero, “Conditional generative adversarial “Semi-supervised learning with deep generative models,” in nets,” in arXiv, 2014. NIPS, 2014. JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 22

[68] O. Chapelle and A. Zien, “Semi-supervised classification by low [97] F. Li and H. Wechsler, “Open set face recognition using transduc- density separation,” in AISTATS, 2005, pp. 57–64. tion,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, [69] T. Joachims, “Transductive inference for text classication using no. 11, pp. 1686–1697, 2005. support vector machines,” in ICML, 1999, pp. 200–209. [98] H. Zhang and V. Patel, “Sparse representation-based open set [70] Y. Li, I. Tsang, J. Kwok, and Z. Zhou, “Convex and scalable recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, weakly labeled svms,” Journal of Machine Learning Research, vol. 39, no. 8, pp. 1690–1696, 2016. vol. 14, pp. 2151–2188, 2013. [99] P. Panareda Busto and J. Gall, “Open set domain adaptation,” in [71] Z. Zhou and M. Li, “Tri-training: exploiting unlabeled data using ICCV, 2017, pp. 754–763. three classifiers,” IEEE Trans. Knowledge Data Engineering, vol. 17, [100] P. Panareda Busto, A. Iqbal, and J. Gall, “Open set domain pp. 1529–1541, 2005. adaptation for image and action recognition,” IEEE Trans. Pattern [72] A. Blum and T. Mitchell, “Combining labeled and unlabeled data Analysis and Machine Intelligence, 2018. with co-training,” in The 11th Int’ Conf’ Computational Learning [101] Y. Pan, T. Yao, Y. Li, C. Ngo, and T. Mei, “Exploring category- Theory, 1998, pp. 92–100. agnostic clusters for open-set domain adaptation,” in CVPR, [73] Z. Zhou and M. Li, “Semi-supervised learning by disagreement,” 2020. Knowledge and Information Systems, vol. 24, pp. 415–439, 2010. [102] J. Kundu, N. Venkat, A. Revanur, M. Rahul, and V. Babu, “To- [74] A. Blum and S. Chawla, “Learning from labeled and unlabeled wards inheritable models for open-set domain adaptation,” in data using graph mincuts,” in ICML, 2001, pp. 19–26. CVPR, 2020. [75] M. Belkin and P. Niyogi, “Semi-supervised learning on rieman- [103] V. Kumar Verma, G. Arora, A. Mishra, and P. Rai, “Generalized nian manifolds,” Machine Learning, vol. 56, no. 1-3, pp. 209–239, zero-shot learning via synthesized examples,” in CVPR, 2018. 2004. [104] J. Huang, A. Smola, A. Gretton, K. Borgwardt, and B. Scholkopf, [76] T. Kipf and M. Welling, “Semi-supervised classification with “Correcting sample selection bias by unlabeled data,” in NIPS, graph convolutional networks,” in ICLR, 2017. 2007, pp. 1–8. [77] Z. Yang, W. Cohen, and R. Salakhutdinov, “Revisiting semi- [105] B. Zadrozny, “Learning and evaluating classifiers under sample supervised learning with graph embeddings,” in ICML, 2016. selection bias,” in ICML, 2004. [78] X. Zhu, “Semi-supervised learning literature survey,” Technical [106] J. Jiang and C. Zhai, “Instance weighting for domain adaptation Report 1530, 2008. in nlp,” in ACL, 2007, pp. 264–271. [79] I. Triguero, S. Garcia, and F. Herrera, “Self-labeled techniques [107] R. Wang, M. Utiyama, L. Liu, K. Chen, and E. Sumita, “Instance for semi-supervised learning: taxonomy, software and empirical weighting for neural machine translation domain adaptation,” in ACL study,” Knowledge and Information Systems, vol. 42, no. 2, pp. 245– , 2017, pp. 1482–1488. 284, 2015. [108] W. Dai, Q. Yang, G. R. Xue, and Y. Yu, “Boosting for transfer learning,” in ICML, 2007, pp. 193–200. [80] D. D. Lewis and W. A. Gale, “A sequential algorithm for training text classifiers,” Acm Sigir Forum, vol. 29, no. 2, pp. 13–19, 1994. [109] S. Chen, F. Zhou, and Q. Liao, “Visual domain adaptation using weighted subspace alignment,” in Visual Communications and [81] S. Tong and D. Koller, “Support vector machine active learning Image Processing, 2017. with applications to text classification.” in ICML, 2000. [110] W. S. Chu, l. T. F. De, and J. F. Cohn, “Selective transfer machine [82] E. Elhamifar, G. Sapiro, A. Yang, and S. S. Sasrty, “A convex for personalized facial action unit detection,” in CVPR, 2013. optimization framework for active learning,” in ICCV, 2013. [111] H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu, and W. Zuo, “Mind [83] B. Settles, “Active learning literature survey,” in Technical Report, the class weight bias: Weighted maximum mean discrepancy for 2010. unsupervised domain adaptation,” in CVPR, 2017, pp. 2272–2281. [84] M. Rohrbach, M. Stark, and B. Schiele, “Evaluating knowledge [112] E. H. Zhong, W. Fan, J. Peng, K. Zhang, J. Ren, D. S. Turaga, and transfer and zero-shot learning in a large-scale setting,” in CVPR, O. Verscheure, “Cross domain distribution adaptation via kernel 2011, pp. 1641–1648. mapping,” in ACM SIGKDD, 2009, pp. 1027–1036. [85] C. Lampert, H. Nickishch, and S. Harmeling, “Attribute-based [113] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer joint classification for zero-shot visual object categorization,” IEEE matching for unsupervised domain adaptation,” in CVPR, 2014. Trans. Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. [114] M. Chen, K. Q. Weinberger, and J. C. Blitzer, “Co-training for 453–465, 2014. domain adaptation,” in NIPS, 2011. [86] Y. Fu, T. Hospedales, T. Xiang, and S. Gong, “Transductive multi- [115] Q. Chen, Y. Liu, Z. Wang, I. Wassell, and K. Chetty, “Re-weighted view zero-shot learning,” IEEE Trans. Pattern Analysis and Machine adversarial adaptation network for unsupervised domain adap- Intelligence, vol. 37, no. 11, pp. 2332–2345, 2015. tation,” in CVPR, 2018, pp. 7976–7985. [87] Z. Ding, M. Shao, and Y. Fu, “Generative zero-shot learning via [116] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsu- low-rank embedded semantic dictionary,” IEEE Trans. Pattern pervised visual domain adaptation using subspace alignment,” Analysis and Machine Intelligence, 2018. in ICCV, 2013, pp. 2960–2967. [88] L. Niu, J. Cai, A. Veeraraghavan, and L. Zhang, “Zero-shot [117] A. Gretton, K. Borgwardt, M. Rasch, B. Schoelkopf, and A. Smola, learning via category-specific visual-semantic mapping and label “A kernel method for the two-sample-problem,” in NIPS, 2006. refinement,” IEEE Trans. Image Processing, 2018. [118] A. Gretton, K. Borgwardt, M. Rasch, B. Scholkopf, and A. Smola, [89] S. Rahman, S. Khan, and F. Porikli, “A unified approach for con- “A kernel two-sample test,” Journal of Machine Learning Research, ventional zero-shot, generalized zero-shot and few-shot learn- pp. 723–773, 2012. ing,” IEEE Trans. Image Processing, pp. 1–16, 2018. [119] R. Gopalan, R. Li, and R. Chellappa, “Domain adaptation for [90] Y. Yu, Z. Ji, J. Guo, and Y. Pang, “Transductive zero-shot learning object recognition: An unsupervised approach,” in ICCV, 2011, with adaptive structural embedding,” IEEE Trans. Neural Net- pp. 999–1006. works and Learning Systems, vol. 29, no. 9, pp. 4116–4127, 2018. [120] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel [91] Y. Yu, Z. Ji, X. Li, J. Guo, Z. Zhang, H. Ling, and F. Wu, “Transduc- for unsupervised domain adaptation,” in CVPR, 2012. tive zero-shot learning with self-training dictionary approach,” [121] B. Sun and K. Saenko, “Subspace distribution alignment for IEEE Trans. Cybernetics, vol. 48, no. 10, pp. 2908–2919, 2018. unsupervised domain adaptation,” in BMVC, 2015. [92] X. Xu, T. Hospedales, and S. Gong, “Transductive zero-shot ac- [122] J. Liu and L. Zhang, “Optimal projection guided transfer hashing tion recognition by word-vector embedding,” International Journal for image retrieval,” in AAAI, 2018. of Computer Vision, vol. 123, no. 3, pp. 309–333, 2017. [123] L. Zhang, J. Fu, S. Wang, D. Zhang, Z. Dong, and C. Chen, “Guide [93] X. Li, Y. Guo, and D. Schuurmans, “Semi-supervised zero-shot subspace learning for unsupervised domain adaptation,” IEEE classification with label representation learning,” in ICCV, 2015. Trans. Neural Networks and Learning Systems, 2020. [94] J. Song, C. Shen, Y. Yang, Y. Liu, and M. Song, “Transductive [124] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain unbiased embedding for zero-shot learning,” in CVPR, 2018. adaptation via transfer component analysis,” IEEE Trans. Neural [95] W. Scheirer, A. Rocha, A. Sapkota, and T. Boult, “Towards open Networks, vol. 22, no. 2, p. 199, 2011. set recognition,” IEEE Trans. Pattern Analysis and Machine Intelli- [125] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer feature gence, vol. 35, no. 7, pp. 1757–1772, 2013. learning with joint distribution adaptation,” in ICCV, 2014. [96] W. Scheirer, L. Jain, and T. Boult, “Probability models for open set [126] M. Xiao and Y. Guo, “Feature space independent semi-supervised recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, domain adaptation via kernel matching,” IEEE Trans. Pattern vol. 36, no. 11, pp. 2317–2324, 2014. Analysis and Machine Intelligence, vol. 37, no. 1, pp. 54–66, 2015. JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 23

[127] S. Si, D. Tao, and B. Geng, “Bregman divergence-based regular- [154] M. Long, J. Wang, J. Sun, and P. Yu, “Domain invariant transfer ization for transfer subspace learning,” IEEE Trans. Knowledge and kernel learning,” IEEE Trans. Knowledge and Data Engineering, Data Engineering, vol. 22, no. 7, pp. 929–942, 2010. vol. 27, no. 6, pp. 1519–1532, 2015. [128] Z. Ding and Y. Fu, “Robust transfer metric learning for image [155] L. Zhang, Y. Liu, Z. He, J. Liu, P. Deng, and X. Zhou, “Anti-drift classification,” IEEE Trans. Image Processing, vol. 26, no. 2, pp. in e-nose: A subspace projection approach with drift reduction,” 660–670, 2017. Sensors and Actuators B: Chemical, vol. 253, pp. 407–417, 2017. [129] H. Wang, W. Wang, C. Zhang, and F. Xu, “Cross-domain metric [156] M. Ghifary, D. Balduzzi, W. Bastiaan Kleijn, and M. Zhang, learning based on information theory,” in Proc. 28th AAAI Conf. “Scatter component analysis: A unified framework for domain Artif. Intell., 2014, pp. 2099–2105. adaptation and domain generalization,” IEEE Trans. Pattern Anal- [130] B. Sun and K. Saenko, “Deep coral: Correlation alignment for ysis and Machine Intelligence, vol. 39, no. 7, pp. 1414–1430, 2017. deep domain adaptation,” in ECCVW, 2016, pp. 443–450. [157] J. Zhang, W. Li, and P. Ogunbona, “Joint geometrical and statisti- [131] S. Herath, M. Harandi, and F. Porikli, “Learning an invariant cal alignment for visual domain adaptation,” CVPR, 2017. hilbert space for domain adaptation,” in CVPR, 2017. [158] L. Zhang, Y. Liu, and P. Deng, “Odor recognition in multiple e- [132] H. Daume III, A. Kumar, and A. Saha, “Frustratingly easy semi- nose systems with cross-domain discriminative subspace learn- supervised domain adaptation,” in ACL, 2010, pp. 53–59. ing,” IEEE Trans. Instrumentation and Measurement, vol. 66, no. 7, pp. 1679–1692, 2017. [133] W. Li, L. Duan, D. Xu, and I. Tsang, “Learning with augmented [159] H. Lu, C. Shen, Z. Cao, Y. Xiao, and A. D. H. Van, “An embarrass- features for supervised and semi-supervised heterogeneous do- ingly simple approach to visual domain adaptation.” IEEE Trans. main adaptation,” IEEE Trans. Pattern Analysis and Machine Intel- Image Processing, 2018. ligence, vol. 36, no. 6, pp. 1134–1148, 2014. [160] S. Li, S. Song, and G. Huang, “Domain invariant and class [134] R. Volpi, P. Morerio, S. Savarese, and V. Murino, “Adversarial discriminative feature learning for visual domain adaptation,” feature augmentation for unsupervised domain adaptation,” in IEEE Trans. Image Processing, vol. 27, no. 9, pp. 4260–4273, 2018. CVPR, 2018, pp. 5495–5504. [161] S. Li, S. Song, G. Huang, and C. Wu, “Cross-domain extreme [135] L. Zhang, S. Wang, G. Huang, W. Zuo, J. Yang, and D. Zhang, learning machines for domain adaptation,” IEEE Trans. Systems, “Manifold criterion guided transfer learning via intermediate Man, and Cybernetics: Systems, 2018. domain generalization,” IEEE Trans. Neural Networks and Learning [162] A. Gretton, O. Bousquet, A. Smola, and B. Scholkopf, “Measuring Systems, 2019. statistical dependence with hilbert-schmidt norms,” in ALT, 2005. [136] I. H. Jhuo, D. Liu, D. T. Lee, and S. F. Chang, “Robust visual [163] K. Yan, L. Kou, and D. Zhang, “Learning domain-invariant sub- domain adaptation with low-rank reconstruction,” in CVPR, space using domain features and independence maximization,” 2012, pp. 2168–2175. IEEE Trans. Cybernetics, vol. 48, no. 1, pp. 288–299, 2017. [137] M. Shao, D. Kit, and Y. Fu, “Generalized transfer subspace [164] Y. Zhang and D. Yeung, “Transfer metric learning by learning learning through low-rank constraint,” International Journal of task relationships,” in ACM SIGKDD, 2010, pp. 1199–1208. Computer Vision, vol. 109, no. 1-2, pp. 74–93, 2014. [165] J. Hu, J. Lu, Y.-P. Tan, and J. Zhou, “Deep transfer metric learn- [138] J. Fu, L. Zhang, B. Zhang, and W. Jia, “Guided learning: A new ing,” IEEE Trans. Image Processing, vol. 25, no. 12, 2016. paradigm for multi-task classification,” in CCBR, 2018. [166] B. Geng, D. Tao, and C. Xu, “Daml: Domain adaptation metric [139] L. Zhang, W. Zuo, and D. Zhang, “Lsdt: Latent sparse domain learning,” IEEE Trans. Image Processing, vol. 20, no. 10, pp. 2980– transfer learning for visual adaptation,” IEEE Trans. Image Pro- 2989, 2011. cessing, vol. 25, no. 3, pp. 1177–1191, 2016. [167] Y. Xu, S. Pan, H. Xiong, Q. Wu, R. Luo, H. Min, and H. Song, [140] Y. Xu, X. Fang, J. Wu, X. Li, and D. Zhang, “Discriminative trans- “A unified framework for metric transfer learning,” IEEE Trans. fer subspace learning via low-rank and sparse representation.” Knowledge and Data Engineering, vol. 29, no. 6, pp. 1158–408, 2017. IEEE Trans. Image Processing, vol. 25, no. 2, pp. 850–863, 2016. [168] S. Li, C. Liu, L. Su, B. Xie, Z. Ding, C. P. Chen, and D. Wu, [141] L. Zhang, J. Yang, and D. Zhang, “Domain class consistency “Discriminative transfer feature and label consistency for cross- based transfer learning for image classification across domains,” domain image classification,” IEEE Trans. Neural Networks and Information Sciences, vol. 418-419, pp. 242–257, 2017. Learning Systems, 2020. [142] S. Wang, L. Zhang, and W. Zuo, “Class-specific reconstruction [169] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy transfer learning via sparse low-rank constraint,” in ICCVW, domain adaptation,” in AAAI, 2016, pp. 153–171. 2017, pp. 949–957. [170] Z. Zhang, M. Wang, Y. Huang, and A. Nehorai, “Aligning [143] S. Shekhar, V. Patel, H. Nguyen, and R. Chellappa, “Generalized infinite-dimensional covariance matrices in reproducing kernel domain-adaptive dictionaries,” in CVPR, 2013, pp. 361–368. hilbert spaces for domain adaptation,” in CVPR, 2018, pp. 3437– [144] H. Xu, J. Zheng, and R. Chellappa, “Bridging the domain shift by 3445. domain adaptive dictionary learning,” in BMVC, 2015. [171] L. Li and Z. Zhang, “Semi-supervised domain adaptation by [145] S. Shekhar, V. Patel, H. Van Nguyen, and R. Chellappa, “Coupled covariance matching,” IEEE Trans. Pattern Analysis and Machine projections for adaptation of dictionaries,” IEEE Trans. Image Intelligence, 2018. Processing, vol. 24, no. 10, pp. 2941–2954, 2015. [172] H. Daume III, “Frustratingly easy domain adaptation,” in arXiv, 2009. [146] Q. Qiu and R. Chellappa, “Compositional dictionaries for domain [173] H. Daume III, A. Kumar, and A. Saha, “Co-regularization based adaptive face recognition,” IEEE Trans. Image Processing, vol. 24, semi-supervised domain adaptation,” in ACL, 2010. no. 12, pp. 5152–5165, 2015. [174] Y. Chen, X. Zhu, W. Zheng, and J. Lai, “Person re-identification [147] Q. Qiu, V. Patel, P. Turaga, and R. Chellappa, “Domain adaptive by camera correlation aware feature augmentation,” IEEE Trans. dictionary learning,” in ECCV, 2012. Pattern Analysis and Machine Intelligence, vol. 40, no. 2, pp. 392– [148] J. Ni, Q. Qiu, and R. Chellappa, “Subspace interpolation via 408, 2018. dictionary learning for unsupervised domain adaptation,” in [175] X. Shi, Q. Liu, W. Fan, P. Yu, and R. Zhu, “Transfer learning CVPR, 2013, pp. 692–699. on heterogenous feature spaces via spectral transformation,” in [149] F. Zhu and L. Shao, “Weakly-supervised cross-domain dictionary ICDM, 2010. learning for visual recognition,” International Journal of Computer [176] J. Zhou, S. Pan, I. Tsang, and Y. Yan, “Hybrid heterogeneous Vision, vol. 109, no. 1-2, pp. 42–59, 2014. transfer learning through deep learning,” in AAAI, 2014, pp. [150] B. Lu, R. Chellappa, and N. Nasrabadi, “Incremental dictionary 2213–2219. learning for unsupervised domain adaptation,” in BMVC, 2015. [177] Y. Yan, W. Li, H. Wu, H. Min, M. Tan, and Q. Wu, “Semi- [151] B. Kulis, K. Saenko, and T. Darrell, “What you saw is not what supervised optimal transport for heterogeneous domain adap- you get: Domain adaptation using asymmetric kernel trans- tation,” in IJCAI, 2018. forms,” in CVPR, 2015, pp. 1785–1792. [178] D. Das and C. G. Lee, “Sample-to-sample correspondence for [152] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy, “Optimal unsupervised domain adaptation,” Engineering Applications of transport for domain adaptation,” IEEE Trans. Pattern Analysis Artificial Intelligence, vol. 73, pp. 80–91, 2018. and Machine Intelligence, vol. 39, no. 9, pp. 1853–1865, 2017. [179] D. Das and C. G. Lee, “Unsupervised domain adaptation using [153] M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salz- regularized hyper-graph matching,” 2018, pp. 3758–3762. mann, “Unsupervised domain adaptation by domain invariant [180] D. Das and C. G. Lee, “Graph matching and pseudo-label guided projection,” in ICCV, 2013, pp. 769–776. deep unsupervised domain adaptation,” 2018, pp. 342–352. JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 24

[181] G. Liu, Z. Lin, and Y. Yu, “Robust subspace segmentation by low- [207] M. Gonen and A. Margolin, “Kernelized bayesian transfer learn- rank representation,” in ICML, 2010, pp. 663–670. ing,” in AAAI, 2014, pp. 1831–1839. [182] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery [208] A. Ramachandran, S. Gupta, S. Rana, and S. Venkatesh, of subspace structures by low-rank representation,” IEEE Trans. “Information-theoretic transfer learning framework for bayesian Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 171– optimisation,” in ECML, 2019, pp. 827–842. 184, 2012. [209] B. Liu and N. Vasconcelos, “Bayesian model adaptation for crowd [183] Z. Lin, M. Chen, and Y. Ma, “The augmented lagrange multiplier counts,” in ICCV, 2015, pp. 4175–4183. method for exact recovery of corrupted low-rank matrices,” [210] B. Gholami, O. Rudovic, and V. Pavlovic, “Punda: Probabilistic Eprint Arxiv, vol. 9, 2013. unsupervised domain adaptation for knowledge transfer across [184] J. Wright, A. Ganesh, S. Rao, and Y. Ma, “Robust principal com- visual categories,” in ICCV, 2017, pp. 3581–3590. ponent analysis: Exact recovery of corrupted low-rank matrices [211] A. Karbalayghareh, X. Qian, and E. Dougherty, “Optimal by convex optimization,” in NIPS, 2009. bayesian transfer learning,” IEEE Trans. Signal Processing, vol. 66, [185] E. Elhamifar and R. Vidal, “Sparse subspace clustering,” in no. 14, pp. 3724–3739, 2018. CVPR, 2009, pp. 2790–2797. [212] V. Perrone, R. Jenatton, M. Seeger, and C. Archambeau, “Scalable [186] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algo- hyperparameter transfer learning,” in NIPS, 2018. rithms, theory, and applications,” IEEE Trans. Pattern Analysis and [213] S. Sonnenburg, G. Ratsch, C. Schafer, and B. Scholkopf, “Large Machine Intelligence, vol. 35, no. 11, pp. 2765–2781, 2013. scale multiple kernel learning,” Journal of Machine Learning Re- [187] M. Soltanolkotabi, E. Elhamifar, and E. Candes, “Robust subspace search, vol. 7, pp. 1531–1565, 2006. clustering,” Ann. Statist., vol. 42, no. 2, pp. 669–699, 2014. [214] D. Xu and S. Chang, “Video event recognition using kernel meth- [188] M. Aharon, M. Elad, and A. Bruckstein, “K-svd: An algorithm for ods with multilevel temporal alignment,” IEEE Trans. Pattern designing overcomplete dictionaries for sparse representation,” Analysis and Machine Intelligence, vol. 30, no. 11, pp. 1985–1997, IEEE Trans. Signal Processing, vol. 54, no. 11, pp. 4311–4322, 2006. 2008. [189] J. Mairal, F. Bach, and J. Ponce, “Task-driven dictionary learning,” [215] L. Duan, D. Xu, and I. Tsang, “Domain adaptation from multiple IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 34, no. 4, sources: A domain-dependent regularization approach,” IEEE pp. 791–804, 2012. Trans. Neural Networks and Learning Systems, vol. 23, no. 3, pp. [190] Z. Jiang, Z. Lin, and L. Davis, “Label consistent k-svd: Learning 504–518, 2012. a discriminative dictionary for recognition,” IEEE Trans. Pattern [216] L. Dalton and E. Dougherty, “Optimal classifiers with minimum Analysis and Machine Intelligence, vol. 35, no. 11, pp. 2651–2664, expected error within a bayesian framework-part i: Discrete and 2013. gaussian models,” Pattern Recognition, vol. 46, no. 5, pp. 1288– 1300, 2013. [191] S. Li, M. Shao, and Y. Fu, “Cross-view projective dictionary learning for person re-identification,” in IJCAI, 2015. [217] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks,” in NIPS, 2014. [192] L. Duan, D. Xu, and S. F. Chang, “Exploiting web images for [218] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, event recognition in consumer videos: A multiple source domain and T. Darrell, “Decaf: A deep convolutional activation feature adaptation approach,” in CVPR, 2012, pp. 1338–1345. for generic visual recognition,” in ICML, 2014. [193] L. Zhang and D. Zhang, “Domain adaptation extreme learning [219] M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning transferable machines for drift compensation in e-nose systems,” IEEE Trans. features with deep adaptation networks,” in ICML, 2015. Instrumentation and Measurement, vol. 64, no. 7, pp. 1790–1801, [220] M. Long, Y. Cao, Z. Cao, J. Wang, and M. Jordan, “Transferable 2015. representation learning with deep adaptation networks,” IEEE [194] A. Royer and C. Lampert, “Classifier adaptation at prediction Trans. Pattern Analysis and Machine Intelligence, 2018. time,” in CVPR, 2015. [221] K. Simonyan and A. Zisserman, “Very deep convolutional net- [195] L. Duan, I. Tsang, D. Xu, and S. Maybank, “Domain transfer svm works for large-scale image recognition,” in arXiv, 2015. for video concept detection,” in CVPR, 2009. [222] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, [196] L. Duan, D. Xu, I. Tsang, and J. Luo, “Visual event recognition in D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with videos by learning from web data,” in CVPR, 2010. convolutions,” in CVPR, 2015. [197] D. Lixin, X. Dong, T. Ivor Wai-Hung, and L. Jiebo, “Visual event [223] Z. Cao, M. Long, J. Wang, and M. Jordan, “Partial transfer recognition in videos by learning from web data,” IEEE Trans. learning with selective adversarial networks,” in CVPR, 2018. Pattern Analysis and Machine Intelligence, vol. 34, pp. 1667–1680, [224] J. Zhang, Z. Ding, W. Li, and P. Ogunbona, “Importance weighted 2012. adversarial nets for partial domain adaptation,” in CVPR, 2018, [198] L. Duan, I. W. Tsang, and D. Xu, “Domain transfer multiple ker- pp. 8156–8164. nel learning,” IEEE Trans. Pattern Analysis and Machine Intelligence, [225] S. Li, C. Liu, Q. Lin, Q. Wen, L. Su, G. Huang, and Z. Ding, “Deep vol. 34, no. 3, p. 465, 2012. residual correction network for partial domain adaptation,” IEEE [199] W. Li, Z. Xu, D. Xu, D. Dai, and L. Van Gool, “Domain gener- Trans. Pattern Analysis and Machine Intelligence, 2020. alization and adaptation using low rank exemplar svms,” IEEE [226] Z. Cao, K. You, M. Long, J. Wang, and Q. Yang, “Learning to Trans. Pattern Analysis and Machine Intelligence, vol. 40, no. 5, pp. transfer examples for partial domain adaptation,” in CVPR, 2019. 1114–1127, 2017. [227] Z. Fang, J. Lu, F. Liu, J. Xuan, and G. Zhang, “Open set domain [200] M. Long, J. Wang, G. Ding, S. Pan, and P. Yu, “Adaptation adaptation: Theoretical bound and algorithm,” IEEE Trans. Neural regularization: a general framework for transfer learning,” IEEE Networks and Learning Systems, 2020. Trans. Knowledge and Data Engineering, vol. 26, no. 5, pp. 1076– [228] K. You, M. Long, Z. Cao, J. Wang, and M. Jordan, “Universal 1089, 2014. domain adaptation,” in CVPR, 2019. [201] L. Zhang and D. Zhang, “Robust visual knowledge transfer [229] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep via extreme learning machine-based domain adaptation,” IEEE domain confusion: Maximizing for domain invariance,” arXiv, Trans. Image Processing, vol. 25, no. 10, pp. 4959–4973, 2016. 2014. [202] J. Wang, W. Feng, Y. Chen, H. Yu, M. Huang, and P. S. Yu, “Visual [230] L. Liu, W. Lin, L. Wu, Y. Yu, and M. Yang, “Unsupervised deep domain adaptation with manifold embedded distribution align- domain adaptation for pedestrian detection,” in ECCV, 2016. ment,” in ACM MM, 2018. [231] M. Long, H. Zhu, J. Wang, and M. Jordan, “Deep transfer learning [203] Y. Cao, M. Long, and J. Wang, “Unsupervised domain adaptation with joint adaptation networks,” in ICML, 2017. with distribution matching machines,” in AAAI, 2018. [232] M. Long, J. Wang, and M. I. Jordan, “Unsupervised domain [204] T. Yao, Y. Pan, C. Ngo, H. Li, and T. Mei, “Semi-supervised do- adaptation with residual transfer networks,” NIPS, 2016. main adaptation with subspace learning for visual recognition,” [233] W. Ge and Y. Yu, “Borrowing treasures from the wealthy: Deep in CVPR, 2015, pp. 2142–2150. transfer learning through selective joint fine-tuning,” in CVPR, [205] Y. Zhang, B. Deng, K. Jia, and L. Zhang, “Label propagation with 2017, pp. 1086–1095. augmented anchors: A simple semi-supervised learning baseline [234] X. Zhang, F. Yu, S. Wang, and S. Chang, “Deep transfer network: for unsupervised domain adaptation,” in ECCV, 2020. Unsupervised domain adaptation,” in arXiv, 2015. [206] Y. Luo, C. Ren, D. Dai, and H. Yan, “Unsupervised domain [235] S. Motiian, M. Piccirilli, D. Adjeroh, and G. Doretto, “Unified adaptation via discriminative manifold propagation,” IEEE Trans. deep supervised domain adaptation and generalization,” in Pattern Analysis and Machine Intelligence, 2020. ICCV, 2017, pp. 5715–5725. JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 25

[236] C. Chen, W. Xie, W. Huang, Y. Rong, and X. Ding, “Progres- [265] R. Li, Q. Jiao, W. Cao, H. Wong, and S. Wu, “Model adapta- sive feature alignment for unsupervised domain adaptation,” in tion: Unsupervised domain adaptation without source data,” in CVPR, 2019. CVPR, 2020. [237] Z. Deng, Y. Luo, and J. Zhu, “Cluster alignment with a teacher [266] J. Kundu, N. Venkat, M. Rahul, and R. Babu, “Universal source- for unsupervised domain adaptation,” in ICCV, 2019. free domain adaptation,” in CVPR, 2020. [238] S. Cicek and S. Soatto, “Unsupervised domain adaptation via [267] X. Peng, Z. Huang, X. Sun, and K. Saenko, “Domain agnostic regularized conditional alignment,” in ICCV, 2019. learning with disentangled representations,” in ICML, 2019. [239] G. Kang, L. Jiang, Y. Yang, and A. Hauptmann, “Contrastive [268] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol, “Extracting adaptation network for unsupervised domain adaptation,” in and composing robust features with denoising autoencoders,” in CVPR, 2019. ICML, 2008. [240] S. Wang and L. Zhang, “Self-adaptive re-weighted adversarial [269] S. Kullback, “Letter to editor: the kullback-leibler distance,” 1987. domain adaptation,” in IJCAI, 2020. [270] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation [241] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou, “Revisiting batch by backpropagation,” in arXiv, 2015. normalization for practical domain adaptation,” in arXiv, 2016. [271] P. Pinheiro, “Unsupervised domain adaptation with similarity [242] F. Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. Bulo, “Autodial: learning,” in CVPR, 2018, pp. 8004–8013. Automatic domain alignment layers,” in ICCV, 2017. [272] Z. Pei, Z. Cao, M. Long, and J. Wang, “Multi-adversarial domain [243] S. Roy, A. Siarohin, E. Sangineto, S. Bulo, N. Sebe, and E. Ricci, adaptation,” in AAAI, 2018. “Unsupervised domain adaptation using feature-whitening and [273] W. Zhang, W. Ouyang, W. Li, and D. Xu, “Collaborative and consensus loss,” in CVPR, 2019, pp. 9471–9480. adversarial networks for unsupervised domain adaptation,” in [244] W. Chang, T. You, S. Seo, S. Kwak, and B. Han, “Domain-specific CVPR, 2018, pp. 3801–3809. batch normalization for unsupervised domain adaptation,” in [274] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain CVPR, 2019. adaptive faster r-cnn for object detection in the wild,” in CVPR, [245] X. Wang, Y. Jin, M. Long, J. Wang, and M. Jordan, “Transferable 2018, pp. 3339–3348. normalization: Towards improving transferability of deep neural [275] Q. Duan and L. Zhang, “Advnet: Adversarial contrastive residual networks,” in NeurIPS, 2019. net for 1 million kinship recognition,” in ACM MM Workshop on [246] X. Glorot, A. Bordes, and Y. Bengio, “Domain adaptation for RFIW, 2017, pp. 21–29. large-scale sentiment classification: A deep learning approach,” [276] Q. Duan, L. Zhang, and W. Jia, “Adv-kin: An adversarial convo- in ICML, 2011. lutional network for kinship verification,” in CCBR, 2017. [247] M. Chen, Z. Xu, K. Weinberger, and F. Sha, “Marginalized de- [277] A. Dubey, O. Gupta, P. Guo, R. Raskar, R. Farrell, and N. Naik, noising autoencoders for domain adaptation,” in ICML, 2012. “Pairwise confusion for fine-grained visual classification,” in [248] F. Zhuang, X. Cheng, P. Luo, S. Pan, and Q. He, “Supervised rep- ECCV, 2018, pp. 48–57. resentation learning: Transfer learning with deep autoencoders,” [278] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous in IJCAI, 2015, pp. 4119–4125. deep transfer across domains and tasks,” ICCV, vol. 30, no. 31, [249] L. Wen, L. Gao, and X. Li, “A new deep transfer learning based pp. 4068–4076, 2015. on sparse auto-encoder for fault diagnosis,” IEEE Trans. Systems, [279] H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, and M. Marc- Man, and Cybernetics: Systems, vol. 49, no. 1, pp. 136–144, 2019. hand, “Domain-adversarial neural network,” in arXiv, 2015. [250] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li, “Deep [280] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and reconstruction-classification networks for unsupervised domain D. Erhan, “Domain separation networks,” 2016. adaptation,” in ECCV, 2016, pp. 597–613. [281] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial [251] Y. Pan, T. Yao, Y. Li, Y. Wang, C. Ngo, and T. Mei, “Transferrable discriminative domain adaptation,” in CVPR, 2017. prototypical networks for unsupervised domain adaptation,” in [282] S. Motiian, Q. Jones, S. Iranmanesh, and G. Doretto, “Few-shot CVPR, 2019, pp. 2239–2247. adversarial domain adaptation,” in NIPS, 2017. [252] S. Lee, D. Kim, N. Kim, and S. Jeong, “Drop to adapt: Learning [283] A. Rozantsev, M. Salzmann, and P. Fua, “Residual parameter discriminative features for unsupervised domain adaptation,” in transfer for deep domain adaptation,” in CVPR, 2018. ICCV, 2019. [284] M. Long, Z. Cao, J. Wang, and M. I. Jordan, “Conditional adver- [253] H. Tang, K. Chen, and K. Jia, “Unsupervised domain adaptation sarial domain adaptation,” in NIPS, 2018. via structurally regularized deep clustering,” in CVPR, 2020. [285] S. Xie, Z. Zheng, L. Chen, and C. Chen, “Learning semantic [254] Y. Liu, Y. Yeh, T. Fu, S. Wang, W. Chiu, and Y. Wang, “Detach representations for unsupervised domain adaptation,” in ICML, and adapt: Learning cross-domain disentangled deep represen- 2018. tation,” in CVPR, 2018. [286] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada, “Maximum [255] H. Zhao, R. Tachet des Combes, K. Zhang, and G. Gordon, “On classifier discrepancy for unsupervised domain adaptation,” in learning invariant representations for domain adaptation,” in CVPR, 2018, pp. 3723–3732. ICML, 2019. [287] J. Hoffman, E. Tzeng, T. Park, and J. Zhu, “Cycada: Cycle- [256] C. Lee, T. Batra, M. H. Baig, and D. Ulbricht, “Sliced wasserstein consistent adversarial domain adaptation,” in arXiv, 2017. discrepancy for unsupervised domain adaptation,” in CVPR, [288] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krish- 2019. nan, “Unsupervised pixel-level domain adaptation with genera- [257] Y. Balaji, R. Chellappa, and S. Feizi, “Normalized wasserstein for tive adversarial networks,” in CVPR, 2017, pp. 3722–3731. mixture distributions with applications in adversarial learning [289] Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised cross-domain and domain adaptation,” in ICCV, 2019. image generation,” in ICLR, 2017. [258] R. Xu, G. Li, J. Yang, and L. Lin, “Larger norm more transferable: [290] L. Hu, M. Kan, S. Shan, and X. Chen, “Duplex generative adver- An adaptive feature norm approach for unsupervised domain sarial network for unsupervised domain adaptation,” in CVPR, adaptation,” in ICCV, 2019. 2018, pp. 1498–1507. [259] Y. Zhang, T. Liu, M. Long, and M. Jordan, “Bridging theory and [291] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim, algorithm for domain adaptation,” in ICML, 2019. “Image to image translation for domain adaptation,” in CVPR, [260] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang, “Mo- 2018, pp. 4500–4509. ment matching for multi-source domain adaptation,” in ICCV, [292] W. Hong, Z. Wang, M. Yang, and J. Yuan, “Conditional generative 2019. adversarial network for structured domain adaptation,” in CVPR, [261] M. Li, Y. Zhai, Y. Luo, P. Ge, and C. Ren, “Enhanced transport 2018, pp. 1498–1507. distance for unsupervised domain adaptation,” in CVPR, 2020. [293] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer gan to [262] R. Xu, P. Liu, L. Wang, C. Chen, and J. Wang, “Reliable weighted bridge domain gap for person re-identification,” in CVPR, 2018, optimal transport for unsupervised domain adaptation,” in pp. 79–88. CVPR, 2020. [294] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang, “Camera style [263] Y. Grandvalet and Y. Bengio, “Semi-supervised learning by en- adaptation for person re-identification,” in CVPR, 2018, pp. 5157– tropy minimization,” in NIPS, 2004, pp. 529–536. 5166. [264] H. Liu, M. Long, J. Wang, and M. Jordan, “Transferable adversar- [295] Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated ial training: A general approach to adapting deep classifiers,” in by gan improve the person re-identification baseline in vitro,” in ICML, 2019. ICCV, 2017. JOURNAL OF LATEX CLASS FILES, VOL. -, NO. -, MARCH 2019 26

[296] S. Li, C. Liu, B. Xie, L. Su, Z. Ding, and G. Huang, “Joint [325] Y. Zhai, S. Lu, Q. Ye, X. Shan, J. Chen, R. Ji, and Y. Tian, adversarial domain adaptation,” in ACM MM, 2019. “Ad-cluster: Augmented discriminative clustering for domain [297] Z. He and L. Zhang, “Multi-adversarial faster-rcnn for unre- adaptive person re-identification,” in CVPR, 2020. stricted object detection,” in ICCV, 2019. [326] Y. Jing, W. Wang, L. Wang, and T. Tan, “Cross-modal cross- [298] Z. He and L. Zhang, “Domain adaptive object detection via domain moment alignment network for person search,” in CVPR, asymmetric tri-way faster-rcnn,” in ECCV, 2019. 2020. [299] Z. Lu, Y. Yang, X. Zhu, C. Liu, Y. Song, and T. Xiang, “Stochastic [327] B. Yan, C. Ma, B. Bare, W. Tan, and S. Hoi, “Disparity-aware classifiers for unsupervised domain adaptation,” in CVPR, 2020. domain adaptation in stereo image restoration,” in CVPR, 2020. [300] J. Fu and L. Zhang, “Reliable domain adaptation with classifiers [328] Y. Shao, L. Li, W. Ren, C. Gao, and N. Sang, “Domain adaptation competition for image classification,” IEEE Trans. Circuits and for image dehazing,” in CVPR, 2020. Systems II: Express Briefs, 2020. [329] Z. Wang, Z. Dai, B. Poczos, and J. Carbonell, “Characterizing and [301] S. Sankaranarayanan, Y. Balaji, C. Castillo, and R. Chellappa, avoiding negative transfer,” in CVPR, 2019. “Generate to adapt: Aligning domains using generative adver- [330] M. Mancini, L. Porzi, S. Bulo, B. Caputo, and E. Ricci, “Boosting sarial networks,” in CVPR, 2018. domain adaptation by discovering latent domains,” in CVPR, [302] R. Xu, Z. Chen, W. Zuo, J. Yan, and L. Lin, “Deep cocktail 2018. network: Multi-source unsupervised domain adaptation with [331] Y. Jang, H. Lee, S. Hwang, and J. Shin, “Learning what and where category shift,” in CVPR, 2018. to transfer,” in ICML, 2019. [332] A. Zamir, A. Sax, W. Shen, L. Guibas, J. Malik, and S. Savarese, [303] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category “Taskonomy: Disentangling task transfer learning,” in CVPR, dataset,” California Institute of Technology, 2007. 2018. [304] C. Rate and C. Retrieval, “Columbia object image library (coil- [333] S. Ben-David, T. Luu, T. Lu, and D. Pal, “Impossibility theorems 20),” Tech. Rep., 2011. for domain adaptation,” in AISTATS, 2010. [305] D. Li, Y. Yang, Y. Z. Song, and T. M. Hospedales, “Deeper, broader and artier domain generalization,” in ICCV, 2017, pp. 5543–5551. [306] M. Ghifary, W. B. Kleijn, M. Zhang, and D. Balduzzi, “Domain generalization for object recognition with multi-task autoen- coders,” in ICCV, 2015, pp. 2551–2559. [307] C. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect Lei Zhang (M’14-SM’18) received his Ph.D de- unseen object classes by between-class attribute transfer,” in gree in Circuits and Systems from the College of CVPR, 2009, pp. 951–958. Communication Engineering, Chongqing Univer- [308] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Pan- sity, Chongqing, China, in 2013. He was selected chanathan, “Deep hashing network for unsupervised domain as a Hong Kong Scholar in China in 2013, and adaptation,” in CVPR, 2017, pp. 5385–5394. worked as a Post-Doctoral Fellow with The Hong Kong Polytechnic University, Hong Kong, from [309] B. Caputo, H. Muller,¨ J. Martinez-Gomez, M. Villegas, B. Acar, 2013 to 2015. He worked as a visiting professor N. Patricia, N. Marvasti, S. Usk¨ udarlı,¨ R. Paredes, and M. Ca- at University of Macau in 2018. He is currently zorla, ImageCLEF 2014: Overview and Analysis of the Results. a Professor/Distinguished Research Fellow with Springer International Publishing, 2014. Chongqing University. He has authored more [310] Y. Pei, Y. Huang, Q. Zou, X. Zhang, and S. Wang, “Effects than 100 scientific papers in top journals and conferences, including of image degradation and degradation removal to cnn-based IEEE Transactions (e.g., T-PAMI, T-IP, T-MM, T-CSVT, T-NNLS), CVPR, image classification,” IEEE Trans. Pattern Analysis and Machine ICCV, ECCV, ACM MM, AAAI, IJCAI, etc. He is on the Editorial Boards Intelligence, 2019. of several journals, such as IEEE Transactions on Instrumentation and [311] K. Saito, Y. Ushiku, T. Harada, and K. Saenko, “Strong-weak Measurement, Neural Networks (Elsevier), etc. Dr. Zhang was a recip- distribution alignment for adaptive object detection,” in CVPR, ient of the 2019 ACM SIGAI Rising Star Award, the Best Paper Award 2019. of CCBR2017, Outstanding Doctoral Dissertation Award of Chongqing, [312] X. Zhu, J. Pang, C. Yang, J. Shi, and D. Lin, “Adapting object China, in 2015, and the New Academic Researcher Award for Doctoral detectors via selective cross-domain alignment,” in CVPR, 2019. Candidates from the Ministry of Education, China, in 2012. His cur- [313] C. Xu, X. Zhao, X. Jin, and X. Wei, “Exploring categorical regu- rent research interests include deep learning, transfer learning, domain larization for domain adaptive object detection,” in CVPR, 2020. adaptation and computer vision. He is a Senior Member of IEEE. [314] C. Chen, Z. Zheng, X. Ding, Y. Huang, and Q. Dou, “Harmo- nizing transferability and discriminability for adapting object detectors,” in CVPR, 2020. [315] Y. Zheng, D. Huang, S. Liu, and Y. Wang, “Cross-domain object detection through coarse-to-fine feature adaptation,” in CVPR, Xinbo Gao received the B.Eng., M.Sc. and 2020. Ph.D. degrees in electronic engineering, signal [316] T. Vu, H. Jain, M. Bucher, M. Cord, and P. Perez, “Advent: and information processing from Xidian Univer- Adversarial entropy minimization for domain adaptation,” in sity, Xi’an, China, in 1994, 1997, and 1999, re- CVPR, 2019. spectively. From 1997 to 1998, he was a re- [317] Y. Luo, L. Zheng, T. Guan, J. Yu, and Y. Yang, “Taking a closer search fellow at the Department of Computer look at domain shift: Category-level adversaries for semantics Science, Shizuoka University, Shizuoka, Japan. consistent domain adaptation,” in CVPR, 2019. From 2000 to 2001, he was a post-doctoral re- [318] Y. Li, L. Yuan, and N. Vasconcelos, “Bidirectional learning for search fellow at the Department of Information domain adaptation of semantic segmentation,” in CVPR, 2019. Engineering, the Chinese University of Hong [319] Y. Chen, Y. Lin, M. Yang, and J. Huang, “Crdoco: Pixel-level Kong, Hong Kong. Since 2001, he has been at domain transfer with cross-domain consistency,” in CVPR, 2019. the School of Electronic Engineering, Xidian University. He is currently [320] T. Vu, H. Jain, M. Bucher, M. Cord, and P. Perez, “Dada: Depth- a Cheung Kong Professor of Ministry of Education of P. R. China, a Pro- aware domain adaptation in semantic segmentation,” in ICCV, fessor of Pattern Recognition and Intelligent System of Xidian University 2019. and a Professor of Computer Science and Technology of Chongqing [321] M. Chen, H. Xue, and D. Cai, “Domain adaptation for semantic University of Posts and Telecommunications. His current research inter- segmentation with maximum squares loss,” in ICCV, 2019. ests include Image processing, computer vision, multimedia analysis, [322] F. Lv, T. Liang, X. Chen, and G. Lin, “Cross-domain semantic machine learning and pattern recognition. He has published six books segmentation via domain-invariant interactive relation transfer,” and around 300 technical articles in refereed journals and proceedings. in CVPR, 2020. Prof. Gao is on the Editorial Boards of several journals, including Signal [323] Y. Zhang, Z. Qiu, T. Yao, C. Ngo, D. Liu, and T. Mei, “Transfer- Processing (Elsevier) and Neurocomputing (Elsevier). He served as ring and regularizing prediction for semantic segmentation,” in the General Chair/Co-Chair, Program Committee Chair/Co-Chair, or PC CVPR, 2020. Member for around 30 major international conferences. He is a Fellow of [324] F. Huang, L. Zhang, Y. Yang, and X. Zhou, “Probability weighted the Institute of Engineering and Technology and a Fellow of the Chinese compact feature for domain adaptive retrieval,” in CVPR, 2020. Institute of Electronics.