Adversarial Discriminative Domain Adaptation

Eric Tzeng Judy Hoffman University of California, Berkeley Stanford University [email protected] [email protected] Kate Saenko Trevor Darrell Boston University University of California, Berkeley [email protected] [email protected]

Abstract

Adversarial learning methods are a promising approach target*encoder to training robust deep networks, and can generate complex samples across diverse domains. They also can improve recognition despite the presence of domain shift or dataset bias: several adversarial approaches to unsupervised do- main adaptation have recently been introduced, which re- source target duce the difference between the training and test domain distributions and thus improve generalization performance. Prior generative approaches show compelling visualizations, but are not optimal on discriminative tasks and can be lim- ited to smaller shifts. Prior discriminative approaches could handle larger domain shifts, but imposed tied weights on the model and did not exploit a GAN-based loss. We first outline a novel generalized framework for adversarial adaptation, which subsumes recent state-of-the-art approaches as spe- domain*discriminator cial cases, and we use this generalized view to better relate the prior approaches. We propose a previously unexplored instance of our general framework which combines discrim- Figure 1: We propose an improved unsupervised domain inative modeling, untied weight sharing, and a GAN loss, adaptation method that combines adversarial learning with which we call Adversarial Discriminative Domain Adap- discriminative feature learning. Specifically, we learn a dis- tation (ADDA). We show that ADDA is more effective yet criminative mapping of target images to the source feature considerably simpler than competing domain-adversarial space (target encoder) by fooling a domain discriminator that arXiv:1702.05464v1 [cs.CV] 17 Feb 2017 methods, and demonstrate the promise of our approach by tries to distinguish the encoded target images from source exceeding state-of-the-art unsupervised adaptation results examples. on standard cross-domain digit classification tasks and a new more difficult cross-modality object classification task. resentations on one large dataset do not generalize well to novel datasets and tasks [4, 1]. The typical solution is to further fine-tune these networks on task-specific datasets— 1. Introduction however, it is often prohibitively difficult and expensive to Deep convolutional networks, when trained on large-scale obtain enough labeled data to properly fine-tune the large datasets, can learn representations which are generically use- number of parameters employed by deep multilayer net- full across a variety of tasks and visual domains [1, 2]. How- works. ever, due to a phenomenon known as dataset bias or domain Domain adaptation methods attempt to mitigate the harm- shift [3], recognition models trained along with these rep- ful effects of domain shift. Recent domain adaptation meth-

1 ods learn deep neural transformations that map both domains 2. Related work into a common feature space. This is generally achieved by optimizing the representation to minimize some measure of There has been extensive prior work on domain trans- domain shift such as maximum mean discrepancy [5, 6] or fer learning, see e.g., [3]. Recent work has focused on correlation distances [7, 8]. An alternative is to reconstruct transferring deep neural network representations from a the target domain from the source representation [9]. labeled source datasets to a target domain where labeled data is sparse or non-existent. In the case of unlabeled Adversarial adaptation methods have become an increas- target domains (the focus of this paper) the main strat- ingly popular incarnation of this type of approach which egy has been to guide feature learning by minimizing the seeks to minimize an approximate domain discrepancy dis- difference between the source and target feature distribu- tance through an adversarial objective with respect to a do- tions [11, 12,5,6,8,9, 13]. main discriminator. These methods are closely related to Several methods have used the Maximum Mean Discrep- generative adversarial learning [10], which pits two networks ancy (MMD) [3] loss for this purpose. MMD computes the against each other—a generator and a discriminator. The norm of the difference between two domain means. The generator is trained to produce images in a way that confuses DDC method [5] used MMD in addition to the regular clas- the discriminator, which in turn tries to distinguish them sification loss on the source to learn a representation that is from real image examples. In domain adaptation, this prin- both discriminative and domain invariant. The Deep Adapta- ciple has been employed to ensure that the network cannot tion Network (DAN) [6] applied MMD to layers embedded distinguish between the distributions of its training and test in a reproducing kernel Hilbert space, effectively matching domain examples [11, 12, 13]. However, each algorithm higher order statistics of the two distributions. In contrast, makes different design choices such as whether to use a gen- the deep Correlation Alignment (CORAL) [8] method pro- erator, which loss function to employ, or whether to share posed to match the mean and covariance of the two distribu- weights across domains. For example, [11, 12] share weights tions. and learn a symmetric mapping of both source and target im- Other methods have chosen an adversarial loss to mini- ages to the shared feature space, while [13] decouple some mize domain shift, learning a representation that is simulta- layers thus learning a partially asymmetric mapping. neously discriminative of source labels while not being able to distinguish between domains. [12] proposed adding a do- In this work, we propose a novel unified framework for main classifier (a single fully connected layer) that predicts adversarial domain adaptation, allowing us to effectively the binary domain label of the inputs and designed a domain examine the different factors of variation between the exist- confusion loss to encourage its prediction to be as close as ing approaches and clearly view the similarities they each possible to a uniform distribution over binary labels. The gra- share. Our framework unifies design choices such as weight- dient reversal algorithm (ReverseGrad) proposed in [11] also sharing, base models, and adversarial losses and subsumes treats domain invariance as a binary classification problem, previous work, while also facilitating the design of novel but directly maximizes the loss of the domain classifier by instantiations that improve upon existing ones. reversing its gradients. DRCN [9] takes a similar approach but also learns to reconstruct target domain images. In particular, we observe that generative modeling of in- In related work, adversarial learning has been explored put image distributions is not necessary, as the ultimate task for generative tasks. The Generative Adversarial Network is to learn a discriminative representation. On the other hand, (GAN) method [10] is a generative deep model that pits two asymmetric mappings can better model the difference in low networks against one another: a generative model G that level features than symmetric ones. We therefore propose captures the data distribution and a discriminative model a previously unexplored unsupervised adversarial adapta- D that distinguishes between samples drawn from G and tion method, Adversarial Discriminative Domain Adapta- images drawn from the training data by predicting a binary tion (ADDA), illustrated in Figure1. ADDA first learns a label. The networks are trained jointly using backprop on the discriminative representation using the labels in the source label prediction loss in a mini-max fashion: simultaneously domain and then a separate encoding that maps the target update G to minimize the loss while also updating D to data to the same space using an asymmetric mapping learned maximize the loss (fooling the discriminator). The advantage through a domain-adversarial loss. Our approach is simple of GAN over other generative methods is that there is no yet surprisingly powerful and achieves state-of-the-art visual need for complex sampling or inference during training; the adaptation results on the MNIST, USPS, and SVHN digits downside is that it may be difficult to train. GANs have datasets. We also test its potential to bridge the gap between been applied to generate natural images of objects, such as even more difficult cross-modality shifts, without requiring digits and faces, and have been extended in several ways. instance constraints, by transferring object classifiers from The BiGAN approach [14] extends GANs to also learn the RGB color images to depth observations. inverse mapping from the image data back into the latent pervised learning on the target is not possible, domain adap- source source source mapping input discriminator tation instead learns a source representation mapping, Ms, along with a source classifier, Cs, and then learns to adapt that model for use in the target domain. Generative or Weights Which discriminative tied or adversarial In adversarial adaptive methods, the main goal is to reg- model? untied? objective? ularize the learning of the source and target mappings, Ms and Mt, so as to minimize the distance between the empir- target target target mapping ical source and target mapping distributions: M (X ) and input discriminator s s Mt(Xt). If this is the case then the source classification classifier model, Cs, can be directly applied to the target representa- tions, elimating the need to learn a separate target classifier Figure 2: Our generalized architecture for adversarial do- and instead setting, C = Cs = Ct. main adaptation. Existing adversarial adaptation methods The source classification model is then trained using the can be viewed as instantiations of our framework with dif- standard supervised loss below: ferent choices regarding their properties.

min Lcls(Xs,Yt) = Ms,C space, and shows that this can learn features useful for image K X 1 classification tasks. The conditional generative adversarial E(xs,ys)∼(Xs,Yt) − [k=ys] log C(Ms(xs)) (1) net (CGAN) [15] is an extension of the GAN where both k=1 networks G and D receive an additional vector of information as input. This might contain, say, information about the We are now able to describe our full general framework class of the training example. The authors apply CGAN to view of adversarial adaptation approaches. We note that generate a (possibly multi-modal) distribution of tag-vectors all approaches minimize source and target representation conditional on image features. distances through alternating minimization between two Recently the CoGAN [13] approach applied GANs to the functions. First a domain discriminator, D, which classi- domain transfer problem by training two GANs to generate fies whether a data point is drawn from the source or the the source and target images respectively. The approach target domain. Thus, D is optimized according to a standard achieves a domain invariant feature space by tying the high- supervised loss, LadvD (Xs, Xt,Ms,Mt) where the labels level layer parameters of the two GANs, and shows that indicate the origin domain, defined below: the same noise input can generate a corresponding pair of images from the two distributions. Domain adaptation was L (X , X ,M ,M ) = performed by training a classifier on the discriminator output advD s t s t and applied to shifts between the MNIST and USPS digit − Exs∼Xs [log D(Ms(xs))] (2) datasets. However, this approach relies on the generators − Ext∼Xt [log(1 − D(Mt(xt)))] finding a mapping from the shared high-level layer feature space to full images in both domains. This can work well Second, the source and target mappings are optimized ac- for say digits which can be difficult in the case of more cording to a constrained adversarial objective, whose partic- distinct domains. In this paper, we observe that modeling ular instantiation may vary across methods. Thus, we can the image distributions is not strictly necessary to achieve derive a generic formulation for domain adversarial tech- domain adaptation, as long as the latent feature space is niques below: domain invariant, and propose a discriminative approach.

min LadvD (Xs, Xt,Ms,Mt) 3. Generalized adversarial adaptation D

min LadvM (Xs, Xt,D) (3) We present a general framework for adversarial unsuper- Ms,Mt vised adaptation methods. In unsupervised adaptation, we s.t. ψ(Ms,Mt) assume access to source images Xs and labels Ys drawn from a source domain distribution ps(x, y), as well as target In the next sections, we demonstrate the value of our images Xt drawn from a target distribution pt(x, y), where framework by positioning recent domain adversarial ap- there are no label observations. Our goal is to learn a tar- proaches within our framework. We describe the poten- get representation, Mt and classifier Ct that can correctly tial mapping structure, mapping optimization constraints classify target images into one of K categories at test time, (ψ(Ms,Mt)) choices and finally choices of adversarial map- despite the lack of in domain annotations. Since direct su- ping loss, LadvM . Method Base model Weight sharing Adversarial loss Gradient reversal [16] discriminative shared minimax Domain confusion [12] discriminative shared confusion CoGAN [13] generative unshared GAN ADDA (Ours) discriminative unshared GAN

Table 1: Overview of adversarial domain adaption methods and their various properties. Viewing methods under a unified framework enables us to easily propose a new adaptation method, adversarial discriminative domain adaptation (ADDA).

3.1. Source and target mappings where each individual layer can be constrained indepen- dently. A very common form of constraint is source and In the case of learning a source mapping M alone it is s target layerwise equality: clear that supervised training through a latent space discrim- inative loss using the known labels Y results in the best ` ` s ψ (M `i ,M i ) = (M `i = M i ). (5) representation for final source recognition. However, given `i s t s t that our target domain is unlabeled, it remains an open ques- It is also common to leave layers unconstrained. These equal- tion how best to minimize the distance between the source ity constraints can easily be imposed within a convolutional and target mappings. Thus the first choice to be made is in network framework through weight sharing. the particular parameterization of these mappings. For many prior adversarial adaptation methods [16, 12], Because unsupervised domain adaptation generally con- all layers are constrained, thus enforcing exact source and siders target discriminative tasks such as classification, pre- target mapping consistency. Learning a symmetric transfor- vious adaptation methods have generally relied on adapting mation reduces the number of parameters in the model and discriminative models between domains [12, 16]. With a ensures that the mapping used for the target is discrimina- discriminative base model, input images are mapped into tive at least when applied to the source domain. However, a feature space that is useful for a discriminative task such this may make the optimization poorly conditioned, since as image classification. For example, in the case of digit the same network must handle images from two separate classification this may be the standard LeNet model. How- domains. ever, Liu and Tuzel achieve state of the art results on un- An alternative approach is instead to learn an asymmetric supervised MNIST-USPS using two generative adversarial transformation with only a subset of the layers constrained, networks [13]. These generative models use random noise thus enforcing partial alignment. Rozantsev et al. [17] as input to generate samples in image space—generally, an showed that partially shared weights can lead to effective intermediate feature of an adversarial discriminator is then adaptation in both supervised and unsupervised settings. As used as a feature for training a task-specific classifier. a result, some recent methods have favored untying weights Once the mapping parameterization is determined for (fully or partially) between the two domains, allowing mod- the source, we must decide how to parametrize the target els to learn parameters for each domain individually. mapping Mt. In general, the target mapping almost always matches the source in terms of the specific functional layer 3.2. Adversarial losses (architecture), but different methods have proposed various regularization techniques. All methods initialize the target Once we have decided on a parametrization of Mt, we em- mapping parameters with the source, but different methods ploy an adversarial loss to learn the actual mapping. There choose different constraints between the source and target are various different possible choices of adversarial loss func- mappings, ψ(Ms,Mt). The goal is to make sure that the tions, each of which have their own unique use cases. All target mapping is set so as to minimize the distance between adversarial losses train the adversarial discriminator using the source and target domains under their respective map- a standard classification loss, LadvD , previously stated in pings, while crucially also maintaining a target mapping that Equation2. However, they differ in the loss used to train the is category discriminative. mapping, LadvM . Consider a layered representations where each layer pa- The gradient reversal layer of [16] optimizes the mapping ` ` to maximize the discriminator loss directly: rameters are denoted as, Ms or Mt , for a given set of equiv- alent layers, {`1, . . . , `n}. Then the space of constraints explored in the literature can be described through layerwise LadvM = −LadvD . (6) equality constraints as follows: This optimization corresponds to the true minimax objective `i `i ψ(Ms,Mt) , {ψ`i (Ms ,Mt )}i∈{1...n} (4) for generative adversarial networks. However, this objective can be problematic, since early on during training the dis- the standard GAN loss. We illustrate our overall sequential criminator converges quickly, causing the gradient to vanish. training procedure in Figure3. When training GANs, rather than directly using the mini- First, we choose a discriminative base model, as we hy- max loss, it is typical to train the generator with the standard pothesize that much of the parameters required to generate loss function with inverted labels [10]. This splits the opti- convincing in-domain samples are irrelevant for discrim- mization into two independent objectives, one for the gen- inative adaptation tasks. Most prior adversarial adaptive erator and one for the discriminator, where LadvD remains methods optimize directly in a discriminative space for this unchanged, but LadvM becomes: reason. One counter-example is CoGANs. However, this method has only shown dominance in settings where the L (X , X ,D) = − [log D(M (x ))]. (7) advM s t Ext∼Xt t t source and target domain are very similar such as MNIST This objective has the same fixed-point properties as the and USPS, and in our experiments we have had difficulty minimax loss but provides stronger gradients to the target getting it to converge for larger distribution shifts. mapping. We refer to this modified loss function as the Next, we choose to allow independent source and target “GAN loss function” for the remainder of this paper. mappings by untying the weights. This is a more flexible Note that, in this setting, we use independent mappings learing paradigm as it allows more domain specific feature extraction to be learned. However, note that the target do- for source and target and learn only Mt adversarially. This mimics the GAN setting, where the real image distribution main has no label access, and thus without weight sharing remains fixed, and the generating distribution is learned to a target model may quickly learn a degenerate solution if match it. we do not take care with proper initialization and training The GAN loss function is the standard choice in the set- procedures. Therefore, we use the pre-trained source model ting where the generator is attempting to mimic another as an intitialization for the target representation space and unchanging distribution. However, in the setting where fix the source model during adversarial training. both distributions are changing, this objective will lead to In doing so, we are effectively learning an asymmetric oscillation—when the mapping converges to its optimum, mapping, in which we modify the target model so as to match the discriminator can simply flip the sign of its prediction the source distribution. This is most similar to the original in response. Tzeng et al. instead proposed the domain generative adversarial learning setting, where a generated confusion objective, under which the mapping is trained space is updated until it is indistinguishable with a fixed real using a cross-entropy loss function against a uniform distri- space. Therefore, we choose the inverted label GAN loss bution [12]: described in the previous section. Our proposed method, ADDA, thus corresponds to the

LadvM (Xs, Xt,D) = following unconstrained optimization: X 1 − Exd∼Xd log D(Md(xd)) 2 (8) d∈{s,t} min Lcls(Xs,Ys) =  Ms,C 1 K + log(1 − D(Md(xd))) . 2 X 1 − E(xs,ys)∼(Xs,Ys) [k=ys] log C(Ms(xs)) This loss ensures that the adversarial discriminator views the k=1 min Ladv (Xs, Xt,Ms,Mt) = two domains identically. D D − [log D(M (x ))] 4. Adversarial discriminative domain adapta- Exs∼Xs s s − [log(1 − D(M (x )))] tion Ext∼Xt t t

min LadvM (Xs, Xt,D) = The benefit of our generalized framework for domain ad- Ms,Mt versarial methods is that it directly enables the development − Ext∼Xt [log D(Mt(xt))]. of novel adaptive methods. In fact, designing a new method (9) has now been simplified to the space of making three design choices: whether to use a generative or discriminative base We choose to optimize this objective in stages. We begin model, whether to tie or untie the weights, and which adver- by optimizing Lcls over Ms and C by training using the sarial learning objective to use. In light of this view we can labeled source data, Xs and Ys. Because we have opted to summarize our method, adversarial discriminative domain leave Ms fixed while learning Mt, we can thus optimize adaptation (ADDA), as well as its connection to prior work, LadvD and LadvM without revisiting the first objective term. according to our choices (see Table1 “ADDA”). Specifically, A summary of this entire training process is provided in we use a discriminative base model, unshared weights, and Figure3. Pre-training Adversarial Adaptation Testing source images source images + labels Source CNN target image Source class domain Target class label CNN label target images CNN label Classifier Classifier

Target Discriminator CNN

Figure 3: An overview of our proposed Adversarial Discriminative Domain Adaptation (ADDA) approach. We first pre-train a source encoder CNN using labeled source image examples. Next, we perform adversarial adaptation by learning a target encoder CNN such that a discriminator that sees encoded source and target examples cannot reliably predict their domain label. During testing, target images are mapped with the target encoder to the shared feature space and classified by the source classifier. Dashed lines indicate fixed network parameters.

We note that the unified framework presented in the previ- and SVHN [19] digits datasets, which consist 10 classes of ous section has enabled us to compare prior domain adversar- digits. Example images from each dataset are visualized in ial methods and make informed decisions about the different Figure4 and Table2. For adaptation between MNIST and factors of variation. Through this framework we are able USPS, we follow the training protocol established in [21], to motivate a novel domain adaptation method, ADDA, and sampling 2000 images from MNIST and 1800 from USPS. offer insight into our design decisions. In the next section For adaptation between SVHN and MNIST, we use the full we demonstrate promising results on unsupervised adapta- training sets for comparison against [16]. All experiments tion benchmark tasks, studying adaptation across digits and are performed in the unsupervised settings, where labels in across modalities. the target domain are withheld, and we consider adaptation in three directions: MNIST→USPS, USPS→MNIST, and 5. Experiments SVHN→MNIST. For these experiments, we use the simple modified LeNet We now evaluate ADDA for unsupervised classification architecture provided in the Caffe source code [18, 22]. adaptation across four different domain shifts. We explore When training with ADDA, our adversarial discriminator three digits datasets of varying difficulty: MNIST [18], consists of 3 fully connected layers: two layers with 500 USPS, and SVHN [19]. We additionally evaluate on the hidden units followed by the final discriminator output. Each NYUD [20] dataset to study adaptation across modalities. of the 500-unit layers uses a ReLU activation function. Example images from all experimental datasets are provided Results of our experiment are provided in Table2. On the in Figure4. easier MNIST and USPS shifts ADDA achieves comparable For the case of digit adaptation, we compare against mul- performance to the current state-of-the-art, CoGANs [13], tiple state-of-the-art unsupervised adaptation methods, all despite being a considerably simpler model. This provides based upon domain adversarial learning objectives. In 3 of compelling evidence that the machinery required to generate 4 of our experimental setups, our method outperforms all images is largely irrelevant to enabling effective adaptation. competing approaches, and in the last domain shift studied, Additionally, we show convincing results on the challenging our approach outperforms all but one competing approach. SVHN and MNIST task in comparison to other methods, We also validate our model on a real-world modality indicating that our method has the potential to generalize adaptation task using the NYU depth dataset. Despite a large to a variety of settings. In contrast, we were unable to get domain shift between the RGB and depth modalities, ADDA CoGANs to converge on SVHN and MNIST—because the learns a useful depth representation without any labeled domains are so disparate, we were unable to train coupled depth data and improves over the nonadaptive baseline by generators for them. over 50% (relative). 5.2. Modality adaptation 5.1. MNIST, USPS, and SVHN digits datasets We use the NYU depth dataset [20], which contains We experimentally validate our proposed method in an un- bounding box annotations for 19 object classes in 1449 im- supervised adaptation task between the MNIST [18], USPS, ages from indoor scenes. The dataset is split into a train (381 Digits adaptation Cross-modality adaptation (NYUD)

MNIST RGB

USPS

HHA SVHN

Figure 4: We evaluate ADDA on unsupervised adaptation across four domain shifts in two different settings. The first setting is adaptation between the MNIST, USPS, and SVHN datasets (left). The second setting is a challenging cross-modality adaptation task between RGB and depth modalities from the NYU depth dataset (right).

MNIST → USPS USPS → MNIST SVHN → MNIST Method → → → Source only 0.752 ± 0.016 0.571 ± 0.017 0.601 ± 0.011 Gradient reversal 0.771 ± 0.018 0.730 ± 0.020 0.739 [16] Domain confusion 0.791 ± 0.005 0.665 ± 0.033 0.681 ± 0.003 CoGAN 0.912 ± 0.008 0.891 ± 0.008 did not converge ADDA (Ours) 0.894 ± 0.002 0.901 ± 0.008 0.760 ± 0.018

Table 2: Experimental results on unsupervised adaptation among MNIST, USPS, and SVHN. images), val (414 images) and test (654) sets. To perform inator consists of three additional fully connected layers: our cross-modality adaptation, we first crop out tight bound- 1024 hidden units, 2048 hidden units, then the adversarial ing boxes around instances of these 19 classes present in discriminator output. With the exception of the output, these the dataset and evaluate on a 19-way classification task over additionally fully connected layers use a ReLU activation object crops. In order to ensure that the same instance is not function. ADDA training then proceeds for another 20,000 seen in both domains, we use the RGB images from the train iterations, again with a batch size of 128. split as the source domain and the depth images from the val We find that our method, ADDA, greatly improves clas- split as the target domain. This corresponds to 2,186 labeled sification accuracy for this task. For certain categories, like source images and 2,401 unlabeled target images. Figure4 counter, classification accuracy goes from 2.9% under the visualizes samples from each of the two domains. source only baseline up to 44.7% after adaptation. In general, We consider the task of adaptation between these RGB average accuracy across all classes improves significantly and HHA encoded depth images [23], using them as source from 13.9% to 21.1%. However, not all classes improve. and target domains respectively. Because the bounding boxes Three classes have no correctly labeled target images before are tight and relatively low resolution, accurate classification adaptation, and adaptation is unable to recover performance is quite difficult, even when evaluating in-domain. In addi- on these classes. Additionally, the classes of pillow and tion, the dataset has very few examples for certain classes, nightstand suffer performance loss after adaptation. such as toilet and bathtub, which directly translates For additional insight on what effect ADDA has on classi- to reduced classification performance. fication, Figure5 plots confusion matrices before adaptation, For this experiment, our base architecture is the VGG-16 after adaptation, and in the hypothetical best-case scenario architecture, initializing from weights pretrained on Ima- where the target labels are present. Examining the confusion geNet [24]. This network is then fully fine-tuned on the matrix for the source only baseline reveals that the domain source domain for 20,000 iterations using a batch size of shift is quite large—as a result, the network is poorly condi- 128. When training with ADDA, the adversarial discrim- tioned and incorrectly predicts pillow for the majority of ihte“ri ntre”mdlrvasta ayo the of many that reveals model target” on “train the with high abnormally such achieves model only source the why al :Aatto eut nteNU [ NYUD the on results Adaptation 3: Table n betvs u rmwr rvdsasmlfidand simplified a provides framework Our learn- adversarial objectives. on ing based techniques adaptation domain Conclusion 6. representation useful a images. learning depth is on model ADDA the that ing as such the reasonable, between are confusion makes model ADDA the mistakes comparison Additionally, classes. other the of many for cies accu- decreased to the leads for This racy classes. of variety wider much classes. the of rest the on the on accuracy output to tendency This dataset. the depth to recognition RGB to NYUD conducive the more space on a of models in class target results prevalent algorithm supervised most adaptation oracle the unsupervised of and our ADDA, that only, observe We source experiment. for adaptation matrices Confusion 5: Figure # in (indicated set target our in 21.1%. imbalance to class large 13.9% the from accuracy to category due per accuracy average class improves per method here our report We Overall instances). domains. target as set val the fisacs1 68 1 1 0 2 2 55 4 75 7 719203 17 33 210 129 47 276 51 37 144 55 0.647 0.303 0.357 25 0.248 0.000 0.362 0.000 0.091 0.630 0.000 129 0.143 0.235 0.010 0.116 0.189 0.008 0.021 122 0.576 0.000 0.297 0.291 0.587 0.020 0.120 0.039 103 0.081 0.636 0.000 0.292 0.057 0.069 0.018 611 0.573 0.000 0.000 0.619 0.000 0.023 0.295 0.047 210 0.025 0.494 0.041 0.447 0.531 0.029 0.344 0.105 0.188 87 0.229 target 0.124 on 0.046 Train 0.011 0.146 0.010 0.000 96 0.000 (Ours) ADDA only 19 Source instances of #

ehv rpsdauie rmwr o unsupervised for framework unified a proposed have We a predicts ADDA using trained classifier the contrast, In True label garbage bin night stand bookshelf television bathtub monitor counter dresser pillow toilet table lamp chair desk door sofa sink bed box pillow bathtub bed bookshelf pillow bathtub box chair Source only Predicted label counter ls,btsgicnl ihraccura- higher significantly but class, desk bed chair door dresser

ls,dsiepo performance poor despite class, garbage bin lamp bookshelf

chair monitor

and night stand pillow box sink

table sofa pillow

. table television chair toilet 20

lse,indicat- classes, True label counter aae,uigRBiae rmtetansta oreaddphiae from images depth and source as set train the from images RGB using dataset, ] garbage bin night stand loexplains also bookshelf television bathtub monitor counter dresser pillow toilet table lamp chair desk door desk sofa sink bed box

bathtub bed door bookshelf box

chair ADDA (Ours)

Predicted label dresser counter desk elars ait ftss civn togrslson results strong achieving tasks, of variety a across well ADDA. ec htAD sefciea atal non h effects the shift. undoing domain partially of at effective is evi- ADDA further much that providing dence domain features, target unadapted than the closely in more data features supervisory resemble with ADDA learned via learned representations that indicates the analysis Additional cross- task. challenging adaptation a modality as well as datasets adaptation benchmark generalizes method Our approach. adaptation unsupervised method, adaptation new a approach into each strategies these from combine ideas to key and and benefits able the are understand we to comparison, this Through proposed methods. connect recently adaptation and between differences understand and may similarities we the which by view cohesive door dresser garbage bin

epeeteauto cosfu oansit o our for shifts domain four across evaluation present We garbage bin lamp monitor lamp night stand pillow sink sofa monitor table television toilet night stand True label garbage bin night stand bookshelf television pillow bathtub monitor counter dresser pillow toilet table lamp chair desk door sofa sink bed box sink bathtub bed bookshelf sofa box Train ontarget chair Predicted label counter desk table door dresser garbage bin lamp television monitor night stand pillow sink toilet sofa 0.468 0.211 0.139 table 2401 television overall toilet References International Conference on (ICML- 15), pages 1180–1189. JMLR Workshop and Confer- [1] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoff- ence Proceedings, 2015.2 man, Ning Zhang, Eric Tzeng, and Trevor Darrell. De- caf: A deep convolutional activation feature for generic [12] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate visual recognition. In International Conference on Saenko. Simultaneous deep transfer across domains Machine Learning (ICML), pages 647–655, 2014.1 and tasks. In International Conference in (ICCV), 2015.2,4,5 [2] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural [13] Ming-Yu Liu and Oncel Tuzel. Coupled generative networks? In Neural Information Processing Systems adversarial networks. CoRR, abs/1606.07536, 2016.2, (NIPS), pages 3320–3328, 2014.1 3,4,6

[3] A. Gretton, AJ. Smola, J. Huang, M. Schmittfull, KM. [14] Jeff Donahue, Philipp Krahenb¨ uhl,¨ and Trevor Darrell. CoRR Borgwardt, and B. Scholkopf.¨ Covariate shift and local Adversarial feature learning. , abs/1605.09782, learning by distribution matching, pages 131–160. MIT 2016.2 Press, Cambridge, MA, USA, 2009.1,2 [15] Mehdi Mirza and Simon Osindero. Conditional gen- erative adversarial nets. CoRR, abs/1411.1784, 2014. [4] Antonio Torralba and Alexei A. Efros. Unbiased look 3 at dataset bias. In CVPR’11, June 2011.1 [16] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas- [5] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, cal Germain, Hugo Larochelle, Franc¸ois Laviolette, and Trevor Darrell. Deep domain confusion: Maxi- Mario Marchand, and Victor Lempitsky. Domain- mizing for domain invariance. CoRR, abs/1412.3474, adversarial training of neural networks. Journal of 2014.2 Machine Learning Research, 17(59):1–35, 2016.4,6, 7 [6] Mingsheng Long and Jianmin Wang. Learning transfer- able features with deep adaptation networks. Interna- [17] Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. tional Conference on Machine Learning (ICML), 2015. Beyond sharing weights for deep domain adaptation. 2 CoRR, abs/1603.06432, 2016.4

[7] Baochen Sun, Jiashi Feng, and Kate Saenko. Return [18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. of frustratingly easy domain adaptation. In Thirtieth Gradient-based learning applied to document recog- AAAI Conference on Artificial Intelligence, 2016.2 nition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.6 [8] Baochen Sun and Kate Saenko. Deep CORAL: corre- lation alignment for deep domain adaptation. In ICCV [19] Yuval Netzer, Tao Wang, Adam Coates, Alessandro workshop on Transferring and Adapting Source Knowl- Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits edge in Computer Vision (TASK-CV), 2016.2 in natural images with unsupervised feature learning. In NIPS Workshop on and Unsupervised [9] Muhammad Ghifary, W Bastiaan Kleijn, Mengjie Feature Learning 2011, 2011.6 Zhang, David Balduzzi, and Wen Li. Deep [20] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and reconstruction-classification networks for unsupervised Rob Fergus. Indoor segmentation and support infer- domain adaptation. In European Conference on Com- ence from rgbd images. In European Conference on puter Vision (ECCV), pages 597–613. Springer, 2016. Computer Vision (ECCV), 2012.6,8 2 [21] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu. Trans- [10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, fer feature learning with joint distribution adaptation. Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron In 2013 IEEE International Conference on Computer Courville, and Yoshua Bengio. Generative adversarial Vision, pages 2200–2207, Dec 2013.6 nets. In Advances in Neural Information Processing Systems 27. 2014.2,5 [22] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadar- [11] Yaroslav Ganin and Victor Lempitsky. Unsupervised rama, and Trevor Darrell. Caffe: Convolutional ar- domain adaptation by backpropagation. In David Blei chitecture for fast feature embedding. arXiv preprint and Francis Bach, editors, Proceedings of the 32nd arXiv:1408.5093, 2014.6 [23] Saurabh Gupta, Ross Girshick, Pablo Arbelaez,´ and Ji- tendra Malik. Learning rich features from rgb-d images for object detection and segmentation. In European Conference on Computer Vision (ECCV), pages 345– 360. Springer, 2014.7

[24] K. Simonyan and A. Zisserman. Very deep convo- lutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.7