Domain Generalization for Object Recognition with Multi-Task Autoencoders
Total Page:16
File Type:pdf, Size:1020Kb
Domain Generalization for Object Recognition with Multi-task Autoencoders Muhammad Ghifary W. Bastiaan Kleijn Mengjie Zhang David Balduzzi Victoria University of Wellington {muhammad.ghifary,bastiaan.kleijn,mengjie.zhang}@ecs.vuw.ac.nz, [email protected] Abstract mains. For example, frontal-views and 45◦ rotated-views correspond to two different domains. Alternatively, we can The problem of domain generalization is to take knowl- associate views or domains with standard image datasets edge acquired from a number of related domains where such as PASCAL VOC2007 [14], and Office [31]. training data is available, and to then successfully apply The problem of learning from multiple source domains it to previously unseen domains. We propose a new fea- and testing on unseen target domains is referred to as do- ture learning algorithm, Multi-Task Autoencoder (MTAE), main generalization [6, 26]. A domain is a probability P Nk that provides good generalization performance for cross- distribution k from which samples {xi,yi}i=1 are drawn. domain object recognition. Source domains provide training samples, whereas distinct Our algorithm extends the standard denoising autoen- target domains are used for testing. In the standard super- coder framework by substituting artificially induced cor- vised learning framework, it is assumed that the source and ruption with naturally occurring inter-domain variability in target domains coincide. Dataset bias becomes a significant the appearance of objects. Instead of reconstructing images problem when training and test domains differ: applying a from noisy versions, MTAE learns to transform the original classifier trained on one dataset to images sampled from an- image into analogs in multiple related domains. It thereby other typically results in poor performance [35, 18]. The learns features that are robust to variations across domains. goal of this paper is to learn features that improve general- The learnt features are then used as inputs to a classifier. ization performance across domains. We evaluated the performance of the algorithm on Contribution. The challenge is to build a system that rec- benchmark image recognition datasets, where the task is ognizes objects in previously unseen datasets, given one to learn features from multiple datasets and to then predict or multiple training datasets. We introduce Multi-task Au- the image label from unseen datasets. We found that (de- toencoder (MTAE), a feature learning algorithm that uses a noising) MTAE outperforms alternative autoencoder-based multi-task strategy [8, 34] to learn unbiased object features, models as well as the current state-of-the-art algorithms for where the task is the data reconstruction. domain generalization. Autoencoders were introduced to address the problem of ‘backpropagation without a teacher’ by using inputs as 1. Introduction labels – and learning to reconstruct them with minimal Recent years have seen dramatic advances in object distortion [28, 5]. Denoising autoencoders in particular recognition by deep learning algorithms [23, 11, 32]. Much are a powerful basic circuit for unsupervised representation of the increased performance derives from applying large learning [36]. Intuitively, corrupting inputs forces autoen- networks to massive labeled datasets such as PASCAL coders to learn representations that are robust to noise. VOC [14] and ImageNet [22]. Unfortunately, dataset bias This paper proposes a broader view: that autoencoders – which can include factors such as backgrounds, camera are generic circuits for learning invariant features. The viewpoints and illumination – often causes algorithms to main contribution is a new training strategy based on nat- generalize poorly across datasets [35] and significantly lim- urally occurring transformations such as: rotations in view- its their usefulness in practical applications. Developing ing angle, dilations in apparent object size, and shifts in algorithms that are invariant to dataset bias is therefore a lighting conditions. The resulting Multi-Task Autoencoder compelling problem. learns features that are robust to real-world image variabil- ity, and therefore generalize well across domains. Exten- Problem definition. In object recognition, the “visual sive experiments show that MTAE with a denoising crite- world” can be considered as decomposing into views (e.g. rion outperforms the prior state-of-the-art in domain gener- perspectives or lighting conditions) corresponding to do- alization over various cross-dataset recognition tasks. 2551 2. Related work 3.1. Autoencoders Domain generalization has recently attracted attention in Autoencoders (AE) have become established as a pre- classification tasks, including automatic gating of flow cy- training model for deep learning [5]. The autoencoder train- tometry data [6, 26] and object recognition [16, 21, 38]. ing consists of two stages: 1) encoding and 2) decoding. x Rdx Khosla et al.[21] proposed a multi-task max-margin classi- Given an unlabeled input ∈ , a single hidden layer x Rdx Rdx fier, which we refer to as Undo-Bias, that explicitly encodes autoencoder fΘ( ): → can be formulated as dataset-specific biases in feature space. These biases are h σ W⊤x used to push the dataset-specific weights to be similar to the = enc( ) x V⊤h x global weights. Fang et al.[16] developed Unbiased Metric ˆ = σdec( )= fΘ( ), (1) Learning (UML) based on learning to rank framework. Val- × × W Rdx dy V Rdy dx idated on weakly-labeled web images, UML produces a less where ∈ , ∈ are input- 1 biased distance metric that provides good object recognition to-hidden and hidden-to-output connection weights re- spectively, h ∈ Rdh is the hidden node vector, performance. and validated on weakly-labeled web images. ⊤ and σenc(·) = [senc(z1),...,senc(zdh )] ,σdec(·) = More recently, Xu et al.[38] extended an exemplar-SVM to ⊤ domain generalization by adding a nuclear norm-based reg- [sdec(z1),...,sdec(zdx )] are element-wise non-linear acti- ularizer that captures the likelihoods of all positive samples. vation functions, and senc and sdec are not necessarily iden- The proposed model is denoted by LRE-SVM. tical. Popular choices for the activation function s(·) are, e.g., the sigmoid s(a) = (1+exp(−a))−1 and the rectified Other works in object recognition exist that address a linear (ReLU) s(a) = max(0,a). similar problem, in the sense of having unknown targets, Let Θ = {W, V} be the autoencoder parameters and where the unseen dataset contains noisy images that are not {x }N be a set of N input data. Learning corresponds to in the training set [17, 33]. However, these were designed i i=1 minimizing the following objective to be noise-specific and may suffer from dataset bias when observing objects with different types of noise. N A closely related task to domain generalization is do- Θˆ := arg min L (fΘ(xi), xi)+ ηR (Θ) , (2) Θ X main adaptation, where unlabeled samples from the target i=1 dataset are available during training. Many domain adapta- where , is the loss function, usually in the form of least tion algorithms have been proposed for object recognition L(· ·) square or cross-entropy loss, and is a regularization (see, e.g., [2, 31]). Domain adaptation algorithms are not R(·) term used to avoid overfitting. The objective (2) can be op- readily applicable to domain generalization, since no infor- timized by the backpropagation algorithm [29]. If we ap- mation is available about the target domain. ply autoencoders to raw pixels of visual object images, the Our proposed algorithm is based on the feature learn- weights W usually form visually meaningful “filters” that ing approach. Feature learning has been of a great interest can be interpreted qualitatively. in the machine learning community since the emergence of To create a discriminative model using the learnt autoen- deep learning (see [4] and references therein). Some feature coder model, either of the following options can be consid- learning methods have been successfully applied to domain ⊤ ered: 1) the feature map φ(x) := σenc(Wˆ x) is extracted adaptation or transfer learning applications [9, 13]. To our and used as an input to supervised learning algorithms while best knowledge, there is no prior work along these lines on keeping the weight matrix Wˆ fixed; 2) the learnt weight ma- the more difficult problem of domain generalization, i.e., trix Wˆ is used to initialize a neural network model and is to create useful representations without observing the target updated during the supervised neural network training (fine- domain. tuning). Recently, several variants such as denoising au- toencoders (DAE) [37] and contractive autoencoders 3. The Proposed Approach (CAE) [27] have been proposed to extract features that are more robust to small changes of the input. In DAEs, the ob- Our goal is to learn features that provide good domain jective is to reconstruct a clean input x given its corrupted generalization. To do so, we extend the autoencoder [7] counterpart x˜ ∼Q(x˜|x). Commonly used types of corrup- into a model that jointly learns multiple data-reconstruction tion are zero-masking, Gaussian, and salt-and-pepper noise. tasks taken from related domains. Our strategy is moti- Features extracted by DAE have been proven to be more vated by prior work demonstrating that learning from mul- discriminative than ones extracted by AE [37]. tiple related tasks can improve performance on a novel, yet related, task – relative to methods trained on a single- 1While the bias terms are incorporated in our experiments, they are task [1, 3, 8, 34]. intentionally omitted from equations for the sake of simplicity. 2552 x1 x2 xM that ( i , i ,... i ) form a category-level correspondence. This configuration enforces the number of samples in a cat- egory to be the same in every domain. Note that such a con- figuration is necessary to ensure that the between-domain reconstruction works (we will discuss how to handle the case with unbalanced samples in Section 3.3). The input and output pairs used to train MTAE can then be written as concatenated matrices X¯ = [X1; X2; ...; XM ], X¯ l = [Xl; Xl; ...; Xl] (3) Figure 1.