Zero Shot Learning Via Multi-Scale Manifold Regularization
Total Page:16
File Type:pdf, Size:1020Kb
Zero Shot Learning via Multi-Scale Manifold Regularization Shay Deutsch 1,2 Soheil Kolouri 3 Kyungnam Kim 3 Yuri Owechko 3 Stefano Soatto 1 1 UCLA Vision Lab, University of California, Los Angeles, CA 90095 2 Department of Mathematics, University of California Los Angeles 3 HRL Laboratories, LLC, Malibu, CA {shaydeu@math., soatto@cs.}ucla.edu {skolouri, kkim, yowechko,}@hrl.com Abstract performed in a transductive setting, using all unlabeled data where the classification and learning process on the transfer We address zero-shot learning using a new manifold data is entirely unsupervised. alignment framework based on a localized multi-scale While our suggested approach is similar in scope to other transform on graphs. Our inference approach includes a ”visual-semantic alignment” methods (see [5, 10] and ref- smoothness criterion for a function mapping nodes on a erences therein) it is, to the best of our knowledge, the first graph (visual representation) onto a linear space (semantic to use multi-scale localized representations that respect the representation), which we optimize using multi-scale graph global (non-flat) geometry of the data space and the fine- wavelets. The robustness of the ensuing scheme allows us scale structure of the local semantic representation. More- to operate with automatically generated semantic annota- over, learning the relationship between visual features and tions, resulting in an algorithm that is entirely free of man- semantic attributes is unified into a single process, whereas ual supervision, and yet improves the state-of-the-art as in most ZSL approaches it is divided into a number of inde- measured on benchmark datasets. pendent steps [10]. Our tests show improvements on popular benchmarks such as the Animals With Attributes (AWA) dataset, demon- 1. Introduction strating the ability of our approach to perform multi-scale manifold alignment using only automated semantic fea- Zero-shot learning (ZSL) aims to enable decisions about tures. In addition we also demonstrate the robustness of our unseen (target) classes by assuming a shared intermedi- regularization method on the CUB dataset showing state-of- ate representation learned from a disjoint set of (source) the-art results. classes [17]. Early methods assumed ground truth (human annotation) was available for intermediate representations 1.1. Related Work such as object attributes that can be inferred from an im- age. More recently, methods have emerged that automati- Currently there are two main approaches in zero shot cally infer such an intermediate “semantic” representation. learning. In the first, following [16], given semantic at- Among these, some have cast the problem as joint align- tributes of a previously-unseen image during the test time, ment of the data using graph structures [10, 5] or directly one would like to classify an unseen image when its at- using regularized sparse representations [23, 15]. Most au- tributes are given at the test time. In the second, the test at- tomatic methods, however, perform at levels insufficient to tributes are unknown; however, all the test images are avail- support practical applications. able at test time, and one wishes to estimate the attributes We propose a new alignment algorithm based on Spec- and to classify the test datum simultaneously. Among tral Graph Wavelets (SGWs) [13], a multi-scale graph trans- the first, [27, 24, 28, 2] suggest learning a cross-domain form that is localized in the vertex and spectral domains. matching function to compare attributes and visual features. The nodes in our graph are visual features, for example Among the second, [10] propose a multi-view transduc- activations of a convolutional neural network or any other tive approach through multi-view alignment using Canon- “graph signal.” Such graph signals are considered an em- ical Correlation Analysis, and the method in [15]. bedding of semantic attributes in a linear space, automati- Several approaches for zero shot learning have been re- cally computed using Word2Vec. Learning is based on the cently proposed. These include kernel alignment with unsu- assumption that nearby visual representations should pro- pervised domain-invariant component analysis [11] which duce similar semantic representations, which translates into addresses the problem from the multi-source domain gener- a smoothness criterion for the graph signal. Our approach is alization point of view. It is an extension of Unsupervised 17112 Training set visual Construct affinity 1.2. Summary of Contributions features, Attributes, graph of visual Denoise the graph Embedding and labels features with signals in the attributes spectral graph Test set visual represented as wavelet domain We formalize ZSL as the problem of learning a map from features graph signals the data space X to a semantic descriptions Y (Sect. 3), after making the underlying assumptions and the resulting Determine labels of the limitations explicit (Sec. 2). unseen test set using Labels of Use the denoised Spectral Clustering or unseen test Our first contribution is to cast the inference process as graph signals to Affinity Propagation on set predict test set the predicted attributes attributes imposing a differentiable structure on the map h : X → Y , supported on a discrete graph. To address this problem, we Figure 1. High level overview of our approach for zero shot learn- ing. use the multi-scale graph transform (Sect. 3.1), which al- lows us to enforce global regularity without sacrificing local structure. Domain-Invariant Component Analysis with centered ker- Our second contribution is to perform such inference in nel alignment. Joint Latent Similarity Embedding [30] uses an integrated fashion (Sect. 3.2), which allows us to forgo a dictionary-learning based discriminative learning frame- any manual annotation, even in the source (training) data work to learn class-specific classifiers in both source and space. We avoid the independent steps followed by most target domain jointly. Synthesized Classifiers for zero- ZSL work, some of which require supervision, to arrive an shot learning [5] address the problem from the perspective entirely automatic ZSL process. of manifold learning, introducing phantom object classes Despite being entirely automatic, our ZSL approach which live between the low-level feature space and the se- achieves state-of-the-art results on benchmark datasets mantic space, resulting in a better aligned semantic space. (Sect. 4.) An important component of ZSL is the choice of at- tributes. There are two types of semantic representations 2. Problem Formulation and Model Assump- typically used in ZSL: (i) Human-annotated attributes, and tions (ii) Automatically-generated attributes. We call X the data or visual space (e.g. visual features), Early methods, starting from [17], used human annota- Y the semantic or attribute space (e.g. words in the English tion. Automatic annotation is clearly more practical, but language), and Z ,Z two disjoint class or label spaces (e.g. challenging to work with. Typically, the automatically- s t different sets of labels). The subscript s denotes the source generated attributes are Word2Vec outputs generated using (or sample, or training) set, and t denotes the target (or the skip-gram model trained on English Wikipedia articles transfer, or test) set. [20, 4]. i i ns More specifically, let {xs,zs}i=1 be a sample, where Denoising-based alignment methods include [23], which i Rm i xs ∈ X,X ⊂ is measured and zs ∈ Z is an observed uses ℓ2,1 as an objective function. However, even using label in a set Z of cardinality cs, which we indicate with an deep learning features, the performance of automatic meth- i cs abuse of notation as zs ∈ {0, 1} . Also, for each instance ods lags that of methods using manual annotation. i i, let ys ∈ Y be its semantic representation, consisting of ZSL is closely related to transfer learning and domain D i D binary attributes Y = {0, 1} (for instance, ys is the in- adaptation; we refer the reader to [12, 22, 9, 18] and refer- i dicator vector of a list of D words describing the datum xs). ences therein for details. It is also possible to consider yi as a vector of likelihoods,1 RD Our regularization method is inspired by [8, 7], that re- in which case Y ⊂ + . The given training set consists of i i i ns cently suggested a new framework for unsupervised mani- the sample S = {(xs, ys,zs)}i=1. fold denoising which is based on SGW [13]. However, our j nt j Let {xt }j=1 be a test (or transfer) set with xt ∈ X, approach is different in two ways from [8]: first, the task j with unknown semantic attributes yt . We are interested in in our case is to align the visual and semantic representa- j classifying the test set into a different set of classes z ∈ tions, and therefore our graph construction is different since t Z , where the set Z is of cardinality c , and disjoint from we use a graph attribute signal for the semantic representa- t t t Z: Z ∩ Z = ∅. We assume that this can be done with a tion, while in [8] the task is to remove noise in the manifold t classifier φ : X → Z . However, the labels in the transfer coordinates, which are used as the graph signal, in order to t set are not given. obtain a smooth manifold approximation. Second, we apply While one could just perform unsupervised learning on a different regularization approach, which is better suited to X to cluster the test set into c classes, ZSL breaks the prob- addressing complex manifolds which are not smooth every- t lem into two: First, use the training set S to learn a map where. We employ this task by denoising all SGW bands 1 i without using thresholds, thus avoiding loss of possibly im- The k-th component of ys is the likelihood of attribute k describing i i i portant information which is discarded in [8, 7].