JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2017 1 A Unified approach for Conventional Zero-shot, Generalized Zero-shot and Few-shot Learning Shafin Rahman, Salman H. Khan and Fatih Porikli

Abstract—Prevalent techniques in zero-shot learning do not having a description of it, possibly leveraging its similarities generalize well to other related problem scenarios. Here, we with the previously learned concepts, without requiring an present a unified approach for conventional zero-shot, generalized image of the new object [22]. zero-shot and few-shot learning problems. Our approach is based on a novel Class Adapting Principal Directions (CAPD) concept In the absence of object annotations, zero-shot learning that allows multiple embeddings of image features into a semantic (ZSL) aims at recognizing object classes not seen at the . Given an image, our method produces one principal training stage. In other words, ZSL intends to bridge the direction for each seen class. Then, it learns how to combine gap between the seen and unseen classes using semantic these directions to obtain the principal direction for each unseen (and syntactic) , which is often derived from class such that the CAPD of the test image is aligned with the semantic embedding of the true class, and opposite to the textual descriptions such as word embeddings and attributes. other classes. This allows efficient and class-adaptive information Emerging work in ZSL attempt to predict and incorporate transfer from seen to unseen classes. In addition, we propose an semantic embeddings to recognize unseen classes [30], [43], automatic process for selection of the most useful seen classes [22], [49], [25]. As noted in [18], semantic embedding itself for each unseen class to achieve robustness in zero-shot learning. might be noisy. Instead of a direct embedding, some methods Our method can update the unseen CAPD taking the advantages of few unseen images to work in a few-shot learning scenario. [4], [44], [33], [52] utilize global compatibility functions, e.g. Furthermore, our method can generalize the seen CAPDs by a single projection in [52], that project image features to the estimating seen-unseen diversity that significantly improves the corresponding semantic representations. Intuitively, different performance of generalized zero-shot learning. Our extensive seen classes contribute differently to describe each unseen evaluations demonstrate that the proposed approach consistently class. Enforcing all seen and unseen classes into a single global achieves superior performance in zero-shot, generalized zero-shot and few/one-shot learning problems. projection undermines the subtle yet important differences among the seen classes. It eventually limits ZSL approaches by Index Terms—Zero-Shot learning, Few-shot learning, Gener- over-fitting to a specific dataset, visual and semantic features alized Zero-Shot learning, Class Adaptive Principal Direction (supervised or unsupervised). Besides, incremental learning with newly added unseen classes using a global projection I.INTRODUCTION is also problematic due to its less flexibility. Being one of the most fundamental tasks in visual under- Traditionally, ZSL approaches (e.g., [7], [51], [35]) assume standing, object classification has long been the focus of at- that only the unseen classes are present in the test set. This tention in vision. Recently, significant advances have is not a realistic setting for recognition in the wild where been reported, in particular for using deep both unseen, as well as seen classes, can appear during the learning based techniques that are driven by the emergence of test phase. Recently [45], [9] tested several ZSL methods in large-scale annotated datasets, fast computational platforms, generalized zero-shot learning (GZSL) settings and reported and efficient optimization methods [37], [39]. their poor performance in this real world scenario. The main Towards an ultimate visual object classification, this paper reason of such failure is the strong bias of existing approaches addresses three inherent handicaps of supervised learning towards seen classes where almost all test unseen instances arXiv:1706.08653v2 [cs.CV] 26 Oct 2017 approaches. The first one is the dependence on the availability are categorized as one of the seen classes. Another obvious of labeled training . When object categories grow in extension of ZSL is few/one-shot learning (F/OSL) where few number, sufficient annotations cannot be guaranteed for all labeled instances of each unseen class are revealed during objects beyond simpler and frequent single-noun classes. For training. The existing ZSL approaches, however, do not scale composite and exotic concepts (such as American crow and well to the GZSL and FSL settings [1], [7], [44], [52], [24]. auto racing paddock) not only the available images do not To provide a comprehensive and flexible solution to ZSL, suffice as the number of combinations would be unbounded, GZSL and FSL problem settings, we introduce the concept but often the annotations can be made only by experts [20], of principal directions that adapt to classes. In simple terms, [42]. The second challenge is the appearance of new classes CAPD is an embedding of the input image into the semantic after the learning stage. In real world situations, we often space such that, when projected onto CAPDs, the semantic need to deal with an ever-growing set of classes without space embedding of the true class gives the highest response. representative images. Conventional approaches, in general, A visualization of the CAPD concept is presented in Fig. 1. As cannot tackle such recognition tasks in the wild. The last illustrated, the CAPDs of a Leopard (Fig. 1a) and a Persian cat shortcoming is that supervised learning, in its customarily image (Fig. 1b) point to their true semantic label embedding contrived forms, disregards the notion of wisdom. This can be shown in violet and blue respectively, which gives the highest exposed in the fact that we can identify a new object by just projection response in each case. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2017 2

polar+bear polar+bear grizzly+beargorilla grizzly+beargorilla chimpanzee150 chimpanzee150 leopardlion leopardlion wolf spider+monkey wolf spider+monkey tigerbobcat bat tigerbobcat bat giant+panda giant+panda fox fox weasel 100 dolphin weasel 100 dolphin german+shepherdrat squirrel german+shepherdrat squirrel seal killer+whale seal killer+whale siamese+cat siamese+catraccoon otter raccoonmole beaverotter humpback+whale mole beaver humpback+whale persian+cat persian+cat chihuahua chihuahuahamsterskunk walrusblue+whale hamsterskunk walrusblue+whale 50 dalmatian 50 dalmatiancollie mouse collie mouserabbit rabbit

sheepzebra sheep -150 -100 -50 50 100 150 200 cat zebra -150 -100 -50 50 100 150 pig hippo. pig cow horse leop. cow moose deer moose horsedeer whale buffaloox antelope seal buffaloox antelope cat giraffe -50 chimp. -50 hippo. rat giraffe leop. rhinoceroselephant panda whale pig rhinoceroselephant seal racc. hippopotamus chimp. -100 hippopotamus rat -100 panda 2 pig 1 racc. 0 0 -150

-2 -1 (a) Input image from unseen class Leopard (b) Input image from unseen class Persian cat

Fig. 1: Visualization of class adapting principal directions (CAPD) on a 2D tSNE [41] for illustration. Text labels on the plot represent the seen (black) and unseen (colored) semantic space embeddings in 2D space. CAPDs of the unseen classes are drawn with the same color of the unseen class label text. The bars indicate the responses of a semantic space embeddings projected onto their corresponding CAPDs. Our approach classifies an input to the class that has the maximum response. We also introduce an improved approach to use a reduced set of CAPDs (shown as dashed line) while obtaining better alignment with the correct unseen class embedding (see Sec. III-B). (a) Leopard and (b) Persian cat.

Our proposed approach utilizes three main sources of visual data. In contrast to ZSL and GZSL, the F/OSL setting knowledge to generalize learning from seen to unseen classes. allows few or a single training instance of the unseen classes. First, we model the relationships between the visual fea- This information is used to update unseen CAPDs based on tures and semantics for seen classes using the proposed the learned relationships between visual and semantic domains ‘Class Adapting Principal Directions’ (CAPDs). CAPDs are for unseen classes (Sec. III-D). The overall pipeline of our computed using class-specific discriminative models which learning and prediction process is illustrated in Fig. 2. are learned for each seen category in the ‘visual domain’ We hypothesize that not all seen classes are instrumental (Sec. III-A1). Second, our approach effectively models the in describing a novel unseen category. To validate this claim, relationships between the seen and unseen classes in the we introduce a new constraint during the reconstruction of ‘semantic space’ defined by CAPDs. To this end, we introduce semantic embedding of the unseen classes. We show that a mixing transformation, which learns the optimal combination automatically reducing the number of seen classes in the of seen semantics which are sufficient to reconstruct the mixing process to obtain CAPD of each unseen class results semantic embedding of an unseen class (Sec. III-A2). Third, in a significant performance boost (Sec. III-B). We perform we learn a metric for the seen CAPDs such that extensive experimental evaluations on four benchmark datasets samples belonging to the same class are clustered together, and compare with several state-of-the-art methods. Our results while different classes are mapped further apart (Sec. III-A2). demonstrate that the proposed CAPD based approach provides This learned metric transfers cross domain knowledge from superior performance in supervised and unsupervised settings visual domain to semantic embedding space. Such a mapping of ZSL, GZSL and F/OSL. is necessary because the class semantics, especially those To summarize, our main contributions are: collected from unsupervised sources (e.g., ), can be noisy and highly confusing. The distance metric is then used • We present a unified solution by introducing the notion to robustly estimate the seen-unseen semantic relationships. of class adapting principal directions that enable efficient and discriminative embeddings of unseen class images in While most of the approaches in the literature focus on the semantic space. specific sub-problems and do not generalize well to other • We propose a semantic transformation to link the em- related settings, we present a unified solution which can easily beddings for seen and unseen classes based on a learned adapt to ZSL, GZSL and F/OSL settings. We attribute this distance measure. strength to two key features in our approach: a) a highly • We provide an automatic solution to select a reduced set ‘modular learning’ scheme and b) the two-way inter-domain of relevant seen classes resulting in a better performance. ‘knowledge sharing’. Specifically for the GZSL, we present a • Our approach can automatically adapt to generalized novel method to generalize seen CAPDs that avoids the inher- zero-shot setting by generalizing seen CAPDs to match ent bias of prediction towards seen classes (Sec. III-C). The seen-unseen diversity. generalized seen CAPD balances the seen-unseen diversity in • Our approach is easily scalable to few/one-shot setting by the semantic space, without any direct supervision from the updating the unseen CAPDs with newly available data. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2017 3

II.RELATED WORK corresponding semantic vector. The use of CAPDs provide an effective avenue to recognition. Class Label Description: It is a common practice to Similarity Matching: This type of approaches build linear employ class label descriptions to transfer knowledge from or nonlinear classifiers for each seen class, and then relate seen to unseen class in ZSL. Such descriptions may come those classifiers with unseen classes based on class-wise sim- from either supervised or settings. For ilarity measures [7], [10], [15], [25], [34]. Our method finds the supervised case, class attributes can be one source as well similar relation but instead of classifiers, we relate CAPDs of [11], [21], [31], [42]. These attributes are often generated seen and unseen classes. Moreover, we compute this relation manually, which is a laborious task. As a workaround, word on a learned metric of semantic embedding space which let semantic space embeddings derived from a large corpus of us consider subtle discriminative details. unannotated text (e.g. from Wikipedia) can be used. Among Few/One-shot Learning: FSL has a long history of investi- such unsupervised word semantic embeddings, word2vec [27], gation where few instances of some classes are used as labeled [26] and GloVe [32] vectors are frequently employed in during training [36], [12]. Although ZSL problem can easily be ZSL [50], [44]. These ZSL methods are sometimes (arguably extended to FSL, established ZSL methods are not evaluated confusingly) referred as unsupervised zero-shot learning [5], in FSL settings. A recent work [40] reports FSL performance [1]. Supervised features tend to provide better performance of only two ZSL methods e.g. [38], [14]. In another work, [8] than the unsupervised ones. Nevertheless, unsupervised fea- presented FSL results on ImageNet. In this paper, we extend tures provide more scalability and flexibility since they do our approach to FSL settings and compare our method with not require expert annotation. Recent approaches attempt to the reported performance in [40]. advance unsupervised ZSL by mapping textual representations Generalized Zero-shot Learning: GZSL setting signifi- (e.g. word2vec or GloVe) as attribute vectors using heuristic cantly increases the complexity of the problem by allowing measures [19], [5]. In our work, we use both types of features both seen and unseen classes during testing phase [45], [9], and evaluate on both supervised and unsupervised ZSL to [8]. This idea is related to open set recognition problem where demonstrate the strength of our approach. methods consider to reject unseen objects in conjunction with Embedding Space: ZSL strategies aim to map between recognizing known objects [6], [17]. In open set case, methods two different sources of information and two spaces: image consider all unseen objects as one outlier class. In contrast, and label embeddings. Based on the mapping scheme, ZSL GZSL represents unseen classes as individual separate cat- approaches can be grouped into two categories. The first egories. Very few of the ZSL methods reported results on category is attribute/word vector prediction. Given an image, GZSL setting [8], [23], [46]. [14] proposed a joint visual- they attempt to approximate label embedding and then classify semantic embedding model to facilitate the generalization of an unseen class image based on the similarity of predicted ZSL. [38] offered a novelty detection mechanism which can vector with unseen attribute/word vector. For example, in detect whether the test image came from seen or unseen an early seminal work, [30] introduced a semantic output category. Chao et al. [9] proposed a calibration mechanism code classifier by using a knowledge base of attributes to to balance seen-unseen prediction score which any ZSL al- predict unseen classes. [43], [22] proposed direct and indirect gorithm can adopt at decision making stage and proposed an attribute prediction methods via a probabilistic realization. [49] evaluation method called Area Under Seen-Unseen accuracy formulated a of category level attributes. Curve (AUSUC). Later, several other works [8], [46] adopted [25] proposed an approach of transferring semantic knowl- this evaluation strategy. In another recent work, Xian et al. edge from seen to unseen classes by a linear combination [45] reported benchmarking results for both ZSL and GZSL of classifiers. The main problem with such direct attribute performance of several established methods published in the prediction is the poor performance when noisy or biased literature. In this paper, we describe extension of our ZSL attribute annotations are available. Jayaraman and Grauman approach to efficiently adapt with GZSL settings. [18] addressed this issue and proposed a discriminative model for ZSL. III.OUR APPROACH Instead of predicting word vectors, the second category of approaches learn a compatibility function between image and Problem Formulation: Suppose, the set of all class labels label embeddings, which returns a compatibility score. An is y = y S ∪ y U where y S = {1, ..., S} and y U = unseen instance is then assigned to the class that gives the {S + 1, ..., S + U} are the sets of seen and unseen class labels maximum score. For example, [2] proposed a label embedding respectively, with no overlap i.e., y S ∩ y U = ∅. Here, S and function that ranks correct class of unseen image higher than U denote the total number of seen and unseen classes, respec- incorrect classes. In [35], authors use the same principle but an tively. For all classes in the seen and unseen class sets, we improved and regularizer. Qiao et al. [33] further have associated semantic class embeddings (either attributes S S improved the former approach by incorporating a component or word vectors) denoted by the sets E = {es : s ∈ y } U U d for noise suppression. In a similar work, Xian et al. [44] and E = {eu : u ∈ y } respectively, where es, eu ∈ R . added latent variables in the compatibility function which For every seen (s) and unseen (u) class, we have a number can learn a collection of label embeddings and select the of instances denoted by ns and nu respectively. The matrices 1 ns S 1 nu correct embedding for prediction. Our method also has similar Xs = [xs, ..., xs ] for s ∈ y , and Xu = [xu, ..., xu ] for compatibility function based on inner product of CAPD and u ∈ y U represent the image features for the seen class s and 13/10/2017 flow_new.xml

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2017 4

Projection δ′ Fig. 2: Overall pipeline of conventional ZSL, W1 u f + ⟨pu, eu⟩ FSL and GZSL method. Conventional ZSL: An image is passed through a deep network W2 Few-shot case: ... . . to get an image feature x. Then, x is fed to ... . δu . f ... . . ŷ = arg max⟨pu, eu⟩ seen classifiers Ws to produce seen CAPDs, u p . Afterwards, unseen CAPDs, p are com- W s u U puted by linearly combining seen CAPDs using f Unseen ′ pu Projection α (or β for reduced case). Finally, prediction pu Classifiers is done by computing the maximum projection VGG or ⟨pu, eu⟩ GoogleNet . response of pu and unseen semantic embeddings α . Conventional zero-shot case: x eu. FSL: x is fed to unseen classifiers Wu to FSL Pipleline 0 ŷ = arg max⟨pu, eu⟩ produce another version of unseen CAPDs pu pu u W1 which are combined with previously computed 0 pu through δu and δu to find an updated version f W2 Projection of unseen CAPDs pu. Final prediction is done ... f . by maximum response of pu and eu. GZSL: ... ⟨p , vl⟩ . . l ... . Seen CAPDs, ps of conventional ZSL setting are γ . v ∈ e ∪ e WS l s u generalized using γ to produce generalized seen Generalized zero-shot case: g g Seen CAPDs, ps . For prediction, both ps and pu are Classifiers g ps ̂ considered for calculating maximum response of ps y = arg max⟨pl, vl⟩ l∈y CAPDs and their corresponding semantic embed- p ∈ p ∪ pg ZSL and GZSL Pipleline l u s dings, es and eu.

k the unseen class u, respectively, such that xs, xu ∈ R . Below, Our goal is to learn the class-specific weights Ws such that we define the three problem instances addressed in this paper: the output principal directions are highly discriminative in • Zero Shot Learning (ZSL): The image features of the the semantic space (rather than the image feature space). To unseen classes Xu are not available during the training this end, we introduce a novel loss function which uses the U stage. The goal is to assign an unseen class label u ∈ y corresponding semantic space embedding es of seen class s to a given unseen image using its feature vector xu. to achieve maximum separability. • Generalized Zero Shot Learning (GZSL): The image Proposed Objective Function: Given the training samples features of the unseen classes Xu are not available during Xs for the seen class s, Ws is learned such that the projection the training stage similar to ZSL. The goal is to assign a of ps on the semantic space embedding es, defined by the class label l ∈ y to a given image using its feature vector inner product hps, esi, generates a strong response. Precisely, x. Notice that, the true class of x may belong to either the following objective function is minimized: a seen or an unseen class. S nc • Few/One Shot Learning (FSL): Only a few/one ran- 1 X X   m  λs 2 min log 1 + exp L(xc ; Ws) + k Ws k2 domly chosen image features from X are available as Ws κ 2 u c=1 m=1 labeled examples during the training stage. The goal is (2) m same as the ZSL setting above. where L is the cost for a specific input xc , λs is the regu- PS In Secs. III-A and III-B, we first provide a general frame- larization weight set using cross validation and κ = c=1 nc. work of our approach mainly focused on ZSL. Afterwards, We define the cost L as: in Secs. III-D and III-C we extend our approach to FSL and ( hps, eci − hps, esi, c 6= s GZSL settings, respectively. Before describing our extensive m L(x ; Ws) = 1 P c hps, S−1 eti − hps, esi, c = s experimental evaluations in Sec. V, we also provide an in- t6=s depth comparison with the existing literature in Sec. IV. In the above loss function, two different scenarios are tackled A. Class Adapting Principal Direction depending on whether the training samples (image features) are from the same (positive) or different (negative) classes. We introduce the concept of ‘Class Adapting Principal For the negative samples (c 6= s), the projection of p on the Direction’ (CAPD), which is a projection of image features s correct semantic embedding e is maximized1/1 while its pro- onto the semantic space. The CAPD is computed for both s jection on the incorrect semantic embedding s is minimized. seen and unseen classes, however the derivation of the CAPD c For the positive samples (c = s), our proposed formulation is different for both cases. In the following, we first introduce directs the projection on the correct semantic embedding e our approach to learn CAPDs for seen classes and then use s to be higher than the average response of projections on the learned principal directions to derive CAPDs for unseen the incorrect semantic embeddings. In both cases, hp , e i is classes. s s constrained to produce a high response. Our loss formulation 1) Learning CAPD for Seen Classes: For a given image is motivated by [50], with notable differences such as the class- feature x belonging to the seen class s, we define its CAPD s wise optimization, explicit handling of positive samples and p in terms of a linear mapping parametrized by W as, s s the extension of their ranking loss for image tagging to the T ps = Ws xs. (1) ZSL problem. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2017 5

We optimize Eq. 2 by Stochastic (SGD) 80 40 T m to obtain Ws for each seen class. Note that, ps = Ws xc in m 60 30 the above cost function, thus for any sample xc , ps changes when Ws is updated at each SGD iteration. Also, the learning 40 20 process of Ws for each seen class is independent of other

classes. Therefore, all Ws can be learned jointly in a parallel 20 10

top-1 accuracy top-1 accuracy top-1 fashion. Once the training process is complete, given an input m 0 0 visual feature xc , we generate one CAPD ps for each seen far mid near rand far mid near rand class using Eq. 1. As a result, P S = [p ...p ] ∈ d×S AwA CUB 1 S R Fig. 3: Experiments with the farthest away, mid-range, nearest, accumulates the CAPDs of all the seen classes. Each CAPD is and randomly chosen seen classes, using one third of the total the mapped version of the image feature on the class specific seen classes in each case. Image features are obtained using semantic space. The CAPD vector and its corresponding VGG-verydeep-19 and semantic space vectors are derived semantic space embedding vector point to similar direction from attributes. As shown, the semantic space embeddings of if the input feature belongs to the same class. the seen classes that are near to the embedding of the unseen 2) Learning CAPD for Unseen Classes: In ZSL settings, class provide more discriminative representations. the images of the unseen classes are not observed during the training. For this reason, we cannot directly learn a weight matrix to calculate pu using the same approach as ps. Instead, simplicity, we consider a linear combination of semantic space for any unseen sample, we propose to approximate pu using embeddings: the seen CAPD of the same sample. Here, we consider a S bilinear map, in particular, a linear combination of the seen X S eˆu = αs,ues = E αu (5) class CAPDs to generate the CAPD of the unseen class u: s=1

S where, eˆu is the approximated semantic embedding of X S pu = θs,ups = P θu (3) eu corresponding to unseen class u. We compute αu = T S s=1 [α1,u...αS,u] ∈ R by solving:

T S where, θu = [θ1,u...θS,u] ∈ R is the coefficient vector that, T λu 2 min (eˆu − eu) M(eˆu − eu) + k αu k2 (6) in a way, aggregates the knowledge of seen classes into the αu 2 unseen one. The computation of θ is subject to the relation u where λ is a regularization parameter which is set via cross between CAPDs and semantic embeddings of classes. We u validation. detail our approach to approximate θ below. u As we mentioned above, using the learned metric M, the Metric Learning on CAPDs: The CAPDs reside in the relationship between the seen-unseen semantic embeddings semantic embedding space. In this space, we learn a distance α is analogous to the relationship between the seen-unseen metric to better model the similarities and dissimilarities u CAPDs θ , thus θ ≈ α . Accordingly, we approximate the among the CAPDs. To this end, we assemble the sets of similar u u u unseen CAPDs with seen CAPDs by rewriting Eq. 3 as: A and dissimilar A¯ pairs of CAPDs that correspond to the pairs of training samples belonging to the same and different S pu ≈ P αu. (7) seen classes, respectively. Our goal is to learn a distance metric dM such that the similar CAPDs are clustered together and the We derive a CAPD, pu for each unseen class using Eq. 7. In dissimilar ones are mapped further apart. We minimize the test stage of ZSL setting, we assign a given image feature x following objective which maximizes the squared to an unseen class using the maximum projection response: between the minimally separated dissimilar pairs: yˆ = arg maxhpu, eui (8) 2 X 2 u max min dM(pi, pj) s.t. dM(pi, pj) ≤ 1 (4) M (i,j)∈A¯ (i,j)∈A

p T B. Reduced Set Description of Unseen Classes where dM = (pi − pj) M(pi − pj) is the Mahalanobis distance metric [48]. After training, the most confusing dis- When describing a novel object, we often resort to relating similar CAPD pairs are pulled apart while the similar CAPDs it with the similar known object categories. It is intuitive that are clustered together by learning an optimal distance matrix a subset of the known objects is sufficient for describing an M. unknown one. Our intuition is that, given a learning metric M in the We incorporate this observation by proposing a modified semantic embedding space, the relation between the semantic version of Eq. 5. The term αu contains the contribution U label embeddings of the seen es and the unseen classes eu of each seen class to describe the unseen class u ∈ y is analogous to that of their principal directions. Since the by reconstructing eu using all seen classes semantic label semantic label embedding of unseen classes are available, embeddings. We reconstruct eu by only a small number of we can estimate their relation with the seen classes. For seen classes (N < S). These N seen classes can be selected JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2017 6

Selecting number of seen classes 0.05 using any similarity measure (Mahalanobis distance in our cat 0.04 hippo. case). The reconstruction of eu becomes: leop. whale 0.03 seal N chimp. 0.02 X rat eˆu = βi,uei (9) panda 0.01 i=1 pig racc. 0 N 0 10 20 30 40 Here, βu ∈ R is the coefficients of selected seen classes. # of seen class We learn βu by a similar minimization objective as in the Fig. 4: PDF of distances using the with Eq. 6. By replacing αu with βu in the Eq. 7, it is possible zero mean and a unit standard deviation for each unseen class. to compute the CAPD of unseen class u using a reduced set (GoogLeNet features and the word2vec semantic embedding of seen classes. Such CAPDs are shown in Fig. 1 in dashed for AwA dataset) lines. Appropriate Choice of Seen Classes: In Fig. 3, we show comparisons when different approaches are used to select approach and do not consider any domain specific semantic a subset of seen classes to describe the unseen ones. The knowledge. results illustrate that the seen classes having the semantic Having discussed the ZSL setting in Secs. III-A and III-B space embeddings close to that of a particular unseen class are above, we present the extension of CAPDs to the GZSL more suitable to describe it. Here, we considered N nearest problem. neighbors of the unseen class semantic vector eu using the Mahalanobis distance. Using a less number of seen classes is C. Generalized Zero-shot Learning inspired by the work Norouzi et al. [29] where they applied convex combination of selected semantic embedding vector ZSL setting considers only unseen class images during the based on outputs of the softmax classifier of corresponding test phase. This setting is less realistic, because new images seen classes. The main drawback of their approach is that can belong to both seen and unseen classes. To address this the softmax classifier output does not take the semantic scenario, generalized ZSL (GZSL) has recently been intro- embeddings into consideration, which can ignore important duced as a new line of investigation [45], [9]. Recent works features when describing the unseen class. Instead, our method suggest that most of the existing ZSL approaches fail to cope performs an independent optimization (Eq. 6) that jointly con- up with the GZSL setting. When both seen and unseen classes siders image feature, CAPD and semantic embedding relations come into consideration for prediction, the prediction score via the learned metric M. As a result, the proposed strategy is function becomes highly biased towards seen classes because better able to determine the optimal combination of selected only seen classes were used for training. As a result, majority seen semantic embeddings (see Sec. V-B). of the unseen test instances are misclassified as seen examples. Automatic N Selection for Each Unseen Class: While [29] In other words, this bias notably decreases the classification proposed a fixed number of selected seen classes to describe accuracy on unseen classes while maintains relatively high an unseen class, we suggest a novel technique to automatically accuracy on seen classes. To solve this problem, available select the number of most informative seen classes (N). techniques attempts to estimate the of an input belonging to either a seen or an unseen class [38], [9]. First, for an unseen class semantic embedding eu, we calculate the Mahalanobis distances (using learned metric M) However, this scheme heavily depends on the original data distribution used for training. from eu to all es and perform mean normalization. Then, we apply kernel density estimation to obtain a probability density Considering the above aspects, a competent GZSL method function (pdf) for the normalized distances. Fig. 4 shows the should possess the following properties: pdf for each unseen semantic embedding vector of the AwA • Equilibrium: It should be able to balance seen-unseen dataset. For a specific unseen class, the number of seen classes diversity so that the performances of both seen and with the highest probability score is assigned as the value of N. unseen classes achieve a balance. Unlike [29], this scheme allows choosing a variable number • Reduced data dependency: It should not receive any of the seen classes for different unseen classes. In Sec. V-A supervision signal (obtained from either training or vali- of this paper, we have reported an estimation of the average dation set images) determining the likelihood of an input numbers of seen classes selected for the tested unseen classes. belonging to seen or unseen class. Sparsity: Using a reduced number of the seen classes in • Consistency: It should retain its performance on the Eq. 9 indirectly imposes sparsity in the coefficient vector conventional ZSL setting as well. αu in the Eq. 5. This is similar to Lasso (`1) regularization In this work, we propose a novel GZSL algorithm to ade- (instead of `2 regularization) in the loss function in Eq. 6. quately address these challenges. We observe that the above selection solution is more efficient Generalized CAPD for Seen Class: In Sec. III-A, we and accurate than the Lasso-based regularization. This is described the CAPD of seen classes for a given input image S because the proposed solution is based on the intuition that the is P = [p1...pS]. Each seen CAPDs is obtained using the semantic embedding of an unseen class can be described by class-wise learned classifier matrix Ws. It is obvious that closely embedded seen classes. In contrast, Lasso is a general each Ws is biased to seen class ‘s’. For the same reason, JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2017 7

each ps is also biased to class ‘s’. Since there was no seen D. Few-shot Learning instance available during the testing phase in conventional ZSL The few-shot learning (FSL) is a natural extension of setting, seen CAPDs were not used for prediction (Eq. 8). ZSL. While ZSL considers no instance of an unseen class Therefore, the inherent bias of seen CAPDs was not affecting during training, FSL relaxes this restriction by allowing a few ZSL performance. In contrast, for GZSL settings, all seen and instances of an unseen class as labeled during the training unseen CAPDs are considered for prediction. Thus, biased process. Another variant of FSL is called one-shot learning seen CAPDs will dominate as expected and significantly affect (OSL), which allows exactly one instance of an unseen class the unseen class performances. To solve this problem, we (instead of few) as labeled during training. An ideal ZSL propose to develop a generalized version of each seen CAPD approach should be able to benefit from the as follows: for unseen classes under F/OSL settings. In this section, we g S explain how our approach is easily adaptable to FSL. ps = P γs, (10) Updated CAPD for Unseen Class. In ZSL setting, for a given input image feature, we can calculate the unseen where, γs denotes a parameter vector for seen class ‘s’. Proposed Objective Function: Our hypothesis is that the CAPD, pu for every unseen class ‘u’. Now, in the FSL setting, bias towards seen classes that causes high scores during we optimally use the newly available labeled unseen data prediction can be resolved using the semantic information of to update pu. To this end, new classifiers Wu are learned for each unseen class ‘u’ similar to the case of seen classes classes. To elaborate, γs is computed solely in semantic label (Sec. III-A). For a given image feature, x, we can calculate embedding domain and later applied to generalize CAPD of 0 T seen class instances. We minimize the squared difference of unseen CAPDs by pu = Wu x. These CAPDs are fused with S×S pu, which were derived from the linear combination of seen two complementary losses to obtain γ = [γ1...γS] ∈ R , as: CAPDs (Eq. 7). The updated CAPD for unseen class ‘u’ is represented as pf , given by: mean generalized seen loss mean unseen reconst. loss u z }| { z }| { f 0 0 0 S U pu = δupu + δupu, s.t. δu + δu = 1 (12) 1 X S 2 1 X S 2 2 min k (E γs − es) − (E αu − eu) k2 0 γ S U where, δu and δu are the contribution of the respective s=1 u=1 CAPDs to form an updated CAPD of an unseen class. During S λγ X pf p + k γ k2, (11) prediction, we use u instead of u in Eq. 8. 2 s 2 Calculation of δ and δ0 : The weights δ and δ0 are set s=1 u u u u using training data such that they encode the reliability of pu 0 where λγ is the regularization weight set using cross valida- and pu respectively. Recall that our prediction is based on the tion. strength of projection of a CAPD on the semantic embedding The objective function in Eq. 11 minimizes the squared vector. Therefore, we need to maximize the correspondence difference between the mean of two loss components. The first between a CAPD and the correct semantic embedding vector 0 component is the mean generalized seen loss which measures i.e., a high hpu, eui. The unseen CAPD among pu and pu th the reconstruction accuracy of seen class embedding es using that provides higher projection response with u unseen class the generalization parameters γs. The second component mea- semantic vector gets a strong weight during the combination sures the reconstruction accuracy of unseen class embedding in Eq. 12. 0 eu from seen classes. By reducing the squared difference We derive pu and pu for each training image feature, S S between these two components, we indirectly balance the x ∈ X = {Xs : s ∈ y }, and the classification matrix of distribution of seen-unseen diversity which effectively prevents unseen class ‘u’. Then, we find the summation of maximum 0 the domination of seen classes in the GZSL setting (the projection response of the CAPD (either pu or pu) with its ‘equilibrium’ property). The interesting fact is that our pro- respective semantic vector. This maximum projection response posed generalization mechanism does not directly use CAPDs, finds the response of most similar (or confusing) unseen class yet it is strong enough to stabilize the CAPD of different of any image. The summation of this response across all classes during the prediction stage (the ‘less data dependence’ training images can estimate the overall quality of CAPDs property). Furthermore, the formulation does not affect the from the two sources. Finally, we normalize the summations 0 computation of unseen CAPDs i.e. pu which preserves the to get δu and δu as follows: conventional ZSL performance (the ‘consistency’ property). P S maxuhpu, eui δ = x∈X , Prediction: For a given image feature x, we can derive u P P 0 S maxuhpu, eui + S maxuhp , eui g x∈X x∈X u generalized CAPDs of seen classes ps and CAPD of unseen P max hp0 , e i classes pu using the description in Sec. III-B. In test stage, 0 x∈X S u u u δu = P P 0 . we consider both the projection responses of seen and unseen x∈X S maxuhpu, eui + x∈X S maxuhpu, eui classes to predict a class. IV. COMPARISON WITH RELATED WORK

yˆ = arg maxhpl, vli A. ZSL Settings l∈y Our method has similarities with two streams of previous g where, pl ∈ pu ∪ ps and vl ∈ es ∪ eu. efforts on ZSL. Here, we discuss the significant differences. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2017 8

In terms of class-specific learning, a number of recent Dataset seen/unseen # image # train # test studies [7], [29] report competitive performances when they aPY[11] 20/12 15,339 12,695 2,644 rely on handcrafted attributes (‘supervised’ source). However, AwA[21] 40/10 30,475 24,518 6,180 we observe that these methods fail when they use ‘unsuper- SUN-10[31] 707/10 14,340 14,140 200 vised’ source of semantics (e.g. word2vec and GloVe). The SUN[31] 645/72 14,340 12,900 1,440 underlying reason is that they do not leverage on the semantic CUB[42] 150/50 11,788 8,855 2,933 information during the training of the classifiers. Moreover, TABLE I: of the benchmark datasets. the attribute set is less noisy than unsupervised source of semantics. Although our work follows the same spirit, the main novelty lies in using the semantic embedding vectors explicitly probability of an input image. Again, the outlier probability during the learning phase for each individual class. This helps is determined using training images which provides an extra the classifiers to easily adapt themselves to a wide variety of image-based supervision to GZSL model. In contrary, our semantic information sources, e.g. attributes, word2vec and method considers the seen-unseen biasness in the semantic GloVe. space at the algorithmic level. The overall prediction scores Another body of work [44], [4] considers semantic infor- are then balanced to remove the inherent biasness towards the mation during the training process. However, these approaches seen classes. We show that such an approach can be useful for do not take the benefits of class-specific learning. Using a both supervised attributes and unsupervised word2vec/GloVe single classifier, they compute a global projection. Generaliz- as semantic embedding information. As our approach does ing all classes by one projection is restrictive and it fails to not follow the post-processing strategy like [9], [7], [46], encompass subtle variations among classes. These approaches we do not evaluate our work with AUSUC. In line with do not leverage the flexibility of suppressing irrelevant seen the recommendation in [44], we use harmonic mean based classes while describing an unseen class. Besides, the semantic approach for GZSL evaluation. label embeddings are subject to tuning based on the visual image features. As they cannot learn any metric on semantic embedding space, these methods fail to work accurately across V. EXPERIMENTS different semantic embeddings. In contrast, by taking the ben- Benchmark Datasets: We use four standard datasets for efits of class-specific learning, our approach computes CAPD our experiments; aPascal & aYahoo (aPY) [11], Animals for each classifier that can significantly enhance the learned with Attributes (AwA) [21], SUN attributes (SUN) [31], and discriminative information. In addition, our approach describes Caltech-UCSD Birds (CUB) [42]. The statistics of these the unseen class with automatically selected informative seen datasets are given in Table I. We follow the standard protocols classes and learns a metric on the semantic embedding space (seen/unseen splits of classes) used in the literature. To be to further fine-tune the semantic label information. specific, we have exactly followed [44] for AwA and CUB We also like to point out that the existing methods seem to datasets, [51], [52] for aPY and SUN-10 and [7] for SUN. To overfit on a specific dataset, specific image features, and spe- increase the complexity of GZSL task for SUN, we used a cific semantic features (supervised-attributes or unsupervised- different of seen/unseen split introduced in [7]. In line with GloVe). Our method, on the other hand, consistently provides the standard protocol, the test images correspond to only improved performance across all the different problem set- unseen classes in ZSL settings. In Few/One-shot settings, we tings. randomly choose three/one instances per unseen class to use in training as labeled examples. Again, in GZSL settings, we B. GZSL settings perform a 80-20% split of each seen class instances; 80% por- We automatically balance the diversity of seen-unseen tion is used in training and rest 20% for testing in conjunction classes in an unsupervised way, without strongly relying on with all unseen test data. We report the average results of CAPD or image visual feature. Previous efforts used a su- 10 random trails for Few/One shot or GZSL settings. In a pervision mechanism either from training or validation image recent work, Xian et al. [45] proposed a different seen/unseen data to determine if any input image belongs to a seen or split for the above mentioned four datasets. We perform GZSL an unseen class. Chao et al. [9] proposed a calibration based experiments on that setting as well. approach to rescale the seen scores and evaluated using Area Image Features: Previous ZSL approaches use both shal- Under Seen-Unseen accuracy Curve (AUSUC) [7], [46]. As low (SIFT, PHOG, Fisher Vector, color histogram) and deep prediction scores of GZSL are strongly biased to seen classes, features [4], [7], [33]. As reported repeatedly, deep features they proposed to calibrate seen scores by adding a constant outperform shallow features by a significant margin [7]. For negative bias termed as a calibration factor. This factor is this reason, we consider only deep features from the pretrained calculated on a validation set and works as a prior likelihood of GoogLeNet [39] and VGG-verydeep-19 [37] models for our a data point being from a seen/unseen class. The drawback of comparisons. For from GoogLeNet and such an approach is that it acts as a post-processing mechanism VGG-verydeep-19, we exactly follow Changpinyo et al. [7] applied at the decision making stage, not dealing with the and Zhang et al. [51], respectively. The dimension of vi- generalization at the basic algorithmic level. sual features extracted from GoogLeNet is 1024, and VGG- Another alternative work, CMT method [38] incorporates verydeep-19 is 4096. While using the recent Xian et al. [45] a novelty detection approach which estimates the outlier seen/unseen split, we use the same 2048-dim features from JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2017 9 top- pooling units of the 101-layered ResNet [16] for a Using G aPY AwA CUB SUN-10 fair comparison. Total seen 20 40 150 717 Semantic Space Embeddings: We analyze both supervised Reduced seen-att 10.17 20.00 74.70 344.40 and unsupervised settings of ZSL. For the supervised case, Reduced seen-w2v - 21.20 70.96 - we use 64, 85, 102 and 312 dimensional continuous valued Reduced seen-glo - 19.70 74.14 - semantic attributes for aPY, AwA, SUN, and CUB datasets, TABLE II: Average number of the seen classes for reduced respectively. We dismiss the binary version of these attributes set case. Our method automatically selects an optimal number since [7] showed that continuous attributes are more useful. of the nearest seen classes to describe an unseen class. For the unsupervised case, we test our approach on AwA and

latEm Our approach CUB datasets. We consider both word vector embeddings i.e., 700 700 cat 677 cat 629 word2vec (w2v) [27], [26] and GloVe (glo) [32]. We use `2 hippo. 433 600 hippo. 519 600 normalized 400-dimensional word vectors, similar to [44]. leop. 581 leop. 587 500 500 Evaluation Metrics: This line of investigation naturally whale 392 whale 708 400 400 applies to two different tasks; recognition and retrieval [45], seal 138 seal 346 chimp. 390 chimp. 669 [24], [40]. We measure the recognition performance by the 300 300 rat 703 rat 241

200 200 top-1 accuracy, and the retrieval performance by the mean panda 287 panda 680 average precision (mAP). The top-1 accuracy is the percentage pig 253 100 pig 165 100

racc. 385 racc. 427

of the estimated labels (the ones with the highest scores) 0 0

rat rat

cat pig cat pig

seal seal

racc. racc.

leop. leop.

whale whale

hippo. hippo.

panda panda chimp. that match the correct labels. The mean average precision chimp. is computed over the precision scores of the test classes. Fig. 5: Confusion matrices on AwA dataset using GoogLeNet In addition, [45] proposed to use Harmonic Mean (HM) of as image features and the attributes as semantic space vectors. the accuracies of seen and unseen classes (accs and accu Left: Xian et al. [44]. Right: CAPD. As seen, CAPD provide respectively) to evaluate GZSL performance, as follows: better overall and class-wise performance. 2 × acc × acc HM = s u . accs + accu closest to each unseen class perform the best among the four The main motivation of using HM is its ability to estimate subsets. The farthest away, mid-range and randomly selected the inherent biasness of any method towards seen classes. subsets fail to describe an unseen class with high accuracy. If a method is too biased to seen classes then accs will be This experiment suggest that using only some nearest seen very high compared to accu and harmonic mean based GZSL classes located in the semantic embedding space can efficiently performance drops down significantly [45], [9]. 1 approximate an unseen class embedding. The nearest case Implementation Details: We initialize each classifier Ws experiment performances are not the best accuracies reported from a N(0, 1 ) distribution where k is the dimension of the k in the paper because we consider an automatic seen class image feature [44]. We use a constant learning rate over 100 selection process in our final experiments. iterations in training of each class: 0.001 for AwA and 0.005 From the discussion in Sec. III-B, we also know that for for aPY, SUN and CUB datasets. For each dataset, we select different unseen classes our method automatically chooses the value of parameters λs, λu and γs using a validation set. different sets of useful seen classes. The numbers of seen We use the same value of λs and λu across all seen and unseen classes in those sets can be different. In Table II, we report the classes in the optimization task (Eq. 2 and 6 respectively). To average number of seen classes in the sets. One can observe choose the validation dataset, we divide all seen classes into that the average number of the seen classes required is around two groups, and use one group as the unseen set (no test data 50% across different datasets. This means, in general, only is used in the validations). Our results on multi-fold cross- half of the total seen classes are useful to describe one unseen validation and single-validation are very similar. class. Such a reduced set description of the unseen class not only maintains the best performance but also reduces the A. Results for Reduced Set complexity of the sparse representation of each unseen class. In the reduced set experiment as describe in Sec. III-B, for each unseen class, we select four subsets of the seen classes B. Benchmark Comparisons having one-third of the total number. First three subsets contain We discuss benchmark performances of ZSL recognition the farthest away, mid-range, and nearest seen classes of each and retrieval for both supervised (attributes) and unsupervised unseen class in the semantic embedding space, and the last semantics (word2vec or GloVe). one is the random selection. For all subsets, we determine 1) Results for ZSL with Supervised Attributes2: We present the proximity of the unseen classes by Mahalanobis distance the top-1 ZSL accuracy results of different versions of our with learned metric M. In our experiments, a different unseen method in Table III. In the all-seen case, we have considered class will get a different set of seen classes to describe it. We all seen classes to describe an unseen class (Eq. 5). In Lasso, report the top-1 accuracy on test data of those four subsets we report the performance using Lasso regularization in place in Fig. 3. We observe that only one-third of seen classes 2For fairness, inductive test performances from DSRL [47], MFMR [46] 1The code of our method will be released. and DMaP [23] are reported in the tables. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2017 10

aPY AwA SUN CUB TABLE III: Top-1 accuracy Method V G V G V G V G (in %) of the various versions Ours [all-seen] 45.84 50.64 73.19 64.74 84.5 87.00 39.86 42.31 of CAPD using the attributes. Ours [reduced-Lasso] 36.54 37.22 74.16 75.76 78.50 84.50 37.47 37.37 V: VGG-verydeep-19, G: Ours [reduced-auto] 54.69 55.07 78.53 80.43 85.00 79.00 43.01 45.31 GoogLeNet image features.

Using V aPY AwA SUN CUB AwA CUB Semantic:word2vec Lampert’14 [22] 38.16 57.23 72.00 31.40 V G V G ESZSL’15 [35] 24.22 75.32 82.10 - Akata’15 [4] - 51.20 - 28.40 SSE-ReLU’15 [51] 46.23 76.33 82.50 30.41 Xian’16 [44] - 61.10 - 31.80 Zhang’16 [52] 50.35 80.46 83.30 42.11 Akata’16 [1] - - 33.90 - Bucher’16 [24] 53.15 77.32 84.41 43.29 Changpinyo’16 [7] - 57.50 - - DSRL’17[47] 56.29 77.38 82.00 50.26 SCoRe’17[28] - 60.88 - 31.51 MFMR’17[46] 48.20 79.80 84.00 47.70 DMaP-I’17[23] - - - 26.28 Ours 54.69 78.53 85.00 43.33 Ours 66.26 66.89 34.40 32.42 AwA CUB Using G aPY AwA SUN CUB Semantic: GloVe Lampert’14 [22] 37.10 59.50 - - V G V G Akata’15 [4] - 66.70 - 50.10 Akata’15 [4] - 58.80 - 24.20 Changpinyo’16 [7] - 72.90 - 45.85 Xian’16 [44] - 62.90 - 32.50 Xian’16 [44] - 72.50 - 45.60 DMaP-I’17[23] - - - 23.69 SCoRe’17[28] - 78.30 - 58.40 Ours 62.01 64.73 32.08 29.66 MFMR’17[46] 46.40 76.60 81.50 46.20 TABLE VI: Unsupervised ZSL performance in top-1 accuracy. Ours 55.07 80.83 87.00 45.31 V: VGG-verydeep-19, G: GoogLeNet image features. Only TABLE IV: Supervised ZSL top-1 accuracy (in %) on four very recent SOTA papers are considered for comparison. standard datasets. V: VGG-verydeep-19 and G: GoogLeNet Average Precision-Recall curve of all test classes image features. Results are from the original papers. Only very 0.7 recent SOTA methods are considered for comparison. 0.6

0.5 Using V aPY AwA SUN CUB 0.4 SSE-INT’15 [51] 15.43 46.25 58.94 4.69 0.3

SSE-ReLU’15 [51] 14.09 42.60 44.55 3.70 Precision 0.2 Bucher’16 [24] 36.92 68.10 52.68 25.33 Xian et al.(2016) AUC=47.06 Changpinyo et al.(2016) AUC=41.40 Zhang’16 [52] 38.30 67.66 80.10 29.15 0.1 Ours(lasso) AUC = 53.92 MFMR’17 [46] 45.60 70.80 77.40 30.60 Ours AUC=55.97 0 Ours 43.85 72.87 80.20 36.60 0.5 0.6 0.7 0.8 0.9 1 Recall TABLE V: Supervised ZSL retrieval performance (in mAP). Fig. 6: Average precision recall curve of all test classes of V: VGG-verydeep-19 image features. AwA dataset. GoogLeNet and word2vec are used as image feature and semantic label embedding respectively. of `2 in Eq. 6. The results demonstrate that using a reduced number of seen classes to describe an individual unseen class manually annotated attribute sets in real-world applications. can improve ZSL accuracy. In Table IV, we compare the Therefore, the ZSL research is pushing forward to eliminate overall top-1 accuracy of our method with many recent ZSL dependency on manually assigned attributes [1], [5], [19], [33], approaches. Our approach outperforms other methods in most [44]. In line with this view, we adapt our method to unsuper- of the settings. In Fig. 5, we show confusion matrices of a vised settings by replacing attribute set with word2vec [27] recent approach [44] and ours. Similar to recognition, ZSL and GloVe [32] vectors. Our results on two standard datasets, can also perform retrieval task. ZSL retrieval is to search AwA and CUB, are reported in Table VI. We compare with images of unseen classes using their class label embeddings. very recent approaches keeping same experimental protocol. We test the attributes set as a query to retrieve test images. In One can notice that our approach performs consistently in the Table V, we compare our ZSL retrieval performance with four unsupervised settings as well in a wide variety of feature and recent approaches on four datasets. Our approach performs semantic embedding combinations. We provide the average consistently better or comparable to state-of-the-art methods. precision-recall curves of ours and two very recent approaches 2) Results for ZSL with Unsupervised Semantics: ZSL with using word2vec embeddings in Fig. 6. As shown, our method pretrained word vectors [27], [32] as sematnic embedding is is superior to others by a significant margin. the focus of nowadays since it is difficult to generate Our observation is that ZSL attains better performance JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2017 11 with supervised attributes as semantics than unsupervised ones 4) Results for FSL: As stated earlier, our method can easily because the semantic descriptors (word2vec and GloVe) are take the advantage when new unseen class instances become often noisy and cannot describe a class as good as attributes. available as labeled data for training. To test this scenario, in To address this performance gap, some works investigate FSL settings, we assume three instances of each unseen class ZSL with transductive learning [46], [23], domain adaptation (randomly chosen) are available as labeled during training. techniques [10], [19], and class attribute associations [1], [5]. In Table X, we report our results for FSL on AwA and In our study, we consider these improvements as future work. CUB dataset while using attribute, word2vec and GloVe as semantic information. The compared methods, DeViSE [13] 3) Results for GZSL: GZSL is a more realistic scenario and CMT[38], did not report FSL performance in the original than conventional ZSL because GZSL setting tests a method paper. But, [40] reimplemented the original work to adapt FSL. with not only the unseen class instances but also seen class The exact three instances of each unseen class used in [40] instances. In this paper, we extend our method to work on are not publically available. However, to make our results GZSL setting as well. Although GZSL is a more interesting comparable with others, we report the average performance problem than ZSL, usually standard ZSL methods do not of 10 random trails. Our method performs consistently better report any results on GZSL in the original papers. However, than comparing methods except one case: mAP of CUB- recently a few efforts have been published to establish the att (58.0 vs 58.5). Another observation from these results standard testing protocol for GZSL [45], [9]. In the current is that the performance gap between unsupervised semantics work, we test our GZSL method on both testing protocols of (like word2vec and GloVe) and supervised attribute semantics [45] and [9]. is significantly reduced compared to ZSL settings where Xian et al. [45] tested 10 ZSL methods with a new seen- unsupervised semantics always ill-performed than supervised unseen split of datasets ensuring unseen classes are not attributes across all methods. The reason is that the FSL setting used during pre-training of deep network (e.g., GoogLeNet, alleviates the inherent noise of unsupervised semantics to ResNet) which was used to extract image features. They perform better (and as good as) supervised semantic. We also used ResNet as image features and attributes as semantic experiment on the OSL task, where all conditions are same as embedding for SUN, CUB, AwA and aPY dataset. With this FSL setting except a single randomly picked labeled instance exact settings, in Table VII, we compare our GZSL results with is available for each unseen class during training. More results the reported results of [45]. In terms of Harmonic based (HM) of OSL and FSL for AwA, CUB, SUN and aPY datasets are measure, our results consistently outperform other methods reported in Table XII, XIII, XIV and XI. by a large margin. Moreover, our method balances the seen- For any given image, our FSL method described in unseen diversity in a robust manner which helps to achieve the Sect. III-D utilizes the contribution of unseen CAPDs coming best unseen class accuracy (acc ). In contrast, seen accuracy u from two sources: one by combining the CAPDs of seen (acc ) moves down because of the trade-off while balancing u classes from zero-shot setting and another by using unseen the bias towards seen classes. In the last row, we report classifier from few-shot setting. In Eq. 12, two constants (δu the ZSL performance of this experiment where only unseen 0 and δ ) combine the respective CAPDs to compute the updated class test instances are classified to only unseen classes (not u CAPD of the unseen class. In this experiment, we visualize considering both seen-unseen classes together). This accuracy 0 the contribution of δ and δ for AwA and CUB dataset in is actually an oracle case (upper bound) for acc of GZSL case u u u Fig. 7. Few observations from this figure are below: of our method. This is because, if an instance is misclassified in the ZSL case, it must be misclassified in the GZSL case 0 • In most cases, few-shot contribution from classifier (δu) too. Another important point to note is that the parameters of contributes higher than zero-shot contribution (δu). The our method are tuned for GZSL setting in this experiment. reason is that few instances of unseen class can make Therefore, ZSL performance in the last row may increase if better generalization than no instance during training. one tunes parameters for the ZSL setting. • Zero-shot contribution (δu) contributes higher on super- Chao et al. [9] experimented GZSL with standard seen- vised attribute case than word2vec or GloVe across two unseen split used in ZSL literature. Keeping this split, they datasets. The reason is that supervised attributes contain kept random 80% seen class images for training and held less noise which gives high confidence to zero-shot based out the rest of 20% images for testing stage during GZSL. CAPD. We perform the same harmonic mean based evaluation like • While comparing OSL and FSL, few-shot contribution 0 previous setting. In Table VIII, we compare our results with from classifier (δu) contributes higher in FSL than OSL the reported results in [9]. Using the same settings, we also case. The reason is that in FSL settings, any unseen compare with two recent methods, MFMR [46] (Table VIII) classifier becomes more confident than OSL settings as and DMAP [23] (Table IX). For the sake of comparison with FSL observes more than one instances during training. DMAP [23], we compare mean Top1 accuracy (not standard • While comparing word2vec and GloVe for both OSL though) instead of harmonic mean because accu and accs and FSL settings, zero-shot contribution (δu) contributes are not reported separately in the [23]. Again, our method higher for word2vec than GloVe semantics. It suggests performs consistently well across datasets. More results on that word2vec is a better semantic embedding than GloVe GZSL for AwA, CUB, SUN and aPY datasets are reported in for FSL task. Tables XII, XIII, XIV and XI. • While comparing AwA and CUB, zero-shot contribution JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2017 12

Top1 SUN CUB AWA aPY ResNet HM accs accu HM accs accu HM accs accu HM accs accu DAP[22] 7.2 25.1 4.2 3.3 67.9 1.7 0.0 88.7 0.0 9.0 78.3 4.8 CONSE[29] 11.6 39.9 6.8 3.1 72.2 1.6 0.8 88.6 0.4 0.0 91.2 0.0 CMT[38] 13.3 28.0 8.7 8.7 60.1 4.7 15.3 86.9 8.4 19.0 74.2 10.9 SSE[51] 4.0 36.4 2.1 14.4 46.9 8.5 12.9 80.5 7.0 0.4 78.9 0.2 LATEM[44] 19.5 28.8 14.7 24.0 57.3 15.2 13.3 71.7 7.3 0.2 73.0 0.1 ALE[3] 26.3 33.1 21.8 34.4 62.8 23.7 27.5 76.1 16.8 8.7 73.7 4.6 DEVISE[14] 20.9 27.4 16.9 32.8 53.0 23.8 22.4 68.7 13.4 9.2 76.9 4.9 SJE[4] 19.8 30.5 14.7 33.6 59.2 23.5 19.6 74.6 11.3 6.9 55.7 3.7 ESZSL[35] 15.8 27.9 11.0 21.0 63.8 12.6 12.1 75.6 6.6 4.6 70.1 2.4 SYNC[7] 13.4 43.3 7.9 19.8 70.9 11.5 16.2 87.3 8.9 13.3 66.6 7.4 Our GZSL 31.3 27.8 35.8 43.3 41.7 44.9 54.5 68.6 45.2 37.0 59.5 26.8 Our ZSL 49.7 53.8 52.6 39.3 TABLE VII: GZSL performance comparison with other established methods in the literature. The experiment setting is exactly same as in [45]. Image features are taken from ResNet and attributes are used as semantic information.

AwA CUB Top1: AwA CUB Top1:G HM accs accu HM accs accu Using G att w2v glo att w2v glo DAP[22] 4.7 77.9 2.4 7.5 55.1 4.0 DeViSE[13] 80.9 75.3 79.4 54.0 45.7 46.0 IAP[22] 3.3 76.8 1.7 2.0 69.4 1.0 CMT[38] 85.1 83.4 84.3 56.7 53.4 52.0 ConSE[29] 16.9 75.9 9.5 3.5 69.9 1.8 Our 87.4 84.9 85.8 56.9 55.4 55.8 SynC[7] 0.8 81.0 0.4 22.3 72.0 13.2 mAP: AwA CUB MFMR[46] 29.60 75.6 18.4 - - - Using G att w2v glo att w2v glo Our GZSL 50.8 43.2 61.7 29.5 23.4 39.9 DeViSE[13] 85.0 79.3 84.9 46.4 42.6 42.9 Our ZSL 76.2 44.0 CMT[38] 88.4 88.2 89.2 58.5 54.0 52.7 92.0 89.5 89.6 56.3 56.2 TABLE VIII: GZSL performance comparison with the experi- Our 58.0 ment settings of [9]. Image features are taken from GoogLeNet TABLE X: FSL performance comparison with the experiment and attributes are used as semantic information. settings of [40]. Image features are taken from GoogLeNet.

AwA CUB AwA CUB Zero shot weight Top1:Mean u ' Classifer weight att w2v att glo glo-FSL u glo-FSL

DMaP[23] 17.23 6.44 13.55 2.07 glo-OSL glo-OSL

Our GZSL 52.45 43.70 31.65 18.75 w2v-FSL w2v-FSL

TABLE IX: Comparison with a recent GZSL work DMaP[23] w2v-OSL w2v-OSL

att-FSL att-FSL

att-OSL att-OSL (δu) contributes lower than few-shot contribution from 0 classifier (δu) for CUB across all semantics used. The 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 reason is that CUB is a more difficult dataset than 0 Fig. 7: Contribution of δ and δ to update unseen class CAPD AwA in zero-shot setting. One can find that the overall u u performance on CUB is lower than AwA in all cases (i.e., ZSL, F/OSL and GZSL). noise of unsupervised semantics. 5) All results at a glance.: With experiment setting of [9], • O/FSL setting should always outperform ZSL because we juxtapose all results of OSL, FSL, ZSL and GZSL for more information of unseen is revealed in O/FSL set- AwA, CUB, SUN and aPY datasets in Table XII, XIII, XIV tings. However, we got one exception in SUN dataset and XI respectively. Some overall observations from these where OSL perform worse than ZSL. The reason is that results are below: the SUN dataset has 717 classes and only one labeled • Performance improves from OSL to FSL settings. This instance of unseen class could not provide discriminative is expected because in FSL setting, more than one (three information which eventually confuses our auto unseen to be exact) instances of unseen class are used as labeled CAPD weighting process. during training. • ZSL results are different from Table IV, V and VI because • The performance gap between supervised attributes and here our method is tuned for GZSL case not on ZSL. In unsupervised word2vec or GloVe is greatly reduced in addition, random selection of 80% training instance of OSL and FSL. It suggests that getting few instances as seen classes across 10 different trails affects the result. labeled during training helps to greatly compensate the • Performance of accu of GZSL is always lower than ZSL JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2017 13

Using G GZSL Using G GZSL OSL FSL ZSL OSL FSL ZSL Semantic HM accs accu Semantic HM accs accu Top1:att 71.2 83.6 40.7 35.7 40.5 32.0 Top1:att 46.3 56.9 44.0 29.5 23.4 39.9 mAP: att 77.7 88.3 45.1 27.7 24.1 32.7 mAP: att 46.9 58.0 40.5 31.8 29.2 34.9 TABLE XI: All results on aPY dataset at a glance. Top1:w2v 41.7 55.4 33.2 14.9 9.8 31.1 mAP: w2v 41.9 56.3 29.5 23.2 21.9 24.6 Using G GZSL Top1:glo 41.2 55.8 31.1 11.7 7.2 30.3 OSL FSL ZSL Semantic HM accs accu mAP: glo 40.3 56.2 28.3 23.1 22.8 23.4 Top1:att 82.8 87.4 76.2 50.8 43.2 61.7 TABLE XIII: All results on CUB dataset at a glance. mAP: att 86.9 92.0 71.7 50.0 41.2 63.6 Using G GZSL Top1:w2v 76.9 84.7 56.4 43.6 42.8 44.6 OSL FSL ZSL mAP: w2v 82.0 89.5 50.8 38.5 35.3 42.5 Semantic HM accs accu Top1:glo 78.2 85.8 60.7 44.7 46.4 43.2 SUN (645/72: Seen/Unseen Split) mAP: glo 83.7 89.6 54.3 42.2 37.8 47.9 Top1:att 53.7 66.3 59.8 28.3 22.2 39.2 mAP: att 55.2 68.9 60.5 34.1 27.1 45.9 TABLE XII: All results on AwA dataset at a glance. SUN-10 (707/10: Seen/Unseen Split) Top1:att 80.8 87.5 77.9 33.6 25.7 48.6 mAP: att 84.3 90.1 76.8 40.0 32.3 52.7 because ZSL accuracy is the oracle case of accu. TABLE XIV: All results on SUN dataset at a glance.

C. Discussion training/validation image data. We show that class semantic Based on our experiments, we draw the following contribu- information can be used to adapt seen-unseen diversity. tions of our work: Extension to Few/One Shot Settings: In some applications, Benefits of CAPD: A CAPD points out the most likely a few images of a new class may become available for class. If a semantic space embedding vector of a class and training. To adapt with such situations, our method can train the CAPD of the image lies close to each other, there is a a model for the new class without disturbing the previous strong confidence for that class. One important contribution training. The CAPD from the new model is combined with of this paper is the derivation of the CAPD for each unseen its previous CAPD (of unseen settings) to obtain an updated class. Conventional ZSL approaches in this vein of thought CAPD with few-shot refinement. We propose an automatic essentially calculate one principal direction [4], [35], [44], way of combining CAPDs from two sources by measuring [33], [50]. Generalizing all seen-unseen classes with only the quality of prediction responses of training images. Our one principal direction cannot capture the differences among updated CAPD provides better fitness score for unseen class classes effectively. In our work, each CAPD is obtained with prediction. the help of bilinear mapping (matrix multiplication). One can extend this by incorporating latent variables, in line with the VI.CONCLUSION work Xian et al. [44] where a collection of bilinear maps along We propose a novel unified solution to ZSL, GZSL and with a selection criterion is used. F/OSL problems utilizing the concept of class adaptive princi- Benefits of Nearest Seen Classes: Intuitively, when we pal direction (CAPD) that enables efficient and discriminative describe a novel object, rather than giving a dissimilar object embeddings of unseen class images in semantic space for as an example, we use a similar known object. This hints recognition and retrieval. We introduce an automatic solution that we can reconstruct the CAPD of an unseen class with the to select a reduced set of relevant seen classes. As demon- CAPDs of the similar seen classes. This idea helps to improve strated in our extensive experimental analysis, our method the prediction performance. works consistently well in both unsupervised and supervised How Many Seen Classes are Required? Results presented ZSL settings and achieves the superior performance in par- in Fig. 3 support the idea that all seen classes are not always ticular for the unsupervised case. It provides several benefits necessary. We propose a simple yet effective solution for including reliable generalization and noise suppression in the selecting adaptively the number of similar seen classes for each semantic space. In addition to ZSL, our method also performs unseen class (see the discussion in Sec. III-B). This scheme very well in GZSL settings. We propose an easy solution to allows different set of useful seen classes required to describe match the seen-unseen diversity of classes at the algorithmic an unseen class. level. Unlike conventional methods, our GZSL strategy can Extension to GZSL Setting: ZSL methods are biased balance seen-unseen performance to achieve overall better to assign high prediction scores towards seen classes while recognition rates. We have extended our CAPD based ZSL performing GZSL task. Due to this reason, conventional approach to adapt with FSL settings. Our approach easily takes ZSL methods fail to achieve good performance in GZSL. the advantage of few examples available in FSL task to fine Our proposed method solves this problem by adapting seen- tune unseen CAPDs to improve classification performance. As unseen class diversity in a novel manner. Unlike [38], [9], our a future work, we will extend our approach to transductive adaptation technique does not take any extra supervision from settings and domain adaptation. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2017 14

REFERENCES and Pattern Recognition (CVPR), July 2017. [24] S. H. Maxime Bucher and F. Jurie. Improving semantic embedding con- sistency by metric learning for zero-shot classification. In Proceedings [1] Z. Akata, M. Malinowski, M. Fritz, and B. Schiele. Multi-cue zero-shot of The 14th European Conference on Computer Vision, 2016. learning with strong supervision. In The IEEE Conference on Computer [25] T. Mensink, E. Gavves, and C. G. Snoek. Costa: Co-occurrence statistics Vision and Pattern Recognition (CVPR), June 2016. for zero-shot classification. In Proceedings of the IEEE Conference on [2] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding Computer Vision and Pattern Recognition, pages 2441–2448, 2014. for attribute-based classification. In Proceedings of the IEEE Computer [26] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of Society Conference on Computer Vision and Pattern Recognition, pages word representations in . arXiv preprint arXiv:1301.3781, 819–826, 2013. January 2013. [3] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-Embedding [27] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Dis- for Image Classification. IEEE Transactions on Pattern Analysis and tributed representations of words and phrases and their compositionality. Machine , 38(7):1425–1438, July 2016. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. [4] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of Weinberger, editors, Advances in Neural Information Processing Systems Proceedings output embeddings for fine-grained image classification. In 26, pages 3111–3119. Curran Associates, Inc., 2013. of the IEEE Computer Society Conference on Computer Vision and [28] P. Morgado and N. Vasconcelos. Semantically consistent regularization Pattern Recognition, volume 07-12-June-2015, pages 2927–2936, 2015. for zero-shot recognition. In The IEEE Conference on Computer Vision [5] Z. Al-Halah, M. Tapaswi, and R. Stiefelhagen. Recovering the missing and Pattern Recognition (CVPR), July 2017. link: Predicting class-attribute associations for unsupervised zero-shot [29] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, learning. In The IEEE Conference on Computer Vision and Pattern G. S. Corrado, and J. Dean. Zero-shot learning by convex combination Recognition (CVPR), June 2016. of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013. [6] A. Bendale and T. E. Boult. Towards open set deep networks. In [30] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell. Zero- Proceedings of the IEEE Conference on Computer Vision and Pattern shot learning with semantic output codes. In Y. Bengio, D. Schuurmans, Recognition, pages 1563–1572, 2016. J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in [7] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha. Synthesized classifiers Neural Information Processing Systems 22, pages 1410–1418. Curran Proceedings of the IEEE Computer Society for zero-shot learning. In Associates, Inc., 2009. Conference on Computer Vision and Pattern Recognition, volume 2016- [31] G. Patterson, C. Xu, H. Su, and J. Hays. The sun attribute database: January, pages 5327–5336, 2016. Beyond categories for deeper scene understanding. International Journal [8] S. Changpinyo, W.-L. Chao, and F. Sha. Predicting visual exemplars of Computer Vision, 108(1-2):59–81, 2014. of unseen classes for zero-shot learning. In The IEEE International [32] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors Conference on Computer Vision (ICCV), Oct 2017. for word representation. In Empirical Methods in Natural Language An Empirical [9] W.-L. Chao, B. Changpinyo, Soravitand Gong, and F. Sha. Processing (EMNLP), pages 1532–1543, 2014. Study and Analysis of Generalized Zero-Shot Learning for Object [33] R. Qiao, L. Liu, C. Shen, and A. van den Hengel. Less is more: Zero- Recognition in the Wild, pages 52–68. Springer International Publishing, shot learning from online textual documents with noise suppression. Cham, 2016. In The IEEE Conference on Computer Vision and Pattern Recognition [10] M. Elhoseiny, B. Saleh, and A. Elgammal. Write a classifier: Zero-shot (CVPR), June 2016. learning using purely textual descriptions. In Proceedings of the IEEE [34] M. Rohrbach, M. Stark, and B. Schiele. Evaluating knowledge transfer International Conference on Computer Vision, pages 2584–2591, 2013. and zero-shot learning in a large-scale setting. In Computer Vision and [11] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1641– Computer Vision and Pattern Recognition, 2009. by their attributes. In 1648. IEEE, 2011. CVPR 2009. IEEE Conference on, pages 1778–1785. IEEE, 2009. [35] B. Romera-Paredes and P. Torr. An embarrassingly simple approach to [12] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object cate- zero-shot learning. In Proceedings of The 32nd International Conference IEEE Transactions on Pattern Analysis and Machine Intelligence gories. , on , pages 2152–2161, 2015. 28(4):594–611, April 2006. [36] R. Salakhutdinov, J. B. Tenenbaum, and A. Torralba. Learning with [13] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, hierarchical-deep models. IEEE Transactions on Pattern Analysis and and T. Mikolov. Devise: A deep visual-semantic embedding model. In Machine Intelligence, 35(8):1958–1971, Aug 2013. NIPS, 2013. [37] K. Simonyan and A. Zisserman. Very deep convolutional networks for [14] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. and T. Mikolov. Devise: A deep visual-semantic embedding model. [38] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. through cross-modal transfer. In C. J. C. Burges, L. Bottou, M. Welling, Weinberger, editors, Advances in Neural Information Processing Systems Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural 26, pages 2121–2129. Curran Associates, Inc., 2013. Information Processing Systems 26, pages 935–943. Curran Associates, [15] E. Gavves, T. Mensink, T. Tommasi, C. G. M. Snoek, and T. Tuytelaars. Inc., 2013. Active transfer learning with zero-shot priors: Reusing past datasets for [39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, future tasks. In The IEEE International Conference on Computer Vision V. Vanhoucke, and A. Rabinovich. Going deeper with . (ICCV), December 2015. Proceedings of the IEEE Computer Society Conference on Computer [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image Vision and Pattern Recognition, 07-12-June-2015:1–9, 2015. recognition. volume 2016-January, pages 770–778, 2016. cited By 107. [40] Y. H. Tsai, L. Huang, and R. Salakhutdinov. Learning robust visual- [17] L. P. Jain, W. J. Scheirer, and T. E. Boult. Multi-class open set semantic embeddings. CoRR, abs/1703.05908, 2017. recognition using probability of inclusion. In European Conference on [41] L. Van Der Maaten. Accelerating t-sne using tree-based algorithms. Computer Vision, pages 393–409. Springer, 2014. Journal of machine learning research, 15(1):3221–3245, 2014. [18] D. Jayaraman and K. Grauman. Zero-shot recognition with unreliable at- [42] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The tributes. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR- K. Q. Weinberger, editors, Advances in Neural Information Processing 2011-001, California Institute of Technology, 2011. Systems 27, pages 3464–3472. Curran Associates, Inc., 2014. [43] X. Wang and Q. Ji. A unified probabilistic approach modeling re- [19] E. Kodirov, T. Xiang, Z. Fu, and S. Gong. Unsupervised domain lationships between attributes and objects. Proceedings of the IEEE adaptation for zero-shot learning. In The IEEE International Conference International Conference on Computer Vision, pages 2120–2127, 2013. on Computer Vision (ICCV), December 2015. [44] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele. [20] J. Krause, H. Jin, J. Yang, and L. Fei-Fei. Fine-grained recognition Latent embeddings for zero-shot classification. In The IEEE Conference without part annotations. In Proceedings of the IEEE Conference on on Computer Vision and Pattern Recognition (CVPR), June 2016. Computer Vision and Pattern Recognition, pages 5546–5555, 2015. [45] Y. Xian, B. Schiele, and Z. Akata. Zero-shot learning - the good, the [21] C. Lampert, H. Nickisch, and S. Harmeling. Learning to detect bad and the ugly. In IEEE Computer Vision and Pattern Recognition unseen object classes by between-class attribute transfer. In 2009 (CVPR), 2017. IEEE Computer Society Conference on Computer Vision and Pattern [46] X. Xu, F. Shen, Y. Yang, D. Zhang, H. T. Shen, and J. Song. Matrix Recognition Workshops, CVPR Workshops 2009, pages 951–958, 2009. tri-factorization with manifold regularizations for zero-shot learning. In [22] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classifi- Proc. of CVPR, 2017. cation for zero-shot visual object . IEEE Transactions on [47] M. Ye and Y. Guo. Zero-shot classification with discriminative semantic Pattern Analysis and Machine Intelligence, 36(3):453–465, March 2014. representation learning. In The IEEE Conference on Computer Vision [23] Y. Li, D. Wang, H. Hu, Y. Lin, and Y. Zhuang. Zero-shot recognition and Pattern Recognition (CVPR), July 2017. using dual visual-semantic mapping paths. In The IEEE Conference on JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2017 15

[48] Y. Ying and P. Li. Distance metric learning with eigenvalue optimization. J. Mach. Learn. Res., 13(1):1–26, Jan. 2012. [49] F. X. Yu, L. Cao, R. S. Feris, J. R. Smith, and S. F. Chang. Designing category-level attributes for discriminative visual recognition. In Com- puter Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 771–778, June 2013. [50] Y. Zhang, B. Gong, and M. Shah. Fast zero-shot image tagging. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [51] Z. Zhang and V. Saligrama. Zero-shot learning via semantic similarity embedding. In The IEEE International Conference on Computer Vision (ICCV), December 2015. [52] Z. Zhang and V. Saligrama. Zero-shot learning via joint latent similarity embedding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.