Low-Rank Pairwise Alignment Bilinear Network for Few-Shot Fine

JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, JUNE 2019 1 Low-Rank Pairwise Alignment Bilinear Network For Few-Shot Fine-Grained Image Classification Huaxi Huang, Student Member, IEEE, Junjie Zhang, Jian Zhang, Senior Member, IEEE, Jingsong Xu, Qiang Wu, Senior Member, IEEE Abstract—Deep neural networks have demonstrated advanced abilities on various visual classification tasks, which heavily rely on the large-scale training samples with annotated ground-truth. Slot Alaskan Dog However, it is unrealistic always to require such annotation in h real-world applications. Recently, Few-Shot learning (FS), as an attempt to address the shortage of training samples, has made significant progress in generic classification tasks. Nonetheless, it Insect Husky Dog is still challenging for current FS models to distinguish the subtle differences between fine-grained categories given limited training Easy to Recognise A Little Hard to Classify data. To filling the classification gap, in this paper, we address Lion Audi A5 the Few-Shot Fine-Grained (FSFG) classification problem, which focuses on tackling the fine-grained classification under the challenging few-shot learning setting. A novel low-rank pairwise Piano Audi S5 bilinear pooling operation is proposed to capture the nuanced General Object Recognition with Single Sample Fine-grained Object Recognition with Single Sample differences between the support and query images for learning an effective distance metric. Moreover, a feature alignment layer Fig. 1. An example of general one-shot learning (Left) and fine-grained one- is designed to match the support image features with query shot learning (Right). For general one-shot learning, it is easy to learn the ones before the comparison. We name the proposed model Low- concepts of objects with only a single image. However, it is more difficult to Rank Pairwise Alignment Bilinear Network (LRPABN), which is distinguish the sub-classes of specific categories with one sample. trained in an end-to-end fashion. Comprehensive experimental results on four widely used fine-grained classification data sets demonstrate that our LRPABN model achieves the superior e.g., it is hard to collect large-scale samples of endangered performances compared to state-of-the-art methods. species. How to tackle the fine-grained image classification Index Terms—Few-Shot, Fine-Grained, Low-Rank, Pairwise, with limited training data remains an open problem. Bilinear Pooling, Feature Alignment. Human beings can learn novel generic concepts with only one or a few samples easily. To simulate this intelligent I. INTRODUCTION ability, machine few-shot learning is initially identified by INE-GRAINED image classification aims to distinguish Li et al. [14]. They propose to utilize probabilistic models F different sub-categories belong to the same entry-level to represent object categories and update them with a few category such as birds [2], [3], dogs [4], and cars [5]. This training examples. Most recently, inspired by the advanced problem is particularly challenging due to the low inter- representation learning ability of deep neural networks, deep category variance yet high intra-category discordance caused machine few-shot learning [15]–[20] revives and achieves by various object postures, illumination conditions and dis- significant improvements against previous methods. However, tances from the cameras, etc. In general, the majority of considering the cognitive process of human beings, preschool arXiv:1908.01313v3 [cs.CV] 28 May 2020 fine-grained classification approaches need to be fed with a students can easily distinguish the difference between generic large amount of training data before obtaining a trustworthy concepts like the ‘Cat’ and ‘Dog’ after seeing a few exemplary classifier [6]–[11]. However, labeling the fine-grained data re- images of these animals, but they may be confused about fine- quires strong domain knowledge, e.g., only ornithologists can grained dog categories such as the ‘Husky’ and ‘Alaskan’ accurately identify different bird species, which is significantly with limited samples. The undeveloped classification ability expensive compared to the generic object classification task. of children in processing information compared to adults [21], Moreover, in some fine-grained data sets such as the Wild- [22] indicates that generic few-shot methods cannot cope with fish [12] and iNaturalist [13], the data distributions are usually the few-shot fine-grained classification task admirably. To this imbalanced and follow the long-tail distribution, and in some end, in this paper, we focus on dealing with the Few-Shot of the categories, the well-labeled training samples are limited, Fine-Gained (FSFG) classification in a ‘developed’ way. Wei et al. [23] recently introduce the FSFG task. Besides Huaxi Huang and Junjie Zhang are co-first authors. Corresponding author: Jian Zhang, email: [email protected]. establishing the FSFG problem, they propose a deep neural Huaxi Huang, Jian Zhang, Qiang Wu and Jingsong Xu are with the network model named Piece-wise Classifier Mapping (PCM). Faculty of Engineering and Information Technology, University of Technology By adopting the meta-learning strategy on the auxiliary data Sydney, Sydney NSW 2007, Australia. Junjie Zhang is with the School of Computer Science, The University of Adelaide, Adelaide SA 5005, Australia. set, their model can classify different samples in the testing The preliminary version of this work is accepted at IEEE ICME 2019 [1]. data set with a few labeled samples. The most critical issue JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, JUNE 2019 2 Embedded Features Support Set Aligned Features 1 1 1 ? 2 Comparative Features 2 2 ? 3 3 Encoder Alignment Pairwise Layer 3 ? Bilinear Comparator ? = 1 Pooling 4 4 4 ? 5 5 ? 5 ? Query Image ? Fig. 2. The framework of LRPABN under the 5-way-1-shot fine-grained image classification setting. The support set contains five labeled samples for each category (marked with numbers) and the query image labeled with a question mark. The LRPPABN model can be divided into four components: Encoder, Alignment Layer, Pairwise Bilinear Pooling, and Comparator. The Encoder extracts coarse features from raw images. Alignment Layer matches the pairs of support and query. Pairwise Bilinear Pooling acts as a fine-grained extractor that captures the subtle features. The Comparator generates the final results. in FSFG is to acquire subtle and informative image features. to-end FSFG model that captures the fine-grained relations In PCM, the authors adopt the naive self-bilinear pooling among different classes. This subtle comparative ability of our to extract image representations, which widely used in the models is inherently more intelligent than merely modeling the state-of-the-art fine-grained object classification [8], [9], [24]. data distribution [17], [19], [23]. The main contributions are Then with the operation of bilinear feature grouping, the PCM summarized as follows: model can generate low-rank subtle descriptors of the original • Pairwise Bilinear Pooling. Existing second-order based image. Most recently, Li et al. [19] propose a covariance FSFG methods [19], [23] enhance the individual encoded pooling [10] to distillate the image representation of each features by directly applying the self-bilinear pooling category. These matrix-outer-product based bilinear pooling operation. However, such an operation fails to capture operations [19], [23] could extract the second-order image more nuanced relations between similar objects. Instead, features and contains more information than traditional first- we uncover the fine-grained relations between different order features [24], and thus achieve better performance on support and query image pairs by using matrix outer prod- FSFG tasks than generic ones. uct operation, which is called pairwise bilinear pooling. It is worth noting that both [19] and [23] employ bilinear Based on the explicit elicitation of correlative informa- pooling on the input image itself to enhance the information of pair samples, the proposed operation can extract tion of original features, which noted as the self-bilinear more discriminate features than existing approaches [1], pooling operation. However, when a human identifies the [17], [23]. More importantly, we introduce a low-rank similar objects, she/he tends to compare them thoroughly in approximation for the comparative second-order feature, a pairwise way, e.g., comparing the heads of two birds first, where a set of co-variance low-rank transform matrices then the wings and feet last. Therefore, it is natural to enhance are learned to reduce the complexity of the operation. the information during the comparing process when dealing • Effective Feature Alignment. The main advantage of with FSFG classification tasks. Based on this motivation, we self-bilinear based FSFG methods is the enhancement of propose a novel pairwise bilinear pooling operation on the depth information for individual spatial positions in the support and query images to extract the comparative second- image, which is achieved by the matrix outer product order images descriptors for FSFG. operation on convolved feature maps. Inspired by the There are a series of works that address the generic few- self-bilinear pooling operation, we design a simple yet shot classification by learning to compare [15]–[17], among effective alignment mechanism to match the pairwise which the RelationNet [17] achieves state-of-the-art perfor- convolved image features. By exploiting

Low-Rank Pairwise Alignment Bilinear Network for Few-Shot Fine

Low-Rank Tucker Approximation of a Tensor from Streaming Data : ; \ \ } Yiming Sun , Yang Guo , Charlene Luo , Joel Tropp , and Madeleine Udell

Polynomial Tensor Sketch for Element-Wise Function of Low-Rank Matrix

Fast and Guaranteed Tensor Decomposition Via Sketching

High-Order Attention Models for Visual Question Answering

Visual Attribute Transfer Through Deep Image Analogy ∗

Monet: Moments Embedding Network

Multi-Objective Matrix Normalization for Fine-Grained Visual Recognition

High-Order Attention Models for Visual Question Answering

Monet: Moments Embedding Network∗

Polynomial Tensor Sketch for Element-Wise Function of Low-Rank Matrix

Hadamard Product for Low-Rank Bilinear Pooling

Higher-Order Count Sketch: Dimensionality Reduction That Retains Efﬁcient Tensor Operations