Learning Deep and Sparse Feature Representation for Fine-Grained Object Recognition

LEARNING DEEP AND SPARSE FEATURE REPRESENTATION FOR FINE-GRAINED OBJECT RECOGNITION M. Srinivas1 Yen-Yu Lin2 Hong-Yuan Mark Liao1 1Institute of Information Science, Academia Sinica, Taiwan 2Research Center for Information Technology Innovation, Academia Sinica, Taiwan ABSTRACT data distribution of this category in the feature space. Mean- while, the number of categories to be recognized is large. In this paper, we address fine-grained classification which is Third, the learned feature presentations by conventional ap- quite challenging due to high intra-class variations and sub- proaches are often of high dimensions, leading to a high com- tle inter-class variations. Most modern approaches to fine- putational cost. grained recognition are established based on convolutional neural networks (CNN). Despite the effectiveness, these ap- To address the first difficulty, two research trends of FGC proaches still suffer from two major problems. First, they arise. The first trend is the use of part-based representa- highly rely on large sets of training data, but manually an- tions [4]. Part-based models recognize objects by referring notating numerous training data is expensive. Second, the to not only the appearances of object parts but also their spa- learned feature presentations by these approaches are often of tial relationships. Thus, these models are robust to intraclass high dimensions, leading to less efficiency. To tackle the two variations caused by different poses. Meanwhile, the distinct problems, we present an approach where on-line dictionary characteristics for fine-grained recognition are often carried learning is integrated into CNN. The dictionaries can be in- by object parts, instead of the whole objects. The second trend crementally learned by leveraging a vast amount of weakly is to extract more discriminative feature representations [5] by convolutional neural networks labeled data on the Internet. With these dictionaries, all the using (CNN) [6]. Recent work training and testing data can be sparsely represented. Our based on CNN has shown notable improvement over the work approach is evaluated and compared with the state-of-the- that adopts handcrafted features [7, 8]. art approaches on the benchmark dataset, CUB-200-2011. However, the two trends of FGC have worsened the other The promising results demonstrate its superiority in both effi- two difficulties, the demand for large training data and the ciency and accuracy. high dimensions of the resultant feature presentations. Learn- ing part-based models typically needs training data with part- Index Terms— Part-based RCNN, sparse representation, level annotation, which leads to the expensive cost of manual dictionary learning, fine-grained categorization labeling in collecting training data. The powerful CNN can extract discriminative features and learn non-linear classifiers 1. INTRODUCTION simultaneously, but the resultant feature representations, i.e. the input to the last decision layer, are of high dimensions. Fine-grained classification (FGC) aims to distinguish fine- Take part-based RCNNs (PRCNN) [10] as an example. It level categories of images, such as bird species, airplane or achieved promising results for FGC by using part annotations car types [1, 2], and animal breeds [3]. In addition to the and CNN-based feature representations. However, further im- difficulties inheriting from generic object recognition such as provement based on PRCNN is very difficult, because learn- large intra-class variations, fine-grained classification is much ing part-based CNN requires a large set of strongly annotated more challenging due to subtle inter-class variations. In this training data, which is currently not available in the bench- work, we propose to learn a deep and sparse feature repre- mark datasets for FGC. sentation to address the difficulties of FGC, and illustrate it The main contribution of this work lies in the develop- with the application to fine-grained bird species recognition. ment of a fine-grained classification approach that addresses This application is considered challenging, since some of the these drawbacks of the part based models and reduces the species are difficult to recognize even for humans. computational cost caused by the high-dimension represen- The major difficulties hinder the advances in accurate tation. We address the lack of training data by using addi- fine-grained classification come from diverse factors. First, tional web images, which can be obtained by querying the there exist small inter-class variations and large intra-class categories to be recognized in search engines. This additional variations. Second, the training data of a category in the dataset contains a large volume of images with image-level benchmarks of FGC are often too few to reliably represent the labels. Weakly supervised learning is conducted on the small 978-1-5090-6067-2/17/$31.00 c 2017 IEEE PRCNN BASED strongly-labeled data and the abundant weakly-labeled data. FEATURE EXTRACTION Training Phase Object KPCA Specifically, we leverage the strongly labeled dataset to learn Body the part-based representation, and transfer the learned repre- Head Weakly Supervised data Data Dictionary sentation to the object parts in the weakly labeled dataset. In PRCNN BASED FEATURE EXTRACTION Augmentation Construction KPCA D={D1,D2,..,Dn} this manner, both the strongly and weakly labeled data can be Object Body used to derive a more reliable feature representation. Head Classification results Find out Using deep convolutional layers can extract features of Sparsest Dictionary Training data high quality, but the dimension of the extracted features PRCNN BASED FEATURE EXTRACTION KPCA Testing Phase is quite large. Dimensionality reduction methods [9] can Object be used to reduce the dimension of feature representations. Body Head Classifier or Model Specifically, kernel principal component analysis (KPCA) is Test data employed. In this work, we propose a new method for fine- Fig. 1. Overview of our approach. grained categorization that learns robust CNN-based feature representations, and carries out classification based on dictionary learning and sparse representation. With the aid of the the methods that adopt handcrafted features [7, 8, 13]. The extra data borrowed from the Internet, the problem caused by major disadvantage of part-based methods is the high cost of the lack of training data problem is alleviated. The use of dic- manually annotating object parts in collecting training data. tionary learning further reduces the computational cost, and Branson et al. [14] described an architecture for bird species improves the accuracy of FGC. fine-grained classification. In this work, features are com- Our approach increases the classification performance puted by applying deep CNN to image patches. These im- while reduces the computation cost. The main contribution of age patches are located and normalized by poses. For learn- this work is three-fold. First, additional training data are used ing a compact space of normalized poses, higher order func- to prevent the problem of overfitting when training CNN with tions for geometric warping and graph-based clustering al- a small dataset available for FGC. Second, we learn category- gorithms are included in their work. However, this approach specific dictionary using both strongly and weakly labeled does not give satisfactory classification accuracy in predicting training data. Thus, our approach is more efficient compared bird species. with the method in [11], since training data can be repre- For fine-grained recognition, many current methods em- sented by just a few dictionary atoms. Another advantage ploy a two-stage pipeline, part detection followed by classifi- of using on-line dictionary learning (ODL) is the reusabil- cation, to accomplish this challenging task. Zhang et al. [10] ity of existing dictionaries [12]. Namely the dictionary can learned a more robust model for fine-grained classification be incrementally learned by using the extra weakly labeled without using the bounding boxes of objects at the phase of data obtained from the Internet. The learned dictionary can testing. They method adopts the RCNN detector [15] to learn be used to reduce the computational cost for processing web the whole object, locate the parts and make the prediction. It images. Third, we employ l1-lasso sparsity to enhance the employs the constraints to enforce the learned geometric re- performance of predicting test data. Our method can effi- lationship between the detected parts and the whole objects. ciently search the sparsest representation of a test sample in The learned pose-normalized representation is used to carry the trained dictionary, which is composed of training samples out fine-grained categorization. Huang et al. [16] proposed of all classes. Thus, there is no need to derive the decision the part-stacked CNN for fine-grained classification. They boundaries. These sparsely learned dictionaries give better leveraged a two-stream structure to capture both object-level classification performance. An overview of the our proposed and part-level information. Krause et al. [17] presented an method is shown in Fig. 1. approach that carries out fine-grained recognition without us- The experimental results show the effectiveness of the ing part-level annotation. In their approach, part-level infor- proposed approach. With the additional weakly labeled mation is derived by using co-segmentation and image align- dataset acquired from the web, we achieve the recognition

Load more