Local Feature Filtering for 3D Object Recognition

Md. Kamrul Hasan∗, Liang Chen†, Christopher J. Pal∗ and Charles Brown† ∗ Ecole Polytechnique de Montreal, QC, Canada E-mail: [email protected], [email protected], Tel: +1(514) 340-5121,x.7110 † University of Northern British Columbia, BC, Canada E-mail: [email protected], [email protected] Tel: +1(250) 960-5838

Abstract—Three filtering/sampling approaches for local fea- by the Bag of Words (BOW) model from textual information tures, namely Random sampling, Concentrated sampling, and retrieval, researchers have induced it as Bag Inversely concentrated sampling are proposed and investigated. Of Visual Words(BOVW) model[13,14], where visual features While local features characterize an image with distinctive key- points, the filtering/sampling analogies have a two-fold goal: (i) are quantized, the code-words are generated, and named as compressing an object definition, and (ii) reducing the classifier visual words. These Visual Words have been successfully learning bias, that is induced by the variance of key-point tested for feature matching in 2D object images. numbers among object classes. The proposed methodologies have In this work, we have proposed three new local image fea- been tested for 3D object recognition on the Columbia Object ture filtering methodologies that are based on some very sim- Library (COIL-20) dataset, where the Support Vector Machines (SVMs) is applied as a classification tool and winner takes all ple assumptions. Following these filtering approaches, three principle is used for inferences. The proposed approaches have object recognition models have been formulated, which are achieved promising performance with 100% recognition rate for described in section two. Section three comes with two major objects with out of plane rotational range less than 30 degree. experiments that have been conducted: (i) Object Recognition, The systems also performed less invariantly in the asymmetrical and (ii) Asymmetrical Rotational Test. Finally, section four recognition test. concludes with a discussion.

I.INTRODUCTION II.I-SIFTMODELS(MF ) Object recognition is the task of identifying object(s) in an Our models are based on the following three assumptions: image or in a video sequence. The major object recognition (i) The representative views of an object, Oi share some approaches can be categorized as: (i) Appearance based [1,2,3] common features, which concentrate around the mean feature , (ii) Geometry based[4], and (iii) Local feature oriented [5- vector. 7]. The difficulties that make object recognition challenging (ii) Key-points, that are distant from the mean feature vector are: scaling, rotational and translation invariance, lighting are the representative of highly distinctive feature points for and illumination change, occlusion, cluttered background, and object Oi. These define the Inverse Concentrated Samples. different types of noise. As found to be the superior in tackling (iii) Following the Monte Carlo Assumptions, key-points, these challenges, object recognition research has gradually be- collected by random sampling from a set of view instances came focused in local feature oriented[5,8-10], and researchers might represent an object Oi. have been successful to extract various scaling, rotational, For an image I(x,y), we define a SIFT-Image with a set of translation invariant local features in images. Following the feature descriptor vectors extracted by the SIFT algorithm [6]. theory[11], Lowe at al. developed Scale Invariant Formally, a SIFT-Image I(x,y) is defined as: Feature Transform (SIFT)[5,6] features, which uses difference of Gaussian functions at different scales to extract distinctive IS(x, ˙ y,˙ X) = SIFT (I(x, y)) local features. In a series of papers [5,6,12], the effectiveness where, X is the feature vectors detected at positions (x, ˙ y˙) of the SIFT features for the object recognition task has been in an image I(x, y); vector X = (x , x , ...... , x ) is under proven. Matas et. al. [8] developed descriptors computed from 1 2 D dimension D, where D = 128 [6]. For simplicity, in the pixels inside convex hulls, called the Invariant Pixel Set Sig- subsequent discussions,| we| will denote IS(x, ˙ y,˙ X) as IS. For natures(IPSS). Some of the successful local features in object a SIFT-Image IS, the mean vector µ, which is calculated from recognition are[10]: SIFT, PCA-SIFT, Gradient Location and all feature points, detected in IS. Orientation Histogram (GLOH), shape context, spin images, steerable filters, differential invariants, and moment invariants. A. µk-SIFT -Image(ISµk) In feature based approaches, each training 2D view image A µk-SIFT -Image is defined as, of a 3D object may generate a number of key-points. It is found that direct feature matching is expensive as this does ISµk(¨x, y,¨ Xµk) = kNN(µ(I(x, y)) not scale with increasing number of training images and target objects. In [12], Panu Turcot et. al. proposed to reduce the where, kNN(µ(I(x, y)) are the k nearest key-points to the number of features by selecting useful feature set. Motivated mean feature descriptor vector, µ. Therefore, for a SIFT-Image

899

Proceedings of the Second APSIPA Annual Summit and Conference, pages 899–902, Biopolis, Singapore, 14-17 December 2010.

10-0108990902©2010 APSIPA. All rights reserved. Sµk Sµk Srk class, FC C ,C ,C is generated. Each of the ∈ { i i i } feature vectors in FC is labeled with the class-label Ci, and a database of training feature vectors is created. The model MF has been formulated as Support Vector Machines with Radial Basis Function(RBF) kernel, and the model parameters were learned with a ten fold cross validation technique. For the three filtering (FS) setups, the corresponding object classification models are denoted as Mµk,Mµk and Mrand. S For testing, a SIFT image IX is generated from a test image, IX . For models Mµk and Mµk, the SIFT image IX is filtered and corresponding filtered images ISµk and ISµk are generated. On the other hand, for model Mrand, the whole S set of SIFT key-points in IX are selected as the test key-point set. Each of the key-points are tested with the corresponding I- SIFT model, MF , and classification votes are accumulated for each object class, Oi. An winning class, and thus the winning object is decided by the national voting principle [1].

III.EXPERIMENTSANDRESULTS We have carried out two experiments on the Columbia Ob- ject Image Library (COIL)-20[15] dataset: (i) Object Recog- nition, and (ii) Asymmetrical Rotational Recognition Test. Fig. 1. COIL-20 Dataset (a) Zero-degree instances from the twenty categories, and (b) The first thirty six instances from object one COIL-20 is a dataset of 20 different 3D objects with 72 views per object, taken at a pose interval of 5◦ by a moving camera. Fig. 1(a) shows the zero-degree samples, whereas the first thirty six instances for object one are shown in Fig. 1(b). IS, the µk-SIFT -Image(ISµk) is a filtered image defined with the k nearest key-points from µ. A. Object Recognition B. µk-SIFT -Image (ISµk) I-SIFT recognition results and a model comparison graph A filtered SIFT image, ISµk is defined with the farthest is shown in Fig. 2. The recognition results, provided for the k key-points from the mean feature descriptor vector, µ . model Mrand is an average from three test runs. A number Formally, of conclusions are drawn below on the relationship among parameter k, the number of training view(s) per object, and ISµk(¨x, y,¨ Xµk) = kNN(µ(IS(X)), k) their effect on recognition results: For k = 10, models M and Mrand achieved the 100% S • µk where kNN(µ(I (X)), k) denotes the farthest k key-points recognition rate with 24 and 36 training views per object from the mean feature descriptor vector, µ. respectively. Srk For k = 20, models M and Mrand achieved the 100% C. rand-SIFT -Image (I ) • µk recognition rate with 24 and 18 training views per object A rand-SIFT -Image, ISrk is an image with k randomly S respectively. selected key-points from a SIFT image I . Formally, For k = 30, all the three models were able to recognize • the objects fully; the number of corresponding training ISrk(¨x, y,¨ Xrk) = random(IS(X), k) instances required for models Mµ, Mµk , and Mrand In this work, object recognition problem has been formu- were 18, 24, and 18 respectively. lated as a supervised classification problem. For each of the From Fig. 2, it can be inferred that models Mµk and filtering setups FS µk-SIFT, µk-SIF T, rand-SIFT , a Mrand performed similar recognition results for an increasing ∈ { } corresponding model MF Mµk,M ,Mrand is learned. k. Model Mµ performed the best when there were less than ∈ { µk } The task of a model MF is to classify SIFT key-points and three training instances were available; while its recognition to recognize objects in a test image assuming that the SIFT rate was lower than the other two models for a k = 10. It is key-points were generated from the targeted object. We call evident in general, an increasing k increases the recognition these models as Inverse-SIFT(I-SIFT). rate. However, a random large k may not necessarily always A set of training images I1,I2, ...... , In , which are produce a better recognition. The reason is, the number of key- { } the representative view samples of an object Oi are grouped points varies from instance to instance, and from class to class. into a category, Ci . For each of the classes Ci, a corre- For example, object class one had an average key-point size S sponding SIFT-Image class Ci , and a filtered SIFT-Image of 22; while class nine had an average of 191(the maximum

900 Fig. 4. Recognition Rate (Rank 1) Comparison between PCA and I-SIFT models

can learn. Inspired by the famous face recognition work [2], Nagabhushan et. al.[16] proposed a methodology of applying Fig. 2. Recognition Rate (Rank 1) of the three models with three k (10,20,30) value setups Principal Component Analysis (PCA) for object recognition. The authors in [16] carried out a set of experiments with one dimensional PCA (1D-PCA), and its extension 2D-PCA on the COIL-20 dataset. The recognition rate (rank 1) of the I-SIFT models have been compared with the 1D-PCA and 2D-PCA[16]. The different experimental setups, results, and a comparison graph is shown in Fig. 4. The 1440 views training and testing setup from [16] was excluded in this comparison study as this setup does not reflect the true recognition rate of any system. For all three I-SIFT models, the k = 30 setup was used in this evaluation. It is evident from Fig. 4 that all Fig. 3. Objects, and corresponding feature vectors with (a) minimum, (b) three I-SIFT models clearly outperformed both the 1D-PCA maximum number of keypoints and 2D-PCA models. The 100% recognition for the 1D-PCA and 2D-PCA models were achieved only when the systems were trained with 50% of the dataset (720 training views); among all 20 classes). The average key-point size found for this means, the PCA systems can deal with at most 10-degree the twenty classes was 54, whereas, for example, a randomly object rotations; whereas, the I-SIFT models were able to deal selected k = 100 may bias a model MF towards object classes with 30 degree (approximately) object rotations. This con- with average number of key-points around or greater than one cludes, the developed models can deal wider object rotational hundred (object five, and nine are such examples). Fig. 3 variations than the 1D-PCA and 2D-PCA[16] models. shows the object instances with the minimum and maximum number of key-points detected. One of the byproduct of the B. Asymmetrical Rotational Recognition Test approaches is the reduction of classifier learning bias that To test the robustness of any object recognition system, might be induced by the varying number of key-points among another metric might be the variance a system shows for object classes. asymmetrical object recognitions (as most of the real objects A number of works [8,16,17] claimed a 100% recognition are asymmetrical). Matas J. et. al. proposed a method called rate on the COIL-20 dataset as an efficiency measure of “Invariant Pixel Set Signature (IPSS)”[8] for object recogni- their object recognition systems. In that sense, all the three tion; where, the authors used certain visual feature points of developed I-SIFT models were able to achieve the targeted interest, in addition to the intensity profile to recognize objects 100% recognition rate. To compare models, an additional in certain rotational ranges. The IPSS model was trained with measure might be to know the rotational variation a model an object view (the zero degree instance), and tested instances

901 TABLE I PERCENTAGE RECOGNITION (RANK 1), IPSS VS I-SIFT performances of the developed models are less sensitive to asymmetrical object rotations. Rotational Range IPSS Mµ Mµk Mrand (in degree) ACKNOWLEDGMENT +50 - 82.0 81.50 79 +40 - 86.25 88.75 83.75 We like to thank the reviewers for their positive and helpful +30 85.00 89.17 91.67 85.00 comments. Finally, we like to also thank APSIPA ASC for all +20 90.00 90.00 93.75 86.25 +10 95.00 90.00 92.50 87.50 of their efforts. -10 95.00 90.00 95.00 87.50 -20 80.00 88.75 93.75 86.25 REFERENCES -30 60.00 87.50 93.33 84.17 [1] L. Chen and N. Tokuda, “A General Stability Analysis on Regional and -40 - 85.00 90.00 80.63 National Voting Schemes Against Noise - Why is an electoral college -50 - 79.50 85.00 75.50 more stable than a direct popular election? ”, Artificial Intelligence, Vol. Variance 174.17 13.39 18.81 15.82 163, No. 1, pp. 47-66, 2005. S. Deviation 13.20 3.66 4.34 3.98 [2] Matthew Turk and Alex Pentland, “Eigenfaces for Recognition”, Journal of Cognitive Neuroscience, Vol. 3, No. 1, pp. 71-86, 1991 [3] Md. Kamrul Hasan and L. Chen and C.G. Brown, “Image Data Mining and Classification with DTree Ensembles for Linguistic Tagging ”, In Proceedings of IEEE Workshop on Computational Intelligence for Visual Intelligence, pp. 29-36, 2009. [4] A. R. Pope, “Model-based Object Recognition: A survey of recent research”, Technical Report, University of British Columbia, 1994. [5] David G. Lowe, “Local Feature View Clustering for 3D Object Recog- nition”, In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 682-688, December-2001. [6] David G. Lowe, “Distinctive image features from scale-invariant fea- Fig. 5. Object two and nine from COIL-20[6], in the range (-50 to +50) tures”, International Journal of Computer Vision, Vol. 60, No. 2, pp. degree 91-110, 2004. [7] Y. Ke and R. Sukthankar, “PCA-SIFT: A more distinctive representation for local image descriptors ”, In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 511-517, 2004. [8] J. Matas and J. Burianek and J. Kittler, “Object Recognition using the within the 30◦ to +30◦ range. In Table I, the percentage Invariant Pixel Set Signature”, In Proceedings of British Machine Vision recognition− (rank 1) of the I-SIFT models for k = 30 have Conference, pp. 606-615, 2000. [9] J. Matas and Steven Obdrzalek, “Object Recognition Methods Based on been compared with the IPSS model, with an experiment Transformation Covariant Features ”, In Proceedings of XII European conducted for object rotations within the 50◦ to +50◦ range. Signal Processing Conference, pp. 1333-1336, September-2004. Most of the objects in the COIL-20 dataset− are not symmet- [10] Mikolajczyk K and Schmid C., “A Performance Evaluation of Local Descriptors”, IEEE Transactions on Pattern Analysis and Machine Intel- rical; the 50◦ to +50◦ range view instances for object two − ligence , Vol. 27, No. 10, pp. 1615-1630, 2005. and nine are shown in Fig. 5. The comparison results in table [11] Tony Lindeberg, “Scale-Space Theory in Computer Visionl”, Kluwer I show that the IPSS performed better recognition within the Academic Publishers, 1994 [12] Panu Turcot and David G. Lowe, “Better matching with fewer features: ( 10◦, +10◦) range; however, the I-SIFT recognition rate for − The selection of useful features in large database recognition problems”, the model Mµk is the best in overall performance. Further- ICCV Workshop on Emergent Issues in Large Amounts of Visual Data more, the variance and standard deviation of the recognition (WS-LAVD), Koyoto, Japan, October-2009. [13] G. Csurka and C. Bray and C. Dance and L. X. Fan, “Visual categoriza- rates for all three I-SIFT models are a way better than the IPSS tion with bag of keypoints”, In Proceedings of International Workshop model. In conclusion, I-SIFT object recognition performance on Statistical Learning in Computer Vision, pp.1-22, 2004. is less sensitive to asymmetrical object rotational variance. [14] K. Barnard and P. Duygulu and N. de Freitas, f. Forsyth and D. Blei and M. Jordan,“Matching Words and Pictures”, Journal of Machine Learning IV.DISCUSSIONSANDCONCLUSIONS Research, Vol. 3,pp. 1107-1135, 2003. [15] “Columbia Object Image Library (COIL20)” In this paper, we have approached three object recognition http://www1.cs.columbia.edu/CAVE/software/softlib/coil-20.php models based on three local feature filtering methodologies. [16] P. Nagabhushan and D.S. Guru and B.H. Shekar, “Visual Learning and Recognition of 3D objects using two-dimensional principal component Major object recognition theories were investigated, and local analysis: A robust and an efficient approach”, Pattern Recognition, Vol. feature-based approaches with SIFT features were found to be 39, No. 4, pp. 721-725, 2006. more distinctive than other object recognition methodologies. [17] V. N. Pawar and Sanjay N. Talbar, “An Investigation of Significant Ob- ject Recognition Techniques”, International Journal of Computer Science Based on the methodology of SVMs[18,20], k-NN[19], and and Network Security, Vol. 9, No. 5, pp. 17-29, May - 2009. SIFT[6], three object recognition models have been proposed: [18] C. W. Hsu and C.C. Chang and C.J. Lin, “A Practical Guide to Sup- (i) M , (ii) M , and (iii) M . We have named the port Vector Classification ”, Technical Report, Department of Computer µk µk rand Science, National Taiwan University, July 2003. models as Inverse SIFT(I-SIFT) as the task of these models [19] Shakhnarovish and Darrell and Indyk, “Nearest-Neighbor Methods in is to classify the feature points in a test image to recognize Learning and Vision”, The MIT Press, 2005. objects from which the SIFT key-points were generated. The [20] C. Cortes and V. Vapnik, “Support Vector Network”,Machine Learning, Vol. 20 ,No. 3, pp. 273-297, September 1995. object recognition experiment validates that the developed I- SIFT models perform better object recognition, and can deal with wider range of object rotations. Additionally, the asym- metrical rotational recognition test proves that the recognition

902