Is Rotation a Nuisance in Shape Recognition?

Qiuhong Ke1,2 Yi Li2,3

1Beijing Forestry University, 2NICTA, 3ANU [email protected], [email protected]

Abstract example, where a shape is uniformly sampled and repre- sented by a series of high dimensional feature vectors. Ro- Rotation in closed contour recognition is a puzzling nui- tation amounts to shifting these vectors in a circular order, sance in most algorithms. In this paper we address three which makes the comparison of two shifted series difficult, fundamental issues brought by rotation in shapes: 1) is because this mis-correspondence phenomenon is not han- alignment among shapes necessary? If the answer is “no”, dled inherently by nearest neighbor methods and classifiers. 2) how to exploit information in different rotations? and Documented solutions of rotation invariance include: 1) 3) how to use rotation unaware local features for rotation define domain specific global rotation angle (a.k.a. “major aware shape recognition? axis”), 2) adopt rotation invariant representation of local de- We argue that the origin of these issues is the use scriptors (e.g., histogram), and 3) brute force search. Global of hand crafted rotation-unfriendly features and measure- rotation angle may be useful for the limited domains, how- ments. Therefore our goal is to learn a set of hierarchi- ever it is also believed that the major axis is not reliable cal features that describe all rotated versions of a shape as [16]. Many solutions frame the shape matching problem as a class, with the capability of distinguishing different such a histogram comparison problem [12]. Unfortunately, use- classes. We propose to rotate shapes as many times as pos- ful information such as structure has to be discarded during sible as training samples, and learn the hierarchical feature histogram computation in order to achieve rotation invari- representation by effectively adopting a convolutional neu- ance in this process. Trying all possible rotations appears to ral network. We further show that our method is very effi- be the last resort, but this is at the expense of efficiency [8]. cient because the network responses of all possible shifted With all the attempts to achieve rotation invariance, it is versions of the same shape can be computed effectively by interesting to note that rotation invariance may even not be re-using information in the overlapping areas. We tested the favorable in some domains. For example, rotation invari- algorithm on three real datasets: Swedish Leaves dataset, ance techniques cannot differentiate between the shapes of ETH-80 Shape, and a subset of the recently collected Leaf- the lowercase letters “d” and “p”, since this pair of shapes snap dataset. Our approach used the curvature scale space only differs in their orientations. Therefore, “rotation limit” and outperformed the state of the art. or “rotation awareness” may be more practical to improve recognition accuracy in these applications. All these difficulties attest that rotation is a nuisance in 1. Background shape recognition. Researchers frequently need to make se- Matching closed contour shapes is an important problem ries decisions of the following three puzzling questions: 1) in computer vision. Dated back to the early age, where ap- shall we rotate shapes? 2) shall we use histogram? and 3) plications are anthropology and biometric [15], it has novel shall we use rotation invariant local feature? The answers applications in domains as diverse as computer graphics and appear elusive, but first of all, why we have these questions? mobile computing [9]. The origin that rotation becomes a bitter and resentful Typical issues in shape recognition include noise, scale, enemy is that we hand craft orientation-unfriendly repre- translation, and rotation. Most of them can be handled sentations and measurements. For instance, representing easily either by carefully chosen classifiers, different data 2D circular points by 1D vectors and feeding them to Sup- representations, and simple normalizations. However, re- port Vector Machine already abandon any hope of han- gardless various techniques are used, rotation invariance dling the rotation, because kernels are hardly shift invari- in shape recognition is particularly difficult to achieve [6]. ant. Even the widely used dynamic time warping is not Take the widely used “landmark point” representation for rotation-friendly, because it neither encodes global mea-

1 • We outperformed state of the art methods in real datasets.

2. Related work Shape is one of the important features in computer vi- sion. Particularly we focus on the closed contour, where landmark points are obtained by uniformly sampling the contours. We review three components in shape recogni- tion: features, distance measurements, and invariance. Figure 1. Two time series representations of the same shape, start- ing from two different contour points. Features Local descriptors can be rotation dependent or independent. In the early age, a large number of papers surement that describes all possible rotations between two extract rotation invariant features such as curvature, entropy, shapes, nor exploits the rotation property. and convex hull [5]. These rotation invariant features are Instead of defining the representations manually, can we less distinctive in literature. learn a set of features for the same shape at all possible rota- Alternatively, Shape Context [2] and Inner Distance tions and use them for distinguish different shapes at differ- Shape Context [12] are two widely used descriptors that ent rotations? In this paper we argue that we can tackle this depend on rotation. In shape recognition, they are usually challenging concerns drastically in one shot. We consider used in dynamic programing framework. Please note that shapes along boundaries as 1D time series and use convo- such descriptors can be modified to be rotation independent lutional neural network [3] as a practical solver, because by computing the “local” orientation, but it is not a common filtering is particularly suitable for time series analysis. practice. In this paper we further suggest that adding more rotated versions of shapes improves shape recognition by nature. In Distance There are many distance measures for shapes. this case, the training will be time consuming with the brute When two point sets are used, Chamfer distance [4] is fre- force solution. While in contrast, we significantly improve quently used in the 2D space measures. the speed without losing accuracy. First, we take a slow but Experiments suggest that 1D representations (time se- intuitive approach to explain the problem, and then we de- ries) achieve superior accuracy. For time series, typical scribe how we make the training and testing more efficient. distance measurements include l2 norm and Dynamic Time Imagine we manually circular shift the time series and Warping [12]. Typically there are implicit assumption to collect all shifted versions as training samples, this naive define starting point [4], thus one may need to check all implementation is unsatisfactory for many scenarios. For possible circular shifts to find the optimal matching. example, it takes 200 times to train a network if a shape is To avoid time consuming distance comparison, align- sampled at 200 points. ment or brute force search may be used to produce the best We demonstrate that we can reduce the complexity dras- accuracy in different domains. tically. The key ingredient is to reuse the redundant in- formation between two shifted versions (Fig. 1). In the Rotation invariance With the elusive starting point, ro- case where average pooling is used, this information reusing tation invariance seems to be always hard to handle in the strategy forms a balanced binary tree. Thus, all responses literature. Major axis appears to be the dominant choice in of the same shape may be computed effectively. This also some areas, but even in Anthropology, a classical area of allows us to perform the “rotation limit” cases. shape matching, [8] argued that the major axis is not useful. We adopt curvature scale space (CSS) as our local de- The points can be manually chosen for training samples, but scriptor [13]. The curvature is a good representation of it can be labor intensive for large dataset [7]. the shape, but numerical method of curvature calculation Histogram is an efficient approach to achieve rotation in- is sensitive to noise. Therefore, we applied integral mea- variance. Kumar et al. [9] concatenated the histogram of the surements proposed by [9] as the curvature of contours. curvatures across multiple scales and achieved impressive Our contributions include results on a large dataset. The drawback is that structural • We introduced an efficient algorithm for shape recog- information may be lost during this histogram computation. nition, based on the recent progress of deep learning. While rotating all shapes is necessarily in some applica- tions, it may be useful to express rotation-limited queries. • The algorithm naturally handles rotation invariance For example, allowing a maximum rotation in a shape or and rotation limit cases. shift in landmark points. Figure 2. Shift-Delay Forward Operation illustrated.

3. Learning rotation robust features for shape signals. In the following derivation, we ignore the bias term, recognition but it can be added using the same idea. We first briefly introduce basic concepts in the Convolu- Shift-delayed Forward Operation tional Neural Network (CNN), and then present our efficient implementation of handling all possible shifts of shapes. Let us consider 1D time series representation for shapes. We first show a slow and naive approach, and then greatly 3.1. Convolutional Neural Network speed it up. The higher dimensional time series can be han- Convolutional Neural Network (CNN) is a multilayer dled using the same principle. learning framework, which may consist of an input layer, In order to generate all possible shifts of a time series, we a few convolutional layers and an output layer for logistic have to shift one curve n times in a circular order, where n is regression. The goal of CNN is to learn a hierarchy of fea- the number of possible start points. Then, each shifted ver- ture representations. Signals in each layer are convolved sion must be convolved with many filters and downsampled. with a number of filters and further downsampled by pool- Considering n = 200 or 400 being the common choices, ing operations, which aggregate values in a small region by this brute force approach can be very slow. functions including max, min, and average. In our work, Instead, we propose to postpone the shifting operation we consider average pooling for simplicity. The learning until necessary. Fig. 2 illustrates our idea using a signal of CNN is based on Stochastic Gradient Descent (SGD), processing diagram. In the downsampling process, we di- which includes two main operations: Forward and Back- vide the signal into 2 groups, which contains points in odd Propagation. Please refer to [10] for details. and even locations. Notice that shifting the signal at the We aim at learning representations for rotated shapes. A beginning of the system is equivalent to shifting its corre- naive implementation is to manually rotate each signal, one sponding downsampled versions properly at any stage, we at a time, to generate rotated versions at different starting can perform shifting before the single layer perception. 0 contour points. However, this is time consuming, and as we Assuming we have K layers CNN and a signal x . The see in this section, a lot of computation is redundant. In this first layer is the input map, and the last layer is the sin- paper we show how to reuse the information and our modi- gle layer perception. Each layer has Nk (k = 1, 2, ..., K) fication of the basic algorithm to speed up the computation. output maps. For simplicity, we define the output map of the first layer is the same as input (e.g., N1 = 1), and k 3.2. Re-using information in Forward operation the filters are denoted by fij ( i = N1, ..., NK−2, and j = N2, ..., NK−1). Denote the output of the 1st layer as Given a CNN, our goal is to generate the outputs for the 1 0 column vector x1 = x , convolution and average pooling shifted signals starting from possible contour points of the from the second layer are same shape in one shot. Considering two time series from different starting point Nk ! k X k k i and j of the same shape (Fig. 1), the Forward operation x¯j = sigm conv(xi , fij) (1) amounts to convolve the times series with CNN filters and i=1 average the outputs in each layer of the convolutional neural  T k k 1 1 network. Obviously, the information in the overlapping area uj = conv(¯xj , ) (2) needs not to be computed twice. 2 2 Therefore, the idea of our efficient algorithm is to re-use where sigm(·) denotes the sigmoid function. Since each information. This dramatically speeds up the computation, downsampling will produce two subbands (Fig. 2), we add k th th which reduces both the time for convolution and for shifting a subscript xj,p denotes the values of p subband at j Figure 3. Average-Advanced BackPropagation illustrated.

th 1 0 output map at the k layer. Initiate x1,1 = x , we down- Algorithm 1 Forward Operation in Circular Time Series sample the signal by two from the second layer, which gives 1: procedure FastForward(x0) 1 0 1 two subbands of output as 2: x1 = x ; id1 = 1; 3: for k = 2 to K − 1 do k−2 ( k+1 k 4: for p = 1 to 2 do x (l) = uj,p(2l − 1) j,2p−1 5: j N k+1 k (3) for =1 to k do x (l) = u (2l) k j,2p j,p 6: x¯j,p = 0; 7: for i=1 to Nk−1 do |x¯k | k k k−1 k−1 j,p 8: x¯j,p = x¯j,p+conv(x ,f , ‘circ’); for l = 1, ... 2 , where |·| is the length of the signal. In j,p i,j k k each subband of the last layer, we only need to circular shift 9: x¯j,p=sigm(x¯j,p); . Sigmoid the elements to obtain the correct response for the single 10: . Downsample th k 1 1 layer perception. Signal starting at r contour point can be 11: u = conv(x¯j,p,[ 2 2 ],‘circ’); k+1 obtained by shifting the original signal r times. If we know 12: xj,2p−1 = u(1 : 2 : end); th k+1 r corresponds to the q element of subband p at a layer, it 13: xj,2p = u(2 : 2 : end); k+1 k is equivalent to take the subband p and shift it q − 1 times. 14: id2p−1 = idp; k+1 k k−1 This can be done by index operation easily. 15: id2p = idp + 2 ; k−1 k−1 r mod (r+2 −1,2 ) r 16: p 2K−2 Define sk = b k−2 c and qk = for = 1 to do 2 K−2 r+2k−1−1 17: for q = 1 to length(x0)/2 do b 2k−1 c, where b·c denotes the floor operation and K−1 K−1 K−1 18: xp =[cshift(x , q); ...;cshift(x , q)]; (·) s 1,p NK−1,p mod is the modular operation. k is a binary variable, K K−2 r r r 19: r = idp + q2 ; so the subband index of r at level k is pk = (s1..sk)10 + 1, K−1 r 20: or = sigm(W xp ); where (·)10 denotes the binary to decimal conversion. pk r and qk define the location of subband at any layer for a given starting point, thus the output is 3.3. Average in BP

h i Backpropagation is a time consuming operation. Every f = cshift (xK−1 ); ...; cshift (xK−1 ) v qK−1 1,pK−1 qK−1 NK−1,pK−1 output from the Forward procedure needs to be compared (4) to the ground truth, and the errors are propagated layer by layer. Since this error depends on each starting point, we o = sigm (W f ) (5) r v cannot reuse the information in a similar way as we used the Forward operation. where cshift (·) denotes the shift of the input q −1 units. q First, we present our naive but slow implementation of For a shape, there will be a set of or, each of which cor- BackPropagation. Then, we significantly speed it up by ad- responds to the different starting points of the same sample. vancing the average operation in gradient calculation. As a result, it generates the output of all rotation versions of For the time series starting at the rth contour point, we a shape in one shot, which reduces the need to align shapes follow the standard procedure to compute error and the gra- when conducting classification. The feature is divided into dient of sigmoid function. odd and even groups in each layer, which forms a balanced binary tree. Assuming the temporal cost of each layer in the er = yr − or (6) network is ck, the total cost of generating responses of all PK samples is n k=1 ck for naive implementation. Instead, PK k−1 ours is k=1 2 ck, which is a significant reduction of ∆or = er ◦ or ◦ (1 − or) (7) the complexity. The complete algorithm is shown in Algo. where y is the true label, o is the prediction output, 1. r r and ◦ is the Hadamard product. Then, the error is back propagate to the convolutional layers as follows   vk = cshift linspace (∆k+1 ) (15) j,pk mod(pk+1,2) 2 j,pk+1 K−1 T ∆r = W ∆or . (8) K−1 K−1 K−1 This shift is because of the alignment for odd and even Let ∆ = [∆ ; ...; ∆ ], where pK−1 r 1,pK−1 NK−1,pK−1 indexes during upsampling. Advancing the average opera- is the same as defined in Forward operation, ∆K−1 has j,pK−1 tion is very useful, because we can treat the averaged results the same size as xK−1 , we back propagate the gradient as new errors. This results in speeding up the BackPropa- j,pK−1 of the sigmoid function in each layer as follows. First, we gation significantly. We show the procedure in Algo. 2. upsample the signal after proper shifting: Algorithm 2 BackPropagation for Circular Time Series 0 k k+1 1: procedure FastBackward(or, yr, r = 1, ..., |x |) vj,p = linspace2(∆j,p ) (9) k k+1 2: for p = 1 to 2K−2 do where linspace2(·) denote the linear upsampling by 2 op- 3: dp = 0; vk 0 eration. j,pk is backpropagated as follows: 4: for r = 1 to |x | do 5: ∆r = (or − yr) ◦ or ◦ (1 − or); 6: Compute pK−1 and qK−1 for r; k k k k ∆¯ = (¯x ) ◦ 1 − x¯  ◦ v 7: d d (o , q ) j,pk cshiftqk j,pk cshiftqk j,pk j,pk pK−1 = pK−1 +cshift r K−1 ; (10) 8: for p = 1 to 2K−2 do 9: dp = mean(dp); Nk−1 X 10: Backpropagate dp using Eq. 8,10,11,15. ∆k = conv(∆¯ k , f¯k ) (11) i,pk j,pk ij 11: . Calculate Gradient j=1 12: for k = 1 to K − 2 do ¯k k 13: for j = 1 to N do where fij is a flipped version of fij. Since SDG operates in k a batch mode, in every BackPropagation procedure we have 14: for i = 1 to Nk − 1 do M δk ∆¯ k 15: ∆f = 0; time series. Denote i,pk as the flipped version of i,pk , k−2 k 16: for p = 1 to 2 do calculating gradient of fij is simply the “valid” region of k k  17: ∆f = ∆f +conv δj,pk , cshiftqk xi,pk k k ∆fij = mean∆fij ; k 1 X k k  ∆f = conv δ , cshiftq x (12) ij M j,pk k i,pk M 3.4. Rotation limit shape recognition Average-advanced BackPropagation As discussed in the introduction, some applications may require rotation limited shape recognition. In our formu- To speed up the back propagation, we first consider the lation, this is equivalent to selecting subsets of outputs in mean gradient in a batch. Forward and only computes the gradients in BackPropaga- tion. To achieve this, one only needs to compute the average k 1 X k k  of the selected samples in each subband of the final layer, ∆f = conv cshift−q (δ ), x (13) i M k j,pk i,pk and the remaining procedure is the same. M ! This could be used to further reduce the computational 1 X k k cost. For instance, in some applications one wants to com- = conv cshift−q (δ ), x (14) M k j,pk i,pk pute rotated versions of every two contour points. This can M be done by selecting proper indexes as well. Another handy This means the average of the filter gradient is equivalent solution is to simply take the samples in the first subband. In to the filter gradient of the average input. Thus, we can our experiments, this practice speeds up the learning with- average the results in each subband by properly shifting the out sacrificing much accuracy. output error. Note that Eq. 9-11 linear, it is possible to shift k 3.5. Integral measures for curvature scale space the gradient ∆j,p in a reverse order, instead of shifting the k xj,p every time. Therefore, all the shifted errors from the Curvature is a fundamental property of natural shapes. k same subband share the same set of xj,p . As a result, we Curvature scale space (CSS) has been proposed in the early advance the average the above mean operation at the early age of computer vision. The theory of CSS is very attractive as possible in the BP (Fig. 3). In this process, the only thing for continuous signals; however, when dealing with discrete needs to be changed is the addition of the following shift in pixelated images and imperfect segmentation results, the Eq. 9: computation of curvature across different scales becomes Algo IDSC T-Dist K-SVM SM Ours % 91.20 93.8 91.47 94.40 94.93/95.33 Table 1. Performance comparison on Swedish leaf dataset. Please see [1] and [7] for the other four algorithms.

Figure 4. Curvature images. The first two are from the same species, but due to rotation they starts from different contour points. very difficult. For example, standard differential techniques may be very sensitive to noise, which makes the CSS less effective in shape recognition. Instead of computing curvature by the definition, a num- ber of literature suggest to use integral measures to compute functions of the curvature [13]. The insight is that the cur- Figure 6. Examples for ETH-80 [11]. vature fundamentally reflects whether a curve is convex or concave, thus we can use other measurements to serve as a aligned. The experiments were run on a standard PC with representation of curvature. 3GHz Quad Core CPU and 32GB memory. The training Following [9], we use the area of intersection of a disk time varies and the testing time is on average 0.0227 second centered at a landmark point and the inside of the silhouette per shape using Matlab. as the measurement. Using filtering, this integral measure is fast, and robust to noise. With different radius of circles, 4.2. Swedish Leaves dataset we can derive the curvature of different scales and gener- ate curvature images. Fig. 4 shows three examples. Each The Swedish Leaves dataset was developed at Linkop- row is the curvature evolving along the boundary for each ing University and the Swedish Museum of Natural History scale, and each column represent the change of curvature [14], which contains segmented leaves from 15 different over scale for every contour points. Swedish tree species (75 leaves per species). We used 25 leaves per class for training and 75 for testing, which is a 4. Experiments standard configuration of this dataset. Table 1 shows our results. Particularly, we showed two 4.1. Settings configurations of our algorithms: 1) use all the contour In this section we show our experiments on three real points as starting points for training, and 2) only use the datasets, namely, the Swedish Leaves [14], the ETH-80 starting points that correspond to the first subband of the Shape [11], and a subset of the recent Leafsnap dataset [9]. last convolutional layer (i.e., 25% of the contour points). The shapes in each dataset were normalized with respect Compared to the state of the art Shape Manifold [7], to the sizes of their convex hulls. Each shape was then where starting points of training images were manually la- represented by 200 uniformly sampled landmarks along beled and testing images were aligned with the training im- boundary (clockwise). At each landmark, we computed the ages, our performance using 25% contour points is already curvature scale space image with 25 scales. better. When more shifts (rotations) are used, the perfor- The curvature image represents real datasets very well, mance increases too. This suggests our method requires less but it is not a practical measurement for the artificial dataset labor in pre-processing shapes and supports our claim that such as the MPEG-7 dataset. This artificial dataset contains rotations facilitate shape recognition. man-made and toy variations, such as inner holes and elon- 4.3. ETH-80 dataset gated branches. Thus integral measurements do not work properly on these artificial variations. We next demonstrate the benefits of our approach on In all the experiments, we used four layers CNN (includ- shape categorization. The ETH-80 dataset [11] consists of ing the input layer and the single layer perception for re- 8 classes with 10 objects in each class. Each object has 41 gression output). The first convolution layer has 11 filters images from different viewing angles. The contour images with the size 13 × 11, and the second convolution layer has are used for many shape recognition algorithms (Fig. 6). 13 filters with size 13×9, respectively. Downsampling only In existing matching approaches, this dataset is used applies to the curvature dimension. in the Leave-One-Out-Cross-Validation (LOOCV) setup. We perform our method on three datasets, which are not Jayasumana et al. [7] proposed to use Leave-One-Object- Figure 5. Examples of the Swedish leaf dataset ([14]). Each column shows two original images and their silhouettes for each class. The landmark points are then sampled from their boundaries.

Nearest Nearest Complex Complex Trans. Shape Algo RIK Ours Mean Neighbor Bingham Normal K-SVM Manifold % 79.02 82.10 86.95 87.50 92.29 87.29 93.75 95.80 Table 2. Performance comparison on ETH-80 dataset. Please see [1] and [7] for the descriptions of other algorithms.

Figure 7. Examples of Leafsnap dataset. Each shape is from a different class, and some of them are very similar to each other.

Out-Cross-Validation for classification tasks: all images 4.4. Leafsnap dataset (subset) from one object are used as test images, and the images of other objects are used to train classifiers. If a test image is Finally, we verified our ideas on a subset of the largest classified to the same class as the training images (i.e., one reported shape dataset. Kumar et al. [9] collected the Leaf- of the remaining 9 objects in the same class for training), it snap dataset. In total, this dataset contains 184 species in is a correct classification. the northeastern United States, which consists of 23,915 In this setup, we reserved a validation set consisting of “clean” lab images (by high quality camera) and 5,192 field 10 images per object that are in the same class as the test images taken by mobile devices. The field images may con- object, and stop the training when the CNN converges. This tain varying amounts of blur and were taken with different process is repeated 80 times, and the average accuracy is viewpoints and illumination. The silhouettes were extracted reported in Table 3. and provided in the dataset. We compared the results of our approach to those re- Recently, the authors prepared a subset of this large ported in [7]. Our result outperformed the state of the art dataset for academic purpose. This leaf subset includes over by 2%. This may be explained by the misaligned start- 5000 leaf images of the 143 trees (Fig. 7). Please note that ing point issue, because there is no simple domain-specific while there may be a well defined starting point for most approaches to label the starting points easily in ETH-80 leaves (e.g., apex), it is labor intensive to label all of them dataset. in this subset, and this labeling task may also be less practi- Algorithm 75% 50% 25% 15% References Ours 60.03 58.50 55.61 48.12 [1] X. Bai, X. Yang, L. J. Latecki, W. Liu, and Z. Tu. Learn- HoCS 57.80 57.30 54.01 48.55 ing context-sensitive shape similarity by graph transduction. IDSC 54.14 56.41 46.09 42.66 IEEE Transactions on Pattern Analysis and Machine Intelli- Table 3. 1-NN results for Leafsnap dataset (subset). gence, 32(5):861–874, 2010. [2] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell., 24(4):509–522, Apr. 2002. [3] Y. Bengio, A. C. Courville, and P. Vincent. Unsupervised feature learning and deep learning: A review and new per- spectives. CoRR, abs/1206.5538, 2012. [4] G. Borgefors. Hierarchical chamfer matching: A parametric edge matching algorithm. IEEE Trans. Pattern Anal. Mach. Intell., 10(6):849–865, Nov. 1988. [5] A. Cardone, R. K. Gupta, and M. Karnik. A survey of shape similarity assessment algorithms for product design and manufacturing applications. Journal of Computing and Information Science in Engineering, 3:109–118, 2003. [6] Y. Gdalyahu and D. Weinshall. Flexible syntactic matching of curves and its application to automatic hierarchical classi- Figure 8. Accuracy comparison on Leafsnap dataset (subset). fication of silhouettes. IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 21:1312–1328. [7] S. Jayasumana, M. Salzmann, H. Li, and M. Harandi. A cal in the original dataset. framework for shape analysis via hilbert space embedding. Kumar et al. [9] proposed to use the histogram of cur- In ICCV, 2013. vature (HoCS) in a LOOCV matching framework. This is [8] E. Keogh, L. Wei, X. Xi, M. Vlachos, S.-H. Lee, and P. Pro- not applicable to our classification method. Therefore, we topapas. Supporting exact indexing of arbitrarily rotated turned to a repeated random sub-sampling validation pro- shapes and periodic time series under euclidean and warp- cedure. We divided the dataset and used 15%, 25%, 50%, ing distance measures. The VLDB Journal, 18(3):611–630, 75% for training, respectively. We also reserved 5% for val- June 2009. [9] N. Kumar, P. N. Belhumeur, A. Biswas, D. W. Jacobs, W. J. idation, and the rest for testing in each case. We repeated Kress, I. Lopez, and J. V. B. Soares. Leafsnap: A computer this procedure and reported the average values. vision system for automatic plant species identification. In We compared our method to the IDSC and the HoCS on The 12th European Conference on Computer Vision (ECCV), the Top 1, 5, 10, 15, 20 accuracies, respectively. Table. 3 October 2012. shows the Top 1 accuracy. Our method outperformed the [10] Y. LeCun and Y. Bengio. The handbook of brain theory and state of the art methods, and is inferior only when training neural networks. chapter Convolutional networks for images, samples are small. Note that 15% of the dataset contains speech, and time series, pages 255–258. MIT Press, Cam- only 700 images, the training may be less effective. bridge, MA, USA, 1998. Fig. 8 further shows the top 5, 10, 15, 20 results, respec- [11] B. Leibe and B. Schiele. Analyzing appearance and con- tively. Our method consistently achieved higher accuracy tour based methods for object categorization. In In IEEE Conference on Computer Vision and Pattern Recognition for most of the test cases, and tied with the HoCS when (CVPR’03, page 409–415, 2003. training samples are small. [12] H. Ling and D. W. Jacobs. Shape classification using the inner-distance. IEEE Transactions on Pattern Analysis and 5. Conclusion Machine Intelligence, 29(2):286–299, 2007. [13] S. Manay, D. Cremers, B. woo Hong, A. J. Yezzi, and In this paper we propose a method to handle rotation in S. Soatto. Integral invariants for shape matching. IEEE shape recognition. We effectively adopted a deep learning PAMI, 28:1602–1618, 2006. framework on the curvature scale space, which computes [14] O. J. O. Soderkvist.¨ Computer vision classification of leaves the forward and backpropation operations efficiently. We from swedish trees. Master’s thesis, Linkoping¨ University, tested the algorithm on three real datasets, and the perfor- SE-581 83 Linkoping,¨ Sweden, September 2001. LiTH-ISY- mance on these challenging dataset concludes that model- EX-3132. ing rotation-friendly features facilitates shape recognition. [15] D. W. Thompson. On Growth and Form. Cambridge Univer- Acknowledgement: we would like to thank Prof. David sity Press, 1983. Jacobs and Arijit Biswas from the University of Maryland at [16] D. Zhang and G. Lu. Review of shape representation and description techniques. Pattern Recognition, 37, 2004. College Park for generously providing the Leafsnap dataset.