2016 23rd International Conference on Pattern Recognition (ICPR) Cancún Center, Cancún, México, December 4-8, 2016 An Efficient Hough Transform for Multi-Instance Object Recognition and Pose Estimation

Erdem Yor¨ uk¨ ∗, Kaan Taha Oner¨ ∗ and Ceyhun Burak Akgul¨ ∗ ∗Vispera Information Technologies Istanbul, Turkey

Abstract—Generalized Hough transform, when applied to ob- ject detection, recognition and pose estimation, can be susceptible to spurious voting depending on the Hough space to be used and hypotheses to be voted. This often necessitates additional computational steps like non-maxima suppression and geometric consistency checks, which can be costly and prevent voting based methods from being precise and scalable for large numbers of target classes and crowded scenes. In this paper, we propose an efficient and refined Hough transform for simultaneous detection, recognition and exact pose estimation, which can efficiently accommodate up to multiple tens of co-visible query instances and multiple thousands of visually similar classes. Specifically, we match SURF features from a given query image to a database of model features with known poses, and in contrast to existing techniques, for each matched pair, we analytically compute a concise set of 6 degrees-of-freedom pose hypotheses, for which the geometric relationship of the correspondence remains invariant. We also introduce an indirect but equivalent representation for those correspondence-specific poses, termed as feature aligning affine transformations, which results in a Hough voting scheme Fig. 1. Overview of our method. Top-Left: Template images of target objects. Top-Right: Query image with its interest points colored according to the as cheap and refined as drawing in raster grids. Owing class of their matched model feature. Bottom-Left: Pose-class histogram to minimized voting redundancy, we can obtain a very sparse visualized on the query image with colors and affine frames indicating and stable Hough image, which can be readily used to read the class and pose hypotheses, where frame thickness is set proportional off instances and poses without dedicated steps of non-maxima to the corresponding vote-count (for clarity, hypotheses with more than 5 suppression and geometric verification. Experimented on an votes are shown, only). Bottom-Right: Final recognition and pose estimation extensive Grocery Products dataset, our method significantly results, obtained after processing the Hough entries of non-overlapping frames outperforms the sate-of-the-art with near real time overall cost. according to decreasing vote counts. Keywords. Feature Matching, Hough Transform, Object Instance Recognition, Pose Estimation, Product Recognition subsets of query features that are consistent in terms of geom- etry and class, and therefore likely to coincide with distinct I.INTRODUCTION objects. Usually, this is handled with some form of generalized Object instance detection, recognition and pose estimation Hough transform on object locations and poses using relative are core problems in many applications of . geometric attributes of feature correspondences [1], [4], [5]. Solving these related tasks jointly and efficiently remains a Nevertheless, depending on the resolution of Hough space and great challenge, especially when a large number of visually the precision in generated hypotheses, the Hough transform similar target classes are in question, and many instances from approach can suffer from large memory requirements and/or them are simultaneously present in the query image. spurious voting. Typically, this necessitates additional refine- Image matching techniques that can associate invariant and ment heuristics such as non-maxima suppression and checks distinctive features to object exemplars, show a great potential for geometric consistency. for such a combined problem, especially given their well doc- Such issues can be especially exacerbated in fine-grained umented effectiveness in image alignment, 3D reconstruction, product recognition and retail scene understanding, where, content based retrieval and more. However, despite significant despite the convenience of rich and rigid object textures, there advances in saliency detection and stable image descriptors are typically thousands of fine-grained classes with visually [1]–[3], approaches that use feature correspondences for object similar patterns (e.g. common shape of bottles, cans; common detection, recognition and pose estimation do not scale well logos etc.) and hundreds of present object instances in a given with large numbers of target object classes and their co-visible query image. Given those challenges, product recognition has instances. not only become an interesting research area recently [6]– In matching based multi-instance detection, the computa- [13], but it also penetrated into our everyday life with smart tional bottleneck is primarily due to the search for disjoint applications like shopper assistance for the visually impaired,

978-1-5090-4846-5/16/$31.00 ©2016 IEEE 1353 online product retrieval in e-commerce (e.g., Google Goggles, retrieval, it suffers from background clutter and ignores spatial Amazon Flow), automated stock monitoring and planogram configurations often requiring additional checks for geometric (plan for product placement on retail shelves) maintenance. consistency. On the other hand, feature matching followed by In this paper, we propose an efficient, matching- and voting- Hough voting ensures spatial consistency to a larger degree, based algorithm for joint detection, recognition and pose but due to spurious voting and limited resolution in spatial estimation of multiple object instances in general; and apply encoding, it also requires further heuristics such as non- our approach to product recognition in particular (see Figure maxima suppression and pose refinement. Nevertheless, in the 1). Our contributions are multi-fold: First, we address the case of multi instance detection, Hough transform performs aforementioned issues of Hough transform (i) by analytically better than brute-force search or robust fitting alternatives deriving for each feature match a precise set of pose hy- such as RANSAC. Accordingly, it has been used extensively potheses, the so-called feature aligning affine transformations, with interest points [1], [4], image regions [15], and image under which the geometric relationship of the correspondence patches [16], [17]. For improved efficiency, faster alternatives remains invariant; and (ii) by introducing an indirect but that apply the branch-and-bound (BB) technique on the pose equivalent representation for affine transformations, which space, have also been proposed, especially suited to detection renders the corresponding Hough voting scheme as cheap as and pose estimation [18], [19]. In contrast to Hough transform, line drawing in raster grids. Second, owing to the minimized which casts votes from individual feature matches to individual redundancy of this voting process, we can obtain a very sparse poses, BB methods rely on computing efficient bounds on and stable pose-class histogram, which saves memory and can feature scores within nested subsets of the pose space. But, be readily used to read off instances and their poses without aimed at finding the global optimum, these approaches are dedicated steps of non-maxima suppression and geometric ver- limited to situations with one object per image, rather than ification. Third, our method scales well, accommodating up to multi-instance reasoning. multiple thousands of target classes, and multiple tens of their Another area of research relevant to our paper is product co-visible instances on high resolution query images. Finally, recognition, which has gained significant interest over the as tested on the Grocery Products dataset [11] with 3235 recent years. One of the earlier datasets introduced to this fine-grained classes and 27 product categories, our method end is due [6], who proposed color histogram matching, SIFT outperforms the state-of-the-art by almost 23% in product matching and boosted Haar-like features to recognize 120 dis- category recognition, which, in contrast to popular but data- tinct grocery products from manually segmented video frames intensive methods like deep learning, is achieved with only a of supermarket shelves. Given the much larger volume of single training image per fine-grained class. available commercial products, works like [7]–[10] addressed Our paper is organized as follows: Section II reviews the the scalability of visual product search and challenges on its relevant literature. Section III analyzes the geometry of feature mobile implementations. But their setting is again limited to a correspondences, which is used to define an efficient Hough single product visible in the query image. More recently, ap- transform on poses, which is detailed in Section IV. Section proaches that handle multiple instances have also been studied V extends the proposed voting scheme to multi-instance [5], [11]–[13]. Most notably, [11] published a larger dataset, detection, recognition and pose estimation, whereas Section which, to the best of our knowledge, is the most extensive one VI explains our experimental findings on product recognition. to this date, containing training images of 3235 fine-grained Finally Section VII gives our concluding remarks. products and 680 test images from retail scenes. In order to confine the search and multi-label classification, they applied II.RELATED WORK multi-class ranking with decision forests, followed by dense In matching based object recognition and image retrieval, pixel matching and genetic algorithm for match optimization. the key factor that determines the performance is the robust- In a further study [12], they also address the problem of ness of features against changes in viewpoint, illumination classifying shelf images to unique product categories by using and noise. There has been a substantial body of work on visual codebooks on SIFT features, discriminative patches and saliency detection and invariant image descriptors including active learning via user feedback. Scale Invariant Feature Transform (SIFT) [1], Harris Laplace detectors [2] and Speeded-Up Robust Features (SURF) [3]. III.POSESFROM FEATURE CORRESPONDENCES A notable work that uses invariant features for object detec- We render the pose estimation problem as finding the tion and recognition is [1], where SIFT descriptors of the transformation between template images in canonical object query image are matched against a feature database from poses to object regions in the query image. Assuming the object exemplars and detections are obtained using a coarse dominant texture in the template comes from an approximately Hough transform on poses with 4 degrees of freedom (dof). planar object face, which is also visible in the query pose Alternatively, [14] proposes a Bag-of-Words (BoW) model on with negligibly small perspective effects, we can establish SIFT vectors, to represent the query image by a histogram the template-to-query transformation reasonably well with an of global word counts and to match it to template images affine transformation. In this section, we explain how we by tf − idf weighting. While the BoW representation lowers can deduce a concise set of such transformations as pose the matching cost enabling for an efficient and scalable image hypotheses deduced from a single feature correspondence.

1354 A. Feature Aligning Affine Transformations parallel to the edge lines `f and `g. More specifically, we can decompose such transformations as For extracting invariant image features, we use SURF al- gorithm [3], which, similar to SIFT (Scale Invariant Feature −1 Tfg(γ, δ) = Ng U(γ, δ)Nf (1) Transform), detects interest points at blob-like structures and computes descriptors from local content of image gradients. In with the following components: particular, given an image I, each SURF feature can be repre- .  sin θ cos θ x sin θ −y cos θ  sented as f = (d , p , s , θ ), which consists of a descriptor f − f − f f f f f f f f sf sf sf cos θ sin θ x cos θ +y sin θ vector df , as well as a set of geometric attributes including Nf =  f f − f f f f  (2)  sf sf sf  the corresponding interest point location pf = (xf , yf ) in I, 0 0 1 the characteristic scale sf and the reference orientation θf as the dominant gradient around pf . normalizes the location, scale and orientation of f, such that . Now, suppose I and J = A ◦ I are two images, related its center pf gets mapped to the origin, and its edge line `f by an (unknown) affine transformation A, and let f and g to the line y = 1. Then, I J (f, g) be two matched features from and , respectively. If   is a correct match, we can find an analytically refined set of γ δ 0 potential estimates for A that align the geometric attributes U(γ, δ) = 0 1 0 (3) 0 0 1 of f and g. In principle, the relative information sg/sf , θ − θ and p − p already specifies a 4-dof similarity g f g f applies scaling with γ > 0 and shear with δ, both along transformation, which contains an isotropic scaling, a rotation the x-direction in the normalized frame, such that the point- and a translation. Such information is indeed used in [1] for line configuration fixed at the origin and the y = 1 line computing for each correspondence pair (f, g) a similarity remains unaffected. Finally N−1, which is the inverse of the transformation S , whose neighborhood is then designated g fg normalizing transform for g, given as in Equation (2), brings and voted as correspondence-specific pose candidates. the origin back to pg and line y = 1 to `g, completing the In this approach, the neighborhood size is crucial, especially mapping (pf , `f ) 7→ (pg, `g). when chosen arbitrarily around S . If it is too small, the fg We denote transformations T defined in Equation (1) as algorithm can hardly capture a wide range of poses, if it is fg feature aligning affine transformations, since they geometri- too large it will spuriously cover transformations that do not cally align the pair (f, g), and have 6 dof with two free param- satisfy the geometric relation between f and g, and therefore eters (γ, δ), for which the alignment remains invariant. Note cannot be consistent with the correspondence. As a result, that, the similarity transformation S is equal to T (1, 0). [1] requires an additional geometric verification step, which fg fg uses an iterative least squares solver to remove outliers from IV. HOUGH TRANSFORMFOR POSES the processed Hough bins and to obtain or reject the final pose estimation. As a key contribution here, we do not let In this section, we will layout the details of our Hough space the neighborhood around Sfg to be arbitrary, but restrict it to for poses and the corresponding voting mechanism that effi- transformations, that preserve the relative geometry between ciently increments votes for feature aligning transformations f and g. In this way, given the correct match (f, g), we can per each processed correspondence. avoid spurious voting and thereby substantially save during post-processing of the resulting Hough image. A. Indirect Representation for Tfg To make things concrete let us review the geometric com- As a 6-dof affine transformation, T can be directly ponents. Since f encodes a blob structure in I, its scale sf es- fg sentially reflects the radius of I’s locally stable neighborhood expressed by its first two rows in homogeneous coordinates, or by decomposing it into individual components of rotation, around pf , and θf specifies the angle of I’s dominant image gradient across that neighborhood. Intuitively, we can interpret 2D scaling, shear, and 2D translation. In our case, we prefer an indirect but equivalent representation in the form of a T - sf as the distance of pf to the strongest nearby edge, which fg transformed triangle, which enables an efficient voting process. locally coincides with the line `f :(x − xf ) cos θf + (y − In particular, given a triplet of fixed, non-collinear anchor yf ) sin θf = sf . Thus, the geometric attributes (pf , sf , θf ) (1) (2) (3) of f can be equivalently represented by the point-line con- points (a , a , a ), say, any 3 corners of the template (j) (j) 3 figuration (pf , `f ). Similarly, feature g in J, defines another image I, we use (bfg = Tfga )j=1 to encode Tfg. point-line configuration (pg, `g), with the line `g formulated Since Tfg is a function of (γ, δ), so is each transformed (j) analogously as above. Accordingly, if (f, g) is a correct match, anchor bfg . Recall that on the target image J, parameters we can deduce that the underlying transformation A from I (γ, δ) correspond to unidirectional scaling and shear, both to J should map (pf , `f ) to (pg, `g). parallel to the edge line `g of feature g. Thus, when (γ, δ) (j) (j) In addition to similarity components that are precisely are varied, each bfg also moves on a line `fg parallel to captured by Sfg, affine transformations that satisfy (pf , `f ) 7→ `g. Moreover, since there are two free parameters and three (pg, `g), permit a further unidirectional scaling and/or shear anchors, we can apply re-parametrization by arc-lengths along

1355 can be directly found by Bresenham’s line-drawing algorithm [20], which is a simple method to approximate lines in raster grids. For each anchor, the algorithm also returns the discrete (j) (j,k) Kfg (j) sequence (bfg )k=1 ) of intersections made between `fg and bin boundaries of lattice B. The subset Pfg ⊂ P of 6D pose cells that correspond to the feature aligning transformation Tfg are all 2D bin Q3 (j) configurations from j=1 Bfg that are simultaneously visited (j) 3 by (bfg )j=1 as γ ∈ Γ and δ ∈ ∆. Since anchors are linearly 20 dependent, it is enough to use bin combinations from two (j) 15 of the sets Bfg and obtain the third bin component using )

3 10 ( Equation (4). Specifically, we first designate the transformed i 5 anchor with shortest line segment among the three, say, (3) 0 without loss of generality ` . Then evaluating Equation (4) 20 20 fg 15 15 (j) 10 10  (j,k)Kfg 5 5 at finite bin intersections b : j = 1, 2 made i(2) 0 0 i(1) fg k=1 (3) by the first two anchors, we obtain the bins visited by bfg Fig. 2. Hough pose voting. Top-Left: Putative feature matches between without skipping any possible 6D cell from P . As a result, template image (left), shown with 3 anchors at its corners, and query image fg (right). Top-Right: Anchor lines (computed from correspondences and shown for each feature correspondence (f, g), our voting procedure with corresponding anchor color) and the lattice B (dashed lines) for binning for transformations Tfg only requires two executions of each anchor position. Bottom-Left: The Hough image of poses shown as Bresenham algorithm for computing the bins crossed by the a 3D plot. Each axis is the scalar index i(j) of bins in B intersected by anchor line j = 1, 2, 3. Pose cells are shown as spheres of size and shade set two of the anchor lines, and evaluation of Equation (4) at their proportional to vote counts. Bottom-Right: Correspondences that voted to the grid intersections to complete the third bin combination. highest scoring cell and their least-squares estimated pose (yellow frame). V. DETECTION,RECOGNITIONAND POSE ESTIMATION (j) those anchor lines `fg s and write the third anchor in terms of In the previous sections, we presented the details of com- the first two puting pose candidates and the corresponding Hough voting (3) (3)  > (1) (1) (2) (2) steps for a single feature correspondence. In this section, bfg = rfg + ug bfg − rfg bfg − rfg we generalize and wrap the whole process for joint object (1) (2) −1 × a − pf a − pf detection, recognition and pose estimation. (3)  × a − pf ug, (4) A. Feature Matching (j) (j) Let C be the set of object classes and let I = {I : c ∈ where rfg = Tfg(1, 0)a is the reference anchor position c on image J found with similarity transformation Tfg(1, 0), C} be the collection of their template images, containing one > and ug = (sin θg, − cos θg) is the unit vector along the g’s exemplar per class with its canonical pose. Accordingly, let F edge line `g. As explained next, Equation (4) will enable fast be the collection of model features extracted from I, where for bin computation during our Hough voting. each f ∈ F, the class label cf and reference anchor positions {a(j) : j = 1, 2, 3} are also known. Similarly, let G denote the B. Efficient Pose Voting cf set of query features from input image J to be matched to F. (1) (2) (3) Our representation (bfg , bfg , bfg ) for Tfg consists of In an application like product recognition, the number |C| of three 2D points on the query image J. Thus, letting B denote distinct products can reach up to several thousands, yielding a a 2D lattice of location bins on J, our Hough domain for very large feature set F. Thus, we use a k-d tree representation poses will be P = B3, where any 6D cell encodes a joint of F [1] and best-bin-first search [21], which, for each g ∈ G, configuration of three anchor-bins from B. determines an approximate nearest neighbor from F. We also assume that true transformation A between I and Again, as in [1], we discard matches, for which the ratio J does not severely deviate from a similarity map, such that of distances to first and second nearest neighbors is larger for any correct feature match (f, g), the solutions γ and δ than a threshold ρ, indicating that the closest match is not to A = Tfg(γ, δ) are within some fixed intervals Γ and ∆ distinctive enough, and less likely to be correct. In product around values 1 and 0, respectively. In particular, we take Γ = images, salient patterns are typically shared across multiple [γ−1, γ] and ∆ = [−δ, δ] for some γ > 1 and δ > 0. In our classes (e.g. different products with the same brand logo). experiments we set γ = 1.5 δ = 0.5, which are enough to Thus, in contrast to [1], we restrict the first-to-second distance cover a wide range of poses. ratios to neighbors that belong to the same template, such (j) With finite Γ and ∆, each anchor line `fg becomes a that for each query feature we can allow multiple matches segment, whose endpoints on J are directly found at two to different class candidates, but ensure distinctiveness and extreme value combinations of γ and δ. Given segment end- geometric stability within each of them. In our experiments, (j) (j) points, the set of location bins Bfg ⊂ B intersected by `fg we fix ρ = 0.5.

1356 B. Pose-Class Histogram h = 3, maximum allowed overlap τ = 0.5, maximum allowed scale s = 5 and maximum aspect ratio change r = 1.5. With the set C of fine-grained target classes, our extended Hough space becomes Ω = P × C, whose pose resolution VI.EXPERIMENTS is specified by the step size b of the 2D lattice B for binning We applied our method to fine-grained product recognition anchor positions. Letting M ⊂ F×G denote the set of putative and product category recognition. The details are as follows: matches between model and query features, we set b as the Dataset. In our experiments, we used the Grocery Products sg p median of the sample { |Ic | :(f, g) ∈ M}, where, as sf f dataset introduced in [11]. It contains 3235 fine-grained classes before, sg and sf are the characteristic scales of the query of packaged food products with corresponding template im- feature g and its matched counterpart f, respectively; and |Icf | ages (one image per class) taken in ideal studio conditions. is the area of the tightly cropped template image of f. Thus, Fine-grained classes are clustered into 27 food categories in pixel units, b gives a rough initial estimate of the typical such as “biscuits”, “soft drinks”, “cheese” etc. The test data object size on the query image J. contain a total of 680 shelf images captured from 5 different Given Ω, we repeatedly execute the voting process described retail stores using a smart phone camera and under real-world in Section IV, where for each correspondence (f, g) ∈ M, conditions involving various degrees of blur, occlusions, and votes are incremented only within the corresponding class lighting variations. Annotations in test images are provided in channel cf ∈ C. Thanks to concise computation of hypotheses terms of bounding boxes around spatial clusters of duplicate Pfg, the resulting pose-class histogram H :Ω 7→ N ends up instances, their unique fine-grained class label, and a single to be very sparse and with few outliers within each cell of food category label for the whole image. true pose (see Figure 2), even for a large number of object Performance Metrics. Given data J of test images, we instances present in query image J and relatively coarse bin measure the performance of our method using four metrics. size b. Thus, we only keep record of cells ω ∈ Ω with nonzero These are (i) Categorization Accuracy (CA) of classifying test entries H(ω) = |Mω|, along with corresponding subsets of images into one of the 27 food categories, which is done by constituent matches Mω = {(f, g) ∈ M : ω ∈ Pfg × {cf }}. predicting the category that cumulatively receives the largest Hough vote over detected instances; (ii) Product Accuracy C. Deducing Instances from H (PA) of retrieving present labels of fine-grained classes, given ˆ 1 P |CJ ∩CJ | ˆ by PA = |J | J∈J ˆ , where CJ and CJ respectively We post-process the entries of H for final detections, |CJ ∪CJ | denote the ground-truth and predicted sets of fine-grained recognitions and pose estimations according to a priority queue product classes in the query image J; (iii) Product Precision of decreasing vote counts. While doing so, we also keep track (PP); and (iv) Product Recall (PR), defined similar to PA, with of query features that are already used for declaring an object denominator in the sum replaced by |Cˆ | and |C |, respectively. instance, so as to avoid their reuse during future detections. J J Results. We compare our method with the state-of-the-art Specifically, suppose ω ∈ Ω is the current pose-class cell to be on the same dataset. Other methods are shorthanded as processed from the queue. Let G = {g :(f, g) ∈ M } ⊂ G ω ω FV+RANSAC (Fisher Vector classification re-ranked with be the set of query features that voted to ω, and let G0 = ω RANSAC [11]), RF+PM+GA (Initial class ranking with {g ∈ G : g is unused} ⊆ G be the subset of available ω ω Random Forests followed by dense Pixel Matching and ones. Then, for deducing an object instance from ω, we check Genetic Algorithm for match optimization [11]), DP+SP+HS if all three of the following conditions hold simultaneously: (i) |G0 | (Discriminative Patches with Spatial Pyramids classified as H(ω) ≥ h; (ii) ω > τ ∈ [0, 1]; and (iii) the affine solution |Gω | the class of Highest Scoring patch [12]), and DP+SP+SVM Aˆ = arg min P ||Ap − p ||2 is nice, such that ω A (f,g)∈Mω f g (Discriminative Patches with Spatial Pyramids classified with −1 its decomposition into scales sx, sy satisfy sx, sy ∈ [s , s] SVMs [12]). As summarized in Table I and Figure 3, our and exp{| log sx |} < r. sy recognition performance both in the category level (CA scores Here, condition (i) eliminates likely false alarms with too over 27 categories) and in the fine-grained class level (PA few votes, while in parallel (ii) suppresses non-maxima around scores over 3235 classes) are superior than the state-of-the- previous detections in an intuitive and cheap way. If the first art, with almost 23% increase in category classification. Sim- two conditions are met, condition (iii) computes the least ilarly, we perform significantly better in fine-grained product squares transformation Aˆ ω from the known coordinates of precision, and comparably in terms of product recall thanks matched features in Mω and ensures that Aˆ ω falls within to synergistic effect of joint pose estimation. While alterna- plausible ranges of scaling and changes in aspect ratio. If tive methods generally use a stepwise strategy, where, for ω satisfies all three conditions, it is declared as a detection, scalability, they first reduce the set of candidate classes and otherwise it is discarded. If declared, the recognized class is then apply finer processing, we perform one-shot recognition read off from the corresponding class , the exact by incorporating all classes at once, but can still operate pose is returned as Aˆ ω, and the corresponding query features within a comparable time budget. In our case, the default Gω are flagged as used. This process is repeated until the end feature extraction and matching steps are the major sources of of the queue, sequentially revealing all detections, recognitions computation. Given the database F of more than 1M model and poses. In our experiments, we set minimum required vote features from all 3235 fine-grained classes, and raw test images

1357

George et al. (2015) Food/Bakery Our method Food/Biscuits Food/Candy/Bonbons Food/Candy/Chocolate Food/Cereals Food/Chips Food/Coffee Food/Dairy/Cheese Food/Dairy/Yoghurt Food/DriedFruitsAndNuts Food/Drinks/Choco Food/Drinks/IceTea Food/Drinks/Juices Food/Drinks/Milk Food/Drinks/SoftDrinks Food/Drinks/Water Food/Jars−Cans/Canned Food/Jars−Cans/Sauces Food/Jars−Cans/Spices Food/Jars−Cans/Spreads Food/Oil−Vinegar Food/Pasta Food/Rice Food/Snacks Food/Soups Food/Tea

0 10 20 30 40 50 60 70 80 90 100 accuracy (%) Fig. 3. Recognition accuracies in individual product categories (Our method compared against DP+SP+SVM [12]).

TABLE I PRODUCT RECOGNITION PERFORMANCES (%) Fig. 4. Example results on Grocery Products dataset. Detections and their predicted classes are shown below each image with frames of matching color. Method CA PA PP PR FV+RANSAC [11] - 10.1 12.3 24.2 REFERENCES RF+PM+GA [11] - 21.2 23.5 43.1 [1] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” DP+SP+HS [12] 49.9 - - - Int. Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. DP+SP+SVM [12] 61.9 - - - [2] K. Mikolajczyk and C. Schmid, “Indexing based on scale invariant Our Method 84.6 32.5 57.0 41.6 interest points.” in ICCV, 2001. [3] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust with 8 MP resolution, our method returns per each query features,” Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346–359, 2008. 10K to 20K putative matches, where feature extraction and [4] B. Leibe, A. Leonardis, and B. Schiele, “Robust object detection with matching take around 9 seconds on average using a single interleaved categorization and segmentation,” International Journal of core of a PC with Intel Core i7 processor. On the other hand, Computer Vision, vol. 77, no. 1-3, pp. 259–289, 2008. [5] M. Marder, S. Harary, A. Ribak, Y. Tzur, S. Alpert, and A. Tzadok, the proposed Hough transform and the following detection, “Using image analytics to monitor retail store shelves,” IBM Journal of recognition and pose estimation steps, which constitute our Research and Development, vol. 59, no. 2/3, pp. 3–1, 2015. paper contributions, require less than half a second all together. [6] M. Merler, C. Galleguillos, and S. Belongie, “Recognizing groceries in situ using in vitro training data,” in SLAM, 2007. Figure 4 shows some examples of our recognition and [7] X. Lin, B. Gokturk, B. Sumengen, and D. Vu, “Visual search engine for pose estimation results, with template images of detected fine- product images,” in Electronic Imaging, 2008. grained classes shown below each test image. In general, the [8] Y. Jing and S. Baluja, “Pagerank for product image search,” in Interna- tional conference on World Wide Web, 2008. sources of error include blurry images that yield insufficient [9] S. S. Tsai, D. Chen, V. Chandrasekhar, G. Takacs, N.-M. Cheung, number of interest points, unreliable matching in less textured R. Vedantham, R. Grzeszczuk, and B. Girod, “Mobile product recogni- classes with varying 3D shape (e.g. bakery), confusions be- tion,” in ACM International Conference on Multimedia, 2010. [10] X. Shen, Z. Lin, J. Brandt, and Y. Wu, “Mobile product image search tween different products that have the same dominant texture by automatic query object extraction,” in ECCV, 2012. (e.g. cans and jars carrying the same brand label), as well as [11] M. George and C. Floerkemeier, “Recognizing products: A per-exemplar some recurring imperfections in ground-truth annotations (e.g. multi-label image classification approach,” in ECCV, 2014. [12] M. George, D. Mircic, G. Soros, C. Floerkemeier, and F. Mattern, images with “choco” objects annotated as “milk” or vice versa, “Fine-grained product class recognition for assisted shopping,” in ICCV or images with co-visible categories “snacks” and “chips”, but Workshops, 2015. annotated with only one of them). [13] S. Advani, B. Smith, Y. Tanabe, K. Irick, M. Cotter, J. Sampson, and V. Narayanan, “Visual co-occurrence network: using context for large- VII.CONCLUSION scale object recognition in retail,” in ESTIMedia, 2015. [14] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to We proposed an efficient approach for joint detection, recog- object matching in videos,” in ICCV, 2003. nition and pose estimation of object instances from a large [15] C. Gu, J. J. Lim, P. Arbelaez,´ and J. Malik, “Recognition using regions,” in CVPR, 2009. number of target classes and crowded scenes. Our method [16] R. Okada, “Discriminative generalized hough transform for object implements a refined Hough transform over matched feature detection,” in ICCV, 2009. pairs, which for each correspondence, votes to a confined set [17] J. Gall and V. Lempitsky, “Class-specific hough forests for object detection,” in Decision forests for computer vision and medical image of pose hypotheses identified analytically as feature aligning analysis. Springer, 2013, pp. 143–157. transformations and described as simple line segments on the [18] C. H. Lampert, M. B. Blaschko, and T. Hofmann, “Efficient subwindow corresponding Hough domain. This yields a sparse and stable search: A branch and bound framework for object localization,” IEEE PAMI, vol. 31, no. 12, pp. 2129–2142, 2009. pose-class histogram, which can be directly processed for [19] E. Yoruk and R. Vidal, “Efficient object localization and pose estimation object instances without the need for dedicated steps of non- with 3d wireframe models,” in ICCV Workshops, 2013. maxima suppression and geometric consistency checks. Our [20] J. E. Bresenham, “Algorithm for computer control of a digital plotter,” IBM Systems Jornal, vol. 4, no. 1, pp. 25–30, 1965. method scales well to a challenging application like product [21] J. S. Beis and D. G. Lowe, “Shape indexing using approximate nearest- recognition, where there are thousands of fine-grained classes. neighbour search in high-dimensional spaces,” in CVPR, 1997.

1358