An Efficient Hough Transform for Multi-Instance Object Recognition

2016 23rd International Conference on Pattern Recognition (ICPR) Cancún Center, Cancún, México, December 4-8, 2016 An Efficient Hough Transform for Multi-Instance Object Recognition and Pose Estimation Erdem Yor¨ uk¨ ∗, Kaan Taha Oner¨ ∗ and Ceyhun Burak Akgul¨ ∗ ∗Vispera Information Technologies Istanbul, Turkey Abstract—Generalized Hough transform, when applied to object detection, recognition and pose estimation, can be susceptible to spurious voting depending on the Hough space to be used and hypotheses to be voted. This often necessitates additional computational steps like non-maxima suppression and geometric consistency checks, which can be costly and prevent voting based methods from being precise and scalable for large numbers of target classes and crowded scenes. In this paper, we propose an efficient and refined Hough transform for simultaneous detection, recognition and exact pose estimation, which can efficiently accommodate up to multiple tens of co-visible query instances and multiple thousands of visually similar classes. Specifically, we match SURF features from a given query image to a database of model features with known poses, and in contrast to existing techniques, for each matched pair, we analytically compute a concise set of 6 degrees-of-freedom pose hypotheses, for which the geometric relationship of the correspondence remains invariant. We also introduce an indirect but equivalent representation for those correspondence-specific poses, termed as feature aligning affine transformations, which results in a Hough voting scheme Fig. 1. Overview of our method. Top-Left: Template images of target objects. Top-Right: Query image with its interest points colored according to the as cheap and refined as line drawing in raster grids. Owing class of their matched model feature. Bottom-Left: Pose-class histogram to minimized voting redundancy, we can obtain a very sparse visualized on the query image with colors and affine frames indicating and stable Hough image, which can be readily used to read the class and pose hypotheses, where frame thickness is set proportional off instances and poses without dedicated steps of non-maxima to the corresponding vote-count (for clarity, hypotheses with more than 5 suppression and geometric verification. Experimented on an votes are shown, only). Bottom-Right: Final recognition and pose estimation extensive Grocery Products dataset, our method significantly results, obtained after processing the Hough entries of non-overlapping frames outperforms the sate-of-the-art with near real time overall cost. according to decreasing vote counts. Keywords. Feature Matching, Hough Transform, Object Instance Recognition, Pose Estimation, Product Recognition subsets of query features that are consistent in terms of geometry and class, and therefore likely to coincide with distinct I. INTRODUCTION objects. Usually, this is handled with some form of generalized Object instance detection, recognition and pose estimation Hough transform on object locations and poses using relative are core problems in many applications of computer vision. geometric attributes of feature correspondences [1], [4], [5]. Solving these related tasks jointly and efficiently remains a Nevertheless, depending on the resolution of Hough space and great challenge, especially when a large number of visually the precision in generated hypotheses, the Hough transform similar target classes are in question, and many instances from approach can suffer from large memory requirements and/or them are simultaneously present in the query image. spurious voting. Typically, this necessitates additional refine- Image matching techniques that can associate invariant and ment heuristics such as non-maxima suppression and checks distinctive features to object exemplars, show a great potential for geometric consistency. for such a combined problem, especially given their well doc- Such issues can be especially exacerbated in fine-grained umented effectiveness in image alignment, 3D reconstruction, product recognition and retail scene understanding, where, content based retrieval and more. However, despite significant despite the convenience of rich and rigid object textures, there advances in saliency detection and stable image descriptors are typically thousands of fine-grained classes with visually [1]–[3], approaches that use feature correspondences for object similar patterns (e.g. common shape of bottles, cans; common detection, recognition and pose estimation do not scale well logos etc.) and hundreds of present object instances in a given with large numbers of target object classes and their co-visible query image. Given those challenges, product recognition has instances. not only become an interesting research area recently [6]– In matching based multi-instance detection, the computa- [13], but it also penetrated into our everyday life with smart tional bottleneck is primarily due to the search for disjoint applications like shopper assistance for the visually impaired, 978-1-5090-4846-5/16/$31.00 ©2016 IEEE 1353 online product retrieval in e-commerce (e.g., Google Goggles, retrieval, it suffers from background clutter and ignores spatial Amazon Flow), automated stock monitoring and planogram configurations often requiring additional checks for geometric (plan for product placement on retail shelves) maintenance. consistency. On the other hand, feature matching followed by In this paper, we propose an efficient, matching- and voting- Hough voting ensures spatial consistency to a larger degree, based algorithm for joint detection, recognition and pose but due to spurious voting and limited resolution in spatial estimation of multiple object instances in general; and apply encoding, it also requires further heuristics such as non- our approach to product recognition in particular (see Figure maxima suppression and pose refinement. Nevertheless, in the 1). Our contributions are multi-fold: First, we address the case of multi instance detection, Hough transform performs aforementioned issues of Hough transform (i) by analytically better than brute-force search or robust fitting alternatives deriving for each feature match a precise set of pose hy- such as RANSAC. Accordingly, it has been used extensively potheses, the so-called feature aligning affine transformations, with interest points [1], [4], image regions [15], and image under which the geometric relationship of the correspondence patches [16], [17]. For improved efficiency, faster alternatives remains invariant; and (ii) by introducing an indirect but that apply the branch-and-bound (BB) technique on the pose equivalent representation for affine transformations, which space, have also been proposed, especially suited to detection renders the corresponding Hough voting scheme as cheap as and pose estimation [18], [19]. In contrast to Hough transform, line drawing in raster grids. Second, owing to the minimized which casts votes from individual feature matches to individual redundancy of this voting process, we can obtain a very sparse poses, BB methods rely on computing efficient bounds on and stable pose-class histogram, which saves memory and can feature scores within nested subsets of the pose space. But, be readily used to read off instances and their poses without aimed at finding the global optimum, these approaches are dedicated steps of non-maxima suppression and geometric ver- limited to situations with one object per image, rather than ification. Third, our method scales well, accommodating up to multi-instance reasoning. multiple thousands of target classes, and multiple tens of their Another area of research relevant to our paper is product co-visible instances on high resolution query images. Finally, recognition, which has gained significant interest over the as tested on the Grocery Products dataset [11] with 3235 recent years. One of the earlier datasets introduced to this fine-grained classes and 27 product categories, our method end is due [6], who proposed color histogram matching, SIFT outperforms the state-of-the-art by almost 23% in product matching and boosted Haar-like features to recognize 120 dis- category recognition, which, in contrast to popular but data- tinct grocery products from manually segmented video frames intensive methods like deep learning, is achieved with only a of supermarket shelves. Given the much larger volume of single training image per fine-grained class. available commercial products, works like [7]–[10] addressed Our paper is organized as follows: Section II reviews the the scalability of visual product search and challenges on its relevant literature. Section III analyzes the geometry of feature mobile implementations. But their setting is again limited to a correspondences, which is used to define an efficient Hough single product visible in the query image. More recently, ap- transform on poses, which is detailed in Section IV. Section proaches that handle multiple instances have also been studied V extends the proposed voting scheme to multi-instance [5], [11]–[13]. Most notably, [11] published a larger dataset, detection, recognition and pose estimation, whereas Section which, to the best of our knowledge, is the most extensive one VI explains our experimental findings on product recognition. to this date, containing training images of 3235 fine-grained Finally Section VII gives our concluding remarks. products and 680 test images from retail scenes. In order to confine the search and multi-label classification, they applied II. RELATED WORK

Load more