Towards Real-Time Object Recognition and Pose Estimation in Point Clouds
Total Page:16
File Type:pdf, Size:1020Kb
Towards Real-time Object Recognition and Pose Estimation in Point Clouds Marlon Marcon1 a, Olga Regina Pereira Bellon2 b and Luciano Silva2 c 1Dapartment of Software Engineering, Federal University of Technology - Parana,´ Dois Vizinhos, Brazil 2Department of Computer Science, Federal University of Parana,´ Curitiba, Brazil Keywords: Transfer Learning, 3D Computer Vision, Feature-based Registration, ICP Dense Registration, RGB-D Images. Abstract: Object recognition and 6DoF pose estimation are quite challenging tasks in computer vision applications. De- spite efficiency in such tasks, standard methods deliver far from real-time processing rates. This paper presents a novel pipeline to estimate a fine 6DoF pose of objects, applied to realistic scenarios in real-time. We split our proposal into three main parts. Firstly, a Color feature classification leverages the use of pre-trained CNN color features trained on the ImageNet for object detection. A Feature-based registration module conducts a coarse pose estimation, and finally, a Fine-adjustment step performs an ICP-based dense registration. Our proposal achieves, in the best case, an accuracy performance of almost 83% on the RGB-D Scenes dataset. Regarding processing time, the object detection task is done at a frame processing rate up to 90 FPS, and the pose estimation at almost 14 FPS in a full execution strategy. We discuss that due to the proposal’s modularity, we could let the full execution occurs only when necessary and perform a scheduled execution that unlocks real-time processing, even for multitask situations. 1 INTRODUCTION Template matching approaches rely on RANSAC- based feature matching algorithms, following the Object recognition and 6D pose estimation represent pipeline proposed by (Aldoma et al., 2012b). a central role in a broad spectrum of computer vision RANSAC algorithm has proven to be one of the most applications, such as object grasping and manipula- versatile and robust. Unfortunately, for large or dense tion, bin picking tasks, and industrial assemblies ver- point clouds, its runtime becomes a significant limita- ification (Vock et al., 2019). Successful object recog- tion in several of the example applications mentioned nition, highly reliable pose estimation, and near real- above (Vock et al., 2019). When we seek a 6Dof esti- time operation are essential capabilities and current mation pose, the real-time is a more challenging task challenges for robot perception systems. (Marcon et al., 2019). In an extensive benchmark of A methodology usually employed to estimate full cloud object detection and pose estimation, (Ho- rigid transformations between scenes and objects is dan et al., 2018) reported runtime of a second per test centered on a feature-based template matching ap- target on average. proach. Assuming we have a known item or a part Deep learning strategies for object recognition and of an object, this technique involves searching all the classification problems have been extensively studied occurrences in a larger and usually cluttered scene for RGB images. As the demand for good quality (Vock et al., 2019). However, due to natural occlu- labeled data increases, large datasets are becoming sions, such occurrences may be represented only by available, serving as a significant benchmark of meth- a partial view of an object. The template is often an- ods (deep or not) and as training data for real appli- other point cloud, and the main challenge of the tem- cations. ImageNet (Deng et al., 2009) is, undoubt- plate matching approach is to maintain the runtime edly, the most studied dataset and the de-facto stan- feasibility and preserve the robustness. dard on such recognition tasks. This dataset presents more than 20,000 categories, but a subset with 1,000 a https://orcid.org/0000-0001-9748-2824 categories, known as ImageNet Large Scale Visual b https://orcid.org/0000-0003-2683-9704 Recognition Challenge (ILSVRC), is mostly used. c https://orcid.org/0000-0001-6341-1323 164 Marcon, M., Bellon, O. and Silva, L. Towards Real-time Object Recognition and Pose Estimation in Point Clouds. DOI: 10.5220/0010265601640174 In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 5: VISAPP, pages 164-174 ISBN: 978-989-758-488-6 Copyright c 2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved Towards Real-time Object Recognition and Pose Estimation in Point Clouds Training a model on ImageNet is quite a challeng- bled the depth of AlexNet, but exploring tiny filters ing task in terms of computational resources and time (3 × 3), and became the runner-up on the ILSVRC, consumption. Fortunately, transferring its models of- one step back the GoogLeNet (Szegedy et al., 2015), fer efficient solutions in different contexts, acting as with 22 layers. GoogLeNet relies on the Inception a blackbox feature extractor. Studies like (Agrawal architecture (Szegedy et al., 2016). Another type of et al., 2014) explore and corroborate this high ca- ConvNets, called ResNets (He et al., 2016), uses the pacity of transferring such models to different con- concept of residual blocks that use skip-connection texts and applications. Regarding the use of pre- blocks that learn residual functions regarding the in- trained CNN features, some approaches handle the put. Many architectures have been proposed based object recognition on the Washington RGB-D Object on these findings, such as ResNet with 50, 101, dataset, e.g., (Zia et al., 2017) with the VGG architec- and 152 (He et al., 2016). Also, based on devel- ture and (Caglayan et al., 2020) evaluate several pop- opments regarding the residual blocks, (Xie et al., ular deep networks, such as AlexNet, VGG, ResNet, 2017) developed the ResNeXt architecture. The ba- and DenseNet. sis upon ResNeXt blocks resides on parallel ResNet- This paper introduces a novel pipeline to deal with like blocks, which have the output summed before the point cloud pose estimation in uncontrolled environ- residual calculation. Some architectures propose the ments and cluttered scenes. Our proposed pipeline use of Deep Learning features on resource-limited de- recognizes the object using color feature descriptors, vices, such as smartphones and embedded systems. crops the selected bounding-box reducing the scenes’ The most prominent architecture is the MobileNet searching surface, and finally estimates the object’s (Sandler et al., 2018). Another family of leading net- pose in a traditional local feature-based approach. De- works is the EfficientNet (Tan and Le, 2019). Relying spite adopting well-known techniques in the 2D/3D on the use of these lighter architectures, EfficientNet computer vision field, our proposal’s novelty centers proposes very deep architectures without compromise on the smooth integration between 2D and 3D meth- resource efficiency. ods to provide a solution efficient in terms of accuracy and time. 2.2 Pose Estimation As presented in (Aldoma et al., 2012b), a compre- 2 BACKGROUND hensive registration process usually consists of two steps: coarse and fine registrations. We can produce Recognition systems work with objects, which are a coarse registration transformation by performing a digital representations of tangible real-world items manual alignment, motion tracking or, the most com- that exist physically in a scene. Such systems are mon, by using the local feature matching. Local- unavoidably machine-learning-based approaches that feature-matching-based algorithms automatically ob- use features to locate and identify objects in a scene tain corresponding points from two or multiple point- reliably. Together with the recognition, another task clouds, coarsely registering by minimizing the dis- is to estimate the location and orientation of the de- tance between them. These methods have been exten- tected items. In a 3D world, we estimate six degrees sively studied and have confirmed to be compliant and of freedom (6DoF), which refers to the geometrical computer efficient (Guo et al., 2016). After coarsely transformation representing a rigid body’s movement registering the point clouds, a fine-registration algo- in a 3D space, i.e., the combination of translation and rithm is applied to refine the initial coarse registra- rotation. tion iteratively. Examples of fine-registration algo- rithms include the ICP algorithm that perform point- to-point alignment (Besl and McKay, 1992), or point- 2.1 Color Feature Extraction to-plane (Chen and Medioni, 1992). These algorithms are suitable for matching between point-clouds of As a mark on the deep learning history, (Krizhevsky isolated scenes (3D registration) or between a scene et al., 2012) presented the first Deep Convolutional and a model (3D object recognition). This proposal Architecture employed on the ILSVRC, an 8-layer ar- adopted two approaches to generate the initial align- chitecture dubbed AlexNet. This network was the first ment: a traditional feature-based RANSAC and the to prove that deep learning could beat hand-crafted Fast Global Registration (FGR) (Zhou et al., 2016). methods when trained on a large scale dataset. Af- ter that, ConvNets became more accurate, deeper, and bigger in terms of parameters. (Simonyan and Zis- serman, 2015) propose VGG, a network that dou- 165 VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications 3 PROPOSED APPROACH step. The database is composed of information con- cerning each item, as well as the extracted features of In this section, we explain in detail our proposed ap- them. We choose a local-descriptors-based approach proach. Our proposed pipeline starts from an RGB to estimate the object’s pose. For each instance of an image and its corresponding point cloud, generated object, we store several partial views of it. Between from RGB and depth images. These inputs are sub- these views, our method will select the most likely to mitted to our three-stage architecture: color feature the correspondent object on the scene. classification, feature-based registration, and fine ad- Based on the predicted objects’ classes, we can justment.