Object Detection and Pose Estimation from RGB and Depth Data for Real-Time, Adaptive Robotic Grasping

Object Detection and Pose Estimation from RGB and Depth Data for Real-time, Adaptive Robotic Grasping Shuvo Kumar Paul1, Muhammed Tawfiq Chowdhury1, Mircea Nicolescu1, Monica Nicolescu1, David Feil-Seifer1 Abstract—In recent times, object detection and pose estimation to the changing environment while interacting with their have gained significant attention in the context of robotic vision surroundings, and a key component of this interaction is the applications. Both the identification of objects of interest as well reliable grasping of arbitrary objects. Consequently, a recent as the estimation of their pose remain important capabilities in order for robots to provide effective assistance for numerous trend in robotics research has focused on object detection and robotic applications ranging from household tasks to industrial pose estimation for the purpose of dynamic robotic grasping. manipulation. This problem is particularly challenging because However, identifying objects and recovering their poses are of the heterogeneity of objects having different and potentially particularly challenging tasks as objects in the real world complex shapes, and the difficulties arising due to background are extremely varied in shape and appearance. Moreover, clutter and partial occlusions between objects. As the main contribution of this work, we propose a system that performs cluttered scenes, occlusion between objects, and variance in real-time object detection and pose estimation, for the purpose lighting conditions make it even more difficult. Additionally, of dynamic robot grasping. The robot has been pre-trained to the system needs to be sufficiently fast to facilitate real-time perform a small set of canonical grasps from a few fixed poses robotic tasks. As a result, a generic solution that can address for each object. When presented with an unknown object in an all these problems remains an open challenge. arbitrary pose, the proposed approach allows the robot to detect the object identity and its actual pose, and then adapt a canonical While classification [1–6], detection [7–12], and segmenta- grasp in order to be used with the new pose. For training, the tion [13–15] of objects from images have taken a significant system defines a canonical grasp by capturing the relative pose step forward - thanks to deep learning, the same has not yet of an object with respect to the gripper attached to the robot’s happened to 3D localization and pose estimation. One primary wrist. During testing, once a new pose is detected, a canonical reason was the lack of labeled data in the past as it is not grasp for the object is identified and then dynamically adapted by adjusting the robot arm’s joint angles, so that the gripper practical to manually infer, thus As a result, the recent research can grasp the object in its new pose. We conducted experiments trend in the deep learning community for such applications using a humanoid PR2 robot and showed that the proposed has shifted towards synthetic datasets [16–20]. Several pose framework can detect well-textured objects, and provide accurate estimation methods leveraging deep learning techniques [21– pose estimation in the presence of tolerable amounts of out-of- 25] use these synthetic datasets for training and have shown plane rotation. The performance is also illustrated by the robot successfully grasping objects from a wide range of arbitrary satisfactory accuracy. poses. Although synthetic data is a promising alternative, capable Index Terms—pose estimation, robotics, robotic grasp, homog- of generating large amounts of labeled data, it requires photo- raphy realistic 3D models of the objects to mirror the real-world scenario. Hence, generating synthetic data for each newly I. INTRODUCTION introduced object needs photo-realistic 3D models and thus arXiv:2101.07347v1 [cs.RO] 18 Jan 2021 significant effort from skilled 3D artists. Furthermore, training Current advances in robotics and autonomous systems have and running deep learning models are not feasible without high expanded the use of robots in a wide range of robotic tasks computing resources as well. As a result, object detection and including assembly, advanced manufacturing, human-robot or pose estimation in real-time with computationally moderate robot-robot collaboration. In order for robots to efficiently machines remain a challenging problem. To address these perform these tasks, they need to have the ability to adapt issues, we have devised a simpler pipeline that does not rely *This work has been supported in part by the Office of Naval Research on high computing resources and focuses on planar objects, award N00014-16-1-2312 and US Army Research Laboratory (ARO) award requiring only an RGB image and the depth information in W911NF-20-2-0084. order to infer real-time object detection and pose estimation. 1Contact author: Shuvo Kumar Paul, Muhammed Tawfiq Chowdhury, Mircea Nicolescu, Monica Nicolescu, and David Feil- In this work, we present a feature-detector-descriptor based Seifer are affiliated with the Department of Computer Science and method for detection and a homography based pose esti- Engineering, University of Nevada, Reno, 1664 North Virginia Street, mation technique where, by utilizing the depth information, Reno, Nevada 89557, USA [email protected], [email protected], [email protected], we estimate the pose of an object in terms of a 2D planar [email protected], [email protected] representation in 3D space. The robot is pre-trained to perform a set of canonical grasps; a canonical grasp describes how a fast execution time and is very useful for matching images. robotic end-effector should be placed relative to an object in a E. Rublee et al. presented ORB [44] (Oriented FAST and fixed pose so that it can securely grasp it. Afterward, the robot Rotated Brief) which is a combination of modified FAST is able to detect objects and estimates their pose in real-time, (Features from Accelerated Segment Test) for feature detection and then adapt the pre-trained canonical grasp to the new pose and BRIEF for description. S. Leutnegger et al. designed of the object of interest. We demonstrate that the proposed BRISK [45] (Binary Robust Invariant Scale Keypoint) that method can detect a well-textured planar object and estimate detects corners using AGAST and filters them using FAST. On its accurate pose within a tolerable amount of out-of-plane the other hand, FREAK (Fast Retina Key-point), introduced rotation. We also conducted experiments with the humanoid by Alahi et al. [46] generates retinal sampling patterns using PR2 robot to show the applicability of the framework where a circular sampling grid and uses a binary descriptor, formed the robot grasped objects by adapting to a range of different by a one bit difference of Gaussians (DoG). Alcantarilla et al. poses. introduced KAZE [47] features that exploit non-linear scale- space using non-linear diffusion filtering and later extended II. RELATED WORK it to AKAZE [48] where they replaced it with a more Our work constitutes of three modules: object detection, computationally efficient method called FED (Fast Explicit planar pose estimation, and adaptive grasping. In the following Diffusion) [49, 50]. sub-sections, several fields of research that are closely related In our work, we have selected four methods to investigate: to our work are reviewed. SIFT, SURF, FAST+BRISK, AKAZE. A. Object Detection B. Planar Pose Estimation Object detection has been one of the fundamental challenges Among the many techniques in literature on pose estimation, in the field of computer vision and in that aspect, the in- we focus our review on those related to planar pose estimation. troduction of feature detectors and descriptors represents a In recent years, planar pose estimation has been increasingly great achievement. Over the past decades, many detectors, becoming popular in many fields, such as robotics and aug- descriptors, and their numerous variants have been presented mented reality. in the literature. The applications of these methods have widely extended to numerous other vision applications such Simon et. al [51] proposed a pose estimation technique as panorama stitching, tracking, visual navigation, etc. for planar structures using homography projection and by computing camera pose from consecutive images. Changhai One of the first feature detectors was proposed by Harris et et. al [52] presented a method to robustly estimate 3D poses of al. [26] (widely known as the Harris corner detector). Later planes by applying a weighted incremental normal estimation Tomasi et al. [27] developed the KLT (Kanade-Lucas-Tomasi) method that uses Bayesian inference. Donoser et al. [53] tracker based on the Harris corner detector. Shi and Tomasi utilized the properties of Maximally Stable Extremal Regions introduced a new detection metric GFTT [28] (Good Features (MSERs [54]) to construct a perspectively invariant frame on To Track) and argued that it offered superior performance. the closed contour to estimate the planar pose. In our approach, Hall et al. introduced the concept of saliency [29] in terms of we applied perspective transformation to approximate a set of the change in scale and evaluated the Harris method proposed corresponding points on the test image for estimating the basis in [30] and the Harris Laplacian corner detector [31] where a vectors of the object surface and used the depth information to Harris detector and a Laplacian function are combined. estimate the 3D pose by computing the normal to the planar Motivated by the need for a scale-invariant feature detector, object. in 2004 Lowe [32] published one of the most influential papers in computer vision, SIFT (Scale Invariant Feature Transform). C. Adaptive Grasping SIFT is both a feature point detector and descriptor. H. Bay et al. [33] proposed SURF (Speeded Up Robust Features) in Designing an adaptive grasping system is challenging due 2008. But both of these methods are computationally expen- to the complex nature of the shapes of objects.

Object Detection and Pose Estimation from RGB and Depth Data for Real-Time, Adaptive Robotic Grasping

3D Object Modeling and Recognition Using Local Affine-Invariant Image Descriptors and Multi-View Spatial Constraints

View Matching with Blob Features

3D Object Modeling and Recognition from Photographs and Image Sequences

Computer Vision Based Human Detection Md Ashikur

Local Feature Filtering for 3D Object Recognition

Object Recognition Using Hough-Transform Clustering of SURF Features

3D Object Processing and Image Processing by Numerical Methods Abdul Rahman El Sayed

2D/3D Object Recognition and Categorization Approaches for Robotic Grasping Nabila Zrira, Mohamed Hannat, El Houssine Bouyakhf, Haris Ahmad Khan

Distinctive Image Features from Scale-Invariant Keypoints

SIFTSIFT -- Thethe Scalescale Invariantinvariant Featurefeature Transformtransform

Lecture 12: Local Descriptors, SIFT & Single Object Recognition

Contributions to a Fast and Robust Object Recognition in Images Jérôme Revaud