Supervised Training of Dense Object Nets Using Optimal Descriptors for Industrial Robotic Applications
Total Page:16
File Type:pdf, Size:1020Kb
The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Supervised Training of Dense Object Nets using Optimal Descriptors for Industrial Robotic Applications Andras Gabor Kupcsik1, Markus Spies1, Alexander Klein2, Marco Todescato1, Nicolai Waniek1, Philipp Schillinger1, Mathias Burger¨ 1 1Bosch Center for Artificial Intelligence, Renningen, Germany 2Technische Universitat¨ Darmstadt, Darmstadt, Germany [email protected] Abstract DONs can be readily applied to learn arbitrary objects with relative ease, including non-rigid objects. The self- Dense Object Nets (DONs) by Florence, Manuelli and supervised training objective of DONs uses contrastive loss Tedrake (2018) introduced dense object descriptors as a novel visual object representation for the robotics community. It is (Hadsell, Chopra, and LeCun 2006) between pixels of image suitable for many applications including object grasping, pol- pairs depicting the object and its environment. The pixel- icy learning, etc. DONs map an RGB image depicting an ob- wise contrastive loss minimizes descriptor space distance ject into a descriptor space image, which implicitly encodes between corresponding pixels (pixels depicting the same key features of an object invariant to the relative camera pose. point on the object surface in an image pair) and pushes Impressively, the self-supervised training of DONs can be ap- away non-correspondences. In essence, minimizing the con- plied to arbitrary objects and can be evaluated and deployed trastive loss defined on descriptors of pixels leads to a view within hours. However, the training approach relies on accu- invariant map of the object surface in descriptor space. rate depth images and faces challenges with small, reflective objects, typical for industrial settings, when using consumer The contrastive loss formulation belongs to the broader grade depth cameras. In this paper we show that given a 3D class of projective nonlinear dimensionality reduction tech- model of an object, we can generate its descriptor space im- niques (Van Der Maaten, Postma, and Van Den Herik 2009). age, which allows for supervised training of DONs. We rely The motivation behind most of these methods is to perform on Laplacian Eigenmaps (LE) to embed the 3D model of an a mapping from an input manifold to a typically lower di- object into an optimally generated space. While our approach mensional output space while ensuring that similar inputs uses more domain knowledge, it can be efficiently applied map to similar outputs. While the contrastive loss formula- even for smaller and reflective objects, as it does not rely tion relies on a similarity indicator (e.g., matching vs. non- on depth information. We compare the training methods on matching pixels), other techniques also exploit the magni- generating 6D grasps for industrial objects and show that our tude of similarity implying local information about the data novel supervised training approach improves the pick-and- place performance in industry-relevant tasks. (e.g., ISOMAP (Tenenbaum, De Silva, and Langford 2000) and LE (Belkin and Niyogi 2003)). While generating similarity indicators, or pixel corre- Introduction spondences, is suitable for self-supervised training, its ac- curacy inherently depends on the quality of the recorded Dense object descriptors for perceptual object representa- data. Based on our observations, noisy depth data can de- tion received considerable attention in the robot learning teriorate the quality of correspondence matching especially community (Florence, Manuelli, and Tedrake 2018, 2020; in the case of smaller objects. As an alternative, in this paper Sundaresan et al. 2020). To learn and generate dense vi- we generate an optimal descriptor space embedding given a sual representation of objects, Dense Object Nets (DONs) 3D mesh model, leading to a supervised training approach. were proposed by Florence, Manuelli, and Tedrake (2018). Optimality here refers to embedding the model vertices into DONs map an h × w × 3 RGB image to its descriptor space a descriptor space with minimal distortion of their local con- map of size h × w × D, where D 2 N+ is an arbitrar- nectivity information. We rely on discrete exterior calcu- ily chosen dimensionality. DONs can be trained in a self- lus to compute the Laplacian of the object model (Crane supervised manner using a robot and a wrist-mounted con- et al. 2013), which generates the geometrically accurate lo- sumer grade RGBD camera, and can be deployed within cal connectivity information. We exploit this information in hours. Recently, several impactful applications and exten- combination with Laplacian Eigenmaps to create the corre- sions of the original approach have been shown, includ- sponding optimal descriptor space embedding. Finally, we ing rope manipulation (Sundaresan et al. 2020), behavior render target descriptor space images to input RGB images cloning (Florence, Manuelli, and Tedrake 2020) and con- depicting the object (see Fig. 1). Ultimately, our approach troller learning (Manuelli et al. 2020). generates dense object descriptors akin the original, self- Copyright c 2021, Association for the Advancement of Artificial supervised method without having to rely on depth infor- Intelligence (www.aaai.org). All rights reserved. mation. 6093 Figure 1: Illustration of the proposed approach (best viewed in color). Given the 3D model of an object (left top) we generate its optimal descriptor space embedding using Laplacian Eigenmaps (left bottom). Then, for a given input image (middle) with known object pose we render its descriptor space target (right). For illustration purposes we use 3 dimensional descriptor space. While our approach uses more domain knowledge (3D different RGB images within a scene, Ia and Ib, to evaluate model of the object and its known pose), it has several the contrastive loss. We use the object masks to sample Nm benefits over the original self-supervised method. Primar- corresponding and Nnm non-corresponding pixels. Then, ily, we do not rely on pixel-wise correspondence match- the contrastive loss consists of match loss Lm of correspond- ing based on noisy or lower quality consumer grade depth ing pixels and non-match loss Lnm of non-corresponding cameras. Thus, our approach can be applied straightfor- pixels: wardly to small, reflective, or symmetric objects, which are 1 X 2 often found in industrial processes. Furthermore, we ex- Lm = D(Ia; ua;Ib; ub) ; (1) Nm plicitly separate object and background descriptors, which Nm avoids the problem of amodal correspondence prediction 1 X 2 when using self-supervised training (Florence 2020). We Lnm = max (0;M − D(Ia; ua;Ib; ub)) ; (2) Nnm also provide a mathematical meaning for the descriptor Nnm space, which we generate optimally in a geometrical sense Lc(Ia;Ib) = Lm(Ia;Ib) + Lnm(Ia;Ib); (3) irrespective of the descriptor dimension. We believe that, overall, this improves explainability and reliability for prac- where D(Ia; ua;Ib; ub) = kf(Ia; θ)(ua) − f(Ib; θ)(ub)k2 titioners. A video summary of the paper can be found under is the descriptor space distance between pixels ua and ub of + https://youtu.be/chtswiIIZzQ. images Ia and Ib. M 2 R is an arbitrarily chosen mar- gin and f(·; θ): Rh×w×3 7! Rh×w ×D represents a fully Background convolutional network (Shelhamer, Long, and Darrell 2017) with parameters θ. This chapter briefly reviews self-supervised training of The efficiency of the self-supervised training approach Dense Object Nets (Florence, Manuelli, and Tedrake 2018) lies in the automatic generation of pixel correspondences and nonlinear dimensionality reduction by Laplacian Eigen- from registered RGBD images. Using the object mask in maps. image Ia we can sample a pixel ua and identify correspond- ing pixel ub in image Ib by reprojecting the depth informa- Self-supervised Training of DONs tion. In a similar way we can sample non-correspondences To collect data for the self-supervised training approach, we on the object and on the background. While this approach use a robot and a wrist-mounted RGBD camera. We place automatically labels tens of thousands of pixels in a single the target object to an arbitrary but fixed location in the image pair, its accuracy inherently relies on the quality of workspace of the robot. Then, using a quasi-random mo- the depth image. In practice we noticed that this consider- tion with the robot and the camera pointing towards the ably limits the accuracy of the correspondence matching for workspace, we record a scene with registered RGBD im- smaller objects with consumer grade depth sensors. For fur- ages depicting the object and its environment. To overcome ther details of the training and evaluation of DONs we refer noisy and often missing depth data of consumer grade depth the reader to (Florence, Manuelli, and Tedrake 2018). sensors, all registered RGBD images are then fused into a single 3D model and depth is recomputed for each frame. Laplacian Eigenmaps N While this will improve the overall depth image quality, we Assume a dataset of N data points is given as X = fxigi=1, noticed that in practice this also over-smooths it, which re- where x 2 M lies on a manifold. Furthermore, we are sults in a loss of geometrical details particularly for small given the connectivity information between these points as ≥0 objects. With knowledge of the object location in the fused wij 2 R . For example, if xi is a node in a graph, then model we can also compute the object mask in each frame. we can define wij to be either 1 or 0 given it is connected After recording a handful of scenes with different object to xj or not. The Laplacian Eigenmap method considers the poses, for training we repeatedly and randomly choose two problem of finding an embedding of the data points in X to 6094 N RD fyigi=1, with y 2 , such that the local connectivity wij can handle symmetries. Then we explain the full target im- measured in a Euclidean sense is preserved.