The Thirty-Fifth AAAI Conference on (AAAI-21)

Supervised Training of Dense Object Nets using Optimal Descriptors for Industrial Robotic Applications

Andras Gabor Kupcsik1, Markus Spies1, Alexander Klein2, Marco Todescato1, Nicolai Waniek1, Philipp Schillinger1, Mathias Burger¨ 1 1Bosch Center for Artificial Intelligence, Renningen, Germany 2Technische Universitat¨ Darmstadt, Darmstadt, Germany [email protected]

Abstract DONs can be readily applied to learn arbitrary objects with relative ease, including non-rigid objects. The self- Dense Object Nets (DONs) by Florence, Manuelli and supervised training objective of DONs uses contrastive loss Tedrake (2018) introduced dense object descriptors as a novel visual object representation for the robotics community. It is (Hadsell, Chopra, and LeCun 2006) between pixels of image suitable for many applications including object grasping, pol- pairs depicting the object and its environment. The pixel- icy learning, etc. DONs map an RGB image depicting an ob- wise contrastive loss minimizes descriptor space distance ject into a descriptor space image, which implicitly encodes between corresponding pixels (pixels depicting the same key features of an object invariant to the relative camera pose. point on the object surface in an image pair) and pushes Impressively, the self-supervised training of DONs can be ap- away non-correspondences. In essence, minimizing the con- plied to arbitrary objects and can be evaluated and deployed trastive loss defined on descriptors of pixels leads to a view within hours. However, the training approach relies on accu- invariant map of the object surface in descriptor space. rate depth images and faces challenges with small, reflective objects, typical for industrial settings, when using consumer The contrastive loss formulation belongs to the broader grade depth cameras. In this paper we show that given a 3D class of projective nonlinear dimensionality reduction tech- model of an object, we can generate its descriptor space im- niques (Van Der Maaten, Postma, and Van Den Herik 2009). age, which allows for supervised training of DONs. We rely The motivation behind most of these methods is to perform on Laplacian Eigenmaps (LE) to embed the 3D model of an a mapping from an input manifold to a typically lower di- object into an optimally generated space. While our approach mensional output space while ensuring that similar inputs uses more domain knowledge, it can be efficiently applied map to similar outputs. While the contrastive loss formula- even for smaller and reflective objects, as it does not rely tion relies on a similarity indicator (e.g., matching vs. non- on depth information. We compare the training methods on matching pixels), other techniques also exploit the magni- generating 6D grasps for industrial objects and show that our tude of similarity implying local information about the data novel supervised training approach improves the pick-and- place performance in industry-relevant tasks. (e.g., ISOMAP (Tenenbaum, De Silva, and Langford 2000) and LE (Belkin and Niyogi 2003)). While generating similarity indicators, or pixel corre- Introduction spondences, is suitable for self-supervised training, its ac- curacy inherently depends on the quality of the recorded Dense object descriptors for perceptual object representa- data. Based on our observations, noisy depth data can de- tion received considerable attention in the robot learning teriorate the quality of correspondence matching especially community (Florence, Manuelli, and Tedrake 2018, 2020; in the case of smaller objects. As an alternative, in this paper Sundaresan et al. 2020). To learn and generate dense vi- we generate an optimal descriptor space embedding given a sual representation of objects, Dense Object Nets (DONs) 3D mesh model, leading to a supervised training approach. were proposed by Florence, Manuelli, and Tedrake (2018). Optimality here refers to embedding the model vertices into DONs map an h × w × 3 RGB image to its descriptor space a descriptor space with minimal distortion of their local con- map of size h × w × D, where D ∈ N+ is an arbitrar- nectivity information. We rely on discrete exterior calcu- ily chosen dimensionality. DONs can be trained in a self- lus to compute the Laplacian of the object model (Crane supervised manner using a robot and a wrist-mounted con- et al. 2013), which generates the geometrically accurate lo- sumer grade RGBD camera, and can be deployed within cal connectivity information. We exploit this information in hours. Recently, several impactful applications and exten- combination with Laplacian Eigenmaps to create the corre- sions of the original approach have been shown, includ- sponding optimal descriptor space embedding. Finally, we ing rope manipulation (Sundaresan et al. 2020), behavior render target descriptor space images to input RGB images cloning (Florence, Manuelli, and Tedrake 2020) and con- depicting the object (see Fig. 1). Ultimately, our approach troller learning (Manuelli et al. 2020). generates dense object descriptors akin the original, self- Copyright c 2021, Association for the Advancement of Artificial supervised method without having to rely on depth infor- Intelligence (www.aaai.org). All rights reserved. mation.

6093 Figure 1: Illustration of the proposed approach (best viewed in color). Given the 3D model of an object (left top) we generate its optimal descriptor space embedding using Laplacian Eigenmaps (left bottom). Then, for a given input image (middle) with known object pose we render its descriptor space target (right). For illustration purposes we use 3 dimensional descriptor space.

While our approach uses more domain knowledge (3D different RGB images within a scene, Ia and Ib, to evaluate model of the object and its known pose), it has several the contrastive loss. We use the object masks to sample Nm benefits over the original self-supervised method. Primar- corresponding and Nnm non-corresponding pixels. Then, ily, we do not rely on pixel-wise correspondence match- the contrastive loss consists of match loss Lm of correspond- ing based on noisy or lower quality consumer grade depth ing pixels and non-match loss Lnm of non-corresponding cameras. Thus, our approach can be applied straightfor- pixels: wardly to small, reflective, or symmetric objects, which are 1 X 2 often found in industrial processes. Furthermore, we ex- Lm = D(Ia, ua,Ib, ub) , (1) Nm plicitly separate object and background descriptors, which Nm avoids the problem of amodal correspondence prediction 1 X 2 when using self-supervised training (Florence 2020). We Lnm = max (0,M − D(Ia, ua,Ib, ub)) , (2) Nnm also provide a mathematical meaning for the descriptor Nnm space, which we generate optimally in a geometrical sense Lc(Ia,Ib) = Lm(Ia,Ib) + Lnm(Ia,Ib), (3) irrespective of the descriptor dimension. We believe that, overall, this improves explainability and reliability for prac- where D(Ia, ua,Ib, ub) = kf(Ia; θ)(ua) − f(Ib; θ)(ub)k2 titioners. A video summary of the paper can be found under is the descriptor space distance between pixels ua and ub of + https://youtu.be/chtswiIIZzQ. images Ia and Ib. M ∈ R is an arbitrarily chosen mar- gin and f(·; θ): Rh×w×3 7→ Rh×w ×D represents a fully Background convolutional network (Shelhamer, Long, and Darrell 2017) with parameters θ. This chapter briefly reviews self-supervised training of The efficiency of the self-supervised training approach Dense Object Nets (Florence, Manuelli, and Tedrake 2018) lies in the automatic generation of pixel correspondences and nonlinear dimensionality reduction by Laplacian Eigen- from registered RGBD images. Using the object mask in maps. image Ia we can sample a pixel ua and identify correspond- ing pixel ub in image Ib by reprojecting the depth informa- Self-supervised Training of DONs tion. In a similar way we can sample non-correspondences To collect data for the self-supervised training approach, we on the object and on the background. While this approach use a robot and a wrist-mounted RGBD camera. We place automatically labels tens of thousands of pixels in a single the target object to an arbitrary but fixed location in the image pair, its accuracy inherently relies on the quality of workspace of the robot. Then, using a quasi-random mo- the depth image. In practice we noticed that this consider- tion with the robot and the camera pointing towards the ably limits the accuracy of the correspondence matching for workspace, we record a scene with registered RGBD im- smaller objects with consumer grade depth sensors. For fur- ages depicting the object and its environment. To overcome ther details of the training and evaluation of DONs we refer noisy and often missing depth data of consumer grade depth the reader to (Florence, Manuelli, and Tedrake 2018). sensors, all registered RGBD images are then fused into a single 3D model and depth is recomputed for each frame. Laplacian Eigenmaps N While this will improve the overall depth image quality, we Assume a dataset of N data points is given as X = {xi}i=1, noticed that in practice this also over-smooths it, which re- where x ∈ M lies on a manifold. Furthermore, we are sults in a loss of geometrical details particularly for small given the connectivity information between these points as ≥0 objects. With knowledge of the object location in the fused wij ∈ R . For example, if xi is a node in a graph, then model we can also compute the object mask in each frame. we can define wij to be either 1 or 0 given it is connected After recording a handful of scenes with different object to xj or not. The Laplacian Eigenmap method considers the poses, for training we repeatedly and randomly choose two problem of finding an embedding of the data points in X to

6094 N RD {yi}i=1, with y ∈ , such that the local connectivity wij can handle symmetries. Then we explain the full target im- measured in a Euclidean sense is preserved. We can express age generation and supervised training pipeline. this objective as a constrained optimization problem Note that the contrastive loss of the self-supervised train-

N N ing approach is defined over pixel descriptors of RGB im- ∗ 1 X X 2 ages depicting the object. In our case, we exploit the LE em- Y = arg min wij yi − yj (4) Y 2 2 bedding directly on the mesh of the model. This allows us to j=1 i=1 exploit the geometry of the object and to address optimality s.t. YCY T = I, (5) of the embedding. As a second step we render the descrip- tor space representation of the object to generate descriptor RD×N P where Y = [y1,..., yN ] ∈ , Cii = j Aij is a space images for the supervised training objective. diagonal matrix with Aji = Aij = wij as the connectiv- ity matrix. C ∈ RN×N measures how strongly the data Embedding Object Models using Laplacian points are connected. The constraint removes bias and en- Eigenmaps forces unit variance in each dimension of the embedding space. As it was shown in the seminal paper by Belkin and We assume that a 3D model of an object is a triangle mesh Niyogi (2003), this optimization problem can be expressed consisting of N = |V | vertices V , edges E and faces F . We consider the mesh as a Riemannian manifold M embedded using the Laplacian LN×N as in R3. The vertices x ∈ M are points on this manifold.   i Y ∗ = arg min Tr Y LY T (6) We are looking for the descriptor space maps of the vertices Y RD yi ∈ . To apply the LE solution we need to compute s.t. YCY T = I, the Laplacian of the object model. For triangle meshes this can be computed based on discrete exterior calculus (Crane where L = C−A is a positive semi-definite, symmetric ma- et al. 2013). For an efficient and easy to implement solution trix. The solution to this optimization problem can be found we refer to (Sharp, Soliman, and Crane 2019). After comput- by solving a generalized eigenvalue problem ing the Laplacian, we can solve the same generalized eigen- value problem as in Eq. (7) to compute the D-dimensional LY T = diag(λ)CY T , (7) descriptor space embedding of the vertices Y ∈ RD×N . with eigenvectors corresponding to the dimensions of y (or Fig. 2 illustrates the solution of the eigenvalue problem rows of Y ) and non-decreasing real eigenvalues λ ∈ R≥0. for the Stanford bunny and the resulting descriptor space Interestingly, the first dimension is constant for each data embedding with D = 3. We project the solution on the point, i.e., Y 1,: = const and λ1 = 0. This corresponds to vertices of the object model and use a renderer to color the embedding every data point in X to a single point. For prac- faces. tical reasons the very first dimension is ignored (D = 0). Therefore, choosing D = 1 corresponds to embedding X Handling Symmetry on a line, D = 2 to a plane, etc. Given that the Laplacian So far we map every vertex of the mesh to a unique point in is symmetric positive semi-definite (when we use the def- descriptor space. However, for objects with symmetric geo- inition L = C − A), the eigenvectors are orthogonal to metric features, typical in industrial settings, this approach each other. Additionally, with the optimal solution we have ∗ ∗T will assign unique descriptors to indistinguishable vertices. diag(λ) = Y LY , that is, the eigenvalues represent the Consider the case of the torus in Fig. 3. The mesh is invari- embedding error in each dimension. Consequently, as these ant to rotations around the z-axis. If we apply the Laplacian eigenvalues are non-decreasing, we always arrive at an opti- embedding approach, we end up with descriptors that do not mal embedding irrespective of the dimensionality D. preserve this symmetry (top right in Fig. 3). The descriptor Note that both contrastive learning and Laplacian Eigen- values will be assigned purely based on the ordering of the maps achieve a highly related objective, that is, minimiz- vertices in the mesh. Instead, we are looking for an embed- ing distances in embedding space between related, or simi- ding as in the bottom of Fig. 3, which only differentiates be- lar data. However, they differ in how they avoid the trivial tween “inside-outside” vertices and which appears invariant solution (mapping to a single point), either by pushing dis- to rotation around z-axis. similar data by a margin away, or by normalizing the solu- To overcome this problem, we have to (i) detect intrinsic tion. Finally, the contrastive formulation gives rise to learn- symmetries of shapes and (ii) compress symmetric embed- ing parametric models (e.g., neural networks), while LE di- dings, such that symmetric vertices map to the same descrip- rectly generates the embedding from the Laplacian, which is tor. Ovsjanikov, Sun, and Guibas (2008) discuss an approach specific to a manifold (e.g., 3D mesh). for detecting intrinsic symmetries for shapes represented as compact (Riemannian) manifolds. In particular, they showed Method that a shape has intrinsic symmetry if the eigenvectors of the In this Section we describe our proposed solution to gener- Laplacian, that is, its Euclidean embedding appear symmet- ate optimal training targets for training input RGB images ric. Following this result, Wang et al. (2014) defined Global and describe the supervised training setup. We first describe Intrinsic Symmetry Invariant Functions (GISIFs) on vertices how we generate an optimal descriptor space map for a 3D that preserve their value in the face of any homeomorphism object that is represented by a triangle mesh and how we (such as rotation around z-axis in the case of the torus). They

6095 Figure 2: Illustration of the Laplacian Eigenmap embedding of the Stanford bunny (best viewed in color). (left): the triangle mesh representation, (middle three): the first three eigenvector projected on the mesh, scaled between [0, 1] and visualized with hot color map, (right): the descriptor space representation of the mesh visualized with red corresponding to the first eigenvector, or descriptor dimension, green for the second and blue for the third.

also reported in (Florence 2020). To train the network we rely on `2 loss between DON output and generated descriptor space images. However, in a given image typically the amount of pixels representing the object is significantly lower than that of the background. Therefore, we separate the total loss into object and back- ground loss and normalize them with the amount of pixels they occupy

Figure 3: (top left) the mesh of the torus, (top right) the K ∗ X  1 d 2 asymmetric 3-dimensional descriptor projected on the mesh, θ = arg min Mi,obj ◦ f(Ii; θ) − Ii + θ P 2 (bottom) the symmetric 3-dimensional descriptor projected i=1 i,obj on the mesh. 1 d 2  Mi,back ◦ f(Ii; θ) − Ii 2 , (8) Pi,back also showed that such a GISIF can be composed of eigenvec- tors of the Laplacian. where θ are the network parameters, Pi,obj/back is the In particular, they propose that in case of identical con- amount of pixels the object, or background occupies in the secutive eigenvalues λi = λi+1 = ··· = λi+L, such a image Ii and Mi,obj/back is the object or background mask. GISIF is the squared sum of the corresponding eigenvectors For convenience here we overloaded the notation M, which Pi+L 2 refers to margin in the self-supervised training approach and ysym = i yi . In practice eigenvalues are rarely identi- cal due to numerical limitations. To resolve this problem we mask here. The mask can be generated via rendering the ob- heuristically consider eigenvalues in an -ball as identical. ject in the image plane, similar to how we generate target d images Ii . Generating Target Images and Network Training We compare the training pipeline of the original self- supervised DON framework (Florence, Manuelli, and Given the optimal embedding of an object model in descrip- Tedrake 2018) and our proposed supervised approach in tor space and the registered RGB images of a static scene, Fig. 4. We use the same static scene recording with a wrist- we can now generate the target descriptor space images for mounted camera for both training approaches. While the training. In the collected dataset we have K registered RGB c c K self-supervised training approach takes the depth, RGB, images Ii with camera extrinsic T i ∈ SE(3), {Ii, T i }i=1 and camera extrinsic parameters, ours takes the model pose depicting the object and its environment. We compute the in world coordinates and its mesh instead of depth im- descriptor space embedding of the model and we assume ages. Then, both approaches perform data preprocessing, 3D the object pose in world coordinates T o ∈ SE(3) is known. d model fusion, depth and mask rendering for the contrastive To generate target descriptor space images Ii we can use loss, object descriptor generation and descriptor image ren- a rendering engine to project the descriptor space model to dering for the `2 loss. Finally, both training approaches op- the image plane of the camera, see Fig. 1 for an illustration. timize the network parameters. The pose of the object in the camera coordinate system can be computed from the camera poses and the object pose with o,c c−1 o View-dependent Descriptors T i = T i T . To generate the background image we first normalize the descriptor dimensions between [0, 1]. Then, as Note that our framework can be extended with additional background we choose the descriptor which is the furthest descriptor dimensions that not necessarily minimize the em- away from object descriptors within a unit cube. By explic- bedding objective defined in Eq. (4). In practice we noticed itly separating object and background it becomes more un- that the self-supervised training approach learns descriptor likely to predict object descriptors in the background, which dimensions which mask the object and its edge resulting may occur when using the self-supervised training approach, in view-dependent descriptors. This is likely a consequence

6096 3D model fusion, Pixelwise Depth mask and depth contrastive loss rendering on image pairs

RGB Render Target Images, loss on output- generate view- target images, dependent descriptors Eq. (8) Camera � � poses �� �=1

Model, Generate Object Figure 6: The metallic shaft and toolcap objects with pre- �� pose Descriptors dicted grasp poses by tracking 2 descriptors, or points on the objects. Figure 4: Comparison of the self-supervised and the super- vised training (ours) pipelines. The first utilizes depth im- ages to find pixel correspondences and relies on the con- Evaluation trastive pixel-wise loss. Ours exploits the object model and In this section we provide a comparison of the trained net- its pose in world coordinates to render training target im- works. While the DON framework can be applied for differ- ages, which will be used in the loss function (Eq. (8)). ent robotic applications, in this paper we consider the ori- ented grasping of industrial work pieces. To this end we first provide a quantitative evaluation of the prediction ac- curacy of the networks. We measure accuracy by how con- sistently the network predicts descriptors for specific points on the object from different points of view. Then, we derive a grasp pose prediction pipeline and use it in a real-world robot grasp and placement scenario. Figure 5: Visualization of view-dependent descriptors. On Hardware, Network and Scene Setup the left side we can see the original RGB image, depicting We use a Franka Emika Panda 7-DoF robot with a parallel the toolcap. Using D = 3 we generate the optimal geometric gripper (Franka 2018), equipped with a wrist-mounted Intel descriptor image. Then, we can either blend the object edge RealSense D435 camera (Intel R 2019). The relative trans- with the background, or generate an object mask. formation from end-effector to camera coordinate system is computed with ChArUco board calibration (Garrido-Jurado of the self-supervised training adapting to slightly inaccu- et al. 2014) and forward kinematics of the robot. We record rate pixel correspondence generation in the face of cam- RGB and depth images of size 640 × 480 and at 100µm res- era extrinsic errors and oversmoothed depth images. As our olution at 15Hz. For a given scene we record ∼ 500 images training method relies on the same registered RGBD scene with a precomputed robot end-effector trajectory resulting in recording, to improve training robustness we considered two highly varying camera poses. Our target objects are a metal- view-dependent descriptors as an extension to our optimally lic shaft and a toolcap, both of which appear invariant to generated geometric descriptors. rotation around one axis (see Fig. 6). First, we considered blending descriptors on the edge of For both, self-supervised and supervised training, we use the object with the background descriptor. This mimics a descriptor dimension D = 3, consider only a single object view-dependent edge detection descriptor that separates ob- per scene, and train with background randomization. For the ject and background. Second, we can use a descriptor dimen- supervised approach we consider two ways of generating sion as object mask that is 0 outside the object and gradually data. We first compute the optimal embedding, then we ei- increases to 1 at the center of the object in the image plane. ther use edge blending, or object masking. In the latter case The former option smooths the optimal descriptor images we use the first two geometric descriptor dimensions and add around the edges, while the latter extends or replaces de- the view-dependent object mask as the third (see Fig. 5). For scriptor dimensions with an object mask. These two addi- the self-supervised approach we use the hyper-parameters tional descriptors together with the optimal geometric de- defined in (Florence, Manuelli, and Tedrake 2018). That is, scriptors are shown in Fig. 5 for a metallic toolcap object. we use M = 0.5 as margin for on-object non-match loss, Note that these features can be computed automatically from M = 2.5 for background non-match loss. We do not use the object mask of the current frame. normalization of the descriptors and use scaling by hard- We experimented with these extensions to improve grasp negatives. The network structure for both method is ResNet- pose prediction tasks in practice. This extension shows on 34 pretrained on ImageNet. one hand the flexibility of using our approach by manipulat- ing descriptor dimensions. On the other hand, it highlights Quantitative Evaluation of Prediction Accuracy the benefits of using the self-supervised training approach to We now provide an evaluation for how accurately the net- automatically address inaccuracies in the data. works can track points on an object from different views.

6097 Figure 7: Given a scene during evaluation, we select a few pixels on the RGB image corresponding to points on an ob- ject (shown for 3 points). We record their descriptors and world coordinates as a tracking reference.

Figure 9: The summary of the experiment setup.

lead to higher tracking errors. With our supervised training method this occurs less frequently as we explicitly separate object and background in descriptor space, while this sepa- ration is less apparent in the self-supervised case. Figure 8: Tracking error distributions evaluated on every x scene for both objects. Note the log-scale of the -axis. Grasp Pose Prediction While self-supervised training provides higher accuracy, it has a long-tail error distribution, which sometimes leads to The dense descriptor representation gives rise to a variety of significantly higher errors. The supervised approaches have grasp pose configurations. In the simplest case, consider a a slightly lower accuracy, but increased robustness. 3D top-down grasp: choose a descriptor d that corresponds to a specific point on the object. Then, in a new scene with a registered RGBD image, evaluate the network and find the For a given static scene we choose an RGB image I and closest descriptor d with corresponding pixel values (u, v). select an arbitrary amount of reference pixels on the object Finally, project the depth value at (u, v) in world coordi- R3 (ui, vi), i = 1,...,N (see Fig. 7 left). Then, we record nates to identify the grasp point p˜ ∈ . This method previ- R3 ously described in Florence, Manuelli, and Tedrake (2018) the 3D world coordinates of the pixels pi ∈ using the registered depth images and their corresponding descriptors together with grasp pose optimization. d D d di = I (ui, vi) ∈ R , where I = f(I; θ) is the descrip- Following this method, we can also encode orientation by tor image evaluated by the network (see Fig. 7 middle and defining an axis grasp. For this purpose we identify 2 de- right). Then, we iterate over every image Ij, j = 1,...,M scriptors d1, d2 and during prediction find the correspond- d ing world coordinates p˜ , p˜ . To define the grasp position in the scene and evaluate the network I = f(Ij; θ). In ev- 1 2 j we use a linear combination of p˜ , p˜ , e.g. taking their mean. ery descriptor space image Id we find the closest descriptors 1 2 j To define the grasp orientation we align a given axis with to our reference set di and their pixel location (ui, vi) by p˜ − p˜ 2 the direction 1 2 and choose an additional axis arbi- solving (u , v )∗ = arg min Id(u , v ) − d , ∀j. i i j ui,vi j i i i 2 trarily, e.g. align it with the z-axis of the camera, or world- We only consider those reference points which are visible coordinates. The third axis can then be computed by taking and have valid depth data. Using the registered depth im- the cross product. See Fig. 6 for predicted axis-grasps on two ∗ ages and pixel values (ui, vi)j we can compute the predicted test objects. Finally, to fully specify a given 6D pose we can world coordinates of the points p˜i and the tracking error extend the above technique to track 3 or more descriptors. kp˜i − pik2 for each frame j and for each chosen descrip- For our test objects we have chosen an axis grasp repre- tor i. sentation by tracking 2 descriptors. We align the x-axis of Fig. 8 shows the tracking error distributions of the super- the grasp pose with the predicted points and we choose as vised and self-supervised approaches. For improved visual- z-axis the axis orthogonal to the image plane of the camera. ization, the normalized distributions are summarized with Experiment setup. Consider an industrial setting where stacked histograms and with log-scale on the x axis. In gen- a human places an object with an arbitrary position and ori- eral, the self-supervised training method seems more accu- entation in the workspace of a robot. The robot’s goal is to rate, but we notice a long-tail error distribution that can lead grasp the object and place it at a handover location, such that to significant prediction errors. The supervised approaches another, pre-programmed robot (without visual feedback) provide better robustness without the long-tail distribution, can grasp and further manipulate the object in an assembly albeit being slightly less accurate. We noticed that the self- task. The setup is shown in Fig. 9. supervised training leads to a descriptor space that often For real robot grasping evaluations we have chosen the mistakes parts of the background as the object, which could toolcap as our test object. We observed that the experiment

6098 robotic cable plugging. Visual representation by dense de- scriptor is not novel though. Choy et al. (2016) introduced a deep learning framework for dense correspondence learning of image features using fully convolutional neural networks by Shelhamer, Long, and Darrell (2017). Later, Schmidt, Newcombe, and Fox (2017) showed how self-supervised training can be incorporated when learning descriptors from video streams. Recently, several authors used dense object descriptors for 6D pose estimation. Zakharov, Shugurov, and Ilic (2019) Figure 10: Summary of the real robot experiments. The su- generate 2D or 3D surface features (colors) of objects via pervised training approaches always resulted in at least a simple spherical or cylindrical projections. Then, they train successful grasp, but most often a successful task comple- a network in a supervised manner to predict these dense fea- tion. ture descriptors and a mask of the object. For 6D pose esti- mation, they project the colored mesh into the predicted fea- ture image. Periyasamy, Schwarz, and Behnke (2019) and with shaft grasping was not challenging for any of the net- Li et al. (2018) propose similar training procedures to es- works. However, we noticed that due to the smaller size of timate dense feature descriptors. Periyasamy, Schwarz, and the toolcap and its symmetric features the self-supervised Behnke (2019) do not report details for generating the dense training approach was pushed to the limit in terms of ac- descriptors and use differentiable rendering for object pose curacy. To reduce variance, we have chosen 8 fixed poses estimation. Li et al. (2018) use relative translations of the ob- for the toolcap and repeated the experiment for each trained ject as dense feature descriptors and a combination of ren- DON on each of the 8 poses. We categorize the experiment dering and an additional refinement network for pose esti- outcome as follows: mation. The supervised training procedure proposed in all of the aforementioned methods is similar to ours. The main • fail: if the grasp was unsuccessful, the toolcap remains difference is our novel method of computing optimal dense untouched descriptors that locally preserves geometric properties. • grasp only: if the grasp was successful, but place- Our work also borrows ideas from the geometry pro- ment failed (e.g., the toolcap was dropped, or was grasped cessing community. Lee and Verleysen (2005) consider em- with the wrong orientation) bedding data manifolds with non-linear dimensionality re- duction that have specific geometric features. Liu, Prab- • grasp & place: if grasping and placement with the hakaran, and Guo (2012) propose a spectral processing right orientation were successful framework that also works with point clouds to compute a Fig. 10 illustrates results of the experiment. The self- discrete Laplacian. The works by Crane et al. (2013); Crane, supervised training resulted in grasps that are often mis- Weischedel, and Wardetzky (2017) build on discrete exterior aligned, which could lead to a failed placement or even a calculus to efficiently compute the LB operator for meshes, failed grasp. The supervised training approaches show bet- which we also use in this work. Finally, detecting and pro- ter robustness with at least a successful grasp in all the cessing symmetrical features of 3D objects represented as cases. Most failed placements were due flipped orientations Riemannian manifolds is considered by the works of Ovs- leading to upside-down final poses. Overall, the results are janikov, Sun, and Guibas (2008); Wang et al. (2014). highly consistent with our previous evaluation on reference point tracking accuracy. The supervised trained approaches Summary provided more robust performance. In this paper we presented a supervised training approach for Dense Object Nets from registered RGB camera streams Related Work without depth information. We showed that by embedding Dense descriptor learning via DONs was popularized by an object model into an optimally generated descriptor space Florence, Manuelli, and Tedrake (2018) for the robotics we achieve a similar objective as in the self-supervised case. community, offering a flexible perceptual world representa- Our experiments show increased grasp pose prediction ro- tion with self-supervised training, which is also applicable to bustness for smaller, metallic objects. While self-supervised non-rigid objects. Recently, this work was extended to learn training has obvious benefits, our supervised method im- descriptors in dynamic scenes for visuo-motor policy learn- proved results for consumer grade cameras by incorporating ing (Florence, Manuelli, and Tedrake 2020). Manuelli et al. domain knowledge suitable for industrial processes. (2020) propose to define model predictive controllers based Future work will investigate how to extend supervised on tracking key descriptors, similar to our grasp pose pre- training with multiple object classes efficiently and a wider diction pipeline. Sundaresan et al. (2020) showed how ropes range of applications. For example, 6D pose estimation from can be manipulated based on dense descriptor learning in RGB images, region of interest detection and generaliz- simulation. Similar to these approaches Vecerik et al. (2020) ing grasp pose prediction to multiple object instances and propose to learn and track human annotated keypoints for classes.

6099 References Manuelli, L.; Li, Y.; Florence, P.; and Tedrake, R. 2020. Belkin, M.; and Niyogi, P. 2003. Laplacian eigenmaps for Keypoints into the Future: Self-Supervised Correspondence dimensionality reduction and data representation. Neural in Model-Based Reinforcement Learning. Conference on Computation 15(6): 1373–1396. ISSN 08997667. doi:10. Robot Learning URL http://arxiv.org/abs/2009.05085. 1162/089976603321780317. Ovsjanikov, M.; Sun, J.; and Guibas, L. 2008. Global in- Forum Choy, C. B.; Gwak, J.; Savarese, S.; and Chandraker, M. trinsic symmetries of shapes. 2016. Universal Correspondence Network. In Advances in 27(5): 1341–1348. ISSN 14678659. doi:10.1111/j.1467- Neural Information Processing Systems 30. 8659.2008.01273.x. Periyasamy, A. S.; Schwarz, M.; and Behnke, S. 2019. Re- Crane, K.; de Goes, F.; Desbrun, M.; and Schroder,¨ P. 2013. fining 6D Object Pose Predictions using Abstract Render- Digital Geometry Processing with Discrete Exterior Calcu- and-Compare. arXiv preprint arXiv:1910.03412 . lus. In ACM SIGGRAPH 2013 courses, SIGGRAPH ’13. New York, NY, USA: ACM. Schmidt, T.; Newcombe, R.; and Fox, D. 2017. Self- Supervised Visual Descriptor Learning for Dense Corre- Crane, K.; Weischedel, C.; and Wardetzky, M. 2017. The spondence. IEEE Robotics and Automation Letters 2(2): heat method for distance computation. Communications of 420–427. ISSN 2377-3766. doi:10.1109/LRA.2016. the ACM 60(11): 90–99. ISSN 15577317. doi:10.1145/ 2634089. 3131280. Sharp, N.; Soliman, Y.; and Crane, K. 2019. The Vector Heat Florence, P.; Manuelli, L.; and Tedrake, R. 2018. Dense Ob- Method. ACM Transactions on Graphics 38(3): 1–19. ISSN ject Nets: Learning Dense Visual Object Descriptors By and 07300301. doi:10.1145/3243651. For Robotic Manipulation. Conference on Robot Learning . Shelhamer, E.; Long, J.; and Darrell, T. 2017. Fully Convo- Florence, P.; Manuelli, L.; and Tedrake, R. 2020. Self- lutional Networks for Semantic Segmentation. IEEE Trans. Supervised Correspondence in Visuomotor Policy Learning. Pattern Anal. Mach. Intell. 39(4): 640651. ISSN 0162-8828. IEEE Robotics and Automation Letters 5(2): 492–499. ISSN doi:10.1109/TPAMI.2016.2572683. URL https://doi.org/10. 23773766. doi:10.1109/LRA.2019.2956365. 1109/TPAMI.2016.2572683. Florence, P. R. 2020. Dense Visual Learning for Robot Ma- Sundaresan, P.; Grannen, J.; Thananjeyan, B.; Balakrishna, nipulation. Ph.D. thesis, Massachusetts Institute of Technol- A.; Laskey, M.; Stone, K.; Gonzalez, J. E.; and Goldberg, K. ogy. 2020. Learning Rope Manipulation Policies Using Dense Franka, E. 2018. Panda Arm. https://www.franka.de. Last Object Descriptors Trained on Synthetic Depth Data (3). accessed on 03.13.2021. URL http://arxiv.org/abs/2003.01835. Garrido-Jurado, S.; Munoz˜ Salinas, R.; Madrid-Cuevas, F.; Tenenbaum, J. B.; De Silva, V.; and Langford, J. C. 2000. and Mar´ın-Jimenez,´ M. 2014. Automatic Generation and A global geometric framework for nonlinear dimensionality Detection of Highly Reliable Fiducial Markers under Occlu- reduction. Science 290(5500): 2319–2323. ISSN 00368075. sion. Pattern Recogn. 47(6): 22802292. ISSN 0031-3203. doi:10.1126/science.290.5500.2319. doi:10.1016/j.patcog.2014.01.005. URL https://doi.org/10. Van Der Maaten, L. J. P.; Postma, E. O.; and Van Den Herik, 1016/j.patcog.2014.01.005. H. J. 2009. Dimensionality Reduction: A Comparative Re- view. Journal of Research 10: 1–41. Hadsell, R.; Chopra, S.; and LeCun, Y. 2006. Dimensional- ISSN 0169328X. doi:10.1080/13506280444000102. ity reduction by learning an invariant mapping. Proceed- ings of the IEEE Computer Society Conference on Com- Vecerik, M.; Regli, J.-B.; Sushkov, O.; Barker, D.; Pevce- puter Vision and Pattern Recognition 2: 1735–1742. ISSN viciute, R.; Rothorl,¨ T.; Schuster, C.; Hadsell, R.; Agapito, 10636919. doi:10.1109/CVPR.2006.100. L.; and Scholz, J. 2020. S3K: Self-Supervised Semantic Keypoints for Robotic Manipulation via Multi-View Con- Intel R . 2019. Relsense D435. https://www.intelrealsense. sistency. Conference on Robot Learning URL http://arxiv. com/depth-camera-d435/. Last accessed on 03.13.2021. org/abs/2009.14711. Lee, J. A.; and Verleysen, M. 2005. Nonlinear dimension- Wang, H.; Simari, P.; Su, Z.; and Zhang, H. 2014. Spectral ality reduction of data manifolds with essential loops. Neu- global intrinsic symmetry invariant functions. Proceedings rocomputing 67(1-4 SUPPL.): 29–53. ISSN 09252312. doi: - Graphics Interface 209–215. ISSN 07135424. 10.1016/j.neucom.2004.11.042. Zakharov, S.; Shugurov, I.; and Ilic, S. 2019. Dpod: 6d pose Li, Y.; Wang, G.; Ji, X.; Xiang, Y.; and Fox, D. 2018. object detector and refiner. In Proceedings of the IEEE In- Deepim: Deep iterative matching for 6d pose estimation. In ternational Conference on Computer Vision, 1941–1950. Proceedings of the European Conference on Computer Vi- sion (ECCV), 683–698. Liu, Y.; Prabhakaran, B.; and Guo, X. 2012. Point-based manifold harmonics. IEEE Transactions on Visualiza- tion and Computer Graphics 18(10): 1693–1703. ISSN 10772626. doi:10.1109/TVCG.2011.152.

6100