Visual Descriptor Learning from Monocular Video

Visual Descriptor Learning from Monocular Video Umashankar Deekshith 1 a , Nishit Gajjar 1 b , Max Schwarz1 c and Sven Behnke1 d 1Autonomous Intelligent Systems group of University of Bonn, Germany [email protected], [email protected] Keywords: Dense Correspondence, Deep Learning, Pixel Descriptors Abstract: Correspondence estimation is one of the most widely researched and yet only partially solved area of computer vision with many applications in tracking, mapping, recognition of objects and environment. In this paper, we propose a novel way to estimate dense correspondence on an RGB image where visual descriptors are learned from video examples by training a fully convolutional network. Most deep learning methods solve this by training the network with a large set of expensive labeled data or perform labeling through strong 3D generative models using RGB-D videos. Our method learns from RGB videos using contrastive loss, where relative labeling is estimated from optical flow. We demonstrate the functionality in a quantitative analysis on rendered videos, where ground truth information is available. Not only does the method perform well on test data with the same background, it also generalizes to situations with a new background. The descriptors learned are unique and the representations determined by the network are global. We further show the applicability of the method to real-world videos. 1 INTRODUCTION Many of the problems in computer vision, like 3D reconstruction, visual odometry, simultaneous localization and mapping (SLAM), object recognition, de- pend on the underlying problem of image correspondence (see Fig. 1). Correspondence methods based on sparse visual descriptors like SIFT (Lowe, 2004) have been shown to be useful for applications like camera calibration, panorama stitching and even robot localization. However, SIFT and similar hand de- Figure 1: Point tracking using the learned descriptors in signed features require a textured image. They per- monocular video. Two different backgrounds are shown to represent the network’s capability to generate global corre- form poorly for images or scenes which lack sufficient spondences. A square patch of 25 pixels are selected (left) texture. In such situations, dense feature extractors and the nearest neighbor for those pixels based on the pixel are better suited than sparse keypoint-based methods. representation is shown for the other images. With the advancement in deep learning in recent years, the general trend is that neural networks can be trained to outperform hand designed feature methods In their work, (Schmidt et al., 2017) and (Flo- for any function using sufficient training data. How- rence et al., 2018) have shown approaches for self- ever, supervised training approaches require signifi- supervised visual descriptor learning using raw RGB- cant effort because they require labeled training data. D sequence of images. Their approach shows that it Therefore, it is useful to have a way to train the model is possible to generate dense descriptors for an ob- in a self-supervised fashion, where the training labels ject or a complete scene and that these descriptors are are created automatically. consistent across videos with different backgrounds and camera alignments. The dense descriptors are a https://orcid.org/0000-0002-0341-5768 learned using contrastive loss (Hadsell et al., 2006). b https://orcid.org/0000-0001-7610-797X The method unlocks a huge potential for robot ma- c https://orcid.org/0000-0002-9942-6604 nipulation, navigation, and self learning of the envi- d https://orcid.org/0000-0002-5040-7525 ronment. However, for obtaining the required corre- Figure 2: Learned dense object descriptors. Top: Stills from a monocular video demonstrating full 6D movement of the object. Bottom: Normalized output of the trained network where each pixel is represented uniquely, visualized as an RGB image. The objects were captured in different places under different lighting conditions, viewpoints giving the object a geometric transformation in a 3D space. It can be seen here that the descriptor generated is independent of these effects. This video sequence has not been seen during training. spondence information for training, the authors rely to limit the learning of meaningful visual descriptors on 3D reconstruction from RGB-D sensors. This lim- for the foreground object, such that these descriptors its the applicability of their method (i.e. shiny and are as far apart in descriptor space as possible. transparent objects cannot be learned). In this work, we show that it is possible to learn Every day a large quantity of videos are generated visual descriptors for monocular images through self- from basic point and shoot cameras. Our aim is to supervised learning by training them using contrastive learn the visual descriptors from an RGB video with- loss for images labeled from optical flow information. out any additional information. We follow a similar We further demonstrate applicability in experiments approach to (Schmidt et al., 2017) by implementing on synthetic and real data. a self-supervised visual descriptor learning network which learns dense object descriptors (see Fig. 2). In- stead of the depth map, we rely on the movement of the object of interest. To find an alternate way 2 Related Work of self learning the pixel correspondences, we turn to available optical flow methods. The traditional Traditionally, dense correspondence estimation al- dense optical flow method of (Farneback,¨ 2000) and gorithms were of two kinds. One with focus on new deep learning based optical flow methods (Sun learning generative models with strong priors (Hinton et al., 2017; Ilg et al., 2017) provide the information et al., 1995), (Sudderth et al., 2005). These algorithms which is the basis of our approach to generate self- were designed to capture similar occurrence of fea- supervised training data. Optical flow gives us a map- tures. Another set of algorithms use hand-engineered ping of pixel correspondences within a sequence of methods. For example, SIFT or HOG performs clus- images. Loosely speaking, our method turns relative tering over training data to discover the feature classes correspondence from optical flow into absolute corre- (Sivic et al., 2005), (C. Russell et al., 2006). Re- spondence using our learned descriptors. cently, the advances in deep learning and their abil- To focus the descriptor learning on the object of ity to reliably capture high dimensional features di- interest, we employ a generic foreground segmenta- rectly from the data has made a lot of progress in tion method (Chen et al., 2014), which provides a correspondence estimation, outperforming the tradi- foreground object mask. We use this foreground mask tional methods. (Taylor et al., 2012) have proposed a method where a regression forest is used to predict used relative labeling to train a fully convolutional dense correspondence between image pixels and ver- network. The idea presented by them is for the de- tices of an articulated mesh model. Similarly, (Shot- scriptor to be encoded with the identity of the point ton et al., 2013) use a random forest given only a sin- that projects onto the pixel so that it is invariant to gle acquired image, to deduce the pose of an RGB-D lighting, viewpoint, deformation and any other vari- camera with respect to a known 3D scene. (Brach- able other than the identity of the surface that gener- mann et al., 2014) jointly train an objective over both ated the observation. (Florence et al., 2018) train the 3D object coordinates and object class labeling to de- network for single-object and multi-object descriptors termine to address the problem of estimating the 6D using a modified pixel-wise contrastive loss function Pose of specific objects from a single RGB-D im- which (similar to (Schmidt et al., 2017)) minimizes age. Semantic segmentation of images as presented the feature distance between the matching pixels and by (Long et al., 2014; Hariharan et al., 2015) use neu- pushes that of non-matching pixels to be at least a ral network that produce dense correspondence of im- configurable threshold away from each other. Their ages. (Guler¨ et al., 2018) propose a method to estab- aim is to train a robotic arm to generate training data lish dense correspondence between RGB image and consisting of objects of interests and then training a surface based representation of human body. All of FCN network to distinctively identify different parts these methods rely on labeled data. To avoid the ex- of the same object and multiple objects. In these pensive labeling process, we use relative labels for methods, an RGB-D video is used and a strong 3D pixels generated before training with minimum com- generative model is used to automatically label cor- putation and no human intervention. respondences. In absence of the depth information, a Relative labeling has been in use for various appli- dense optical flow between subsequent frames of an cations. (Wang et al., 2014) introduced a multi-scale RGB video can provide the correlation between pix- network with triplet sampling algorithm that learns a els in different frames. For image segmentation to se- fine-grained image similarity model directly from im- lect the dominant object in the scene, we refer to the ages. For image retrieval using deep hashing, (Zhang solution given by (Chen et al., 2014) and (Jia et al., et al., 2015) trained a deep CNN where discrimina- 2014). tive image features and hash functions are simulta- neously optimized using max-margin loss on triplet units. However, these methods also require labeled 3 Method data. (Wang and Gupta, 2015) propose a method for image patch detection where they employ rela- Our aim is to train a visual descriptor network us- tive labeling for data points. Here they try to track ing an RGB video to get a non-linear function which patches in videos where two patches connected by W xHx3 can translate an RGB image R to a descriptor a track should have similar visual representation and WxHxD image R .

Load more