Visual Descriptor Learning from Monocular Video

Visual Descriptor Learning from Monocular Video Umashankar Deekshith a, Nishit Gajjar b, Max Schwarz c and Sven Behnke d Autonomous Intelligent Systems Group of University of Bonn, Germany [email protected], [email protected] Keywords: Dense Correspondence, Deep Learning, Pixel Descriptors. Abstract: Correspondence estimation is one of the most widely researched and yet only partially solved area of computer vision with many applications in tracking, mapping, recognition of objects and environment. In this paper, we propose a novel way to estimate dense correspondence on an RGB image where visual descriptors are learned from video examples by training a fully convolutional network. Most deep learning methods solve this by training the network with a large set of expensive labeled data or perform labeling through strong 3D generative models using RGB-D videos. Our method learns from RGB videos using contrastive loss, where relative labeling is estimated from optical flow. We demonstrate the functionality in a quantitative analysis on rendered videos, where ground truth information is available. Not only does the method perform well on test data with the same background, it also generalizes to situations with a new background. The descriptors learned are unique and the representations determined by the network are global. We further show the applicability of the method to real-world videos. 1 INTRODUCTION Many of the problems in computer vision, like 3D reconstruction, visual odometry, simultaneous localization and mapping (SLAM), object recognition, de- pend on the underlying problem of image correspondence (see Fig. 1). Correspondence methods based on sparse visual descriptors like SIFT (Lowe, 2004) have been shown to be useful for applications like camera calibration, panorama stitching and even robot localization. However, SIFT and similar hand de- Figure 1: Point tracking using the learned descriptors in monocular video. Two different backgrounds are shown to signed features require a textured image. They per- represent the network’s capability to generate global corre- form poorly for images or scenes which lack sufficient spondences. A square patch of 25 pixels are selected (left) texture. In such situations, dense feature extractors and the nearest neighbor for those pixels based on the pixel are better suited than sparse keypoint-based methods. representation is shown for the other images. With the advancement in deep learning in recent years, the general trend is that neural networks can be In their work, (Schmidt et al., 2017) and (Flo- trained to outperform hand designed feature methods rence et al., 2018) have shown approaches for self- for any function using sufficient training data. How- supervised visual descriptor learning using raw RGB- ever, supervised training approaches require signifi- D sequence of images. Their approach shows that it cant effort because they require labeled training data. is possible to generate dense descriptors for an ob- Therefore, it is useful to have a way to train the model ject or a complete scene and that these descriptors are in a self-supervised fashion, where the training labels consistent across videos with different backgrounds are created automatically. and camera alignments. The dense descriptors are learned using contrastive loss (Hadsell et al., 2006). a https://orcid.org/0000-0002-0341-5768 The method unlocks a huge potential for robot ma- b https://orcid.org/0000-0001-7610-797X nipulation, navigation, and self learning of the envi- c https://orcid.org/0000-0002-9942-6604 ronment. However, for obtaining the required corre- d https://orcid.org/0000-0002-5040-7525 spondence information for training, the authors rely Figure 2: Learned dense object descriptors. Top: Stills from a monocular video demonstrating full 6D movement of the object. Bottom: Normalized output of the trained network where each pixel is represented uniquely, visualized as an RGB image. The objects were captured in different places under different lighting conditions, viewpoints giving the object a geometric transformation in a 3D space. It can be seen here that the descriptor generated is independent of these effects. This video sequence has not been seen during training. on 3D reconstruction from RGB-D sensors. This lim- for the foreground object, such that these descriptors its the applicability of their method (i.e. shiny and are as far apart in descriptor space as possible. transparent objects cannot be learned). In this work, we show that it is possible to learn Every day a large quantity of videos are generated visual descriptors for monocular images through self- from basic point and shoot cameras. Our aim is to supervised learning by training them using contrastive learn the visual descriptors from an RGB video with- loss for images labeled from optical flow information. out any additional information. We follow a similar We further demonstrate applicability in experiments approach to (Schmidt et al., 2017) by implementing on synthetic and real data. a self-supervised visual descriptor learning network which learns dense object descriptors (see Fig. 2). In- stead of the depth map, we rely on the movement 2 RELATED WORK of the object of interest. To find an alternate way of self learning the pixel correspondences, we turn Traditionally, dense correspondence estimation algo- to available optical flow methods. The traditional rithms were of two kinds. One with focus on learning dense optical flow method of (Farneback,¨ 2000) and generative models with strong priors (Hinton et al., new deep learning based optical flow methods (Sun 1995), (Sudderth et al., 2005). These algorithms were et al., 2017; Ilg et al., 2017) provide the information designed to capture similar occurrence of features. which is the basis of our approach to generate self- Another set of algorithms use hand-engineered meth- supervised training data. Optical flow gives us a map- ods. For example, SIFT or HOG performs cluster- ping of pixel correspondences within a sequence of ing over training data to discover the feature classes images. Loosely speaking, our method turns relative (Sivic et al., 2005), (C. Russell et al., 2006). Re- correspondence from optical flow into absolute corre- cently, the advances in deep learning and their abil- spondence using our learned descriptors. ity to reliably capture high dimensional features di- To focus the descriptor learning on the object of rectly from the data has made a lot of progress in interest, we employ a generic foreground segmenta- correspondence estimation, outperforming the tradi- tion method (Chen et al., 2014), which provides a tional methods. (Taylor et al., 2012) have proposed foreground object mask. We use this foreground mask a method where a regression forest is used to predict to limit the learning of meaningful visual descriptors dense correspondence between image pixels and ver- tices of an articulated mesh model. Similarly, (Shot- scriptor to be encoded with the identity of the point ton et al., 2013) use a random forest given only a sin- that projects onto the pixel so that it is invariant to gle acquired image, to deduce the pose of an RGB-D lighting, viewpoint, deformation and any other vari- camera with respect to a known 3D scene. (Brach- able other than the identity of the surface that gener- mann et al., 2014) jointly train an objective over both ated the observation. (Florence et al., 2018) train the 3D object coordinates and object class labeling to de- network for single-object and multi-object descriptors termine to address the problem of estimating the 6D using a modified pixel-wise contrastive loss function Pose of specific objects from a single RGB-D im- which (similar to (Schmidt et al., 2017)) minimizes age. Semantic segmentation of images as presented the feature distance between the matching pixels and by (Long et al., 2014; Hariharan et al., 2015) use neu- pushes that of non-matching pixels to be at least a ral network that produce dense correspondence of im- configurable threshold away from each other. Their ages. (Guler¨ et al., 2018) propose a method to estab- aim is to train a robotic arm to generate training data lish dense correspondence between RGB image and consisting of objects of interests and then training a surface based representation of human body. All of FCN network to distinctively identify different parts these methods rely on labeled data. To avoid the ex- of the same object and multiple objects. In these pensive labeling process, we use relative labels for methods, an RGB-D video is used and a strong 3D pixels generated before training with minimum com- generative model is used to automatically label cor- putation and no human intervention. respondences. In absence of the depth information, a Relative labeling has been in use for various appli- dense optical flow between subsequent frames of an cations. (Wang et al., 2014) introduced a multi-scale RGB video can provide the correlation between pix- network with triplet sampling algorithm that learns a els in different frames. For image segmentation to se- fine-grained image similarity model directly from im- lect the dominant object in the scene, we refer to the ages. For image retrieval using deep hashing, (Zhang solution given by (Chen et al., 2014) and (Jia et al., et al., 2015) trained a deep CNN where discrimina- 2014). tive image features and hash functions are simulta- neously optimized using max-margin loss on triplet units. However, these methods also require labeled 3 METHOD data. (Wang and Gupta, 2015) propose a method for image patch detection where they employ rela- Our aim is to train a visual descriptor network us- tive labeling for data points. Here they try to track ing an RGB video to get a non-linear function which patches in videos where two patches connected by W xHx3 can translate an RGB image R to a descriptor a track should have similar visual representation and W xHxD image R .

Visual Descriptor Learning from Monocular Video

Image-Based Query by Example Using MPEG-7 Visual Descriptors

A Comparative Study of Image Descriptors in Recognizing Human Faces Supported by Distributed Platforms

CENTRIST: a VISUAL DESCRIPTOR for SCENE CATEGORIZATION 1 CENTRIST: a Visual Descriptor for Scene Categorization Jianxin Wu, Member, IEEE and James M

Visual Descriptor Learning from Monocular Video

Local Features for Visual Object Matching and Video Scene Detection

A Review on Near-Duplicate Detection of Images Using

Image Classification Using Bag of Visual Words and Novel COSFIRE Descriptors

Indoor Versus Outdoor Scene Recognition for Navigation of a Micro

Event Curation and Classification Technique Based on Object Detection

Empirical Comparison of Visual Descriptors for Ulcer Recognition in Wireless Capsule Endoscopy Video