Deep Image Homography Estimation

Deep Image Homography Estimation Daniel DeTone Tomasz Malisiewicz Andrew Rabinovich Magic Leap, Inc. Magic Leap, Inc. Magic Leap, Inc. Mountain View, CA Mountain View, CA Mountain View, CA [email protected] [email protected] [email protected] Abstract—We present a deep convolutional neural network for The traditional homography estimation pipeline is com- estimating the relative homography between a pair of images. posed of two stages: corner estimation and robust homography Our feed-forward network has 10 layers, takes two stacked estimation. Robustness is introduced into the corner detection grayscale images as input, and produces an 8 degree of freedom homography which can be used to map the pixels from the stage by returning a large and over-complete set of points, first image to the second. We present two convolutional neural while robustness into the homography estimation step shows network architectures for HomographyNet: a regression network up as heavy use of RANSAC or robustification of the squared which directly estimates the real-valued homography parameters, loss function. Since corners are not as reliable as man-made and a classification network which produces a distribution over linear structures, the research community has put considerable quantized homographies. We use a 4-point homography parameterization which maps the four corners from one image into the effort into adding line features [18] and more complicated second image. Our networks are trained in an end-to-end fashion geometries [8] into the feature detection step. What we really using warped MS-COCO images. Our approach works without want is a single robust algorithm that, given a pair of images, the need for separate local feature detection and transformation simply returns the homography relating the pair. Instead of estimation stages. Our deep models are compared to a traditional manually engineering corner-ish features, line-ish features, homography estimator based on ORB features and we highlight etc, is it possible for the algorithm to learn its own set the scenarios where HomographyNet outperforms the traditional technique. We also describe a variety of applications powered by of primitives? We want to go even further, and add the deep homography estimation, thus showcasing the flexibility of transformation estimation step as the last part of a deep a deep learning approach. learning pipeline, thus giving us the ability to learn the entire homography estimation pipeline in an end-to-end fashion. I. INTRODUCTION Recent research in dense or direct featureless SLAM algo- Sparse 2D feature points are the basis of most modern rithms such as LSD-SLAM [6] indicates promise in using a Structure from Motion and SLAM techniques [9]. These full image for geometric computer vision tasks. Concurrently, sparse 2D features are typically known as corners, and in deep convolutional networks are setting state-of-the-art bench- all geometric computer vision tasks one must balance the marks in semantic tasks such as image classification, semantic errors in corner detection methods with geometric estimation segmentation and human pose estimation. Additionally, recent errors. Even the simplest geometric methods, like estimating works such as FlowNet [7], Deep Semantic Matching [1] the homography between two images, rely on the error-prone and Eigen et al.’s Multi-Scale Deep Network [5] present corner-detection method. promising results for dense geometric computer vision tasks Estimating a 2D homography (or projective transformation) like optical flow and depth estimation. Even robotic tasks like from a pair of images is a fundamental task in computer vision. visual odometry are being tackled with convolutional neural The homography is an essential part of monocular SLAM networks [4]. arXiv:1606.03798v1 [cs.CV] 13 Jun 2016 systems in scenarios such as: In this paper, we show that the entire homography estima- Rotation only movements • tion problem can be solved by a deep convolutional neural Planar scenes • network (See Figure 1). Our contributions are as follows: we Scenes in which objects are very far from the viewer • present a new VGG-style [17] network for the homography It is well-known that the transformation relating two im- estimation task. We show how to use the 4-point parameter- ages undergoing a rotation about the camera center is a ization [2] to get a well-behaved deep estimation problem. homography, and it is not surprising that homographies are Because deep networks require a lot of data to be trained essential for creating panoramas [3]. To deal with planar and from scratch, we share our recipe for creating a seemingly AB mostly-planar scenes, the popular SLAM algorithm ORB- infinite dataset of (IA;IB;H ) training triplets from an SLAM [14] uses a combination of homography estimation existing dataset of real images like the MS-COCO dataset. and fundamental matrix estimation. Augmented Reality ap- We present an additional formulation of the homography plications based on planar structures and homographies have estimation problem as classification, which produces a dis- been well-studied [16]. Camera calibration techniques using tribution over homographies and can be used to determine the planar structures [20] also rely on homographies. confidence of an estimated homography. Deep Image Homography Estimation using ConvNets 1.Teaser Figure Conv1 Conv2 Conv3 Conv4 Input Images Conv5 Conv6 Conv7 Conv8 FC FC Softmax 3x3 3x3 3x3 3x3 8x21 3x3 3x3 Max 16x16x128 16x16x128 1024 Pooling H Max 32x32x128 32x32x128 3x3 3x3 Pooling 128x128x2 Max 64x64x64 64x64x64 Pooling 128x128x64 128x128x64 Fig. 1: Deep Image Homography Estimation. HomographyNet is a Deep Convolutional Neural Network which directly produces the Homography relating two images. Our method does net require separate corner detection and homography estimation steps and all parameters are trained in an end-to-end fashion using a large dataset of labeled images. Deep Image Homography Estimation using ConvNets Conv1 Conv2 1 HE POINT OMOGRAPHY ARAMETERIZATION Conv3 Conv4 II. T 4- H P of natural images .Input The Images process is illustrated in Figure 3 and Conv5 Conv6 Conv7 Conv8 described below. FC The simplest way to parameterize a homography is with a FC Softmax 3x3 3x3 matrix and a fixed scale. The homography maps [u; v], To generate a single training example, we first randomly 3x3 3x3 8x11 3x3 3x3 Max 16x16x128 16x16x128 1024 Pooling H Max 32x32x128 32x32x128 crop a square patch from the larger image3x3 I at position3x3 p (we the pixels in the left image, to [u0; v0], the pixels in the right Pooling 128x128x2 Max 64x64x64 64x64x64 image, and is defined up to scale (see Equation 1). avoid the borders to prevent bordering artifacts later inPooling the 128x128x64 128x128x64 data generation pipeline). This random crop is Ip. Then, the u0 H11 H12 H13 u four corners of PatchFig. 1:ADeep are randomly Image Homography perturbed Estimation. by valuesHomographyNet is a Deep Convolutional Neural Network which directly produces the Homography relating two images. Our method does net require separate corner detection and homography v0 H21 H22 H23 v (1) 0 1 ∼ 0 1 0 1 within the range [-estimationρ, ρ]. The steps four and all correspondences parameters are trained define in an end-to-end fashion using a large dataset of labeled images. 1 H31 H32 H33 1 a homography HAB. Then, the inverse of this homography BA AB 1 However, if@ weA unroll@ the 8 (or 9) parametersA @ A of the homog- H = (H )− is applied to the large image to produce II. THE 4-POINT HOMOGRAPHY PARAMETERIZATION applying random projective transformations to a large dataset image I0. A second patch I0 is cropped from I0 at position p. raphy into a single vector, we’ll quickly realize that we are p of natural images 1. This procedure is detailed below. The simplest way to parameterize a homography is with a mixing both rotational and translational terms. For example, The two grayscale patches, Ip and Ip0 are then stacked channel- To generate a single training example, we first randomly 3x3 matrix and a fixed scale (see Equation 1). However, if wise to create the 2-channel image which is fed directly into crop a square patch from the larger image I at position p (we the submatrix [H11 H12; H21 H22], represents the rotational we unroll the 8 (or 9) parameters of the homography into a AB avoid the borders to prevent bordering artifacts later in the terms in the homography, while the vector [H13 H23] is the our ConvNet. Thesingle 4-point vector, parameterization well quickly realize of H that weis are then mixing both data generation pipeline). This random crop is I . Then, the used as the associatedrotational ground-truth and translational training terms. label. For example, the subma- p translational offset. Balancing the rotational and translational four corners of Patch A are randomly perturbed by values trix [H H ; H H ], represents the rotational terms in the terms as part of an optimization problem is difficult. Managing the training11 image12 21 generation22 pipeline gives us within the range [-⇢, ⇢]. The four correspondences define homography, while the vector [H H ] is the translational 13 23 a homography HAB. Then, the inverse of this homography We found that an alternate parameterization, one based on full control over theoffset. kinds Balancing of visual the effects rotational we and want translational to model. terms as part HBA =(HAB) 1 is applied to the large image to produce a single kind of location variable, namely the corner location, For example, to makeof an our optimization method more problem robust is difficult. to motion blur, − image I0. A second patch Ip0 is cropped from I0 at position p. is more suitable for our deep homography estimation task. we can apply such blurs to the image in our training set. The two grayscale patches, I and I are then stacked channel- u H H H u p p0 The 4-point parameterization has been used in traditional If we want the method to be1 robust0 to11 occlusions,12 13 we1 can wise to create the 2-channel image which is fed directly into v H H H v (1) 1 21 22 23 1 our ConvNet.

Load more