<<

Deep Image Homography Estimation

Daniel DeTone Tomasz Malisiewicz Andrew Rabinovich Magic Leap, Inc. Magic Leap, Inc. Magic Leap, Inc. Mountain View, CA Mountain View, CA Mountain View, CA [email protected] [email protected] [email protected]

Abstract—We present a deep convolutional neural network for The traditional homography estimation pipeline is com- estimating the relative homography between a pair of images. posed of two stages: corner estimation and robust homography Our feed-forward network has 10 layers, takes two stacked estimation. Robustness is introduced into the corner detection grayscale images as input, and produces an 8 degree of freedom homography which can be used to map the pixels from the stage by returning a large and over-complete set of points, first image to the second. We present two convolutional neural while robustness into the homography estimation step shows network architectures for HomographyNet: a regression network up as heavy use of RANSAC or robustification of the squared which directly estimates the real-valued homography parameters, loss . Since corners are not as reliable as man-made and a classification network which produces a distribution over linear structures, the research community has put considerable quantized homographies. We use a 4-point homography param- eterization which maps the four corners from one image into the effort into adding features [18] and more complicated second image. Our networks are trained in an end-to-end fashion geometries [8] into the feature detection step. What we really using warped MS-COCO images. Our approach works without want is a single robust algorithm that, given a pair of images, the need for separate local feature detection and transformation simply returns the homography relating the pair. Instead of estimation stages. Our deep models are compared to a traditional manually engineering corner-ish features, line-ish features, homography estimator based on ORB features and we highlight etc, is it possible for the algorithm to learn its own set the scenarios where HomographyNet outperforms the traditional technique. We also describe a variety of applications powered by of primitives? We want to go even further, and add the deep homography estimation, thus showcasing the flexibility of transformation estimation step as the last part of a deep a deep learning approach. learning pipeline, thus giving us the ability to learn the entire homography estimation pipeline in an end-to-end fashion. I.INTRODUCTION Recent research in dense or direct featureless SLAM algo- Sparse 2D feature points are the basis of most modern rithms such as LSD-SLAM [6] indicates promise in using a Structure from Motion and SLAM techniques [9]. These full image for geometric computer vision tasks. Concurrently, sparse 2D features are typically known as corners, and in deep convolutional networks are setting state-of-the-art bench- all geometric computer vision tasks one must balance the marks in semantic tasks such as image classification, semantic errors in corner detection methods with geometric estimation segmentation and human pose estimation. Additionally, recent errors. Even the simplest geometric methods, like estimating works such as FlowNet [7], Deep Semantic Matching [1] the homography between two images, rely on the error-prone and Eigen et al.’s Multi-Scale Deep Network [5] present corner-detection method. promising results for dense geometric computer vision tasks Estimating a 2D homography (or projective transformation) like optical flow and depth estimation. Even robotic tasks like from a pair of images is a fundamental task in computer vision. visual odometry are being tackled with convolutional neural The homography is an essential part of monocular SLAM networks [4]. arXiv:1606.03798v1 [cs.CV] 13 Jun 2016 systems in scenarios such as: In this paper, we show that the entire homography estima- only movements • tion problem can be solved by a deep convolutional neural Planar scenes • network (See Figure 1). Our contributions are as follows: we Scenes in which objects are very far from the viewer • present a new VGG-style [17] network for the homography It is well-known that the transformation relating two im- estimation task. We show how to use the 4-point parameter- ages undergoing a rotation about the camera center is a ization [2] to get a well-behaved deep estimation problem. homography, and it is not surprising that homographies are Because deep networks require a lot of data to be trained essential for creating panoramas [3]. To deal with planar and from scratch, we share our recipe for creating a seemingly AB mostly-planar scenes, the popular SLAM algorithm ORB- infinite dataset of (IA,IB,H ) training triplets from an SLAM [14] uses a combination of homography estimation existing dataset of real images like the MS-COCO dataset. and fundamental matrix estimation. Augmented Reality ap- We present an additional formulation of the homography plications based on planar structures and homographies have estimation problem as classification, which produces a dis- been well-studied [16]. Camera calibration techniques using tribution over homographies and can be used to determine the planar structures [20] also rely on homographies. confidence of an estimated homography. Deep Image Homography Estimation using ConvNets

1.Teaser Figure

Conv1 Conv2 Conv3 Conv4 Input Images Conv5 Conv6 Conv7 Conv8 FC FC Softmax 3x3

3x3 3x3 3x3 8x21 3x3 3x3 Max 16x16x128 16x16x128 1024 Pooling H Max 32x32x128 32x32x128 3x3 3x3 Pooling 128x128x2 Max 64x64x64 64x64x64 Pooling 128x128x64 128x128x64 Fig. 1: Deep Image Homography Estimation. HomographyNet is a Deep Convolutional Neural Network which directly produces the Homography relating two images. Our method does net require separate corner detection and homography estimation steps and all parameters are trained in an end-to-end fashion using a large dataset of labeled images. Deep Image Homography Estimation using ConvNets Conv1 Conv2 1 HE POINT OMOGRAPHY ARAMETERIZATION Conv3 Conv4 II.T 4- H P of natural images .Input The Images process is illustrated in Figure 3 and Conv5 Conv6 Conv7 Conv8 described below. FC The simplest way to parameterize a homography is with a FC Softmax

3x3 3x3 matrix and a fixed scale. The homography maps [u, v], To generate a single training example, we first randomly 3x3 3x3 8x11 3x3 3x3 Max 16x16x128 16x16x128 1024 Pooling H Max 32x32x128 32x32x128 crop a square patch from the larger image3x3 I at position3x3 p (we the pixels in the left image, to [u0, v0], the pixels in the right Pooling 128x128x2 Max 64x64x64 64x64x64 image, and is defined up to scale (see Equation 1). avoid the borders to prevent bordering artifacts later inPooling the 128x128x64 128x128x64 data generation pipeline). This random crop is Ip. Then, the u0 H11 H12 H13 u four corners of PatchFig. 1:ADeep are randomly Image Homography perturbed Estimation. by valuesHomographyNet is a Deep Convolutional Neural Network which directly produces the Homography relating two images. Our method does net require separate corner detection and homography v0 H21 H22 H23 v (1)   ∼     within the range [-estimationρ, ρ]. The steps four and all correspondences parameters are trained define in an end-to-end fashion using a large dataset of labeled images. 1 H31 H32 H33 1 a homography HAB. Then, the inverse of this homography BA AB 1 However, if we unroll the 8 (or 9) parameters   of the homog- H = (H )− is applied to the large image to produce II. THE 4-POINT HOMOGRAPHY PARAMETERIZATION applying random projective transformations to a large dataset image I0. A second patch I0 is cropped from I0 at position p. raphy into a single vector, we’ll quickly realize that we are p of natural images 1. This procedure is detailed below. The simplest way to parameterize a homography is with a mixing both rotational and translational terms. For example, The two grayscale patches, Ip and Ip0 are then stacked channel- To generate a single training example, we first randomly 3x3 matrix and a fixed scale (see Equation 1). However, if wise to create the 2-channel image which is fed directly into crop a square patch from the larger image I at position p (we the submatrix [H11 H12; H21 H22], represents the rotational we unroll the 8 (or 9) parameters of the homography into a AB avoid the borders to prevent bordering artifacts later in the terms in the homography, while the vector [H13 H23] is the our ConvNet. Thesingle 4-point vector, parameterization well quickly realize of H that weis are then mixing both data generation pipeline). This random crop is I . Then, the used as the associatedrotational ground-truth and translational training terms. label. For example, the subma- p translational offset. Balancing the rotational and translational four corners of Patch A are randomly perturbed by values trix [H H ; H H ], represents the rotational terms in the terms as part of an optimization problem is difficult. Managing the training11 image12 21 generation22 pipeline gives us within the range [-⇢, ⇢]. The four correspondences define homography, while the vector [H H ] is the translational 13 23 a homography HAB. Then, the inverse of this homography We found that an alternate parameterization, one based on full control over theoffset. kinds Balancing of visual the effects rotational we and want translational to model. terms as part HBA =(HAB) 1 is applied to the large image to produce a single kind of location variable, namely the corner location, For example, to makeof an our optimization method more problem robust is difficult. to motion blur, image I0. A second patch Ip0 is cropped from I0 at position p. is more suitable for our deep homography estimation task. we can apply such blurs to the image in our training set. The two grayscale patches, I and I are then stacked channel- u H H H u p p0 The 4-point parameterization has been used in traditional If we want the method to be1 robust0 to11 occlusions,12 13 we1 can wise to create the 2-channel image which is fed directly into v H H H v (1) 1 21 22 23 1 our ConvNet. The 4-point parameterization of HAB is then homography estimation methods [2], and we use it in our insert random occluding shapes0 10 into1 ⇠ 0 ourH trainingH H images.1 0 1 1 We 31 32 33 used as the associated ground-truth training label. The process modern deep manifestation of the homography estimation experimented with in-painting@ randomA @ occluding rectanglesA @ A is illustrated in Figure 3. problem (See Figure 2). Letting ∆u = u u be the u-offset We found that an alternate parameterization, one based on 1 10 1 into our training images,a single as kind a of simple location mechanism variable, namely to simulatethe corner location, Managing the training image generation pipeline gives us − Deep Image Homography Estimation using ConvNets for the first corner, the 4-point parameterization represents a real occlusions. is moreConv1 suitableConv2 for our deep homography estimation task. full control over the kinds of visual effects we want to model. Conv3 Conv4 Input Images Conv5 Conv6 Deep Image Homography Estimation using ConvNets Conv7 Conv8 For example, to make our method more robust to motion blur, homography as follows: The 4-point parameterization has been used inFC traditional FC Conv1 Conv2 Softmax

3x3 Conv3 Conv4 3x3 3x3 we can apply such blurs to the image in our training set. 8x11 Input Images homography estimation methods3x3 3x3 [2], andMaxwe use it in our 16x16x128 16x16x128 1024 Conv5 Conv6 Pooling H Max 32x32x128 32x32x128 Conv7 Conv8 3x3 3x3 1 FC Pooling 128x128x2 FC Max 64x64x64 In our experiments, we used croppedSoftmax MS-COCO64x64x64 [13] images, although modern deep manifestationPooling of the homography estimation If we want the method to be robust to occlusions, we can 3x3 ∆u ∆v 3x3 3x3 1 1 128x128x64 128x128x64 8x11 3x3 3x3 Max 16x16x128 16x16x128 1024 Pooling H Max 32x32x128 32x32x128 any large-enough3x3 3x3 dataset could be used for training Fig.Pooling 1: Deep Image Homography Estimation. HomographyNetu is a= Deepu Convolutionalu Neural Network which directly insert random occluding shapes into our training images. We Max problem (See Figure 2). Letting be the u-offset 128x128x2 1 1 1 64x64x64 64x64x64 ∆u2 ∆v2 Pooling produces the Homography relating two images. Our method does net require0 separate corner detection and homography H4point =   (2) 128x128x64 128x128x64 forestimation the firststeps and corner, all parametersthe are trained 4-point in an end-to-end parameterization fashion using a large dataset of represents labeled images. the experimented with in-painting random occluding rectangles ∆u ∆v Fig. 1: Deep Image Homography Estimation. HomographyNet is a Deep Convolutional Neural Network which directly 3 3 produces the Homography relating two images. Our methodHomography does net require separate corner as follows:detection and homography into our training images, as a simple mechanism to simulate estimation steps and all parameters are trained in an end-to-end fashionII. using THE a4- largePOINT datasetHOMOGRAPHY of labeled images.PARAMETERIZATION applying random projective transformations to a large dataset ∆u ∆v of natural images 1. This procedure is detailed below. real occlusions.  4 4 The simplest way to parameterize a homography is with a To generate a single training example, we first randomly   3x3 matrix and a fixed scale (see Equation 1). However, if II. THE 4-POINT HOMOGRAPHY PARAMETERIZATION applying random projective transformations to a large dataset ucrop1 a squarev patch1 from the larger image I at position p (we 1   of naturalwe unroll images the 81. (or This 9) procedure parameters is detailed of the homography below. into a In our experiments, we used cropped MS-COCO [15] images, although The simplest way to parameterize a homography is with a avoid the borders to prevent bordering artifacts later in the Equivalently to the matrix formulation of the homography, Tosingle generate vector, a single well quickly training realize example, that we we first are randomly mixing both 3x3 matrix and a fixed scale (see Equation 1). However, if udata2 generationv2 pipeline). This random crop is Ip. Then, the any large-enough dataset could be used for training croprotational a square and patch translational from the larger terms. image ForI example,at position thep (we subma- we unroll the 8 (or 9) parameters of the homography into a H4point = four corners of Patch A are randomly perturbed by(2) values avoidtrix the[H11 bordersH12; H to21H prevent22], represents bordering the artifacts rotational later terms in the0 in the 1 the 4-point parameterization uses eight numbers. Once thesingle vector, well quickly realize that we are mixing both uwithin3 the rangev3 [-⇢, ⇢]. The four correspondences define datahomography, generation pipeline). while the This vector random[H13 cropH23] isisI the. Then, translational the AB rotational and translational terms. For example, the subma- p a homography H . Then, the inverse of this homography fouroffset. corners Balancing of Patch the A rotational are randomly and translational perturbed by terms values as part BA AB 1 trix [H H ; H H ], represents the rotational terms in the uH =(H v) is applied to the large image to produce displacement of the four corners is known, one can easily 11 12 21 22 withinof an the optimization range [-⇢, problem⇢]. The is four difficult. correspondences defineB 4 4C homography, while the vector [H H ] is the translational image I0. A second patch Ip0 is cropped from I0 at position p. 13 23 a homography HAB. Then, the inverse of this homographyB C offset. Balancing the rotational and translational terms as part The two grayscale patches, Ip and Ip0 are then stacked channel- HBA =(HAB) u11 is appliedH11 to theH12 largeH13 imageu to1 produce@ A convert H to H . This can be accomplished inof a an optimization problem is difficult. 0 wise to create the 2-channel image which is fed directly into 4point matrix v1 H21 H22 H23 1-to-1v1 mapping(1) (Δu1, Δv1) imageEquivalentlyI0. A second patch Ip0 is croppedto the from I matrix0 at position p. formulationour ConvNet. The of 4-point the parameterization homography, of HAB is then 0 101 ⇠ 0H H H 1 0 1 1 The two grayscale patches, Ip and31 Ip0 32are then33 stacked channel- 2 2 u1 H11 H12 H13 u1 used as the associated ground-truth training label. The process (Δu , Δv ) number of ways, for example one can use the normalized 0 thewise to 4-point create@ the 2-channelA parameterization@ image which is fedA @ directlyA into is represented by eight numbers. H4point = v1 H21 H22 H23 v1 (1) We found that an alternate parameterization, one based on is illustrated in Figure 3. (Δu3, Δv3) 0 01 ⇠ 0 1 0 1 our ConvNet. The 4-point parameterization of HAB is then 1 H31 H32 H33 1 a single kind of location variable, namely the corner location, Managing the training image generation pipeline gives us H Inused asother the associated words, ground-truth once training label. the The displacementprocess H11 H12 of theH13 four corners is (Δu4, Δv4) Direct Linear Transform (DLT) algorithm [9], or the function @ A @ A @ A is more suitable for our deep homography estimation task. full control over the kinds of visual effects we want to model. We found that an alternate parameterization, one based on is illustrated in Figure 3. For example, to make our method more robust to motion blur, ManagingThe 4-point the parameterizationtraining image generation hasH been pipeline used in gives= traditional us H H H a single kind of location variable, namely the corner location, known, only a singlematrix closed formwe21 can transformation apply22 such blurs23 to the image is in needed our training set. 1-to-1 mapping fullhomography control over theestimation kinds of methods visual effects [2], we and want we to use model. it in our getPerspectiveTransform()in OpenCV. is more suitable for our deep homography estimation task. ⎛ If we want the method to be⎞ robust to occlusions, we can (u1’,v1’) Formodern example, deep to make manifestation our method of more the robust homography to motion estimation blur, The 4-point parameterization has been used in traditional for the 8-dof homography. ThisHinsert31 can randomH be32 occluding accomplishedH shapes33 into our trainingin images. a We weproblem can apply (See such Figure blurs 2). to Letting the imageu1 = inu our1 u training1 be the set. u-offset (u1,v1) homography estimation methods [2], and we use it in our 0 experimented with in-painting random occluding rectangles H H11 H12 H13 If wefor thewant first the corner, method the to 4-point be robust parameterization to occlusions, represents we can the modern deep manifestation of the homography estimation number of ways, for example oneinto our can training images,use as the a simple normalized mechanism to simulate insertHomography random occluding as follows: shapes into our training images. We⎝ ⎠ Hmatrix = H21 H22 H23 III.DATA GENERATION FOR HOMOGRAPHY ESTIMATIONproblem (See Figure 2). Letting u1 = u1 u1 be the u-offset real occlusions. ( ( 0 for the first corner, the 4-point parameterization represents the experimented with in-painting random occluding rectangles H31 H32 H33 Direct Linear Transformu v (DLT) algorithm [9], or the function Homography as follows: into our training images, as a simple1 mechanism1 to simulate 1In our experiments, we used cropped MS-COCO [15] images, although u v Fig. 2: 4-point parameterization.real occlusions. H = 2 We2 use the(2) any 4-point large-enough dataset could param- be used for training getPerspectiveTransform()4point 0u v 1 in OpenCV. Training deep convolutional networks from scratch requires u v 3 3 1 1 1In our experiments, we used cropped MS-COCOu [15]v images, although u v B 4 4C eterizationH = 2 2 of the(2) homography.any large-enough dataset could be used There forB training C exists a 1-to-1 mapping 4point 0u v 1 @ A a large amount of data. To meet this requirement, we generate 3 3 Equivalently to the matrix formulation of the homography, (Δu1, Δv1) Bu4 v4C (Δu2, Δv2) the 4-pointATA parameterizationENERATION is represented by eight FOR numbers. OMOGRAPHY STIMATIONH4point = III. D G H E 3 3 betweenB theC 8-dof ”corner offset” matrix and the representation (Δu , Δv ) Fig. 2: 4-point parameterization. We use the 4-point param- @ A In other words, once the displacement of the four corners is (Δu4, Δv4) a nearly unlimited number of labeled training examples byEquivalently to the matrix formulation of the homography, (Δu1, Δv1) known, only a single closed form transformation(Δu2, Δv2) is needed 1-to-1 mapping the 4-point parameterization is represented by eight numbers. H4point = for the 8-dof homography. This can be accomplished(Δu3, Δv3) in a (u1’,v1’) eterization of the homography. There exists a 1-to-1 mapping In other words,of once the the displacement homography of the four corners is asTraining a 3x3 deep matrix. convolutional(Δu4, Δv4) networks(u from1,v1) scratch requires applying random projective transformations to a large dataset number of ways, for example one can use the normalized H H11 H12 H13 Hmatrix = H21 H22 H23 known, only a single closed form transformation is needed 1-to-1 mapping ( ( 31 32 33 between the 8-dof ”corner offset” matrix and the representation 1 1 H H H for the 8-dof homography. This can be accomplished in a a largeDirect Linear amount Transform (DLT)(u of’,v ’) algorithm data. [9], To or the meet function this requirement, we generate (u1,v1) number of ways, for example one can use the normalized getPerspectiveTransform()H in OpenCV.H11 H12 H13 Hmatrix = (H21 H22 H23 ( of the homography as a 3x3 matrix. Direct Linear Transform (DLT) algorithm [9], or the function a nearly unlimited numberH31 H32 H33 of labeled training examples by getPerspectiveTransform()in OpenCV. III. DATA GENERATION FOR HOMOGRAPHY ESTIMATION Fig. 2: 4-point parameterization. We use the 4-point param- Training deep convolutional networks from scratch requires eterization of the homography. There exists a 1-to-1 mapping III. DATA GENERATION FOR HOMOGRAPHY ESTIMATION Fig.a2: large4-point amount parameterization. of data. To meetWe this use requirement, the 4-point weparam- generate between the 8-dof ”corner offset” matrix and the representation of the homography as a 3x3 matrix. Training deep convolutional networks from scratch requires eterizationa nearly of unlimited the homography. number There of labeled exists traininga 1-to-1 mapping examples by a large amount of data. To meet this requirement, we generate between the 8-dof ”corner offset” matrix and the representation a nearly unlimited number of labeled training examples by of the homography as a 3x3 matrix. function during training. While quantization means that there is some inherent quantization error, the network is able to produce a confidence for each of the corners produced by the method. We chose to use 21 quantization bins for each of the 8 output dimensions, which results in a final layer with 168 output neurons. Figure 6 is a visualization of the corner confidences produced by our method — notice how the confidence is not equal for all corners. Step 1: Randomly crop at Step 2: Randomly perturb four position p. This is Patch A. corners of Patch A. V. EXPERIMENTS We train both of our networks for about 8 hours on a single Titan X GPU, using stochastic gradient descent (SGD) with momentum of 0.9. We use a base learning rate of 0.005 and decrease the learning rate by a factor of 10 after every 30,000 iterations. The networks are trained for for 90,000 total iterations using a batch size of 64. We use Caffe [11], a popular Step 3: Compute HAB given Step 4: Apply (HAB)-1 = HBA to open-source deep learning package, for all experiments. these correspondences. the image, and crop again at To create the training data, we use the MS-COCO Training position p, this is Patch B. Set. All images are resized to 320x240 and converted to Deep Image grayscale. We then generate 500,000 pairs of image patches Homography HAB sized 128x128 related by a homography using the method Network described in Section III. We choose ρ = 32, which means that each corner of the 128x128 grayscale image can be Step 5: Stack Patch A and Patch B channel-wise and feed into the perturbed by a maximum of one quarter of the total image edge network. Set HAB as the target vector. size. We avoid larger random perturbations to avoid extreme transformations. We did not use any form of pre-training; the Fig. 3: Training Data Generation. The process for creating weights of the networks were initialized to random values and a single training example is detailed. See Section III for more trained from scratch. We use the MS-COCO validation set to information. monitor overfitting, of which we found very little. To our knowledge there are no large, publicly available homography estimation test sets, thus we evaluate our homog- IV. CONVNET MODELS raphy estimation approach on our own Warped MS-COCO Our networks use 3x3 convolutional blocks with Batch- 14 Test Set. To create this test set, we randomly chose 5000 Norm [10] and ReLUs, and are architecturally similar to images from the test set and resized each image to grayscale Oxfords VGG Net [17] (see Figure 1). Both networks take as 640x480, and generate a pairs of image patches sized 256x256 input a two-channel grayscale image sized 128x128x2. In other 2 and corresponding ground truth homography, using the words, the two input images, which are related by a homogra- approach described in Figure 3 with ρ = 64. phy, are stacked channel-wise and fed into the network. We use We compare the Classification and Regression variants 8 convolutional layers with a max pooling layer (2x2, stride of the HomographyNet with two baselines. The first base- 2) after every two convolutions. The 8 convolutional layers line is a classical ORB [15] descriptor + RANSAC + have the following number of filters per layer: 64, 64, 64, 64, getPerspectiveTransform() OpenCV Homography 128, 128, 128, 128. The convolutional layers are followed by computation. We use the default OpenCV parameters in the two fully connected layers. The first fully connected layer has traditional homography estimator. This estimates ORB features 1024 units. Dropout with a probability of 0.5 is applied after at multiple scales and uses the top 25 scoring matches as input the final convolutional layer and the first fully-connected layer. to the RANSAC estimator. In scenarios where too few ORB Our two networks share the same architecture up to the last features are computed, the ORB+RANSAC approach outputs layer, where the first network produces real-valued outputs and an identity estimate. In scenarios where the ORB+RANSAC’s the second network produces discrete quantities (see Figure 4). estimate is too extreme, the 4-point homography estimate is The regression network directly produces 8 real-valued clipped at [-64,64]. The second baseline uses a 3x3 identity numbers and uses the Euclidean (L2) loss as the final layer matrix for every pair of images in the test set. during training. The advantage of this formulation is the Since the HomographyNets expect a fixed sized 128x128x2 simplicity; however, without producing any kind of confidence input, the image pairs from the Warped MS-COCO 14 Test Set value for the prediction, such a direct approach could be are resized from 256x256x2 to 128x128x2 before being passed prohibitive in certain applications. 2We found that very few ORB features were detected when the patches The classification network uses a quantization scheme, has were sized 128x128, while the HomographyNets had no issues working at a softmax at the last layer, and we use the cross entropy loss the smaller scale. 2.Distribution Figure

1 Corner 1 Corner 2 2

Corner 4 Corner 3

3 4

3.Comparison Figure

Classification HomographyNet Regression HomographyNet Conv7 Conv8 Conv7 Conv8 FC FC FC FC … Softmax … 3x3 3x3 8x21 8 16x16x128 16x16x128 1024 16x16x128 16x16x128 1024

p(x)logq(x) 1 2 Loss: Cross-Entropy − Loss: Euclidean (L2) p(x) q(x) x 2|| − || ! Fig. 4: Classification HomographyNet vs Regression HomographyNet. Our VGG-like Network has 8 convolutional layers and two fully connected layers. The final layer is 8x21 for the classification network and 8x1 for the regression network. The 8x21 output can be interpreted as four 21x21 corner distributions. See Section IV for full ConvNet details. through the network. The 4-point parameterized homography Secondly, by formulating homography estimation as a ma- output by the network is then multiplied by a factor of chine learning problem, one can build application-specific two to account for this. When evaluating the Classification homography estimation engines. For example, a robot that HomographyNet, the corner displacement with the highest navigates an indoor factory floor using planar SLAM via confidence is chosen. homography estimation could be trained solely with images The results are reported in Figure 5. We report the Mean captured from the robot’sTable image 1 sensor of the indoor factory. Average Corner Error for each approach. To measure this While it is possible to optimize a feature detector such as metric, one first computes the L2 distance between the ground ORB to work in specific environments, it is not straightfor- 9.2 11.7 24.1 49.1 truth corner position and the estimated corner position. The ward. Environment and sensor-specific noise, motion blur, and error is averaged over the four corners of the image, and the occlusions which might restrict the ability of a homography mean is computed over the entire test set. While the regression estimation algorithm can be tackled in a similar fashion using a network performs the best, the classification network can ConvNet. Other classical computer vision tasks such as image produce confidences and thus a meaningful way to visually mosaicing (as in [19]) and markerless camera tracking systems debug the results. In certain applications, it may be critical to have this measure of certainty. We visualize homography estimations in Figure 7. The blue squares in column 1 are mapped to a blue quadrilateral 50 in column 2 by a random homography generated from the 49.1 process described in Section III. The green quadrilateral is 37.5 the estimated homography. The more closely the blue and green quadrilateral align, the better. The red lines show the top scoring matches of ORB features across the image patches. A 25

similar visualization is shown in columns 3 and 4, except the (pixels) 24.1 Deep Homography Estimator is used. 12.5 PPLICATIONS 11.7 VI.A 9.2 Our Deep Homography Estimation system enables a vari- Corner Error Average Mean 0 ety of interesting applications. Firstly, our system is fast. It runs at over 300fps with a batch size of one (i.e. real-time ORB + inference mode) on an NVIDIA Titan X GPU, which enables Identity a host of applications that are simply not possible with a RANSAC HomographyNet HomographyNet Homography slower system. The recent emergence of specialized embedded (Regression) (Classification) hardware for deep networks will enable applications on many embedded systems or platforms with limited computational Fig. 5: Homography Estimation Comparison on Warped power which cannot afford an expensive and power-hungry MS-COCO 14 Test Set. The mean average corner error is desktop GPU. These embedded systems are capable of running computed for various approaches on the Warped MS-COCO much larger networks such as AlexNet [12] in real-time, and 14 Test Set. The HomographyNet with the regression head should have no problem running the relatively light-weight performs the best. The far right bar shows the error computed HomographyNets. if the identity transformation is estimated for each test pair.

1 1 Corner 1 Corner 2 2

Corner 4 Corner 3

3 4

Fig. 6: Corner Confidences Measure. Our Classification HomographyNet produces a score for each potential 2D displacement of each corner. Each corner’s 2D grid of scores can be interpreted as a distribution. for augmented reality (as in [16]) could also benefit from [8] A. P. Gee, D. Chekhlov, A. Calway, and W. Mayol- HomographyNets trained on image pair examples created from Cuevas. Discovering higher level structure in visual slam. the target system’s sensors and environment. IEEE Transactions on Robotics, 2008. [9] R. I. Hartley and A. Zisserman. Multiple View Geometry VII.CONCLUSION in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004. In this paper we asked if one of the most essential computer [10] Sergey Ioffe and Christian Szegedy. Batch normalization: vision estimation tasks, namely homography estimation, could Accelerating deep network training by reducing internal be cast as a learning problem. We presented two Convolutional covariate shift. CoRR, abs/1502.03167, 2015. Neural Network architectures that are able to perform well [11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, on this task. Our end-to-end training pipeline contains two R. Girshick, S. Guadarrama, and T. Darrell. Caffe: additional insights: using a 4-point corner parameterization of Convolutional architecture for fast feature embedding. homographies, which makes the parameterizations coordinates arXiv preprint arXiv:1408.5093, 2014. operate on the same scale, and using a large dataset of real [12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. image to synthetically create an seemingly unlimited-sized Imagenet classification with deep convolutional neural training set for homography estimation. We hope that more networks. In NIPS. 2012. geometric problems in vision will be tackled using learning [13] Tsung-Yi Lin, Michael Maire, Serge Belongie, James paradigms. Hays, Pietro Perona, Deva Ramanan, Piotr Dollr, and C. Lawrence Zitnick. Microsoft coco: Common objects REFERENCES in context. In ECCV, 2014. [14] R. Mur-Artal, J. M. M. Montiel, and J. D. Tards. Orb- [1] M. Bai, W. Luo, K. Kundu, and R. Urtasun. Deep Seman- slam: A versatile and accurate monocular slam system. CoRR tic Matching for Optical Flow. , abs/1604.01827, IEEE Transactions on Robotics, 2015. April 2016. [15] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary [2] Simon Baker, Ankur Datta, and Takeo Kanade. Param- Bradski. Orb: An efficient alternative to sift or surf. In eterizing homographies. Technical Report CMU-RI-TR- ICCV, 2011. 06-11, Robotics Institute, Pittsburgh, PA, March 2006. [16] G. Simon, A. Fitzgibbon, and A. Zisserman. Markerless [3] Matthew Brown and David G Lowe. Automatic tracking using planar structures in the scene. In Proc. Inter- panoramic image stitching using features. International Symposium on Augmented Reality, pages national journal of computer vision , 74(1):59–73, 2007. 120–128, October 2000. [4] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia. [17] K. Simonyan and A. Zisserman. Very deep convolu- Exploring representation learning with cnns for frame to tional networks for large-scale image recognition. CoRR, ICRA frame ego-motion estimation. , 2016. abs/1409.1556, 2014. [5] David Eigen, Christian Puhrsch, and Rob Fergus. Depth [18] Paul Smith, Ian Reid, and Andrew Davison. Real-time map prediction from a single image using a multi-scale monocular SLAM with straight lines. In Proc. British CoRR deep network. , abs/1406.2283, 2014. Machine Vision Conference, 2006. [6] J. Engel, T. Schops,¨ and D. Cremers. LSD-SLAM: Large- [19] Richard Szeliski. Video mosaics for virtual environ- scale direct monocular SLAM. 2014. ments. IEEE Computer Graphics and Applications, 1996. [7] Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip [20] Zhengyou Zhang. A flexible new technique for camera Hausser,¨ Caner Hazirbas, Vladimir Golkov, Patrick calibration. PAMI, 22(11):1330–1334, 2000. van der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional net- works. ICCV, 2015. Traditional Homography Estimation Deep Image Homography Estimation

Fig. 7: Traditional Homography Estimation vs Deep Image Homography Estimation. In each of the 12 examples, blue depicts the ground truth region. The left column shows the output of ORB-based Homography Estimation, the matched features in red, and the resulting mapping in green of the cropping. The right column shows the output of the HomographyNet (regression head) in green. Rows 1-2: The ORB features either concentrate on small regions or cannot detect enough features and perform poorly relative to the HomographyNet, which is uneffected by these phenomena. Row 3: Both methods give reasonably good homography estimates. Row 4: A small amount of Gaussian noise is added to the image pair in row 3, deteriorating the results produced by the traditional method, while our method is unaffected by the distortions. Rows 5-6: The traditional approach extracts well-distributed ORB features, and also outperforms the deep method.