<<

Deep Probabilistic Regression of Elements of SO(3) using Quaternion Averaging and Uncertainty Injection

Valentin Peretroukhin, Brandon Wagstaff, and Jonathan Kelly University of Toronto [email protected], [email protected], [email protected]

Abstract HydraNet q1 AAACK3icjVDLSsNAFJ34tj6rSzeDRXBjSFRQdwU3bgQFq0JTys301g5OHp25kZaQ33CrP+DXuFLc+h9O0y5UXHhg4Mw598UJUyUNed6bMzU9Mzs3v7BYWVpeWV1br25cmyTTAhsiUYm+DcGgkjE2 SJLC21QjRKHCm/D+dOTfPKA2MomvaJhiK4K7WHalALJSEJwDaTnI+0Xbb6/XPPekBB+To8MJOfG573olamyCi3bVWQw6icgijEkoMKbpeym1ctAkhcKiEmQGUxD3cIdNS2OI0LTy8uiC71ilw7uJti8mXqrfO3KIjBlGoa2MgHrmtzcS//KaGXWPW7mM04wwFuNF3Uxx SvgoAd6RGgWpoSUgtLS3ctEDDYJsTj+2DFLQBgteCcppeT+TNss9BYSDPdNLNImMjGt/xf/Cu953/QPXuzys1b1JjAtsi22zXeazI1ZnZ+yCNZhgKXtkT+zZeXFenXfnY1w65Ux6NtkPOJ9fXsupUA==

SO(3) Consistent estimates of are crucial to vision- q2 AAACLXicbVDLSgMxFM34rPWtSzfBIrixzKig7gQXuhBUtK3QVsmktzaYmYzJjbQM/Q+3+gN+jQtB3PobptNBfB0InJxzX5wwkcKg7796I6Nj4xOThani9Mzs3PzC4lLVKKs5VLiSSl+GzIAUMVRQoITLRAOLQgm18PZg4NfuQRuh4gvsJdCM2E0s2oIzdNJV41jAoVY2OT9Jt/rXCyW/7Gegf0mQkxLJcXq96E01WorbCGLkkhlTD/wEmynTKLiEfrFhDSSM37IbqDsaswhMM83O7tM1p7RoW2n3YqSZ+r0jZZExvSh0lRHDjvntDcT/vLrF9m4zFXFiEWI+XNS2kqKigwxoS2jgKHuOMK6Fu5XyDtOMo0vqx5ZuwrSBPi02smnpnRUuzQ3JELobpqM0coum7H7D8PYy0CHZ2c7JXvAVXnWzHGyV/bPt0r6fx1ggK2SVrJOA7JB9ckROSYVwoskDeSRP3rP34r1578PSES/vWSY/4H18AqPcqVc= AAACK3icjVDLSsNAFJ34tj6rSzeDRXBjSFRQdwU3bgQFq0JTys301g5OHp25kZaQ33CrP+DXuFLc+h9O0y5UXHhg4Mw598UJUyUNed6bMzU9Mzs3v7BYWVpeWV1br25cmyTTAhsiUYm+DcGgkjE2 SJLC21QjRKHCm/D+dOTfPKA2MomvaJhiK4K7WHalALJSEJwDaTnI+0V7v71e89yTEnxMjg4n5MTnvuuVqLEJLtpVZzHoJCKLMCahwJim76XUykGTFAqLSpAZTEHcwx02LY0hQtPKy6MLvmOVDu8m2r6YeKl+78ghMmYYhbYyAuqZ395I/MtrZtQ9buUyTjPCWIwXdTPF KeGjBHhHahSkhpaA0NLeykUPNAiyOf3YMkhBGyx4JSin5f1M2iz3FBAO9kwv0SQyMq79Ff8L73rf9Q9c7/KwVvcmMS6wLbbNdpnPjlidnbEL1mCCpeyRPbFn58V5dd6dj3HplDPp2WQ/4Hx+AWCHqVE=

based motion estimation in augmented reality and robotics. … q, Σq AAACRXicbZDPThsxEMa9QFsI/RPaIxerKVIPJdptkQCpB6ReekEC0QSkbBTNOpPEwmtv7FmUaLUvwNNwpS/QZ+hDcKt6bZ3NCpW2n2Tp8zcej/1LMiUdheH3YGV17dHjJ+sbjc2nz56/aG697DqT W4EdYZSxFwk4VFJjhyQpvMgsQpooPE8uPy3q51donTT6C80z7Kcw1nIkBZCPBs038TGCLuIuCjK2mJblOx4fA1k5K+IzOU6hHEwHzVbYDivxf01UmxardTLYCjbioRF5ipqEAud6UZhRvwBLUigsG3HuMANxCWPseashRdcvqu+UfMcnQz4y1i9NvEr/7CggdW6eJv5k CjRxf9cW4f9qvZxGB/1C6iwn1GI5aJQrToYv2PChtB6DmnsDwkr/Vi4mYEGQJ/hgyiwD67Dkjbi6rZjm0lPeVUA423UTY0nk5Np+V1bwDivxpdnfq81hdA+v+74dfWiHp3uto481xnW2zV6ztyxi++yIfWYnrMMEu2Y37JZ9Db4Fd8GP4Ofy6EpQ97xiDxT8+g2w9rNf

qh In this work, we present a method to extract probabilistic AAACK3icjVDLSsNAFJ34tj6rSzeDRXBjSFRQdwU3bgQFq0JTys301g5OHp25kZaQ33CrP+DXuFLc+h9O0y5UXHhg4Mw598UJUyUNed6bMzU9Mzs3v7BYWVpeWV1br25cmyTTAhsiUYm+DcGgkjE2 SJLC21QjRKHCm/D+dOTfPKA2MomvaJhiK4K7WHalALJSEJwDaTnI+0W7116vee5JCT4mR4cTcuJz3/VK1NgEF+2qsxh0EpFFGJNQYEzT91Jq5aBJCoVFJcgMpiDu4Q6blsYQoWnl5dEF37FKh3cTbV9MvFS/d+QQGTOMQlsZAfXMb28k/uU1M+oet3IZpxlhLMaLupni lPBRArwjNQpSQ0tAaGlv5aIHGgTZnH5sGaSgDRa8EpTT8n4mbZZ7CggHe6aXaBIZGdf+iv+Fd73v+geud3lYq3uTGBfYFttmu8xnR6zOztgFazDBUvbIntiz8+K8Ou/Ox7h0ypn0bLIfcD6/AL4vqYc= estimates of rotation from deep regression models. First, Σ AAACMHicjVDLSitBFOzxbfRq1KWbxiC4cZi5V1B3ghs3gqJRIRPCmc5J0tjzuN2nJWHIl7jVH/BrdCVu/Qo7kyxUXFjQUF11XlScK2koCF68qemZ2bn5hcXK0vKfldXq2vqVyawWWBeZyvRNDAaV TLFOkhTe5BohiRVex7fHI//6DrWRWXpJgxybCXRT2ZECyEmt6mp0CqRlv4guZDeBYataC/zDEnxM9vcm5DDkoR+UqLEJzlpr3mLUzoRNMCWhwJhGGOTULECTFAqHlcgazEHcQhcbjqaQoGkW5eVDvu2UNu9k2r2UeKl+7iggMWaQxK4yAeqZ795I/MlrWOocNAuZ5pYw FeNFHas4ZXwUA29LjYLUwBEQWrpbueiBBkEurC9b+jlog0NeicppxX8rXaC7Cgj7u6aXaRKWjO9+vwzv6q8f/vOD873aUTCJcYFtsi22w0K2z47YCTtjdSaYZffsgT16T96z9+q9jUunvEnPBvsC7/0Dm/Sq6w== T?, Σ? we build on prior work and develop a multi-headed net- AAACSHicbVDLbtNAFB0bCm2gkJYlmxERgkVj2SVSW6mLSmy6QSqiaSrFUXQ9uUlGGT+Yua4SWf4Evqbb9gf6B/0Ldqi7ThyDaOFII5055z5mTpQpacj3bx33ydO1Z8/XNxovXm6+et3c2j4zaa4F dkWqUn0egUElE+ySJIXnmUaII4W9aPZ56fcuUBuZJqe0yHAQwySRYymArDRsfgi/AGk5L07LYWgI9A7/rYTf5CSGWh42W77nV+D/kqAmLVbjZLjlbISjVOQxJiQUGNMP/IwGBWiSQmHZCHODGYgZTLBvaQIxmkFR/ajk760y4uNU25MQr9S/OwqIjVnEka2MgabmsbcU /+f1cxrvDwqZZDlhIlaLxrnilPJlPHwkNQpSC0tAaGnfysUUNAiyIT7YMs9AGyx5I6ymFd9zaYNuKyCct8001SRyMp69lVV4BxX4iux1anIQ/AnvbNcLPnn+107r6LCOcZ29Ze/YRxawPXbEjtkJ6zLBfrBLdsWunRvnp/PLuVuVuk7d84Y9gOveA1DUs6g= work structure we name HydraNet that can account for both Fused estimate Sensor data aleatoric and epistemic uncertainty. Second, we extend Hy- T, ΣT draNet to targets that belong to the rotation group, SO(3), AAACRXicbVBNT9tAEF1Dy0dKIcCxl1VDJQ4lsgEJkDggcekFiaoJIMVRNN5MklXWa7M7Roks/wF+DVf6B/ob+iN6q3ptN45blZYnrfTmvZmd3RelSlry/a/ewuKLl0vLK6u1V2uv1zfqm1tXNsmM wLZIVGJuIrCopMY2SVJ4kxqEOFJ4HY3PZ/71HRorE92iaYrdGIZaDqQAclKvvhNeIOg8vAAycpK3iuI9/12En+QwhqLX6tUbftMvwf8nQUUarMJlb9NbDfuJyGLUJBRY2wn8lLo5GJJCYVELM4spiDEMseOohhhtNy+/U/B3TunzQWLc0cRL9e+JHGJrp3HkOmOgkf3X m4nPeZ2MBsfdXOo0I9RivmiQKU4Jn2XD+9KgIDV1BISR7q1cjMCAIJfgky2TFIzFgtfC8rb8NpMu5T0FhJM9O0oMiYxs01VFGd5JCT4nR4cVOQn+hHe13wwOmv7Hw8bZaRXjCnvD3rJdFrAjdsY+sEvWZoLdswf2yD57X7xv3nfvx7x1watmttkTeD9/AU2Psyc= SE(3) by regressing unit quaternions and using the tools of ro- Classical odometry AAACLXicbVDLSgMxFM34rPWtSzfBIrixzKig7gQRXbhQtK3QVsmktzaYmYzJjbQM/Q+3+gN+jQtB3PobptNBfB0InJxzX5wwkcKg7796I6Nj4xOThani9Mzs3PzC4lLVKKs5VLiSSl+GzIAUMVRQoITLRAOLQgm18PZg4NfuQRuh4gvsJdCM2E0s2oIzdNJV40TAkVY2OT9Mt/rXCyW/7Gegf0mQkxLJcXq96E01WorbCGLkkhlTD/wEmynTKLiEfrFhDSSM37IbqDsaswhMM83O7tM1p7RoW2n3YqSZ+r0jZZExvSh0lRHDjvntDcT/vLrF9m4zFXFiEWI+XNS2kqKigwxoS2jgKHuOMK6Fu5XyDtOMo0vqx5ZuwrSBPi02smnpnRUuzQ3JELobpqM0coum7H7D8PYy0CHZ2c7JXvAVXnWzHGyV/bPt0r6fx1ggK2SVrJOA7JB9ckxOSYVwoskDeSRP3rP34r1578PSES/vWSY/4H18ApJmqU0= tation averaging and uncertainty injection onto the mani- fold to produce three-dimensional covariances. Finally, we Figure 1: Our model outputs a probabilistic representation present results and analysis on a synthetic dataset, learn of rotation as a unit quaternion and a full covariance ma- consistent orientation estimates on the 7-Scenes dataset, trix. A consistent uncertainty estimate allows us to improve and show how we can use our learned covariances to classical estimators using pose-graph optimization. fuse deep estimates of relative orientation with classical stereo visual odometry to improve localization on the KITTI less, the representational power of deep regression algo- dataset. rithms makes them an attractive option to complement clas- sical motion estimation when these latter methods perform poorly (e.g., under diverse lighting conditions or low scene 1. Introduction texture). By endowing deep regression models with a useful notion of uncertainty, we can account for out-of-training- Accounting for position and orientation, or pose, is at the distribution errors (a critical step to ensure safe operation) heart of many computer vision algorithms like visual odom- and fuse these models with classical methods using proba- etry, structure from motion, and SLAM. By tracking mo- bilistic factor graphs. In this work, we choose to focus on tion, these algorithms form the basis of visual localization rotation regression, since many motion algorithms are sen- pipelines in autonomous ground and aerial vehicles, visual sitive to rotation errors [18], and good rotation initializa- mapping, and augmented reality. tions can be critical to robust optimization [3]. Our novel Recent work [4, 15, 12] has attempted to transfer the suc- contributions are cess of deep neural networks in many areas of computer 1. a deep network structure we call HydraNet that builds vision to the task of camera pose estimation. These ap- on prior work [14, 17, 16] to build covariance matrices proaches, however, can produce arbitrarily poor pose esti- that account for both epistemic and aleatoric uncer- mates if sensor data differs from what is observed during tainty [13], training (i.e., it is ‘out of training distribution’) and their monolithic nature makes them difficult to debug. Further, 2. a formulation that extends HydraNet to means and co- despite much research effort, classical motion estimation variances defined on the rotation group SO(3), algorithms, like stereo visual odometry, still achieve state- of-the-art performance in nominal conditions1. Neverthe- 3. and open-source code for SO(3) regression2.

1Based on the KITTI odometry leaderboard [6] at the time of writing. 2https://github.com/utiasSTARS/so3_learning

432183 2. Approach where Log (·), represents the capitalized matrix logarithm · [19], and  F the Frobenius norm. In the context of Equa- 2.1. HydraNet tion (1), using the angular metric leads to the Karcher mean, To endow deep networks with uncertainty, we extend which requires an iterative solver and has no known analytic prior work on multi-headed networks [17] and ensembles of expression. Applying the chordal metric leads to an analytic networks trained through bootstrapped training sets [14] to expression for the average but requires the use of Singular a single network structure we call HydraNet (see Figure 1). Value Decomposition. Using the quaternionic metric, how- HydraNet is composed of a large, main body, with multi- ever, leads to a simple, analytic expression for the rotation ple heads attached prior to the output. Each head is trained average as the normalized arithmetic mean of a set of unit independently and from different weight initializations to quaternions [10], provide a different estimate of the regressed quantity. This H H diversity allows the model to capture epistemic uncertainty 2 i=1 qi q = argmin dquat(q , q) = S . (5) 2 i H (σ ) by computing a sample variance over the heads. The R(q)∈SO(3) [ e i=1  i=1 qi epistemic uncertainty is designed to grow when inputs are S    ‘far’ from training data in some latent . Further, an This expression is simple to evaluate numerically, and additional head is reserved for regressing an aleatoric esti- if necessary, can be easily differentiated with respect to its 2 mate of uncertainty directly (σa). This aleatoric uncertainty constituent parts. For these reasons, we opt to construct our accounts for effects like sensor noise that are present even if SO(3) HydraNet using unit quaternion outputs, and evalu- a test input is ‘close’ to training data. See [13] for a further ate the rotation average using the quaternionic metric. discussion of aleatoric and epistemic uncertainty. 2.2.2 SO(3) Uncertainty 2.2. Deep Probabilistic SO(3) Regression In order to extend the ideas of HydraNet to the matrix We parametrize uncertainty over SO(3) by injecting uncer- Lie group SO(3), we investigate different ways to regress tainty onto the manifold [5, 2, 1] from a local tangent space and combine several estimates of . Namely, given a about some mean element, q, network, g(·), and an input I, we consider how to to process q = Exp (ƒ) ⊗ q, ƒ ∼ N (0, Σ) , (6) several outputs, gi(I), and combine them into an estimate of a ‘mean’ rotation, R, and an associated 3 × 3 covariance where ⊗ represents quaternion multiplication. In this for- matrix, Σ. To produce estimates of rotation for a given Hy- mulation, Σ provides a 3 × 3 covariance matrix that can 4 draNet head, we choose to normalize outputs, g(I) ∈ R , to express uncertainty in different directions. Given a mean 3 g( ) produce a unit quaternions that reside on S , q = I . rotation, q, and samples, q , we use the logarithmic map g( )2 i The choice of unit quaternions (as opposed to using expo-I to compute a sample covariance matrix that expressed epis- nential coordinates, for example), is motivated by a simple temic uncertainty as, analytic mean expression based on the quaternionic metric H which we describe next. 1 T −1 Σe = φ φ , φ = Log q ⊗ q . (7) H − 1 [ i i i  i  i=1 2.2.1 Rotation Averaging 2.3. Loss Function Given several estimates of a rotation, we define the mean as To capture aleatoric uncertainty, we train a direct re- the rotation which minimizes some squared metric defined gression of covariance through a parametrization of posi- over the group [10], tive semi-definite matrices using a Cholesky decomposition n [11, 9]). Given the network output of a unit quaternion q, 2 R = argmin d(R , R) . (1) and a positive semi-definite matrix Σa, we define a loss [ i R∈SO(3) i=1 function as the negative log likelihood of a given rotation under Equation (6) (see [5]) for a given target rotation, qt, There are three common choices for a bijective metric as [10, 3] on SO(3). The angular, chordal and quaternionic: 1 T −1 1 LNLL(q, qt, Σa) = φ Σa φ + log det (Σa), (8) T 2 2 dang(Ra, Rb) = Log RaRb  , (2)  2 −1   where φ = Log q ⊗ qt . At test time, we combine both dchord(Ra, Rb) = Ra − Rb , (3)   F covariance matrices using a simple matrix additon, Σ = d (q , q ) = min (q − q  , q + q  ) , (4) quat a b a b 2 a b 2 Σe + Σa.

84 2

1

1 0.0 (deg) φ 0 1 ∆ φ − −0.2 1

5 0.1

2 0.0 φ

(deg) 0 ∆ −0.1 2 φ −0.2 −5 0.1 ±3σ (epistemic + aleatoric) 3

φ 0.0 ±3σ (aleatoric only)

∆ 0 Ground truth Rotational Error (deg) 3 HydraNet −0.1 φ HydraNet ±3σ −80 −60 −40 −20 0 20 40 60 80 −2 Polar () 0 1000 2000 3000 4000 Frame

Figure 2: Rotation estimation errors and reported network Figure 4: Frame-to-frame rotation regression for KITTI covariance elements for HydraNet trained on synthetic data odometry dataset sequence 00. Note how the uncertainty (noisy pixel locations). Epistemic uncertainty grows for increases when the car turns (φ2 represents the yaw angle). out-of-training-distribution test samples (inputs captured at polar outside of ±60◦). Trajectory

50 400

(deg) 300 1 0 φ

200 100 100 0 (deg) Northing (m) 2 φ −100 0 Ground Truth viso2-s 50 −100 viso2-s + HydraNet

0 Ground truth (deg) − 3 HydraNet 200 0 200 400 φ HydraNet ±3σ −50 Easting (m) 0 500 1000 1500 2000 2500 3000 3500 4000 Frame Figure 5: Top-down trajectories of KITTI odometry dataset sequence 00 with and without HydraNet fusion. Figure 3: Orientation regression results for the 7scenes chess test set. Our HydraNet structure paired with a on the KITTI self-driving car dataset. By using the ro- resnet-34 results mean errors of 8.6 degrees, with con- tation estimates and their associated covariances, we fuse sistent uncertainty. the outputs of HydraNet with a classical visual odometry 3. Results & Future Work pipeline (based on sparse features [7]) using probabilistic factor graphs and visualize the resulting increase in accu- To demonstrate our approach, we use three different racy in Figure 5. datasets: synthetic (wherein we simulate a camera observ- In future work, we are looking for ways to obviate the ing a static scene with point landmarks), 7-Scenes [8] RGB need for supervised training (e.g., by embedding the Hy- data, and KITTI monocular grayscale images [6]. Figure 2 draNet structure within a Bayesian filter like that in [9]), shows the importance of epistemic uncertainty when train- and compare our notion of epistemic uncertainty to a sep- ing a model to compute rotations on the synthetic dataset. arate classification model that predicts whether a sample is Figure 3 shows the consistent orientation regression re- in or out of training distribution. We also see great promise sulting from training a HydraNet (with a ResNet body) in using deep probabilistic rotation regression to aid in ini- to regress probabilistic estimates of orientation on the 7- tialization within large scale pose graph optimization that Scenes dataset. Finally, Figure 4 shows the result of train- relies on vision-based sensors. ing a custom HydraNet to regress frame-to-frame rotations

85 References classification. IEEE Transactions on Geoscience and Remote Sensing, pages 1–12, 2019. [1] Timothy D Barfoot. State Estimation for Robotics. Cam- [17] Ian Osband, Charles Blundell, Alexander Pritzel, and Ben- bridge University Press, July 2017. jamin Van Roy. Deep exploration via bootstrapped DQN. [2] T D Barfoot and P T Furgale. Associating uncertainty with CoRR, abs/1602.04621, 2016. Three-Dimensional poses for use in estimation problems. [18] Valentin Peretroukhin, Lee Clement, and Jonathan Kelly. In- IEEE Trans. Rob., 30(3):679–693, June 2014. ferring sun direction to improve visual odometry: A deep [3] L Carlone, R Tron, K Daniilidis, and F Dellaert. Initializa- learning approach. The International Journal of Robotics tion techniques for 3D SLAM: A survey on rotation estima- Research, 37(9):996–1016, 2018. tion and its use in pose graph optimization. In 2015 IEEE In- [19] Joan Sola,` Jeremie Deray, and Dinesh Atchuthan. A micro ternational Conference on Robotics and Automation (ICRA), lie theory for state estimation in robotics. Dec. 2018. pages 4597–4604, May 2015. [4] Ronald Clark, Sen Wang, Hongkai Wen, Andrew Markham, and Niki Trigoni. VINet: Visual-inertial odometry as a sequence-to-sequence learning problem. In AAAI Conf on Artificial Intelligence, 2017. [5] Christian Forster, Luca Carlone, Frank Dellaert, and Davide Scaramuzza. IMU preintegration on manifold for efficient visual-inertial maximum-a-posteriori estimation. 2015. [6] A Geiger, P Lenz, C Stiller, and R Urtasun. Vision meets robotics: The KITTI dataset. Int. J. Rob. Res., 32(11):1231– 1237, 1 Sept. 2013. [7] A Geiger, J Ziegler, and C Stiller. StereoScan: Dense 3D re- construction in real-time. In Proc. Intelligent Vehicles Symp. (IV), pages 963–968. IEEE, June 2011. [8] B. Glocker, S. Izadi, J. Shotton, and A. Criminisi. Real- time rgb-d camera relocalization. In 2013 IEEE Interna- tional Symposium on Mixed and Augmented Reality (IS- MAR), pages 173–179, Oct 2013. [9] Tuomas Haarnoja, Anurag Ajay, Sergey Levine, and Pieter Abbeel. Backprop KF: Learning discriminative determinis- tic state estimators. In Proceedings of Neural Information Processing Systems (NIPS), 2016. [10] Richard Hartley, Jochen Trumpf, Yuchao Dai, and Hongdong Li. Rotation averaging. Int. J. Comput. Vis., 103(3):267–305, July 2013. [11] H Hu and G Kantor. Parametric covariance prediction for heteroscedastic noise. In Proc. IEEE/RSJ Int. Conf. Intelli- gent Robots and Syst. (IROS), pages 3052–3057, 2015. [12] Alex Kendall, Kendall Alex, Grimes Matthew, and Cipolla Roberto. PoseNet: A convolutional network for Real-Time 6-DOF camera relocalization. In Proc. of IEEE Int. Conf. on Computer Vision (ICCV), 2015. [13] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584, 2017. [14] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty esti- mation using deep ensembles. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Process- ing Systems 30, pages 6402–6413. Curran Associates, Inc., 2017. [15] Iaroslav Melekhov, Juha Ylioinas, Juho Kannala, and Esa Rahtu. Relative camera pose estimation using convolutional neural networks. In Proc. Int. Conf. on Advanced Concepts for Intel. Vision Syst., pages 675–687. Springer, 2017. [16] R. Minetto, M. P. Segundo, and S. Sarkar. Hydra: An en- semble of convolutional neural networks for geospatial land

86