UC San Diego UC San Diego Electronic Theses and Dissertations
Title From pictures to 3D : global optimization for scene reconstruction
Permalink https://escholarship.org/uc/item/8rs3b74c
Author Chandraker, Manmohan Krishna
Publication Date 2009
Peer reviewed|Thesis/dissertation
eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO
From Pictures to 3D: Global Optimization for Scene Reconstruction
A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy
in
Computer Science
by
Manmohan Krishna Chandraker
Committee in charge:
Professor David Kriegman, Chair Professor Serge Belongie Professor Samuel Buss Professor Fredrik Kahl Professor Gert Lanckriet Professor Matthias Zwicker
2009 Copyright Manmohan Krishna Chandraker, 2009 All rights reserved. The dissertation of Manmohan Krishna Chandraker is approved and it is acceptable in quality and form for publication on microfilm and electronically:
Chair
University of California, San Diego
2009
iii DEDICATION
To Papa, for his incomparable example. To Mom, for her innumerable sacrifices. To Didi, for her unbridled sisterly pride.
iv EPIGRAPH
Somewhere afield here something lies In Earth’s oblivious eyeless trust That moved a poet to prophecies - A pinch of unseen, unguarded dust.
Thomas Hardy, “Shelley’s Skylark”
v TABLE OF CONTENTS
Signature Page...... iii
Dedication...... iv
Epigraph...... v
Table of Contents...... vi
List of Figures...... xi
List of Tables...... xiii
Acknowledgements...... xiv
Vita...... xviii
Abstract of the Dissertation...... xx
Chapter 1 Introduction...... 1 1.1 Multiview Geometry and Optimization...... 4 1.2 3D Reconstruction from 2D Images...... 7 1.2.1 The Projective Ambiguity...... 8 1.2.2 Projective Spaces and Projective Cameras...... 10 1.2.3 Stratification of 3D Reconstruction...... 13 1.2.4 Autocalibration...... 15 1.2.5 Feature Selection and Matching...... 17 1.3 The Optimization Framework...... 19 1.3.1 Global Optimization for 3D Reconstruction...... 20 1.3.2 Optimization for Robust SFM...... 25 1.4 Contributions of the Dissertation...... 26 1.5 How to Read This Dissertation...... 27
Chapter 2 Preliminaries: Projective Geometry...... 30 2.1 Axiomatic Projective Geometry...... 30 2.2 Projective Geometry of 2D...... 32 2.3 Projective Geometry of 3D...... 36 2.3.1 Points and Planes...... 36 2.3.2 Lines...... 36 2.3.3 Quadrics...... 39 2.4 The Projective Camera...... 39 2.5 The Plane at Infinity and Its Denizens...... 42 2.5.1 The Absolute Conic...... 44
vi 2.5.2 Image of the Absolute Conic...... 45 2.5.3 The Absolute Dual Quadric...... 46
Chapter 3 Preliminaries: Multiview Geometry...... 50 3.1 Feature Selection and Matching...... 50 3.1.1 Corner detection...... 51 3.1.2 Feature matching...... 52 3.1.3 Advanced feature descriptors...... 53 3.2 Epipolar Geometry...... 54 3.3 Projective Reconstruction...... 57 3.3.1 Pairwise reconstruction...... 57 3.3.2 Factorization-based approaches...... 58 3.4 Stratification...... 60 3.5 Chirality...... 62 3.5.1 Bounding the plane at infinity...... 64
Chapter 4 Global Optimization...... 65 4.1 Approaches to Global Optimization...... 66 4.2 Convex Optimization...... 69 4.2.1 Convex Sets...... 69 4.2.2 Convex Functions...... 71 4.2.3 Convex Optimization Problems...... 72 4.2.4 Linear Matrix Inequalities...... 74 4.3 Branch and Bound Theory...... 76 4.3.1 Bounding...... 79 4.3.2 Branching...... 80 4.4 Global Optimization for Polynomials...... 82
Chapter 5 Triangulation and Resectioning...... 89 5.1 Introduction...... 89 5.1.1 Related Work...... 90 5.1.2 Outline...... 91 5.2 Problem Formulation...... 92 5.3 Traditional Approaches...... 94 5.3.1 Linear Solution...... 94 5.3.2 Bundle Adjustment...... 96 5.4 Fractional Programming...... 97 5.4.1 Bounding...... 98 5.5 Applications to Multiview Geometry...... 100 5.5.1 Triangulation...... 101 5.5.2 Camera Resectioning...... 103 5.5.3 Projections from Pn to Pm ...... 103 5.6 Multiview Fractional Programming...... 104
vii 5.6.1 Bounds Propagation...... 104 5.6.2 Initialization...... 106 5.6.3 Coordinate System Independence...... 107 5.7 Experiments...... 108 5.7.1 Synthetic Data...... 108 5.7.2 Real Data...... 112 5.8 Discussions...... 115
Chapter 6 Stratified Autocalibration...... 117 6.1 Introduction...... 118 6.2 Background...... 121 6.2.1 The Infinite Homography Relation...... 121 6.2.2 Modulus Constraints...... 124 6.2.3 Chirality Bounds on Plane at Infinity...... 126 6.2.4 Need for Global Optimization...... 126 6.3 Previous Work...... 128 6.4 The Branch and Bound Framework...... 130 6.4.1 Constructing Convex Relaxations...... 131 6.5 Global Estimation of Plane at Infinity...... 132 6.5.1 Traditional Solution...... 132 6.5.2 Problem Formulation...... 133 6.5.3 Convex Relaxation...... 133 6.5.4 Incorporating Bounds on the Plane at Infinity...... 134 6.6 Globally Optimal Metric Upgrade...... 136 6.6.1 Traditional Solution...... 136 6.6.2 Problem Formulation...... 137 6.6.3 Convex Relaxation...... 138 6.7 Experiments...... 141 6.8 Conclusions and Further Discussions...... 150
Chapter 7 Direct Autocalibration...... 153 7.1 Introduction...... 153 7.2 Background...... 156 7.2.1 Autocalibration Using the Absolute Dual Quadric...... 157 7.2.2 Chirality...... 158 7.3 Related Work...... 159 7.4 Problem Formulation...... 161 7.4.1 Imposing rank degeneracy and positive semidefiniteness of Q∗ 161 7.4.2 Imposing chirality constraints...... ∞ 162 7.4.3 Choice of objective function...... 163 7.5 Experiments with synthetic data...... 165 7.6 Experiments with real data...... 167 7.7 Conclusions...... 172
viii Chapter 8 Bilinear Programming...... 174 8.1 Introduction...... 174 8.2 Related Work...... 176 8.3 Formulation...... 177 8.3.1 LP relaxation for the L1-norm case...... 177 8.3.2 SOCP relaxation for the L2-norm case...... 179 8.3.3 Additional notes for the L2 case...... 180 8.4 Branching strategy...... 181 8.5 Experiments...... 184 8.5.1 Synthetic data...... 184 8.5.2 Applications...... 188 8.6 Discussions...... 190
Chapter 9 Line SFM Using Stereo...... 192 9.1 Introduction...... 192 9.2 Related Work...... 195 9.3 Structure and Motion Using Lines...... 197 9.3.1 A Simple Solution?...... 197 9.3.2 Geometry of the Problem...... 198 9.3.3 Linear Solution...... 198 9.3.4 Efficient Solutions for Orthonormality...... 200 9.3.5 Solution for Incremental Motion...... 204 9.3.6 A Note on Number of Lines...... 204 9.4 System Details...... 204 9.4.1 Line Detection, Matching and Tracking...... 204 9.4.2 Efficiently Computing Determinants...... 206 9.5 Experiments...... 207 9.5.1 Synthetic Data...... 207 9.5.2 Real Data...... 209 9.6 Discussions...... 212
Chapter 10 Discussions...... 215 10.1 Sequels in the Computer Vision Community...... 215 10.2 Future Directions...... 216 10.3 Conclusions...... 218
Appendix A Fractional Programming...... 220
Appendix B Convex Relaxations for Stratified Autocalibration...... 223 B.1 Functions of the Form f(x) = x8/3 ...... 223 B.2 Bilinear Functions f(x, y) = xy ...... 223 B.3 Functions of the Form f(x, y) = x1/3y ...... 225 B.3.1 Case I: [xl > 0 or xu < 0]...... 225
ix B.3.2 Case II: [xl 0 xu]...... 226 B.4 Convergence Proofs≤...... ≤ 228 B.4.1 Errata...... 230
Appendix C Convergence Proof for Bilinear Relaxations...... 232
Bibliography...... 235
x LIST OF FIGURES
Figure 1.1: Various cues in images create a perception of depth...... 2 Figure 1.2: Branch and bound for global optimization...... 4 Figure 1.3: Progress of projective geometry in Renaissance art...... 5 Figure 1.4: Reconstruction up to rotation, translation and scale...... 8 Figure 1.5: Projection for imaging and back-projection for reconstruction.....9 Figure 1.6: The projective plane...... 11 Figure 1.7: Vanishing points are an everyday phenomenon...... 11 Figure 1.8: The perspective projection camera model...... 12 Figure 1.9: Stratification in three dimensions...... 14 Figure 1.10: Quasi-affine reconstruction preserves the convex hull...... 15 Figure 1.11: A visualization of autocalibration...... 16 Figure 1.12: Not all corners are created equal...... 18 Figure 1.13: Multiview geometry problems are hard to optimize...... 21 Figure 1.14: Traditional local optimization in multiview geometry...... 22
Figure 2.1: Perspective camera projection...... 40 Figure 2.2: Internal and external parameters of the camera...... 42
Figure 3.1: Corner detection...... 52 Figure 3.2: Types of image neighborhoods...... 52 Figure 3.3: Epipolar geometry...... 55
Figure 4.1: Branch and bound for non-convex minimization...... 79
Figure 5.1: Local minima in three-view triangulation...... 94 Figure 5.2: Bounds propagation...... 105 Figure 5.3: Triangulation errors with forward motion...... 109 Figure 5.4: Comparing optimal (L2,L2) triangulation to bundle adjustment.... 110 Figure 5.5: Triangulation errors with outliers...... 111 Figure 5.6: Reprojection errors for camera resectioning...... 111 Figure 5.7: Dependence of convergence on optimality criterion...... 113
Figure 6.1: Plane-induced homography between two cameras...... 123 Figure 6.2: Need for globally optimal autocalibration...... 127 Figure 6.3: Construction of convex relaxations...... 132 Figure 6.4: Errors in calibration parameters across noise level...... 144 Figure 6.5: Runtime behavior of affine and metric upgrade algorithms...... 145 Figure 6.6: Geometrical errors in affine and metric upgrades...... 145 Figure 6.7: Comparison of local and global affine upgrades...... 147 Figure 6.8: Comparison of local and global metric upgrades...... 148 Figure 6.9: Stratified autocalibration with real data...... 149
xi Figure 7.1: Direct autocalibration with real data...... 168 Figure 7.2: The benefit of global optimization...... 171
Figure 8.1: Errors in bilinear fitting across noise levels...... 185 Figure 8.2: Errors with varying outlier levels...... 186 Figure 8.3: Convergence times for optimal bilinear fitting...... 187 Figure 8.4: Face reconstruction from 3D exemplars...... 189 Figure 8.5: Bilinear fitting for non-rigid structure and motion...... 190
Figure 9.1: Motion estimation in challenging indoor environment...... 193 Figure 9.2: Geometry of line-based structure and motion...... 199 Figure 9.3: Line detection and tracking...... 205 Figure 9.4: Errors for small motion using two-line solvers...... 208 Figure 9.5: Errors for small motion using three-line solvers...... 209 Figure 9.6: Errors for large motion using three-line solvers...... 209 Figure 9.7: Line detection and tracking for turntable sequence...... 210 Figure 9.8: Line-based structure and motion for a turntable sequence...... 211 Figure 9.9: Line-based structure and motion for a corridor sequence...... 213 Figure 9.10: Polynomial and incremental solutions for corridor sequence..... 214
Figure B.1: Convex and concave relaxations for bilinear functions...... 224 Figure B.2: The non-convex function f(x, y) = x1/3y ...... 225 Figure B.3: Concave overestimator for x1/3 ...... 228 Figure B.4: Convex underestimator for x1/3 ...... 229
xii LIST OF TABLES
Table 5.1: Cost functions for various error norms...... 101 Table 5.2: Optimal resectioning runtimes for various error norms...... 112 Table 5.3: Triangulation and resectioning errors with real data...... 114 Table 5.4: Branch and bound iterations for real data...... 114 Table 5.5: Runtimes with real data...... 115
Table 6.1: Stratified autocalibration errors and branching iterations...... 143 Table 6.2: Geometric evaluation of stratified autocalibration...... 146 Table 6.3: Positive-semidefiniteness violations in linear metric upgrade...... 148
Table 7.1: Direct autocalibration errors with synthetic data...... 167 Table 7.2: Direct autocalibration with real data...... 169
xiii ACKNOWLEDGEMENTS
This dissertation is the product of the constant endeavor of my mentors, col- leagues, friends and family, who have steadfastly shaped my perspective towards research, education and life in general. My time in graduate school was not cast in a stereotypical mould, as I was extremely fortunate to have not one, but several wonderful mentors. A lion’s share of the credit for determining the nature of my PhD experience goes to my adviser, Prof. David Kriegman. Words cannot do justice to his impact in forging my academic and personal outlook, for his efforts far exceed the obligations of merely guiding this dissertation. Not once have I seen David impose any specific demands, rather, his aura compels students to strive to live up to his high standards. The breadth of his knowledge, combined with a meditative understanding of his students’ strengths, have given him the confidence to consider “good taste” an inherently subjective term. Accordingly, the only expectations he has ever had from me are ensuring the paramountcy of quality over quantity and indulging in research that I would myself take pride in being associated with. The variety of research emanating from David’s research group is testimony to the exploratory freedom he allows his students. His style of advising has always been to subtly plant an idea, or gently nudge me in the right direction, rather than push an agenda. On numerous occasions, the significance of David’s suggestions would only dawn upon me much later – his astuteness in diving to the core of any problem is awe-inspiring. In the course of more than five years, I have admired the dexterity with which David has handled simultaneous responsibilities such as his editorship of IEEE PAMI and his fledgling company, while devoting quality time to his advisory role. Indeed, David’s concern for his international students, far from their own homes and families, goes well beyond mere advising: when I told him I had bought my first car, a second-hand Mitsubishi, the first thing he said was, “Great! Now make sure that you drive safe.” All I can say for my indebtedness to David is that if, one day, I advise students of my own and display half of his vitality and sagacity, I will consider it a success.
xiv Prof. Serge Belongie is another of my mentors who has significantly contributed towards my academic progress, with his generous help, advice and feedback. Some of my love for teaching is attributable to Serge – the amount of preparation he puts into each class and his expertness at weaving an authoritative delivery into a friendly ambiance, are invaluable lessons for any graduate student. Much of the course of this PhD was set during my interactions with Prof. Fredrik Kahl at the beginning of my second year. To him goes the credit for introducing me to convex optimization and rekindling my love for multiview geometry. Working with him, or even merely talking to him, have been profoundly educative experiences. A mentor and friend who significantly enriched my stay in the Pixel Lab, both intellectually and personally, is Sameer Agarwal, soon to become a professor at University of Washington. Not only has he enlightened me on innumerable technical topics during our collaborations, he has also set a great example of dedication and uprightness in scientific research. He is a budding chef and postprandial ruminations at his apartment led to lively discussions on every topic imaginable. Right from the first day Sameer saw me in Pixel Lab, he has somehow taken personal responsibility for my well-being. Whenever I have needed support, I have chatted to him, for his immense faith in me has been a constant source of strength. Satya Mallick also occupies a similar zone between mentorship and friendship. From helping me settle down in UCSD to imparting pithy lessons through his witty anecdotes, his presence has been vital to my graduate school experience. I have always admired the dedication Satya invested into both his work and his long-distance marriage to Sunita. The time I spent in the lab with Satya and Sameer, especially the summer of 2005, constitutes some of the best memories from my stay at UCSD. Vincent Rabaud spent the last couple of years at the office space next to mine and was very tolerant of any exultant yells or soulful moans. He would sometimes echo them too, or cheer me up by playing funky numbers. It was great to share lab space with some nice people – Neil Alldrin, Andrew Rabinovich, William Beaver, Neel Joshi, Ben Laxton and Will Chang – all of whom, I am glad, have found their respective callings. Likewise,
xv I wish all the best to the current denizens of Pixel Lab - Steve Branson, Kai Wang, Carolina Galleguillos, Boris Babenko and Catherine Wah. Importantly, I would like to acknowledge the role of my senior colleagues in my academic development. Kuang-chih Lee was my first year cubicle-mate who was always generous with help and advice. Ben Ochoa, Josh Wills, Jongwoo Lim, Kristin Branson, Craig Donner and Piotr Dollar´ are all my early lab-mates who I admire for their amazing creativity and work ethic. Many thanks to Virginia McIlwain for help with numerous travel and administrative details. A summer internship at Microsoft Research Cambridge was an invaluable oppor- tunity to work closely with Prof. Andrew Blake, whose scientific outlook and intensity I greatly admire. I would like to thank my friends Pushmeet Kohli, Srikumar Ramalingam, Anitha Kannan, Ankur Agarwal, Gregory Neverov, Dynal Patel, Varun Gupta and all others for making the stay at Cambridge such a fun-filled experience. An internship at Honda Research Institute in Mountain View was a wonderful experience in robotics research. It was a pleasure to collaborate with Jongwoo Lim, who taught me many nuances of real-time SFM. I am also grateful to Prof. Ming-Hsuan Yang for his encouragement, as well as Arjun, Ravi and Dipak for their enjoyable company. There are several friends whose love and support I would like to acknowledge. Since the day I have known him, Praveen Rajurkar has been my best friend, whose pure heart, ready smile and abysmally pathetic jokes have livened up my days for over a decade now. Anish Karandikar (Aka) is a wonderful friend whose frank opinions I greatly value. Kiran (Machi) is a great buddy with an incredibly positive attitude, who needed only the slightest cajoling to accompany me to anything under the sun (or away from it). Suchit Jhunjhunwala, for his equal measures of levelheadedness and neuroticism, as well as Manish Amde, my long-time housemate, also deserve acknowledgment. Saumya Chandra and Mayank Kabra, with their contrasting slapstick-cynical routines, provided comic relief. I would also like to thank Ragesh, Diwaker, Saurabh (Sina) and Himanshu (Half-cold) for the joy and variety their company brought to my non-academic life. My cricket teammates, tennis and squash partners – Raiyan, Rahul, Kushal, Vikas, Nitin and all others – deserve special thanks for helping me sustain my love for sports at UCSD.
xvi I am thankful to my adviser, Prof. David Kriegman, for meticulously reading this dissertation and suggesting several corrections and improvements. Any errors that still persist are, of course, mine alone. I am grateful to my wonderful brother-in-law, Dr. Shyam Varma, for his love and support. Finally, no description can adequately quantify the principal ingredients of this dissertation, namely the endeavors and sacrifices of my father, mother and sister. They are my greatest supporters and the people most attuned to the tides of emotions that swept the course of my PhD. Their pain at being oceans apart from me is only surpassed by the unabashed pride and joy they experience with every little seashell I discover. To their blessings, to their tears and their smiles, I owe every success. Hence, to them, I dedicate this dissertation. Parts of this dissertation are based on papers co-authored with my collaborators:
Chapter5 is based on “Practical Global Optimization for Multiview Geometry”, • by F. Kahl, S. Agarwal, M. K. Chandraker, D. J. Kriegman and S. Belongie, as it appears in (Kahl et al., 2008) and (Agarwal et al., 2006).
Chapter6 is based on “Globally Optimal Affine and Metric Upgrades in Strat- • ified Autocalibration”, by M. K. Chandraker, S. Agarwal, D. J. Kriegman and S. Belongie, as it appears in (Chandraker et al., 2007b).
Chapter7 is based on “Autocalibration via Rank-Constrained Estimation of the • Absolute Quadric”, by M. K. Chandraker, S. Agarwal, F. Kahl, D. Nister´ and D. J. Kriegman, as it appears in (Chandraker et al., 2007a).
Chapter8 is based on “Globally Optimal Bilinear Programming for Computer • Vision Applications”, by M. K. Chandraker and D. J. Kriegman, as it appears in (Chandraker and Kriegman, 2008).
Chapter9 is based on “Moving in Stereo: Efficient Structure and Motion Using • Lines”, by M. K. Chandraker, J. Lim and D. J. Kriegman, as it appears in (Chandraker et al., 2009).
xvii VITA
1982 Born, Raipur, India
2003 B.Tech., Indian Institute of Technology, Bombay, India
2009 Ph.D., University of California, San Diego, USA
PUBLICATIONS
M. K. Chandraker, J. Lim and D. Kriegman, “Moving in Stereo: Efficient Structure and Motion Using Lines,” IEEE International Conference on Computer Vision (ICCV), 2009.
M. K. Chandraker, S. Agarwal, D. Kriegman and S. Belongie, “Globally Optimal Algo- rithms for Stratified Autocalibration,” International Journal of Computer Vision (IJCV, invited), 2009.
M. K. Chandraker and D. Kriegman, “Globally Optimal Bilinear Programming for Computer Vision Applications,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
F. Kahl, S. Agarwal, M. K. Chandraker, D. Kriegman and S. Belongie, “Practical Global Optimization for Multiview Geometry,” International Journal of Computer Vision (IJCV), 79(3):271-284, 2008.
M. K. Chandraker, S. Agarwal, D. Kriegman and S. Belongie, “Globally Optimal Affine and Metric Upgrades in Stratified Autocalibration,” IEEE International Conference on Computer Vision (ICCV), 2007.
A. Agarwal, S. Izadi, M. K. Chandraker and A. Blake, “High Precision Multi-touch Sensing on Surfaces using Overhead Cameras,” IEEE Tabletop and Interactive Surfaces, 2007.
M. K. Chandraker, S. Agarwal and D. Kriegman, “ShadowCuts: Photometric Stereo with Shadows,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
M. K. Chandraker, S. Agarwal, F. Kahl, D. Nister´ and D. Kriegman, “Autocalibration via Rank-Constrained Estimation of the Absolute Quadric,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
S. Agarwal, M. K. Chandraker, F. Kahl, D. Kriegman and S. Belongie, “Practical Global Optimization for Multiview Geometry,” European Conference on Computer Vision (ECCV), 2006.
xviii M. K. Chandraker, F. Kahl and D. Kriegman, “Reflections on the Generalized Bas-Relief Ambiguity,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005.
M. K. Chandraker, C. Stock and A. Pinz, “Real-Time Cameras Pose in a Room,” Interna- tional Conference on Computer Vision Systems (ICVS), 2003.
C. Stock, U. Muhlmann,¨ M. K. Chandraker and A. Pinz, “Subpixel Corner Detection for Tracking Applications using CMOS Camera Technology,” Proceedings of the Austrian Association of Pattern Recognition, 2002.
FIELDS OF STUDY
Major Field: Optimization Global Optimization, Convex Optimization, Polynomial Optimization, Convex Relaxations, Branch and Bound Search.
Major Field: 3D Reconstruction Structure from Motion, Shape from Exemplars.
Major Field: Multiple View Geometry Triangulation, Camera Resectioning, Autocalibration, Projective Geometry.
Minor Fields Semidefinite Programs, Fractional Programs, Bilinear Programs, Sum-of-Squares Polynomials, Linear Matrix Inequality Relaxations.
xix ABSTRACT OF THE DISSERTATION
From Pictures to 3D: Global Optimization for Scene Reconstruction
by
Manmohan Krishna Chandraker Doctor of Philosophy in Computer Science
University of California, San Diego, 2009
Professor David Kriegman, Chair
Reconstructing the three-dimensional structure of a scene using images is a fundamental problem in computer vision. The geometric aspects of 3D reconstruction have been well-understood for a decade, but the involved optimization problems are known to be highly non-convex and difficult to solve. Traditionally, these problems are tackled using heuristic initializations followed by local, gradient-based optimization algorithms, which are prone to being enmeshed in local minima. In contrast, this dissertation proposes powerful, global optimization methods to derive provably optimal, yet practical, algorithms for estimating 3D scene structure and camera motion. This dissertation develops a branch and bound framework to solve several well- established problems in multiview geometry to their global optima, with a certificate of optimality. The framework relies on the construction of efficient and tight relaxations to the involved non-convex problems, using modern convex optimization methods. The underlying geometry of the task is exploited to restrict the search space to a small, fixed
xx number of dimensions, which alleviates the worst case exponential complexity of branch and bound in practice. The dissertation begins by deriving optimal solutions to triangulation and camera pose estimation for an arbitrary number of views and points, using extensions to the theory of fractional programming. Next, the framework is amplified to solve the con- ceptually important affine and metric reconstruction stages of stratified autocalibration to their global optima. Additionally, an algorithm for directly upgrading a projective reconstruction to a metric one is proposed, based on elegant real algebraic geometry methods for global optimization of polynomial systems. Further, large-scale bilinear programs that arise in diverse applications such as shape from exemplar models and non-rigid structure from motion, are globally optimized using a novel branching strategy that exploits problem structure typical to 3D reconstruction. The final part of the dissertation develops a complete pipeline for real-time 3D reconstruction using stereo images and straight line features. The core structure from motion problem constitutes efficient optimization of an overdetermined system of polynomials that is fast enough to be used in a robust hypothesize-and-test framework. The algorithm has already found application in the autonomous navigation system for the well-known humanoid robot, ASIMO.
xxi Chapter 1
Introduction
“.... I have great faith in a seed. Convince me that you have a seed there and I am prepared to expect wonders.”
Henry David Thoreau (American naturalist, 1817-1862 AD), Faith in a Seed
A cogent vignette that stimulates artificial intelligence research and movie box office receipts alike, includes a robot autonomously navigating and consciously interact- ing with the world around it. It is, perhaps, well-accepted that the ability to garner visual input, discerningly extract scene information and sentiently use the same, represents a crucial attribute for such a machine. Since visual input is usually comprised of two-dimensional images, an important piece of the puzzle involves recovering the depth, that is, inferring the three-dimensional (3D) scene structure represented by the two-dimensional (2D) images. Various cues can be employed to achieve this goal, such as camera motion between images, the extent of defocus or shading variations with change in pose and lighting (Figure 1.1). Recovering 3D scene structure using (possibly unknown) camera motion as a cue, the so-called “Structure and Motion” or “Structure from Motion (SFM)” problem, is one of the principal themes of this dissertation. Structure from motion is a quintessential computer vision problem, for which robotic navigation is by no means the only real-world application. Organizing vacation
1 2
(a) (b)
Figure 1.1: (a) Rain, Steam and Speed - The Great Western Railway, by J. M. W. Turner, 1844 AD. Various cues combine to create the illusion of depth, such as linear perspective (parallel lines seem to intersect), aerial perspective (distant regions acquire a bluish hue) and defocus (farther objects are hazier). (b) Motion as a cue to perceive depth, exemplified by Lake Palanskoye in the Kamchatka Peninsula, Russia. The landmass was shifted, left to right, during a landslide, creating the effect of a camera motion between the two images. By crossing the eyes to view this image pair as a stereogram, the reader can perceive depth in the scene (white regions are highest, followed by brown, green and bluish-black). photographs, augmented reality walk-throughs, 3D city maps and motion capture tech- nology in the movie and gaming industries are but a few examples where progress in 3D reconstruction techniques has already influenced modern society. Given its widespread application, it is unsurprising that a significant part of the computer vision challenge consists of designing robust, reliable algorithms and systems that can infer 3D scene structure and camera motion using 2D images. Structure and motion problems are, in general, highly non-convex and finding optimal solutions to them is computationally hard (Nister´ et al., 2007; Freund and Jarre, 2001). For instance, Figures 1.13 and 5.1 illustrate cost functions for some of the problems we will encounter in this dissertation. Traditionally, these problems are solved by employing a heuristic initialization in conjunction with a gradient-based optimization 3 algorithm to arrive at a local optimum. Needless to say, the possibility of such approaches achieving an acceptable solution quality is contingent on a propitious initialization in the vicinity of the optimum. In contrast, this dissertation presents algorithms for geometric reconstruction that provably converge to the global optimum, regardless of the initialization. This dissertation makes a strong case for modern optimization methods being more suited to meet the challenge of provably accurate geometric 3D reconstruction than their traditional gradient-based counterparts. Convex programs, for example, are attractive since their local minima are, by definition, also the global minima. Moreover, the past twenty years have seen tremendous activity towards developing fast, reliable and robust solvers for a variety of convex problems. One way of harnessing the power of convex optimization is to develop approximation algorithms that are guaranteed to lie within a fixed distance of the optimum. But even with the assumption that an approximate 3D reconstruction is useful, it is difficult to come up with a provably good one for complex multiview geometry problems. However, suppose the search space for, say, a minimization problem, is sub- divided into some regions. Then, as long as the convex approximation can be shown to be a “tight” lower bound, or a relaxation, we can use it to prune away those regions where the lower bound lies above the objective function in some other region (see Figure 1.2). This is precisely the basis for a branch and bound paradigm for global optimization. In this dissertation, we develop provably tight and efficiently minimizable convex relaxations to non-convex geometric reconstruction problems which can be coupled with a well- designed branch and bound algorithm to achieve the global minimum. The bane of a traditional branch and bound algorithm, of course, is its worst case complexity that increases exponentially with the number of dimensions, which, for a multiview geometry problem, may easily be a few dozens or hundreds. So, while a cer- tificate of optimality satisfies the theoretical premise of global optimization, practicality demands an informed design that avoids the curse of dimensionality. Undeniably, the primary reason for the success of our global optimization algorithms is the judicious 4
;<9= ()*+'","'- !"#$%&' ()*+'","'- !"#$%&'
803+$03,-910*:-$7",- 603,-91&-/%9%7"03
40$%/12"3"2)2 40$%/12"3"2)2 6)&&-371*-#7
405-&1*0)3' ./0*%/12"3"2)2 9
Figure 1.2: The principles of global optimization using a branch and bound framework, illustrated for a univariate function. assimilation of the underlying problem structure afforded by multiview geometry to potently restrict the dimensionality of the search space. Indeed, one of the central motifs of this dissertation is an inquiry into the symbiotic congruity of multiview geometry and convex optimization to achieve expeditious convergence in practice. On this note, let us embark on our exploration of practical global optimization methods for geometric 3D reconstruction.
1.1 Multiview Geometry and Optimization
Much of this dissertation examines multiple view geometry problems through the conjugality of projective geometry and modern convex optimization methods. We begin our quest with some observations on the appositeness of those two frameworks. 5
The utility of projective geometry
In a most basic interpretation, any photograph can be regarded as the output of a projection device, which might be an advanced servo-controlled zoom lens system for modern digital cameras, or an artist’s felicity for a Renaissance-era proteg´ e´ of Raphael’s studio. The study of projections of the three-dimensional world onto a planar canvas, thus, predates the advent of modern photography. Consequently, it is not mere fortuitousness that a mature set of mathematical tools, in the form of projective geometry, was readily available to tackle the inverse problemPlane of inferring at Infinity information about the three-dimensional world from two-dimensional projections (Figure 1.3). The (imaginary) plane which images to vanishing points.
vanishing point
(a)Raphael (b) Sanzio, The School of Athens, ca. 1510
Figure 1.3: The progress of projective geometry through art. (a) Jesus before Caiaphas, by Giotto di Bondone, ca. 1305, around the beginning of the Italian Renaissance. While Giotto aims to replicate the effect of parallel lines seeming to converge, the painting reflects unawareness of the concept of a vanishing point. (b) The School of Athens, by Raphael Sanzio, ca. 1510, at the zenith of classical Renaissance. Parallel lines of the 3D scene are concurrent between the central characters in the 2D painting. This vanishing point is the image of an ideal point on the plane at infinity. Coincidentally, in this painting, Plato is lecturing Aristotle on idealism and the vanishing point heightens the contrast to the latter’s realism.
Projective geometry lends a variety of useful concepts, which are well-adapted for use in the analysis of the geometry of multiple views as well as implementation in modern computational frameworks. Its distinguishing feature is the uniform treatment of 6
finite as well as infinite points. The set of all infinite points in projective 3-space defines a plane, aptly designated the plane at infinity. The behavior of the plane at infinity and some mathematical entities that reside on it, such as the absolute conic and the absolute dual quadric, are fundamental to our understanding of image formation and thereby, also indispensable in determining our modus operandi for recovering 3D scene information from those images. In Chapter2, we briefly review notions from projective geometry that form the mathematical backbone of this dissertation. It might seem quite remarkable that these imaginary habitues´ of the imaginary abode that is the plane at infinity, can contribute so tangibly to our ability to digitally perceive the real world around us. But in the words of Jean-Paul Sartre, the twentieth- century French existentialist philosopher:
“No finite point has meaning without an infinite reference point.”
And therein lies the power of projective geometry.
The utility of convex optimization
The ability to acquire digital images of the world around us is now commonplace. Even more easily accessible is the ability to store, process, share and distribute them. How do we meaningfully analyze and extract relevant information from the plethora of data at our disposal? A natural approach might be to formulate an objective function that encapsulates some notion of satisfaction and while appropriately taking into account the available data, devise a scheme to maximize the satisfaction. This is precisely the basic premise of optimization. An optimization framework may seem an obvious necessity to us today, but it serves to pause and ponder on the alternatives, or the lack thereof. Optimization, after all, was a novel concept merely 200 years ago, around when Carl Friedrich Gauss had begun to lay the foundations for the method of steepest descent. The beginnings of modern day optimization - convex programming in particular - can perhaps be attributed to George Dantzig’s mid-twentieth century studies in linear 7 programming. While the term linear programming does not bear relation to computer programs (rather, it refers to scheduling), it is beyond doubt that contemporary study of optimization methods serves and is served by our present-day computers. Linear programming is a special instance of convex programming, which deals with convex functions and sets. Convex programs enjoy some special properties, for instance, a local minimum is guaranteed to be a global minimum. Recognizing elements of convexity in a problem, thus, allows us to deduce fundamental characterizations of the intrinsic difficulty of the problem. But that only partly explains the utility of convex optimization. The theoretical development of modern-day interior point solvers (Karmarkar, 1984; Nesterov and Nemirovskii, 1994) and their subsequent deployment in the form efficient, general-purpose computer software (Sturm, 1999; Andersen et al., 2003) is as significant a reason for the popularity of convex optimization. Indeed, implementation problems and choices concerning feasibility detection, stopping criteria and convergence rates that vex traditional optimization methods are readily and satisfactorily handled by the theory of convex optimization.
1.2 3D Reconstruction from 2D Images
Given one or more images of a scene, a verbal description of 3D reconstruction might be “inferring the structure of the scene and the cameras used for imaging”. How can we make this description mathematically precise? Clearly, without knowing the absolute coordinates of the scene points (which are not available for a traditional image), it is not possible to recreate a copy of the scene at the same geographical position and orientation as the original. In addition, there is no way to determine the absolute scale of the scene imaged by a camera - a fact that has allowed the directors of many a blockbuster to portray mayhem in a bathtub as a savage shark or a stricken ship (Figure 1.4). Indeed, scale ambiguity is fundamental in biological vision too - it is only with the aid of experience-based priors that we learn 8 to reconcile relative scales in the world around us. For instance, highly myopic, but perfectly peripatetic people wearing vision-correcting lenses, often stumble when objects appear much closer (or larger) upon removing the lenses.
(a) (b)
Figure 1.4: (a) A rotation and translation applied to both the object and the camera results in the same image. (b) Scale information is “lost” in perspective projection - a far away big object appears the same as a nearby small object.
So, it seems reasonable to define a satisfactory 3D reconstruction as one that differs from the true scene by a global rotation, translation and scale factor (a similarity transformation). As it turns out, if we have calibrated cameras, that is, cameras with known internal settings, then it is possible to achieve this goal and in fact, impossible to do any better (Longuet-Higgins, 1981). Such a reconstruction is called a metric reconstruction.
1.2.1 The Projective Ambiguity
Della Pittura, a definitive fifteenth-century treatise on painting by Leon Battista Alberti, the Renaissance artist and polymath, expounds on painting with correct perspec- tive. To achieve this, Alberti recommends observing the scene through a transparent cloth 9 stretched on a wire frame, with one eye closed and marking points on the glass where they appear to be in the image. This procedure, depicted in Figure 1.5 (a), has come to be known as Alberti’s veil and is simply an effectuation of perspective projection. Geometric 3D reconstruction seeks to solve the inverse problem: given the image points on Alberti’s veil, back-project along the line of sight to determine the correct locations of the 3D points. It is apparent that instead of points, the geometric primitives for the study of 3D reconstruction from 2D projections must be rays emanating from a center of projection. This is precisely the purview of projective geometry.
(a) Forward projection (b) Inverse projection
Figure 1.5: (a) Man Drawing a Lute, a woodcut by the German artist Albrecht Durer,¨ ca. 1525, illustrates Alberti’s method of painting, which recognized image formation as a perspective projection. (b) The inverse operation of back-projection from the camera center along the rays of sight need not preserve the correct length ratios and angles.
As Figure 1.5 (b) shows, 3D reconstruction from a single 2D image is inherently an ill-posed problem, since information about the true depth of a scene point is lost in the 2D projection. However, images from a greater number of viewpoints should, arguably, better constrain the reconstruction, for ray intersections are localizable, while infinite rays are not. This intuition, which forms the basis of multiview geometry, is valid up to a certain extent. While a 3D reconstruction from just image data can indeed be computed 10 when multiple views are available, there exists a so-called projective ambiguity in the reconstruction. Loosely, if image formation is considered as an incidence operation, then a transformation applied to the cameras, along with the inverse transformation applied to the bundle of rays emanating from the camera, can contrive to leave the observed image unchanged. A reconstruction up to a projective ambiguity is called a projective reconstruction. Mathematically, it is related to the true scene by an invertible projective transformation, which can be defined as a linear operator on the space of back-projected rays from the origin. Given image data, it is always possible to compute a projective reconstruction, but no better, without extraneous knowledge of the scene or imaging devices (Faugeras, 1992; Hartley et al., 1992). We review projective reconstruction for computer vision applications in Section 3.3.
1.2.2 Projective Spaces and Projective Cameras
The intuition that the geometry of 3D reconstruction is really a study of inverse projections can be formalized through the machinery of projective geometry. In projective geometry, points are represented by rays through the origin and lines are represented by hyperplanes through the origin, as visualized for the two-dimensional case in Figure 1.6. Stated differently, in projective geometry, all the points on a line passing through the origin are identified. We encounter projective geometry everyday when, for instance, we perceive two rails or the vertical edges of a skyscraper, as intersecting in a finite vanishing point (Figure 1.7). Indeed, one of the principal axioms of projective geometry is that any two lines in a “plane” must meet. In a projective world, concepts such as squares and circles, which are based on some notion of length, are meaningless.
Suppose a 3D point (X,Y,Z)> is observed at (u, v)> on the image plane of a camera of focal length f. Then, assuming the imaging geometry as depicted in Figure 1.8, where the world coordinate system origin coincides with the camera’s center of 11
78&'()*+2&(* -.)*/)'#0)2 +5%&2#"%&.#4
!"#$%&'()*+ -0#*%&+5.(6$5 &&&&&&(.)$)* 3#4&+5.(6$5 &&&&&(.)$)*
!"#$%&,)*%
!"#$)*$&'0#*% !
Figure 1.6: Perspective projection can be interpreted naturally using projective geometry. Identifying points along a ray through the projection center means the 3D world can be interpreted as the projective space P3, while the image plane is interpreted as the projective plane P2.
Figure 1.7: Vanishing points, where parallel lines of the real world appear to intersect in an image, are a consequence of projective geometry that we observe in daily life. Any two lines in a projective plane must axiomatically, intersect at a unique point. projection, the world axes are aligned with the image plane and the camera faces along the negative depth axis, the perspective projection equations for image formation are 12 given by u v f = = − , (1.1) X Y Z which are non-linear relations. Defining u0 = wu and v0 = wv, where w = 0, the image 6 formation equations can be interpreted in a linear framework as X u f 0 0 0 0 − Y v = 0 f 0 0 . (1.2) 0 − Z w 0 0 1 0 1
!'#$(#$)&
./0120345$4607 !"#$%&
!+#$+#$,-&
894:;$3541;
!$*$!+#$+#$+&
Figure 1.8: The perspective projection camera model follows naturally from the projective geometry interpretation of perspective image formation. While we view the world as a projective 3-space for the purpose of image formation, we can also view the image plane as the projective plane. Image formation, which is the projection of an infinite line through the origin to a 2D point on the image plane, can now be interpreted as a linear projective transformation from P3 to P2: X0 u0 f 0 0 0 − Y 0 v0 = 0 f 0 0 , (1.3) − Z0 w 0 0 1 0 W 13
where X0 = WX,Y 0 = WY and Z0 = WZ. Notice that to represent a point in projective n-space Pn, we use n + 1 coordinates, but since scale is immaterial in projective space, only the ratios between the coordinate magnitudes are important. That is, to represent a point X Rn as a point in Pn, we can use the (n + 1)-dimensional ∈ vector k (X, 1)>, for any k = 0. · 6 In this interpretation, transformations of the camera and scene are now projective transformations, which are represented as invertible, but otherwise arbitrary 4 4 matrices. × The projective ambiguity that we discussed in Section 1.2.1 now has an easy mathematical basis: given just the image points in P2, when we seek to recover the cameras and scene points that form the image, we can always insert a 4 4 invertible transformation and its × inverse between the two entities on the right hand side of (1.3) to obtain the same image points.
1.2.3 Stratification of 3D Reconstruction
The most visually noticeable aspect of a projective reconstruction, for example in Figure 1.9(a), is that all parallel lines of the (metric) scene in the same direction are concurrent. Further, angles between coplanar lines or length ratios are not preserved, but incidence relations remain unchanged. Once a projective reconstruction is computed, the goal of 3D reconstruction is to compute a metric upgrade, that is, determine a transformation that recovers the scene and camera configurations up to a similarity transformation. In other words, the task is to upgrade a projective reconstruction to one more in consonance with human experience. One way to achieve this goal is using some prior knowledge pertaining to the scene. For instance, knowledge of a few angles in the scene suffices to fix all the angles in a reconstruction, which amounts to a reconstruction up to a similarity. Extending the intuition, it is reasonable that a little less knowledge should allow us to achieve a reconstruction which is, in some sense, intermediate between a projective and a metric reconstruction. Indeed, such is the case - knowing a few parallel lines in the 14 scene, for example, allows us to recreate it up to a so-called affine transformation, where parallelism between lines is restored, but in general, not the other angles. Defining such a hierarchy of transformations that can progressively upgrade a projective reconstruction to a metric one is the basis for stratification of 3D reconstruction (Figure 1.9), which we elaborate upon in Section 3.4.
(a) Projective (b) Affine (c) Metric
Figure 1.9: Stratification in 3D. Given images of a scene, a projective reconstruction can be computed, in which parallel lines meet at a vanishing point. A projective reconstruction can be upgraded to an affine one, in which parallelism is restored, if a few parallel lines in the scene are known. Knowing a few angles in the affine reconstruction allows it to be upgraded to a metric one.
An important constraint for a reconstruction is chirality, which requires the imaged scene points to lie in front of the cameras. An arbitrary projective reconstruction need not satisfy chirality, which manifests itself as a violation of the convex hull of the scene points. Enforcing the chirality condition yields a quasi-affine reconstruction, which is simply a projective reconstruction that preserves the convex hull of the scene (Figure 1.10). Section 3.5 reviews chirality in 3D reconstruction in greater detail. Since it forms an important part of Chapters6 and7, we can consider quasi-affinity as a separate stratum in our reconstruction hierarchy. 15
(a) Euclidean “scene” (b) A quasi-affine (c) A general projective reconstruction reconstruction
Figure 1.10: Despite appearances to the contrary, (c) is as valid a projective reconstruction of the house as (b). While the quasi-affine reconstruction preserves the convex hull of the scene in (a), a general projective transformation might not. Note that incidence relations are still preserved in (c).
1.2.4 Autocalibration
Rather than relying on any prior knowledge of the scene, an alternate ingress into a metric world from a projective one is by estimating the internal parameters of the camera (in effect, reducing the setting to a calibrated one). Indeed, with minimal assumptions on the imaging set-up, such as constancy of camera settings or rectangularity of image pixels, it is possible to recover the internal parameters of the cameras using image data alone. This process is called autocalibration or camera self-calibration, which is the subject of Chapters6 and7 of this dissertation. A typical approach to calibrating a camera involves using several images of a known calibration grid. Once a correspondence can be ascertained between scene points (or higher order features like curves) and their counterparts on the image plane, it is straightforward to recover the camera parameters. The term autocalibration stems from its key premise that it obviates the requirement for an explicit calibration grid. Instead, it tries to locate the image of the so-called absolute conic, which is an imaginary, but fixed object on the plane at infinity, whose location in a metric reconstruction is known a 16 priori. Its image can be shown to be related to the internal parameters of the camera, so locating the image of the absolute conic is equivalent to calibrating the camera.
Figure 1.11: Autocalibration can be visualized as determining the plane at infinity and the conic on it which projects to image of the absolute conic in each image.
Thus, geometrically, the goal of autocalibration is to recover the plane at infinity and the absolute conic in a projective reconstruction. This can be visualized as depicted in Figure 1.11, which also motivates the optimization paradigm of autocalibration that we pursue in chapters6 and7: suppose we hypothesize the location of the image of the absolute conic in the input images. Then, if the back-projected cones from the camera centers all intersect in a common conic, it might be the absolute conic and we would have hypothesized the correct camera parameters. If we somehow knew the position of the plane at infinity, then the problem reduces to verifying the intersection of these cones with a known plane, which is arguably a simpler problem. Thus, estimating the position of the plane at infinity in a projective reconstruction is considered the most difficult obstacle to a metric reconstruction (Hartley et al., 1999). 17
Of course, the above is only a sketch to aid visualization of the problem - in reality, the absolute conic and its image are imaginary entities.
1.2.5 Feature Selection and Matching
From the foregoing discussions, while single view geometric reconstruction is, ostensibly, an ill-defined challenge, multiview reconstruction is much more tractable, provided corresponding features can be identified across the multiple images. All the algorithms that we discuss in this dissertation require salient features in the image as input, which must then be matched across the image dataset to detect correspondences. The two steps of feature detection and matching are a basic requirement for most 3D reconstruction frameworks (there are a few exceptions, for example (Dellaert et al., 2000)). With the exception of Chapter9, this dissertation assumes that these two steps have already been addressed. This is not to suggest that feature detection and matching are trivial stages of 3D reconstruction - quite to the contrary, they present significant computational challenges as image datasets become larger in this Internet age. However, fast and robust algorithms exist to solve these problems, which work very well in practice for small to medium sized image collections, which is the target application domain of our algorithms.
Feature Detection
Salient features are useful when they are sparse, but informative. Moreover, they should be reproducible, ideally persisting across different image scales and pose transfor- mations. Determining good features for particular applications like 3D reconstruction and object recognition is an active area research, some of which we will survey in Section 3.1.3. In all instances, except Chapter9, our features of interest in this dissertation are corners. We review the basic principles of corner detection in Section 3.1.1. The features of interest in Chapter9 are straight lines. Using higher order features is advantageous 18 in some milieu, since they are better localized and can encode a greater amount of information. Not all the detected corners are good features for structure from motion applica- tions. One of the more striking examples of a bad “corner” is a region where the feature detector outputs a favorable response, since it detects the appropriate intensity changes, but the region does not stay “locked down” under a camera motion. This can arise, for example, at the occlusion interface of two regions at different depths or along T-junctions of two edges in relative motion. Figure 1.12 gives examples of good and bad corners for SFM applications.
Figure 1.12: Examples of good and bad corners. The blue square shows a good corner, which stays “locked down” under a camera motion. The red circle, on the other hand, has the characteristics of a corner, but is not suited for SFM since it does not correspond to the same 3D point when the camera moves.
Feature matching
In a practical SFM application, how do we prevent the feature detector from yielding spurious candidates? The short answer is that we need not prematurely discard putative features. The real requirement for SFM is a correspondence between interest points in an image sequence, that is, being able to identify the same point across an image sequence. In most cases, bad corners are weeded out by the feature matching step. To be able to match corresponding features across images, a descriptor is required 19 for each feature. There are two basic requirements for such a descriptor:
1. It must be sufficiently discriminative, that is, it must distinguish one feature from others.
2. It must be invariant to common transformations, such as rotation, translation and scale changes, as well as illumination variations.
Most descriptors are computed using operations on intensity patches around an interest point. Depending upon the application, these operations might be simply computing a squared difference of intensities, or something more advanced. We will review some of the popular feature matching strategies in Section 3.1.2. Matching features across a large image dataset is the time critical step for many SFM applications. Thus, appropriate attention must be paid to designing features which are amenable to fast matching algorithms. Some recent advances in designing features tailored for rapid matching are discussed in Section 3.1.3.
1.3 The Optimization Framework
In this dissertation, multiview geometry problems are posited and solved in an optimization framework. The optimization paradigm for problem solving is ideal for dealing with noisy data and flexibly combining information from disparate sources - akin to the situation for estimating 3D reconstructions from noisy images, possibly acquired using multiple cameras. In particular, the uncertainties associated with noise or inaccuracies in image acquisition and feature matching can be explicitly modeled in an optimization framework. Also, formulating the goal as an optimization problem allows scalability across an arbitrary number of points or views. Moreover, the logical demarcation between the goal (the objective function) and the means to achieve it (the optimization algorithm) is invaluable for analysis and generalization. 20
1.3.1 Global Optimization for 3D Reconstruction
A globally optimal algorithm, for a specific class of objective functions, is one that provably converges to the global minimum. We refer the reader to Section 4.1 for a survey of prominent global optimization techniques employed in various disciplines of science and mathematics. In this section, we will motivate the need for global optimization in multiview geometry.
3D reconstruction problems are hard
Since image formation and 3D reconstruction involve projective entities, cost functions usually encountered in 3D reconstruction have intrinsically non-linear charac- teristics. These cost functions are also typically highly non-convex, so the search space is riddled with local minima, as Figure 1.13 illustrates for the autocalibration problem (details in Section 6.2.4). Even for the simplest of 3D reconstruction problems, namely triangulation, the objective function terrain can be quite complex, as shown in Figure 5.1. For several multiview geometry problems, there may also be many critical camera configurations (Kahl, 2001; Kahl et al., 2000; Sturm, 1997), which might result in a problem that is especially arduous to optimize. For some applications, like bilinear programming in Chapter8, the number of variables might be very large, so a brute force search in high dimensional space is unlikely to converge in a reasonable amount of time.
Traditional methods have limitations
The standard practice for tackling multiview geometry problems is to first solve a simpler problem, usually a linear least squares one. This solution is then used as an initialization for a non-linear optimization routine that minimizes the actual cost function (Figure 1.14). The simple initial problem corresponds to an algebraic cost function, which may give the correct solution to the geometric problem in the absence of noise. But when noise is present, the linear solution can be quite non-intuitive and may bear no resemblance to the actual geometry of the problem. For some problems, like estimating 21
(b) Side view
100
90
80
70
60
50
40
30
20
10
(a) Top and side view 10 20 30 40 50 60 70 80 90 100 (c) Contour plot
Figure 1.13: A typical cost function for a multiview geometry problem might have several local minima. In addition, the surface terrain can be very rugged, which necessitates a highly accurate estimation algorithm. This figure shows various views and a contour plot of a 2D slice of the 3D volume that characterizes the variation of the cost function for the autocalibration problem with change in the position of the plane at infinity. Blue denotes a low cost and red denotes a high cost. Please see Section 6.2.4 for further details. the plane at infinity in Section 6.5, there may not even be a useful linear least squares problem for initialization. The traditional practice in such situations is to straightaway use non-linear minimization with multiple random restarts. The non-linear minimizer of choice for multiview geometry applications is Levenberg-Marquardt (Levenberg, 1944; Marquardt, 1963), which behaves like gra- dient descent when far from the local minimum and like Gauss-Newton when proximal to it. A Levenberg-Marquardt algorithm tailored to exploit the problem structure and sparsity patterns in multiview geometry is called bundle adjustment (Triggs et al., 1999). Bundle adjustment methods are quite powerful and fast, but suffer from the drawback that they are inherently local minimization algorithms. So, they require a good initialization in the vicinity of the global minimum to be able to converge to it, else they are likely to get stuck in local minima. An example of the advantage of global optimization is shown 22
895:
2#)31#)4,5$6,1.(4,
+(),%- ()(.(%"(/%.(#)
+(),%-&7-#$",' +#1%"&'()('*'
!"#$%"&'()('*' +(),%-&0#"*.(#)
5
Figure 1.14: Traditionally, to minimize a non-convex objective, a linear solution corre- sponding to an algebraic least squares cost function is used to determine an initialization. Subsequent application of a gradient-based local optimization method converges to a local minimum, which might be far away from the global minimum. in Figure 7.2. Thus, it is not straightforward to come up with good initializations for multiview geometry problems and even then, there is no way to verify whether the solution achieved by the non-linear minimization routine is indeed the global optimum. Finally, when outliers are present in the data, it is beneficial to pose problems using a robust error measure, like the L1-norm, which may be non-differentiable. Clearly, traditional gradient-based methods cannot be applied for such cases.
The choice of objective function
For some multiview geometry problems, the “best” objective function can be defined by due consideration of the error statistics for image measurements. A common assumption is that image measurement errors obey a Gaussian probability distribution.
This leads to the popular notion of an L2-norm reprojection error, which is the usual objective function for several problems like triangulation (Section 5.2) and projective reconstruction (Section 3.3). 23
While this dissertation does propose algorithms for minimizing these standard objective functions, we note that the L2-norm reprojection error is optimal only under a set of simplifying assumptions on measurement errors which might not be satisfied by real world error distributions. For instance, the L2-norm reprojection error results in an objective function too sensitive to the presence of outliers in the data. Using a more robust L1-norm formulation renders the cost function non-differentiable and thus, not amenable for minimization by traditional gradient-based methods.
Our global optimization approach
Following the classification of the survey in Section 4.1, the global optimization algorithms that we propose are deterministic, that is, they do not possess any element of randomization. All our algorithms rely on modern convex optimization methods to construct convex relaxations for the non-convex objective functions encountered in multiview geometry. In most cases, we use these convex relaxations in a branch and bound framework to systematically prune away regions of the search space guaranteed to not contain the global optimum (Section 4.3). In other cases, we construct a series of convex relaxations of increasing complexity that converges to the global minimum (Section 4.2.4).
The need for global optimization
Before we delve deeper into the dissertation, a valid question at this stage is about the need for global optimization. From common experience, local optimization methods seem to perform satisfactorily in many realistic scenarios. Then why should one develop global optimization methods, which are admittedly more sophisticated than traditional approaches? There are several facets to the answer. The first, most obvious one, is that global optimization algorithms always yield the optimal solution, including in situations where traditional algorithms might break down due to a poor initialization or the complex nature 24 of the objective function’s terrain. Indeed, a key feature of our algorithms is that they do not require gradient information or even differentiability of the cost function. Further, while our algorithms will achieve global optimality for any given initialization, we also propose geometrically correct initializations that do not compromise optimality. Second, reaching the global optimum is only the first step towards achieving optimality. The more important step is to be able to prove that the solution is, indeed, globally optimal. Typical local minimization methods do not provide any mechanism for ascertaining the quality of the solutions they achieve. Our methods, on the other hand, always terminate with a certificate of optimality which guarantees the solution to lie within a pre-specified tolerance of the global optimum. Indeed, most of the time consumption of our algorithms is directed towards verifying optimality - the optimal point itself is usually recovered in a fraction of that time. Third, while local optimization methods will always be an attractive proposition for system implementations due to their speed advantages, it is important to be able to determine the conditions under which they break down. Before our global optimization algorithms, there was no mechanism for characterizing the failure modes of popular gradient-based algorithms in the face of real-world data where no ground truth is avail- able. Thus, following the terminology of (Hartley and Zisserman, 2004), some of our algorithms are Gold Standard ones for the concerned multiview geometry problems. Finally, we note that a global optimization algorithm with poor empirical con- vergence behavior is tantamount to a brute force search and of little practical utility. The primary reason our algorithms demonstrate reasonable solution times is that special properties of multiview geometry cost functions are exploited at every stage of the opti- mization. This dissertation is, thus, also an advocation of the importance of incorporating domain knowledge into the optimization framework. That said, while we provide a unified framework for obtaining optimal solutions to several 3D reconstruction problems, in many ways, our constructions are general enough to find application in domains beyond multiple view geometry too. 25
1.3.2 Optimization for Robust SFM
From the discussion on feature selection and matching in Section 1.2.5, it is apparent that even with conservative matching criteria, it is not always possible to produce a perfect set of correspondences. How can we ensure that false matches, or outliers, do not corrupt the structure and motion estimates? This problem falls within the ambit of robust SFM. There are several ways in which robustness can be achieved in SFM applications. One option is to use a cost function that is relatively insensitive to outliers. For instance, in Chapters5 and8, we will propose algorithms for solving problems under the L1-norm, which corresponds to an error distribution with a thicker tail. An alternative is to use minimal solutions within a hypothesize-and-test framework. A minimal solution to an SFM problem is one achievable using the minimum amount of data. While a minimal solution by itself is susceptible to noise, it is indispens- able for rejecting outliers. A hypothesize-and-test framework, such as Random Sample Consensus (RANSAC), is used for robustly solving a model-fitting problem. For instance, suppose we want to find a line that best fits a given set of points, then, the model is a straight line and a minimal sample consists of 2 points. RANSAC randomly chooses a minimal sample from the data and uses it to compute the model. Then, it determines the support for this sample, that is, the number of other points in the dataset consistent with its model. A model that retains a large consensus set is more likely to be correct than one that does not. In fact, for a given probability p < 1, one can pre-determine the number of random samples that RANSAC must evaluate to determine the correct model with a probability of success equal to p. In Chapter9, we will develop a line-based SFM algorithm that computes the displacement of a stereo camera rig using the minimum amount of data. Even though a minimal solution is required to be fast and not robust, a good solution should incorporate problem-specific constraints (like orthonormality of the rotational component of the motion). A good minimal solution translates into fewer RANSAC trials, which reduces computation time and might be critical in situations where the feature set is sparse. We 26 use our minimum data solution in a RANSAC framework to achieve a robust solution that achieves close to real-time rates for indoor robotic navigation.
1.4 Contributions of the Dissertation
Let us summarize our discussions so far and portend the upcoming ones by emphasizing the contributions of this dissertation.
1. Global optimization for 3D reconstruction: Our first set of contributions con- sists of globally optimal solutions to several problems that arise in geometric 3D reconstruction. Traditional methods often break down for these problems due to the complex nature of cost functions that arise in multiview geometry or due to the large number of variables that some reconstruction problems involve. Some of the problems that we globally minimize are:
Triangulation and camera resectioning (Chapter5) • Stratified autocalibration (Chapter6) • Direct autocalibration (Chapter7) • Shape reconstruction from exemplars (Chapter8). • 2. Provably tight convex relaxations: We exploit recent developments in convex optimization theory to develop tight convex relaxations for the non-convex ob- jective functions that commonly arise in multiview geometry. These convex relaxations are shown to be theoretically viable and practically ideal for use in a branch and bound framework for global optimization.
3. Provable strategies for tractable branch and bound: Our next set of contri- butions demonstrates how incorporating domain knowledge into the branch and bound optimization framework makes seemingly intractable problems solvable. In most multiview geometry problems encountered in this dissertation, the dimen- sionality of the search space increases with the number of points or views. We 27
propose several extensions to a traditional branch and bound scheme that allow us to achieve fast convergence rates in practice. Some examples of these extensions are:
A novel bounds propagation scheme that capitalizes on the geometric struc- • ture of 3D reconstruction problems to restrict the effective dimensionality of our search space to a small, fixed number, irrespective of problem size.
A problem-specific procedure for selecting the initial search region which is • not inordinately large, yet is guaranteed to contain the global optimum.
A new, non-exhaustive branching strategy that is demonstrably convergent • for some 3D reconstruction problems, even though it performs branching only in the dimensions corresponding to a small subset of the problem variables.
4. Real-time SFM for autonomous robotic navigation: Finally, we turn our at- tention to developing a real-time SFM system for autonomous robotic navigation, which will be deployed on Honda’s humanoid ASIMO robot. This system solves a minimum data problem to localize a stereo rig employing straight line features. Chapter9 describes a bottom-up system implementation, from feature detec- tion and tracking to solving the minimum data SFM problem and using it in a RANSAC framework for robust estimation.
1.5 How to Read This Dissertation
This dissertation touches upon a broad array of subjects in multiple view geometry and optimization theory, but aims to cater to experts as well as beginners. Therefore, it can be read in multiple ways in appurtenance to the reader’s background or objectives. While the foregoing chapter was intended to serve the dual purpose of a historical perspective and an intuitive preview of the challenges we address, a more didactic 28 introduction to the pre-requisites required for this dissertation is contained in Chapters 2,3 and4. These chapters are written with the aim of ensuring the dissertation’s self- sufficiency and should be of particular appeal to, say, an entry-level graduate student with an adequate linear algebra background. Specifically, Chapter2 is a review of concepts from projective geometry that will be used to formulate and solve our 3D reconstruction problems. Chapter3 presents the relevant background from multiple view geometry. Chapter4 outlines the convex optimization theory and global optimization frameworks that will be used throughout the dissertation. The next block within the dissertation, from Chapters5 to8, presents globally optimal solutions for several well-known problems pertaining to the geometry of multiple views. While the theoretical background and tools required for these chapters may vary, they are organized in order of the apparent complexity of the task. Accordingly, we begin in Chapter5 with the simplest of multiview geometry problems, namely triangulation and camera resectioning. These “simple” tasks can nevertheless be shown to be NP-Hard. We propose globally optimal solutions for these problems, posed in various error norms, based on modern convex relaxations in a branch and bound framework, which is tailored to achieve fast convergence in practice by exploiting the underlying geometric structure. The same framework is extended to propose globally optimal algorithms for both the affine and metric upgrade stages of stratified autocalibration in Chapter6. The most elemental, yet powerful mechanism of chirality in projective geometry is brought into play to serve as a theoretically justified and practically useful initialization for the algorithm. While the subject of Chapter7, direct autocalibration, is pedagogically more advanced than the stratified approach, its projective geometry representation is more concise, which leads to a simpler formulation. Consequently, it becomes amenable for solution in a framework of convex linear matrix inequality relaxations, based on the elegant theory of positive polynomials from real algebraic geometry. Chapter8 tackles bilinear fitting, which arises in yet more complex problems 29 like single-view shape reconstruction using 3D exemplars and non-rigid structure from motion. The apparent difficulty of bilinear programming in computer vision stems from the large dimensionality of its search space. However, we exploit an underlying trait of geometric cost functions to effectively restrict the number of dimensions to a small, fixed number, which makes global optimization practical. Chapter9 of the dissertation is of greatest interest to the practitioner. It outlines our construction of a robust and accurate 3D reconstruction pipeline, from feature detection and tracking to full SFM, which will form one of two parallel components in the autonomous navigation system for Honda’s humanoid robot, ASIMO. The structure and motion algorithm described here is based on stereo input and uses straight lines as features. Finally, we conclude the dissertation in Chapter 10 with further discussions of the perceived impact of this work and our outlook towards the future. Chapter 2
Preliminaries: Projective Geometry
“Beauty itself is but the sensible image of the infinite.”
George Bancroft (American historian, 1800-1891 AD), The Necessity, Reality and Promise of Progress of the Human Race
In this chapter, we will develop some of the terminology and notation useful in formulating the 3D reconstruction problems for which we seek globally optimal solutions. We will begin with a brief review of concepts from projective geometry, which forms the basis for mathematical representation of multiview geometry. Large segments of this chapter borrow liberally from (Hartley and Zisserman, 2004), where a more detailed treatment of the relevant material can be found.
2.1 Axiomatic Projective Geometry
Like any geometry, it is possible to develop the entire framework for projective geometry axiomatically. For completion, we give a brief preview of the axiomatic framework for projective geometry here and refer the reader interested in a more detailed exposition to texts such as (Beutelspacher and Rosenbaum, 1998). A geometry G = (S,I) can be defined as a pair consisting of a set S and a symmetric and reflexive incidence relation I between the elements of the set. That is,
30 31 for any x, y S, if (x, y) I, then (y, x) I and (x, x) I. For instance, in classical ∈ ∈ ∈ ∈ geometry of 3 dimensions, E, the set S consists of points, lines and planes of Euclidean 3-space and the incidence relation I is our notion of “contained by” and “passes through”. A flag of G is a set of elements of S that are mutually incident. A flag F is maximal when there is no element in S F whose union with F is a flag. G has rank r \ if S can be partitioned in r subsets, S1, ,Sr, such that every maximal flag contains ··· exactly one element of each subset. Each maximal flag in a geometry of rank r must have exactly r elements. For example, in the case of Euclidean geometry of 3-space, the set S itself is the only maximal flag, which can be partitioned into three subsets, which correspond to the set of all points, lines and planes. So, E is a geometry of rank 3.
A subgeometry G0 = (S0,I0) is defined by a subset S0 S and a relation ⊆ I0 induced by I, that is, I0 is the restriction of I to S0. In particular, the geometry
Gi = (S Si,Ii), where Ii is induced from G, is a geometry of rank r 1. It follows that \ − any geometry of rank r 2 can be studied as a rank 2 geometry. The two subsets S1 and ≥ S2 of a rank 2 geometry can be considered corresponding to as points and lines and will now be denoted as P and L. Let G = ( P,L ,I) be a geometry of rank 2. Then the following axioms are { } satisfied by any projective geometry:
Axiom 1: For every two distinct points, there is one distinct line incident to them. • Axiom 2: For points A, B, C, D, where A and D are distinct from B and C, if • AB intersects CD, then AC intersects BD.
Axiom 3: Any line is incident with at least three points. • Axiom 4 There are at least two lines. • A projective space is a geometry of rank 2 that satisfies Axioms 1, 2, and 3. A non- degenerate projective space also satisfies Axiom 4. A projective plane is a non-degenerate projective space where Axiom 2 is replaced by a stronger one: 32
Axiom 2’: Any two lines have at least one point in common. • As an example, in this dissertation, the image plane is considered as a projective plane, so any two image lines are considered intersecting. In the following chapters, we will be concerned with transformations between projective spaces. This is can be formalized as follows. Suppose G = ( P,L ,I) and { } G0 = ( P 0,L0 ,I0) are two rank 2 geometries. If there is a map h : P,L P 0,L0 { } { } → { } such that P is mapped bijectively to P 0 and L is mapped bijectively to L0, then h is an isomorphism from G to G0. An automorphism is an isomorphism from a rank 2 geometry to itself. For a rank 2 geometry with a notion of a line, an automorphism is also called a collineation or a homography. We will frequently encounter homographies in this dissertation, in the context of the image plane treated as a projective plane, for example, inter-image homographies or homographies between a scene plane and the image plane. While the rich geometry of projective spaces can be wholly developed in this ax- iomatic framework, for ease of computational representation, we will adopt the coordinate approach for the rest of this dissertation.
2.2 Projective Geometry of 2D
Points and Lines in 2D
Since absolute depth is not important in projective geometry, it is convenient to algebraically represent entities in homogeneous coordinates. A 2D point in Cartesian coordinates, (x, y)>, has the homogeneous representation x = (x, y, 1)>, which is equivalent to (kx, ky, k)> for any k R, k = 0. ∈ 6 A 2D line whose equation is ax+by +c = 0 may be represented in homogeneous coordinates as l = (a, b, c)>, note that (ka, kb, kc)> represents the same 2D line. It follows that a point x lies on the line l if and only if x>l = 0.
The point of intersection of two lines l and l0 is given by the vector cross product x = l l0. This may be easily verified, since x>l = x>l0 = 0. Similarly, the join of two × 33
points x and x0 is given by the line l = x x0. ×
Ideal points
The above definition of line-line intersection also serves to illustrate Axiom 2’ in
Section 2.1 and how it distinguishes the projective plane P2 from the Euclidean plane R2. Consider two parallel lines in R2, with the equations ax + by + c = 0 and 2 ax + by + c0 = 0. It is common to say that these two lines “never intersect”. In P , they are represented as homogeneous vectors l = (a, b, c)> and l0 = (a, b, c0)>, whose 2 intersection is x = l l0 = (c c0)( b, a, 0)>, which is a valid point in P . However, ∞ × − − we cannot represent the point x in R2, since its de-homogeneization corresponds to a ∞ division by zero. While this corresponds with our Euclidean notion that parallel lines can never meet, projective geometry provides a framework for uniform treatment of finite and infinite points.
2 A point such as x P , having the form (x1, x2, 0)>, does not have a Euclidean ∞ ∈ analogue and is called an ideal point. From the incidence relation of points and lines in P2, we observe that all ideal points must lie on a common line, the aptly termed line at infinity, given by l = (0, 0, 1)>. Thus, the line at infinity is what distinguishes the ∞ 2 2 S projective plane from the Euclidean plane, that is, P = R l . ∞
Duality
It may be apparent to the reader that points and lines seem to enjoy a symmetric relationship in the projective plane P2. Indeed, this is a consequence of the duality principle, which states that any theorem about incidences between points and lines in the projective plane may be transformed into another theorem about lines and points, by a substitution of the appropriate words. The duality principle in P2 is a special case of duality for general projective spaces Pn, whereby entities of dimension d can be interchanged with those of codimension d + 1. 34
Conics
Conics play a central role in multiview geometry, especially the theory of auto- calibration. Geometrically, a conic (or a conic section) is obtained as the intersection
boundary of a plane in R3 with a cone in R3. Algebraically, it is a curve on a 2D plane defined by a degree 2 equation:
ax2 + 2bxy + cy2 + 2dx + 2ey + f = 0, (2.1)
or equivalently, a b d x>Cx = 0, where C = b c e is symmetric. (2.2) d e f Clearly, five points in general position suffice to define a conic. A conic is said to be degenerate when C is rank deficient. An example of a
degenerate conic of rank 2 is two lines: two lines l and l0 can be together represented by
a conic C = l l0> + l0 l>. An example of a rank 1 conic is a repeated line.
Tangent to a conic
A line l is tangent to a conic C at the point x if and only if l = Cx. Note that tangency is defined only for non-degenerate conics. To prove this, note that l does
pass through x, since l>x = x>Cx = 0. Next, suppose l passes through another point
y on C. Then, y>Cy = 0 and y>Cx = 0. It easily follows that for any α R, ∈ (x + αy)>C(x + αy) = 0, that is, the entire line joining x and y lies on the conic C, which is not possible unless C is degenerate. Thus, l = Cx is tangent to C at x.
Dual conics
A dual conic is defined as the envelope of all lines tangent to a point conic. The dual to a point conic C is characterized by the relation
l>C∗l = 0 (2.3) 35
where C∗ represents the adjoint of C. Indeed it is easy to show that the dual of a non- 1 degenerate conic C is given by C∗ (C∗ = C− when C is full rank). Suppose l = Cx is tangent to C at x. Then,
1 1 x>Cx = 0 (C− l)>C(C− l) ⇔ l>C−>l = 0 ⇔ 1 l>C− l = 0 ⇔ since C is symmetric.
Effect of projective transformations
A general projective transformation H : P2 P2, also called a homography, is → represented by a full-rank 3 3 matrix H. × When a projective transformation is applied to a space such that points are transformed as x0 = Hx, then lines are transformed as l0 = H−>l. This can be easily verified as a requirement for preserving incidence, since
x>l = 0 x0>l0 = 0. (2.4) ⇔
1 Under a point transformation x0 = Hx, a point conic C transforms as C0 = H−>CH− . To verify this, we note
1 x>Cx = 0 x0>H−>CH− x0 ⇒ 1 C0 = H−>CH− . (2.5) ⇒
Similarly, under a point transformation x0 = Hx, a dual line conic C∗ transforms as
C∗0 = HC∗H>, which can again be verified as
l>C∗l = 0 l0>HCH>x0 ⇒ C∗0 = HC∗H>. (2.6) ⇒ Borrowing the terminology of tensor algebra, while conics transform covariantly, dual conics transform contravariantly. 36
2.3 Projective Geometry of 3D
The geometry and interactions of the dimension 0 and codimension 1 entities in P3, namely, points and planes, follow rules very similar to those for the corresponding entities in P2. While points and planes are duals in P3, lines are self-dual since they have dimension 1 and codimension 2.
2.3.1 Points and Planes
3 3 A 3D point (x, y, z)> R is represented in P by a homogeneous four-vector ∈ 3 X = (kx, ky, kx, k)>, where k R and k = 0. A plane in R , with the equation ax+by+ ∈ 6 3 cz + d = 0 is represented in P by a homogeneous four-vector π = (ka, kb, kc, kd)>, which is consistent with the incidence relation π>X = 0. Under a 3D projective 3 3 transformation H : P P , if points transform as X0 = HX, then planes transform as → π0 = H−>π.
The intersection of three planes π, π0 and π00 is given by the point X = null([π, π0, π00]>). Similarly, the join of three points X, X0 and X00 is given by the plane π = null([X, X0, X00]>).
2.3.2 Lines
Unlike points and planes, algebraic treatment of 3D lines is not straightforward in projective geometry. Part of the reason is that lines in P3 can be shown to be in bijective correspondence with points on the so-called Klein quadric in P5. So, lines in P3 may be represented as a homogeneous vector in R6 whose coordinates must satisfy an implicit constraint, making it difficult to interact with points and planes which are simply homogeneous vectors in R4. In the language of algebraic geometry, lines in R3, when interpreted as projective lines in P3, correspond to two-dimensional subspaces of a four-dimensional vector space, which is the basis for the so-called Plucker¨ embedding. The interested reader is referred to (Pottmann and Wallner, 2001) for an excellent 37 treatment of 3D line geometry.
Span and null-space representations
A 3D line may be represented as the join of two points X, Y P3 by the 2 4 ∈ × matrix L = [XY]>. The span of L> is the pencil of points λX + µY that define the line. Alternatively, the dual representation of a line is the intersection of two planes. A
3 line formed by the intersection of planes π1, π2 P can be represented by the 2 4 ∈ × matrix L∗ = [π1 π2], since the null-space of L∗ is the pencil of points on that line.
The two dual representations are related as LL∗> = L∗L> = 02 2. ×
Plucker¨ matrix representations
The 3D line joining two points X and Y can be represented as a 4 4 skew- × symmetric Plucker¨ matrix:
L = XY> YX>. (2.7) − The matrix L has rank 2, in fact, its two-dimensional null-space is spanned by the pencil of planes with the 3D line joining X and Y as axis. It can be verified that under a point
3 3 transformation H : P P , such that X0 = HX, the line L in (2.7) transforms as → L0 = HLH>. The dual Plucker¨ matrix representation is a 4 4 skew-symmetric matrix of rank × 2 that parameterizes a 3D line as the intersection of two planes π1 and π2:
L∗ = π1π> π2π>. (2.8) 2 − 1
1 Under a point transformation X0 = HX, the dual line L∗ transforms as L∗0 = H−>L∗H− . This Plucker¨ skew-symmetric representation is especially convenient for repre- senting join and incidence relations:
The plane defined by the join of line L and point X is π = L∗X. Further, X lies • on L if and only if L∗X = 0. 38
The point defined by the intersection of line L with the plane π is X = Lπ. • Further, L lies on π if and only if Lπ = 0.
Plucker¨ coordinates
Algebraically, the Plucker¨ coordinates of a line correspond to the six strictly upper triangular entries of the skew-symmetric matrix L in 2.7. Since det(L) = 0, these six coordinates must satisfy a quadratic constraint, which defines the Klein quadric. But Plucker¨ coordinates also arise naturally in line geometry and it is possible to develop their theory in an axiomatic framework too. Accordingly, the line L spanned by
the points X = (x>, x0)> and Y = (y>, y0)> is given by their exterior product, which can be represented as the homogeneous 6-vector
L = X Y = (l>,¯l>)> = (x0y> y0x>, (x y)>)>, (2.9) ∧ − ×
where (l>,¯l>)> represent the Plucker¨ coordinates of L. The dual Plucker¨ coordinates
specify the line L contained in the intersection of planes π1 = (p, p0)> and π2 = (q, q0)> as
L∗ = π1 π2 = (l∗>,¯l∗>)> = (p0q> q0p>, (p q)>)>. (2.10) ∧ − ×
The Plucker¨ coordinates of L are given by (l>,¯l>)> = ((p q)>, p0q> q0p>)>. × − The plane spanned by a line (l>,¯l>)> and a point (x>, x0)> is the plane
(p>, p0)> = ( x0l> + (x l)>, x>l)>. (2.11) − ×
The intersection of a line (l>,¯l>)> and a plane (p>, p0)> is the point given by the dual relation:
(x>, x0)> = ( p0l> + (p l)>, p>l)>. (2.12) − × Geometrically, the vector l represents the direction of the line L, that is, l is
the ideal point of L. For any non-ideal point (x>, x0)> on the line L, the Plucker¨
coordinates of L are computed as the join of (l>, 0)> and (x>, x0)>, given by (l>,¯l>)> =
(l>, (x l)>)>. The orthogonality of l and ¯l gives a quadratic relation which defines the × Klein quadric. 39
In algebraic geometry, Plucker¨ coordinates also provide a natural connection to the rich theory of Grassmannians.
2.3.3 Quadrics
A quadric is a surface that defines the locus of points X P3 satisfying: ∈
X>QX = 0, (2.13) where Q is a symmetric 4 4 matrix. Quadrics may be ruled (through every point on × Q, there exists a straight line that lies on Q) or non-ruled. Examples of the latter are some common surfaces like spheres and ellipsoids, while a hyperboloid of one sheet is an example of the former.
The envelope of all planes π tangent to a quadric Q defines the dual quadric Q∗, which satisfies the equation
π>Q∗π = 0, (2.14)
1 where Q∗ is the adjoint of Q and equal to Q− for a full-rank Q.
Under a point transformation X0 = HX, a quadric transforms as
1 Q0 = H−>QH− , (2.15) while a dual quadric transforms as
Q∗0 = HQ∗H>. (2.16)
2.4 The Projective Camera
Referring to Figure 2.1, we saw in Section 1.2.2 that the image formation pro- cess by perspective projection, when the camera is centered and axis-aligned, can be represented as f 0 0 0 x = PX, where P = 0 f 0 0 , (2.17) 0 0 0 1 40 where x P2 is the image of X P3 and f is the focal length of the camera. ∈ ∈
12-34/ " ,.-*4 ! & %
#/5/678/78/79 # $ '()*+),-./-%)0 $
! "
Figure 2.1: The centered and axis-aligned projective camera is represented by a perspec- tive projection matrix.
The 3 4 camera matrix can be rewritten as × f 0 0 P = K [ I 0 ] , where K = 0 f 0 . (2.18) | 0 0 1
The matrix K encodes the internal parameters of the camera. The projection equation in (2.17) assumes that the origin of coordinates in the image plane coincides with the point where the principal axis intersects it, the principal point. However, if the principal point is located at some (u, v)>, instead of (0, 0)>, then the image formation equation equation is f 0 u x = K [ I 0 ] X, where K = 0 f v . (2.19) | 0 0 1 The effect of any non-rectangularity in the camera pixels can be modeled by introducing a pixel skew, s = cot θ. Finally, we account for the fact that focal lengths in the x and y directions might be slightly different in a real, lens-based camera, to get an 41
upper-triangular internal parameter matrix: fx s u K = 0 fy v . (2.20) 0 0 1
The above projection equations are derived under the assumption that the camera is situated at the origin of the world coordinate system, with its principal axis pointing along the positive Z-axis. In general, if the camera center is C˜ , in inhomogeneous coordinates, then the camera’s coordinate system is related to the world’s as X˜ = R(X˜ C˜ ), cam world − where R is a rotation. This can be written in homogeneous coordinates as R RC˜ ˜ ˜ Xcam = − Xworld. (2.21) 0> 1
The image formation equation for a perspective camera now becomes
h i x = KR I C˜ X = K [ R t ] (2.22) | − |
where t = RC˜ . While the upper triangular matrix K encodes the internal parameters, − (R, t) represent the exterior pose of the camera (Figure 2.2). Note that any 3 4 matrix, P = [ A a ], that maps a 3D point to a 2D one as × | x = PX can be interpreted as a projective camera, as long as A is non-singular, since A can always be decomposed A = KR using RQ decomposition. The non-singularity condition is to ensure that the projection is from the 3D world on to a 2D image plane and not on to a 1D line or a single point.
The center of a projective camera is given by C = null(P). Let Pi> denote row i
of the camera matrix. Then O3 is the principal plane, which is parallel to the image plane and passes through the image center. If the camera is represented as P = [A b], then the | direction of the principal axis, which passes through the camera center and points to the “front” of the camera, is given by
v = det(A)a3, (2.23) 42
(a) (b)
Figure 2.2: The internal and external parameters of a projective camera. (a) Besides the focal length, the internal parameters of the camera account for the principal point offset and the pixel skew. (b) The position and orientation of the camera with respect to a world coordinate frame constitute the external parameters of the camera.
where a3> is the third row of A.
If X = (X,Y,Z,T )> is a 3D point, imaged by a camera P = [A b] as the point | x = w(x, y, 1)>, then the depth of point X with respect to the camera P is defined as:
sign(det(A))w depth(X, P) = (2.24) T a3 k k where a3 is the third row of A. k k
2.5 The Plane at Infinity and Its Denizens
The projective space can be considered as the union of real space and the set of all points at infinity, or ideal points. Such points in P3 can be characterized by the last coordinate of the homogeneous representation being 0. It is evident that all 3 ideal points X = (X,Y,Z, 0)> in P must lie on the plane at infinity, represented as ∞ 3 3 S π = (0, 0, 0, 1)>, since they satisfy π> X = 0. Thus, P = R π . ∞ ∞ ∞ ∞ The importance of the plane at infinity in geometric reconstruction problems stems from the fact that while it is moved out of its canonical position by a projective transformation, it stays fixed under an affine transformation. So, once we can locate 43
the plane at infinity in a projective reconstruction, we can compute a transformation that takes it back to its canonical position, which would result in a reconstruction which differs up to an affinity from the true Euclidean reconstruction. These statements are formalized below.
Theorem 1. The plane at infinity is fixed under a projective transformation H if and only if H is an affinity.
Proof. Suppose H is an affine transformation, represented as A3 3 t H = × . (2.25) 0 1
Then, it is easy to see that the plane at infinity is fixed under H, as 0 0 A−> 0 0 0 H−>π = = . (2.26) ∞ t>A> 1 0 0 − 1 1
For the converse, suppose H is a projective transformation that fixes the plane at in-
finity. Then, H−>[0, 0, 0, 1]> = [0, 0, 0, 1]> necessitates that (H−>)14 = (H−>)24 =
(H−>)34 = 0, which leads to the conclusion that H must be of the form 2.25.
The above result immediately leads to the utility of the plane at infinity for 3D reconstruction problems:
Corollary 2. Once the plane at infinity is estimated in a projective reconstruction, it is possible to upgrade to a reconstruction where affine properties are restored.
Proof. Suppose the plane at infinity is identified as π = (p, 1)>. Then, π can be ∞ ∞
mapped to its canonical position by a transformation HA−>: I p HA−> = − HA−>(p, 1)> = (0, 1)>. (2.27) 0> 1 ⇒ 44
Consequently, given π , applying a transformation of the form ∞ I 0 HA = | (2.28) π ∞ will take the projective reconstruction to an affine one.
2.5.1 The Absolute Conic
The absolute conic is one of the most important geometric entities used in com- puter vision, primarily for its relationship to camera calibration and autocalibration. The absolute conic is a point conic that lives on the plane at infinity and is defined as the locus of points X = (x1, x2, x3, x4)> satisfying
x2 + x2 + x2 = 0 1 2 3 (2.29) x4 = 0.
Equivalently, it is the set of all points X = (x˜, 0)> that satisfy x˜>Ω x˜ = 0, where ∞ 3 Ω = I3 3. Note that there is no real point x˜ R which lies on Ω , it is a conic that ∞ × ∈ ∞ consists entirely of imaginary points. Part of the utility of the absolute conic for geometric reconstruction problems stems from the following result:
Theorem 3. The absolute conic is fixed under a projective transformation H if and only if H is a similarity.
Proof. Suppose H is a similarity transformation, then it has the form R t H = . (2.30) 0> 1
On the plane at infinity, the point conic Ω transforms under the action of H as given by ∞ (2.5), that is 1 1 R−>Ω R− = (RR>)− = I = Ω . (2.31) ∞ ∞ 45
To prove the converse, we note that a transformation that fixes a conic lying on the plane at infinity must also fix the plane at infinity itself. Thus, H must at least be an affine transform and should have the form (2.25). Then, on the plane at infinity, H
1 1 would transform Ω as A−>IA− = (AA>)− . If Ω remains fixed under H, then ∞ ∞ 1 (AA>)− = I, which means A must be a rotation or a rotation with reflection. In either case, H represents a similarity transformation.
2.5.2 Image of the Absolute Conic
Consider a point X = (x> , 0)> on the plane at infinity. Under a camera ∞ ∞ P = K [ R t ], it is imaged as |
x = PX = KRx> , (2.32) ∞ ∞ so the mapping between the plane at infinity and the image is given by a planar homogra- phy H = KR. Since the absolute conic lies on the plane at infinity, its image will be a conic transformed as
1 1 1 ω = H−>Ω H− = (KR)−>I(KR)− = (KK>)− . (2.33) ∞ The image of the absolute conic (IAC) is one of the most important entities in camera calibration, since it depends only on the internal parameters of the camera. At times, it is useful to consider its dual, the dual image of the absolute conic (DIAC), which is given simply by 1 ω∗ = ω− = KK>. (2.34)
Estimating the IAC or the DIAC is central to upgrading an affine reconstruction of a scene to a metric reconstruction. This is encapsulated in the following result.
Theorem 4. Given an affine reconstruction PA, XA , knowing the IAC, ω, in one of { i j } the images, corresponding to the camera PA = [ A a ], allows the affine reconstruction | to be upgraded to a metric one by applying a transformation of the form 1 S− 0 H = (2.35) 0> 1 46
1 where S is obtained by the Cholesky factorization of SS> = (A>ωA)− .
Proof. Suppose a transformation H of the form (2.35), for some matrix S, can take
M M M A 1 the affine reconstruction to the Euclidean one P , X . Then, P = P H− = { i j } [ R t ] for the camera under consideration, which leads to KR = AS. However, | 1 ω∗ = (KR)(KR)> = ASS>A>, which can be rewritten as SS> = (A>ωA)− . Thus, an S to upgrade the affine reconstruction to a metric one using a transformation of the
1 form (2.35) can be computed by Cholesky factorization of (A>ωA)− .
2.5.3 The Absolute Dual Quadric
The absolute dual quadric (ADQ), denoted Q∗ , is a plane quadric which is the ∞ envelope of all planes tangent to the absolute conic. Its algebraic representation is I3 3 0 Q∗ = ˜I = × (2.36) ∞ 0> 0 Clearly, the ADQ is a degenerate quadric. Since it is a dual quadric, under a projective transformation H, the ADQ moves from its canonical position to
Q∗ = H˜IH>. (2.37) ∞ An intuitive understanding of the absolute dual quadric as the dual of the absolute conic can be obtained by considering the one parameter family of quadrics
2 2 2 2 x1 + x2 + x3 + kx4 = 0, (2.38) equivalently representable as 1 1 Q(k) = . (2.39) 1 k
As k , the quadric gets closer and closer to the plane at infinity. In the limit, the → ∞ 2 2 2 only points lying on limk Q(k) are those that satisfy x1 + x2 + x3 = 0 and x4 = 0. →∞ 47
But, from (2.29), this is precisely the set of points that define the absolute conic. Thus, one can consider the absolute conic as the limiting case for a series of quadrics Q(k). Then, in the limit, the dual quadric for this series is given by 1 1 1 1 lim Q∗(k) = lim = (2.40) k k →∞ →∞ 1 1 1 k− 0 which is precisely the definition of the absolute dual quadric, Q∗ . ∞ The utility of the ADQ for multiview geometry stems from the fact that it is fixed under a similarity transformation. This is formalized below:
Theorem 5. The absolute dual quadric is fixed under a projective transformation H if and only if H is a similarity.
Proof. Let us represent the projective transformation H as A t H = . (2.41) v> 1
Under the transformation H, the ADQ transforms as AA> Av Q∗ = H˜IH = . (2.42) ∞ v>A> v>v
Since the ADQ remains fixed under H, equating the above to ˜I (up to scale) leads to the conclusion that v = 0 and A is a scaled rotation matrix. Thus, H must be a similarity transformation.
Another useful property of the ADQ is the following result:
Result 1. The plane at infinity is the null-vector of the absolute dual quadric.
To prove this, we simply note that in their canonical form, π = (0, 0, 0, 1)> is ∞ clearly the null vector of Q∗ = ˜I. Under any projective transformation H, the plane ∞ 48
at infinity transforms to π0 = H−>π and the ADQ transforms to Q∗ 0 = HQ∗ H>. ∞ ∞ ∞ ∞ Clearly,
Q∗ 0π0 = (HQ∗ H>)(H−>π ) = HQ∗ π = 0. (2.43) ∞ ∞ ∞ ∞ ∞ ∞
Image of the absolute dual quadric
To determine the image of the ADQ under perspective projection, we require an auxiliary result:
Result 2. An image line l backprojects as the plane π = P>l.
To see that this result must be true, let X be a point on the plane π, which is imaged as the point x. Then, x = PX and since x lies on l,
x>l = 0 (PX)>l = 0 X>(P>l) = 0 π = P>l (2.44) ⇒ ⇒ ⇒
Next, we also require two concepts from differential geometry to deduce the image of an arbitrary quadric. Under perspective projection, the contour generator of a smooth surface is the set of points where the imaging rays are tangent to the surface. The apparent contour is the image of the contour generator. Now, consider a sphere, for which the contour generator and the apparent contour under a Euclidean camera Pb = K [ I 0 ] must be circles, by symmetry. Applying a | projective transformation to the system, the sphere changes into a quadric, while the circles transform into conics. So, the contour generator and the apparent contour for a quadric under perspective projection are conics. We are now ready to state the second auxiliary result:
Result 3. The image of a quadric Q under a projective camera P is the given by the conic C whose dual satisfies
C∗ = PQ∗P>. (2.45)
To prove this assertion, let C be the conic that is the image of Q under the projection P. Lines l tangent to C satisfy l>C∗l = 0. From Result2, these lines 49
backproject to planes π = P>l, which are tangent to the quadric Q. Thus,
π>Q∗π = 0 l>PQ∗P>l = 0 C∗ = PQ∗P>. (2.46) ⇒ ⇒
From (2.46), for the particular case of the ADQ, its image under a camera P is given by PQ∗ P>. Since the ADQ is dual to the absolute conic, its image must be ∞ dual to the image of the absolute conic, which is simply the DIAC. Thus, we have the following important result:
Result 4. The image of the absolute dual quadric Q∗ under a projective camera P is ∞ given by the dual image of the absolute conic, that is,
ω∗ = PQ∗ P>. (2.47) ∞ Chapter 3
Preliminaries: Multiview Geometry
“Everything we see is a perspective, not the truth.”
Marcus Aurelius (Roman emperor, 121-180 AD), Meditations (on Stoicism)
This chapter is a more detailed perusal of the basic concepts of 3D reconstruc- tion that we informally discussed in Chapter1. In particular, we will outline standard techniques for feature detection and matching, as well as computing a projective re- construction from correspondences. We will formalize the notion of stratification in 3D reconstruction and finally, review the important, but often overlooked, concept of chirality. More details for the material in Sections 2.4 to 3.5 can be found in standard texts like (Hartley and Zisserman, 2004).
3.1 Feature Selection and Matching
To harness the power of multiview geometry, the image of the same salient feature, which might be a point, line, curve or some other well-defined scene element must be visible in multiple views. This requires two operations: one, detecting salient features in images (which usually correspond to salient features in the scene, but not always) and establishing correspondence between salient features across images. Well-established
50 51 techniques exist in the literature to perform these tasks, so they constitute pre-processing steps that form the input to our algorithms. In the following, we will briefly review the concepts involved in the low-level image processing that yields our input feature correspondences.
3.1.1 Corner detection
With the exception of Chapter9, the salient features which form the input for our algorithms are corners. For some problems such as autocalibration, we require a projective reconstruction which might be obtained from higher order features too. But corners are by far the most popular features in SFM applications, since they abound in natural scenes and are relatively inexpensive to detect and match. Roughly, corners are locations on the image where the image gradient undergoes a rapid change in direction (Figure 3.1 (a) and (b)). For a 2D image point x = (x, y)>, with intensity I(x, y), this is encapsulated in the positive semidefinite form P 2 P W (x) Ix W (x) IxIy A(x) = , (3.1) P P 2 W (x) IxIy W (x) Iy where W (x) stands for a local window around x. An intensity corner can be expected at a point x where the eigenvalues of A(x) are both significant and its condition number is not high. That is, if the eigenvalues of A are λ1 and λ2, with λ1 λ2 0, then the ratio λ ≥ 2 is not close to zero. The intuition for this is visually illustrated in Figure 3.2. λ1 + λ2 This is the basis of corner detection as proposed in (Forstner¨ and Gulch, 1987). A popular variant is the so-called Harris corner detector (Harris and Stephens, 1988), which declares a putative corner at the local maxima of the following function of A: C(x) = det(A) + k trace(A), (3.2) × where the constant k depends on the image. In practice, a value of k 0.04 usually ≈ works well across a large range of imaging conditions. 52
!"#$%&'()*#)
!"#$%&'()"(
(a) An ideal corner (b) A realistic corner
Figure 3.1: Corner detection.
(a) rank(A) = 0, (b) rank(A) = 1, (c) rank(A) = 2,
λ1 = λ2 = 0 λ1 0, λ2 = 0 λ1 λ2 0 ≥
Figure 3.2: Types of image neighborhoods.
3.1.2 Feature matching
Feature matching establishes correspondences between multiple views of a scene, so it forms the first stage of a multiview geometry algorithm. Since most natural scenes have an abundance of corners and a mismatch might result in an outlier, feature matching tends to be a conservative operation in most implementations. That is, if a match is uncertain, it is usually more prudent to discard a corner in the initial stages rather than indulge in expensive outlier rejection schemes or risk a breakdown later in the SFM pipeline. As we discussed in Section 1.2.5, feature matching requires a discriminative descriptor associated with each feature, that remains relatively invariant under rotation, 53 translation and scale transformations and to some extent, illumination changes too. A common matching criterion is the normalized cross-correlation (NCC), which yields a “match score” between every pair of pixels x and x0 of the two images I and I0, respectively: P ¯ ¯ x W (x),x0 W (x0) I(x) I I0(x0) I0 NCC(x, x0) = q ∈ ∈ − · − (3.3) P ¯2 P ¯2 x W (x) I(x) I x0 W (x0) I0(x0) I0 ∈ − ∈ − where W (x) and W (x0) are local windows around the points x and x0 in the two images ¯ ¯ and I and I0 are the mean intensities within those windows. The NCC is invariant to affine transformations of the image intensity, that is, intensity changes of the form I0 = aI + b. Note that NCC implicitly assumes a trans- lational model of motion between the images being matched, which is a reasonable approximation when the inter-frame motion is small. In many applications, such as real-time stereo, where illumination variations between subsequent frames is not significant and where matching speed is of utmost im- portance, a simplified sum of absolute differences (SAD) measure is used to characterize a match: X SAD(x, x0) = I(x) I0(x0) . (3.4) | − | x W (x),x0 W (x0) ∈ ∈ One way to build in some robustness to brightness and contrast variations is to subtract out the mean intensity from each window, to get a zero mean SAD measure:
X ¯ ¯ SAD0(x, x0) = (I(x) I) (I0(x0) I0) . (3.5) − − − x W (x),x0 W (x0) ∈ ∈ 3.1.3 Advanced feature descriptors
In recent years, the need to incorporate invariances into the features for robust 3D reconstruction and object recognition has given rise to several feature detection methods.
1. SIFT (Lowe, 2004): The Scale-Invariant Feature Transform (SIFT) descriptor is invariant to image scale and rotation, as well as robust to small illumination and 54
viewpoint changes. Difference of Gaussian images are computed to approximate the Laplacian and keypoint detection is performed by detecting scale-space extrema (Lindeberg, 1998; Lindeberg and Bretzner, 2003). An interpolated location of the scale-space extrema is computed using interpolation to achieve subpixel accuracy. Low-contrast keypoints and edge responses are then discarded, using the eigenvalues of the matrix A in equation (3.1). An orientation histogram is computed for the local image gradient directions around each keypoint. For each keypoint, 8-bin histograms are computed in a 4 4 array around the keypoint, × to yield a 128 dimensional feature descriptor. The descriptor is normalized to increase its illumination invariance.
2. GLOH (Mikolajczyk and Schmid, 2005): Gradient Location and Orientation Histogram (GLOH) is similar to SIFT, but considers more spatial regions for com- puting the histograms. Subsequently, dimensionality is reduced by performing principal components analysis (PCA) and retaining the top 64 eigenvectors.
3. LESH (Sarfraz and Hellwich, 2008): The Local Energy-based Shape Histogram (LESH) is motivated by a local energy model of feature perception. Using several local histograms, with different filter orientations, a descriptor is created which can be used for matching across pose variations.
4. SURF (Bay et al., 2006): Speeded Up Robust Features (SURF) defines keypoints using a Haar wavelet approximation to the Monge-Ampere` operator (determinant of Hessian), which is more robust to non-Euclidean transformation than the determinant of Gaussian approximation to the Lapalacian used for SIFT. Integral images are used to make the feature extraction efficient.
3.2 Epipolar Geometry
In this section, we consider the geometry of two views and the geometric entity that characterizes uncalibrated two-view reconstruction: the fundamental matrix. 55
For reference, let us call one of the views left and the other right. The left view is represented by the camera matrix, P, centered at C, while the right camera is given by
P0, centered at C0. For any point feature x in the left view, the corresponding feature in the right view, x0, must lie on a straight line, which is the image (in the second view) of the backprojected line from C, passing through x. This line is called the epipolar line, let us represent it as l0. Similarly, for every feature x0 in the right view, there exists an epipolar line, l, in the left view. For this reason, the geometry of two views is also called epipolar geometry (see Figure 3.3).
Figure 3.3: For any point x in the left view, the corresponding point x0 in the right view is constrained to lie on a straight line, which is the image of the back-projected line from the left camera’s center passing through x.
Let e0 be the image of C in the second view and e be the image of C0 in the first view. Then, from Figure 3.3 and Section 2.2, it is evident that l0 = e0 x0 = [e0] x0, × × 56 where, for any vector a R3, we define a skew-symmetric matrix ∈ 0 a3 a2 − [a] = a3 0 a1 . (3.6) × − a2 a1 0 −
Suppose π is an arbitrary plane passing through the 3D point Xπ, of which x and x0 are images. Then, the plane π induces a planar homography, Hπ between the left and right views. In particular, it must be true that x0 = Hπx. Thus, the epipolar line in the right view is
l0 = [e0] x0 = [e0] Hπx = Fx (3.7) × × where
F = [e0] Hπ. (3.8) ×
Since x0 lies on l0, it follows that
x0>l0 = 0 x0>Fx = 0, (3.9) ⇒ which is the defining equation for uncalibrated two-view geometry. The matrix F is called the fundamental matrix and it is evident from (3.8) that F must be rank 2, since the skew-symmetric matrix [e0] is rank 2, while the homography Hπ is full rank. × From (3.8), the right epipole e0 can be deduced as the left null-space of F, since e0>F = e0>[e0] Hπ = 0>. By similar derivations, or simply from symmetry, the left × epipole is given by the right null-space of F, that is Fe = 0 and the left epipolar line corresponding to x0 is given by l = F>x0. Given several views, one can perform 3D reconstruction by considering pairs of views and stitching together transformations between the coordinate systems of different pairs. The challenge there is to reconcile the differences between the reconstructions generated by the same view pair, but arrived at by different traversals of the graph of overlapping camera views. Moreover, for some problems like autocalibration, it can be shown that fundamental matrix based approaches result in weaker constraints and might have additional degeneracies as compared to multiview methods. Yet, the fundamental 57 matrix and its three-view analogue, the trifocal tensor (Shashua and Werman, 1995; Hartley, 1997), remain the principal geometric entities for large scale reconstructions, due to their simplicity and the fact that robust and fast algorithms exist to estimate them.
3.3 Projective Reconstruction
Given feature correspondences, with no other a priori information about the scene or the cameras, the best reconstruction possible is a so-called projective reconstruction, that differs from the true scene by an arbitrary 4 4 linear transformation. That is, × if Pb i, Xb j , for i = 1, , m, j = 1, , n is the true scene, then the projective { } ··· ··· reconstruction is 1 Pi = Pb iH− , Xj = HXb j (3.10) where H is an arbitrary 4 4 linear transformation. We refer the reader to Chapter2 for × a review of the terminology and notation used in this section.
3.3.1 Pairwise reconstruction
Given several correspondences xj x0j between two views, we saw in Section ↔ 3.2 that epipolar geometry guarantees the existence of a fundamental matrix, which is a 3 3 matrix of rank 2 that satisfies ×
xj>Fx0j = 0. (3.11)
Then, a projective reconstruction consistent with these correspondences is given by
P = [ I 0 ] , P0 = [ SF e0 ] (3.12) | | where S is an arbitrary 3 3 skew-symmetric matrix and e0 is the epipole in the second × view, defined by F>e0 = 0. In practice, a good choice of S is the matrix [e0] , defined as × (3.6).
Given a pairwise estimate of projective cameras from (3.12), the 3D points Xj can be estimated by triangulation (see Chapter5). 58
3.3.2 Factorization-based approaches
Given more than two views of a scene, an alternaive method for projective reconstruction adopts a factorization approach. First, we will take a brief digression to describe the concept of affine factorization, or the so-called Tomasi-Kanade factorization (Tomasi and Kanade, 1992), which is the motivation for projective factorization.
Affine factorization
Suppose the image of a 3D point Xj, observed under camera Ai, is the 2D 3 point xij. Note that for this discussion, A is a 2 3 affine camera, X R is a × ∈ non-homogeneous point, as is x R2 and the image formation equation is ∈
xij = AiXj + ti (3.13)
2 where ti R is a translation vector and i = 1, , n, j = 1, , m. Then, the goal of ∈ ··· ··· affine factorization is to minimize the squared reprojection error
X 2 min xij (AiXj + ti) (3.14) Ai,ti,Xj ij k − k under the assumption that each 3D point is visible in all the views. It is easy to see that by assuming the reconstruction has the origin at its centroid and by centering the image points in each view, that is, making their centroid the origin 1 X of the local coordinate system, fixes the translations as t = x¯ , where x¯ = x . i i i n ij j The problem now reduces to
X 2 min xij AiXj (3.15) Ai,Xj ij k − k which can be reformulated as the factorization of a 2m n measurement matrix as × x11 x1n A1 ··· . .. . . W = . . . = . [X1 Xn] . (3.16) ··· xm1 xmn Am ··· 59
In the presence of noise, it can be shown that a singular value decomposition (SVD) of the measurement matrix, W, yields the maximum likelihood estimate (MLE) affine reconstruction. In practice, the assumption that each feature point is visible in all the views is a severe limitation. In an application with missing data, alternating least squares minimization is used to solve the problem (3.14), using the fact that once either the cameras or the points are known, estimating the other set of unknowns is a linear least squares problem.
Projective factorization
Consider the image formation equation in the projective case, xij PiXj, where ∼ P is a 3 4 projective camera, X R4 is a homogeneous point in 3D, as is x R3 in × ∈ ∈ 2D. Then, the equality can be made exact by introducing an explicit scale factor:
λijxij = PiXj, (3.17)
where the λij are called projective depths. The problem that projective factorization aims to solve is 2 min λijxij PiXj , (3.18) λij ,Pi,Xj k − k assuming that all points are visible in all the cameras. Similar to affine factorization, a 3m n measurement matrix may be constructed, which can be factorized into the × cameras and the 3D points: λ11x11 λ1nx1n P1 ··· . .. . . W = . . . = . [X1 Xn] . (3.19) ··· λm1xm1 λmnxmn Pm ··· To solve this problem, initial estimates of projective depths are needed, which may be obtained with the aid of an initial pairwise projective reconstruction, or simply by setting all the λij to 1. The depths are then normalized by setting all row norms of the measurement matrix to 1 in one pass and all the column norms to 1 in another pass. The 60 nearest rank-4 approximation of the normalized measurement matrix is computed using SVD, which yields an estimate of the cameras and 3D points. This estimate is used to project the points on to the images and obtain new estimates of depths, then the process is iterated until convergence. Projective factorization was introduced in (Sturm and Triggs, 1996) as a gener- alization of affine factorization to the projective case. In theory, it is not a good idea to introduce projective depths on the left hand side of (3.17), since it artificially introduces the all zeros solution as a global minimum of (3.18). While there is almost no theoretical justification for projective factorization as described here, in the absence of alterna- tives, the method is still used with some degree of success since it is straightforward to implement.
3.4 Stratification
An important conceptual paradigm in multiview geometry is stratification, which defines a hierarchy relating the actual scene to its reconstructed versions through trans- formations of gradually increasing degrees of freedom. These transformations can be geometrically characterized by their invariants, which are the entities that do not change under the transformation. The strata we describe here and deal with in the rest of this thesis correspond to transformations which are subgroups of the projective linear group, PL(n), which is the quotient group of the general linear group, GL(n), obtained by identifying elements related by a scalar multiple.
Isometry
An isometric transformation has the form δR t HE = (3.20) 0> 1 61
n 1 where δ = 1, R SO(n 1) is a rotation matrix and t R − . When δ = 1, it is ± ∈ − ∈ called a Euclidean transformation, which corresponds to a physical displacement. δ = 1 − represents a displacement composed with a reflection. A 3D Euclidean transformation has six degrees of freedom, three for the rotation R and three corresponding to the translation t. Geometrically, a Euclidean transformation preserves the Euclidean characterization of an object, so its invariants are entities like angles, length and area.
Similarity
A similarity transformation has the form sR t HS = (3.21) 0> 1 where s is a scalar. A 3D similarity transformation has seven degrees of freedom, three for the rotation R, three for the translation t and one for the scale s. Geometrically, a similarity transformation preserves the “shape” of an object, so its invariants are entities like angles, parallelism, length ratios and area ratios. The goal of most 3D reconstruction algorithms is to deduce the scene up to a similarity transformation, which is also called a metric reconstruction.
Affinity
An affine transformation has the form A t HA = (3.22) 0> 1 where A GL(n 1) is non-singular. A 3D affine transformation has 12 degrees ∈ − of freedom, 9 corresponding to A and 3 corresponding to t. Geometrically, an affine transformation corresponds to a composition of rotation, non-isotropic scaling and another rotation, followed by a translation. Some of its invariants are parallelism, length ratios of parallel lines and area ratios. 62
An affine reconstruction differs from the true scene by an affine transformation. It is useful in practice since obtaining an affine reconstruction from input images is considered the “difficult” part of 3D reconstruction. That is, upgrading from the affine to the metric stratum is the relatively easier part of 3D reconstruction.
Projectivity
A projective transformation is defined on homogeneous coordinates and has the form A t HP = . (3.23) v> k A 3D projective transformation has 15 degrees of freedom, corresponding to the 16 unknowns of HP , less one for scale. A general projective transformation is also called a homography or a collineation. The fundamental invariants of a projective transformation are the incidence structure and the cross ratio. As discussed in Section 3.3, a projective reconstruction is the best one can obtain using point correspondences alone. So, the objective of many 3D reconstruction algorithms can be stated as finding the appropriate projective transformation to upgrade from the projective stratum to the metric one.
3.5 Chirality
The etymology of the term chirality can be traced to the Greek cheir, which means hand. Chirality, thus, means “handedness” and in the context of computer vision, imposing chirality refers to demanding the most basic of imaging constraints - the imaged points must lie in front of the camera. It can be shown that this simple constraint is sufficient to compute a quasi-affine reconstruction, which is a projective reconstruction that preserves the convex hull of the scene.
In P3, convexity is defined as follows: a subset P3 is convex if and only if S ⊂ R3 and is convex in R3. (See Section 4.2.1 for a definition of convexity in R3.) S ⊂ S 63
The convex hull of a set is the smallest convex set containing it.
A projective transformation h : P3 P3 preserves the convex hull of a point set → Xi if h(Xi) is finite for all i and h maps the convex hull of Xi bijectively to the { } { } convex hull of h(Xi . { } Recall that the center of a camera is given by C = null(P), which can be rewritten as i (i) CP = (c1, c2, c3, c4)>, where ci = ( 1) det P (3.24) − and P(i) is the matrix formed by removing column i of P. For imposing chirality, we are only concerned with the sign of the depth of a point with respect to a camera, so (2.24) can be written as
depth(X, P) = wT det(A) = w(E4>X)(E4>CP), (3.25)
where E4 = (0, 0, 0, 1)> represents the plane at infinity. On the application of a projective transformation, (3.25) transforms as
1 depth(X, P) = w(E4>HX)(E4>HCP) det(H− ), (3.26)
where H>E4 can now be interpreted as the plane π which is mapped to infinity by H, ∞ that is, H−>π = E4. Thus, ∞
depth(X, P) = w(π X)(π CP)δ, (3.27) ∞ ∞
1 where δ = sign(det(H− )) determines whether the transformation is orientation preserv- ing or reversing. Thus, it can be easily verified that if the determinant of a projective transformation H is positive and π is the plane mapped to infinity by H, then the ∞ chirality of a point X with respect to π is preserved under H if and only if X lies on ∞ the same side on π as the camera center. ∞ It can be easily shown that for any projective reconstruction from a set of image points xij, it is possible to assign signs to the projective cameras and points such that
PiXj = wij (xij, yij, 1)> with wij > 0. (3.28) 64
Then, if H is a quasi-affine transformation, δ = sign(det(H)) and v R4 is the plane ∈ mapped to infinity by H, then
depth(Pi, Xj) > 0 (v>Xj)(v>CP )δ > 0. (3.29) ⇒ i It follows that a quasi-affine reconstruction can be computed using a transformation I 0 H = ± | (3.30) v> where the sign of H is chosen to match the δ which leads to v as a feasible solution for the following chiral inequalities:
Xj>v > 0, for all j C v P>i > 0, for all i. (3.31)
3.5.1 Bounding the plane at infinity
Once a quasi-affine transformation has been found, one can normalize all points to have the last coordinate 1 and all cameras to have the determinant of their first 3 3 × sub-matrix also 1. Now, a translation may be applied to move the centroid of the set of points and camera centers to the origin, to obtain a (still quasi-affine) reconstruction
Pq, Xq { } Let H be a further, orientation-preserving quasi-affine transformation and v be the plane mapped to infinity by H. Since v lies outside the convex hull of the set of points and camera centers, it cannot pass through the origin and we can parameterize it as v = (v1, v2, v3, 1)>, where the vi must be bounded. Thus, bounds on the plane at infinity can be computed by minimizing and maximizing vi subject to the chiral inequalities (3.31) with δ = 1: min / max vi subject to Xqj> v > 0, j = 1, , n i = 1, 2, 3. (3.32) ··· C> v > 0, k = 1, , m qk ··· Chapter 4
Global Optimization
“Now upon the rugged top Stands she, on the loftiest height, When the cliffs abruptly stop And the path is lost to sight.”
Friedrich Schiller (German poet, 1759-1805 AD), The Alpine Hunter
Throughout this dissertation, we will encounter constrained optimization prob- lems of the form:
minimize f(x) (4.1)
subject to gi(x) 0 , i = 1, , m ≤ ···
hj(x) = 0 , j = 1, , p ··· where x Rn is the vector of decision variables, f(x) is the objective function to be ∈ optimized, while gi(x) and hi(x) are the inequality and equality constraints, respectively. In general, such problems are quite difficult to optimize, since they might have several local minima. Indeed, it might be difficult to even find a feasible point, that is, a value of x which satisfies all the constraints. A further problem that riddles general purpose optimization routines is the absence of a principled stopping criterion.
65 66
4.1 Approaches to Global Optimization
The need for global optimization arises in several branches of science and math- ematics, so it is a term imbued with differing interpretations depending on the diverse array of contexts it is used in. In this section, we briefly review the various approaches to global optimization prevalent in scientific computing.
Deterministic approaches
A global optimization approach may be considered deterministic when the ob- jective function and constraints are not specified in probabilistic terms and there is no element of randomization in the search algorithm. Two important optimization frame- works that are deterministic in nature are based on branch and bound methods and real algebraic geometry. Branch and bound minimization algorithms systematically explore the search space, using lower and upper bounding functions (Land and Doig, 1960). The search proceeds by branching, a repeated subdivision of the domain and the bounding functions are computed to underestimate and overestimate the objective in each sub-domain. All those regions of the search space where the lower bounding function lies above the upper bounding function of some other region are pruned away. The branch and bound procedure terminates when the entire search space has been explored or discarded, although in noisy estimation problems, a termination criterion that depends on the approximation gap may be specified. Branch and bound algorithms provably converge to the global minimum, with a certificate of optimality. In this dissertation, several multiview geometry problems are globally minimized in a branch and bound optimization framework, so we review it in greater detail in Section 4.3. Clearly, the key to ensuring that the branch and bound does not degenerate to an exhaustive search is designing an effective branching strategy and constructing bounding functions that tightly approximate the objective. We explore these problem-specific aspects in Chapters5,6 and8 of the dissertation. 67
Algebraic geometry is the study of algebraic varieties, which are geometric manifestations of solutions of systems of polynomial equations (Cox et al., 1992). A popular approach for global minimization of polynomial systems relies on Grobner¨ bases, which can be considered as a generalization of Gaussian elimination to polynomial systems. More precisely, a Grobner¨ basis G is a generating subset of an ideal I in a polynomial ring R, such that dividing any polynomial in I by G gives 0 (Buchberger, 1965, 1995). Several problems in multiview geometry, such as three-view triangulation and relative pose estimation have been solved using Grobner¨ basis methods (Stewenius´ et al., 2005; Stewenius´ et al., 2006). An alternate algebraic geometry approach to optimizing a polynomial system extends the reformulation linearization technique, whereby linear programming relax- ations for polynomial programs are constructed by a lifting procedure, that is, introducing additional redundant constraints and linearizing using additional variables (Sherali and Adams, 1998). Using the theory of positive polynomials, a sequence of convex linear matrix inequality relaxations of increasing size is constructed, whose optimal values provably converge to the global optimum of the given polynomial system (Lasserre, 2001; Henrion and Lasserre, 2004). Such approaches to global optimization have been used for several multiview geometry problems (Kahl and Henrion, 2005). We review this approach in greater detail in Section 4.2.4 and use it for globally optimizing the direct autocalibration problem in Chapter7.
Stochastic approaches
Stochastic approaches to optimization are useful in problems where either the modeling is in probabilistic terms, or a degree of randomness is necessary for the algorithm to achieve convergence or optimality. Incorporating the effects of noise often leads to probabilistic features in the objective function, whereby tools of statistical inference may be used to determine optimal solutions or optimal strategies to explore the search space. Stochastic approximation is a popular method used to optimize an unknown, 68 time-varying function (Robbins and Munro, 1951; Kiefer and Wolfowitz, 1952). A na¨ıve incorporation of randomization in the algorithm is to initialize a local gradient descent based method at several randomly chosen location in the search space, in the hope that one of them will converge to the global minimum. Clearly, no performance guarantee better than exhaustive search can be given for such an approach. Markov Chain Monte Carlo (MCMC) is a technique for sampling from a high- dimensional probability distribution, by constructing a rapid mixing Markov chain whose equilibrium is the target distribution (Robert and Casella, 2004). The Boltzmann E(x) distribution, P (x) ∝ e− T , is a popular choice for global optimization, since sampling at a low enough temperature T guarantees a global minimum with arbitrarily high probability. The Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970) is a popular way to draw samples from the Boltzmann distribution, a faster but less general variant is the Gibbs sampling procedure (Casella and George, 1992). Simulated annealing, in effect, is the optimization technique that starts with Metropolis-Hastings at a high temperature, then slowly lowers the temperature in ac- cordance with an annealing schedule (Kirkpatrick et al., 1983). Each step of simulated annealing consists of trying to determine whether a randomly chosen “next state” should replace the “current state”, where the decision is a probabilistic one depending on the objective function values and a global temperature parameter. At higher temperatures, simulated annealing approximates random search, while at lower temperatures, it behaves like gradient descent. While a suitably slow annealing schedule can be proven to achieve the global optimum, in most practical cases, it takes at least as long to converge as exhaustive search.
Metaheuristic approaches
Metaheuristics refer to heuristic search strategies used for solving combinatorial optimization problems using pre-defined black box functions, which might be heuristics themselves, that evaluate the function value for a given state. They are generally em- 69 ployed when the problem is intractable or difficult to pose in one of the more traditional optimization paradigms. The motivation for metaheuristic approaches stems from mimicking the oper- ations of biological systems, such as evolution by natural selection (Barricelli, 1954), genetic mutation or the behavior of ant colonies (Goss et al., 1989). A typical meta- heuristic procedure minimizes an objective function using a series of state transitions that might be a product of a generator function or a mutator function. Genetic algorithms (Fraser, 1957), ant colony optimization (Dorigo et al., 1996) and differential evolution (Storn and Price, 1997) are some well-known examples of metaheuristic approaches. Metaheuristic methods are usually employed with a pre-specified time budget and have weak probabilistic guarantees of reaching the global optimum as the time budget tends to infinity.
4.2 Convex Optimization
Convex optimization represents a large class of problems for which the obstacles mentioned at the beginning of the chapter vanish. A problem such as (4.1) is convex when the objective is a convex function and the constraints define a convex set. Any local minimum for a convex problem is the global minimum, feasibility can be determined unambiguously while precise stopping criteria are provided by the theory of duality. See (Boyd and Vandenberghe, 2004) for a comprehensive treatment of the topics outlined in this section.
4.2.1 Convex Sets
A set Rn is convex if it contains the line segment joining any two points of S ⊂ the set, that is,
x, y , α, β 0, α + β = 1 αx + βy . (4.2) ∈ S ≥ ⇒ ∈ S 70
Some convex sets that we will encounter in this dissertation are subspaces, affine sets and convex cones. A set Rn is a subspace if it contains the plane defined by the origin S ⊂ and any two of its points, that is,
x, y , α, β R αx + βy . (4.3) ∈ S ∈ ⇒ ∈ S
An affine set is one that contains the line through any two of its elements, that is, Rn S ⊂ is affine if
x, y , α, β R, α + β = 1 αx + βy . (4.4) ∈ S ∈ ⇒ ∈ S A set Rn is a convex cone if it contains the line segment joining any two of its S ⊂ points as well as all the rays emanating from the origin and passing through its points. Mathematically, x, y , α, β 0 αx + βy . (4.5) ∈ S ≥ ⇒ ∈ S A convex cone that arises frequently in this dissertation is the second order cone, named so since it is the norm cone associated with the Euclidean norm:
n S2 = x xˆ xn . (4.6) { | k k ≤ } where xˆ := (x1, , xn 1)>. Constraints in several computer vision applications, with ··· − m n A R × , have the form ∈ Ax + b c>x + d (4.7) k k ≤ which are called second order cone programming constraints, since (4.7) is equivalent to n+1 demanding that (Ax + b, c>x + d)> S2 . ∈ Another convex cone that is important in computer vision and machine learning is the set of symmetric positive semidefinite (PSD) matrices:
n n n S+ = X S y R , y>Xy 0 (4.8) { ∈ | ∀ ∈ ≥ } where Sn is the set of n n symmetric matrices. × 71
Generalized Inequalities
A proper convex cone is closed, with a non-empty interior and does not contain any line. Any proper convex cone K Rn defines a generalized inequality: ⊂
x K y x y K. (4.9) ⇔ − ∈ Unless otherwise stated, we will encounter instances of generalized inequalities in the following two contexts:
n Componentwise inequality for n-dimensional vectors, using K = R+: •
n Given x, y R , x y xi yi 0, i = 1, , n. (4.10) ∈ ⇔ − ≥ ···
n Positive semidefiniteness constraints for n n matrices, using K = S+: • ×
n n Given X, Y R × , X Y X Y is PSD. (4.11) ∈ ⇔ −
Note that we have dropped the reference to the convex cone K while representing the generalized inequalities above, since it will usually be clear from the context.
4.2.2 Convex Functions
Several objective functions that we will encounter in the following chapters have traditionally been minimized by local methods, but they become amenable to global optimization following the recognition of underlying convexities. A convex function f : R is one with a convex domain := dom f and which satisfies D → D x, y , λ [0, 1], f(λx + (1 λ)y) λf(x) + (1 λ)f(y). (4.12) ∀ ∈ D ∀ ∈ − ≤ − A concave function is one whose negation is convex. A necessary and sufficient first-order condition for convexity of a differentiable function is that its first-order approximation is a global underestimator, that is,
x, x0 , f(x) f(x0) + f(x0)>(x x0) (4.13) ∀ ∈ D ≥ ∇ − 72
A necessary and sufficient second-order condition for convexity of a differentiable function is given by the positive semidefiniteness of its Hessian, that is,
x , 2f(x) 0 (4.14) ∀ ∈ D ∇
Some common examples of convex functions are
n Affine functions, f(x) = a>x + c, where a R , c R. • ∈ ∈
Quadratic functions f(x) = x>Ax + 2b>x + c, where A 0. •
a Exponential functions f(x) = x , where x R++, a > 1. • ∈
Epigraph
A useful notion is the epigraph of a function:
epi f = (x, t) x dom f, f(x) t (4.15) { | ∈ ≤ }
It is easy to see that f is a convex function if and only if epi f is a convex set. Many properties of convex functions can be better understood in terms of its epigraph. In particular, the following operations can be shown to be convexity-preserving:
P Non-negative summation: αi 0, fi convex αifi convex. • ≥ ⇒ i
Pointwise supremum: fs convex for s sups fs convex. • ∈ S ⇒ ∈S
Minimizing over some variables: f(x, y) convex in x Rm and y Rn • ∈ ∈ ⇒ g(x) = infy f(x, y) convex in x.
4.2.3 Convex Optimization Problems
A convex program is an optimization problem of the form (4.1) with the additional conditions that the objective function f(x) is convex, the inequality constraints gi(x) are all convex and the equality constraints hi(x) are all affine. Besides the advantages 73 mentioned at the beginning of the section, an important reason for the widespread popularity of convex optimization is the rapid development in (publicly available) solver technology. The convex programs whose instances we will study in this dissertation all satisfy the so-called self-concordance property which has enabled the emergence of interior-point method based solvers (Karmarkar, 1984; Nesterov and Nemirovskii, 1994). These methods are guaranteed to converge to the global minimum in polynomial time and are free of numerical sensitivity issues that hinder many general-purpose optimization approaches. In the remainder of this section, we will look at a few important examples of convex optimization problems.
Linear Programs
A linear program (LP) has the form
minimize c>x + d (4.16)
subject to Gx h Ax = b
n k n l n where x R , G R × and A R × . ∈ ∈ ∈
Quadratic Programs
A quadratic constrained quadratic program (QCQP) has the form
minimize x>Px + q>x + r (4.17)
subject to x>Pix + qi>x + ri Gx h Ax = b where i = 1, , m. While a general QCQP is NP-hard to minimize, convex QCQPs ··· have the additional requirements that P and each Pi are positive definite and are solvable in polynomial time. 74
Second Order Cone Programs
A second order cone program (SOCP) has the form
minimize c>x + d (4.18)
subject to Pix + qi r>x + si k k ≤ i Gx h Ax = b where i = 1, , m. It can be easily verified that a QCQP is a special case of an SOCP, ··· obtained by setting each ri = 0 and squaring the second order cone constraints.
Semidefinite Programs
A semidefinite program (SDP) has the form
minimize c>x + d (4.19) n X subject to xiFi + F0 0 i=1 Ax = b
n where Fj S for j = 1, , n. It can be verified that SDPs subsume LPs, convex QPs ∈ ··· and SOCPs.
4.2.4 Linear Matrix Inequalities
Linear matrix inequalities (LMIs) have come to play a central role in optimization problems that arise in fields like control theory and signal processing. In recent years, LMIs have proven to be very successful in seeking globally optimal solutions to systems of polynomial equations. As some of following chapters will demonstrate, this has great utility for several applications in computer vision. 75
n m Given symmetric matrices Ai S , i = 0, , m and x R , a linear matrix ∈ ··· ∈ inequality is an expression of the form
m X LMI (x) := A0 + xiAi 0 (4.20) i=1 The solution set of the LMI is convex, since it defines the preimage of the semidefinite cone under an affine function. A trick commonly used in uncovering convex LMI constraints is the Schur complement technique. Let X be a symmetric matrix decomposable as AB X = (4.21) B> C
where A is invertible, then the Schur complement of A in X is given by
1 S = C B>A− B (4.22) − and can be shown to have the following properties:
X 0 A 0 and S 0 • ⇔ Given A 0, X 0 S 0. • ⇔ The Schur complement will be used several times in this dissertation, especially to recognize the convexity of constraints of the form Ax + b 2 0 k k 1 (4.23) ≤ c>x + d ≤ which, we will see, arise quite commonly in several computer vision problems. Indeed, the epigraph associated with the nonlinear function in (4.23), given by (x, t) Ax + b 2 { | k k ≤ (c>x + d) t can be reduced to the convex LMI constraint · } (c>x + d)I Ax + b 0. (4.24) (Ax + b)> t Similarly, second order cone programming constraints such as
Ax + b c>x + d (4.25) k k ≤ 76 can be reduced to the convex LMI constraint (c>x + d)I Ax + b 0. (4.26) (Ax + b)> c>x + d
4.3 Branch and Bound Theory
Branch and bound algorithms are non-heuristic methods for global optimization of non-convex problems. They maintain a provable upper and/or lower bound on the (globally) optimal objective value and terminate with a certificate proving that the solution is -suboptimal (that is, within of the global optimum), for arbitrarily small . We will restrict our treatment to minimization problems as that is the case we will encounter in pertinent structure and motion problems. Consider a non-convex, scalar-valued objective function f(x), for which we seek a global optimum over a rectangle Q0. By a rectangle, we mean a region of the search space demarcated by finite intervals along each search dimension. For a rectangle
Q Q0, let f (Q) denote the minimum value of the function f over Q. Also, let ⊆ min flb(Q) denote the minimum value attained within the rectangle Q by a function flb(x), which underestimates the value of f(x) for any x Q. The function f must satisfy the ∈ lb following conditions:
(L1) f (Q) computes a lower bound on f (Q) over the domain Q, that is, f (Q) lb min lb ≤ fmin(Q).
(L2) The approximation gap f (Q) f (Q) converges to zero as the maximum min − lb half-length of sides of Q, denoted Q , tends to zero, that is | |
> 0, δ > 0 s.t. Q Q0, Q δ f (Q) f (Q) . ∀ ∃ ∀ ⊆ | | ≤ ⇒ min − lb ≤
An intuitive technique to determine the -suboptimal solution would be to divide the whole search region Q0 into a grid with cells of sides δ and compute the minimum of a lower bounding function flb defined over each grid cell, with the presumption that 77
each flb(Q) is easier to compute than the corresponding fmin(Q). However, the number of such grid cells increases rapidly as δ 0, so a clever procedure must be deployed to → create as few cells as possible and ”prune” away as many of these grid cells as possible (without having to compute the lower bounding function for these cells). This is precisely the aim of a branch and bound algorithm.
The branch and bound algorithm begins by computing flb(Q0) and the point q∗ Q0 which minimizes f (Q0). If f(q∗) f (Q0) < , for a pre-specified , the ∈ lb − lb algorithm terminates. Otherwise Q0 is partitioned as a union of subrectangles Q0 =
Q1 Qk for some k 2 and the lower bounds f (Qi) as well as points qi, at which ∪ · · · ≥ lb k these lower bounds are attained, are computed for each Qi. Let q∗ = arg min qi f(qi). { }i=1 We deem f(q∗) to be the current best estimate of fmin(Q0). The algorithm terminates when f(q∗) min1 i k flb(Qi) < , else the partition of Q0 is refined by further dividing − ≤ ≤ some subrectangle and repeating the above. The rectangles Qi for which flb(Qi) > f(q∗) cannot contain the global minimum and are not considered for further refinement. A graphical illustration of the algorithm is presented in Figure 4.1. Computation of the lower bounding functions is referred to as bounding, while the procedure that chooses a rectangle and subdivides it is called branching. There can be several possible choices of the rectangle picked for refinement in the branching step and the actual subdivision itself. We consider the rectangle with the smallest minimum of flb as the most promising to contain the global minimum and subdivide it into k = 2 rectangles, as described in the following sections, which can be shown to be a convergent strategy. Algorithm1 uses the abovementioned functions and presents a concise pseudocode for the branch and bound method. Further detailed descriptions of the bounding and branching procedures are given in the next two subsections. Although guaranteed to find the global optimum (or a point arbitrarily close to it), the worst case complexity of a branch and bound algorithm is exponential. However, we will show in our experiments that the special properties offered by the geometric reconstruction problems considered in this dissertation lead to fast convergence rates in practice. 78
Algorithm 1 Branch and Bound Require: Initial rectangle Q0 and > 0.
1: Bound : Compute f (Q0) and minimizer q∗ Q0. lb ∈ 2: S = Q0 Initialize the set of candidate rectangles { }{ } 3: loop
4: Q0 = arg minQ S flb(Q) Choose rectangle with lowest bound ∈ { } 5: if f(q∗) f (Q0) < then − lb 6: return q∗ Termination condition satisfied { } 7: end if
8: Branch : Q0 = Ql Qr ∪ 9: S = (S/ Q0 ) Ql,Qr Update the set of candidate rectangles { } ∪ { }{ } 10: Bound : Compute f (Ql) and minimizer ql Ql. lb ∈ 11: if f(ql) < f(q∗) then
12: q∗ = ql Update the best feasible solution { } 13: end if
14: Bound : Compute f (Qr) and minimizer qr Qr. lb ∈ 15: if f(qr) < f(q∗) then
16: q∗ = qr Update the best feasible solution { } 17: end if
18: S = Q Q S, f (Q) < f(q∗) Discard rectangles with high lower bounds { | ∈ lb }{ } 19: end loop 79
Φ(x) Φ(x)
q1∗
l x u l x u
(a) (b)
Φ(x) Φ(x)
q1∗ q1∗ q2∗
l x u l x u
(c) (d)
Figure 4.1: This figure illustrates the operation of a branch and bound algorithm on a one dimensional non-convex minimization problem. Figure (a) shows the the function f(x) and the interval l x u in which it is to be minimized. Figure (b) shows the convex relaxation≤ of f(≤x) (indicated in yellow/dashed), its domain
(indicated in blue/shaded) and the point for which it attains a minimum value. q1∗ is the corresponding value of the function f. This value is the best estimate of the minimum of f(x) is used to reject the left subinterval in Figure (c) as the minimum value of the convex relaxation is higher than q1∗. Figure (d) shows the lower bounding op- eration in the right sub-interval in which a new estimate q2∗ of the minimum value of f(x).
4.3.1 Bounding
The goal of the bounding procedure is to provide the branch and bound algorithm with a bound on the smallest value the objective function takes in a domain. The 80
computation of the function flb must possess three properties crucial to the efficiency and convergence of the algorithm: (i) must be easily computable, (ii) must provide as tight a bound as possible and (iii) must be easily minimizable. Precisely these features are inherent in the convex envelope of an objective function, which we define below.
Definition 1 (Convex Envelope). Let f : S R, where S Rn is a non-empty convex → ⊂ set. The convex envelope of f over S (denoted convenv f) is a convex function such that (i) convenv f(x) f(x) for all x S and (ii) for any other convex function u, ≤ ∈ satisfying u(x) f(x) for all x S, we have convenv f(x) u(x) for all x S. ≤ ∈ ≥ ∈ Finding the convex envelope of an arbitrary function may be as hard as finding the global minimum. To be of any advantage, the envelope construction must be cheaper than the optimal estimation. Two of the principal contributions of this dissertation are to construct the tightest possible convex relaxations for objective functions that arise in important multiview geometry problems and to show that they satisfy the conditions (L1) and (L2) enumerated above.
4.3.2 Branching
Branch and bound algorithms can be slow, in fact, the worst case complexity grows exponentially with problem size. Thus, one must devise a sufficiently sophisticated branching strategy to expedite the convergence. Indeed, for the various geometric reconstruction problems considered in this dissertation, we will demonstrate that it is possible to restrict the branching to a small and fixed number of dimensions regardless of the problem size, which substantially enhances the number of views or points our algorithms can handle. There are three issues that must be addressed within the branching phase - the rectangle to branch on, the dimension of the chosen rectangle to split along and the point at which to split the chosen dimension. The choice of rectangle to be partitioned is essentially heuristic: we consider the rectangle with the smallest minimum of flb as the most promising to contain the global 81 minimum and subdivide it first. After a choice has been made of the rectangle to be further partitioned, there are two issues that must be addressed within the branching phase - namely, deciding the dimensions along which to split the rectangle and where along a chosen dimension to split the rectangle. We pick the dimension with the largest interval and employ a simple spatial division procedure, called α-bisection (see Algorithm2) for a given scalar α, 0 < α 0.5. It can be shown that for our applications, α-bisection leads to a branch- ≤ and-bound algorithm which is convergent, see for instance AppendixC or (Benson, 2002).
Algorithm 2 α-bisection n Require: A rectangle Q R , defined as Q = [ l1, u1 ] [ ln, un ] ⊂ × · · · × 1: j = arg maxi=1,...,n(ui li) − 2: vj = α(uj lj) − 3: Ql = [ l1, u1 ] [ lj, vj ] [ ln, un ] × · · · × × · · · × 4: Qr = [ l1, u1 ] [ vj, uj ] [ ln, un ] × · · · × × · · · × 5: return (Ql,Qr)
The α bisection branching strategy does not pay any attention to the value of the convex envelope and the objective function. The next strategy we describe, ω- subdivision, addresses this issue. Intuitively, the ω-subdivision rule is a heuristic for maximum improvement in the convex envelope underestimation for Ql and Qr compared to Q. We will illustrate the ω-subdivision rule for the case of fractional programming, n X ti considered in Chapter5, where the goal is to minimize a sum of fractions, . It can s i=1 i be shown that, for a fractional program, it suffices to branch only along the dimensions
2n corresponding to the denominator variables si. Let a rectangle Q R be defined as ⊂
Q = [ l1, u1 ] [ ln, un ] [ L1,U1 ] [ Ln,Un ] (4.27) × · · · × × × · · · × where ti [ li, ui ] and si [ Li,Ui ]. Let the lower bounding point in a rectangle Q ∈ ∈ 82
2n n be given by ω(Q) = (t1∗, , tn∗ , s1∗, , sn∗ )> R and let y = (y1, , yn)> R ··· ··· ∈ ··· ∈ be the vector of lower bounds for each individual fraction. Then we branch along that dimension for which the difference between the lower bound and the objective function value is maximum. The method is stated more formally in Algorithm3.
Algorithm 3 ω-subdivision Require: A rectangle Q R2n, ω(Q), y ⊂ 1: j = arg maxi(ti/si yi) − 2: Ql = (l1, u1) (ln, un) (L1,U1) (Lj, sj) (Ln,Un) × · · · × × × · · · · · · × 3: Qr = (l1, u1) (ln, un) (L1,U1) (sj,Uj) (Ln,Un) × · · · × × × · · · · · · × 4: return (Ql,Qr)
The ω-subdivision algorithm can also be shown to be convergent for our applica- tions, but we will henceforth restrict our attention to the α-bisection algorithm.
4.4 Global Optimization for Polynomials
Several problems in computer vision can be formulated as minimizing a polyno- mial objective, subject to polynomial equality and inequality constraints:
min p(x) pi(x) 0, i = 1, , m (4.28) x n ∈R { | ≥ ··· } Recent advances in algebraic geometry (Cox et al., 1998) and semidefinite pro- gramming Boyd and Vandenberghe(2004) have shown that it is possible to find the global minimum for such problems. This review section borrows terminology and notation from (Schweighofer, 2006) and (Lasserre, 2001), where we direct the interested reader for greater details and further references. 83
Theoretical background
For x Rn, let [x] denote the ring of n-variate polynomials p(x), with the ∈ R monomial basis
α α1 αn n B = x := x1 xn α N . (4.29) { ··· | ∈ }
Then, for f, g1, , gm [x], we want to solve the following constrained minimization ··· ∈ R problem:
n f ∗ := inf f(x) x S , where S = x R gi(x) 0, i = 1, , m (4.30) { | ∈ } { ∈ | ≥ ··· } where S is assumed to be compact to make the problem well-defined. We will use the following notations:
[x]2 := p [x] p = q2 for some q [x] R { ∈ R | ∈ R } [x]g := p [x] p = q2g for given g [x] and for some q [x] R { ∈ R | ∈ R ∈ R } X X [x]gi := p [x] p = qi for some qi [x]gi, given gi [x] (4.31) R { ∈ R | i ∈ R ∈ R } Similarly constructed definitions will also be used in the subsequent paragraphs. The quadratic module generated by g1, g2, , gm is the set ··· X 2 X 2 X 2 Q := [x] + [x] g1 + + [x] gm R R ··· R m X 2 = sigi si [x] [x] , (4.32) { i=0 | ∈ R } ⊆ R which is non-negative on S. Problem (4.30) is clearly non-convex in general, since f is a non-convex function and S is a non-convex set. One way to convexify an arbitrary optimization problem is by lifting it to the infinite dimensional space of measures: if M(S) is the set of all probability measures with support in the set S, then Z f ∗ = inf f(x) = inf f(x)dµ, (4.33) x S µ M(S) ∈ ∈ S which is convex since M(S) is convex. The dual formulation of this trivial convexification is
f ∗ = sup c R f c 0 on S = sup c R f c > 0 on S (4.34) { ∈ | − ≥ } { ∈ | − } 84 where the constraint set is now the cone of non-negative polynomials on S, which is a convex cone. While both (4.33) and (4.34) are convex optimization problems, they are not efficiently solvable in general, since the constraint sets cannot be characterized tractably. But this does give us the intuition that the problem of globally minimizing a system of polynomials is, in fact, related to the classical problem of moments (Akhiezer, 1965),
n n which can be stated as: given a set S R and a sequence of numbers mα , α N , ⊆ { } ∈ does there exist a probability density µ, with support S(µ) S, such that ⊆ Z α x dµ = mα (4.35) S for every given α? The remarkable results in (Putinar, 1993) formalize this connection:
Theorem 6. A map L : [x] R such that L(1) = 1 and L(Q) [0, ) is linear if R → ⊆ ∞ and only if there exists a probability measure µ M(S) such that ∈ Z L(p) = p dµ , (4.36) S for any p [x], ∈ R To see that the above theorem solves the moment problem on S, we note that
(4.36) can be rewritten in terms of the moments mα , since every linear map L : { } [x] R is characterized by its values on the monomial basis B in (4.29). Thus, the R → primal problem in (4.33) can be reformulated as
f ∗ = inf L(f) L : [x] R is linear, L(1) = 1,L(Q) [0, ) . (4.37) { | R → ⊆ ∞ }
To reduce the dual problem, we ascribe to another result from (Putinar, 1993), whose resemblance in form to Hilbert’s Nullstellensatz is the reason for similarly naming this theorem the Positivstellensatz:
Theorem 7. If p [x] is positive on S, then p Q. ∈ R ∈ 85
Given the above result, we replace the dual problem in (4.34) with
f ∗ = sup c R f c Q . (4.38) { ∈ | − ∈ } Note that there might be polynomials non-negative on S but not contained in Q. Still, the formulation in (4.38) suffices for our purposes as far as a reduction of the dual problem is concerned. Let deg(p) denote the degree of polynomial p. Further, let the vector space of polynomials up to degree d be denoted as [x]d. For an arbitrary k N, such that R ∈
k d , where d = max deg(f), deg(g1), , deg(gm) , (4.39) ≥ max max { ··· } define k deg(gi) di = − , (4.40) 2 where [ ] stands for the greatest integer function. Then, similar to (4.32), we define an · approximation to Q as
X 2 X 2 X 2 Qk := [x] + [x] g1 + + [x] gm R d0 R d1 ··· R dm m X 2 = sigi si [x] , deg(sigi) k [x]. (4.41) i=0 | ∈ R ≤ ⊆ R Now, we consider the following optimization problem, which is a relaxation of (4.37) and called the primal relaxation of order k:
Pk : min L(f)
subject to L : [x]k R is linear, R → L(1) = 1
L(Qk) [0, ). (4.42) ⊆ ∞ Similarly, a dual relaxation of order k for (4.38) can be constructed as
Dk : max c
subject to c R ∈
f c Qk. (4.43) − ∈ 86
Let P ∗ and D∗ be the optimal values of Pk and Dk, respectively. Then, P ∗ f ∗. Further, k k k ≤ if L is feasible for Pk and c is feasible for Dk, then, since L is a linear map, we have
L(f) c = L(f) cL(1) = L(f c) 0 , (4.44) − − − ≥ where the last inequality follows from the fact that f c Qk. Thus, in particular, − ∈ D∗ P ∗. k ≤ k Moreover, every feasible solution of Dk is also a feasible solution of Dk+1 and every feasible solution of Pk+1 is feasible for Pk, when restricted to the subspace [x]k R of [x]k+1. Finally, for any > 0, R
f ∗ D∗ , (4.45) − ≤ k since for a sufficiently large k d , it follows from Theorem7 that f (f ∗ ) Qk, ≥ max − − ∈ that is, f ∗ is feasible for Dk. Putting it all together, we have the following result from − (Lasserre, 2001):
Theorem 8. For k = dmax, dmax + 1, N, the sequences Pk∗ and Dk∗ are · · · ∈ { } { } increasing and converge to f ∗ while satisfying D∗ P ∗ f ∗. k ≤ k ≤
Indeed, stronger results are proved for the relaxations Pk and Dk in (Lasserre, 2001), namely, that for an S with a non-empty interior, there is zero duality gap, that is,
Pk∗ = Dk∗. It is also shown that for all practical purposes, the problems Pk and Dk can be represented as semidefinite programs (SDPs), which can be efficiently minimized using standard solvers. This forms the basis for a primal-dual schema for solving increasingly tighter relaxations of (4.30), whose optimal solutions converge to the global optimum f ∗. Note that the above result is valid only for polynomials p [x] for which there ∈ R exists some n N such that n p Q. It can be shown that this is equivalent to ∈ ± ∈ demanding the existence of an n N such that n x 2 Q (Schmudgen¨ , 1991). In ∈ − k k ∈ practice, if an n∗ N is known such that the set S is contained in a closed ball of radius ∈ √n∗ centered at the origin, then this condition can be satisfied by including an additional 2 redundant constraint gm+1 := n∗ x 0. − k k ≥ 87
Convex LMI relaxations for polynomial optimization
The above theory is the basis for the algorithm to globally optimize a scalar polynomial objective function subject to polynomial constraints, as presented in (Henrion and Lasserre, 2005). Let p∗ denote the minimum objective value (if it exists) of the problem (4.28). Then, a convex relaxation is, by construction, a convex optimization problem with minimum objective value p∗ such that p∗ p∗. By solving the relaxed r r ≤ problem, a lower bound on the original objective function is obtained. Convex linear matrix inequality (LMI) relaxations for (4.28) can be obtained by gradually adding lifting variables and constraints corresponding to linearizations of monomials up to a given degree. The LMI relaxation covering monomials up to a given even degree 2δ is referred to as the LMI relaxation of order δ. The standard Shor relaxation in mathematical programming (Shor, 1998) can be regarded as a first-order LMI relaxation.
To construct an LMI relaxation of order δ, let vδ(x) be a vector containing the coefficients of all monomials up to degree δ, including the constant term 1. Then an LMI relaxation of order δ for the original problem in (4.28) can be constructed using the following algorithm:
k1 k2 kn 1. Linearize the objective function p(x) by lifting: x1 x2 . . . xn is replaced with
the new lifting variable yk1k2...kn . Thus, the linearized objective function can be
written p>y for a constant coefficient vector p and the lifting vector y.
2. Apply lifting to the LMI constraint pi(x)vδ 1(x)vδ 1(x)> 0 for each con- − − straint pi(x) 0. Denote the linearized constraint by Cδ 1(pi(y)) 0. ≥ − 3. Add the LMI moment matrix constraint which corresponds to linearizing the
trivial constraint vδ(x)vδ(x)> 0. Denote the linearized constraint by Cδ(y) 0.
To summarize, the following SDP is solved for the LMI relaxation of order δ of the 88 problem (4.28):
min p>y
subject to Cδ 1(pi(y)) 0, i = 1, 2 . . . , m, (4.46) − Cδ(y) 0. As we saw previously, it is shown in (Lasserre, 2001) that under certain mild technical conditions, the above hierarchy of relaxations converges asymptotically to p∗, that is,
lim pδ∗ = p∗. (4.47) δ →∞ It turns out that for many of the non-convex polynomial optimization problems, global optima are reached at a given accuracy for a moderate number of lifting variables and constraints, hence for an LMI relaxation of moderate order. A sufficient condition for reaching the global optimum is that the moment matrix Cδ(y∗) has rank one at the optimum y∗.
If the solution to the relaxed problem is not tight, that is, pδ∗ < p∗, then an approx- imate solution may be obtained by simply keeping the lifting variables corresponding to first-order moments. A Matlab toolbox for LMI relaxations based on the publicly available SDP solver SeDuMi (Sturm, 1999) can be found in (Henrion and Lasserre, 2003). Chapter 5
Triangulation and Resectioning
“From the centre of the land and water, at a distance of one-quarter of the Earth’s circumference lies Lankˇ a.¯ And from Lankˇ a,¯ at a distance of one-fourth thereof, exactly northwards, lies Ujjayini.”¯
Aryabhatt¯ (Indian astronomer, 476-550 AD), Circumference of Earth, Aryabhatia¯
5.1 Introduction
With the background of the previous chapters, we now turn our attention to various problems in multiple view geometry, with the aim of developing practical solutions with theoretical guarantees of optimality. As we have discussed in Chapter1, these optimization problems are typically highly non-linear and finding their global optima in general has been shown to be NP- Hard (Nister´ et al., 2007; Freund and Jarre, 2001). Methods for solving these problems are based on a combination of heuristic initialization and local optimization to converge to a locally optimal solution. A common method for finding the initial solution is to use a direct linear transform (for example, the eight-point algorithm (Longuet-Higgins, 1981)) to convert the optimization problem into a linear least squares problem. The solution then serves as the initial point for a non-linear minimization method based on the Jacobian and Hessian of the objective function, for instance, bundle adjustment. As has been
89 90 documented, the success of these methods critically depends on the quality of the initial estimate (Hartley and Zisserman, 2004). In this chapter, we present the practical algorithms for finding the globally optimal solution to a variety of problems in multiview geometry, such as general n-view triangu- lation, camera resectioning (also called camera pose estimation or absolute orientation determination) and the estimation of general projections Pn Pm, for n m. We solve 7→ ≥ each of these problems under three different noise models, including the standard Gaus- sian distribution and two variants of the bi-variate Laplace distribution. Our algorithm is provably optimal, that is, given any tolerance , if the optimization problem is feasible, the algorithm returns a solution which is at most far from the global optimum. The algorithm is a branch and bound style method based on extensions to recent developments in the fractional and convex programming literature (Tawarmalani and Sahinidis, 2001b; Benson, 2002; Boyd and Vandenberghe, 2004). While the worst case complexity of our algorithm is exponential, we will show in our experiments that for a fixed , the runtime of our algorithm scales almost linearly with problem size, making this a very attractive approach for use in practice. In summary, the main contributions of this chapter are:
A scalable algorithm for solving a class of multiview problems with a guarantee • of global optimality.
Handling the standard L2-norm of reprojection errors, as well as the robust • L1-norm for the perspective camera model.
Introduction of fractional programming to the computer vision community. •
5.1.1 Related Work
Recently there has been some progress made towards finding the global solution to a few of the multiview optimization problems. An attempt to generalize the optimal solution of two-view triangulation (Hartley and Sturm, 1997) to three views was done 91 in (Stewenius´ et al., 2005) based on Grobner¨ basis. However, the resulting algorithm is numerically unstable, computationally expensive and does not generalize for more views or harder problems like resectioning, although more numerically stable results were obtained in (Byrod¨ et al., 2007). In (Kahl and Henrion, 2005), convex linear matrix inequalities were used to estimate the global optimum for several multiple view geometry problems, but no guarantees can be given to certify that the computed solution is indeed the global optimum. Also, there are unsolved problems concerning numerical stability of the solvers used. Robustification using the L1-norm was presented in (Ke and Kanade, 2005b), but the approach is restricted to the affine camera model. In (Kahl, 2005; Ke and Kanade, 2005a), a wider class of geometric reconstruction problems was solved globally, but with the L -norm. ∞ Subsequent to this work, there have been faster solutions that employ similar principles, but are more tailored for the particular case of the triangulation problem (Lu and Hartley, 2007). It is also possible to use convex optimization methods to verify the optimality of a given solution for some multiview geometry problems posed in the L2 norm (Hartley and Seo, 2008).
5.1.2 Outline
We begin by formulating the problems we are interested in solving in the next section. An exposition on fractional programming is given in Section 5.2, with details of the construction of lower bounds (Section 5.4.1). We justify in Section 5.5 the claim that a broad class of multiview geometry problems with different noise models can be cast in the unifying framework of fractional programming. Section 5.6 presents two innovations, crucial to expeditious convergence, that exploit the special properties of structure and motion problems - a novel bounds propagation scheme to restrict the branching process to a small, fixed number of dimensions independent of the problem size and an intuitive initialization strategy based on reprojection error. Finally, Section 5.7 presents the experimental results of the extensive evaluation of our algorithm on a 92 variety of synthetic and real data sets with different noise levels.
5.2 Problem Formulation
A perspective camera can be modelled as a linear mapping P3 P2 from 7→ projective 3-space to a projective image plane. In matrix notation, a 3D scene point, represented by a homogeneous 4-vector X, and its projected image point, represented by a homogeneous 3-vector x, are related by
x = λPX (5.1) where the scalar λ, called the projective depth, accounts for scale and P is the 3 4 × camera matrix encoding the intrinsic and extrinsic parameters of the camera. We consider the following two problems under three different noise models, namely the Gaussian and two variants of the bivariate Laplacian.
1. Structure Estimation: Given N images of a point and the corresponding camera matrices, estimation of the position of the point in P3. This is also known as the N-view triangulation problem.
2. Transformation Estimation: Given the position of N points in the projective space Pn and their images in the space Pm, estimation of the projective transformation P that maps these points from Pn to Pm. When n = 3 and m = 2, that is, the transformation is a 3 4 camera matrix, the problem is also known as camera × resectioning.
Let P = [π1 π2 π3]> denote the 3 4 camera where πi are 4-vectors, (u, v)> × stand for image coordinates and X be a homogeneous 3D point. Then the reprojection residual vector for one image is given by
π X π X> r = u 1> , v 2> . (5.2) − π3>X − π3>X 93
Under an indepedent, identically distributed (iid) Gaussian noise model, the objective function to minimize is the sum-of-squared residuals, which becomes
N X 2 ri 2, (5.3) i=1 || || where N is the number of residual terms in the problem, that is, the number of images of the given point. Other noise models will also be considered later in this chapter. Minimizing the sum-of-squares objective function (5.3) is known to be a trouble- some non-convex optimization problem for both structure and transformation estimation (Hartley and Zisserman, 2004). It is known that the two-view triangulation problem can be reduced to finding the roots of a sixth degree polynomial (Hartley and Sturm, 1997), while three-view triangulation can be posed as the solution to a polynomial system which may have up to 47 roots (Stewenius´ et al., 2005). So, not only are these seemingly simple instances of the triangulation problem known to have several local minima, the difficulty of obtaining a solution rises sharply with the number of views. This phenomena causes difficulties for local optimization techniques (such as Newton-based methods) since they might get stuck in local minima. As an example, consider the following three-view triangulation problem (first
published in (Kahl and Hartley, 2008)) in which there are three local L2 minima, all lying in front of all three cameras. In this example, all points lie in the plane z = 0, so we may simplify the problem to a 2-dimensional triangulation problem. Adding a third dimension makes no significant difference to the example. 3 1 8 Let P0 be represented by the camera matrix P0 = −1 3 −6 . The centre of this − − − camera is at the point ( 3, 1, 1)>. We obtain two other cameras P1 and P2 by rotating − − around the origin by 120◦. ± Now, for all i = 0,..., 2, let xi = (3, 1)>; this is simply the point with non- homogeneous coordinate 3 in the image. It is easily seen that all points of the form
(x, 1, 1)> map to the same point (3, 1)> in the P0 image. These points lie along the − line y = 1, which is therefore the ray corresponding to the image point x0 = (3, 1)> − for the P0 camera. The rays corresponding to the points measured in the other images 94 lie on lines rotated by 1200 around the origin. The three rays form a triangle. Since ± this configuration has three-fold symmetry, if there is to be a single minimum to the cost function, then it could only be the origin, which is the symmetry centre. It is easily seen that the origin is not the global optimum. One might suspect that the local optima are at the vertices of the triangle. However, the best L2 solutions do not lie exactly at the vertices of the triangle. The contour plot (sublevel-set plot) of the L2 error (of a slightly perturbed problem) is shown in Figure 5.1.
Figure 5.1: A contour plot of the L2 error for a three-view triangulation problem in which there are three local minima for the L2 cost function.
5.3 Traditional Approaches
5.3.1 Linear Solution
Often, it is possible to algebraically solve for, say, the unknown 3D point in the case of triangulation, by using a linear method. Let (xi, yi)> be the image of unknown 95
point X under camera Pi. Then, the following relationships hold true:
xi(π>X) (π>X) = 0 3i − 1i
yi(π>X) (π>X) = 0 (5.4) 3i − 2i
If m 2 views are available, these equations can be expressed in matrix form as ≥ x1π31> π11> − y1π31> π21> − . AX = 0, where A = . (5.5) xmπ3>m π1>m − ymπ> π> 3m − 2m and the triangulation problem can be solved by minimizing the algebraic error
min AX . (5.6) X k k
Note that X = 0 is a solution to (5.6), so to extract a solution, we need to fix the scale of X. This can be achieved by demanding that X = 1 or by setting the last coordinate of k k X to 1. The former, which is a homogeneous version, can be solved using singular value decomposition (SVD), while the latter inhomogeneous version represents a linear least squares problem. The inhomogeneous version, of course, precludes points at infinity from being a solution. The two solutions above have, in fact, quite different solution properties in the presence of noise. Note that both the solutions are not projective invariant, since replacing
1 1 A by AH should yield H− X as the solution for a projective invariant method, but H− X 1 will not, in general, satisfy H− X = 0 or have its last coordinate equal to 1. k k Such solutions are often called Direct Linear Transform (DLT) methods in the literature (Hartley and Zisserman, 2004). While they are fast and applicable for a large number of views, they minimize an algebraic distance and not the geometric reprojection error in (5.3). So, the solution yielded by these methods may not correspond to geometric intuition and can vary depending on the choice of normalization. Moreover, the solution 96 quality can degrade dramatically in the presence of noise. In practice, they are used as initialization to a nonlinear minimization routine, called bundle adjustment.
5.3.2 Bundle Adjustment
Bundle adjustment refers to a class of local iterative optimization approaches which can be used to minimize the cost function in (5.3). While bundle adjustment is a local optimization approach, it is quite powerful due to its flexibility and incorporation of features specific to multiview geometry. It is common practice to refine the estimate of any reconstruction algorithm using bundle adjustment. The most notable bundle adjustment methods in computer vision employ variants of the Levenberg-Marquardt iterative algorithm (Levenberg, 1944; Marquardt, 1963). The reader is referred to (Triggs et al., 1999) for an exhaustive treatment of the subject. In brief, bundle adjustment seeks to estimate cameras P and points X that minimize a geometric error criterion:
X 2 min d(Pi, Xj) . (5.7) Pi,Xj ij
In the case of triangulation or camera resectioning, the cost has the form (5.3) where the minimization is only over either the structure variables or the cameras. But in the case of, say projective reconstruction (Section 3.3), the minimization involves both the 3D points and the cameras. Levenberg-Marquardt is the preferred optimization routine for bundle adjustment, partly because it allows efficient incorporation of the structure of photogrammetric problems. Sparsity patterns are exploited in several ways in any state-of-the-art bundle adjustment implementation. Variable partitioning, as well as exploiting band diagonality, block structure and skyline structure are a few ways in which problem structure is taken into account by bundle adjustment. Bundle adjustment had already been popular in the photogrammetry community before it found a new audience and widespread application in computer vision (Brown, 97
1976; Granshaw, 1980; Slama, 1980). Bundle adjustment minimizes a Maximum Like- lihood criterion and has the advantage of being very flexible in incorporating different kinds of variables and constraints as well as dealing with missing data. The disadvantage, of course, is that it is very prone to getting stuck in local minima, so it requires a strong initialization to produce meaningful estimates. With this background, we turn our attention to methods that yield the globally optimal solution to the geometric error, regardless of the initialization.
5.4 Fractional Programming
Let us begin with a brief exposition on fractional programming. In its most general form, fractional programming seeks to minimize/maximize the sum of p 1 ≥ fractions subject to convex constraints. Our interest from the point of view of multiview geometry, however, is specific to the minimization problem
p X fi(x) min subject to x D (5.8) x g (x) i=1 i ∈
n n where fi : R R and gi : R R are convex and concave functions, respectively, and → → n the domain D R is a convex, compact set. Further, it is assumed that both fi and gi ⊂ are positive with lower and upper bounds over D. Even with these restrictions the above problem is NP -complete (Freund and Jarre, 2001), but we demonstrate that practical and reliable estimation of the global optimum is still possible for the multiview problems considered through iterative algorithms that solve an appropriate convex optimization problem at each step. For the purposes of the development of the Branch and Bound algorithm, let us assume that we have available to us upper and lower bounds on the functions fi(x) and gi(x), denoted by the intervals [ li, ui ] and [ Li,Ui ], respectively. Let Q0 denote the 2p- dimensional rectangle [ l1, u1 ] [ lp, up ] [ L1,U1 ] [ Lp,Up ]. Introducing × · · · × × × · · · × auxiliary variables t = (t1, . . . , tp)> and s = (s1, . . . , sp)>, consider the following 98 alternate optimization problem:
p X ti min x,t,s s i=1 i
subject to fi(x) ti gi(x) si ≤ ≥
x D (t, s) Q0. (5.9) ∈ ∈
We note that the feasible set for problem (5.9) is a convex, compact set and that (5.9) is feasible if and only if (5.8) is. Indeed the following holds true (Benson, 2002):
n+2p Theorem 9. (x∗, t∗, s∗) R is a global, optimal solution for (5.9) if and only if ∈ n ti∗ = fi(x∗), si∗ = gi(x∗), i = 1, , p and x∗ R is a global optimal solution for ··· ∈ (5.8).
Proof. See AppendixA.
Thus, Problems (5.8) and (5.9) are equivalent and henceforth we shall restrict our attention to Problem (5.9). Next, we look at the construction of a convex relaxation for Problem (5.9) that is well-adapted for use in a branch and bound algorithm.
5.4.1 Bounding
As discussed in Section 4.3.1, a useful lower bounding function must be a tight approximation to the objective function that can be efficiently computed and minimized. For the case of a fractional program, the convex envelope, which is the tightest possible convex relaxation, can be shown to satisfy these requirements. Indeed, it is shown in (Tawarmalani and Sahinidis, 2001b) that the convex envelope for a single fraction t/s, where t [ l, u ] and s [ L, U ], is given as the solution to the following Second Order ∈ ∈ 99
Cone Program (SOCP): t convenv = min r s r,r0,s0
subject to r, r0, s0 R ∈
2λ√l 2(1 λ)√u r0 + s0 − r r0 + s s0
r0 s0 ≤ r r0 s + s0 ≤ − − − − −
λL s0 λU (1 λ)L s s0 (1 λ)U ≤ ≤ − ≤ − ≤ − r0 0 r r0 0 ≥ − ≥ u t where we have substituted λ = − for ease of notation, and r, r0, s0 are auxiliary scalar u l variables. − It is easy to show that the convex envelop of a sum is always greater (or equal) P than the sum of convex envelopes. That is, if f = ti/si then convenv f i ≥ P i convenv ti/si. It follows that in order to compute a lower bound on Problem
(5.9), one can compute the sum of convex envelopes for ti/si subject to the convex constraints. Hence, this way of computing a lower bound flb(Q) amounts to solving a convex SOCP problem which can be done efficiently (Sturm, 1999).
In summary, in order to compute a lower bound flb(Q) on the rectangle Q =
[ l1, u1 ] [ lp, up ] [ L1,U1 ] [ Lp,Up ], the following SOCP is solved: × · · · × × × · · · × p X min ri x,r,r0,s,s0,t i=1 n p subject to x R , r, r0, s, s0, t R ∈ ∈
2λi√li 2(1 λi)√ui r0 + s0 − ri r0 + si s0 ≤ i i ≤ − i − i r0 s0 ri r0 si + s0 i − i − i − i
λiLi s0 λiUi (1 λi)Li si s0 (1 λi)Ui ≤ i ≤ − ≤ − i ≤ −
r0 0 ri r0 0 i ≥ − i ≥
li ti ui Li si Ui ≤ ≤ ≤ ≤
fi(x) ti gi(x) si for i = 1, . . . , p. ≤ ≥ 100
This construction of convex envelopes can be shown to satisfy the conditions (L1) and (L2) of Section 4.3(Benson, 2002) and therefore, is well-suited for our branch and bound algorithm.
5.5 Applications to Multiview Geometry
In this section, we elaborate on adapting the theory developed in the previous section to common problems of multiview geometry. In the standard formulation of these problems based on the Maximum Likelihood Principle, the exact form of the objective function to be optimized depends on the choice of noise model. The noise model describes how the errors in the observations are statistically distributed given the ground truth. The most common noise model is the Gaussian distribution which has a very thin tail, that is, the probability of large deviation decreases to zero very rapidly. In practice, however, large errors occur more often than predicted by the Gaussian distribution, for instance, due to erroneous localization of interest points or just bad correspondences. There are two ways of getting around this problem. The first is to robustify the cost function by reducing the penalty for large deviations and the second is to consider noise models with thicker tails (Huber, 1981). The latter choice then translates into a modified likelihood function. We will consider the Gaussian and two variants of the Laplacian noise model. In the Gaussian noise model, assuming an isotropic distribution of error with a known standard deviation σ, the likelihood for two image points - one measured point x and one true x0 - is
2 1 2 2 p(x x0) = (2πσ )− exp( x x0 /(2σ )) . (5.10) | −k − k2 where p stands for the p-norm. Thus maximizing the likelihood, assuming iid noise, k·k P 2 is equivalent to minimizing xi x0 , which we interpret as a combination of two i k − ik2 vector norms - the first for the point-wise error in the image and the second that cumulates point-wise errors. We call this the (L2,L2)-formulation. 101
Table 5.1: Different cost-functions of reprojection errors. Gaussian Laplacian I Laplacian II P 2 P P xi x0 xi x0 2 xi x0 1 i k − ik2 i k − ik i k − ik (L2,L2) (L2,L1) (L1,L1)
The exact definition of the Laplace noise model depends on the particular defini- tion of the multivariate Laplace distribution (Kotz et al., 2001). In the current work we choose two of the simpler definitions. The first one is a special case of the multivariate exponential power distribution giving us the likelihood function:
1 p(x x0) = (2πσ)− exp( x x0 2/σ) . (5.11) | −k − k
An alternative view of the bivariate Laplace distribution is to consider it as the
joint distribution of two iid univariate Laplace random variables, where x = (u, v)> and
x0 = (u0, v0)> which gives us the following likelihood function
1 1 u u0 1 1 v v0 2 1 p(x x0) = e− σ | − | e− σ | − | = (4σ )− exp( x x0 1/σ) . (5.12) | 2σ 2σ −k − k
Maximizing the likelihoods in (5.11) and (5.12) is equivalent to minimizing P P xi x0 2 and xi x0 1, respectively. Again, in our interpretation of these i k − ik i k − ik expressions as a combination of two vector norms, we denote these minimizations as
(L2,L1) and (L1,L1), respectively. We summarize the classification of overall error under various noise models in
Table 5.1. In this notation the (L2,L )-case of the problems has recently been solved in ∞ polynomial time (Kahl, 2005).
5.5.1 Triangulation
The primary concern in triangulation is to recover the 3D scene point given mea-
sured image points and known camera matrices in N 2 views. Let P = [π1 π2 π3]> ≥ denote the 3 4 camera where πi is a 4-vector, (u, v)> image coordinates, X = × 102
(U, V, W, 1)> the extended 3D point coordinates, then the reprojection residual vector for this image is given by π X π X> r = u 1> , v 2> (5.13) − π3>X − π3>X PN q and hence the objective function to minimize becomes ri for the (Lp,Lq)-case. i=1 || ||p In addition, one can require that π3>X > 0 which corresponds to the 3D point being in front of the camera. We now show that by defining r q as an appropriate ratio f/g of a || ||p convex function f and a concave function g, the problem in (5.13) can be identified with the one in (5.9).
(L2,L2). The norm-squared residual of r can be written as (a X)2 + (b X)2 r 2 > > 2 = 2 , (5.14) || || (π3>X) where a, b are 4-vectors dependent on the known image coordinates and the known camera matrix. By setting 2 2 (a>X) + (b>X) f = and g = π3>X , (5.15) π3>X a convex-concave ratio is obtained. It is straightforward to verify the convexity of f via the convexity of its epigraph:
epif = (X, t) t f(X) { | ≥ } 1 1 = (X, t) (t + π>X) a>X, b>X, (t π>X) , | 2 3 ≥ 2 − 3 which is a second-order convex cone (Boyd and Vandenberghe, 2004).
(L2,L1). Similar to the (L2,L2)-case, the norm of r can be written r 2 = f/g where || || p 2 2 f = (a>X) + (b>X) and g = π3>X. Again, the convexity of f can be established by noting that the epigraph epif = (X, t) t (a>X, b>X) is | ≥ k k a second-order cone.
(L1,L1). Using the same notation as above, the L1-norm of r is given by r 1 = f/g || || where f = a>X + b>X and g = π>X. | | | | 3 In all the cases above, g is trivially concave since it is linear in X. 103
5.5.2 Camera Resectioning
The problem of camera resectioning is the analogous counterpart of triangulation, whereby the aim is to recover the camera matrix given the 3D coordinates of N 6 ≥ scene points and their corresponding images. The main difference compared to the triangulation problem is that the number of degrees of freedom has increased from 3 to 11. Let π = π1>, π2>, π3> > be a homogeneous 12-vector of the unknown elements in the camera matrix P. Now, the squared norm of the residual vector r in (5.13) can be
2 2 2 2 rewritten in the form r = ((a>π) + (b>π) )/(X>π3) , where a, b are 12-vectors || ||2 determined by the coordinates of the image point x and the scene point X. Recalling the
2 derivations for the (L2,L2)-case of triangulation, it follows that r can be written as || ||2 2 2 a fraction f/g with f = ((a>π) + (b>π) )/(X>π3) which is convex and g = X>π3 concave in accordance with Problem (5.9). Similar derivations show that the same is true for camera resectioning with (L2,L1)-norm as well as (L1,L1)-norm.
5.5.3 Projections from Pn to Pm
Our formulation for the camera resectioning problem is very general and not restricted by the dimensionality of the world or image points. Thus, it can be viewed as a special case of a Pn Pm projection with n = 3 and m = 2. 7→ When m = n, the mapping is called a homography. Typical applications include homography estimation of planar scene points to the image plane, or inter-image homo- graphies (m = n = 2) as well as the estimation of 3D homographies due to different coordinate systems (m = n = 3). For projections (n > m), camera resection is the most common application, but numerous other instances appear in the computer vision field (Wolf and Shashua, 2002). 104
5.6 Multiview Fractional Programming
In this section, we present some important aspects of our implementation which extend the traditional methods to solve fractional programs by exploiting properties specific to the structure of multiview geometry problems. In fact, these developments form the basis for the excellent convergence rates our implementation achieves, as opposed to an exponential search in several dimensions that a na¨ıve implementation of existing fractional programming techniques results in.
5.6.1 Bounds Propagation
Consider a fractional program with k fractions. Traditional approaches to frac- tional programming require branching in at least k dimensions corresponding to the denominators for the algorithm to converge correctly. For a triangulation problem, k is the number of cameras and for a resectioning problem, it is the number of points. A branching dimension in a traditional branch and bound algorithm is the denominator of the reprojection error term corresponding to each point (for resectioning) or camera (for triangulation). It is evident that the search space of a branch and bound algorithm that branches in k dimensions can be untenably large even for medium-sized problems. Con- temporary literature (Schaible and Shi, 2003) documents reasonable results for practical problems with k at most 10 to 12. However, we can do much better with the realization that for all problems pre- sented in Section 5.5, the denominator is a linear function in the unknowns. To elucidate the concept, let us assume the problem is one of triangulating the location of a (homoge-
3 neous) point X = (U, V, 1)> R so that the branching entity (the denominator g(X)) ∈ is a linear function in two variables U and V . Please refer to Figure 5.2 for an illustration. Each bounding constraint restricts the denominator to lie in a particular half space in
R2, thus, a pair of lower and upper bounds on two linearly independent denominators g1 and g2 restrict the feasible values to a convex quadrilateral on the 2D plane. Further,
U and V are linear in g1 and g2 and so are the denominators of all the other fractions in 105 the triangulation problem corresponding to views 3, , k. So, the convex polygon that ··· represents the bounds on the denominators g1 and g2 induces bounds on the denominators of all the fractions in the triangulation problem.
V g (U, V) = gU U 1 1 g3(U, V) = g3 U g2(U, V) = g2
L g1(U, V) = g1
U
L g3(U, V) = g3 L g2(U, V) = g2
Figure 5.2: The red lines indicate the lower and upper bounds on the denominator g1 while the blue lines indicate bounds on the denominator g2. The shaded gray region represents the induced bounds on the variables U and V . Any linear function of U and V restricted to the domain represented by the gray polygon will attain its extremal values at two vertices of this simplex, as illustrated by the thick black points for some linear function g3(U, V ) represented by the green lines.
Extending the analogy to the case of triangulation in three dimensions, the unknown point coordinates X = (U, V, W, 1)> are linear in gi(X) = π3>iX for i =
1, . . . , k. Suppose k > 3 and bounds are given on three denominators, say g1, g2, g3 which are not linearly dependent. These bounds then define a convex polytope in R3. This polytope constrains the possible values of U, V and W which in turn induce bounds on the other denominators g4, . . . , gk. The bounds can be obtained by solving a set of 106 linear equations each time branching is performed.
for i = 1, . . . , k, min gi(x) max gi(x)
Lj gj(x) Uj Lj gj(x) Uj j = 1, 2, 3. (R1) ≤ ≤ ≤ ≤
Thus, it is sufficient to branch on three dimensions in the case of triangulation. Similarly, in the case of camera resectioning, the denominator has only three degrees of freedom and more generally, for projections Pn Pm, the denominator has n degrees of freedom. 7→ The choice of the n dimensions to be used for bounds propagation is, in our opinion, a matter of implementation preferences. A simple heuristic is to branch on the dimension along which the rectangle to be split is the widest and incorporate the same as one of the n dimensions used in the subsequent step of propagating bounds. This might, in principle, avoid issues with committing once and for all to some choice of n particular denominators as the ones branched upon, such as the case when two or more faces of the bounding constraints polytope are nearly parallel. However, in both our synthetic and real experiments, we have observed no such (numerical) instabilities. As a practical note, we must point out that as the number of fractions increases, bounds propagation becomes the time critical step of the algorithm. However, the gains accrued in reduced dimensionality of the search space more than outweighs any cost involved in solving the large LP which constitutes the bounds propagation step.
5.6.2 Initialization
Besides bounds propagation, another component of the algorithm crucial to a rapid convergence is the initialization. In the construction of the algorithm, we assumed that initial bounds are available on the numerator and the denominator of each of the
2k fractions. This initial rectangle Q0 in R is the starting point for the branch and bound algorithm. It is clear that the size of this initial search region will affect the runtime of the search algorithm. However it is not clear how the user should specify the bounds that 107 define the initial region, especially since they depend on the problem geometry and are not straightforward to guess intuitively. What is intuitive, however, is the notion of reprojection error (in pixels) and it is easy for the user to specify a reasonable upper bound on the worst reprojection error. This upper bound can then be used to construct bounds on the numerator and denominator by solving a set of simple optimization problems. Let γ be an upper bound on the reprojection error in pixels (specified by the user), then we can bound the denominators gi(x) by solving the following set of 2k optimization problems:
for i = 1, . . . , k, min gi(x) max gi(x) f (x) f (x) j γ j γ j = 1, . . . , k. (5.16) gj(x) ≤ gj(x) ≤
Depending on the choice of error norm, the above optimization problems will be instances of linear programming (for L1 L1) or quadratic programming (for L2 L1 and L2 L2). − − − We will call this γ-initialization. If the user-specified reprojection error is too small to lead to a feasible solution or so large that the SOCP solver is mired in numerical errors, the algorithm defaults to initial bounds which are wide enough for usual problem scales and known to be small enough to be numerically stable. This situation arises sometimes in our experiments, but we have found that the search space shrinks rapidly even with extremely liberal default values for the initial bounds. As a further note on the implementation, while tight bounds on the denominators are crucial for the performance of the overall algorithm, the bounds on the numerators are not. Therefore, we set the numerator bounds to preset values.
5.6.3 Coordinate System Independence
All three error norms (see Table 5.1) are independent of the coordinate system chosen for the scene (or source) points. In the image, one can translate and scale the 108 points without effecting the norms. For all problem instances and all three error norms considered, the coordinate system can be chosen such that the first denominator g1 is a constant equal to one. Thus, there is no need to approximate the first term in the cost-function with a convex envelope, since it is a convex function already.
5.7 Experiments
Both triangulation and estimation of projections Pn Pm have been imple- 7→ mented for all three error norms in Table 5.1 in the Matlab environment using the convex solver SeDuMi (Sturm, 1999) and the code is publicly available1. The optimization is based on the branch and bound procedure as described in Algorithm1 and α-bisection (see Algorithm2) with α = 0.5. To compute the initial bounds, γ-initialization is used (see Section 5.6.2) with γ = 15 pixels for both real and synthetic data. The branch and bound terminates when the difference between the global optimum and the underestima- tor is less than = 0.05. In all experiments, the Root Mean Squares (RMS) errors of the reprojection residuals are reported regardless of the computation method. In addition to the methods based on fractional programming, the results are also compared to that of bundle adjustment initialized with a linear method (Hartley and Zisserman, 2004).
5.7.1 Synthetic Data
We demonstrate the various aspects of our algorithm such as scalability, runtime and termination using extensive simulations on synthetic data. Our data is generated by creating random 3D points within the cube [ 1, 1]3 and then projecting to the images. − The image coordinates are corrupted with iid Gaussian noise with different levels of variance. In all graphs, the average of 200 trials are plotted. In the first experiment, we employ a weak camera geometry for triangulation, whereby three cameras are placed along a line at distances 5, 6 and 7 units, respectively,
1See http://www.maths.lth.se/matematiklth/personal/fredrik/download.html. 109
0.012 35 Bundle Bundle 30 0.01 L2−L2 L2−L2 L2−L1 25 L2−L1 0.008 L1−L1 L1−L1 20 0.006 15 3D error 0.004 10 Reprojection error 0.002 5
0 0 0.002 0.004 0.006 0.008 0.010 0.002 0.004 0.006 0.008 0.010 Noise level (pixels) Noise level (pixels)
(a) (b)
Figure 5.3: Triangulation with forward motion. Figure (a) compares the reprojection error of the three algorithms with bundle adjustment. Note the degradation in performance of bundle adjustment with increasing noise in the image, further demonstrated in Figure (b), which plots the mean 3D error for the four algorithms. from the origin. In Figures 5.3(a) and (b), the reprojection errors and the 3D errors are plotted, respectively. The (L2,L2) method, on the average, results in a much lower error than bundle adjustment, which can be attributed to bundle adjustment being enmeshed in local minima due to the non-convexity of the problem. The graph in Figure 5.4 depicts the percentage number of times (L2,L2) outperforms bundle adjustment in accuracy. It is evident that higher the noise level, the more likely it is that the bundle adjustment method does not attain the global optimum. In the next experiment, we simulate outliers in the data in the following manner. Varying numbers of cameras, placed 10o apart and viewing toward the origin, are gener- ated in a circular motion of radius 2 units. In addition to Gaussian noise with standard deviation 0.01 pixels for all image points, the coordinates for one of the image points have been perturbed by adding or subtracting 0.1 pixels. This point may be regarded as an outlier. As can seen from Figure 5.5(a) and (b), the reprojection errors are lowest for the (L2,L2) and bundle methods, as expected. However, in terms of 3D-error, the L1 methods perform best and already from two cameras one gets a reasonable estimate of the scene point. 110
30
25
20
15
10
5 Local minima in bundle (%)
0 0.002 0.004 0.006 0.008 0.010 Noise level (pixels)
Figure 5.4: For triangulation with forward motion, this figure shows the percentage number of times the (L2,L2) algorithm found a better solution than bundle adjustment.
In the third experiment, six 3D points in general position are used to compute the camera matrix. Note that this is a minimal case, as it is not possible to compute the camera matrix from five points. The true camera location is at a distance of two units from the origin. The reprojection errors are graphed in Figure 5.6. Results for bundle adjustment and the (L2,L2) methods are identical and thus, likelihood of local minima is low. No errors on the estimated quantities are given since it is not meaningful to compare (homogeneous) camera matrices. To demonstrate scalability, Table 5.2 reports the runtime of our algorithm over a variety of problem sizes for resectioning. The tolerance, , is set to within 1 percent of the global optimum, the maximum number of iterations to 500 and mean and median runtimes are reported over 200 trials. The algorithm’s excellent runtime performance is demonstrated by almost linear scaling in runtimes. As can be seen, both median and mean runtime scale almost linearly with the size of the problem, making this an attractive algorithm for use in practice. Finally, we demonstrate the effect of the optimality tolerance, , on the time it takes the branch and bound algorithm to converge. Five cameras were used for the triangulation experiment, placed in a circular arc of radius 1, looking towards the origin, 111
2 0.06 10 Bundle Bundle L2−L2 0.055 L2−L2 1 10 L2−L1 L2−L1 scale)
0.05 − L1−L1 L1−L1 0 10 0.045
−1 Reprojection error 10 0.04 3D error (log
−2 10 2 3 4 5 6 2 3 4 5 6 Number of cameras Number of cameras
(a) (b)
Figure 5.5: (a) and (b) show reprojection and 3D errors, respectively, for triangulation with one outlier. Despite a higher reprojection error, the L1-algorithms better bundle adjustment in terms of 3D error.
−3 x 10 4 Bundle L2−L2 3 L2−L1 L1−L1 2
Reprojection error 1
0 0.002 0.004 0.006 0.008 0.01 Noise level (pixels)
Figure 5.6: Reprojection errors for camera resectioning. 112
Table 5.2: Mean and median runtimes (in seconds) for the three algorithms as the number of points for a resectioning problem is increased. MI is the percentage number of times the algorithm reached 500 iterations.
Points (L2,L2)(L2,L1)(L1,L1) Mean Median MI Mean Median MI Mean Median MI 6 42.8 35.5 0.5 41.6 31.5 1.5 7.9 4.7 0.0 10 51.8 41.9 0.5 105.8 66.6 3.5 20.3 13.5 0.5 20 72.7 50.5 2.5 210.2 121.2 9.0 46.8 28.2 1.0 50 145.5 86.5 4.5 457.9 278.3 8.5 143.0 75.9 2.5 70 172.5 107.8 3.5 616.5 368.7 7.5 173.0 102.8 1.5 100 246.2 148.5 4.5 728.7 472.4 4.0 242.3 133.6 2.0
with an angular separation of 10◦ between adjacent cameras. The points to be triangulated are generated in the cube [ 0.5, 0.5]3 and Gaussian noise of standard deviation 1% of − image size is added to the image coordinates. Six points in general position are used for the resectioning experiments with similar additive noise. The mean and median number of iterations over 200 trials for the triangulation and resectioning experiments are recorded in Figure 5.7 as is varied from 0.001 to 0.1. The dependence of the convergence time on is exponential. This is expected, since for a given initial region and fixed number of branching dimensions, d, reducing by half increases discrete volume by a factor of 2d. Note that a value of below 0.01 is, for all practical purpose, too stringent. In all our experiments, a value of between 0.01 to 0.1 suffices and in that range, the exponential behavior is not significantly pronounced.
5.7.2 Real Data
We have evaluated the performance on two publicly available data sets as well - the dinosaur and the corridor sequences. In Table 5.3, the reprojection errors are given for (1) triangulation of all 3D points given pre-computed camera motion and (2) resection of cameras given pre-computed 3D points. Both the mean error and the estimated standard deviation are given. There is no difference between the bundle adjustment and 113
50 35 l2l2 l2l2 l2l1 30 l2l1 40 l1l1 l1l1 25
30 20
15 Iterations 20 Iterations
10 10 5
0 0 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 ε ε
(a) Mean times for triangulation (b) Median times for triangulation
1000 200 l2l2 l2l2 l2l1 l2l1 800 l1l1 l1l1 150
600 100 Iterations Iterations 400
50 200
0 0 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 ε ε
(c) Mean times for resectioning (d) Median times for resectioning
Figure 5.7: (a) and (b) show trends for the mean and median number of iterations, re- spectively, over 200 trials, for termination of the triangulation algorithm as the optimality tolerance, , is varied from 0.001 to 0.1. (c) and (d) show the same for the resectioning experiment. 114
Table 5.3: Reprojection errors (in pixels) for triangulation and resectioning in the Di- nosaur and Corridor data sets. “Dinosaur” has 36 turntable images with 324 tracked points, while “Corridor” has 11 images in forward motion with 737 points. A denotes triangulation experiments and a denotes resectioning ones. ∗ †
Experiment Bundle (L2,L2)(L2,L1)(L1,L1) Mean Std Mean Std Mean Std Mean Std Dino∗ 0.30 0.14 0.30 0.14 0.18 0.09 0.22 0.11 Corridor∗ 0.21 0.16 0.21 0.16 0.13 0.13 0.15 0.12 Dino† 0.33 0.04 0.33 0.04 0.34 0.04 0.34 0.04 Corridor† 0.28 0.05 0.28 0.05 0.28 0.05 0.28 0.05
Table 5.4: Number of branch and bound iterations for triangulation and resectioning on the Dinosaur and Corridor datasets. More parameters are estimated for resectioning, but the main reason for the difference in performance between triangulation and resectioning is that several hundred points are visible to each camera for the latter. A denotes triangulation experiments and a denotes resectioning ones. ∗ †
Experiment (L2,L2)(L2,L1)(L1,L1) Mean Std Mean Std Mean Std Dino∗ 1.2 1.5 1.0 0.2 6.7 3.4 Corridor∗ 8.9 9.4 27.4 26.3 25.9 27.4 Dino† 49.8 40.1 84.4 53.4 54.9 42.9 Corridor† 39.9 2.9 49.2 20.6 47.9 7.9
the (L2,L2) method. Thus, for these particular sequences, the bundle adjustment did not get trapped in any local optimum. The L1 methods also result in low reprojection errors as measured by the RMS criterion. More interesting, perhaps, is the number of iterations and execution times on a standard PC (3 GHz), see Tables 5.4 and 5.5, respectively. We must point out that the implementations are (unoptimized) MATLAB functions. The differences in iterations and runtimes are most likely due to the setup: the dinosaur sequence has a circular camera motion and thereby a more well-posed camera geometry compared to the forward-moving camera in the corridor sequence. In the case of triangulation, a point is typically visible in only a few frames, while several hundred points may be visible in each view for the resectioning experiments. 115
Table 5.5: Triangulation and resectioning runtimes (in seconds) for experiments on real datasets. A denotes triangulation experiments and a denotes resectioning ones. ∗ †
Experiment Bundle (L2,L2)(L2,L1)(L1,L1) Mean Std Mean Std Mean Std Mean Std Dino∗ 1.0 0.4 5.5 4.5 12.1 4.0 17.0 9.9 Corridor∗ 1.0 0.6 18.7 17.7 51.0 47.3 46.4 51.6 Dino† 4.0 3.0 273.1 192.3 640.0 554.1 312.8 304.9 Corridor† 38.3 15.7 1433.5 348.0 1271.6 608.1 1122.7 565.0
5.8 Discussions
In this chapter, we have demonstrated that several problems in multiview ge- ometry can be formulated within the unified framework of fractional programming, in a form amenable to global optimization. A branch and bound algorithm is proposed that provably finds a solution arbitrarily close to the global optimum, with a fast con- vergence rate in practice. Besides minimizing reprojection error under Gaussian noise, our framework allows incorporation of robust L1 norms, reducing sensitivity to outliers. Two improvements that exploit the underlying problem structure and are critical for expeditious convergence are: branching in a small, constant number of dimensions and bounds propagation. It is inevitable that our solution times be compared with those of bundle adjust- ment, but we must point out that it is producing a certificate of optimality that forms the most significant portion of our algorithm’s runtime. In fact, it is our empirical observation that the optimal point ultimately reported by the branch and bound is usually obtained within the first few iterations. A distinction must also be made between the accuracy of a solution and the optimality guarantee associated with it. An optimality criterion of, say = 0.95, is only a worst case bound and does not necessarily mean a solution 5% away from optimal. Indeed, as evidenced by our experiments, our solutions consistently equal or better those of bundle adjustment in accuracy. In fact, it is our empirical observation that the optimal 116 point ultimately reported by the branch and bound is usually obtained within the first few iterations. Thus, from a practitioner’s viewpoint, it is useful to set a lower criterion for global optimality and use gradient descent in the neighborhood of the resulting solution. Needless to say, other segments of the computer vision community can also benefit from our approach as it is general enough to be applicable to any problem formulated as a fractional program in a few independent dimensions. Another avenue for potential future work is the exploration of other algorithms for achieving global optimality in specialized fractional programs. As faster and more reliable algorithms are designed for achieving global optimality in fractional programs, we can anticipate corresponding improvements in our solution times. The most significant portions of this chapter are based on “Practical Global Optimization for Multiview Geometry”, by F. Kahl, S. Agarwal, M. K. Chandraker, D. J. Kriegman and S. Belongie, as it appears in (Kahl et al., 2008) and (Agarwal et al., 2006). Chapter 6
Stratified Autocalibration
“To see a world in a grain of sand And heaven in a wild flower Hold infinity in the palms of your hand And eternity in an hour.”
William Blake (English poet, 1757-1827 AD), Auguries of Innocence
In this chapter, we seek to extend the global optimization framework of Chapter 5 to tackle the more difficult problem of autocalibration. Recall from Section 1.2.4 that autocalibration seeks to estimate the plane at infinity and the dual image of the absolute conic in order to upgrade a projective reconstruction to a metric one. As outlined in Section 3.4, a stratified approach to autocalibration estimates the plane at infinity to take the projective reconstruction to an affine one and subsequently, estimating the DIAC results in a metric upgrade. This chapter proposes practical algorithms that provably solve well-known formulations for both affine and metric upgrade stages of stratified autocalibration to their global optimum. At this stage, we suggest a brief perusal of Sections 2.3 and 2.5 to a reader who wishes to get acquainted with the projective geometry background required for this chapter.
117 118
6.1 Introduction
Given n feature correspondences across n views of a scene, it is well-known that a projective reconstruction may be computed that differs from the true scene by an arbitrary 4 4 linear transformation, or homography Faugeras(1992); Hartley et al. × (1992). A projective reconstruction may be upgraded to a metric one, which differs from the true scene by a similarity transformation, using a priori knowledge of a few scene characteristics, such as the angles between a few 3D lines. An alternative approach to estimate the 4 4 transformation that restores the metric scene is through simple × assumptions on the internal parameters of the cameras used to image the scene, such as their constancy across different views, or the rectangularity of the image pixels. The latter approach is called autocalibration, or camera self-calibration, which forms the subject of this chapter. A typical approach to calibrating a camera involves using several images of a known calibration grid. Once a correspondence can be ascertained between scene points (or higher order features like curves) and their counterparts on the image plane, it is straightforward to recover the camera parameters. The term autocalibration stems from its key premise that it obviates the requirement for an explicit calibration grid. Instead, it tries to locate the image of the so-called absolute conic, which is an imaginary object on the plane at infinity, whose location stays fixed in any metric reconstruction. Its image can be shown to be related to the internal parameters of the camera, so locating the image of the absolute conic is equivalent to calibrating the camera. The method of autocalibration presented in this chapter is a stratified one, whose first step upgrades the projective reconstruction to an affine one, while the next step performs the upgrade to a metric reconstruction. An affine reconstruction restores certain aspects of the scene, such as parallelism between 3D lines, while the metric rconstruction restores characteristics such as exact angles and length ratios. The affine upgrade, which is arguably the more difficult step in stratified autocal- ibration, is succinctly computable by estimating the position of the plane at infinity in 119 a projective reconstruction, for instance, by solving the modulus constraints Pollefeys and van Gool(1999). Previous approaches to minimizing the modulus constraints for several views rely on local, gradient-based methods with random reinitializations. These methods are not guaranteed to perform well for such non-convex problems. Moreover, in our experience, a highly accurate estimate of the plane at infinity is imperative to obtain a usable metric reconstruction. The metric upgrade step involves estimating the intrinsic parameters of the cam- eras, which is commonly approached by estimating the dual image of the absolute conic (DIAC). A variety of linear methods exist towards this end, however, they are known to perform poorly in the presence of noise Hartley and Zisserman(2004). Perhaps more significantly, most methods a posteriori impose the positive semi-definiteness of the DIAC, which might lead to a spurious calibration. Thus, it is important to impose the positive semidefiniteness of the DIAC within the optimization, not as a post-processing step. This chapter proposes global minimization algorithms for both stages of stratified autocalibration that furnish theoretical certificates of optimality. That is, they return a solution at most away from the global minimum, for arbitrarily small . Our solution approach relies on constructing efficiently minimizable, tight convex relaxations to non- convex programs and using them in a branch and bound framework Horst and Tuy(2006); Tawarmalani and Sahinidis(2002). A significant drawback of local methods is that they critically depend on the quality of a heuristic initialization. To be considered truly optimal, an algorithm must converge to the global optimum regardless of the choice of initialization. Branch and bound methods require a demarcated region of the search space as initialization. Arbitrar- ily choosing a small initial region might compromise optimality, since the true solution might lie outside that chosen region. On the other hand, choosing a very large region might lead to a ponderous convergence rate for the branch and bound algorithm. In this chapter, we use chirality constraints derived from the scene to compute a theoretically correct initial search space for the plane at infinity, within which we 120 are guaranteed to find the global minimum Hartley(1998a); Hartley et al.(1999). In practice, for a moderate number of cameras, the initial region determined by the chirality constraints are tight enough to allow rapid convergence of the search algorithm. Our initial region for the metric upgrade is intuitively specifiable as conditions on the intrinsic parameters of the camera and can be wide enough to include any practical case. A crucial concern in branch and bound algorithms is the exponential dependence of the worst case time complexity on the number of branching dimensions. The number of branching dimensions in most computer vision problems scales with the number of points and views, which can quickly translate into an impractical branch and bound search. In this chapter, we exploit the inherent problem structure of autocalibration to restrict our branching dimensions to a small, fixed number, independent of the number of views. In our experiments, this allows the runtime of algorithms proposed in this chapter to scale gracefully with the number of views. In summary, the main contributions of this chapter are the following:
Highly accurate recovery of the plane at infinity in a projective reconstruction by • global minimization of the modulus constraints.
Highly accurate estimation of the DIAC by globally solving the infinite homogra- • phy relation.
A general exposition on novel convexification methods for global optimization of • non-convex programs.
The outline of the rest of this chapter is as follows. Section 6.2 describes back- ground relevant to autocalibration and Section 6.3 outlines the related prior work. Section 6.4.1 describes the general strategy that we employ for constructing the convex relaxation of a non-convex function, while Sections 6.5 and 6.6 describe our global optimization algorithms for estimating the plane at infinity and the DIAC, respectively. Section 6.7 presents experiments on synthetic and real data and Section 6.8 concludes with a discus- sion of further extensions. 121
6.2 Background
As is our convention, unless stated otherwise, we will denote 3D world points X by homogeneous 4-vectors and 2D image points x by homogeneous 3-vectors. Recall that, given the images of n points in m views, a projective reconstruction P, X computes { } the Euclidean scene Pˆ , Xˆ up to a 4 4 homography: { } ×
1 Pi = Pˆ iH− , i = 1, , m ···
Xj = HXˆ j , j = 1, , n (6.1) ··· where P and Pˆ denote the 3 4 projective and Euclidean camera matrices, respectively. × Given a projective reconstruction, autocalibration seeks to estimate the best homography H that upgrades the reconstruction to a metric one. A more detailed discussion of the material in this section can be found in (Hartley and Zisserman, 2004).
6.2.1 The Infinite Homography Relation
The Euclidean camera is parametrized as Pˆ = K[r t] where the 3 3 rotation | × matrix r and the 3 1 translation vector t constitute the exterior orientation, while the × 3 3 upper-triangular matrix K encodes the intrinsic parameters of the camera. × We can always perform the projective reconstruction such that P1 = [I 0]. Let | the world coordinate system be aligned with the first (world) camera, i.e. Pb 1 = K1 [I 0]. | Let the homography H that we seek to estimate have the form A t H = (6.2) v> 1
Then, since Pb 1H = P1, we have A t K1 0 K1 [ I 0 ] = [ I 0 ] H = . (6.3) | | v> 1 ⇒ v> 1 122
Further, let the plane at infinity in the given projective reconstruction be π = (p, 1)>. ∞ Then, since the plane at infinity is moved out of its canonical position in the metric reconstruction by H, p 0 = H−> 1 1 K1−> K1−>v 0 K1−>v = − = − . 0> 1 1 1
It follows that v = K−>p, so H must have the form − 1 K1 0 H = (6.4) p>K1 1 − This is consistent with the notion that the aim of autocalibration is to recover the plane at infinity and the intrinsic parameters.
Let the cameras in the projective reconstruction be of the form Pi = [Ai ai] where | Ai is 3 3 and ai is a 3 1 vector. Further, let the metric cameras be Pb i = Ki [Ri ti], × × | where Ri is a 3 3 rotation matrix. Then, since Pb i = PiH, given the form of H in (6.4), × we have K1 0 Ki [ Ri ti ] = [ Ai ai ] | | p>K1 1 −
KiRi = (Ai aip>)K1. (6.5) ⇒ − By post-multiplying both sides of 6.5 with its transpose, we can eliminate the rotations
Ri to obtain:
KiK> = (Ai aip>)K1K>(bAi a1p>)>, (6.6) i − 1 − which can be rewritten as
ω∗ = (Ai aip>)ω∗(bAi a1p>)>, (6.7) i − 1 − since, by definition, the dual image of absolute conic is ω∗ = KK>. This is one of the basic relations for autocalibration, whose aim can now be restated as estimating the plane 123 at infinity and the DIAC. It is important to note that 6.7 relates projective entities, so the equality holds only up to scale.
The Infinite Homography
Since 6.7 is central to the problem of autocalibration, we will digress slightly to characterize it further. First, we need an auxiliary result:
Theorem 10. Given projective cameras P = [ I 0 ] and P0 = [ A a ], the homography, | | x0 = Hx, induced from the first view to the second through any plane π = (v>, 1)> is given by
H = A av>. (6.8) −
Figure 6.1: Any plane in the scene induces a homography between two projective cameras.
Proof. Referring to Figure 6.1, let π = (v>, 1)> be any plane and Xπ be the point where the back-projected ray from point x in the first image intersects the plane π. Then,
x = [ I 0 ] Xπ Xπ = (x>, λ)>. | ⇒ 124
Since Xπ lies on the plane π,
π>Xπ = 0 v>x + λ = 0 Xπ = (x>, v>x)>. ⇒ ⇒ −
Let x0 be the image of Xπ in the second view. Then, x x0 = P0Xπ = [ A a ] | v>x −
= (A av>)x, − which means that for every point x in the first image, there exists a corresponding point x0 related by the planar homography H = A av>. − The homography H above is called the homography from the first view to the second induced via the plane π.
Returning to equation (6.7), it is clear that Ai aip> is the homography induced − from the first view to the i-th via the plane at infinity π = (p>, 1)>. We denote ∞
H = A ap>, (6.9) ∞ − where H is called the infinite homography. The autocalibration relation (6.7) can now ∞ be written as i i ωi∗ = H ω1∗H > (6.10) ∞ ∞ and will be henceforth referred to as the infinite homography relation.
6.2.2 Modulus Constraints
From equation (6.5), it follows that the infinite homography has the form
j 1 H = KiRiK1− (6.11) ∞ where the equality holds up to a constant scale factor. Let us assume that all the cameras
Pb i have the same internal parameters, that is, Kj = K, for j = 1, , m. Then, equation ··· (6.11) becomes j 1 H = µjKRjK− , (6.12) ∞ 125
where, we have introduced an explicit scale factor µi to make the equality exact. Clearly, the infinity homography is conjugate (or similar) to the rotation matrix Rj. It follows
j iθj iθj that the eigenvalues of H must be µj, µje , µje− . In particular, the infinite ∞ { } homography has the property that its eigenvalues have equal moduli. Remarkably, this gives us enough leverage to be able to estimate the plane at infinity. The characteristic polynomial of the infinite homography, det(Hj λI) will be ∞ − a degree three polynomial in λ. Let its roots be λ1, λ2 and λ3. Then,
j det(H λI) = (λ λ1)(λ λ2)(λ λ3) ∞ − − − − 3 2 = λ αjλ + βjλ γj (6.13) − − where, it can be easily seen that
αj = λ1 + λ2 + λ3 = µj(1 + 2 cos θj)
2 βj = λ1λ2 + λ2λ3 + λ3λ1 = µj (1 + 2 cos θj)
3 γj = λ1λ2 + λ3 = µj . (6.14)
From the above, it follows that 3 3 γjαj = βj (6.15) which gives one constraint for each view, the so-called modulus constraint. Now, considering the determinant det(Hj λI) and noting from (6.9) that the ∞ − dependence of the infinite homography on the plane at infinity is through a rank 1 term, it is evident that αj, βj and γj must all be affine in the coordinates p1, p2 and p3 of the plane at infinity π = (p, 1)>. It follows that the modulus constraint is a quartic polynomial ∞ in the coordinates of the plane at infinity.
Three views can provide three such degree 4 polynomials in p1, p2, p3, which suffices to restrict the solution space to a 43 = 64 possibilities. However, for the case of three views, an additional cubic equation is available from the modulus constraints, which can be used to eliminate several spurious solutions and restrict the solution to 21 possibilities (Schaffalitzky, 2000). In practice, there may be several views available, so the modulus constraints can be solved in a least squares sense. 126
6.2.3 Chirality Bounds on Plane at Infinity
Chirality constraints demand that the reconstructed scene points lie in front of the camera. While a general projective transformation may result in the plane at infinity splitting the scene, a quasi-affine transformation is one that preserves the convex hull of the scene points X and camera centers C. A transformation Hq that upgrades a projective reconstruction to quasi-affine can be computed by solving the so-called chiral inequalities.
A subsequent affine centering, Ha, guarantees that the plane at infinity in the centered quasi-affine frame, v = (HaHq)−>π , cannot pass through the origin. So it can be ∞ parametrized as (v1, v2, v3, 1)> and bounds on vi in the centered quasi-affine frame can be computed by solving six linear programs: min / max v i q subject to Xj>v > 0, j = 1, . . . , n i = 1, 2, 3. (6.16) q Ck>v > 0, k = 1, . . . , m
q q Where Xj and Ck are points and camera centers in the quasi-affine frame. We refer the reader to (Hartley, 1998a; Nister´ , 2004) for a thorough treatment of the subject.
6.2.4 Need for Global Optimization
Before we venture further, the reader might wonder about the benefits of global optimization, as opposed to a local gradient-based technique, for solving the autocal- ibration problem. To motivate the need for global optimization, we perform a small experiment similar to (Hartley et al., 1999). For a given scene, bounds on the location of the plane at infinity in a centered quasi-affine frame are computed using chirality constraints. The plane at infinity is parameterized as π = (p1, p2, p3, 1)>. Within the bounds on p1, p2 and p3, for each ∞ putative location of the plane at infinity, the infinite homography in each view is computed. Subsequently, the infinite homography relation is minimized to compute the DIAC. The minimum of the infinite homography relation for each putative plane at infinity is used to 127 populate a three-dimensional cost cube, one of whose slices is shown in Figure 6.2.A similar example is also shown in Figure 1.13.
Global minimum Global Global minimum minimum
(a) Surface plot of 2D slice (b) Another view (c) Contour plot
Figure 6.2: Various views and a contour plot of a 2D slice of the 3D cube of infinite homography relation minima as a function of position of plane at infinity. Blue denotes a low cost and red denotes a high cost.
The above example indicates that the margin of error for estimating the plane at infinity is not very high, since the cost surface terrain is quite rugged around the optimum. This is consistent with the adage that estimating the plane at infinity is the most difficult step in uncalibrated reconstruction (Hartley et al., 1999). Now, consider an alternative, more direct approach to autocalibration that tries to bound the plane at infinity and then minimizes the cost function corresponding to Figure 6.2, thereby simultaneously estimating both the plane at infinity and the DIAC. However, the multiple low basins in the search space indicate that a traditional gradient-based local minimization algorithm is bound to get stuck in local minima. Moreover, the initialization needs to be very accurate for the local minimization algorithm to reach the global optimum, which is sharply cloistered within high cliffs on all sides. It is evident that the complex nature of the autocalibration problem merits a more powerful, global optimization approach. The position marked as global minimum in Figure 6.2 corresponds to the plane at infinity estimated using the globally optimal algorithm described in Section 6.5. 128
6.3 Previous Work
Approaches to autocalibration (Faugeras et al., 1992) can be broadly classified as direct and stratified. Direct methods seek to compute a metric reconstruction by esti- mating the absolute conic. This is encoded conveniently in the dual quadric formulation of autocalibration (Heyden and Astr˚ om¨ , 1996; Triggs, 1997), whereby an eigenvalue decomposition of the estimated dual quadric yields the homography that relates the projective reconstruction to Euclidean. Linear methods (Pollefeys et al., 1998) as well as more elaborate SQP based optimization approaches (Triggs, 1997) have been proposed to estimate the dual quadric, but perform poorly with noisy data. Methods such as (Manning and Dyer, 2001) which are based on the Kruppa equations (or the fundamental matrix), are known to suffer from additional ambiguities (Sturm, 2000). This work primarily deals with a stratified approach to autocalibration (Pollefeys and van Gool, 1999). It is well-established in literature that, in the absence of prior information about the scene, estimating the plane at infinity represents the most significant challenge in autocalibration (Hartley et al., 1999). The modulus constraints (Pollefeys and van Gool, 1999) are a necessary condition for the coordinates of the plane at infinity. Local techniques are used in (Pollefeys and van Gool, 1999) to estimate the coordinates of the plane at infinity by minimizing a noisy overdetermined system in the multiview case. An alternate approach to estimating the plane at infinity exploits the chirality constraints. The algorithm in (Hartley et al., 1999) computes bounds on the plane at infinity and a brute force search is used to recover π within this region. It is argued ∞ in (Nister´ , 2004) that it might be advantageous to use camera centers alone when using chirality constraints. Several linear methods exist for estimating the DIAC (Hartley and Zisserman, 2004) for the metric upgrade, but they do not enforce its positive semi-definiteness. The only work the authors are aware of which explicitly deals with this issue is (Agrawal, 2004), which is formulated under the assumption of known principal point and zero skew. 129
The interested reader is referred to (Hartley and Zisserman, 2004) and the references therein for a more detailed overview of literature relevant to autocalibration. Of late, there has been significant activity towards developing globally optimal algorithms for various problems in computer vision. The theory of convex linear matrix inequality (LMI) relaxations (Lasserre, 2001) is used in (Kahl and Henrion, 2005) to find global solutions to several optimization problems in multiview geometry, while (Chandraker et al., 2007a) discusses a direct method for autocalibration using the same techniques. Triangulation and resectioning are solved with a certificate of optimality using convex relaxation techniques for fractional programs in (Agarwal et al., 2006). Several geometric problems in computer vision, when posed in the L -norm, can be ∞ solved to their global optimum using techniques of quasi-convex optimization (Kahl, 2005; Sim and Hartley, 2006a; Agarwal et al., 2008). A survey of recent work in developing optimal algorithms for multiview geometry can be found in (Hartley and Kahl, 2007b). An interval analysis based branch and bound method for autocalibration is pro- posed in (Fusiello et al., 2004), however the fundamental matrix based formulation does not scale well beyond a small number of views. Grobner¨ basis methods have been used to achieve optimal solutions for several geometric reconstruction problems, such as triangulation (Stewenius´ et al., 2005), but they do not scale well for more than very few number of views. Branch and bound as a solution paradigm has been used for a diverse range of applications in computer vision, such as feature selection (Zongker and Jain, 1996), geometric matching (Breuel, 2002), image segmentation (Gat, 2003), contour tracking (Freedman, 2003), object localization (Lampert et al., 2008) and so on. A branch and bound method for Euclidean registration problems is presented in (Olsson et al., 2009b). 130
6.4 The Branch and Bound Framework
In this section, we provide a general outline of the branch and bound framework that will be employed for stratified autocalibration. Consider a multivariate, non-convex, scalar-valued objective function f(x), for which we seek a global minimum over a rectangle Q0. Branch and bound algorithms require an auxiliary function flb(Q) which for every region Q Q0, satisfies two properties: ⊆
(L1) The value of flb(Q) is always less than or equal to the minimum value fmin(Q) of f(x) for all x Q. ∈ (L2) Let Q denote the size of a rectangle, which in our case, is the length of the | | longest edge. Then the relaxation gap f(x) f (x) monotonically decreases as − lb a function of Q . | | Note that while (L1) is a basic stipulation for a convex underestimator, (L2) is a Cauchy continuity requirement specific to branch and bound algorithms. Indeed, several popular convex underestimators such as linear matrix inequality (LMI) relaxations (Lasserre, 2001) and sum-of-squares relaxations for polynomial systems (Prajna et al., 2002) do not satisfy this requirement, thus they are rendered unsuitable for our purposes.
Computing the value of flb(Q) is referred to as bounding, while choosing and subdividing a rectangle is called branching. As before, we consider the rectangle with the smallest minimum of flb as the most promising to contain the global minimum and subdivide it into two rectangles along the largest dimension. A key consideration when designing bounding functions is the ease with which they can be estimated . So, it is desirable to design flb(Q) as the solution of a convex optimization problem for which efficient solvers exist (Boyd and Vandenberghe, 2004). In the following sections, we present branch and bound algorithms based on such constructions. 131
6.4.1 Constructing Convex Relaxations
In this section, we will outline our general strategy for the construction of a convex underestimator for an arbitrary non-convex function. This strategy will be employed to underestimate the objective functions that arise in both the affine and metric upgrade stages of autocalibration. Let us consider the following unconstrained, non-linear least squares problem: m X 2 min (fi(x) µi) (6.17) x i=1 − k where µi R and x R , k 1. Then, an equivalent constrained optimization problem ∈ ∈ ≥ is:
X 2 min (si µi) x,si i −
subject to si = fi(x). (6.18)
Suppose we can construct a convex underestimator conv (fi) and a concave overestimator conc (fi) for the function fi(x). Then, the following convex optimization problem minimizes the same objective as (6.18), but with a “relaxed” constraint set:
X 2 min (si µi) x,si i −
subject to conv (fi) si conc (fi). (6.19) ≤ ≤ Consequently, the minimum attained by the problem (6.19) will always be at least as low as the minimum attained by (6.18). In effect, we have constructed a convex problem whose minimum always underestimates the minimum of the non-convex problem we wished to optimize. The solution to (6.19) corresponds to the construction of the lower bounding function flb discussed in Section 4.3. An intuitive illustration of the procedure is depicted for a 1-D function in Figure 6.3. While the variable s is allowed to attain values only on the graph of the function f(x) in the original problem (6.18), it can attain any value within the larger region between the convex and concave relaxations in the relaxed problem (6.19). 132
! ! ()*(%$' "
!"#"$%&' !"#"$%&' #$%&'()*+*"*+*#$%#'()
()*+%$'
& & ! (a) (b) (c)
Figure 6.3: (a) The objective function of the non-linear least squares problem min(f(x) µ)2 is linearized by replacing the non-linear function f(x) by a scalar variable s and− introducing an equality constraint s = f(x). (b) Convex and concave relaxations are constructed for the function f(x). For the functions encountered in this chapter, Appendix B demonstrates the construction of tight, piecewise linear relaxations. (c) s is now allowed to attain values in a relaxed region between the convex under-estimator and the concave over-estimator.
The convex relaxations that are used in a branch and bound framework must satisfy conditions (L1) and (L2) specified above. Appendix B.4 proves the same for the convex relaxations constructed in this chapter.
6.5 Global Estimation of Plane at Infinity
6.5.1 Traditional Solution
Given exactly three views, the modulus constraints of (6.15) correspond to a system of three quartic polynomials in three variables, for which the 64 roots may be found, typically using continuation methods. Also for the three-view case, an additional cubic equation available from the modulus constraints Schaffalitzky(2000) can be used to eliminate several spurious solutions, reducing the number of possible solutions to 21. When more than three views are present, the modulus constraints from all the views may be used in a least squares framework for greater accuracy and robustness:
m X 3 3 2 min (γiαi βi ) . (6.20) p1,p2,p3 i=1 − 133
A gradient-based optimization routine, such as Levenberg(1944); Marquardt(1963), may be used to obtain a locally optimal solution to the above problem. In Pollefeys and van Gool(1999), several random initializations for the Levenberg-Marquardt algorithm are used to enhance the chances of converging to a global optimum.
6.5.2 Problem Formulation
Note that the cost function in (6.20) is a polynomial and some recent work in computer vision Kahl and Henrion(2005); Chandraker et al.(2007a) exploits convex linear matrix inequality (LMI) relaxations to achieve global optimality in polynomial programs. However, this is a degree 8 polynomial in three variables, which is far beyond what present-day solvers can handle Henrion and Lasserre(2003); Prajna et al.(2002). We instead consider the equivalent formulation: m X 1/3 2 min (γi αi βi) , (6.21) p1,p2,p3 i=1 − for which the global minimum is estimated using the method outlined in this section.
6.5.3 Convex Relaxation
As an illustration of higher-level concepts, we show construction of convex under- estimators for the non-convex objective in (6.21). The actual objective we minimize incorporates chirality bounds and is derived in Section 6.5.4.
1/3 Let us suppose it is possible to derive a convex under-estimator conv (γi αi) 1/3 1/3 and concave over-estimator conc (γi αi) for γi αi. Then the following convex opti- mization problem underestimates the solution to (6.21). m X 2 min (si βi) p1,p2,p3 i=1 − 1/3 1/3 subject to conv (γi αi) si conc (γi αi) (6.22) ≤ ≤ As shown in Appendix B.3, our convex and concave relaxations for functions of the form x1/3y are piecewise linear and representable using a small set of linear inequalities. Thus 134 the above optimization problem is a convex quadratic program that can be solved using a quadratic programming (QP) or a second order cone programming (SOCP) solver.
Given bounds on p1, p2, p3 , a branch and bound algorithm can now be used to { } obtain a global minimum to the modulus constraints. All that remains to be shown is that it is possible to estimate an initial region which bounds the coordinates of π . ∞
6.5.4 Incorporating Bounds on the Plane at Infinity
One way to derive bounds on the coordinates of the plane at infinity is by using the chirality conditions overviewed in Section 6.2.3. Let v be the plane at infinity in the centered quasi-affine frame, where v = (v1, v2, v3, 1)>, so that we can find bounds on each vi. However, the modulus constraints require that the first metric camera be of the form K [I 0] and the first projective camera have the form [I 0], which might not be | | satisfiable in a centered quasi-affine frame, in general. Thus, we need to use the bounds derived in the centered quasi-affine frame within the modulus constraints for the original projective frame. The centered quasi-affine reconstruction differs from the projective one by a transformation Hqa = HaHq, where Hq takes the projective frame to some quasi-affine frame and Ha is the affine centering in that quasi-affine frame. Let hi be the i-th column of Hqa, then we have pi = hi>v/h4>v. Recall that, for the j-th view, αj, βj and γj are affine expressions in p1, p2 and p3 Pollefeys and van Gool(1999). Then, for instance,
αj =αj1p1 + αj2p2 + αj3p3 + αj4 (6.23) a (v) = j , (6.24) d(v) where, aj(v) = αj1h1>v + αj2h2>v + αj3h3>v + αj4h4>v and d(v) = h4>v. Similarly, let
b (v) β = j (6.25) j d(v) c (v) γ = j , (6.26) j d(v) 135
where aj(v), bj(v), cj(v), d(v) are linear functions of v. In the following, for the sake of brevity, we will drop the reference to v and just use aj, bj, cj, d. Now the optimization problem (6.21) can be rewritten as
m 2 X 1/3 1/3 cj aj d bj j=1 − min 8/3 v1,v2,v3 d
subject to li vi ui, i = 1, 2, 3. (6.27) ≤ ≤
Introducing new scalar variables for some of the non-linear terms, the above is equivalent to
min r v1,v2,v3 m X 2 subject to r e (fj gj) · ≥ j=1 − 1/3 fj = c aj, j = 1, , m j ··· 1/3 gj = d bj, j = 1, , m ··· e = d8/3
li vi ui, i = 1, 2, 3. (6.28) ≤ ≤
As outlined in our general recipe for constructing convex relaxations (Section 6.4.1), we have reduced the non-convexity in the above optimization problem to a set of equality constraints. The quadratic inequality constraint is convex and is known as a rotated cone Boyd and Vandenberghe(2004). Given bounds on vi, it is easy to calculate bounds on aj, bj, cj, d, by solving eight linear programs in three variables. Given these bounds, we can construct convex and concave envelopes of the non-linear functions e, fj, gj and use them to construct the following convex program that underestimates the minimum of 136 the problem (6.28):
min r v1,v2,v3 m X 2 subject to r e (fj gj) , · ≥ j=1 − 1/3 1/3 conv c aj fj conc c aj , j = 1, , m j ≤ ≤ j ··· 1/3 1/3 conv d bj gj conc d bj , j = 1, , m ≤ ≤ ··· e conc (d8/3) ≤
li vi ui, i = 1, 2, 3. (6.29) ≤ ≤
Notice that the convex envelope of d8/3 is not needed. Since (6.29) is a minimization problem, e always takes its maximum possible value and does not require a lower bound. Following AppendixB, our convex relaxation in (6.29) consists of a linear objec- tive subject to linear and SOCP constraints, which can be efficiently minimized Sturm (1999). A branch and bound algorithm can now be used to obtain an estimate of
v1, v2, v3 , which globally minimizes the modulus constraints. Thereafter, the plane at { } infinity in the projective frame can be recovered as π = Hqa> v, which completes the ∞ projective to affine upgrade.
6.6 Globally Optimal Metric Upgrade
6.6.1 Traditional Solution
Recall that when the camera intrinsic parameters are held constant, the DIAC
i i satisifes the infinite homography relations ω∗ = H ω∗H >, i = 1, , m, where equal- ∞ ∞ ··· ity holds up to a scale factor. The standard technique for estimating the DIAC is to first normalize the infinite homography matrix by dividing it by the cube-root of its determinant: H bH = . (6.30) √3 H 137
Since this normalization “equates” the scale on the two sides of the infinite homography relation, estimating the DIAC can now be posed as a least squares problem:
m X i i min ω∗ bH ω∗bH > (6.31) ω∗ ∞ ∞ i=1 k − k which is typically solved linearly by ignoring the positive semidefiniteness requirement on the DIAC. For the cases where the linear solution does not yield a positive semidefinite DIAC, the closest positive semidefinite matrix is estimated as a post-processing step by dropping the negative eigenvalues. It is well-documented that this may lead to a spurious calibration in practice Hartley and Zisserman(2004).
6.6.2 Problem Formulation
For the optimal solution to the infinite homography relation, we note that both
i ω∗ and H are homogeneous entities, so our cost function must correctly account for ∞ the scale factor before it can be used to search for the optimal DIAC. Moreover, the optimization algorithm itself must take into account the positive semidefiniteness of the DIAC.
A necessary condition for the matrix ω∗ to be interpreted as ω∗ = KK> is that
ω33∗ = 1. Thus, we fix the scale in the infinite homography relation by demanding that both the matrices on the left and the right hand side of the relation have their (3, 3) entry equal to 1. To this end, we introduce additional variables λi and pose the minimization problem:
X i i 2 min ω∗ λiH ω∗H > ∗ F ω ,λi ∞ ∞ i k − k
subject to ω33∗ = 1
i> i λih3 ω∗h3 = 1
ω∗ 0 ω∗ D (6.32) ∈ 138
i i Here, h3 denotes the third row of the 3 3 infinite homography H and D is some initial × ∞ convex region whose choice is elucidated later in this section. For the present, it suffices to understand that the individual entries of ω∗ lie within the convex region D.
6.6.3 Convex Relaxation
We begin by introducing a new set of variables νi = λiω∗. Here each matrix νi is a symmetric 3 3 matrix with entries νijk = λiω∗ . Also let us assume that the domain × jk is given in the form of bounds [ljk, ujk] on the five unknown symmetric entries ω∗ of D jk ω∗. Then (6.32) can be re-written as
X i i 2 min ω∗ H νiH > ∗ F ω ,νi,λi ∞ ∞ i −
subject to νijk = λiωjk∗
ω33∗ = 1
i> i λih3 ω∗h3 = 1
ω∗ 0
ljk ω∗ ujk (6.33) ≤ jk ≤
The non-convexity in the above optimization problem has been reduced to the bilinear
i> i equality constraints νijk = λiωjk∗ and λih3 ω∗h3 = 1.
Given bounds on the entries of ω∗, a relaxation of (6.33) is obtained by replacing i> i the constraint λih ω∗h = 1 by a pair of linear inequalities of the form Li λi Ui, 3 3 ≤ ≤ i i where Li and Ui are computed by simply inverting the bounds on h3>ω∗h3. Thus, the lower bound Li can be computed as the reciprocal of the result of the maximization 139 problem:
i> i max h3 ω∗h3 ω∗
subject to ω33∗ = 1
ω∗ 0
ljk ω∗ ujk (6.34) ≤ jk ≤
This is a semi-definite program (SDP) in 9 variables and can be solved very efficiently using interior point methods Boyd and Vandenberghe(2004). The upper bound Ui can be computed similarly by computing the reciprocal of the minimizer of the above. The relaxed optimization problem can now be stated as:
X i i 2 min ω∗ H νiH > ∗ F ω ,νi,λi ∞ ∞ i −
subject to νijk = λiωjk∗
ω33∗ = 1
ω∗ 0
ljk ω∗ ujk ≤ jk ≤
Li λi Ui (6.35) ≤ ≤
In effect, the above ensures that the introduction of an additional view does not translate into an increase in the dimensionality of our search space. Instead, the cost is limited to solving a small SDP to compute bounds on λi, while the branching variables remain the five unknowns of ω∗. Thus, the search space for the branch and bound algorithm can be restricted to a small, fixed number of dimensions, independent of the number of views. Appendix B.2 discusses the synthesis of convex relaxations of bilinear equalities, which allows us to replace each bilinear equality by a set of linear inequalities. Using 140 them, a convex relaxation of the above optimization problem can be stated as
X i i 2 min ω∗ H νiH > ∗ F ω ,νi,λi ∞ ∞ i −
subject to νijk Uiω∗ + ljkλi Uiljk ≤ jk −
νijk Liω∗ + ujkλi Liujk ≤ jk −
νijk Liω∗ + ljkλi Liljk ≥ jk −
νijk Uiω∗ + ujkλi Uiujk ≥ jk −
ω33∗ = 1
ω∗ 0
ljk ω∗ ujk ≤ jk ≤
Li λi Ui (6.36) ≤ ≤
The objective function of the above optimization problem is convex quadratic. The constraint set includes linear inequalities and a positive semi-definiteness constraint. Such problems can be efficiently solved to their global optimum using interior point methods and a number of software packages exist for doing so. We use SeDuMi in our implementation Sturm(1999). The user of the algorithm specifies valid ranges for the entries of the calibration matrix K. From this input, we derive intervals [ljk, ujk] for the entries ωjk∗ of the matrix
ω∗ using the rules of interval arithmetic, which specifies the initial convex region D in (6.32) Moore(1966).
A Note on Normalization
The careful reader would have observed that we do not follow the standard prescription of normalizing the infinite homography by the cube root of its determinant, as discussed in Section 6.6.1, to resolve the scale in the infinite homography relations
i i ω∗ = H ω∗H >, i = 1, , m. ∞ ∞ ··· There are two reasons for this. Since the equation we are trying to satisfy with 141
the optimal estimate of ω∗ is algebraic, there are a number of ways in which the scale ambiguity can be resolved. The method based on normalizing the determinant is just one of them. However, the DIAC is not an arbitrary 3 3 symmetric definite matrix, it has × a particular geometric and numerical interpretation, which requires that ω33∗ = 1. Our choice of the objective function reflects this. If one were to choose the determinant based normalization, even though technically the scale ambiguity in the infinite homography relation would have been resolved, a further normalization would be needed before the camera parameters matrix K can be estimated from the DIAC. The second reason is that our normalization subsumes the standard one, since the
i 2/3 latter just corresponds to setting λi = det (H )− . Since we are optimizing over the ∞ choice of λi also, the solution returned by our method will correspond to a minimum which is, in general, lower than the one that corresponds to the standard normalization. In other words, our normalization is more consonant with the aim of global optimization, since we estimate this scale factor, while the traditional approach chooses one. Thus, our method, at the expense of some computational effort, poses the opti- mization problem in terms of an interpretable quantity and finds estimates which are at least as good as, or better than, those obtained by using the standard normalization.
6.7 Experiments
In this section, we will describe the experimental evaluation of our algorithms using synthetic and real data. 142
To evaluate the output of our algorithm the following metrics are defined: v u 3 uX 0 2 ∆p = t (pi/pi 1) (6.37) i=1 −
f1 f2 ∆f = 0 1 + 0 1 (6.38) f1 − f2 − ∆uv = u u0 + v v0 (6.39) | − | | − | ∆s = s s0 (6.40) | − |
Here, pi are estimated coordinates of the plane at infinity, f1, f2 represents the two 0 0 0 0 0 focal lengths, (u, v) stands for the principal point and s for the skew. pi , f1 , f2 , u , v and s0 are the corresponding ground truth quantities. In the first experiment, we simulated a scene where 100 3D points are randomly generated in a cube with sides of length 20, centered at the origin and a varying number of cameras are randomly placed at a nominal distance of 40 units. Zero mean, Gaussian noise of varying standard deviation is added to the image coordinates. A projective transformation is applied to the scene with a known, randomly generated plane at infinity and the ground truth intrinsic calibration matrix is identity. All the statistics reported in this section are acquired over 50 trials. Table 6.1 reports, for various number of cameras and noise levels, the errors in the estimates of various camera parameters and the number of iterations needed for the algorithm to converge. The column π (1) in Table 6.1 reports the number of branch and ∞ bound iterations using the algorithm described in Section 6.5.4. However, an additional optimization is possible: we can refine the value of the feasible point f(q∗) using a gradient descent method within the rectangle that contains it. This does not compromise optimality, but allows the value of the current best estimate to be lower than the value corresponding to the minimum of the lower bounding function. The number of iterations with this refinement is tabulated under π (2). The error metrics reported are computed ∞ using the refined algorithm, however, since both algorithms are run with the same very stringent tolerance ( = 1e 7), the solutions obtained are comparable. The number − 143
Table 6.1: Error in camera calibration parameters and number of iterations for conver- gence, using random synthetic data. Calibration errors are reported relative to ground truth. All quantities reported are averaged over 50 trials. σ stands for percentage noise in image coordinates and m stands for number of views.
σ m Error Iterations (%) ∆p ∆f ∆uv ∆s π (1) π (2) ω∗ ∞ ∞ 5 3.65e-5 3.28e-5 3.01e-5 2.89e-5 31.1 11.0 32.1 0 10 2.51e-6 2.14e-6 2.03e-6 2.00e-6 18.7 4.2 40.9 20 1.28e-6 1.51e-6 1.33e-6 1.73e-6 21.4 1.6 31.8 40 9.08e-7 8.24e-7 7.99e-7 7.58e-7 23.8 1.1 27.5 5 4.76e-4 4.59e-4 4.22e-4 4.05e-4 27.3 9.9 36.3 0.1 10 3.44e-4 3.07e-4 2.73e-4 2.91e-4 17.2 3.5 44.2 20 2.75e-4 2.92e-4 2.56e-4 2.31e-4 16.1 2.4 33.0 40 2.55e-4 2.41e-4 2.06e-4 1.85e-4 23.2 7.9 30.1 5 1.19e-3 1.14e-3 9.92e-4 8.73e-4 41.0 12.5 38.4 0.2 10 7.65e-4 7.13e-4 7.01e-4 6.85e-4 24.7 4.5 47.9 20 6.03e-4 6.80e-4 5.12e-4 5.79e-4 19.5 7.0 34.5 40 5.59e-4 6.05e-4 4.29e-4 5.10e-4 33.2 10.6 31.6 5 3.29e-3 3.22e-3 3.02e-3 2.63e-3 63.2 11.6 42.7 0.5 10 1.66e-3 2.19e-3 1.99e-3 2.11e-3 29.8 6.2 51.1 20 1.41e-3 1.84e-3 1.43e-3 1.55e-3 22.3 8.2 38.2 40 1.25e-3 1.50e-3 1.18e-3 9.06e-4 46.6 20.6 32.4 5 4.68e-3 4.04e-3 3.77e-3 3.36e-3 74.1 9.9 45.5 1.0 10 3.15e-3 2.88e-3 2.52e-3 2.15e-3 36.4 9.2 56.8 20 2.86e-3 2.45e-3 2.02e-3 1.74e-3 31.0 15.8 40.9 40 2.79e-3 2.21e-3 1.76e-3 1.30e-3 56.4 23.6 38.9
of iterations for the DIAC estimation (with no refinement) are tabulated under ω∗. The metric upgrade step was performed with = 1e 5. The termination criterion measures − the gap, , between the lowest lower bound and the current best objective function value. Figure 6.4 plots the errors graphically. The accuracy of the algorithm is evident from the very low error rates obtained for reasonable noise levels. It is interesting that the algorithm performs quite well even for noise as high as 1%. In general, the accuracy improves as expected when more cameras are used. To demonstrate scalable runtime behavior, Figure 6.5 plots the runtime for the affine and metric upgrade stages for the random data experiment with 0.1% noise, for 144
0.005 0.005 Noiseless Noiseless 0.1% noise 0.1% noise 0.004 0.2% noise 0.004 0.2% noise 0.5% noise 0.5% noise
1.0% noise length 1.0% noise
ordinates 0.003 0.003 co cal fo ∞ π
0.002 in 0.002 in
0.001 Error 0.001 Error
0 0 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 Number of views Number of views (a) Error in plane at infinity (b) Error in focal length
0.005 0.005 Noiseless Noiseless 0.1% noise 0.1% noise t 0.004 0.2% noise 0.004 0.2% noise oin ew p 0.5% noise 0.5% noise sk 0.003 1.0% noise 0.003 1.0% noise pixel principal 0.002 in 0.002 in
0.001 Error 0.001 Error
0 0 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 Number of views Number of views (c) Error in principal point (d) Error in pixel skew
Figure 6.4: Error in camera calibration parameters for random synthetic data. The errors in the graphs are plotted relative to ground truth for the indicated quantities. All quantities reported are averaged over a 50 trials. varying numbers of cameras. These experiments were conducted on a Pentium IV, 1 GHz computer with 1GB of RAM. Note that the graceful variation in the runtime behavior is a direct outcome of our bounds propagation schemes, without which the branch and bound algorithms would display exponential characteristics. Our code is unoptimized MATLAB with an off-the-shelf SDP solver (Sturm, 1999), so the actual magnitude of these timings should be understood only as rough qualitative indicators.1 While the metrics in Table 6.1 are intuitive for evaluating the intrinsic parameters, it is not readily evident how ∆p should be interpreted. Towards that end, we perform a set of experiments, inspired by (Pollefeys and van Gool, 1999), where three mutually orthogonal 5 5 grids are observed by varying numbers of randomly placed cameras. × 1Prototype code available at http://vision.ucsd.edu/stratum. 145
60 800 50
40 600 seconds) 30 seconds) (in
(in 400 20 Time Time 200 10
0 0 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 Number of views Number of views (a) Affine upgrade (b) Metric upgrade
Figure 6.5: Runtime behavior of the branch and bound algorithms for the affine and metric upgrade steps. All timings reported are averaged over 50 trials.
Noise ranging from 0.1 to 1% is added to the image coordinates. The quality of the affine upgrade is indirectly inferred from the deviation from parallelism in the reconstructed grid lines, while the quality of the metric upgrade is inferred from the deviation from orthogonality. Table 6.2 reports the results of this experiment and Figure 6.6 shows the results graphically.
6 6 Noiseless Noiseless 0.1% noise 0.1% noise
5 (degrees) 5 (degrees)
0.2% noise y 0.2% noise 0.5% noise 0.5% noise 4 4 1.0% noise 1.0% noise 3 3 parallelism orthogonalit 2 2 from from 1 1
Deviation 0 0
5 10 15 20 25 30 35 40 Deviation 5 10 15 20 25 30 35 40 Number of views Number of views (a) Deviation from parallelism (b) Deviation from orthogonality (Affine upgrade) (Metric upgrade)
Figure 6.6: Errors in affine and metric properties for three synthetic, mutually orthogonal planar grids. The graphs plot (a) angular deviation from parallelism after the affine upgrade and (b) angular deviation from orthogonality after the metric upgrade, both measured in degrees. All quantities reported are averaged over 50 trials.
Again, we observe that the algorithm achieves very good accuracy for reasonable 146
Table 6.2: Error in affine and metric properties for three synthetically generated, mutually orthogonal planar grids. The table reports mean angular deviations from parallelism and orthogonality, measured in degrees. All quantities reported are averaged over 50 trials.
Noise Views Affine Metric (%) (Parallel) (Parallel) (Perpendicular) 5 2.40e-6 2.46e-6 3.07e-6 0 10 7.90e-7 8.12e-7 1.12e-6 20 5.22e-7 5.50e-7 8.70e-7 40 3.67e-7 3.88e-7 6.23e-7 5 0.40 0.40 0.34 0.1 10 0.27 0.27 0.23 20 0.19 0.19 0.15 40 0.13 0.13 0.10 5 0.79 0.80 0.63 0.2 10 0.54 0.54 0.44 20 0.36 0.36 0.25 40 0.25 0.25 0.19 5 1.95 1.96 1.88 0.5 10 1.31 1.31 1.02 20 0.89 0.89 0.79 40 0.64 0.64 0.57 5 4.05 4.07 3.97 1.0 10 2.63 2.63 2.30 20 1.83 1.83 1.52 40 1.27 1.27 1.09
noise and performs quite well even for 1% noise. With just 5 cameras, it is quite likely for the configuration to be ill-conditioned or degenerate, which causes the algorithm to break down in some cases. We also use this experimental setup to compare against traditional local optimiza- tion approaches. In the first set of experiments, for a fixed number of views (20) and varying noise levels, we minimize the modulus constraints in (6.21) using a Levenberg- Marquardt scheme with 50 random initializations. It can be seen from Figure 6.7 that the globally optimal solution of Section 6.5 yields more accurate solutions. Finally, once the plane at infinity has been estimated, the metric upgrade of 147
6 Local 5 Global (degress)
4
3 parallelism 2 from
1
Deviation 0 0 0.2 0.4 0.6 0.8 1 Noise (percent)
Figure 6.7: Comparison of the accuracy of the plane at infinity estimated by the globally optimal method (black dotted curve) and a Levenberg-Marquardt routine with 50 random initializations (red curve). The number of views is 20. The accuracy of the affine upgrade is deduced from the extent to which parallelism is preserved in the reconstruction.
Section 6.6 is compared against the traditional linear approach. The two drawbacks of the linear approach discussed previously, namely overlooking the positive semidefiniteness requirement and using a suboptimal scale factor, are illustrated by these experiments. Table 6.3 shows the percentage number of times the estimated DIAC is not positive semidefinite. The number of such instances is higher for a fewer number of views and increases with noise level. In such cases, it is not possible to decompose the DIAC into the intrinsic calibration matrix without a further approximation to project on the closest point on the cone of positive semidefinite matrices. The DIAC estimation here is based on an affine reconstruction performed using the optimal plane at infinity estimated by the method in Section 6.5. Next, for a fixed number of views (m = 5), the deviation from orthogonality observed in the globally optimal metric reconstruction is compared to the deviation obtained from the linear method, across varying noise levels (Figure 6.8). Two sets of metric reconstructions are performed for the linear method. In the first case, the metric upgrade is performed starting with the affine reconstruction obtained using the plane at infinity estimated by a local optimization method (Levenberg-Marquardt with 50 random initializations) and in the other case, the affine upgrade is computed using the optimal 148
Table 6.3: Percentage number of times the linear method for DIAC estimation returns a solution which is not positive semidefinite. All quantities reported are computed over 50 trials.
Noise (%) Views 0 0.1 0.2 0.5 1.0 5 0 4 8 8 14 10 0 2 6 4 10 20 0 0 4 4 6 40 0 0 0 2 4
method in Section 6.5, while the metric upgrade is linear. Clearly, the globally optimal DIAC estimation outperforms the linear method across all noise levels, which shows the importance of estimating the normalization factors λi introduced in Section 6.6.2.
6 Local π + Linear DIAC ∞ Global π + Linear DIAC (degrees) 5 ∞
y Global π + Global DIAC ∞ 4
3 orthogonalit 2 from 1
0
Deviation 0 0.2 0.4 0.6 0.8 1 Noise (percent)
Figure 6.8: Comparison of the globally optimal metric upgrade algorithm with a tra- ditional linear method for estimating the DIAC. The accuracy of the reconstruction is deduced by the extent to which orthogonality is preserved in the metric reconstruction. The solid red curve plots the deviation from orthogonality when a local optimization method is used for estimating the plane at infinity and a linear method is used for estimat- ing the DIAC. The dotted black curve uses a linear method for DIAC estimation, but the optimal plane at infinity is used for affine upgrade. The dashed blue curve uses optimal algorithms for both the affine and metric upgrade steps.
An important consideration for noisy situations is the sensitivity of the chirality bounds to outliers. Similar to (Nister´ , 2004), for noisy images expected in a real world 149 scenario, chirality bounds are computed using only the camera centers. The reason is that usually there are far more points than cameras and the camera centers are likely to be estimated more robustly than 3D points.
(a) Four of the twelve input images used for reconstruction
(b) Two views of the obtained 3D reconstruction
Figure 6.9: In the reconstruction, targets on the same plane are represented by lines of the same color. The second view of the obtained 3D reconstruction shows that the angle between targets on adjacent walls is recovered as nearly 90 degrees.
To demonstrate performance on real data, we consider images of marker targets on two orthogonal walls (Figure 6.9(a)). Using image correspondences from detected corners, we perform a projective reconstruction using the projective factorization al- gorithm (Sturm and Triggs, 1996) followed by bundle adjustment. The normalization procedure and exact implementation follows the description in (Hartley and Zisserman, 2004). Bounds on the plane at infinity are computed using chirality, c.f. equation (3.32). Focal lengths are assumed to lie in the interval [500, 1500], principal point within [250, 450] [185, 385] (image size is 697 573) and skew [ 0.1, 0.1]. The plane at × × − 150 infinity and DIAC are estimated using our algorithm. While ground truth measurements for the scene are not available, we can indirectly infer some observable properties. The coplanarity of the individual targets is indicated by the ratio of the first eigenvalue of the covariance matrix of their points to the sum of eigenvalues. This ratio is
6 5 5 4 measured to be 3.1 10− , 4.1 10− , 6.2 10− and 4.1 10− for the four polygonal × × × × targets. The angle between the normals to the planes represented by two targets on the adjacent walls is 88.1◦ in our metric reconstruction (Figure 6.9(b)). The same angle is measured as 89.8◦ in a reconstruction using (Pollefeys et al., 2002). The precise ground truth angle between the targets is unknown. All the results that we have reported are for the raw output of our algorithm. In practice, a few iterations of bundle adjustment following our algorithms might be used to achieve a slightly better estimate.
6.8 Conclusions and Further Discussions
In this chapter, we have presented globally optimal solutions to the affine and metric stages of stratified autocalibration. Although our cost function is algebraic, this is the first work that provides optimality guarantees for scalable stratified autocalibration. For the success of a branch and bound scheme, it is of utmost importance that the convex relaxations be as tight as possible. The second order cone programming based convex relaxation that we develop for solving the affine upgrade step and the semidefinite programming based convex relaxation for the metric upgrade step satisfy this requirement, while also being very fast to compute in practice. Sometimes, a consideration of practicality of the convex relaxation influences the choice of the algebraic form of an objective function. Indeed, the most straightforward way to minimize the modulus constraints would be to use the simpler formulation of (6.20) and construct a multi-level relaxation for the quartic polynomials by successively using the bilinear relaxation of Section B.2. However, in our experience, such multistep relaxations are very loose in practice, so the branch and bound algorithm will not converge 151 in a reasonable amount of time. Thus, the least squares version of the modulus constraints we globally minimize corresponds to the reformulated version of (6.21). A crucial aspect of designing a global optimization algorithm based on branch and bound is the choice of initial region, which must be principled and guaranteed to contain the optimal solution. Arbitrarily choosing a very large initial region will lead to impractically long convergence times for the branch and bound, while too restrictive a choice might not contain the globally optimum point. Our affine upgrade step addresses this issue by incorporating chirality constraints within the convex relaxation for the modulus constraints. In practice, this limits the location of the plane at infinity to a small region of the search space. For the metric upgrade step, the entries of the DIAC that we wish to estimate are related to more tangible entities corresponding to the internal parameters of the camera. So, a user can easily specify reasonable bounds on the focal length, pixel skew and principal point, which are propagated to initial bounds on the DIAC using interval arithmetic. Several important extensions to the methods introduced in this chapter can be envisaged. For instance, an L1-norm formulation will allow us to use an LP solver for the affine upgrade, making it possible to solve larger problems faster. To the best of our knowledge, it remains an open question whether methods similar to those proposed in this chapter can be used for obtaining optimal solutions to the direct autocalibration problem, which is the subject of the next chapter of this dissertation. Finally, we reiterate that pragmatic application of domain knowledge is important for successfully employing an optimization paradigm to globally optimize a computer vision problem. Indeed, it is the careful consideration of multiview geometry for choosing the initial region, constructing convex relaxations and restricting the dimensionality of the search space within a branch and bound framework that allows the globally optimal algorithms presented in this chapter to be practical. We are hopeful that, in the near future, methods not unlike ours will be used to exploit underlying convexities to successfully optimize challenging problems in other areas of computer vision, besides multiview geometry. 152
Most of the content of this chapter is based on “Globally Optimal Affine and Metric Upgrades in Stratified Autocalibration”, by M. K. Chandraker, S. Agarwal, D. J. Kriegman and S. Belongie, as it appears in (Chandraker et al., 2007b). Chapter 7
Direct Autocalibration
“Confidence is a good name for what is intended by the term directness. .... It denotes the straightforwardness with which one goes at what he has to do.”
John Dewey (American pragmatist, 1859-1952 AD), The Traits of Individual Method (Democracy and Education)
7.1 Introduction
While a projective reconstruction of the scene can be computed from image coordinates alone, the goal of autocalibration is to compute the projective transformation (homography) that upgrades the projective reconstruction to a metric reconstruction. Chapter6 achieved this in a stratified manner, with an intermediate affine reconstruction step. In this chapter, we will propose a global optimization method to directly upgrade a projective reconstruction to a metric one. The basic tenet of autocalibration is the constancy of the absolute conic under rigid body motion of the camera. This is encoded conveniently in the absolute dual quadric (ADQ) formulation of autocalibration (Triggs, 1997). Recall from Section 2.5.3, that the absolute dual quadric, also denoted as Q∗ , is a 4 4 rank-degenerate quadric ∞ ×
153 154 with the canonical representation: I3 3 0 Q∗ = ˜I = × . (7.1) ∞ 0> 0
The ADQ is the dual of the absolute conic and the plane at infinity is its null vector, so estimating it can directly upgrade a projective reconstruction to a metric one. Indeed, once estimated, an eigenvalue decomposition of the ADQ yields the homography that relates the projective reconstruction to a Euclidean reconstruction. From Section 2.5.3, it is evident that the ADQ must satisfy some important properties:
1. Q∗ is a degenerate quadric, so it must be rank-deficient. ∞
2. The null-space of Q∗ is the plane at infinity, π . ∞ ∞
3. Q∗ is positive (or negative, depending on scale) semi-definite. ∞ Note that these properties are not mere algebraic conveniences, they are geometric requirements essential for a quadric to be considered as the ADQ.
The common practice in estimating Q∗ is to enforce its rank degeneracy as a ∞ post-processing step by simply dropping the smallest singular value. The rank degeneracy of the absolute quadric has an important physical interpretation: enforcing it is equivalent to demanding a common support plane for the absolute conic over the multiple views. That, indeed, is the real advantage that an absolute quadric based method affords over one based on, say, the Kruppa constraints. Thus, for more than two views, it does not make sense to estimate Q∗ without enforcing the rank condition. ∞ In this chapter, we propose a method for estimating the absolute quadric where its rank deficiency is imposed within the estimation procedure. A significant drawback of several prior approaches for autocalibration is the difficulty in ensuring a positive (or negative, depending on scale) semidefinite Q∗ . As has been discussed in the literature ∞ (Hartley and Zisserman, 2004), it is usually not correct to simply output the closest positive semidefinite matrix as a post-processing step since it might lead to a spurious 155
calibration. Our formulation explicitly demands a positive (or negative) semidefinite Q∗ ∞ as an output of our optimization problem, which ensures that the resulting dual image of the absolute conic (DIAC) can be decomposed into its Cholesky factors to yield the calibration parameters of the cameras. It is well-established that the principal difficulty in autocalibration lies in the affine upgrade step, which involves a precise estimation of the plane at infinity, π . In ∞ our approach, estimating the plane at infinity is encapsulated in the absolute quadric estimation itself, as π lies in the null-space of Q∗ . Further, it is reasonable to demand ∞ ∞ that chirality holds, that is, the points and camera centers lie on one side of the plane at infinity (Hartley, 1998a). It has been argued in recent literature (Nister´ , 2004) that it is most important for any reconstruction method to start by satisfying the requirements of chirality, in particular, with respect to the camera centers. Intuitively, moving across the plane at infinity requires jumps across large “basins” in the search space. Thus, imposing chirality constraints increases the chances of an autocalibration routine to start in the correct region of the search space. Given reasonable assumptions on the internal parameters of the camera, such as zero skew and unit aspect ratio, we pose the problem of estimating the positive semidef- inite, rank-degenerate absolute quadric as one of minimizing a polynomial objective function, subject to polynomial equality and inequality conditions. Chirality conditions translate into polynomial inequality constraints on the entries of the absolute quadric, so they can be readily included in the polynomial system that we seek to minimize. Our formulation allows us to compute the global minimum for such a polynomial system using a series of convex linear matrix inequality relaxations (Kahl and Henrion, 2005; Lasserre, 2001). In summary, the contributions of this chapter are:
We present a fast and reliable method for autocalibration by estimating the • absolute dual quadric, where its rank degeneracy and positive semidefiniteness are imposed within our optimization framework. 156
We can demand that our reconstruction satisfy the requirements of chirality by • imposing constraints on the plane at infinity during our estimation procedure.
We globally minimize a reasonable objective function based on camera matrices • alone to deduce the entries of the absolute quadric subject to all the above constraints.
7.2 Background
We follow the same notations and conventions as the previous chapters, which we restate here for ease of reference. Unless stated otherwise, we will denote 3D points by homogeneous 4-vectors
(such as X = (X1,X2,X3,X4)>) and 2D points by homogeneous 3-vectors (such as
x = (x1, x2, x3)>). A projective camera is represented as P = K[R t], where the | intrinsic calibration parameters of the camera are encoded in the upper triangular matrix K and (R, t) denote the exterior orientation of the camera. One of the objectives in autocalibration is to recover the matrix K, which is parametrized as fx s u K = 0 fy v (7.2) 0 0 1
where fx, fy stand for the focal lengths in the x and y directions, s denotes skew and (u, v) the position of the principal point. It is usual to estimate K by estimating the dual image of absolute conic, since
ω∗ = KK>. (7.3)
Suppose we have a projective reconstruction Pi, Xj for i = 1, , m and { } ··· j = 1, , n. Then, the goal of 3D reconstruction is to determine the projective ··· transformation H that takes the projective reconstruction back to Euclidean Pb i, Xb j { } 157 where
Pb i = PiH , i = 1, . . . , m
1 Xb j = H− Xj , j = 1, . . . , n. (7.4)
7.2.1 Autocalibration Using the Absolute Dual Quadric
From Section 2.5.3, under a projective transformation H, the ADQ moves out of its canonical position to Q∗ = H˜IH>. Thus, an eigendecomposition of the estimated ∞ ADQ in a projective reconstruction immediately gives us the rectifying homography H that upgrades to a metric reconstruction. The ADQ is the dual of the absolute conic and its image under a camera P is the dual image of the absolute conic (DIAC):
ω∗ = PQ∗ P>. (7.5) ∞
This follows from the projection properties of dual quadrics, see Result4. From (7.2) and (7.3), it follows that the DIAC has the form 2 2 2 fx + s + u sfy + uv u ω 2 2 (7.6) ∗ = sfy + uv fy + v v . u v 1
Thus, imposing some constraints on the internal parameters of a camera translates to constraints on the entries of the DIAC, which in turn yields a relation between the entries of the ADQ.
Linear autocalibration using the ADQ
If the principal point is known, then it is possible to obtain linear equations in the entries of the ADQ. Since the ADQ is representable as a symmetric 4 4 matrix, × let us parametrize it in terms of the 10 unknowns of the upper-triangular part. Since the principal point defines the origin of the image coordinate system, knowing it is 158 equivalent to assuming that the principal point is at origin, which can be ensured by a simple translation of the coordinate system. Then, from (7.6),
u = 0 and v = 0 [ω∗i]13 = 0 and [ω∗i]12 = 0 ⇒
(PiQ∗ Pi)13 = 0 and (PiQ∗ Pi)12 = 0, (7.7) ⇒ ∞ ∞ where the latter follows from (7.5). Thus, knowing the principal point in m 5 views ≥ yields 2m equations in the 10 entries of the ADQ, which can be solved for using a singular value decomposition (SVD).
Nonlinear autocalibration constraints on the ADQ
Rather than a known principal point, there can be other assumptions on the internal parameters of a camera, which translate into nonlinear constraints on the entries of the ADQ. For instance, assuming zero pixel skew leads to a quadratic equation in the entries of the DIAC:
ω∗12ω∗33 = ω∗13ω∗33 (7.8) which translates into a quadratic relation in the entries of the ADQ. Similarly, assuming that the internal parameters of the cameras are constant for the image sequence, the DIACs for any two views i and j satisfy the condition that
[ω ] ∗i kl = constant , for k = 1, 2, 3 and k l 3 , (7.9) [ω∗j]kl ≤ ≤ which again translates into a quadratic relation in the entries of the ADQ.
7.2.2 Chirality
As reviewed in Section 3.5, chirality constraints demand that the reconstruction satisfy a very basic criterion: the imaged scene points must be in front of the camera (Hartley, 1998a). A general projective transformation need not preserve the convex hull of a point set, that is, the scene can be split across the plane at infinity in a projective reconstruction. A quasi-affine reconstruction is one that differs from the Euclidean scene 159 by a projective transformation, but in which the plane at infinity is guaranteed not to split the convex hull of the set of points and camera centers. A quasi-affine reconstruction can be computed from the solution of the so-called “chiral inequalities” determined by all the scene points and camera centers and given by (3.31).
7.3 Related Work
There is a significant body of literature within computer vision that deals with autocalibration, beginning with the introduction of the concept in (Faugeras et al., 1992). Approaches to autocalibration can be broadly classified as stratified and direct. The former is a two-step process, whereby the first step involves estimating the plane at infinity for an upgrade to the affine stratum and the metric upgrade is typically performed by estimating K in a subsequent step. Estimating the plane at infinity to achieve an affine upgrade is considered the most difficult step in autocalibration (Hartley et al., 1999). The plane at infinity itself has proven to be a rather elusive entity to estimate precisely. A prior approach has been to exhaustively compute all 64 solutions to the modulus constraints (Pollefeys et al., 1996), although only 21 of them are physically realizable (Schaffalitzky, 2000). An alternate approach involves solving a linear program arising from chirality constraints imposed on the points and camera centers to delineate the region in R3 where the first three coordinates of the plane at infinity parametrized as π = (p, 1)> must lie. Subsequently, ∞ p is recovered by a brute force search within this region (Hartley et al., 1999). In our approach, estimating the plane at infinity is encapsulated in the absolute quadric estimation itself, as π lies in the null-space of Q∗ . ∞ ∞ A variety of linear methods exists for estimating K for the metric upgrade step, see (Hartley and Zisserman, 2004) for more discussion. A drawback of linear approaches is that they do not enforce positive semidefiniteness of the DIAC. One work that the authors are aware of, where estimation of the DIAC is constrained to be positive semidefinite, is (Agrawal, 2004). 160
The class of direct approaches to autocalibration are those that directly com- pute the metric reconstruction from a projective one by estimating the absolute conic. Kruppa equations are view-pairwise constraints on the projection of the absolute quadric. Methods based on the Kruppa equations (or the fundamental matrix), such as (Man- ning and Dyer, 2001), are known to suffer from additional ambiguities when used for autocalibration with three or more views (Sturm, 2000). The absolute quadric was introduced as a device for autocalibration in (Heyden and Astr˚ om¨ , 1996; Triggs, 1997), as it is a convenient representation for both the absolute conic and the plane at infinity. Moreover, it can simultaneously be estimated over multiple views. Constraints on the DIAC can be transferred to those on Q∗ using the (known) ∞ cameras in the projective reconstruction. The actual solution methods proposed in (Triggs, 1997) are a linear approach and one based on sequential quadratic programming. Linear initializations for estimating the dual quadric are also discussed in (Pollefeys et al., 1998). It is known that these methods which do not ensure positive semidefiniteness are liable to perform quite poorly with noisy data. While it can be shown that zero skew alone is sufficient for a metric reconstruction (Heyden and Astr˚ om¨ , 1998), a practitioner must use as much of the available information as possible (Hartley and Zisserman, 2004). We will adopt this latter philosophy and make the safe assumptions that the skew is close to zero, the principal point close to origin and the aspect ratio is close to unity. We allow the focal length to vary to account for varying zoom. Very recently, there has been interest in developing globally optimal solutions to several problems in multiview geometry. A number of simpler problems in multiview geometry can be formulated in terms of systems of polynomial inequalities (Kahl and Henrion, 2005), which can be globally minimized using the theory of convex linear matrix inequality relaxations (Lasserre, 2001). The inherent fractional nature of multiview geometry problems is exploited in (Agarwal et al., 2006) to compute the globally optimal solution to triangulation and resectioning, with a certificate of optimality. A branch- and-bound method is used for autocalibration in (Fusiello et al., 2004), however, their 161 problem is formulated in terms of the fundamental matrix of view pairs and does not scale beyond a small number of views.
7.4 Problem Formulation
Many self-calibration algorithms, such as (Manning and Dyer, 2001; Fusiello et al., 2004), do not estimate the absolute quadric directly. As compared to algorithms such as (Triggs, 1997; Nister´ , 2004) which estimate the absolute quadric, our main contribution is that we constrain Q∗ to be positive semidefinite and rank degenerate ∞ within our optimization framework. Our optimization problems themselves have been designed with the intention of extracting a globally minimal solution. Further, we give the user the option to impose the requirements of chirality within the same unified framework, if needed. It has been argued in prior literature that it is desirable to ensure that chirality is satisfied only with respect to the set of camera centers (Nister´ , 2004). The reason is that cameras are estimated using robust techniques from several points, so they exhibit better statistical properties than the points themselves. And a few outliers in the scene points can make it impossible to satisfy the requirements of full chirality. Similar to the quasi-affine reconstruction with respect to camera centers (QUARC) in (Nister´ , 2004), a pairwise twist test ensures that the plane at infinity cannot violate the line segment joining any pair of camera centers. By subsequently imposing the condition on our reconstruction that the plane at infinity lie on one side of all the camera centers, we are guaranteed to recover a Q∗ consistent with the requirements of ∞ chirality.
7.4.1 Imposing rank degeneracy and positive semidefiniteness of
Q∗ ∞
Let us suppose an appropriate objective function, f(Q∗ ), has been defined, which ∞ depends on the parameters of Q∗ and imposes some desired property on the metric ∞ 162 reconstruction. In addition, we demand that the absolute quadric be rank deficient and positive semidefinite. Thus, our optimization problem is of the form:
min f(Q∗ ) (7.10) ∞
subject to rank(Q∗ ) < 4, Q∗ 0. ∞ ∞
Since Q∗ is a symmetric matrix, it can be parameterized using 10 variables. The ∞ condition of rank degeneracy can be imposed by demanding that det Q∗ = 0, which ∞ is a polynomial of degree 4. The positive semidefiniteness of Q∗ can be ensured by ∞ asserting that each principal minor of Q∗ have a non-negative determinant, which are ∞ polynomial equations of degree at most 3. Thus, an equivalent problem is:
min f(Q∗ ) (7.11) ∞
subject to det(Q∗ ) = 0 ∞ 4 det Q∗ jk 0, j = 1, 2, 3 k = 1, , | ∞| ≥ ··· j 2 Q∗ F = 1. k ∞k where Q∗ jk stands for the k-th j j principal minor of Q∗ . Note that one need not | ∞| × ∞ impose all of the above inequalities for ensuring semidefiniteness, but doing so may strengthen the convex LMI relaxation. Since Q∗ is only defined up to a scale factor, ∞ we use the last equality constraint to fix its norm to one. The objective function in the system above is a polynomial and the constraint set consists of polynomial equalities and inequalities, so the program in (7.11) can be minimized globally using the theory in Section 4.2.4.
7.4.2 Imposing chirality constraints
Next, to impose chirality constraints, recall that the plane at infinity is the null- vector of Q∗ and can be expressed (up to scale) as ∞
h (1) (2) (3) (4) i> π = det(Q∗ ), det(Q∗ ), det(Q∗ ), det(Q∗ ) ∞ ∞ ∞ ∞ ∞ 163
(i) where Q∗ represents the 3 3 matrix formed by eliminating the fourth row and the ∞ × i-th column of Q∗ . The camera center is determined as ∞ h (1) (2) (3) (4) i Ci = det(Pi ), det(Pi ), det(Pi ), det(Pi )
(j) where Pi stands for the i-th camera in the projective reconstruction with its j-th column eliminated.
Now, chirality constraints are of the form π >Ci > 0, which is simply a poly- ∞ nomial of degree 3. Thus, even chirality constraints can be included as part of the polynomial system in our formulation for autocalibration.
7.4.3 Choice of objective function
Finally, we address the question of the objective function. Over the years, several different objective functions have been specified for autocalibration. The trade-off in designing a suitable objective function, as discussed in (Nister´ , 2004), is between retaining geometric meaningfulness and ensuring optimality of the recovered solution. We have looked at various choices of the objective function within our polynomial optimization framework, which are described below. In most situations, it is quite correct to assume that the skew is close to zero and aspect ratio close to unity, while a simple transformation of the image coordinates sets the principal point to (0, 0). Let us assume we have enough prior knowledge of the camera intrinsic parameters in our motion sequence to apply a suitable transformation to the coordinate system that brings the intrinsic parameter matrices of the cameras close to identity. With this “pre-conditioning” based on prior knowledge, we can demand that an algebraic condition be satisfied by the entries of the DIAC so that K has a form diag(f, f, 1). One such objective function to be minimized is:
X 2 2 2 2 f(Q∗ ) := ([ω∗i]11 [ω∗i]22) + [ω∗i]12 + [ω∗i]13 + [ω∗i]23 . (7.12) ∞ i −
Recall that ω∗i = PiQ∗ Pi>, thus [ω∗i]jk = pi,j>Q∗ pi,k where, pi,k> stands for the ∞ ∞ k-th row of the i-th camera matrix. 164
Experiments with synthetic data for this objective function are described in Section 7.5 and results are tabulated in Table 7.1. This works well even when we allow focal length to vary such that K is significantly different from identity. Experimental results for this scenario are also given in Table 7.1. A problem with the above objective function is that it is not normalized to account for the scale invariance of ω∗. This can be achieved by dividing each quantity by, say
ω∗33, to obtain the following objective function:
2 2 2 2 X ([ω∗i]11 [ω∗i]22) + [ω∗i]12 + [ω∗i]13 + [ω∗i]23 f(Q∗ ) := − 2 . (7.13) ∞ i [ω∗i]33 This is a rational objective function, which can be tackled in our polynomial optimization set-up by introducing a new variable corresponding to each view. The optimization problem we address now has the form:
n X min ti (7.14) i=1 2 2 2 subject to [ω∗i]33 ti =([ω∗i]11 [ω∗i]22) + [ω∗i]12 − 2 2 + [ω∗i]13 + [ω∗i]23
det(Q∗ ) =0, Q∗ 0. ∞ ∞
The number of variables in the above optimization problem increases linearly with the number of views. So, it can be solved only for a relatively smaller number of views (three, maybe four) with the current state of the art in polynomial minimization. It is easy to impose further constraints on the problem, such as zero skew, which corresponds to a quadratic polynomial:
[ω∗i]12[ω∗i]33 = [ω∗i]13[ω∗i]23. (7.15)
Other constraints such as principal point at the origin and unit aspect ratio can be similarly imposed as linear or quadratic polynomial constraints. There can be other principled approaches to formulating a polynomial objective 165 function for autocalibration, assuming constant intrinsic parameters, such as
X f(Q∗ ) := ω∗ λiPiQ∗ Pi> . (7.16) ∞ ∞ i k − k
For a projective reconstruction such that P1 = [ I 0 ], the ADQ can be parametrized | in terms of ω∗ and the three parameters of the plane at infinity, which leads to 10 + n variables for the n-view problem. Our experimental sections will discuss only the results obtained by using objective functions in (7.12) and (7.14).
7.5 Experiments with synthetic data
We have subjected our algorithm to extensive simulations with synthetic, noisy data to give the reader an idea of its performance in statistical terms. The implementation is done in Matlab with the GloptiPoly toolbox (Henrion and Lasserre, 2003). An LMI relaxation order of δ = 2 (cf. Section 4.2.4) has been used throughout all the experiments and this, in general, yields a global solution to the polynomial optimization problem at hand. Let the ground truth intrinsic calibration matrix be K0 and the estimated matrix be K, where 0 0 0 f1 s u f1 s u K0 0 0 and K = 0 f2 v = 0 f2 v . 0 0 1 0 0 1 Then we define the following metrics to evaluate the performance of the autocalibration algorithm:
f1 f2 ∆f = 0 1 + 0 1 f1 − f2 − 0 0 r r f1 0 f1 ∆r = max 0 , , where r = , r = 0 r r f2 f2 1 1 ∆p = ( u + v ) ( u0 + v0 ) 2 | | | | − 2 | | | | ∆s = s s0 . (7.17) | − | 166
For a calibration matrix of the form diag(f, f, 1), the ideal values of these metrics are ∆f = 0, ∆r = 1, ∆p = 0 and ∆s = 0. An additional quantity of interest is the chirality check, that is, the (percentage) number of cameras that follow the chirality constraints. For the synthetic experiments, there are 12 cameras and 15 points. All the points lie in the cube [ 1, 1]3 of side length 2, centered at the origin and the nominal − distance of the cameras from the origin is 2. For the case of constant focal length, ground truth intrinsic calibration matrix for each camera is K0 = diag(1, 1, 1). To simulate variable focal length, the intrinsic calibration matrix is of the form diag(f, f, 1), where f is allowed to attain any random value in the range [0.05, 1.00]. Noise with standard deviation 0.2% of the image size is added to the image coordinates prior to the projective factorization that forms the input to our algorithm. The objective function to be minimized is the one in (7.12). The results are tabulated in Table 7.1, where all the quantities reported are statistics over 100 trials.
The size of the LMI relaxations used to solve for Q∗ is a function of the number ∞ of variables, the number of constraints and the maximum degree of the polynomial occurring in the objective function and constraints. For the optimization problem (7.12) that we solve in this chapter, none of these quantities depend on the number of views. Thus the time complexity of our algorithm is essentially constant with respect to the number of views. Note that these experiments were conducted with a relatively few number of cameras and points, which sometimes causes the geometry to become ill-conditioned. If that happens, the algorithm can break down due to numerical instabilities, which manifest themselves as constraint violations (such as the few chirality violations above). Another way to detect this is by checking the rank of the moment matrix in the LMI relaxation which should have rank one for globally optimal solutions. A smaller set of experiments was performed using the objective function in (7.14). This was found to perform marginally better than the objective function in (7.12) for experiments conducted with three views. But the algorithm already becomes very computationally intensive for just three views and impossible to solve within reasonable 167
Table 7.1: Performance on synthetic data. The numbers are mean values over 100 trials for the performance metrics defined in (7.17). For fixed focal length, K0 = diag(1, 1, 1). For variable focal length, K0 = diag(f, f, 1), f [0.05, 1.00]. 0.2% noise is added in image coordinates. The last row indicates the number∈ of experiments that failed due to numerical issues and were excluded from the results .
Quantity Fixed focal length Variable focal length Without chirality With chirality Without chirality With chirality
∆f 0.0086 0.0360 0.0069 0.0328 ∆r 1.0041 1.0315 1.0027 1.0272 ∆p 0.0073 0.0096 0.0055 0.0098 ∆s 0.0051 0.0092 0.0033 0.0050 chirality N/A 0.9867 N/A 0.9942 failed 2 2 4 1
memory limits for four or more views. The gains in terms of accuracy of solution achieved by using this theoretically more correct objective function are not enough to justify the enormous computational expense, so we will restrict ourselves to the objective function (7.12) henceforth.
7.6 Experiments with real data
In (Nister´ , 2004), several image sequences for a variety of scenes are obtained with a hand-held camcorder. The number of images varies from 3 to 125. The images themselves are quite noisy and although acquired with a constant zoom setting, auto- focus effects cause focal length to vary across the sequence. The resulting projective factorizations were upgraded to metric using a variety of algorithms and their results visually compared. The results are rated on a scale of 0 (severely distorted) to 5 (very good metric reconstruction) according to the qualitative criteria listed in (Nister´ , 2004). We evaluate the metric reconstructions obtained by our algorithm for 25 of these sequences using the same qualitative criteria as in (Nister´ , 2004). These reconstructions are compared with the output of five prior state-of-the-art methods for autocalibration and 168
(a) Flower Pot (61 views) (b) Nissan2 (89 views)
(c) David (11 views) (d) Pickup (89 views)
Figure 7.1: Metric reconstructions for four real sequences. The points are plotted in white, the image planes in yellow and the optical axes are plotted in green. tabulated in Table 7.2. Some sample scene reconstructions using the method proposed in this chapter are depicted in Figure 7.1. 1 Method A is the method of (Nister´ , 2004) where a quasi-affine reconstruction is obtained after untwisting the cameras and a non-linear local optimization method is used to minimize an appropriate objective function which specifies some requirements on the intrinsic parameters. Method B is the technique in (Beardsley et al., 1997). Method C is the algorithm in (Hartley, 1994) which uses the full set of chirality constraints to obtain an estimate of the plane at infinity. The method of (Pollefeys et al., 1999) is used to obtain the reconstructions in Method D by minimizing a cost function based on the absolute quadric starting from a linear initialization. Method E is a modified version of (Hartley et al., 1999). More details on the individual methods A-E can be found in the
1The VRML reconstructions for these and other sequences are available at http://vision.ucsd.edu/ quadric. 169
Table 7.2: Performance on real data, compared to other state-of-the-art approaches. A score of 0 represents a severely distorted metric reconstruction, while 5 stands for a very good one and a “*” denotes cases where numerical errors were encountered. Method F is the algorithm described in this chapter. Method G is the same algorithm, but with chirality imposed with respect to the set of camera centers. Please see the text for references to Methods A-E and details of the qualitative evaluation criteria.
Dataset Views A B C D E F G Basement 9 4 4 4 4 4 2 2 House 9 5 3 5 5 5 5 5 David 11 5 3 5 2 5 5 5 ClockB 13 4 3 4 3 4 4 4 Frode2 15 5 1 5 5 5 5 5 Nissan1 17 5 3 5 5 5 5 5 Stove 19 5 1 5 5 5 5 5 Drunk 21 4 1 4 4 4 * * SceneSw 23 5 2 5 3 5 5 5 Wine 28 4 2 4 4 4 4 4 CorrPill 35 5 1 5 5 5 5 5 FlowPt2 43 5 2 5 5 5 5 5 Contai1 57 5 2 - 5 5 5 5 Nissan3 59 5 3 5 0 5 5 5 FlowPt1 61 5 1 5 5 5 5 5 Contai2 65 5 1 - 1 5 5 5 FlowPt3 83 5 1 5 5 5 5 5 StatWk2 85 5 3 5 5 5 5 5 Nissan2 89 5 1 - 5 5 5 5 Pickup 89 5 1 - 4 5 5 5 Bicycles 103 5 5 5 1 5 5 5 GirlsSt2 105 5 1 - 2 5 5 5 StatWk1 107 5 1 5 5 5 5 5 Volvo 117 5 1 - 1 5 5 5 SwBre2 125 5 1 4 5 5 5 5
references above or (Nister´ , 2004). Method F is the method described in this chapter where we impose the require- ment that the estimated dual quadric be positive semidefinite and rank deficient. The optimization problem is given by (7.11) and the objective function used is (7.12). 170
Method G is the same algorithm, but now chirality is imposed with respect to the set of camera centers. As a general observation, the algorithm performs well for a larger number of cameras. The sequence “Basement” has forward camera motion, while the sequence “Drunk” is a largely planar scene with rotational camera motion. Both these cases are ill-conditioned for our algorithm and thus, the recovered structure is projectively dis- torted. Besides these sequences, unless the algorithm breaks down due to any numerical instability, the metric reconstructions obtained are at par with other state of the art algorithms. Note that all our experiments are performed without imposing any bounds on the camera parameters. If one explicitly enforces prior information, for instance, that aspect ratio typically lies between 0.25 and 3, the reconstruction quality in even the scenes such as “Basement” which are ill-conditioned can be improved. Further, the reconstructions evaluated above are the raw output of our algorithm. Due to the fact that interior point solvers only solve the LMI relaxation up to an tolerance, in practice a subsequent bundle adjustment step can be used to further refine the solution. Finally, we demonstrate the importance of global optimization in a comparison with a state-of-the-art, but locally optimal method of (Pollefeys et al., 2002). Figure 7.2 plots the camera trajectories for a few sequences where the globally optimal method of this chapter clearly outperforms the one in (Pollefeys et al., 2002). Notice the stretching of the inter-camera baseline and reversal of orientation of some cameras in the trajectories recovered by the local method. This kind of distortion is characteristic of the plane at infinity not being correctly estimated, which leads to a reconstruction inconsistent with the requirements of chirality (Nister´ , 2001). These reconstruction artifacts are not present in the output of our algorithm, which is a direct benefit of global optimization. It is observed that, in general, such chirality violations for the local method are more common for sequences with relatively fewer number of views. 171
Sequence: SceneSw
Local Method Global Method
Sequence: ClockB
Local Method Global Method
Sequence: Nissan1
Local Method Global Method
Figure 7.2: Comparison of the local method of (Pollefeys et al., 2002) with the global method proposed in this chapter. It is clear that the camera trajectories recovered by the local method are sometimes sub-optimal and violate the requirements of chirality. The reconstructions using the algorithm of this chapter are devoid of those artifacts. 172
7.7 Conclusions
Autocalibration using the ADQ is a mature research topic and yet none of the previously existing approaches are capable of handling many of the hard, non-convex constraints that should be imposed according to the theory of autocalibration. In this chapter, we have presented a solution that guarantees a theoretically correct estimate of the ADQ by imposing its rank deficiency and positive semidefiniteness. The involved polynomial system is solved to its global minimum using developments in the theory of convex LMI relaxations. Experiments show that the resulting algorithm is scalable, stable and robust, with performance comparable to other state-of-the-art methods for autocalibration. At this stage, a comparison is merited between the stratified and direct approaches to autocalibration. From an optimization perspective, the goal of a direct approach is more principled, since it estimates both the plane at infinity and the internal parameters of the camera together using the ADQ, while the stratified approach estimates them sequentially. The stratified approach is important from the pedagogical viewpoint, since an affine reconstruction solves the most difficult part of the transition from a projective to a metric frame. In practice, characterizing the relative benefits of the two approaches is not straightforward, since it is difficult to define a cost function for autocalibration that is purely geometric in nature. The polynomial optimization based approach proposed in this chapter has the benefit of being practical for even a very large number of views, but this scalability is at the expense of using the theoretically inferior cost function of (7.12). The branch and bound based algorithm for stratified autocalibration is decidedly slower, but it globally minimizes well-accepted cost functions for both the affine and metric upgrade steps. To the best of our knowledge, a branch and bound solution for globally optimal direct autocalibration has not yet been achieved. Subsequent to the work that contributed to this chapter, there have been advance- ments that exploit structured sparsity to globally optimize larger polynomial systems. In 173 particular, the work of (Waki et al., 2006) is ideal for situations where the members of a large set of variables interact with another small, fixed set of variables, but not among themselves. This is precisely the case for the more principled optimization problem of (7.14), which we believe is now solvable for a greater number of views (15 to 20) using the software of (Waki et al., 2008). Note that the principle of exploiting structured sparsity is quite similar to the idea of partial relaxations proposed in (Kahl and Henrion, 2007). The most significant portions of this chapter are based on “Autocalibration via Rank-Constrained Estimation of the Absolute Quadric”, by M. K. Chandraker, S. Agar- wal, F. Kahl, D. Nister´ and D. J. Kriegman, as it appears in (Chandraker et al., 2007a). Chapter 8
Bilinear Programming
“One Yang and one Yin: this is called the Tao. That which ensues from this is goodness and that which is completed thereby is the nature.”
The Tao of the Production of Things, Appendix III, I Ching (The Book of Changes)
8.1 Introduction
Bilinearity is an oft-encountered phenomenon in computer vision, since observ- ables in a vision system commonly arise due to an interaction between physical aspects well-approximated by linear models (Tenenbaum and Freeman, 2000). For instance, the coordinates of an image feature are determined by a camera matrix acting on a three-dimensional point (Tomasi and Kanade, 1992). Image intensity for a Lambertian object is an inner product between the surface normal and the light source direction. An image in non-rigid structure from motion arises due to a camera matrix observing a linear combination of the elements of a shape basis (Torresani et al., 2003). Thus, in its most general form, bilinear programming subsumes diverse sub-fields of computer vision such as 3D reconstruction, photometric stereo, non-rigid SFM and several others. This chapter proposes a practical algorithm that provably obtains the globally optimal solution to a class of bilinear programs widely prevalent in computer vision appli-
174 175 cations. The algorithm constructs tight convex relaxations of the objective function and minimizes it in a branch and bound framework. For an arbitrarily small , the algorithm terminates with a guarantee that the solution lies within of the global minimum, thus, providing a certificate of optimality. One of the key contributions of this chapter is to establish, with theoretical and empirical justification, that it is possible to attain convergence with a non-exhaustive branching strategy that branches on only a particular set of variables. Note that this is different from bounds propagation schemes proposed in the previous chapters, which are essentially exhaustive (Agarwal et al., 2006; Chandraker et al., 2007b). This has great practical significance for many computer vision problems where bilinearity arises due to the interaction of a small set of variables with a much larger, independent set. For instance, one commonly represents an entity, say shape, as a linear combina- tion of basis entities. The image formation process under this model can be understood as a bilinear interaction between a small number of camera parameters and a large number of coefficients for the basis shapes. As an illustration, we present two applications where the nature of this interaction is suitably exploited - reconstructing the 3D structure of a face from a single input image using exemplar models (Blanz and Vetter, 1999) and deter- mining the cameras and shape coefficients in non-rigid structure from motion (Torresani et al., 2003). In this chapter, we globally minimize bilinear programs under both the standard
L2-norm for Gaussian distributions and the robust L1-norm corresponding to the heavier- tailed Laplacian distribution. The convex relaxations are linear programs (LP) for the
L1-norm case and second order cone programs (SOCP) for the L2-norm, both of which are efficiently solvable by modern interior point methods (Andersen et al., 2003). A traditional counterpoint to our approach would employ linear regression followed by singular value decomposition (SVD), which is suboptimal for this rank- constrained problem in the noisy case. Our experiments clearly demonstrate the lower error rates compared to SVD attainable by our globally optimal algorithms. The remainder of this chapter is organized as follows: Section 8.2 describes 176 related work and Section 8.3 presents convex relaxations for the bilinear programs under consideration. Section 8.4 proposes a branching strategy to globally minimize our programs and proves convergence. Experimental results on real and synthetic data are presented in Section 8.5, while Section 8.6 concludes with a discussion and future directions.
8.2 Related Work
Bilinear problems arise in several guises in computer vision (Koenderink and van Doorn, 1997), a fact highlighted by the widespread use of singular value decomposition in diverse vision algorithms. This is, in part, due to the preponderance of linear models in our understanding of several aspects of visual phenomena, such as structure from mo- tion (Tomasi and Kanade, 1992), illumination models (Belhumeur and Kriegman, 1996; Hallinan, 1994), color spectra (Marimont and Wandell, 1992) and 3D appearance (Murase and Nayar, 1995). The bilinear coupling between head pose and facial expression is recovered in (Bascle and Blake, 1998) for actor-driven animation. Disparate applications like typography, face pose estimation and color constancy are tackled in (Freeman and Tenen- baum, 1997) by learning and fitting bilinear models in an EM framework. A perceptual motivation for the abundance of bilinearity in vision is presented in (Tenenbaum and Freeman, 2000). From the perspective of optimization algorithms, bilinear programming is quite well-studied (McCormick, 1976), particularly as a special case of biconvex program- ming (Al-Khayyal and Falk, 1983). A variety of approaches, such as cutting-plane algo- rithms (Konno, 1976) and reformulation linearization techniques (Sherali and Alamed- dine, 1992) have been proposed to solve bilinear programs. Our approach differs in exploiting structure to achieve optimality in problems which would otherwise be consid- ered too large for global optimization.
Numerous computer vision applications involve norms besides L2. Supergaussian 177 distributions like Laplacian routinely arise in independent component analysis (Hyvarinen¨ et al., 2001). The L1 norm has been used where robustness is a primary concern, such as (Zach et al., 2007) for range image integration. The underlying convexity or quasi- convexity of L1 and L formulations of multiview geometry problems have been studied ∞ in (Agarwal et al., 2006; Kahl, 2005; Seo and Hartley, 2007). Note that bilinear programming is a special case of polynomial optimization, however, the problem sizes that concern us in this chapter are far greater than what modern polynomial solvers can handle (Henrion and Lasserre, 2003).
8.3 Formulation
To motivate the derivations and establish a consistent terminology, we will refer to a concrete example, namely reconstructing a 3D shape from a single image, given basis shapes. Modulo suitable variable reorderings, the following derivations hold for other bilinear programs too. For simplicity, we will write expressions for one coordinate of the 2D image,
N extension to two coordinates is straightforward. Let u = uj be the observed image, { }j=1 n i m a R be a row of the camera matrix (n = 4 for 3D shapes) and α = α i=1 be the ∈ { } i i n shape coefficients corresponding to the m basis shapes = Xj R . Then, the B { ∈ } shape is represented as P αi i and the affine imaging equation is i B Pm i i uj = a> α X , j = 1, ,N (8.1) i=1 j ···
8.3.1 LP relaxation for the L1-norm case
The L1-norm bilinear program to find the globally optimal camera and shape coefficients is: N m X X i i min uj a> α Xj (8.2) a,α j=1 − i=1
subject to a a , α α , (a, α) 0 ∈ Q ∈ Q G ≥ 178
i n where Xj R and a, α specify rectangular domains for a and α. (a, α) represents ∈ Q Q G a set of linear constraints (or constraints which can be relaxed into linear ones) on a and/or α to fix the scale of the variables.
Introducing scalar variables tj, j = 1, ,N, an equivalent constrained opti- ··· mization problem is
N n m X X X i i min tj , s.t. tj uj akα Xj,k (8.3) a,α,t ≥ − j=1 k=1 i=1
a a , α α , (a, α) 0 ∈ Q ∈ Q G ≥
i i i where a = (a1, , an)> and X = (X , ,X )>. Note that the non-convexity in ··· j j,1 ··· j,n i the problem is now contained in the bilinear terms akα in the constraints. The convex and concave relaxations for a bilinear term xy in the domain [xl, xu] [yl, yu] are given by, × respectively, the two pairs of linear inequalities (Al-Khayyal and Falk, 1983; McCormick, 1976):
z max xly + ylx xlyl, xuy + yux xuyu (8.4) ≥ { − − }
z min xuy + ylx xuyl, xly + yux xlyu (8.5) ≤ { − − }
We will collectively refer to the four inequalities above as conv (xy) z conc (xy). ≤ ≤ Each constraint of the form tj in (8.3) can be equivalently replaced by the ≥ | · | i i pair of constraints tj ( ) and tj ( ). Substituting γ = akα , we can construct a ≥ · ≥ − · k convex relaxation of (8.3):
N n m ! X X X i i min tj , s.t. tj Xj,kγk uj (8.6) a,α,γ,t ≥ − j=1 k=1 i=1 n m ! X X i i tj X γ uj ≥ − j,k k − k=1 i=1 i i i conv akα γ conc akα ≤ k ≤
a a , α α , (a, α) 0. ∈ Q ∈ Q G ≥
P i i Introducing new variables µj = tj X γ + uj and eliminating tj from the − i,k j,k k 179 resulting system of equations, the program (8.6) can be equivalently rewritten as
N n m ! X X X i i min Xj,kγk uj + µj (8.7) a,α,γ,µ − j=1 k=1 i=1 n m ! X X i i subject to 2 X γ uj µj 0 − j,k k − − ≤ k=1 i=1
µj 0 ≥ i i i conv akα γ conc akα ≤ k ≤
a a , α α , (a, α) 0. ∈ Q ∈ Q G ≥ While both (8.6) and (8.7) are linear programs, (8.6) has two “general” linear inequalities in the constraint set for each j = 1, ,N. In (8.7), one of them has been replaced by ··· a comparison of the scalar variable µj to 0, which can be handled in a more efficient manner in interior point solvers (Andersen et al., 2003). In computer vision applications where N n, the importance of this transformation is immense - it improves timings by up to an order of magnitude in some of our experiments.
8.3.2 SOCP relaxation for the L2-norm case
The L2-norm bilinear problem is: v u N m !2 uX X i i min t uj a> α Xj (8.8) a,α j=1 − i=1
subject to a a , α α , (a, α) 0. ∈ Q ∈ Q G ≥ A convex relaxation for (8.8) can be easily constructed in the form of a second order cone program using the same principles as for the L1 case:
X i i X i i min u1 X1,kγk , , uN XN,kγk a,α,γ − ··· − i,k i,k