UC San Diego UC San Diego Electronic Theses and Dissertations

Title From pictures to 3D : global optimization for scene reconstruction

Permalink https://escholarship.org/uc/item/8rs3b74c

Author Chandraker, Manmohan Krishna

Publication Date 2009

Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO

From Pictures to 3D: Global Optimization for Scene Reconstruction

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy

in

Computer Science

by

Manmohan Krishna Chandraker

Committee in charge:

Professor David Kriegman, Chair Professor Serge Belongie Professor Samuel Buss Professor Fredrik Kahl Professor Gert Lanckriet Professor Matthias Zwicker

2009 Copyright Manmohan Krishna Chandraker, 2009 All rights reserved. The dissertation of Manmohan Krishna Chandraker is approved and it is acceptable in quality and form for publication on microfilm and electronically:

Chair

University of California, San Diego

2009

iii DEDICATION

To Papa, for his incomparable example. To Mom, for her innumerable sacrifices. To Didi, for her unbridled sisterly pride.

iv EPIGRAPH

Somewhere afield here something lies In Earth’s oblivious eyeless trust That moved a poet to prophecies - A pinch of unseen, unguarded dust.

Thomas Hardy, “Shelley’s Skylark”

v TABLE OF CONTENTS

Signature Page...... iii

Dedication...... iv

Epigraph...... v

Table of Contents...... vi

List of Figures...... xi

List of Tables...... xiii

Acknowledgements...... xiv

Vita...... xviii

Abstract of the Dissertation...... xx

Chapter 1 Introduction...... 1 1.1 Multiview Geometry and Optimization...... 4 1.2 3D Reconstruction from 2D Images...... 7 1.2.1 The Projective Ambiguity...... 8 1.2.2 Projective Spaces and Projective Cameras...... 10 1.2.3 Stratification of 3D Reconstruction...... 13 1.2.4 Autocalibration...... 15 1.2.5 Feature Selection and Matching...... 17 1.3 The Optimization Framework...... 19 1.3.1 Global Optimization for 3D Reconstruction...... 20 1.3.2 Optimization for Robust SFM...... 25 1.4 Contributions of the Dissertation...... 26 1.5 How to Read This Dissertation...... 27

Chapter 2 Preliminaries: Projective Geometry...... 30 2.1 Axiomatic Projective Geometry...... 30 2.2 Projective Geometry of 2D...... 32 2.3 Projective Geometry of 3D...... 36 2.3.1 Points and Planes...... 36 2.3.2 Lines...... 36 2.3.3 Quadrics...... 39 2.4 The Projective Camera...... 39 2.5 The Plane at Infinity and Its Denizens...... 42 2.5.1 The Absolute Conic...... 44

vi 2.5.2 Image of the Absolute Conic...... 45 2.5.3 The Absolute Dual Quadric...... 46

Chapter 3 Preliminaries: Multiview Geometry...... 50 3.1 Feature Selection and Matching...... 50 3.1.1 Corner detection...... 51 3.1.2 Feature matching...... 52 3.1.3 Advanced feature descriptors...... 53 3.2 Epipolar Geometry...... 54 3.3 Projective Reconstruction...... 57 3.3.1 Pairwise reconstruction...... 57 3.3.2 Factorization-based approaches...... 58 3.4 Stratification...... 60 3.5 Chirality...... 62 3.5.1 Bounding the plane at infinity...... 64

Chapter 4 Global Optimization...... 65 4.1 Approaches to Global Optimization...... 66 4.2 Convex Optimization...... 69 4.2.1 Convex Sets...... 69 4.2.2 Convex Functions...... 71 4.2.3 Convex Optimization Problems...... 72 4.2.4 Linear Matrix Inequalities...... 74 4.3 Branch and Bound Theory...... 76 4.3.1 Bounding...... 79 4.3.2 Branching...... 80 4.4 Global Optimization for Polynomials...... 82

Chapter 5 Triangulation and Resectioning...... 89 5.1 Introduction...... 89 5.1.1 Related Work...... 90 5.1.2 Outline...... 91 5.2 Problem Formulation...... 92 5.3 Traditional Approaches...... 94 5.3.1 Linear Solution...... 94 5.3.2 Bundle Adjustment...... 96 5.4 Fractional Programming...... 97 5.4.1 Bounding...... 98 5.5 Applications to Multiview Geometry...... 100 5.5.1 Triangulation...... 101 5.5.2 Camera Resectioning...... 103 5.5.3 Projections from Pn to Pm ...... 103 5.6 Multiview Fractional Programming...... 104

vii 5.6.1 Bounds Propagation...... 104 5.6.2 Initialization...... 106 5.6.3 Coordinate System Independence...... 107 5.7 Experiments...... 108 5.7.1 Synthetic Data...... 108 5.7.2 Real Data...... 112 5.8 Discussions...... 115

Chapter 6 Stratified Autocalibration...... 117 6.1 Introduction...... 118 6.2 Background...... 121 6.2.1 The Infinite Homography Relation...... 121 6.2.2 Modulus Constraints...... 124 6.2.3 Chirality Bounds on Plane at Infinity...... 126 6.2.4 Need for Global Optimization...... 126 6.3 Previous Work...... 128 6.4 The Branch and Bound Framework...... 130 6.4.1 Constructing Convex Relaxations...... 131 6.5 Global Estimation of Plane at Infinity...... 132 6.5.1 Traditional Solution...... 132 6.5.2 Problem Formulation...... 133 6.5.3 Convex Relaxation...... 133 6.5.4 Incorporating Bounds on the Plane at Infinity...... 134 6.6 Globally Optimal Metric Upgrade...... 136 6.6.1 Traditional Solution...... 136 6.6.2 Problem Formulation...... 137 6.6.3 Convex Relaxation...... 138 6.7 Experiments...... 141 6.8 Conclusions and Further Discussions...... 150

Chapter 7 Direct Autocalibration...... 153 7.1 Introduction...... 153 7.2 Background...... 156 7.2.1 Autocalibration Using the Absolute Dual Quadric...... 157 7.2.2 Chirality...... 158 7.3 Related Work...... 159 7.4 Problem Formulation...... 161 7.4.1 Imposing rank degeneracy and positive semidefiniteness of Q∗ 161 7.4.2 Imposing chirality constraints...... ∞ 162 7.4.3 Choice of objective function...... 163 7.5 Experiments with synthetic data...... 165 7.6 Experiments with real data...... 167 7.7 Conclusions...... 172

viii Chapter 8 Bilinear Programming...... 174 8.1 Introduction...... 174 8.2 Related Work...... 176 8.3 Formulation...... 177 8.3.1 LP relaxation for the L1- case...... 177 8.3.2 SOCP relaxation for the L2-norm case...... 179 8.3.3 Additional notes for the L2 case...... 180 8.4 Branching strategy...... 181 8.5 Experiments...... 184 8.5.1 Synthetic data...... 184 8.5.2 Applications...... 188 8.6 Discussions...... 190

Chapter 9 Line SFM Using Stereo...... 192 9.1 Introduction...... 192 9.2 Related Work...... 195 9.3 Structure and Motion Using Lines...... 197 9.3.1 A Simple Solution?...... 197 9.3.2 Geometry of the Problem...... 198 9.3.3 Linear Solution...... 198 9.3.4 Efficient Solutions for Orthonormality...... 200 9.3.5 Solution for Incremental Motion...... 204 9.3.6 A Note on Number of Lines...... 204 9.4 System Details...... 204 9.4.1 Line Detection, Matching and Tracking...... 204 9.4.2 Efficiently Computing Determinants...... 206 9.5 Experiments...... 207 9.5.1 Synthetic Data...... 207 9.5.2 Real Data...... 209 9.6 Discussions...... 212

Chapter 10 Discussions...... 215 10.1 Sequels in the Computer Vision Community...... 215 10.2 Future Directions...... 216 10.3 Conclusions...... 218

Appendix A Fractional Programming...... 220

Appendix B Convex Relaxations for Stratified Autocalibration...... 223 B.1 Functions of the Form f(x) = x8/3 ...... 223 B.2 Bilinear Functions f(x, y) = xy ...... 223 B.3 Functions of the Form f(x, y) = x1/3y ...... 225 B.3.1 Case I: [xl > 0 or xu < 0]...... 225

ix B.3.2 Case II: [xl 0 xu]...... 226 B.4 Convergence Proofs≤...... ≤ 228 B.4.1 Errata...... 230

Appendix C Convergence Proof for Bilinear Relaxations...... 232

Bibliography...... 235

x LIST OF FIGURES

Figure 1.1: Various cues in images create a perception of depth...... 2 Figure 1.2: Branch and bound for global optimization...... 4 Figure 1.3: Progress of projective geometry in Renaissance art...... 5 Figure 1.4: Reconstruction up to rotation, translation and scale...... 8 Figure 1.5: Projection for imaging and back-projection for reconstruction.....9 Figure 1.6: The projective plane...... 11 Figure 1.7: Vanishing points are an everyday phenomenon...... 11 Figure 1.8: The perspective projection camera model...... 12 Figure 1.9: Stratification in three dimensions...... 14 Figure 1.10: Quasi-affine reconstruction preserves the ...... 15 Figure 1.11: A visualization of autocalibration...... 16 Figure 1.12: Not all corners are created equal...... 18 Figure 1.13: Multiview geometry problems are hard to optimize...... 21 Figure 1.14: Traditional local optimization in multiview geometry...... 22

Figure 2.1: Perspective camera projection...... 40 Figure 2.2: Internal and external parameters of the camera...... 42

Figure 3.1: Corner detection...... 52 Figure 3.2: Types of image neighborhoods...... 52 Figure 3.3: Epipolar geometry...... 55

Figure 4.1: Branch and bound for non-convex minimization...... 79

Figure 5.1: Local minima in three-view triangulation...... 94 Figure 5.2: Bounds propagation...... 105 Figure 5.3: Triangulation errors with forward motion...... 109 Figure 5.4: Comparing optimal (L2,L2) triangulation to bundle adjustment.... 110 Figure 5.5: Triangulation errors with outliers...... 111 Figure 5.6: Reprojection errors for camera resectioning...... 111 Figure 5.7: Dependence of convergence on optimality criterion...... 113

Figure 6.1: Plane-induced homography between two cameras...... 123 Figure 6.2: Need for globally optimal autocalibration...... 127 Figure 6.3: Construction of convex relaxations...... 132 Figure 6.4: Errors in calibration parameters across noise level...... 144 Figure 6.5: Runtime behavior of affine and metric upgrade algorithms...... 145 Figure 6.6: Geometrical errors in affine and metric upgrades...... 145 Figure 6.7: Comparison of local and global affine upgrades...... 147 Figure 6.8: Comparison of local and global metric upgrades...... 148 Figure 6.9: Stratified autocalibration with real data...... 149

xi Figure 7.1: Direct autocalibration with real data...... 168 Figure 7.2: The benefit of global optimization...... 171

Figure 8.1: Errors in bilinear fitting across noise levels...... 185 Figure 8.2: Errors with varying outlier levels...... 186 Figure 8.3: Convergence times for optimal bilinear fitting...... 187 Figure 8.4: Face reconstruction from 3D exemplars...... 189 Figure 8.5: Bilinear fitting for non-rigid structure and motion...... 190

Figure 9.1: Motion estimation in challenging indoor environment...... 193 Figure 9.2: Geometry of line-based structure and motion...... 199 Figure 9.3: Line detection and tracking...... 205 Figure 9.4: Errors for small motion using two-line solvers...... 208 Figure 9.5: Errors for small motion using three-line solvers...... 209 Figure 9.6: Errors for large motion using three-line solvers...... 209 Figure 9.7: Line detection and tracking for turntable sequence...... 210 Figure 9.8: Line-based structure and motion for a turntable sequence...... 211 Figure 9.9: Line-based structure and motion for a corridor sequence...... 213 Figure 9.10: Polynomial and incremental solutions for corridor sequence..... 214

Figure B.1: Convex and concave relaxations for bilinear functions...... 224 Figure B.2: The non-convex function f(x, y) = x1/3y ...... 225 Figure B.3: Concave overestimator for x1/3 ...... 228 Figure B.4: Convex underestimator for x1/3 ...... 229

xii LIST OF TABLES

Table 5.1: Cost functions for various error norms...... 101 Table 5.2: Optimal resectioning runtimes for various error norms...... 112 Table 5.3: Triangulation and resectioning errors with real data...... 114 Table 5.4: Branch and bound iterations for real data...... 114 Table 5.5: Runtimes with real data...... 115

Table 6.1: Stratified autocalibration errors and branching iterations...... 143 Table 6.2: Geometric evaluation of stratified autocalibration...... 146 Table 6.3: Positive-semidefiniteness violations in linear metric upgrade...... 148

Table 7.1: Direct autocalibration errors with synthetic data...... 167 Table 7.2: Direct autocalibration with real data...... 169

xiii ACKNOWLEDGEMENTS

This dissertation is the product of the constant endeavor of my mentors, col- leagues, friends and family, who have steadfastly shaped my perspective towards research, education and life in general. My time in graduate school was not cast in a stereotypical mould, as I was extremely fortunate to have not one, but several wonderful mentors. A lion’s share of the credit for determining the nature of my PhD experience goes to my adviser, Prof. David Kriegman. Words cannot do justice to his impact in forging my academic and personal outlook, for his efforts far exceed the obligations of merely guiding this dissertation. Not once have I seen David impose any specific demands, rather, his aura compels students to strive to live up to his high standards. The breadth of his knowledge, combined with a meditative understanding of his students’ strengths, have given him the confidence to consider “good taste” an inherently subjective term. Accordingly, the only expectations he has ever had from me are ensuring the paramountcy of quality over quantity and indulging in research that I would myself take pride in being associated with. The variety of research emanating from David’s research group is testimony to the exploratory freedom he allows his students. His style of advising has always been to subtly plant an idea, or gently nudge me in the right direction, rather than push an agenda. On numerous occasions, the significance of David’s suggestions would only dawn upon me much later – his astuteness in diving to the core of any problem is awe-inspiring. In the course of more than five years, I have admired the dexterity with which David has handled simultaneous responsibilities such as his editorship of IEEE PAMI and his fledgling company, while devoting quality time to his advisory role. Indeed, David’s concern for his international students, far from their own homes and families, goes well beyond mere advising: when I told him I had bought my first car, a second-hand Mitsubishi, the first thing he said was, “Great! Now make sure that you drive safe.” All I can say for my indebtedness to David is that if, one day, I advise students of my own and display half of his vitality and sagacity, I will consider it a success.

xiv Prof. Serge Belongie is another of my mentors who has significantly contributed towards my academic progress, with his generous help, advice and feedback. Some of my love for teaching is attributable to Serge – the amount of preparation he puts into each class and his expertness at weaving an authoritative delivery into a friendly ambiance, are invaluable lessons for any graduate student. Much of the course of this PhD was set during my interactions with Prof. Fredrik Kahl at the beginning of my second year. To him goes the credit for introducing me to convex optimization and rekindling my love for multiview geometry. Working with him, or even merely talking to him, have been profoundly educative experiences. A mentor and friend who significantly enriched my stay in the Pixel Lab, both intellectually and personally, is Sameer Agarwal, soon to become a professor at University of Washington. Not only has he enlightened me on innumerable technical topics during our collaborations, he has also set a great example of dedication and uprightness in scientific research. He is a budding chef and postprandial ruminations at his apartment led to lively discussions on every topic imaginable. Right from the first day Sameer saw me in Pixel Lab, he has somehow taken personal responsibility for my well-being. Whenever I have needed support, I have chatted to him, for his immense faith in me has been a constant source of strength. Satya Mallick also occupies a similar zone between mentorship and friendship. From helping me settle down in UCSD to imparting pithy lessons through his witty anecdotes, his presence has been vital to my graduate school experience. I have always admired the dedication Satya invested into both his work and his long-distance marriage to Sunita. The time I spent in the lab with Satya and Sameer, especially the summer of 2005, constitutes some of the best memories from my stay at UCSD. Vincent Rabaud spent the last couple of years at the office space next to mine and was very tolerant of any exultant yells or soulful moans. He would sometimes echo them too, or cheer me up by playing funky numbers. It was great to share lab space with some nice people – Neil Alldrin, Andrew Rabinovich, William Beaver, Neel Joshi, Ben Laxton and Will Chang – all of whom, I am glad, have found their respective callings. Likewise,

xv I wish all the best to the current denizens of Pixel Lab - Steve Branson, Kai Wang, Carolina Galleguillos, Boris Babenko and Catherine Wah. Importantly, I would like to acknowledge the role of my senior colleagues in my academic development. Kuang-chih Lee was my first year cubicle-mate who was always generous with help and advice. Ben Ochoa, Josh Wills, Jongwoo Lim, Kristin Branson, Craig Donner and Piotr Dollar´ are all my early lab-mates who I admire for their amazing creativity and work ethic. Many thanks to Virginia McIlwain for help with numerous travel and administrative details. A summer internship at Microsoft Research Cambridge was an invaluable oppor- tunity to work closely with Prof. Andrew Blake, whose scientific outlook and intensity I greatly admire. I would like to thank my friends Pushmeet Kohli, Srikumar Ramalingam, Anitha Kannan, Ankur Agarwal, Gregory Neverov, Dynal Patel, Varun Gupta and all others for making the stay at Cambridge such a fun-filled experience. An internship at Honda Research Institute in Mountain View was a wonderful experience in robotics research. It was a pleasure to collaborate with Jongwoo Lim, who taught me many nuances of real-time SFM. I am also grateful to Prof. Ming-Hsuan Yang for his encouragement, as well as Arjun, Ravi and Dipak for their enjoyable company. There are several friends whose love and support I would like to acknowledge. Since the day I have known him, Praveen Rajurkar has been my best friend, whose pure heart, ready smile and abysmally pathetic jokes have livened up my days for over a decade now. Anish Karandikar (Aka) is a wonderful friend whose frank opinions I greatly value. Kiran (Machi) is a great buddy with an incredibly positive attitude, who needed only the slightest cajoling to accompany me to anything under the sun (or away from it). Suchit Jhunjhunwala, for his equal measures of levelheadedness and neuroticism, as well as Manish Amde, my long-time housemate, also deserve acknowledgment. Saumya Chandra and Mayank Kabra, with their contrasting slapstick-cynical routines, provided comic relief. I would also like to thank Ragesh, Diwaker, Saurabh (Sina) and Himanshu (Half-cold) for the joy and variety their company brought to my non-academic life. My cricket teammates, tennis and squash partners – Raiyan, Rahul, Kushal, Vikas, Nitin and all others – deserve special thanks for helping me sustain my love for sports at UCSD.

xvi I am thankful to my adviser, Prof. David Kriegman, for meticulously reading this dissertation and suggesting several corrections and improvements. Any errors that still persist are, of course, mine alone. I am grateful to my wonderful brother-in-law, Dr. Shyam Varma, for his love and support. Finally, no description can adequately quantify the principal ingredients of this dissertation, namely the endeavors and sacrifices of my father, mother and sister. They are my greatest supporters and the people most attuned to the tides of emotions that swept the course of my PhD. Their pain at being oceans apart from me is only surpassed by the unabashed pride and joy they experience with every little seashell I discover. To their blessings, to their tears and their smiles, I owe every success. Hence, to them, I dedicate this dissertation. Parts of this dissertation are based on papers co-authored with my collaborators:

Chapter5 is based on “Practical Global Optimization for Multiview Geometry”, • by F. Kahl, S. Agarwal, M. K. Chandraker, D. J. Kriegman and S. Belongie, as it appears in (Kahl et al., 2008) and (Agarwal et al., 2006).

Chapter6 is based on “Globally Optimal Affine and Metric Upgrades in Strat- • ified Autocalibration”, by M. K. Chandraker, S. Agarwal, D. J. Kriegman and S. Belongie, as it appears in (Chandraker et al., 2007b).

Chapter7 is based on “Autocalibration via Rank-Constrained Estimation of the • Absolute Quadric”, by M. K. Chandraker, S. Agarwal, F. Kahl, D. Nister´ and D. J. Kriegman, as it appears in (Chandraker et al., 2007a).

Chapter8 is based on “Globally Optimal Bilinear Programming for Computer • Vision Applications”, by M. K. Chandraker and D. J. Kriegman, as it appears in (Chandraker and Kriegman, 2008).

Chapter9 is based on “Moving in Stereo: Efficient Structure and Motion Using • Lines”, by M. K. Chandraker, J. Lim and D. J. Kriegman, as it appears in (Chandraker et al., 2009).

xvii VITA

1982 Born, Raipur, India

2003 B.Tech., Indian Institute of Technology, Bombay, India

2009 Ph.D., University of California, San Diego, USA

PUBLICATIONS

M. K. Chandraker, J. Lim and D. Kriegman, “Moving in Stereo: Efficient Structure and Motion Using Lines,” IEEE International Conference on Computer Vision (ICCV), 2009.

M. K. Chandraker, S. Agarwal, D. Kriegman and S. Belongie, “Globally Optimal Algo- rithms for Stratified Autocalibration,” International Journal of Computer Vision (IJCV, invited), 2009.

M. K. Chandraker and D. Kriegman, “Globally Optimal Bilinear Programming for Computer Vision Applications,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.

F. Kahl, S. Agarwal, M. K. Chandraker, D. Kriegman and S. Belongie, “Practical Global Optimization for Multiview Geometry,” International Journal of Computer Vision (IJCV), 79(3):271-284, 2008.

M. K. Chandraker, S. Agarwal, D. Kriegman and S. Belongie, “Globally Optimal Affine and Metric Upgrades in Stratified Autocalibration,” IEEE International Conference on Computer Vision (ICCV), 2007.

A. Agarwal, S. Izadi, M. K. Chandraker and A. Blake, “High Precision Multi-touch Sensing on Surfaces using Overhead Cameras,” IEEE Tabletop and Interactive Surfaces, 2007.

M. K. Chandraker, S. Agarwal and D. Kriegman, “ShadowCuts: Photometric Stereo with Shadows,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007.

M. K. Chandraker, S. Agarwal, F. Kahl, D. Nister´ and D. Kriegman, “Autocalibration via Rank-Constrained Estimation of the Absolute Quadric,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007.

S. Agarwal, M. K. Chandraker, F. Kahl, D. Kriegman and S. Belongie, “Practical Global Optimization for Multiview Geometry,” European Conference on Computer Vision (ECCV), 2006.

xviii M. K. Chandraker, F. Kahl and D. Kriegman, “Reflections on the Generalized Bas-Relief Ambiguity,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005.

M. K. Chandraker, C. Stock and A. Pinz, “Real-Time Cameras Pose in a Room,” Interna- tional Conference on Computer Vision Systems (ICVS), 2003.

C. Stock, U. Muhlmann,¨ M. K. Chandraker and A. Pinz, “Subpixel Corner Detection for Tracking Applications using CMOS Camera Technology,” Proceedings of the Austrian Association of Pattern Recognition, 2002.

FIELDS OF STUDY

Major Field: Optimization Global Optimization, Convex Optimization, Polynomial Optimization, Convex Relaxations, Branch and Bound Search.

Major Field: 3D Reconstruction Structure from Motion, Shape from Exemplars.

Major Field: Multiple View Geometry Triangulation, Camera Resectioning, Autocalibration, Projective Geometry.

Minor Fields Semidefinite Programs, Fractional Programs, Bilinear Programs, Sum-of-Squares Polynomials, Linear Matrix Inequality Relaxations.

xix ABSTRACT OF THE DISSERTATION

From Pictures to 3D: Global Optimization for Scene Reconstruction

by

Manmohan Krishna Chandraker Doctor of Philosophy in Computer Science

University of California, San Diego, 2009

Professor David Kriegman, Chair

Reconstructing the three-dimensional structure of a scene using images is a fundamental problem in computer vision. The geometric aspects of 3D reconstruction have been well-understood for a decade, but the involved optimization problems are known to be highly non-convex and difficult to solve. Traditionally, these problems are tackled using heuristic initializations followed by local, gradient-based optimization algorithms, which are prone to being enmeshed in local minima. In contrast, this dissertation proposes powerful, global optimization methods to derive provably optimal, yet practical, algorithms for estimating 3D scene structure and camera motion. This dissertation develops a branch and bound framework to solve several well- established problems in multiview geometry to their global optima, with a certificate of optimality. The framework relies on the construction of efficient and tight relaxations to the involved non-convex problems, using modern convex optimization methods. The underlying geometry of the task is exploited to restrict the search space to a small, fixed

xx number of dimensions, which alleviates the worst case exponential complexity of branch and bound in practice. The dissertation begins by deriving optimal solutions to triangulation and camera pose estimation for an arbitrary number of views and points, using extensions to the theory of fractional programming. Next, the framework is amplified to solve the con- ceptually important affine and metric reconstruction stages of stratified autocalibration to their global optima. Additionally, an algorithm for directly upgrading a projective reconstruction to a metric one is proposed, based on elegant real algebraic geometry methods for global optimization of polynomial systems. Further, large-scale bilinear programs that arise in diverse applications such as shape from exemplar models and non-rigid structure from motion, are globally optimized using a novel branching strategy that exploits problem structure typical to 3D reconstruction. The final part of the dissertation develops a complete pipeline for real-time 3D reconstruction using stereo images and straight line features. The core structure from motion problem constitutes efficient optimization of an overdetermined system of polynomials that is fast enough to be used in a robust hypothesize-and-test framework. The algorithm has already found application in the autonomous navigation system for the well-known humanoid robot, ASIMO.

xxi Chapter 1

Introduction

“.... I have great faith in a seed. Convince me that you have a seed there and I am prepared to expect wonders.”

Henry David Thoreau (American naturalist, 1817-1862 AD), Faith in a Seed

A cogent vignette that stimulates artificial intelligence research and movie box office receipts alike, includes a robot autonomously navigating and consciously interact- ing with the world around it. It is, perhaps, well-accepted that the ability to garner visual input, discerningly extract scene information and sentiently use the same, represents a crucial attribute for such a machine. Since visual input is usually comprised of two-dimensional images, an important piece of the puzzle involves recovering the depth, that is, inferring the three-dimensional (3D) scene structure represented by the two-dimensional (2D) images. Various cues can be employed to achieve this goal, such as camera motion between images, the extent of defocus or shading variations with change in pose and lighting (Figure 1.1). Recovering 3D scene structure using (possibly unknown) camera motion as a cue, the so-called “Structure and Motion” or “Structure from Motion (SFM)” problem, is one of the principal themes of this dissertation. Structure from motion is a quintessential computer vision problem, for which robotic navigation is by no means the only real-world application. Organizing vacation

1 2

(a) (b)

Figure 1.1: (a) Rain, Steam and Speed - The Great Western Railway, by J. M. W. Turner, 1844 AD. Various cues combine to create the illusion of depth, such as linear perspective (parallel lines seem to intersect), aerial perspective (distant regions acquire a bluish hue) and defocus (farther objects are hazier). (b) Motion as a cue to perceive depth, exemplified by Lake Palanskoye in the Kamchatka Peninsula, Russia. The landmass was shifted, left to right, during a landslide, creating the effect of a camera motion between the two images. By crossing the eyes to view this image pair as a stereogram, the reader can perceive depth in the scene (white regions are highest, followed by brown, green and bluish-black). photographs, augmented reality walk-throughs, 3D city maps and motion capture tech- nology in the movie and gaming industries are but a few examples where progress in 3D reconstruction techniques has already influenced modern society. Given its widespread application, it is unsurprising that a significant part of the computer vision challenge consists of designing robust, reliable algorithms and systems that can infer 3D scene structure and camera motion using 2D images. Structure and motion problems are, in general, highly non-convex and finding optimal solutions to them is computationally hard (Nister´ et al., 2007; Freund and Jarre, 2001). For instance, Figures 1.13 and 5.1 illustrate cost functions for some of the problems we will encounter in this dissertation. Traditionally, these problems are solved by employing a heuristic initialization in conjunction with a gradient-based optimization 3 algorithm to arrive at a local optimum. Needless to say, the possibility of such approaches achieving an acceptable solution quality is contingent on a propitious initialization in the vicinity of the optimum. In contrast, this dissertation presents algorithms for geometric reconstruction that provably converge to the global optimum, regardless of the initialization. This dissertation makes a strong case for modern optimization methods being more suited to meet the challenge of provably accurate geometric 3D reconstruction than their traditional gradient-based counterparts. Convex programs, for example, are attractive since their local minima are, by definition, also the global minima. Moreover, the past twenty years have seen tremendous activity towards developing fast, reliable and robust solvers for a variety of convex problems. One way of harnessing the power of convex optimization is to develop approximation algorithms that are guaranteed to lie within a fixed distance of the optimum. But even with the assumption that an approximate 3D reconstruction is useful, it is difficult to come up with a provably good one for complex multiview geometry problems. However, suppose the search space for, say, a minimization problem, is sub- divided into some regions. Then, as long as the convex approximation can be shown to be a “tight” lower bound, or a relaxation, we can use it to prune away those regions where the lower bound lies above the objective function in some other region (see Figure 1.2). This is precisely the basis for a branch and bound paradigm for global optimization. In this dissertation, we develop provably tight and efficiently minimizable convex relaxations to non-convex geometric reconstruction problems which can be coupled with a well- designed branch and bound algorithm to achieve the global minimum. The bane of a traditional branch and bound algorithm, of course, is its worst case complexity that increases exponentially with the number of dimensions, which, for a multiview geometry problem, may easily be a few dozens or hundreds. So, while a cer- tificate of optimality satisfies the theoretical premise of global optimization, practicality demands an informed design that avoids the curse of dimensionality. Undeniably, the primary reason for the success of our global optimization algorithms is the judicious 4

;<9= ()*+'","'- !"#$%&' ()*+'","'- !"#$%&'

803+$03,-910*:-$7",- 603,-91&-/%9%7"03

40$%/12"3"2)2 40$%/12"3"2)2 6)&&-371*-#7

405-&1*0)3' ./0*%/12"3"2)2 9

Figure 1.2: The principles of global optimization using a branch and bound framework, illustrated for a univariate function. assimilation of the underlying problem structure afforded by multiview geometry to potently restrict the dimensionality of the search space. Indeed, one of the central motifs of this dissertation is an inquiry into the symbiotic congruity of multiview geometry and convex optimization to achieve expeditious convergence in practice. On this note, let us embark on our exploration of practical global optimization methods for geometric 3D reconstruction.

1.1 Multiview Geometry and Optimization

Much of this dissertation examines multiple view geometry problems through the conjugality of projective geometry and modern convex optimization methods. We begin our quest with some observations on the appositeness of those two frameworks. 5

The utility of projective geometry

In a most basic interpretation, any photograph can be regarded as the output of a projection device, which might be an advanced servo-controlled zoom lens system for modern digital cameras, or an artist’s felicity for a Renaissance-era proteg´ e´ of Raphael’s studio. The study of projections of the three-dimensional world onto a planar canvas, thus, predates the advent of modern photography. Consequently, it is not mere fortuitousness that a mature set of mathematical tools, in the form of projective geometry, was readily available to tackle the inverse problemPlane of inferring at Infinity information about the three-dimensional world from two-dimensional projections (Figure 1.3). The (imaginary) plane which images to vanishing points.

vanishing point

(a)Raphael (b) Sanzio, The School of Athens, ca. 1510

Figure 1.3: The progress of projective geometry through art. (a) Jesus before Caiaphas, by Giotto di Bondone, ca. 1305, around the beginning of the Italian Renaissance. While Giotto aims to replicate the effect of parallel lines seeming to converge, the painting reflects unawareness of the concept of a vanishing point. (b) The School of Athens, by Raphael Sanzio, ca. 1510, at the zenith of classical Renaissance. Parallel lines of the 3D scene are concurrent between the central characters in the 2D painting. This vanishing point is the image of an ideal point on the plane at infinity. Coincidentally, in this painting, Plato is lecturing Aristotle on idealism and the vanishing point heightens the contrast to the latter’s realism.

Projective geometry lends a variety of useful concepts, which are well-adapted for use in the analysis of the geometry of multiple views as well as implementation in modern computational frameworks. Its distinguishing feature is the uniform treatment of 6

finite as well as infinite points. The set of all infinite points in projective 3-space defines a plane, aptly designated the plane at infinity. The behavior of the plane at infinity and some mathematical entities that reside on it, such as the absolute conic and the absolute dual quadric, are fundamental to our understanding of image formation and thereby, also indispensable in determining our modus operandi for recovering 3D scene information from those images. In Chapter2, we briefly review notions from projective geometry that form the mathematical backbone of this dissertation. It might seem quite remarkable that these imaginary habitues´ of the imaginary abode that is the plane at infinity, can contribute so tangibly to our ability to digitally perceive the real world around us. But in the words of Jean-Paul Sartre, the twentieth- century French existentialist philosopher:

“No finite point has meaning without an infinite reference point.”

And therein lies the power of projective geometry.

The utility of convex optimization

The ability to acquire digital images of the world around us is now commonplace. Even more easily accessible is the ability to store, process, share and distribute them. How do we meaningfully analyze and extract relevant information from the plethora of data at our disposal? A natural approach might be to formulate an objective function that encapsulates some notion of satisfaction and while appropriately taking into account the available data, devise a scheme to maximize the satisfaction. This is precisely the basic premise of optimization. An optimization framework may seem an obvious necessity to us today, but it serves to pause and ponder on the alternatives, or the lack thereof. Optimization, after all, was a novel concept merely 200 years ago, around when Carl Friedrich Gauss had begun to lay the foundations for the method of steepest descent. The beginnings of modern day optimization - convex programming in particular - can perhaps be attributed to George Dantzig’s mid-twentieth century studies in linear 7 programming. While the term linear programming does not bear relation to computer programs (rather, it refers to scheduling), it is beyond doubt that contemporary study of optimization methods serves and is served by our present-day computers. Linear programming is a special instance of convex programming, which deals with convex functions and sets. Convex programs enjoy some special properties, for instance, a local minimum is guaranteed to be a global minimum. Recognizing elements of convexity in a problem, thus, allows us to deduce fundamental characterizations of the intrinsic difficulty of the problem. But that only partly explains the utility of convex optimization. The theoretical development of modern-day interior point solvers (Karmarkar, 1984; Nesterov and Nemirovskii, 1994) and their subsequent deployment in the form efficient, general-purpose computer software (Sturm, 1999; Andersen et al., 2003) is as significant a reason for the popularity of convex optimization. Indeed, implementation problems and choices concerning feasibility detection, stopping criteria and convergence rates that vex traditional optimization methods are readily and satisfactorily handled by the theory of convex optimization.

1.2 3D Reconstruction from 2D Images

Given one or more images of a scene, a verbal description of 3D reconstruction might be “inferring the structure of the scene and the cameras used for imaging”. How can we make this description mathematically precise? Clearly, without knowing the absolute coordinates of the scene points (which are not available for a traditional image), it is not possible to recreate a copy of the scene at the same geographical position and orientation as the original. In addition, there is no way to determine the absolute scale of the scene imaged by a camera - a fact that has allowed the directors of many a blockbuster to portray mayhem in a bathtub as a savage shark or a stricken ship (Figure 1.4). Indeed, scale ambiguity is fundamental in biological vision too - it is only with the aid of experience-based priors that we learn 8 to reconcile relative scales in the world around us. For instance, highly myopic, but perfectly peripatetic people wearing vision-correcting lenses, often stumble when objects appear much closer (or larger) upon removing the lenses.

(a) (b)

Figure 1.4: (a) A rotation and translation applied to both the object and the camera results in the same image. (b) Scale information is “lost” in perspective projection - a far away big object appears the same as a nearby small object.

So, it seems reasonable to define a satisfactory 3D reconstruction as one that differs from the true scene by a global rotation, translation and scale factor (a similarity transformation). As it turns out, if we have calibrated cameras, that is, cameras with known internal settings, then it is possible to achieve this goal and in fact, impossible to do any better (Longuet-Higgins, 1981). Such a reconstruction is called a metric reconstruction.

1.2.1 The Projective Ambiguity

Della Pittura, a definitive fifteenth-century treatise on painting by Leon Battista Alberti, the Renaissance artist and polymath, expounds on painting with correct perspec- tive. To achieve this, Alberti recommends observing the scene through a transparent cloth 9 stretched on a wire frame, with one eye closed and marking points on the glass where they appear to be in the image. This procedure, depicted in Figure 1.5 (a), has come to be known as Alberti’s veil and is simply an effectuation of perspective projection. Geometric 3D reconstruction seeks to solve the inverse problem: given the image points on Alberti’s veil, back-project along the line of sight to determine the correct locations of the 3D points. It is apparent that instead of points, the geometric primitives for the study of 3D reconstruction from 2D projections must be rays emanating from a center of projection. This is precisely the purview of projective geometry.

(a) Forward projection (b) Inverse projection

Figure 1.5: (a) Man Drawing a Lute, a woodcut by the German artist Albrecht Durer,¨ ca. 1525, illustrates Alberti’s method of painting, which recognized image formation as a perspective projection. (b) The inverse operation of back-projection from the camera center along the rays of sight need not preserve the correct length ratios and angles.

As Figure 1.5 (b) shows, 3D reconstruction from a single 2D image is inherently an ill-posed problem, since information about the true depth of a scene point is lost in the 2D projection. However, images from a greater number of viewpoints should, arguably, better constrain the reconstruction, for ray intersections are localizable, while infinite rays are not. This intuition, which forms the basis of multiview geometry, is valid up to a certain extent. While a 3D reconstruction from just image data can indeed be computed 10 when multiple views are available, there exists a so-called projective ambiguity in the reconstruction. Loosely, if image formation is considered as an incidence operation, then a transformation applied to the cameras, along with the inverse transformation applied to the bundle of rays emanating from the camera, can contrive to leave the observed image unchanged. A reconstruction up to a projective ambiguity is called a projective reconstruction. Mathematically, it is related to the true scene by an invertible projective transformation, which can be defined as a linear operator on the space of back-projected rays from the origin. Given image data, it is always possible to compute a projective reconstruction, but no better, without extraneous knowledge of the scene or imaging devices (Faugeras, 1992; Hartley et al., 1992). We review projective reconstruction for computer vision applications in Section 3.3.

1.2.2 Projective Spaces and Projective Cameras

The intuition that the geometry of 3D reconstruction is really a study of inverse projections can be formalized through the machinery of projective geometry. In projective geometry, points are represented by rays through the origin and lines are represented by hyperplanes through the origin, as visualized for the two-dimensional case in Figure 1.6. Stated differently, in projective geometry, all the points on a line passing through the origin are identified. We encounter projective geometry everyday when, for instance, we perceive two rails or the vertical edges of a skyscraper, as intersecting in a finite vanishing point (Figure 1.7). Indeed, one of the principal axioms of projective geometry is that any two lines in a “plane” must meet. In a projective world, concepts such as squares and circles, which are based on some notion of length, are meaningless.

Suppose a 3D point (X,Y,Z)> is observed at (u, v)> on the image plane of a camera of focal length f. Then, assuming the imaging geometry as depicted in Figure 1.8, where the world coordinate system origin coincides with the camera’s center of 11

78&'()*+2&(* -.)*/)'#0)2 +5%&2#"%&.#4

!"#$%&'()*+ -0#*%&+5.(6$5 &&&&&&(.)$)* 3#4&+5.(6$5 &&&&&(.)$)*

!"#$%&,)*%

!"#$)*$&'0#*% !

Figure 1.6: Perspective projection can be interpreted naturally using projective geometry. Identifying points along a ray through the projection center means the 3D world can be interpreted as the projective space P3, while the image plane is interpreted as the projective plane P2.

Figure 1.7: Vanishing points, where parallel lines of the real world appear to intersect in an image, are a consequence of projective geometry that we observe in daily life. Any two lines in a projective plane must axiomatically, intersect at a unique point. projection, the world axes are aligned with the image plane and the camera faces along the negative depth axis, the perspective projection equations for image formation are 12 given by u v f = = − , (1.1) X Y Z which are non-linear relations. Defining u0 = wu and v0 = wv, where w = 0, the image 6 formation equations can be interpreted in a linear framework as       X u f 0 0 0   0      −   Y   v  =  0 f 0 0    . (1.2)  0         −   Z  w 0 0 1 0   1

!'#$(#$)&

./0120345$4607 !"#$%&

!+#$+#$,-&

894:;$3541;

!$*$!+#$+#$+&

Figure 1.8: The perspective projection camera model follows naturally from the projective geometry interpretation of perspective image formation. While we view the world as a projective 3-space for the purpose of image formation, we can also view the image plane as the projective plane. Image formation, which is the projection of an infinite line through the origin to a 2D point on the image plane, can now be interpreted as a linear projective transformation from P3 to P2:       X0 u0 f 0 0 0      −   Y       0  v0 = 0 f 0 0   , (1.3)    −         Z0  w 0 0 1 0   W 13

where X0 = WX,Y 0 = WY and Z0 = WZ. Notice that to represent a point in projective n-space Pn, we use n + 1 coordinates, but since scale is immaterial in projective space, only the ratios between the coordinate magnitudes are important. That is, to represent a point X Rn as a point in Pn, we can use the (n + 1)-dimensional ∈ vector k (X, 1)>, for any k = 0. · 6 In this interpretation, transformations of the camera and scene are now projective transformations, which are represented as invertible, but otherwise arbitrary 4 4 matrices. × The projective ambiguity that we discussed in Section 1.2.1 now has an easy mathematical basis: given just the image points in P2, when we seek to recover the cameras and scene points that form the image, we can always insert a 4 4 invertible transformation and its × inverse between the two entities on the right hand side of (1.3) to obtain the same image points.

1.2.3 Stratification of 3D Reconstruction

The most visually noticeable aspect of a projective reconstruction, for example in Figure 1.9(a), is that all parallel lines of the (metric) scene in the same direction are concurrent. Further, angles between coplanar lines or length ratios are not preserved, but incidence relations remain unchanged. Once a projective reconstruction is computed, the goal of 3D reconstruction is to compute a metric upgrade, that is, determine a transformation that recovers the scene and camera configurations up to a similarity transformation. In other words, the task is to upgrade a projective reconstruction to one more in consonance with human experience. One way to achieve this goal is using some prior knowledge pertaining to the scene. For instance, knowledge of a few angles in the scene suffices to fix all the angles in a reconstruction, which amounts to a reconstruction up to a similarity. Extending the intuition, it is reasonable that a little less knowledge should allow us to achieve a reconstruction which is, in some sense, intermediate between a projective and a metric reconstruction. Indeed, such is the case - knowing a few parallel lines in the 14 scene, for example, allows us to recreate it up to a so-called affine transformation, where parallelism between lines is restored, but in general, not the other angles. Defining such a hierarchy of transformations that can progressively upgrade a projective reconstruction to a metric one is the basis for stratification of 3D reconstruction (Figure 1.9), which we elaborate upon in Section 3.4.

(a) Projective (b) Affine (c) Metric

Figure 1.9: Stratification in 3D. Given images of a scene, a projective reconstruction can be computed, in which parallel lines meet at a vanishing point. A projective reconstruction can be upgraded to an affine one, in which parallelism is restored, if a few parallel lines in the scene are known. Knowing a few angles in the affine reconstruction allows it to be upgraded to a metric one.

An important constraint for a reconstruction is chirality, which requires the imaged scene points to lie in front of the cameras. An arbitrary projective reconstruction need not satisfy chirality, which manifests itself as a violation of the convex hull of the scene points. Enforcing the chirality condition yields a quasi-affine reconstruction, which is simply a projective reconstruction that preserves the convex hull of the scene (Figure 1.10). Section 3.5 reviews chirality in 3D reconstruction in greater detail. Since it forms an important part of Chapters6 and7, we can consider quasi-affinity as a separate stratum in our reconstruction hierarchy. 15

(a) Euclidean “scene” (b) A quasi-affine (c) A general projective reconstruction reconstruction

Figure 1.10: Despite appearances to the contrary, (c) is as valid a projective reconstruction of the house as (b). While the quasi-affine reconstruction preserves the convex hull of the scene in (a), a general projective transformation might not. Note that incidence relations are still preserved in (c).

1.2.4 Autocalibration

Rather than relying on any prior knowledge of the scene, an alternate ingress into a metric world from a projective one is by estimating the internal parameters of the camera (in effect, reducing the setting to a calibrated one). Indeed, with minimal assumptions on the imaging set-up, such as constancy of camera settings or rectangularity of image pixels, it is possible to recover the internal parameters of the cameras using image data alone. This process is called autocalibration or camera self-calibration, which is the subject of Chapters6 and7 of this dissertation. A typical approach to calibrating a camera involves using several images of a known calibration grid. Once a correspondence can be ascertained between scene points (or higher order features like curves) and their counterparts on the image plane, it is straightforward to recover the camera parameters. The term autocalibration stems from its key premise that it obviates the requirement for an explicit calibration grid. Instead, it tries to locate the image of the so-called absolute conic, which is an imaginary, but fixed object on the plane at infinity, whose location in a metric reconstruction is known a 16 priori. Its image can be shown to be related to the internal parameters of the camera, so locating the image of the absolute conic is equivalent to calibrating the camera.

Figure 1.11: Autocalibration can be visualized as determining the plane at infinity and the conic on it which projects to image of the absolute conic in each image.

Thus, geometrically, the goal of autocalibration is to recover the plane at infinity and the absolute conic in a projective reconstruction. This can be visualized as depicted in Figure 1.11, which also motivates the optimization paradigm of autocalibration that we pursue in chapters6 and7: suppose we hypothesize the location of the image of the absolute conic in the input images. Then, if the back-projected cones from the camera centers all intersect in a common conic, it might be the absolute conic and we would have hypothesized the correct camera parameters. If we somehow knew the position of the plane at infinity, then the problem reduces to verifying the intersection of these cones with a known plane, which is arguably a simpler problem. Thus, estimating the position of the plane at infinity in a projective reconstruction is considered the most difficult obstacle to a metric reconstruction (Hartley et al., 1999). 17

Of course, the above is only a sketch to aid visualization of the problem - in reality, the absolute conic and its image are imaginary entities.

1.2.5 Feature Selection and Matching

From the foregoing discussions, while single view geometric reconstruction is, ostensibly, an ill-defined challenge, multiview reconstruction is much more tractable, provided corresponding features can be identified across the multiple images. All the algorithms that we discuss in this dissertation require salient features in the image as input, which must then be matched across the image dataset to detect correspondences. The two steps of feature detection and matching are a basic requirement for most 3D reconstruction frameworks (there are a few exceptions, for example (Dellaert et al., 2000)). With the exception of Chapter9, this dissertation assumes that these two steps have already been addressed. This is not to suggest that feature detection and matching are trivial stages of 3D reconstruction - quite to the contrary, they present significant computational challenges as image datasets become larger in this Internet age. However, fast and robust algorithms exist to solve these problems, which work very well in practice for small to medium sized image collections, which is the target application domain of our algorithms.

Feature Detection

Salient features are useful when they are sparse, but informative. Moreover, they should be reproducible, ideally persisting across different image scales and pose transfor- mations. Determining good features for particular applications like 3D reconstruction and object recognition is an active area research, some of which we will survey in Section 3.1.3. In all instances, except Chapter9, our features of interest in this dissertation are corners. We review the basic principles of corner detection in Section 3.1.1. The features of interest in Chapter9 are straight lines. Using higher order features is advantageous 18 in some milieu, since they are better localized and can encode a greater amount of information. Not all the detected corners are good features for structure from motion applica- tions. One of the more striking examples of a bad “corner” is a region where the feature detector outputs a favorable response, since it detects the appropriate intensity changes, but the region does not stay “locked down” under a camera motion. This can arise, for example, at the occlusion interface of two regions at different depths or along T-junctions of two edges in relative motion. Figure 1.12 gives examples of good and bad corners for SFM applications.

Figure 1.12: Examples of good and bad corners. The blue square shows a good corner, which stays “locked down” under a camera motion. The red circle, on the other hand, has the characteristics of a corner, but is not suited for SFM since it does not correspond to the same 3D point when the camera moves.

Feature matching

In a practical SFM application, how do we prevent the feature detector from yielding spurious candidates? The short answer is that we need not prematurely discard putative features. The real requirement for SFM is a correspondence between interest points in an image sequence, that is, being able to identify the same point across an image sequence. In most cases, bad corners are weeded out by the feature matching step. To be able to match corresponding features across images, a descriptor is required 19 for each feature. There are two basic requirements for such a descriptor:

1. It must be sufficiently discriminative, that is, it must distinguish one feature from others.

2. It must be invariant to common transformations, such as rotation, translation and scale changes, as well as illumination variations.

Most descriptors are computed using operations on intensity patches around an interest point. Depending upon the application, these operations might be simply computing a squared difference of intensities, or something more advanced. We will review some of the popular feature matching strategies in Section 3.1.2. Matching features across a large image dataset is the time critical step for many SFM applications. Thus, appropriate attention must be paid to designing features which are amenable to fast matching algorithms. Some recent advances in designing features tailored for rapid matching are discussed in Section 3.1.3.

1.3 The Optimization Framework

In this dissertation, multiview geometry problems are posited and solved in an optimization framework. The optimization paradigm for problem solving is ideal for dealing with noisy data and flexibly combining information from disparate sources - akin to the situation for estimating 3D reconstructions from noisy images, possibly acquired using multiple cameras. In particular, the uncertainties associated with noise or inaccuracies in image acquisition and feature matching can be explicitly modeled in an optimization framework. Also, formulating the goal as an optimization problem allows scalability across an arbitrary number of points or views. Moreover, the logical demarcation between the goal (the objective function) and the means to achieve it (the optimization algorithm) is invaluable for analysis and generalization. 20

1.3.1 Global Optimization for 3D Reconstruction

A globally optimal algorithm, for a specific class of objective functions, is one that provably converges to the global minimum. We refer the reader to Section 4.1 for a survey of prominent global optimization techniques employed in various disciplines of science and . In this section, we will motivate the need for global optimization in multiview geometry.

3D reconstruction problems are hard

Since image formation and 3D reconstruction involve projective entities, cost functions usually encountered in 3D reconstruction have intrinsically non-linear charac- teristics. These cost functions are also typically highly non-convex, so the search space is riddled with local minima, as Figure 1.13 illustrates for the autocalibration problem (details in Section 6.2.4). Even for the simplest of 3D reconstruction problems, namely triangulation, the objective function terrain can be quite complex, as shown in Figure 5.1. For several multiview geometry problems, there may also be many critical camera configurations (Kahl, 2001; Kahl et al., 2000; Sturm, 1997), which might result in a problem that is especially arduous to optimize. For some applications, like bilinear programming in Chapter8, the number of variables might be very large, so a brute force search in high dimensional space is unlikely to converge in a reasonable amount of time.

Traditional methods have limitations

The standard practice for tackling multiview geometry problems is to first solve a simpler problem, usually a linear least squares one. This solution is then used as an initialization for a non-linear optimization routine that minimizes the actual cost function (Figure 1.14). The simple initial problem corresponds to an algebraic cost function, which may give the correct solution to the geometric problem in the absence of noise. But when noise is present, the linear solution can be quite non-intuitive and may bear no resemblance to the actual geometry of the problem. For some problems, like estimating 21

(b) Side view

100

90

80

70

60

50

40

30

20

10

(a) Top and side view 10 20 30 40 50 60 70 80 90 100 (c) Contour plot

Figure 1.13: A typical cost function for a multiview geometry problem might have several local minima. In addition, the surface terrain can be very rugged, which necessitates a highly accurate estimation algorithm. This figure shows various views and a contour plot of a 2D slice of the 3D volume that characterizes the variation of the cost function for the autocalibration problem with change in the position of the plane at infinity. Blue denotes a low cost and red denotes a high cost. Please see Section 6.2.4 for further details. the plane at infinity in Section 6.5, there may not even be a useful linear least squares problem for initialization. The traditional practice in such situations is to straightaway use non-linear minimization with multiple random restarts. The non-linear minimizer of choice for multiview geometry applications is Levenberg-Marquardt (Levenberg, 1944; Marquardt, 1963), which behaves like gra- dient descent when far from the local minimum and like Gauss-Newton when proximal to it. A Levenberg-Marquardt algorithm tailored to exploit the problem structure and sparsity patterns in multiview geometry is called bundle adjustment (Triggs et al., 1999). Bundle adjustment methods are quite powerful and fast, but suffer from the drawback that they are inherently local minimization algorithms. So, they require a good initialization in the vicinity of the global minimum to be able to converge to it, else they are likely to get stuck in local minima. An example of the advantage of global optimization is shown 22

895:

2#)31#)4,5&#$6,1.(4,

+(),%- ()(.(%"(/%.(#)

+(),%-&7-#$",' +#1%"&'()('*'

!"#$%"&'()('*' +(),%-&0#"*.(#)

5

Figure 1.14: Traditionally, to minimize a non-convex objective, a linear solution corre- sponding to an algebraic least squares cost function is used to determine an initialization. Subsequent application of a gradient-based local optimization method converges to a local minimum, which might be far away from the global minimum. in Figure 7.2. Thus, it is not straightforward to come up with good initializations for multiview geometry problems and even then, there is no way to verify whether the solution achieved by the non-linear minimization routine is indeed the global optimum. Finally, when outliers are present in the data, it is beneficial to pose problems using a robust error measure, like the L1-norm, which may be non-differentiable. Clearly, traditional gradient-based methods cannot be applied for such cases.

The choice of objective function

For some multiview geometry problems, the “best” objective function can be defined by due consideration of the error statistics for image measurements. A common assumption is that image measurement errors obey a Gaussian probability distribution.

This leads to the popular notion of an L2-norm reprojection error, which is the usual objective function for several problems like triangulation (Section 5.2) and projective reconstruction (Section 3.3). 23

While this dissertation does propose algorithms for minimizing these standard objective functions, we note that the L2-norm reprojection error is optimal only under a set of simplifying assumptions on measurement errors which might not be satisfied by real world error distributions. For instance, the L2-norm reprojection error results in an objective function too sensitive to the presence of outliers in the data. Using a more robust L1-norm formulation renders the cost function non-differentiable and thus, not amenable for minimization by traditional gradient-based methods.

Our global optimization approach

Following the classification of the survey in Section 4.1, the global optimization algorithms that we propose are deterministic, that is, they do not possess any element of randomization. All our algorithms rely on modern convex optimization methods to construct convex relaxations for the non-convex objective functions encountered in multiview geometry. In most cases, we use these convex relaxations in a branch and bound framework to systematically prune away regions of the search space guaranteed to not contain the global optimum (Section 4.3). In other cases, we construct a series of convex relaxations of increasing complexity that converges to the global minimum (Section 4.2.4).

The need for global optimization

Before we delve deeper into the dissertation, a valid question at this stage is about the need for global optimization. From common experience, local optimization methods seem to perform satisfactorily in many realistic scenarios. Then why should one develop global optimization methods, which are admittedly more sophisticated than traditional approaches? There are several facets to the answer. The first, most obvious one, is that global optimization algorithms always yield the optimal solution, including in situations where traditional algorithms might break down due to a poor initialization or the complex nature 24 of the objective function’s terrain. Indeed, a key feature of our algorithms is that they do not require gradient information or even differentiability of the cost function. Further, while our algorithms will achieve global optimality for any given initialization, we also propose geometrically correct initializations that do not compromise optimality. Second, reaching the global optimum is only the first step towards achieving optimality. The more important step is to be able to prove that the solution is, indeed, globally optimal. Typical local minimization methods do not provide any mechanism for ascertaining the quality of the solutions they achieve. Our methods, on the other hand, always terminate with a certificate of optimality which guarantees the solution to lie within a pre-specified tolerance of the global optimum. Indeed, most of the time consumption of our algorithms is directed towards verifying optimality - the optimal point itself is usually recovered in a fraction of that time. Third, while local optimization methods will always be an attractive proposition for system implementations due to their speed advantages, it is important to be able to determine the conditions under which they break down. Before our global optimization algorithms, there was no mechanism for characterizing the failure modes of popular gradient-based algorithms in the face of real-world data where no ground truth is avail- able. Thus, following the terminology of (Hartley and Zisserman, 2004), some of our algorithms are Gold Standard ones for the concerned multiview geometry problems. Finally, we note that a global optimization algorithm with poor empirical con- vergence behavior is tantamount to a brute force search and of little practical utility. The primary reason our algorithms demonstrate reasonable solution times is that special properties of multiview geometry cost functions are exploited at every stage of the opti- mization. This dissertation is, thus, also an advocation of the importance of incorporating domain knowledge into the optimization framework. That said, while we provide a unified framework for obtaining optimal solutions to several 3D reconstruction problems, in many ways, our constructions are general enough to find application in domains beyond multiple view geometry too. 25

1.3.2 Optimization for Robust SFM

From the discussion on feature selection and matching in Section 1.2.5, it is apparent that even with conservative matching criteria, it is not always possible to produce a perfect set of correspondences. How can we ensure that false matches, or outliers, do not corrupt the structure and motion estimates? This problem falls within the ambit of robust SFM. There are several ways in which robustness can be achieved in SFM applications. One option is to use a cost function that is relatively insensitive to outliers. For instance, in Chapters5 and8, we will propose algorithms for solving problems under the L1-norm, which corresponds to an error distribution with a thicker tail. An alternative is to use minimal solutions within a hypothesize-and-test framework. A minimal solution to an SFM problem is one achievable using the minimum amount of data. While a minimal solution by itself is susceptible to noise, it is indispens- able for rejecting outliers. A hypothesize-and-test framework, such as Random Sample Consensus (RANSAC), is used for robustly solving a model-fitting problem. For instance, suppose we want to find a line that best fits a given set of points, then, the model is a straight line and a minimal sample consists of 2 points. RANSAC randomly chooses a minimal sample from the data and uses it to compute the model. Then, it determines the support for this sample, that is, the number of other points in the dataset consistent with its model. A model that retains a large consensus set is more likely to be correct than one that does not. In fact, for a given probability p < 1, one can pre-determine the number of random samples that RANSAC must evaluate to determine the correct model with a probability of success equal to p. In Chapter9, we will develop a line-based SFM algorithm that computes the displacement of a stereo camera rig using the minimum amount of data. Even though a minimal solution is required to be fast and not robust, a good solution should incorporate problem-specific constraints (like orthonormality of the rotational component of the motion). A good minimal solution translates into fewer RANSAC trials, which reduces computation time and might be critical in situations where the feature set is sparse. We 26 use our minimum data solution in a RANSAC framework to achieve a robust solution that achieves close to real-time rates for indoor robotic navigation.

1.4 Contributions of the Dissertation

Let us summarize our discussions so far and portend the upcoming ones by emphasizing the contributions of this dissertation.

1. Global optimization for 3D reconstruction: Our first set of contributions con- sists of globally optimal solutions to several problems that arise in geometric 3D reconstruction. Traditional methods often break down for these problems due to the complex nature of cost functions that arise in multiview geometry or due to the large number of variables that some reconstruction problems involve. Some of the problems that we globally minimize are:

Triangulation and camera resectioning (Chapter5) • Stratified autocalibration (Chapter6) • Direct autocalibration (Chapter7) • Shape reconstruction from exemplars (Chapter8). • 2. Provably tight convex relaxations: We exploit recent developments in convex optimization theory to develop tight convex relaxations for the non-convex ob- jective functions that commonly arise in multiview geometry. These convex relaxations are shown to be theoretically viable and practically ideal for use in a branch and bound framework for global optimization.

3. Provable strategies for tractable branch and bound: Our next set of contri- butions demonstrates how incorporating domain knowledge into the branch and bound optimization framework makes seemingly intractable problems solvable. In most multiview geometry problems encountered in this dissertation, the dimen- sionality of the search space increases with the number of points or views. We 27

propose several extensions to a traditional branch and bound scheme that allow us to achieve fast convergence rates in practice. Some examples of these extensions are:

A novel bounds propagation scheme that capitalizes on the geometric struc- • ture of 3D reconstruction problems to restrict the effective dimensionality of our search space to a small, fixed number, irrespective of problem size.

A problem-specific procedure for selecting the initial search region which is • not inordinately large, yet is guaranteed to contain the global optimum.

A new, non-exhaustive branching strategy that is demonstrably convergent • for some 3D reconstruction problems, even though it performs branching only in the dimensions corresponding to a small subset of the problem variables.

4. Real-time SFM for autonomous robotic navigation: Finally, we turn our at- tention to developing a real-time SFM system for autonomous robotic navigation, which will be deployed on Honda’s humanoid ASIMO robot. This system solves a minimum data problem to localize a stereo rig employing straight line features. Chapter9 describes a bottom-up system implementation, from feature detec- tion and tracking to solving the minimum data SFM problem and using it in a RANSAC framework for robust estimation.

1.5 How to Read This Dissertation

This dissertation touches upon a broad array of subjects in multiple view geometry and optimization theory, but aims to cater to experts as well as beginners. Therefore, it can be read in multiple ways in appurtenance to the reader’s background or objectives. While the foregoing chapter was intended to serve the dual purpose of a historical perspective and an intuitive preview of the challenges we address, a more didactic 28 introduction to the pre-requisites required for this dissertation is contained in Chapters 2,3 and4. These chapters are written with the aim of ensuring the dissertation’s self- sufficiency and should be of particular appeal to, say, an entry-level graduate student with an adequate linear algebra background. Specifically, Chapter2 is a review of concepts from projective geometry that will be used to formulate and solve our 3D reconstruction problems. Chapter3 presents the relevant background from multiple view geometry. Chapter4 outlines the convex optimization theory and global optimization frameworks that will be used throughout the dissertation. The next block within the dissertation, from Chapters5 to8, presents globally optimal solutions for several well-known problems pertaining to the geometry of multiple views. While the theoretical background and tools required for these chapters may vary, they are organized in order of the apparent complexity of the task. Accordingly, we begin in Chapter5 with the simplest of multiview geometry problems, namely triangulation and camera resectioning. These “simple” tasks can nevertheless be shown to be NP-Hard. We propose globally optimal solutions for these problems, posed in various error norms, based on modern convex relaxations in a branch and bound framework, which is tailored to achieve fast convergence in practice by exploiting the underlying geometric structure. The same framework is extended to propose globally optimal algorithms for both the affine and metric upgrade stages of stratified autocalibration in Chapter6. The most elemental, yet powerful mechanism of chirality in projective geometry is brought into play to serve as a theoretically justified and practically useful initialization for the algorithm. While the subject of Chapter7, direct autocalibration, is pedagogically more advanced than the stratified approach, its projective geometry representation is more concise, which leads to a simpler formulation. Consequently, it becomes amenable for solution in a framework of convex linear matrix inequality relaxations, based on the elegant theory of positive polynomials from real algebraic geometry. Chapter8 tackles bilinear fitting, which arises in yet more complex problems 29 like single-view shape reconstruction using 3D exemplars and non-rigid structure from motion. The apparent difficulty of bilinear programming in computer vision stems from the large dimensionality of its search space. However, we exploit an underlying trait of geometric cost functions to effectively restrict the number of dimensions to a small, fixed number, which makes global optimization practical. Chapter9 of the dissertation is of greatest interest to the practitioner. It outlines our construction of a robust and accurate 3D reconstruction pipeline, from feature detection and tracking to full SFM, which will form one of two parallel components in the autonomous navigation system for Honda’s humanoid robot, ASIMO. The structure and motion algorithm described here is based on stereo input and uses straight lines as features. Finally, we conclude the dissertation in Chapter 10 with further discussions of the perceived impact of this work and our outlook towards the future. Chapter 2

Preliminaries: Projective Geometry

“Beauty itself is but the sensible image of the infinite.”

George Bancroft (American historian, 1800-1891 AD), The Necessity, Reality and Promise of Progress of the Human Race

In this chapter, we will develop some of the terminology and notation useful in formulating the 3D reconstruction problems for which we seek globally optimal solutions. We will begin with a brief review of concepts from projective geometry, which forms the basis for mathematical representation of multiview geometry. Large segments of this chapter borrow liberally from (Hartley and Zisserman, 2004), where a more detailed treatment of the relevant material can be found.

2.1 Axiomatic Projective Geometry

Like any geometry, it is possible to develop the entire framework for projective geometry axiomatically. For completion, we give a brief preview of the axiomatic framework for projective geometry here and refer the reader interested in a more detailed exposition to texts such as (Beutelspacher and Rosenbaum, 1998). A geometry G = (S,I) can be defined as a pair consisting of a set S and a symmetric and reflexive incidence relation I between the elements of the set. That is,

30 31 for any x, y S, if (x, y) I, then (y, x) I and (x, x) I. For instance, in classical ∈ ∈ ∈ ∈ geometry of 3 dimensions, E, the set S consists of points, lines and planes of Euclidean 3-space and the incidence relation I is our notion of “contained by” and “passes through”. A flag of G is a set of elements of S that are mutually incident. A flag F is maximal when there is no element in S F whose union with F is a flag. G has rank r \ if S can be partitioned in r subsets, S1, ,Sr, such that every maximal flag contains ··· exactly one element of each subset. Each maximal flag in a geometry of rank r must have exactly r elements. For example, in the case of Euclidean geometry of 3-space, the set S itself is the only maximal flag, which can be partitioned into three subsets, which correspond to the set of all points, lines and planes. So, E is a geometry of rank 3.

A subgeometry G0 = (S0,I0) is defined by a subset S0 S and a relation ⊆ I0 induced by I, that is, I0 is the restriction of I to S0. In particular, the geometry

Gi = (S Si,Ii), where Ii is induced from G, is a geometry of rank r 1. It follows that \ − any geometry of rank r 2 can be studied as a rank 2 geometry. The two subsets S1 and ≥ S2 of a rank 2 geometry can be considered corresponding to as points and lines and will now be denoted as P and L. Let G = ( P,L ,I) be a geometry of rank 2. Then the following axioms are { } satisfied by any projective geometry:

Axiom 1: For every two distinct points, there is one distinct line incident to them. • Axiom 2: For points A, B, C, D, where A and D are distinct from B and C, if • AB intersects CD, then AC intersects BD.

Axiom 3: Any line is incident with at least three points. • Axiom 4 There are at least two lines. • A projective space is a geometry of rank 2 that satisfies Axioms 1, 2, and 3. A non- degenerate projective space also satisfies Axiom 4. A projective plane is a non-degenerate projective space where Axiom 2 is replaced by a stronger one: 32

Axiom 2’: Any two lines have at least one point in common. • As an example, in this dissertation, the image plane is considered as a projective plane, so any two image lines are considered intersecting. In the following chapters, we will be concerned with transformations between projective spaces. This is can be formalized as follows. Suppose G = ( P,L ,I) and { } G0 = ( P 0,L0 ,I0) are two rank 2 geometries. If there is a map h : P,L P 0,L0 { } { } → { } such that P is mapped bijectively to P 0 and L is mapped bijectively to L0, then h is an isomorphism from G to G0. An automorphism is an isomorphism from a rank 2 geometry to itself. For a rank 2 geometry with a notion of a line, an automorphism is also called a collineation or a homography. We will frequently encounter homographies in this dissertation, in the context of the image plane treated as a projective plane, for example, inter-image homographies or homographies between a scene plane and the image plane. While the rich geometry of projective spaces can be wholly developed in this ax- iomatic framework, for ease of computational representation, we will adopt the coordinate approach for the rest of this dissertation.

2.2 Projective Geometry of 2D

Points and Lines in 2D

Since absolute depth is not important in projective geometry, it is convenient to algebraically represent entities in homogeneous coordinates. A 2D point in Cartesian coordinates, (x, y)>, has the homogeneous representation x = (x, y, 1)>, which is equivalent to (kx, ky, k)> for any k R, k = 0. ∈ 6 A 2D line whose equation is ax+by +c = 0 may be represented in homogeneous coordinates as l = (a, b, c)>, note that (ka, kb, kc)> represents the same 2D line. It follows that a point x lies on the line l if and only if x>l = 0.

The point of intersection of two lines l and l0 is given by the vector cross product x = l l0. This may be easily verified, since x>l = x>l0 = 0. Similarly, the join of two × 33

points x and x0 is given by the line l = x x0. ×

Ideal points

The above definition of line-line intersection also serves to illustrate Axiom 2’ in

Section 2.1 and how it distinguishes the projective plane P2 from the Euclidean plane R2. Consider two parallel lines in R2, with the equations ax + by + c = 0 and 2 ax + by + c0 = 0. It is common to say that these two lines “never intersect”. In P , they are represented as homogeneous vectors l = (a, b, c)> and l0 = (a, b, c0)>, whose 2 intersection is x = l l0 = (c c0)( b, a, 0)>, which is a valid point in P . However, ∞ × − − we cannot represent the point x in R2, since its de-homogeneization corresponds to a ∞ division by zero. While this corresponds with our Euclidean notion that parallel lines can never meet, projective geometry provides a framework for uniform treatment of finite and infinite points.

2 A point such as x P , having the form (x1, x2, 0)>, does not have a Euclidean ∞ ∈ analogue and is called an ideal point. From the incidence relation of points and lines in P2, we observe that all ideal points must lie on a common line, the aptly termed line at infinity, given by l = (0, 0, 1)>. Thus, the line at infinity is what distinguishes the ∞ 2 2 S projective plane from the Euclidean plane, that is, P = R l . ∞

Duality

It may be apparent to the reader that points and lines seem to enjoy a symmetric relationship in the projective plane P2. Indeed, this is a consequence of the duality principle, which states that any theorem about incidences between points and lines in the projective plane may be transformed into another theorem about lines and points, by a substitution of the appropriate words. The duality principle in P2 is a special case of duality for general projective spaces Pn, whereby entities of dimension d can be interchanged with those of codimension d + 1. 34

Conics

Conics play a central role in multiview geometry, especially the theory of auto- calibration. Geometrically, a conic (or a conic section) is obtained as the intersection

boundary of a plane in R3 with a cone in R3. Algebraically, it is a curve on a 2D plane defined by a degree 2 equation:

ax2 + 2bxy + cy2 + 2dx + 2ey + f = 0, (2.1)

or equivalently,   a b d     x>Cx = 0, where C =  b c e  is symmetric. (2.2)   d e f Clearly, five points in general position suffice to define a conic. A conic is said to be degenerate when C is rank deficient. An example of a

degenerate conic of rank 2 is two lines: two lines l and l0 can be together represented by

a conic C = l l0> + l0 l>. An example of a rank 1 conic is a repeated line.

Tangent to a conic

A line l is tangent to a conic C at the point x if and only if l = Cx. Note that tangency is defined only for non-degenerate conics. To prove this, note that l does

pass through x, since l>x = x>Cx = 0. Next, suppose l passes through another point

y on C. Then, y>Cy = 0 and y>Cx = 0. It easily follows that for any α R, ∈ (x + αy)>C(x + αy) = 0, that is, the entire line joining x and y lies on the conic C, which is not possible unless C is degenerate. Thus, l = Cx is tangent to C at x.

Dual conics

A dual conic is defined as the envelope of all lines tangent to a point conic. The dual to a point conic C is characterized by the relation

l>C∗l = 0 (2.3) 35

where C∗ represents the adjoint of C. Indeed it is easy to show that the dual of a non- 1 degenerate conic C is given by C∗ (C∗ = C− when C is full rank). Suppose l = Cx is tangent to C at x. Then,

1 1 x>Cx = 0 (C− l)>C(C− l) ⇔ l>C−>l = 0 ⇔ 1 l>C− l = 0 ⇔ since C is symmetric.

Effect of projective transformations

A general projective transformation H : P2 P2, also called a homography, is → represented by a full-rank 3 3 matrix H. × When a projective transformation is applied to a space such that points are transformed as x0 = Hx, then lines are transformed as l0 = H−>l. This can be easily verified as a requirement for preserving incidence, since

x>l = 0 x0>l0 = 0. (2.4) ⇔

1 Under a point transformation x0 = Hx, a point conic C transforms as C0 = H−>CH− . To verify this, we note

1 x>Cx = 0 x0>H−>CH− x0 ⇒ 1 C0 = H−>CH− . (2.5) ⇒

Similarly, under a point transformation x0 = Hx, a dual line conic C∗ transforms as

C∗0 = HC∗H>, which can again be verified as

l>C∗l = 0 l0>HCH>x0 ⇒ C∗0 = HC∗H>. (2.6) ⇒ Borrowing the terminology of tensor algebra, while conics transform covariantly, dual conics transform contravariantly. 36

2.3 Projective Geometry of 3D

The geometry and interactions of the dimension 0 and codimension 1 entities in P3, namely, points and planes, follow rules very similar to those for the corresponding entities in P2. While points and planes are duals in P3, lines are self-dual since they have dimension 1 and codimension 2.

2.3.1 Points and Planes

3 3 A 3D point (x, y, z)> R is represented in P by a homogeneous four-vector ∈ 3 X = (kx, ky, kx, k)>, where k R and k = 0. A plane in R , with the equation ax+by+ ∈ 6 3 cz + d = 0 is represented in P by a homogeneous four-vector π = (ka, kb, kc, kd)>, which is consistent with the incidence relation π>X = 0. Under a 3D projective 3 3 transformation H : P P , if points transform as X0 = HX, then planes transform as → π0 = H−>π.

The intersection of three planes π, π0 and π00 is given by the point X = null([π, π0, π00]>). Similarly, the join of three points X, X0 and X00 is given by the plane π = null([X, X0, X00]>).

2.3.2 Lines

Unlike points and planes, algebraic treatment of 3D lines is not straightforward in projective geometry. Part of the reason is that lines in P3 can be shown to be in bijective correspondence with points on the so-called Klein quadric in P5. So, lines in P3 may be represented as a homogeneous vector in R6 whose coordinates must satisfy an implicit constraint, making it difficult to interact with points and planes which are simply homogeneous vectors in R4. In the language of algebraic geometry, lines in R3, when interpreted as projective lines in P3, correspond to two-dimensional subspaces of a four-dimensional , which is the basis for the so-called Plucker¨ embedding. The interested reader is referred to (Pottmann and Wallner, 2001) for an excellent 37 treatment of 3D line geometry.

Span and null-space representations

A 3D line may be represented as the join of two points X, Y P3 by the 2 4 ∈ × matrix L = [XY]>. The span of L> is the pencil of points λX + µY that define the line. Alternatively, the dual representation of a line is the intersection of two planes. A

3 line formed by the intersection of planes π1, π2 P can be represented by the 2 4 ∈ × matrix L∗ = [π1 π2], since the null-space of L∗ is the pencil of points on that line.

The two dual representations are related as LL∗> = L∗L> = 02 2. ×

Plucker¨ matrix representations

The 3D line joining two points X and Y can be represented as a 4 4 skew- × symmetric Plucker¨ matrix:

L = XY> YX>. (2.7) − The matrix L has rank 2, in fact, its two-dimensional null-space is spanned by the pencil of planes with the 3D line joining X and Y as axis. It can be verified that under a point

3 3 transformation H : P P , such that X0 = HX, the line L in (2.7) transforms as → L0 = HLH>. The dual Plucker¨ matrix representation is a 4 4 skew-symmetric matrix of rank × 2 that parameterizes a 3D line as the intersection of two planes π1 and π2:

L∗ = π1π> π2π>. (2.8) 2 − 1

1 Under a point transformation X0 = HX, the dual line L∗ transforms as L∗0 = H−>L∗H− . This Plucker¨ skew-symmetric representation is especially convenient for repre- senting join and incidence relations:

The plane defined by the join of line L and point X is π = L∗X. Further, X lies • on L if and only if L∗X = 0. 38

The point defined by the intersection of line L with the plane π is X = Lπ. • Further, L lies on π if and only if Lπ = 0.

Plucker¨ coordinates

Algebraically, the Plucker¨ coordinates of a line correspond to the six strictly upper triangular entries of the skew-symmetric matrix L in 2.7. Since det(L) = 0, these six coordinates must satisfy a quadratic constraint, which defines the Klein quadric. But Plucker¨ coordinates also arise naturally in line geometry and it is possible to develop their theory in an axiomatic framework too. Accordingly, the line L spanned by

the points X = (x>, x0)> and Y = (y>, y0)> is given by their exterior product, which can be represented as the homogeneous 6-vector

L = X Y = (l>,¯l>)> = (x0y> y0x>, (x y)>)>, (2.9) ∧ − ×

where (l>,¯l>)> represent the Plucker¨ coordinates of L. The dual Plucker¨ coordinates

specify the line L contained in the intersection of planes π1 = (p, p0)> and π2 = (q, q0)> as

L∗ = π1 π2 = (l∗>,¯l∗>)> = (p0q> q0p>, (p q)>)>. (2.10) ∧ − ×

The Plucker¨ coordinates of L are given by (l>,¯l>)> = ((p q)>, p0q> q0p>)>. × − The plane spanned by a line (l>,¯l>)> and a point (x>, x0)> is the plane

(p>, p0)> = ( x0l> + (x l)>, x>l)>. (2.11) − ×

The intersection of a line (l>,¯l>)> and a plane (p>, p0)> is the point given by the dual relation:

(x>, x0)> = ( p0l> + (p l)>, p>l)>. (2.12) − × Geometrically, the vector l represents the direction of the line L, that is, l is

the ideal point of L. For any non-ideal point (x>, x0)> on the line L, the Plucker¨

coordinates of L are computed as the join of (l>, 0)> and (x>, x0)>, given by (l>,¯l>)> =

(l>, (x l)>)>. The orthogonality of l and ¯l gives a quadratic relation which defines the × Klein quadric. 39

In algebraic geometry, Plucker¨ coordinates also provide a natural connection to the rich theory of Grassmannians.

2.3.3 Quadrics

A quadric is a surface that defines the locus of points X P3 satisfying: ∈

X>QX = 0, (2.13) where Q is a symmetric 4 4 matrix. Quadrics may be ruled (through every point on × Q, there exists a straight line that lies on Q) or non-ruled. Examples of the latter are some common surfaces like spheres and ellipsoids, while a hyperboloid of one sheet is an example of the former.

The envelope of all planes π tangent to a quadric Q defines the dual quadric Q∗, which satisfies the equation

π>Q∗π = 0, (2.14)

1 where Q∗ is the adjoint of Q and equal to Q− for a full-rank Q.

Under a point transformation X0 = HX, a quadric transforms as

1 Q0 = H−>QH− , (2.15) while a dual quadric transforms as

Q∗0 = HQ∗H>. (2.16)

2.4 The Projective Camera

Referring to Figure 2.1, we saw in Section 1.2.2 that the image formation pro- cess by perspective projection, when the camera is centered and axis-aligned, can be represented as   f 0 0 0     x = PX, where P =  0 f 0 0  , (2.17)   0 0 0 1 40 where x P2 is the image of X P3 and f is the focal length of the camera. ∈ ∈

12-34/ " ,.-*4 ! & %

#/5/678/78/79 # $ '()*+),-./-%)0 $

! "

Figure 2.1: The centered and axis-aligned projective camera is represented by a perspec- tive projection matrix.

The 3 4 camera matrix can be rewritten as ×   f 0 0     P = K [ I 0 ] , where K =  0 f 0  . (2.18) |   0 0 1

The matrix K encodes the internal parameters of the camera. The projection equation in (2.17) assumes that the origin of coordinates in the image plane coincides with the point where the principal axis intersects it, the principal point. However, if the principal point is located at some (u, v)>, instead of (0, 0)>, then the image formation equation equation is   f 0 u     x = K [ I 0 ] X, where K =  0 f v  . (2.19) |   0 0 1 The effect of any non-rectangularity in the camera pixels can be modeled by introducing a pixel skew, s = cot θ. Finally, we account for the fact that focal lengths in the x and y directions might be slightly different in a real, lens-based camera, to get an 41

upper-triangular internal parameter matrix:   fx s u     K =  0 fy v  . (2.20)   0 0 1

The above projection equations are derived under the assumption that the camera is situated at the origin of the world coordinate system, with its principal axis pointing along the positive Z-axis. In general, if the camera center is C˜ , in inhomogeneous coordinates, then the camera’s coordinate system is related to the world’s as X˜ = R(X˜ C˜ ), cam world − where R is a rotation. This can be written in homogeneous coordinates as   R RC˜ ˜ ˜ Xcam =  −  Xworld. (2.21) 0> 1

The image formation equation for a perspective camera now becomes

h i x = KR I C˜ X = K [ R t ] (2.22) | − |

where t = RC˜ . While the upper triangular matrix K encodes the internal parameters, − (R, t) represent the exterior pose of the camera (Figure 2.2). Note that any 3 4 matrix, P = [ A a ], that maps a 3D point to a 2D one as × | x = PX can be interpreted as a projective camera, as long as A is non-singular, since A can always be decomposed A = KR using RQ decomposition. The non-singularity condition is to ensure that the projection is from the 3D world on to a 2D image plane and not on to a 1D line or a single point.

The center of a projective camera is given by C = null(P). Let Pi> denote row i

of the camera matrix. Then O3 is the principal plane, which is parallel to the image plane and passes through the image center. If the camera is represented as P = [A b], then the | direction of the principal axis, which passes through the camera center and points to the “front” of the camera, is given by

v = det(A)a3, (2.23) 42

(a) (b)

Figure 2.2: The internal and external parameters of a projective camera. (a) Besides the focal length, the internal parameters of the camera account for the principal point offset and the pixel skew. (b) The position and orientation of the camera with respect to a world coordinate frame constitute the external parameters of the camera.

where a3> is the third row of A.

If X = (X,Y,Z,T )> is a 3D point, imaged by a camera P = [A b] as the point | x = w(x, y, 1)>, then the depth of point X with respect to the camera P is defined as:

sign(det(A))w depth(X, P) = (2.24) T a3 k k where a3 is the third row of A. k k

2.5 The Plane at Infinity and Its Denizens

The projective space can be considered as the union of real space and the set of all points at infinity, or ideal points. Such points in P3 can be characterized by the last coordinate of the homogeneous representation being 0. It is evident that all 3 ideal points X = (X,Y,Z, 0)> in P must lie on the plane at infinity, represented as ∞ 3 3 S π = (0, 0, 0, 1)>, since they satisfy π> X = 0. Thus, P = R π . ∞ ∞ ∞ ∞ The importance of the plane at infinity in geometric reconstruction problems stems from the fact that while it is moved out of its canonical position by a projective transformation, it stays fixed under an affine transformation. So, once we can locate 43

the plane at infinity in a projective reconstruction, we can compute a transformation that takes it back to its canonical position, which would result in a reconstruction which differs up to an affinity from the true Euclidean reconstruction. These statements are formalized below.

Theorem 1. The plane at infinity is fixed under a projective transformation H if and only if H is an affinity.

Proof. Suppose H is an affine transformation, represented as   A3 3 t H =  ×  . (2.25) 0 1

Then, it is easy to see that the plane at infinity is fixed under H, as     0 0           A−> 0  0   0  H−>π =     =   . (2.26) ∞     t>A> 1  0   0  −     1 1

For the converse, suppose H is a projective transformation that fixes the plane at in-

finity. Then, H−>[0, 0, 0, 1]> = [0, 0, 0, 1]> necessitates that (H−>)14 = (H−>)24 =

(H−>)34 = 0, which leads to the conclusion that H must be of the form 2.25.

The above result immediately leads to the utility of the plane at infinity for 3D reconstruction problems:

Corollary 2. Once the plane at infinity is estimated in a projective reconstruction, it is possible to upgrade to a reconstruction where affine properties are restored.

Proof. Suppose the plane at infinity is identified as π = (p, 1)>. Then, π can be ∞ ∞

mapped to its canonical position by a transformation HA−>:   I p HA−> =  −  HA−>(p, 1)> = (0, 1)>. (2.27) 0> 1 ⇒ 44

Consequently, given π , applying a transformation of the form ∞   I 0 HA =  |  (2.28) π ∞ will take the projective reconstruction to an affine one.

2.5.1 The Absolute Conic

The absolute conic is one of the most important geometric entities used in com- puter vision, primarily for its relationship to camera calibration and autocalibration. The absolute conic is a point conic that lives on the plane at infinity and is defined as the locus of points X = (x1, x2, x3, x4)> satisfying

x2 + x2 + x2 = 0 1 2 3 (2.29) x4 = 0.

Equivalently, it is the set of all points X = (x˜, 0)> that satisfy x˜>Ω x˜ = 0, where ∞ 3 Ω = I3 3. Note that there is no real point x˜ R which lies on Ω , it is a conic that ∞ × ∈ ∞ consists entirely of imaginary points. Part of the utility of the absolute conic for geometric reconstruction problems stems from the following result:

Theorem 3. The absolute conic is fixed under a projective transformation H if and only if H is a similarity.

Proof. Suppose H is a similarity transformation, then it has the form   R t H =   . (2.30) 0> 1

On the plane at infinity, the point conic Ω transforms under the action of H as given by ∞ (2.5), that is 1 1 R−>Ω R− = (RR>)− = I = Ω . (2.31) ∞ ∞ 45

To prove the converse, we note that a transformation that fixes a conic lying on the plane at infinity must also fix the plane at infinity itself. Thus, H must at least be an affine transform and should have the form (2.25). Then, on the plane at infinity, H

1 1 would transform Ω as A−>IA− = (AA>)− . If Ω remains fixed under H, then ∞ ∞ 1 (AA>)− = I, which means A must be a rotation or a rotation with reflection. In either case, H represents a similarity transformation.

2.5.2 Image of the Absolute Conic

Consider a point X = (x> , 0)> on the plane at infinity. Under a camera ∞ ∞ P = K [ R t ], it is imaged as |

x = PX = KRx> , (2.32) ∞ ∞ so the mapping between the plane at infinity and the image is given by a planar homogra- phy H = KR. Since the absolute conic lies on the plane at infinity, its image will be a conic transformed as

1 1 1 ω = H−>Ω H− = (KR)−>I(KR)− = (KK>)− . (2.33) ∞ The image of the absolute conic (IAC) is one of the most important entities in camera calibration, since it depends only on the internal parameters of the camera. At times, it is useful to consider its dual, the dual image of the absolute conic (DIAC), which is given simply by 1 ω∗ = ω− = KK>. (2.34)

Estimating the IAC or the DIAC is central to upgrading an affine reconstruction of a scene to a metric reconstruction. This is encapsulated in the following result.

Theorem 4. Given an affine reconstruction PA, XA , knowing the IAC, ω, in one of { i j } the images, corresponding to the camera PA = [ A a ], allows the affine reconstruction | to be upgraded to a metric one by applying a transformation of the form   1 S− 0 H =   (2.35) 0> 1 46

1 where S is obtained by the Cholesky factorization of SS> = (A>ωA)− .

Proof. Suppose a transformation H of the form (2.35), for some matrix S, can take

M M M A 1 the affine reconstruction to the Euclidean one P , X . Then, P = P H− = { i j } [ R t ] for the camera under consideration, which leads to KR = AS. However, | 1 ω∗ = (KR)(KR)> = ASS>A>, which can be rewritten as SS> = (A>ωA)− . Thus, an S to upgrade the affine reconstruction to a metric one using a transformation of the

1 form (2.35) can be computed by Cholesky factorization of (A>ωA)− .

2.5.3 The Absolute Dual Quadric

The absolute dual quadric (ADQ), denoted Q∗ , is a plane quadric which is the ∞ envelope of all planes tangent to the absolute conic. Its algebraic representation is   I3 3 0 Q∗ = ˜I =  ×  (2.36) ∞ 0> 0 Clearly, the ADQ is a degenerate quadric. Since it is a dual quadric, under a projective transformation H, the ADQ moves from its canonical position to

Q∗ = H˜IH>. (2.37) ∞ An intuitive understanding of the absolute dual quadric as the dual of the absolute conic can be obtained by considering the one parameter family of quadrics

2 2 2 2 x1 + x2 + x3 + kx4 = 0, (2.38) equivalently representable as   1      1  Q(k) =   . (2.39)    1    k

As k , the quadric gets closer and closer to the plane at infinity. In the limit, the → ∞ 2 2 2 only points lying on limk Q(k) are those that satisfy x1 + x2 + x3 = 0 and x4 = 0. →∞ 47

But, from (2.29), this is precisely the set of points that define the absolute conic. Thus, one can consider the absolute conic as the limiting case for a series of quadrics Q(k). Then, in the limit, the dual quadric for this series is given by     1 1          1   1  lim Q∗(k) = lim   =   (2.40) k k     →∞ →∞  1   1      1 k− 0 which is precisely the definition of the absolute dual quadric, Q∗ . ∞ The utility of the ADQ for multiview geometry stems from the fact that it is fixed under a similarity transformation. This is formalized below:

Theorem 5. The absolute dual quadric is fixed under a projective transformation H if and only if H is a similarity.

Proof. Let us represent the projective transformation H as   A t H =   . (2.41) v> 1

Under the transformation H, the ADQ transforms as   AA> Av Q∗ = H˜IH =   . (2.42) ∞ v>A> v>v

Since the ADQ remains fixed under H, equating the above to ˜I (up to scale) leads to the conclusion that v = 0 and A is a scaled rotation matrix. Thus, H must be a similarity transformation.

Another useful property of the ADQ is the following result:

Result 1. The plane at infinity is the null-vector of the absolute dual quadric.

To prove this, we simply note that in their canonical form, π = (0, 0, 0, 1)> is ∞ clearly the null vector of Q∗ = ˜I. Under any projective transformation H, the plane ∞ 48

at infinity transforms to π0 = H−>π and the ADQ transforms to Q∗ 0 = HQ∗ H>. ∞ ∞ ∞ ∞ Clearly,

Q∗ 0π0 = (HQ∗ H>)(H−>π ) = HQ∗ π = 0. (2.43) ∞ ∞ ∞ ∞ ∞ ∞

Image of the absolute dual quadric

To determine the image of the ADQ under perspective projection, we require an auxiliary result:

Result 2. An image line l backprojects as the plane π = P>l.

To see that this result must be true, let X be a point on the plane π, which is imaged as the point x. Then, x = PX and since x lies on l,

x>l = 0 (PX)>l = 0 X>(P>l) = 0 π = P>l (2.44) ⇒ ⇒ ⇒

Next, we also require two concepts from differential geometry to deduce the image of an arbitrary quadric. Under perspective projection, the contour generator of a smooth surface is the set of points where the imaging rays are tangent to the surface. The apparent contour is the image of the contour generator. Now, consider a sphere, for which the contour generator and the apparent contour under a Euclidean camera Pb = K [ I 0 ] must be circles, by symmetry. Applying a | projective transformation to the system, the sphere changes into a quadric, while the circles transform into conics. So, the contour generator and the apparent contour for a quadric under perspective projection are conics. We are now ready to state the second auxiliary result:

Result 3. The image of a quadric Q under a projective camera P is the given by the conic C whose dual satisfies

C∗ = PQ∗P>. (2.45)

To prove this assertion, let C be the conic that is the image of Q under the projection P. Lines l tangent to C satisfy l>C∗l = 0. From Result2, these lines 49

backproject to planes π = P>l, which are tangent to the quadric Q. Thus,

π>Q∗π = 0 l>PQ∗P>l = 0 C∗ = PQ∗P>. (2.46) ⇒ ⇒

From (2.46), for the particular case of the ADQ, its image under a camera P is given by PQ∗ P>. Since the ADQ is dual to the absolute conic, its image must be ∞ dual to the image of the absolute conic, which is simply the DIAC. Thus, we have the following important result:

Result 4. The image of the absolute dual quadric Q∗ under a projective camera P is ∞ given by the dual image of the absolute conic, that is,

ω∗ = PQ∗ P>. (2.47) ∞ Chapter 3

Preliminaries: Multiview Geometry

“Everything we see is a perspective, not the truth.”

Marcus Aurelius (Roman emperor, 121-180 AD), Meditations (on Stoicism)

This chapter is a more detailed perusal of the basic concepts of 3D reconstruc- tion that we informally discussed in Chapter1. In particular, we will outline standard techniques for feature detection and matching, as well as computing a projective re- construction from correspondences. We will formalize the notion of stratification in 3D reconstruction and finally, review the important, but often overlooked, concept of chirality. More details for the material in Sections 2.4 to 3.5 can be found in standard texts like (Hartley and Zisserman, 2004).

3.1 Feature Selection and Matching

To harness the power of multiview geometry, the image of the same salient feature, which might be a point, line, curve or some other well-defined scene element must be visible in multiple views. This requires two operations: one, detecting salient features in images (which usually correspond to salient features in the scene, but not always) and establishing correspondence between salient features across images. Well-established

50 51 techniques exist in the literature to perform these tasks, so they constitute pre-processing steps that form the input to our algorithms. In the following, we will briefly review the concepts involved in the low-level image processing that yields our input feature correspondences.

3.1.1 Corner detection

With the exception of Chapter9, the salient features which form the input for our algorithms are corners. For some problems such as autocalibration, we require a projective reconstruction which might be obtained from higher order features too. But corners are by far the most popular features in SFM applications, since they abound in natural scenes and are relatively inexpensive to detect and match. Roughly, corners are locations on the image where the image gradient undergoes a rapid change in direction (Figure 3.1 (a) and (b)). For a 2D image point x = (x, y)>, with intensity I(x, y), this is encapsulated in the positive semidefinite form   P 2 P W (x) Ix W (x) IxIy A(x) =   , (3.1) P P 2 W (x) IxIy W (x) Iy where W (x) stands for a local window around x. An intensity corner can be expected at a point x where the eigenvalues of A(x) are both significant and its condition number is not high. That is, if the eigenvalues of A are λ1 and λ2, with λ1 λ2 0, then the ratio λ ≥  2 is not close to zero. The intuition for this is visually illustrated in Figure 3.2. λ1 + λ2 This is the basis of corner detection as proposed in (Forstner¨ and Gulch, 1987). A popular variant is the so-called Harris corner detector (Harris and Stephens, 1988), which declares a putative corner at the local maxima of the following function of A: C(x) = det(A) + k trace(A), (3.2) × where the constant k depends on the image. In practice, a value of k 0.04 usually ≈ works well across a large range of imaging conditions. 52

!"#$%&'()*#)

!"#$%&'()"(

(a) An ideal corner (b) A realistic corner

Figure 3.1: Corner detection.

(a) rank(A) = 0, (b) rank(A) = 1, (c) rank(A) = 2,

λ1 = λ2 = 0 λ1 0, λ2 = 0 λ1 λ2 0  ≥ 

Figure 3.2: Types of image neighborhoods.

3.1.2 Feature matching

Feature matching establishes correspondences between multiple views of a scene, so it forms the first stage of a multiview geometry algorithm. Since most natural scenes have an abundance of corners and a mismatch might result in an outlier, feature matching tends to be a conservative operation in most implementations. That is, if a match is uncertain, it is usually more prudent to discard a corner in the initial stages rather than indulge in expensive outlier rejection schemes or risk a breakdown later in the SFM pipeline. As we discussed in Section 1.2.5, feature matching requires a discriminative descriptor associated with each feature, that remains relatively invariant under rotation, 53 translation and scale transformations and to some extent, illumination changes too. A common matching criterion is the normalized cross-correlation (NCC), which yields a “match score” between every pair of pixels x and x0 of the two images I and I0, respectively: P ¯ ¯ x W (x),x0 W (x0) I(x) I I0(x0) I0 NCC(x, x0) = q ∈ ∈ − · − (3.3) P ¯2 P ¯2 x W (x) I(x) I x0 W (x0) I0(x0) I0 ∈ − ∈ − where W (x) and W (x0) are local windows around the points x and x0 in the two images ¯ ¯ and I and I0 are the mean intensities within those windows. The NCC is invariant to affine transformations of the image intensity, that is, intensity changes of the form I0 = aI + b. Note that NCC implicitly assumes a trans- lational model of motion between the images being matched, which is a reasonable approximation when the inter-frame motion is small. In many applications, such as real-time stereo, where illumination variations between subsequent frames is not significant and where matching speed is of utmost im- portance, a simplified sum of absolute differences (SAD) measure is used to characterize a match: X SAD(x, x0) = I(x) I0(x0) . (3.4) | − | x W (x),x0 W (x0) ∈ ∈ One way to build in some robustness to brightness and contrast variations is to subtract out the mean intensity from each window, to get a zero mean SAD measure:

X ¯ ¯ SAD0(x, x0) = (I(x) I) (I0(x0) I0) . (3.5) − − − x W (x),x0 W (x0) ∈ ∈ 3.1.3 Advanced feature descriptors

In recent years, the need to incorporate invariances into the features for robust 3D reconstruction and object recognition has given rise to several feature detection methods.

1. SIFT (Lowe, 2004): The Scale-Invariant Feature Transform (SIFT) descriptor is invariant to image scale and rotation, as well as robust to small illumination and 54

viewpoint changes. Difference of Gaussian images are computed to approximate the Laplacian and keypoint detection is performed by detecting scale-space extrema (Lindeberg, 1998; Lindeberg and Bretzner, 2003). An interpolated location of the scale-space extrema is computed using interpolation to achieve subpixel accuracy. Low-contrast keypoints and edge responses are then discarded, using the eigenvalues of the matrix A in equation (3.1). An orientation histogram is computed for the local image gradient directions around each keypoint. For each keypoint, 8-bin histograms are computed in a 4 4 array around the keypoint, × to yield a 128 dimensional feature descriptor. The descriptor is normalized to increase its illumination invariance.

2. GLOH (Mikolajczyk and Schmid, 2005): Gradient Location and Orientation Histogram (GLOH) is similar to SIFT, but considers more spatial regions for com- puting the histograms. Subsequently, dimensionality is reduced by performing principal components analysis (PCA) and retaining the top 64 eigenvectors.

3. LESH (Sarfraz and Hellwich, 2008): The Local Energy-based Shape Histogram (LESH) is motivated by a local energy model of feature perception. Using several local histograms, with different filter orientations, a descriptor is created which can be used for matching across pose variations.

4. SURF (Bay et al., 2006): Speeded Up Robust Features (SURF) defines keypoints using a Haar wavelet approximation to the Monge-Ampere` operator (determinant of Hessian), which is more robust to non-Euclidean transformation than the determinant of Gaussian approximation to the Lapalacian used for SIFT. Integral images are used to make the feature extraction efficient.

3.2 Epipolar Geometry

In this section, we consider the geometry of two views and the geometric entity that characterizes uncalibrated two-view reconstruction: the fundamental matrix. 55

For reference, let us call one of the views left and the other right. The left view is represented by the camera matrix, P, centered at C, while the right camera is given by

P0, centered at C0. For any point feature x in the left view, the corresponding feature in the right view, x0, must lie on a straight line, which is the image (in the second view) of the backprojected line from C, passing through x. This line is called the epipolar line, let us represent it as l0. Similarly, for every feature x0 in the right view, there exists an epipolar line, l, in the left view. For this reason, the geometry of two views is also called epipolar geometry (see Figure 3.3).

Figure 3.3: For any point x in the left view, the corresponding point x0 in the right view is constrained to lie on a straight line, which is the image of the back-projected line from the left camera’s center passing through x.

Let e0 be the image of C in the second view and e be the image of C0 in the first view. Then, from Figure 3.3 and Section 2.2, it is evident that l0 = e0 x0 = [e0] x0, × × 56 where, for any vector a R3, we define a skew-symmetric matrix ∈   0 a3 a2  −    [a] =  a3 0 a1  . (3.6) ×  −  a2 a1 0 −

Suppose π is an arbitrary plane passing through the 3D point Xπ, of which x and x0 are images. Then, the plane π induces a planar homography, Hπ between the left and right views. In particular, it must be true that x0 = Hπx. Thus, the epipolar line in the right view is

l0 = [e0] x0 = [e0] Hπx = Fx (3.7) × × where

F = [e0] Hπ. (3.8) ×

Since x0 lies on l0, it follows that

x0>l0 = 0 x0>Fx = 0, (3.9) ⇒ which is the defining equation for uncalibrated two-view geometry. The matrix F is called the fundamental matrix and it is evident from (3.8) that F must be rank 2, since the skew-symmetric matrix [e0] is rank 2, while the homography Hπ is full rank. × From (3.8), the right epipole e0 can be deduced as the left null-space of F, since e0>F = e0>[e0] Hπ = 0>. By similar derivations, or simply from symmetry, the left × epipole is given by the right null-space of F, that is Fe = 0 and the left epipolar line corresponding to x0 is given by l = F>x0. Given several views, one can perform 3D reconstruction by considering pairs of views and stitching together transformations between the coordinate systems of different pairs. The challenge there is to reconcile the differences between the reconstructions generated by the same view pair, but arrived at by different traversals of the graph of overlapping camera views. Moreover, for some problems like autocalibration, it can be shown that fundamental matrix based approaches result in weaker constraints and might have additional degeneracies as compared to multiview methods. Yet, the fundamental 57 matrix and its three-view analogue, the trifocal tensor (Shashua and Werman, 1995; Hartley, 1997), remain the principal geometric entities for large scale reconstructions, due to their simplicity and the fact that robust and fast algorithms exist to estimate them.

3.3 Projective Reconstruction

Given feature correspondences, with no other a priori information about the scene or the cameras, the best reconstruction possible is a so-called projective reconstruction, that differs from the true scene by an arbitrary 4 4 linear transformation. That is, × if Pb i, Xb j , for i = 1, , m, j = 1, , n is the true scene, then the projective { } ··· ··· reconstruction is 1 Pi = Pb iH− , Xj = HXb j (3.10) where H is an arbitrary 4 4 linear transformation. We refer the reader to Chapter2 for × a review of the terminology and notation used in this section.

3.3.1 Pairwise reconstruction

Given several correspondences xj x0j between two views, we saw in Section ↔ 3.2 that epipolar geometry guarantees the existence of a fundamental matrix, which is a 3 3 matrix of rank 2 that satisfies ×

xj>Fx0j = 0. (3.11)

Then, a projective reconstruction consistent with these correspondences is given by

P = [ I 0 ] , P0 = [ SF e0 ] (3.12) | | where S is an arbitrary 3 3 skew-symmetric matrix and e0 is the epipole in the second × view, defined by F>e0 = 0. In practice, a good choice of S is the matrix [e0] , defined as × (3.6).

Given a pairwise estimate of projective cameras from (3.12), the 3D points Xj can be estimated by triangulation (see Chapter5). 58

3.3.2 Factorization-based approaches

Given more than two views of a scene, an alternaive method for projective reconstruction adopts a factorization approach. First, we will take a brief digression to describe the concept of affine factorization, or the so-called Tomasi-Kanade factorization (Tomasi and Kanade, 1992), which is the motivation for projective factorization.

Affine factorization

Suppose the image of a 3D point Xj, observed under camera Ai, is the 2D 3 point xij. Note that for this discussion, A is a 2 3 affine camera, X R is a × ∈ non-homogeneous point, as is x R2 and the image formation equation is ∈

xij = AiXj + ti (3.13)

2 where ti R is a translation vector and i = 1, , n, j = 1, , m. Then, the goal of ∈ ··· ··· affine factorization is to minimize the squared reprojection error

X 2 min xij (AiXj + ti) (3.14) Ai,ti,Xj ij k − k under the assumption that each 3D point is visible in all the views. It is easy to see that by assuming the reconstruction has the origin at its centroid and by centering the image points in each view, that is, making their centroid the origin 1 X of the local coordinate system, fixes the translations as t = x¯ , where x¯ = x . i i i n ij j The problem now reduces to

X 2 min xij AiXj (3.15) Ai,Xj ij k − k which can be reformulated as the factorization of a 2m n measurement matrix as ×     x11 x1n A1  ···     . .. .   .  W =  . . .  =  .  [X1 Xn] . (3.16)     ··· xm1 xmn Am ··· 59

In the presence of noise, it can be shown that a singular value decomposition (SVD) of the measurement matrix, W, yields the maximum likelihood estimate (MLE) affine reconstruction. In practice, the assumption that each feature point is visible in all the views is a severe limitation. In an application with missing data, alternating least squares minimization is used to solve the problem (3.14), using the fact that once either the cameras or the points are known, estimating the other set of unknowns is a linear least squares problem.

Projective factorization

Consider the image formation equation in the projective case, xij PiXj, where ∼ P is a 3 4 projective camera, X R4 is a homogeneous point in 3D, as is x R3 in × ∈ ∈ 2D. Then, the equality can be made exact by introducing an explicit scale factor:

λijxij = PiXj, (3.17)

where the λij are called projective depths. The problem that projective factorization aims to solve is 2 min λijxij PiXj , (3.18) λij ,Pi,Xj k − k assuming that all points are visible in all the cameras. Similar to affine factorization, a 3m n measurement matrix may be constructed, which can be factorized into the × cameras and the 3D points:     λ11x11 λ1nx1n P1  ···     . .. .   .  W =  . . .  =  .  [X1 Xn] . (3.19)     ··· λm1xm1 λmnxmn Pm ··· To solve this problem, initial estimates of projective depths are needed, which may be obtained with the aid of an initial pairwise projective reconstruction, or simply by setting all the λij to 1. The depths are then normalized by setting all row norms of the measurement matrix to 1 in one pass and all the column norms to 1 in another pass. The 60 nearest rank-4 approximation of the normalized measurement matrix is computed using SVD, which yields an estimate of the cameras and 3D points. This estimate is used to project the points on to the images and obtain new estimates of depths, then the process is iterated until convergence. Projective factorization was introduced in (Sturm and Triggs, 1996) as a gener- alization of affine factorization to the projective case. In theory, it is not a good idea to introduce projective depths on the left hand side of (3.17), since it artificially introduces the all zeros solution as a global minimum of (3.18). While there is almost no theoretical justification for projective factorization as described here, in the absence of alterna- tives, the method is still used with some degree of success since it is straightforward to implement.

3.4 Stratification

An important conceptual paradigm in multiview geometry is stratification, which defines a hierarchy relating the actual scene to its reconstructed versions through trans- formations of gradually increasing degrees of freedom. These transformations can be geometrically characterized by their invariants, which are the entities that do not change under the transformation. The strata we describe here and deal with in the rest of this thesis correspond to transformations which are subgroups of the projective linear group, PL(n), which is the quotient group of the general linear group, GL(n), obtained by identifying elements related by a scalar multiple.

Isometry

An isometric transformation has the form   δR t HE =   (3.20) 0> 1 61

n 1 where δ = 1, R SO(n 1) is a rotation matrix and t R − . When δ = 1, it is ± ∈ − ∈ called a Euclidean transformation, which corresponds to a physical displacement. δ = 1 − represents a displacement composed with a reflection. A 3D Euclidean transformation has six degrees of freedom, three for the rotation R and three corresponding to the translation t. Geometrically, a Euclidean transformation preserves the Euclidean characterization of an object, so its invariants are entities like angles, length and area.

Similarity

A similarity transformation has the form   sR t HS =   (3.21) 0> 1 where s is a scalar. A 3D similarity transformation has seven degrees of freedom, three for the rotation R, three for the translation t and one for the scale s. Geometrically, a similarity transformation preserves the “shape” of an object, so its invariants are entities like angles, parallelism, length ratios and area ratios. The goal of most 3D reconstruction algorithms is to deduce the scene up to a similarity transformation, which is also called a metric reconstruction.

Affinity

An affine transformation has the form   A t HA =   (3.22) 0> 1 where A GL(n 1) is non-singular. A 3D affine transformation has 12 degrees ∈ − of freedom, 9 corresponding to A and 3 corresponding to t. Geometrically, an affine transformation corresponds to a composition of rotation, non-isotropic scaling and another rotation, followed by a translation. Some of its invariants are parallelism, length ratios of parallel lines and area ratios. 62

An affine reconstruction differs from the true scene by an affine transformation. It is useful in practice since obtaining an affine reconstruction from input images is considered the “difficult” part of 3D reconstruction. That is, upgrading from the affine to the metric stratum is the relatively easier part of 3D reconstruction.

Projectivity

A projective transformation is defined on homogeneous coordinates and has the form   A t HP =   . (3.23) v> k A 3D projective transformation has 15 degrees of freedom, corresponding to the 16 unknowns of HP , less one for scale. A general projective transformation is also called a homography or a collineation. The fundamental invariants of a projective transformation are the incidence structure and the cross ratio. As discussed in Section 3.3, a projective reconstruction is the best one can obtain using point correspondences alone. So, the objective of many 3D reconstruction algorithms can be stated as finding the appropriate projective transformation to upgrade from the projective stratum to the metric one.

3.5 Chirality

The etymology of the term chirality can be traced to the Greek cheir, which means hand. Chirality, thus, means “handedness” and in the context of computer vision, imposing chirality refers to demanding the most basic of imaging constraints - the imaged points must lie in front of the camera. It can be shown that this simple constraint is sufficient to compute a quasi-affine reconstruction, which is a projective reconstruction that preserves the convex hull of the scene.

In P3, convexity is defined as follows: a subset P3 is convex if and only if S ⊂ R3 and is convex in R3. (See Section 4.2.1 for a definition of convexity in R3.) S ⊂ S 63

The convex hull of a set is the smallest containing it.

A projective transformation h : P3 P3 preserves the convex hull of a point set → Xi if h(Xi) is finite for all i and h maps the convex hull of Xi bijectively to the { } { } convex hull of h(Xi . { } Recall that the center of a camera is given by C = null(P), which can be rewritten as i (i) CP = (c1, c2, c3, c4)>, where ci = ( 1) det P (3.24) − and P(i) is the matrix formed by removing column i of P. For imposing chirality, we are only concerned with the sign of the depth of a point with respect to a camera, so (2.24) can be written as

depth(X, P) = wT det(A) = w(E4>X)(E4>CP), (3.25)

where E4 = (0, 0, 0, 1)> represents the plane at infinity. On the application of a projective transformation, (3.25) transforms as

1 depth(X, P) = w(E4>HX)(E4>HCP) det(H− ), (3.26)

where H>E4 can now be interpreted as the plane π which is mapped to infinity by H, ∞ that is, H−>π = E4. Thus, ∞

depth(X, P) = w(π X)(π CP)δ, (3.27) ∞ ∞

1 where δ = sign(det(H− )) determines whether the transformation is orientation preserv- ing or reversing. Thus, it can be easily verified that if the determinant of a projective transformation H is positive and π is the plane mapped to infinity by H, then the ∞ chirality of a point X with respect to π is preserved under H if and only if X lies on ∞ the same side on π as the camera center. ∞ It can be easily shown that for any projective reconstruction from a set of image points xij, it is possible to assign signs to the projective cameras and points such that

PiXj = wij (xij, yij, 1)> with wij > 0. (3.28) 64

Then, if H is a quasi-affine transformation, δ = sign(det(H)) and v R4 is the plane ∈ mapped to infinity by H, then

depth(Pi, Xj) > 0 (v>Xj)(v>CP )δ > 0. (3.29) ⇒ i It follows that a quasi-affine reconstruction can be computed using a transformation   I 0 H =  ± |  (3.30) v> where the sign of H is chosen to match the δ which leads to v as a feasible solution for the following chiral inequalities:

Xj>v > 0, for all j C v P>i > 0, for all i. (3.31)

3.5.1 Bounding the plane at infinity

Once a quasi-affine transformation has been found, one can normalize all points to have the last coordinate 1 and all cameras to have the determinant of their first 3 3 × sub-matrix also 1. Now, a translation may be applied to move the centroid of the set of points and camera centers to the origin, to obtain a (still quasi-affine) reconstruction

Pq, Xq { } Let H be a further, orientation-preserving quasi-affine transformation and v be the plane mapped to infinity by H. Since v lies outside the convex hull of the set of points and camera centers, it cannot pass through the origin and we can parameterize it as v = (v1, v2, v3, 1)>, where the vi must be bounded. Thus, bounds on the plane at infinity can be computed by minimizing and maximizing vi subject to the chiral inequalities (3.31) with δ = 1:  min / max vi   subject to Xqj> v > 0, j = 1, , n i = 1, 2, 3. (3.32) ···   C> v > 0, k = 1, , m  qk ··· Chapter 4

Global Optimization

“Now upon the rugged top Stands she, on the loftiest height, When the cliffs abruptly stop And the path is lost to sight.”

Friedrich Schiller (German poet, 1759-1805 AD), The Alpine Hunter

Throughout this dissertation, we will encounter constrained optimization prob- lems of the form:

minimize f(x) (4.1)

subject to gi(x) 0 , i = 1, , m ≤ ···

hj(x) = 0 , j = 1, , p ··· where x Rn is the vector of decision variables, f(x) is the objective function to be ∈ optimized, while gi(x) and hi(x) are the inequality and equality constraints, respectively. In general, such problems are quite difficult to optimize, since they might have several local minima. Indeed, it might be difficult to even find a feasible point, that is, a value of x which satisfies all the constraints. A further problem that riddles general purpose optimization routines is the absence of a principled stopping criterion.

65 66

4.1 Approaches to Global Optimization

The need for global optimization arises in several branches of science and math- ematics, so it is a term imbued with differing interpretations depending on the diverse array of contexts it is used in. In this section, we briefly review the various approaches to global optimization prevalent in scientific computing.

Deterministic approaches

A global optimization approach may be considered deterministic when the ob- jective function and constraints are not specified in probabilistic terms and there is no element of randomization in the search algorithm. Two important optimization frame- works that are deterministic in nature are based on branch and bound methods and real algebraic geometry. Branch and bound minimization algorithms systematically explore the search space, using lower and upper bounding functions (Land and Doig, 1960). The search proceeds by branching, a repeated subdivision of the domain and the bounding functions are computed to underestimate and overestimate the objective in each sub-domain. All those regions of the search space where the lower bounding function lies above the upper bounding function of some other region are pruned away. The branch and bound procedure terminates when the entire search space has been explored or discarded, although in noisy estimation problems, a termination criterion that depends on the approximation gap may be specified. Branch and bound algorithms provably converge to the global minimum, with a certificate of optimality. In this dissertation, several multiview geometry problems are globally minimized in a branch and bound optimization framework, so we review it in greater detail in Section 4.3. Clearly, the key to ensuring that the branch and bound does not degenerate to an exhaustive search is designing an effective branching strategy and constructing bounding functions that tightly approximate the objective. We explore these problem-specific aspects in Chapters5,6 and8 of the dissertation. 67

Algebraic geometry is the study of algebraic varieties, which are geometric manifestations of solutions of systems of polynomial equations (Cox et al., 1992). A popular approach for global minimization of polynomial systems relies on Grobner¨ bases, which can be considered as a generalization of Gaussian elimination to polynomial systems. More precisely, a Grobner¨ basis G is a generating subset of an ideal I in a polynomial ring R, such that dividing any polynomial in I by G gives 0 (Buchberger, 1965, 1995). Several problems in multiview geometry, such as three-view triangulation and relative pose estimation have been solved using Grobner¨ basis methods (Stewenius´ et al., 2005; Stewenius´ et al., 2006). An alternate algebraic geometry approach to optimizing a polynomial system extends the reformulation linearization technique, whereby linear programming relax- ations for polynomial programs are constructed by a lifting procedure, that is, introducing additional redundant constraints and linearizing using additional variables (Sherali and Adams, 1998). Using the theory of positive polynomials, a sequence of convex linear matrix inequality relaxations of increasing size is constructed, whose optimal values provably converge to the global optimum of the given polynomial system (Lasserre, 2001; Henrion and Lasserre, 2004). Such approaches to global optimization have been used for several multiview geometry problems (Kahl and Henrion, 2005). We review this approach in greater detail in Section 4.2.4 and use it for globally optimizing the direct autocalibration problem in Chapter7.

Stochastic approaches

Stochastic approaches to optimization are useful in problems where either the modeling is in probabilistic terms, or a degree of randomness is necessary for the algorithm to achieve convergence or optimality. Incorporating the effects of noise often leads to probabilistic features in the objective function, whereby tools of statistical inference may be used to determine optimal solutions or optimal strategies to explore the search space. Stochastic approximation is a popular method used to optimize an unknown, 68 time-varying function (Robbins and Munro, 1951; Kiefer and Wolfowitz, 1952). A na¨ıve incorporation of randomization in the algorithm is to initialize a local gradient descent based method at several randomly chosen location in the search space, in the hope that one of them will converge to the global minimum. Clearly, no performance guarantee better than exhaustive search can be given for such an approach. Markov Chain Monte Carlo (MCMC) is a technique for sampling from a high- dimensional probability distribution, by constructing a rapid mixing Markov chain whose equilibrium is the target distribution (Robert and Casella, 2004). The Boltzmann E(x) distribution, P (x) ∝ e− T , is a popular choice for global optimization, since sampling at a low enough temperature T guarantees a global minimum with arbitrarily high probability. The Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970) is a popular way to draw samples from the Boltzmann distribution, a faster but less general variant is the Gibbs sampling procedure (Casella and George, 1992). Simulated annealing, in effect, is the optimization technique that starts with Metropolis-Hastings at a high temperature, then slowly lowers the temperature in ac- cordance with an annealing schedule (Kirkpatrick et al., 1983). Each step of simulated annealing consists of trying to determine whether a randomly chosen “next state” should replace the “current state”, where the decision is a probabilistic one depending on the objective function values and a global temperature parameter. At higher temperatures, simulated annealing approximates random search, while at lower temperatures, it behaves like gradient descent. While a suitably slow annealing schedule can be proven to achieve the global optimum, in most practical cases, it takes at least as long to converge as exhaustive search.

Metaheuristic approaches

Metaheuristics refer to heuristic search strategies used for solving combinatorial optimization problems using pre-defined black box functions, which might be heuristics themselves, that evaluate the function value for a given state. They are generally em- 69 ployed when the problem is intractable or difficult to pose in one of the more traditional optimization paradigms. The motivation for metaheuristic approaches stems from mimicking the oper- ations of biological systems, such as evolution by natural selection (Barricelli, 1954), genetic mutation or the behavior of ant colonies (Goss et al., 1989). A typical meta- heuristic procedure minimizes an objective function using a series of state transitions that might be a product of a generator function or a mutator function. Genetic algorithms (Fraser, 1957), ant colony optimization (Dorigo et al., 1996) and differential evolution (Storn and Price, 1997) are some well-known examples of metaheuristic approaches. Metaheuristic methods are usually employed with a pre-specified time budget and have weak probabilistic guarantees of reaching the global optimum as the time budget tends to infinity.

4.2 Convex Optimization

Convex optimization represents a large class of problems for which the obstacles mentioned at the beginning of the chapter vanish. A problem such as (4.1) is convex when the objective is a convex function and the constraints define a convex set. Any local minimum for a convex problem is the global minimum, feasibility can be determined unambiguously while precise stopping criteria are provided by the theory of duality. See (Boyd and Vandenberghe, 2004) for a comprehensive treatment of the topics outlined in this section.

4.2.1 Convex Sets

A set Rn is convex if it contains the line segment joining any two points of S ⊂ the set, that is,

x, y , α, β 0, α + β = 1 αx + βy . (4.2) ∈ S ≥ ⇒ ∈ S 70

Some convex sets that we will encounter in this dissertation are subspaces, affine sets and convex cones. A set Rn is a subspace if it contains the plane defined by the origin S ⊂ and any two of its points, that is,

x, y , α, β R αx + βy . (4.3) ∈ S ∈ ⇒ ∈ S

An affine set is one that contains the line through any two of its elements, that is, Rn S ⊂ is affine if

x, y , α, β R, α + β = 1 αx + βy . (4.4) ∈ S ∈ ⇒ ∈ S A set Rn is a if it contains the line segment joining any two of its S ⊂ points as well as all the rays emanating from the origin and passing through its points. Mathematically, x, y , α, β 0 αx + βy . (4.5) ∈ S ≥ ⇒ ∈ S A convex cone that arises frequently in this dissertation is the second order cone, named so since it is the norm cone associated with the Euclidean norm:

n S2 = x xˆ xn . (4.6) { | k k ≤ } where xˆ := (x1, , xn 1)>. Constraints in several computer vision applications, with ··· − m n A R × , have the form ∈ Ax + b c>x + d (4.7) k k ≤ which are called second order cone programming constraints, since (4.7) is equivalent to n+1 demanding that (Ax + b, c>x + d)> S2 . ∈ Another convex cone that is important in computer vision and machine learning is the set of symmetric positive semidefinite (PSD) matrices:

n n n S+ = X S y R , y>Xy 0 (4.8) { ∈ | ∀ ∈ ≥ } where Sn is the set of n n symmetric matrices. × 71

Generalized Inequalities

A proper convex cone is closed, with a non-empty interior and does not contain any line. Any proper convex cone K Rn defines a generalized inequality: ⊂

x K y x y K. (4.9)  ⇔ − ∈ Unless otherwise stated, we will encounter instances of generalized inequalities in the following two contexts:

n Componentwise inequality for n-dimensional vectors, using K = R+: •

n Given x, y R , x y xi yi 0, i = 1, , n. (4.10) ∈  ⇔ − ≥ ···

n Positive semidefiniteness constraints for n n matrices, using K = S+: • ×

n n Given X, Y R × , X Y X Y is PSD. (4.11) ∈  ⇔ −

Note that we have dropped the reference to the convex cone K while representing the generalized inequalities above, since it will usually be clear from the context.

4.2.2 Convex Functions

Several objective functions that we will encounter in the following chapters have traditionally been minimized by local methods, but they become amenable to global optimization following the recognition of underlying convexities. A convex function f : R is one with a convex domain := dom f and which satisfies D → D x, y , λ [0, 1], f(λx + (1 λ)y) λf(x) + (1 λ)f(y). (4.12) ∀ ∈ D ∀ ∈ − ≤ − A concave function is one whose negation is convex. A necessary and sufficient first-order condition for convexity of a differentiable function is that its first-order approximation is a global underestimator, that is,

x, x0 , f(x) f(x0) + f(x0)>(x x0) (4.13) ∀ ∈ D ≥ ∇ − 72

A necessary and sufficient second-order condition for convexity of a differentiable function is given by the positive semidefiniteness of its Hessian, that is,

x , 2f(x) 0 (4.14) ∀ ∈ D ∇ 

Some common examples of convex functions are

n Affine functions, f(x) = a>x + c, where a R , c R. • ∈ ∈

Quadratic functions f(x) = x>Ax + 2b>x + c, where A 0. • 

a Exponential functions f(x) = x , where x R++, a > 1. • ∈

Epigraph

A useful notion is the epigraph of a function:

epi f = (x, t) x dom f, f(x) t (4.15) { | ∈ ≤ }

It is easy to see that f is a convex function if and only if epi f is a convex set. Many properties of convex functions can be better understood in terms of its epigraph. In particular, the following operations can be shown to be convexity-preserving:

P Non-negative summation: αi 0, fi convex αifi convex. • ≥ ⇒ i

Pointwise supremum: fs convex for s sups fs convex. • ∈ S ⇒ ∈S

Minimizing over some variables: f(x, y) convex in x Rm and y Rn • ∈ ∈ ⇒ g(x) = infy f(x, y) convex in x.

4.2.3 Convex Optimization Problems

A convex program is an optimization problem of the form (4.1) with the additional conditions that the objective function f(x) is convex, the inequality constraints gi(x) are all convex and the equality constraints hi(x) are all affine. Besides the advantages 73 mentioned at the beginning of the section, an important reason for the widespread popularity of convex optimization is the rapid development in (publicly available) solver technology. The convex programs whose instances we will study in this dissertation all satisfy the so-called self-concordance property which has enabled the emergence of interior-point method based solvers (Karmarkar, 1984; Nesterov and Nemirovskii, 1994). These methods are guaranteed to converge to the global minimum in polynomial time and are free of numerical sensitivity issues that hinder many general-purpose optimization approaches. In the remainder of this section, we will look at a few important examples of convex optimization problems.

Linear Programs

A linear program (LP) has the form

minimize c>x + d (4.16)

subject to Gx h  Ax = b

n k n l n where x R , G R × and A R × . ∈ ∈ ∈

Quadratic Programs

A quadratic constrained quadratic program (QCQP) has the form

minimize x>Px + q>x + r (4.17)

subject to x>Pix + qi>x + ri Gx h  Ax = b where i = 1, , m. While a general QCQP is NP-hard to minimize, convex QCQPs ··· have the additional requirements that P and each Pi are positive definite and are solvable in polynomial time. 74

Second Order Cone Programs

A second order cone program (SOCP) has the form

minimize c>x + d (4.18)

subject to Pix + qi r>x + si k k ≤ i Gx h  Ax = b where i = 1, , m. It can be easily verified that a QCQP is a special case of an SOCP, ··· obtained by setting each ri = 0 and squaring the second order cone constraints.

Semidefinite Programs

A semidefinite program (SDP) has the form

minimize c>x + d (4.19) n X subject to xiFi + F0 0 i=1  Ax = b

n where Fj S for j = 1, , n. It can be verified that SDPs subsume LPs, convex QPs ∈ ··· and SOCPs.

4.2.4 Linear Matrix Inequalities

Linear matrix inequalities (LMIs) have come to play a central role in optimization problems that arise in fields like control theory and signal processing. In recent years, LMIs have proven to be very successful in seeking globally optimal solutions to systems of polynomial equations. As some of following chapters will demonstrate, this has great utility for several applications in computer vision. 75

n m Given symmetric matrices Ai S , i = 0, , m and x R , a linear matrix ∈ ··· ∈ inequality is an expression of the form

m X LMI (x) := A0 + xiAi 0 (4.20) i=1  The solution set of the LMI is convex, since it defines the preimage of the semidefinite cone under an affine function. A trick commonly used in uncovering convex LMI constraints is the Schur complement technique. Let X be a symmetric matrix decomposable as   AB X =   (4.21) B> C

where A is invertible, then the Schur complement of A in X is given by

1 S = C B>A− B (4.22) − and can be shown to have the following properties:

X 0 A 0 and S 0 • ⇔ Given A 0, X 0 S 0. • ⇔ The Schur complement will be used several times in this dissertation, especially to recognize the convexity of constraints of the form Ax + b 2 0 k k 1 (4.23) ≤ c>x + d ≤ which, we will see, arise quite commonly in several computer vision problems. Indeed, the epigraph associated with the nonlinear function in (4.23), given by (x, t) Ax + b 2 { | k k ≤ (c>x + d) t can be reduced to the convex LMI constraint · }   (c>x + d)I Ax + b   0. (4.24) (Ax + b)> t  Similarly, second order cone programming constraints such as

Ax + b c>x + d (4.25) k k ≤ 76 can be reduced to the convex LMI constraint   (c>x + d)I Ax + b   0. (4.26) (Ax + b)> c>x + d 

4.3 Branch and Bound Theory

Branch and bound algorithms are non-heuristic methods for global optimization of non-convex problems. They maintain a provable upper and/or lower bound on the (globally) optimal objective value and terminate with a certificate proving that the solution is -suboptimal (that is, within  of the global optimum), for arbitrarily small . We will restrict our treatment to minimization problems as that is the case we will encounter in pertinent structure and motion problems. Consider a non-convex, scalar-valued objective function f(x), for which we seek a global optimum over a rectangle Q0. By a rectangle, we mean a region of the search space demarcated by finite intervals along each search dimension. For a rectangle

Q Q0, let f (Q) denote the minimum value of the function f over Q. Also, let ⊆ min flb(Q) denote the minimum value attained within the rectangle Q by a function flb(x), which underestimates the value of f(x) for any x Q. The function f must satisfy the ∈ lb following conditions:

(L1) f (Q) computes a lower bound on f (Q) over the domain Q, that is, f (Q) lb min lb ≤ fmin(Q).

(L2) The approximation gap f (Q) f (Q) converges to zero as the maximum min − lb half-length of sides of Q, denoted Q , tends to zero, that is | |

 > 0, δ > 0 s.t. Q Q0, Q δ f (Q) f (Q) . ∀ ∃ ∀ ⊆ | | ≤ ⇒ min − lb ≤

An intuitive technique to determine the -suboptimal solution would be to divide the whole search region Q0 into a grid with cells of sides δ and compute the minimum of a lower bounding function flb defined over each grid cell, with the presumption that 77

each flb(Q) is easier to compute than the corresponding fmin(Q). However, the number of such grid cells increases rapidly as δ 0, so a clever procedure must be deployed to → create as few cells as possible and ”prune” away as many of these grid cells as possible (without having to compute the lower bounding function for these cells). This is precisely the aim of a branch and bound algorithm.

The branch and bound algorithm begins by computing flb(Q0) and the point q∗ Q0 which minimizes f (Q0). If f(q∗) f (Q0) < , for a pre-specified , the ∈ lb − lb algorithm terminates. Otherwise Q0 is partitioned as a union of subrectangles Q0 =

Q1 Qk for some k 2 and the lower bounds f (Qi) as well as points qi, at which ∪ · · · ≥ lb k these lower bounds are attained, are computed for each Qi. Let q∗ = arg min qi f(qi). { }i=1 We deem f(q∗) to be the current best estimate of fmin(Q0). The algorithm terminates when f(q∗) min1 i k flb(Qi) < , else the partition of Q0 is refined by further dividing − ≤ ≤ some subrectangle and repeating the above. The rectangles Qi for which flb(Qi) > f(q∗) cannot contain the global minimum and are not considered for further refinement. A graphical illustration of the algorithm is presented in Figure 4.1. Computation of the lower bounding functions is referred to as bounding, while the procedure that chooses a rectangle and subdivides it is called branching. There can be several possible choices of the rectangle picked for refinement in the branching step and the actual subdivision itself. We consider the rectangle with the smallest minimum of flb as the most promising to contain the global minimum and subdivide it into k = 2 rectangles, as described in the following sections, which can be shown to be a convergent strategy. Algorithm1 uses the abovementioned functions and presents a concise pseudocode for the branch and bound method. Further detailed descriptions of the bounding and branching procedures are given in the next two subsections. Although guaranteed to find the global optimum (or a point arbitrarily close to it), the worst case complexity of a branch and bound algorithm is exponential. However, we will show in our experiments that the special properties offered by the geometric reconstruction problems considered in this dissertation lead to fast convergence rates in practice. 78

Algorithm 1 Branch and Bound Require: Initial rectangle Q0 and  > 0.

1: Bound : Compute f (Q0) and minimizer q∗ Q0. lb ∈ 2: S = Q0 Initialize the set of candidate rectangles { }{ } 3: loop

4: Q0 = arg minQ S flb(Q) Choose rectangle with lowest bound ∈ { } 5: if f(q∗) f (Q0) <  then − lb 6: return q∗ Termination condition satisfied { } 7: end if

8: Branch : Q0 = Ql Qr ∪ 9: S = (S/ Q0 ) Ql,Qr Update the set of candidate rectangles { } ∪ { }{ } 10: Bound : Compute f (Ql) and minimizer ql Ql. lb ∈ 11: if f(ql) < f(q∗) then

12: q∗ = ql Update the best feasible solution { } 13: end if

14: Bound : Compute f (Qr) and minimizer qr Qr. lb ∈ 15: if f(qr) < f(q∗) then

16: q∗ = qr Update the best feasible solution { } 17: end if

18: S = Q Q S, f (Q) < f(q∗) Discard rectangles with high lower bounds { | ∈ lb }{ } 19: end loop 79

Φ(x) Φ(x)

q1∗

l x u l x u

(a) (b)

Φ(x) Φ(x)

q1∗ q1∗ q2∗

l x u l x u

(c) (d)

Figure 4.1: This figure illustrates the operation of a branch and bound algorithm on a one dimensional non-convex minimization problem. Figure (a) shows the the function f(x) and the interval l x u in which it is to be minimized. Figure (b) shows the convex relaxation≤ of f(≤x) (indicated in yellow/dashed), its domain

(indicated in blue/shaded) and the point for which it attains a minimum value. q1∗ is the corresponding value of the function f. This value is the best estimate of the minimum of f(x) is used to reject the left subinterval in Figure (c) as the minimum value of the convex relaxation is higher than q1∗. Figure (d) shows the lower bounding op- eration in the right sub-interval in which a new estimate q2∗ of the minimum value of f(x).

4.3.1 Bounding

The goal of the bounding procedure is to provide the branch and bound algorithm with a bound on the smallest value the objective function takes in a domain. The 80

computation of the function flb must possess three properties crucial to the efficiency and convergence of the algorithm: (i) must be easily computable, (ii) must provide as tight a bound as possible and (iii) must be easily minimizable. Precisely these features are inherent in the convex envelope of an objective function, which we define below.

Definition 1 (Convex Envelope). Let f : S R, where S Rn is a non-empty convex → ⊂ set. The convex envelope of f over S (denoted convenv f) is a convex function such that (i) convenv f(x) f(x) for all x S and (ii) for any other convex function u, ≤ ∈ satisfying u(x) f(x) for all x S, we have convenv f(x) u(x) for all x S. ≤ ∈ ≥ ∈ Finding the convex envelope of an arbitrary function may be as hard as finding the global minimum. To be of any advantage, the envelope construction must be cheaper than the optimal estimation. Two of the principal contributions of this dissertation are to construct the tightest possible convex relaxations for objective functions that arise in important multiview geometry problems and to show that they satisfy the conditions (L1) and (L2) enumerated above.

4.3.2 Branching

Branch and bound algorithms can be slow, in fact, the worst case complexity grows exponentially with problem size. Thus, one must devise a sufficiently sophisticated branching strategy to expedite the convergence. Indeed, for the various geometric reconstruction problems considered in this dissertation, we will demonstrate that it is possible to restrict the branching to a small and fixed number of dimensions regardless of the problem size, which substantially enhances the number of views or points our algorithms can handle. There are three issues that must be addressed within the branching phase - the rectangle to branch on, the dimension of the chosen rectangle to split along and the point at which to split the chosen dimension. The choice of rectangle to be partitioned is essentially heuristic: we consider the rectangle with the smallest minimum of flb as the most promising to contain the global 81 minimum and subdivide it first. After a choice has been made of the rectangle to be further partitioned, there are two issues that must be addressed within the branching phase - namely, deciding the dimensions along which to split the rectangle and where along a chosen dimension to split the rectangle. We pick the dimension with the largest interval and employ a simple spatial division procedure, called α-bisection (see Algorithm2) for a given scalar α, 0 < α 0.5. It can be shown that for our applications, α-bisection leads to a branch- ≤ and-bound algorithm which is convergent, see for instance AppendixC or (Benson, 2002).

Algorithm 2 α-bisection n Require: A rectangle Q R , defined as Q = [ l1, u1 ] [ ln, un ] ⊂ × · · · × 1: j = arg maxi=1,...,n(ui li) − 2: vj = α(uj lj) − 3: Ql = [ l1, u1 ] [ lj, vj ] [ ln, un ] × · · · × × · · · × 4: Qr = [ l1, u1 ] [ vj, uj ] [ ln, un ] × · · · × × · · · × 5: return (Ql,Qr)

The α bisection branching strategy does not pay any attention to the value of the convex envelope and the objective function. The next strategy we describe, ω- subdivision, addresses this issue. Intuitively, the ω-subdivision rule is a heuristic for maximum improvement in the convex envelope underestimation for Ql and Qr compared to Q. We will illustrate the ω-subdivision rule for the case of fractional programming, n X ti considered in Chapter5, where the goal is to minimize a sum of fractions, . It can s i=1 i be shown that, for a fractional program, it suffices to branch only along the dimensions

2n corresponding to the denominator variables si. Let a rectangle Q R be defined as ⊂

Q = [ l1, u1 ] [ ln, un ] [ L1,U1 ] [ Ln,Un ] (4.27) × · · · × × × · · · × where ti [ li, ui ] and si [ Li,Ui ]. Let the lower bounding point in a rectangle Q ∈ ∈ 82

2n n be given by ω(Q) = (t1∗, , tn∗ , s1∗, , sn∗ )> R and let y = (y1, , yn)> R ··· ··· ∈ ··· ∈ be the vector of lower bounds for each individual fraction. Then we branch along that dimension for which the difference between the lower bound and the objective function value is maximum. The method is stated more formally in Algorithm3.

Algorithm 3 ω-subdivision Require: A rectangle Q R2n, ω(Q), y ⊂ 1: j = arg maxi(ti/si yi) − 2: Ql = (l1, u1) (ln, un) (L1,U1) (Lj, sj) (Ln,Un) × · · · × × × · · · · · · × 3: Qr = (l1, u1) (ln, un) (L1,U1) (sj,Uj) (Ln,Un) × · · · × × × · · · · · · × 4: return (Ql,Qr)

The ω-subdivision algorithm can also be shown to be convergent for our applica- tions, but we will henceforth restrict our attention to the α-bisection algorithm.

4.4 Global Optimization for Polynomials

Several problems in computer vision can be formulated as minimizing a polyno- mial objective, subject to polynomial equality and inequality constraints:

min p(x) pi(x) 0, i = 1, , m (4.28) x n ∈R { | ≥ ··· } Recent advances in algebraic geometry (Cox et al., 1998) and semidefinite pro- gramming Boyd and Vandenberghe(2004) have shown that it is possible to find the global minimum for such problems. This review section borrows terminology and notation from (Schweighofer, 2006) and (Lasserre, 2001), where we direct the interested reader for greater details and further references. 83

Theoretical background

For x Rn, let [x] denote the ring of n-variate polynomials p(x), with the ∈ R monomial basis

α α1 αn n B = x := x1 xn α N . (4.29) { ··· | ∈ }

Then, for f, g1, , gm [x], we want to solve the following constrained minimization ··· ∈ R problem:

n f ∗ := inf f(x) x S , where S = x R gi(x) 0, i = 1, , m (4.30) { | ∈ } { ∈ | ≥ ··· } where S is assumed to be compact to make the problem well-defined. We will use the following notations:

[x]2 := p [x] p = q2 for some q [x] R { ∈ R | ∈ R } [x]g := p [x] p = q2g for given g [x] and for some q [x] R { ∈ R | ∈ R ∈ R } X X [x]gi := p [x] p = qi for some qi [x]gi, given gi [x] (4.31) R { ∈ R | i ∈ R ∈ R } Similarly constructed definitions will also be used in the subsequent paragraphs. The quadratic module generated by g1, g2, , gm is the set ··· X 2 X 2 X 2 Q := [x] + [x] g1 + + [x] gm R R ··· R m X 2 = sigi si [x] [x] , (4.32) { i=0 | ∈ R } ⊆ R which is non-negative on S. Problem (4.30) is clearly non-convex in general, since f is a non-convex function and S is a non-convex set. One way to convexify an arbitrary optimization problem is by lifting it to the infinite dimensional space of measures: if M(S) is the set of all probability measures with support in the set S, then Z f ∗ = inf f(x) = inf f(x)dµ, (4.33) x S µ M(S) ∈ ∈ S which is convex since M(S) is convex. The dual formulation of this trivial convexification is

f ∗ = sup c R f c 0 on S = sup c R f c > 0 on S (4.34) { ∈ | − ≥ } { ∈ | − } 84 where the constraint set is now the cone of non-negative polynomials on S, which is a convex cone. While both (4.33) and (4.34) are convex optimization problems, they are not efficiently solvable in general, since the constraint sets cannot be characterized tractably. But this does give us the intuition that the problem of globally minimizing a system of polynomials is, in fact, related to the classical problem of moments (Akhiezer, 1965),

n n which can be stated as: given a set S R and a sequence of numbers mα , α N , ⊆ { } ∈ does there exist a probability density µ, with support S(µ) S, such that ⊆ Z α x dµ = mα (4.35) S for every given α? The remarkable results in (Putinar, 1993) formalize this connection:

Theorem 6. A map L : [x] R such that L(1) = 1 and L(Q) [0, ) is linear if R → ⊆ ∞ and only if there exists a probability measure µ M(S) such that ∈ Z L(p) = p dµ , (4.36) S for any p [x], ∈ R To see that the above theorem solves the moment problem on S, we note that

(4.36) can be rewritten in terms of the moments mα , since every L : { } [x] R is characterized by its values on the monomial basis B in (4.29). Thus, the R → primal problem in (4.33) can be reformulated as

f ∗ = inf L(f) L : [x] R is linear, L(1) = 1,L(Q) [0, ) . (4.37) { | R → ⊆ ∞ }

To reduce the dual problem, we ascribe to another result from (Putinar, 1993), whose resemblance in form to Hilbert’s Nullstellensatz is the reason for similarly naming this theorem the Positivstellensatz:

Theorem 7. If p [x] is positive on S, then p Q. ∈ R ∈ 85

Given the above result, we replace the dual problem in (4.34) with

f ∗ = sup c R f c Q . (4.38) { ∈ | − ∈ } Note that there might be polynomials non-negative on S but not contained in Q. Still, the formulation in (4.38) suffices for our purposes as far as a reduction of the dual problem is concerned. Let deg(p) denote the degree of polynomial p. Further, let the vector space of polynomials up to degree d be denoted as [x]d. For an arbitrary k N, such that R ∈

k d , where d = max deg(f), deg(g1), , deg(gm) , (4.39) ≥ max max { ··· } define   k deg(gi) di = − , (4.40) 2 where [ ] stands for the greatest integer function. Then, similar to (4.32), we define an · approximation to Q as

X 2 X 2 X 2 Qk := [x] + [x] g1 + + [x] gm R d0 R d1 ··· R dm m  X 2 = sigi si [x] , deg(sigi) k [x]. (4.41) i=0 | ∈ R ≤ ⊆ R Now, we consider the following optimization problem, which is a relaxation of (4.37) and called the primal relaxation of order k:

Pk : min L(f)

subject to L : [x]k R is linear, R → L(1) = 1

L(Qk) [0, ). (4.42) ⊆ ∞ Similarly, a dual relaxation of order k for (4.38) can be constructed as

Dk : max c

subject to c R ∈

f c Qk. (4.43) − ∈ 86

Let P ∗ and D∗ be the optimal values of Pk and Dk, respectively. Then, P ∗ f ∗. Further, k k k ≤ if L is feasible for Pk and c is feasible for Dk, then, since L is a linear map, we have

L(f) c = L(f) cL(1) = L(f c) 0 , (4.44) − − − ≥ where the last inequality follows from the fact that f c Qk. Thus, in particular, − ∈ D∗ P ∗. k ≤ k Moreover, every feasible solution of Dk is also a feasible solution of Dk+1 and every feasible solution of Pk+1 is feasible for Pk, when restricted to the subspace [x]k R of [x]k+1. Finally, for any  > 0, R

f ∗  D∗ , (4.45) − ≤ k since for a sufficiently large k d , it follows from Theorem7 that f (f ∗ ) Qk, ≥ max − − ∈ that is, f ∗  is feasible for Dk. Putting it all together, we have the following result from − (Lasserre, 2001):

Theorem 8. For k = dmax, dmax + 1, N, the sequences Pk∗ and Dk∗ are · · · ∈ { } { } increasing and converge to f ∗ while satisfying D∗ P ∗ f ∗. k ≤ k ≤

Indeed, stronger results are proved for the relaxations Pk and Dk in (Lasserre, 2001), namely, that for an S with a non-empty interior, there is zero duality gap, that is,

Pk∗ = Dk∗. It is also shown that for all practical purposes, the problems Pk and Dk can be represented as semidefinite programs (SDPs), which can be efficiently minimized using standard solvers. This forms the basis for a primal-dual schema for solving increasingly tighter relaxations of (4.30), whose optimal solutions converge to the global optimum f ∗. Note that the above result is valid only for polynomials p [x] for which there ∈ R exists some n N such that n p Q. It can be shown that this is equivalent to ∈ ± ∈ demanding the existence of an n N such that n x 2 Q (Schmudgen¨ , 1991). In ∈ − k k ∈ practice, if an n∗ N is known such that the set S is contained in a closed ball of radius ∈ √n∗ centered at the origin, then this condition can be satisfied by including an additional 2 redundant constraint gm+1 := n∗ x 0. − k k ≥ 87

Convex LMI relaxations for polynomial optimization

The above theory is the basis for the algorithm to globally optimize a scalar polynomial objective function subject to polynomial constraints, as presented in (Henrion and Lasserre, 2005). Let p∗ denote the minimum objective value (if it exists) of the problem (4.28). Then, a convex relaxation is, by construction, a convex optimization problem with minimum objective value p∗ such that p∗ p∗. By solving the relaxed r r ≤ problem, a lower bound on the original objective function is obtained. Convex linear matrix inequality (LMI) relaxations for (4.28) can be obtained by gradually adding lifting variables and constraints corresponding to linearizations of monomials up to a given degree. The LMI relaxation covering monomials up to a given even degree 2δ is referred to as the LMI relaxation of order δ. The standard Shor relaxation in mathematical programming (Shor, 1998) can be regarded as a first-order LMI relaxation.

To construct an LMI relaxation of order δ, let vδ(x) be a vector containing the coefficients of all monomials up to degree δ, including the constant term 1. Then an LMI relaxation of order δ for the original problem in (4.28) can be constructed using the following algorithm:

k1 k2 kn 1. Linearize the objective function p(x) by lifting: x1 x2 . . . xn is replaced with

the new lifting variable yk1k2...kn . Thus, the linearized objective function can be

written p>y for a constant coefficient vector p and the lifting vector y.

2. Apply lifting to the LMI constraint pi(x)vδ 1(x)vδ 1(x)> 0 for each con- − −  straint pi(x) 0. Denote the linearized constraint by Cδ 1(pi(y)) 0. ≥ −  3. Add the LMI moment matrix constraint which corresponds to linearizing the

trivial constraint vδ(x)vδ(x)> 0. Denote the linearized constraint by Cδ(y)   0.

To summarize, the following SDP is solved for the LMI relaxation of order δ of the 88 problem (4.28):

min p>y

subject to Cδ 1(pi(y)) 0, i = 1, 2 . . . , m, (4.46) −  Cδ(y) 0.  As we saw previously, it is shown in (Lasserre, 2001) that under certain mild technical conditions, the above hierarchy of relaxations converges asymptotically to p∗, that is,

lim pδ∗ = p∗. (4.47) δ →∞ It turns out that for many of the non-convex polynomial optimization problems, global optima are reached at a given accuracy for a moderate number of lifting variables and constraints, hence for an LMI relaxation of moderate order. A sufficient condition for reaching the global optimum is that the moment matrix Cδ(y∗) has rank one at the optimum y∗.

If the solution to the relaxed problem is not tight, that is, pδ∗ < p∗, then an approx- imate solution may be obtained by simply keeping the lifting variables corresponding to first-order moments. A Matlab toolbox for LMI relaxations based on the publicly available SDP solver SeDuMi (Sturm, 1999) can be found in (Henrion and Lasserre, 2003). Chapter 5

Triangulation and Resectioning

“From the centre of the land and water, at a distance of one-quarter of the Earth’s circumference lies Lankˇ a.¯ And from Lankˇ a,¯ at a distance of one-fourth thereof, exactly northwards, lies Ujjayini.”¯

Aryabhatt¯ (Indian astronomer, 476-550 AD), Circumference of Earth, Aryabhatia¯

5.1 Introduction

With the background of the previous chapters, we now turn our attention to various problems in multiple view geometry, with the aim of developing practical solutions with theoretical guarantees of optimality. As we have discussed in Chapter1, these optimization problems are typically highly non-linear and finding their global optima in general has been shown to be NP- Hard (Nister´ et al., 2007; Freund and Jarre, 2001). Methods for solving these problems are based on a combination of heuristic initialization and local optimization to converge to a locally optimal solution. A common method for finding the initial solution is to use a direct linear transform (for example, the eight-point algorithm (Longuet-Higgins, 1981)) to convert the optimization problem into a linear least squares problem. The solution then serves as the initial point for a non-linear minimization method based on the Jacobian and Hessian of the objective function, for instance, bundle adjustment. As has been

89 90 documented, the success of these methods critically depends on the quality of the initial estimate (Hartley and Zisserman, 2004). In this chapter, we present the practical algorithms for finding the globally optimal solution to a variety of problems in multiview geometry, such as general n-view triangu- lation, camera resectioning (also called camera pose estimation or absolute orientation determination) and the estimation of general projections Pn Pm, for n m. We solve 7→ ≥ each of these problems under three different noise models, including the standard Gaus- sian distribution and two variants of the bi-variate Laplace distribution. Our algorithm is provably optimal, that is, given any tolerance , if the optimization problem is feasible, the algorithm returns a solution which is at most  far from the global optimum. The algorithm is a branch and bound style method based on extensions to recent developments in the fractional and convex programming literature (Tawarmalani and Sahinidis, 2001b; Benson, 2002; Boyd and Vandenberghe, 2004). While the worst case complexity of our algorithm is exponential, we will show in our experiments that for a fixed , the runtime of our algorithm scales almost linearly with problem size, making this a very attractive approach for use in practice. In summary, the main contributions of this chapter are:

A scalable algorithm for solving a class of multiview problems with a guarantee • of global optimality.

Handling the standard L2-norm of reprojection errors, as well as the robust • L1-norm for the perspective camera model.

Introduction of fractional programming to the computer vision community. •

5.1.1 Related Work

Recently there has been some progress made towards finding the global solution to a few of the multiview optimization problems. An attempt to generalize the optimal solution of two-view triangulation (Hartley and Sturm, 1997) to three views was done 91 in (Stewenius´ et al., 2005) based on Grobner¨ basis. However, the resulting algorithm is numerically unstable, computationally expensive and does not generalize for more views or harder problems like resectioning, although more numerically stable results were obtained in (Byrod¨ et al., 2007). In (Kahl and Henrion, 2005), convex linear matrix inequalities were used to estimate the global optimum for several multiple view geometry problems, but no guarantees can be given to certify that the computed solution is indeed the global optimum. Also, there are unsolved problems concerning numerical stability of the solvers used. Robustification using the L1-norm was presented in (Ke and Kanade, 2005b), but the approach is restricted to the affine camera model. In (Kahl, 2005; Ke and Kanade, 2005a), a wider class of geometric reconstruction problems was solved globally, but with the L -norm. ∞ Subsequent to this work, there have been faster solutions that employ similar principles, but are more tailored for the particular case of the triangulation problem (Lu and Hartley, 2007). It is also possible to use convex optimization methods to verify the optimality of a given solution for some multiview geometry problems posed in the L2 norm (Hartley and Seo, 2008).

5.1.2 Outline

We begin by formulating the problems we are interested in solving in the next section. An exposition on fractional programming is given in Section 5.2, with details of the construction of lower bounds (Section 5.4.1). We justify in Section 5.5 the claim that a broad class of multiview geometry problems with different noise models can be cast in the unifying framework of fractional programming. Section 5.6 presents two innovations, crucial to expeditious convergence, that exploit the special properties of structure and motion problems - a novel bounds propagation scheme to restrict the branching process to a small, fixed number of dimensions independent of the problem size and an intuitive initialization strategy based on reprojection error. Finally, Section 5.7 presents the experimental results of the extensive evaluation of our algorithm on a 92 variety of synthetic and real data sets with different noise levels.

5.2 Problem Formulation

A perspective camera can be modelled as a linear mapping P3 P2 from 7→ projective 3-space to a projective image plane. In matrix notation, a 3D scene point, represented by a homogeneous 4-vector X, and its projected image point, represented by a homogeneous 3-vector x, are related by

x = λPX (5.1) where the scalar λ, called the projective depth, accounts for scale and P is the 3 4 × camera matrix encoding the intrinsic and extrinsic parameters of the camera. We consider the following two problems under three different noise models, namely the Gaussian and two variants of the bivariate Laplacian.

1. Structure Estimation: Given N images of a point and the corresponding camera matrices, estimation of the position of the point in P3. This is also known as the N-view triangulation problem.

2. Transformation Estimation: Given the position of N points in the projective space Pn and their images in the space Pm, estimation of the projective transformation P that maps these points from Pn to Pm. When n = 3 and m = 2, that is, the transformation is a 3 4 camera matrix, the problem is also known as camera × resectioning.

Let P = [π1 π2 π3]> denote the 3 4 camera where πi are 4-vectors, (u, v)> × stand for image coordinates and X be a homogeneous 3D point. Then the reprojection residual vector for one image is given by

 π X π X> r = u 1> , v 2> . (5.2) − π3>X − π3>X 93

Under an indepedent, identically distributed (iid) Gaussian noise model, the objective function to minimize is the sum-of-squared residuals, which becomes

N X 2 ri 2, (5.3) i=1 || || where N is the number of residual terms in the problem, that is, the number of images of the given point. Other noise models will also be considered later in this chapter. Minimizing the sum-of-squares objective function (5.3) is known to be a trouble- some non-convex optimization problem for both structure and transformation estimation (Hartley and Zisserman, 2004). It is known that the two-view triangulation problem can be reduced to finding the roots of a sixth degree polynomial (Hartley and Sturm, 1997), while three-view triangulation can be posed as the solution to a polynomial system which may have up to 47 roots (Stewenius´ et al., 2005). So, not only are these seemingly simple instances of the triangulation problem known to have several local minima, the difficulty of obtaining a solution rises sharply with the number of views. This phenomena causes difficulties for local optimization techniques (such as Newton-based methods) since they might get stuck in local minima. As an example, consider the following three-view triangulation problem (first

published in (Kahl and Hartley, 2008)) in which there are three local L2 minima, all lying in front of all three cameras. In this example, all points lie in the plane z = 0, so we may simplify the problem to a 2-dimensional triangulation problem. Adding a third dimension makes no significant difference to the example.  3 1 8  Let P0 be represented by the camera matrix P0 = −1 3 −6 . The centre of this − − − camera is at the point ( 3, 1, 1)>. We obtain two other cameras P1 and P2 by rotating − − around the origin by 120◦. ± Now, for all i = 0,..., 2, let xi = (3, 1)>; this is simply the point with non- homogeneous coordinate 3 in the image. It is easily seen that all points of the form

(x, 1, 1)> map to the same point (3, 1)> in the P0 image. These points lie along the − line y = 1, which is therefore the ray corresponding to the image point x0 = (3, 1)> − for the P0 camera. The rays corresponding to the points measured in the other images 94 lie on lines rotated by 1200 around the origin. The three rays form a triangle. Since ± this configuration has three-fold symmetry, if there is to be a single minimum to the cost function, then it could only be the origin, which is the symmetry centre. It is easily seen that the origin is not the global optimum. One might suspect that the local optima are at the vertices of the triangle. However, the best L2 solutions do not lie exactly at the vertices of the triangle. The contour plot (sublevel-set plot) of the L2 error (of a slightly perturbed problem) is shown in Figure 5.1.

Figure 5.1: A contour plot of the L2 error for a three-view triangulation problem in which there are three local minima for the L2 cost function.

5.3 Traditional Approaches

5.3.1 Linear Solution

Often, it is possible to algebraically solve for, say, the unknown 3D point in the case of triangulation, by using a linear method. Let (xi, yi)> be the image of unknown 95

point X under camera Pi. Then, the following relationships hold true:

xi(π>X) (π>X) = 0 3i − 1i

yi(π>X) (π>X) = 0 (5.4) 3i − 2i

If m 2 views are available, these equations can be expressed in matrix form as ≥   x1π31> π11>  −     y1π31> π21>   −   .  AX = 0, where A =  .  (5.5)      xmπ3>m π1>m   −  ymπ> π> 3m − 2m and the triangulation problem can be solved by minimizing the algebraic error

min AX . (5.6) X k k

Note that X = 0 is a solution to (5.6), so to extract a solution, we need to fix the scale of X. This can be achieved by demanding that X = 1 or by setting the last coordinate of k k X to 1. The former, which is a homogeneous version, can be solved using singular value decomposition (SVD), while the latter inhomogeneous version represents a linear least squares problem. The inhomogeneous version, of course, precludes points at infinity from being a solution. The two solutions above have, in fact, quite different solution properties in the presence of noise. Note that both the solutions are not projective invariant, since replacing

1 1 A by AH should yield H− X as the solution for a projective invariant method, but H− X 1 will not, in general, satisfy H− X = 0 or have its last coordinate equal to 1. k k Such solutions are often called Direct Linear Transform (DLT) methods in the literature (Hartley and Zisserman, 2004). While they are fast and applicable for a large number of views, they minimize an algebraic distance and not the geometric reprojection error in (5.3). So, the solution yielded by these methods may not correspond to geometric intuition and can vary depending on the choice of normalization. Moreover, the solution 96 quality can degrade dramatically in the presence of noise. In practice, they are used as initialization to a nonlinear minimization routine, called bundle adjustment.

5.3.2 Bundle Adjustment

Bundle adjustment refers to a class of local iterative optimization approaches which can be used to minimize the cost function in (5.3). While bundle adjustment is a local optimization approach, it is quite powerful due to its flexibility and incorporation of features specific to multiview geometry. It is common practice to refine the estimate of any reconstruction algorithm using bundle adjustment. The most notable bundle adjustment methods in computer vision employ variants of the Levenberg-Marquardt iterative algorithm (Levenberg, 1944; Marquardt, 1963). The reader is referred to (Triggs et al., 1999) for an exhaustive treatment of the subject. In brief, bundle adjustment seeks to estimate cameras P and points X that minimize a geometric error criterion:

X 2 min d(Pi, Xj) . (5.7) Pi,Xj ij

In the case of triangulation or camera resectioning, the cost has the form (5.3) where the minimization is only over either the structure variables or the cameras. But in the case of, say projective reconstruction (Section 3.3), the minimization involves both the 3D points and the cameras. Levenberg-Marquardt is the preferred optimization routine for bundle adjustment, partly because it allows efficient incorporation of the structure of photogrammetric problems. Sparsity patterns are exploited in several ways in any state-of-the-art bundle adjustment implementation. Variable partitioning, as well as exploiting band diagonality, block structure and skyline structure are a few ways in which problem structure is taken into account by bundle adjustment. Bundle adjustment had already been popular in the photogrammetry community before it found a new audience and widespread application in computer vision (Brown, 97

1976; Granshaw, 1980; Slama, 1980). Bundle adjustment minimizes a Maximum Like- lihood criterion and has the advantage of being very flexible in incorporating different kinds of variables and constraints as well as dealing with missing data. The disadvantage, of course, is that it is very prone to getting stuck in local minima, so it requires a strong initialization to produce meaningful estimates. With this background, we turn our attention to methods that yield the globally optimal solution to the geometric error, regardless of the initialization.

5.4 Fractional Programming

Let us begin with a brief exposition on fractional programming. In its most general form, fractional programming seeks to minimize/maximize the sum of p 1 ≥ fractions subject to convex constraints. Our interest from the point of view of multiview geometry, however, is specific to the minimization problem

p X fi(x) min subject to x D (5.8) x g (x) i=1 i ∈

n n where fi : R R and gi : R R are convex and concave functions, respectively, and → → n the domain D R is a convex, compact set. Further, it is assumed that both fi and gi ⊂ are positive with lower and upper bounds over D. Even with these restrictions the above problem is NP -complete (Freund and Jarre, 2001), but we demonstrate that practical and reliable estimation of the global optimum is still possible for the multiview problems considered through iterative algorithms that solve an appropriate convex optimization problem at each step. For the purposes of the development of the Branch and Bound algorithm, let us assume that we have available to us upper and lower bounds on the functions fi(x) and gi(x), denoted by the intervals [ li, ui ] and [ Li,Ui ], respectively. Let Q0 denote the 2p- dimensional rectangle [ l1, u1 ] [ lp, up ] [ L1,U1 ] [ Lp,Up ]. Introducing × · · · × × × · · · × auxiliary variables t = (t1, . . . , tp)> and s = (s1, . . . , sp)>, consider the following 98 alternate optimization problem:

p X ti min x,t,s s i=1 i

subject to fi(x) ti gi(x) si ≤ ≥

x D (t, s) Q0. (5.9) ∈ ∈

We note that the feasible set for problem (5.9) is a convex, compact set and that (5.9) is feasible if and only if (5.8) is. Indeed the following holds true (Benson, 2002):

n+2p Theorem 9. (x∗, t∗, s∗) R is a global, optimal solution for (5.9) if and only if ∈ n ti∗ = fi(x∗), si∗ = gi(x∗), i = 1, , p and x∗ R is a global optimal solution for ··· ∈ (5.8).

Proof. See AppendixA.

Thus, Problems (5.8) and (5.9) are equivalent and henceforth we shall restrict our attention to Problem (5.9). Next, we look at the construction of a convex relaxation for Problem (5.9) that is well-adapted for use in a branch and bound algorithm.

5.4.1 Bounding

As discussed in Section 4.3.1, a useful lower bounding function must be a tight approximation to the objective function that can be efficiently computed and minimized. For the case of a fractional program, the convex envelope, which is the tightest possible convex relaxation, can be shown to satisfy these requirements. Indeed, it is shown in (Tawarmalani and Sahinidis, 2001b) that the convex envelope for a single fraction t/s, where t [ l, u ] and s [ L, U ], is given as the solution to the following Second Order ∈ ∈ 99

Cone Program (SOCP):  t  convenv = min r s r,r0,s0

subject to r, r0, s0 R ∈

2λ√l 2(1 λ)√u r0 + s0 − r r0 + s s0

r0 s0 ≤ r r0 s + s0 ≤ − − − − −

λL s0 λU (1 λ)L s s0 (1 λ)U ≤ ≤ − ≤ − ≤ − r0 0 r r0 0 ≥ − ≥ u t where we have substituted λ = − for ease of notation, and r, r0, s0 are auxiliary scalar u l variables. − It is easy to show that the convex envelop of a sum is always greater (or equal) P than the sum of convex envelopes. That is, if f = ti/si then convenv f i ≥ P i convenv ti/si. It follows that in order to compute a lower bound on Problem

(5.9), one can compute the sum of convex envelopes for ti/si subject to the convex constraints. Hence, this way of computing a lower bound flb(Q) amounts to solving a convex SOCP problem which can be done efficiently (Sturm, 1999).

In summary, in order to compute a lower bound flb(Q) on the rectangle Q =

[ l1, u1 ] [ lp, up ] [ L1,U1 ] [ Lp,Up ], the following SOCP is solved: × · · · × × × · · · × p X min ri x,r,r0,s,s0,t i=1 n p subject to x R , r, r0, s, s0, t R ∈ ∈

2λi√li 2(1 λi)√ui r0 + s0 − ri r0 + si s0 ≤ i i ≤ − i − i r0 s0 ri r0 si + s0 i − i − i − i

λiLi s0 λiUi (1 λi)Li si s0 (1 λi)Ui ≤ i ≤ − ≤ − i ≤ −

r0 0 ri r0 0 i ≥ − i ≥

li ti ui Li si Ui ≤ ≤ ≤ ≤

fi(x) ti gi(x) si for i = 1, . . . , p. ≤ ≥ 100

This construction of convex envelopes can be shown to satisfy the conditions (L1) and (L2) of Section 4.3(Benson, 2002) and therefore, is well-suited for our branch and bound algorithm.

5.5 Applications to Multiview Geometry

In this section, we elaborate on adapting the theory developed in the previous section to common problems of multiview geometry. In the standard formulation of these problems based on the Maximum Likelihood Principle, the exact form of the objective function to be optimized depends on the choice of noise model. The noise model describes how the errors in the observations are statistically distributed given the ground truth. The most common noise model is the Gaussian distribution which has a very thin tail, that is, the probability of large deviation decreases to zero very rapidly. In practice, however, large errors occur more often than predicted by the Gaussian distribution, for instance, due to erroneous localization of interest points or just bad correspondences. There are two ways of getting around this problem. The first is to robustify the cost function by reducing the penalty for large deviations and the second is to consider noise models with thicker tails (Huber, 1981). The latter choice then translates into a modified likelihood function. We will consider the Gaussian and two variants of the Laplacian noise model. In the Gaussian noise model, assuming an isotropic distribution of error with a known standard deviation σ, the likelihood for two image points - one measured point x and one true x0 - is

2 1 2 2 p(x x0) = (2πσ )− exp( x x0 /(2σ )) . (5.10) | −k − k2 where p stands for the p-norm. Thus maximizing the likelihood, assuming iid noise, k·k P 2 is equivalent to minimizing xi x0 , which we interpret as a combination of two i k − ik2 vector norms - the first for the point-wise error in the image and the second that cumulates point-wise errors. We call this the (L2,L2)-formulation. 101

Table 5.1: Different cost-functions of reprojection errors. Gaussian Laplacian I Laplacian II P 2 P P xi x0 xi x0 2 xi x0 1 i k − ik2 i k − ik i k − ik (L2,L2) (L2,L1) (L1,L1)

The exact definition of the Laplace noise model depends on the particular defini- tion of the multivariate Laplace distribution (Kotz et al., 2001). In the current work we choose two of the simpler definitions. The first one is a special case of the multivariate exponential power distribution giving us the likelihood function:

1 p(x x0) = (2πσ)− exp( x x0 2/σ) . (5.11) | −k − k

An alternative view of the bivariate Laplace distribution is to consider it as the

joint distribution of two iid univariate Laplace random variables, where x = (u, v)> and

x0 = (u0, v0)> which gives us the following likelihood function

1 1 u u0 1 1 v v0 2 1 p(x x0) = e− σ | − | e− σ | − | = (4σ )− exp( x x0 1/σ) . (5.12) | 2σ 2σ −k − k

Maximizing the likelihoods in (5.11) and (5.12) is equivalent to minimizing P P xi x0 2 and xi x0 1, respectively. Again, in our interpretation of these i k − ik i k − ik expressions as a combination of two vector norms, we denote these minimizations as

(L2,L1) and (L1,L1), respectively. We summarize the classification of overall error under various noise models in

Table 5.1. In this notation the (L2,L )-case of the problems has recently been solved in ∞ polynomial time (Kahl, 2005).

5.5.1 Triangulation

The primary concern in triangulation is to recover the 3D scene point given mea-

sured image points and known camera matrices in N 2 views. Let P = [π1 π2 π3]> ≥ denote the 3 4 camera where πi is a 4-vector, (u, v)> image coordinates, X = × 102

(U, V, W, 1)> the extended 3D point coordinates, then the reprojection residual vector for this image is given by  π X π X> r = u 1> , v 2> (5.13) − π3>X − π3>X PN q and hence the objective function to minimize becomes ri for the (Lp,Lq)-case. i=1 || ||p In addition, one can require that π3>X > 0 which corresponds to the 3D point being in front of the camera. We now show that by defining r q as an appropriate ratio f/g of a || ||p convex function f and a concave function g, the problem in (5.13) can be identified with the one in (5.9).

(L2,L2). The norm-squared residual of r can be written as (a X)2 + (b X)2 r 2 > > 2 = 2 , (5.14) || || (π3>X) where a, b are 4-vectors dependent on the known image coordinates and the known camera matrix. By setting 2 2 (a>X) + (b>X) f = and g = π3>X , (5.15) π3>X a convex-concave ratio is obtained. It is straightforward to verify the convexity of f via the convexity of its epigraph:

epif = (X, t) t f(X) { | ≥ }     1 1 = (X, t) (t + π>X) a>X, b>X, (t π>X) , | 2 3 ≥ 2 − 3 which is a second-order convex cone (Boyd and Vandenberghe, 2004).

(L2,L1). Similar to the (L2,L2)-case, the norm of r can be written r 2 = f/g where || || p 2 2 f = (a>X) + (b>X) and g = π3>X. Again, the convexity of f can be  established by noting that the epigraph epif = (X, t) t (a>X, b>X) is | ≥ k k a second-order cone.

(L1,L1). Using the same notation as above, the L1-norm of r is given by r 1 = f/g || || where f = a>X + b>X and g = π>X. | | | | 3 In all the cases above, g is trivially concave since it is linear in X. 103

5.5.2 Camera Resectioning

The problem of camera resectioning is the analogous counterpart of triangulation, whereby the aim is to recover the camera matrix given the 3D coordinates of N 6 ≥ scene points and their corresponding images. The main difference compared to the triangulation problem is that the number of degrees of freedom has increased from 3 to 11.  Let π = π1>, π2>, π3> > be a homogeneous 12-vector of the unknown elements in the camera matrix P. Now, the squared norm of the residual vector r in (5.13) can be

2 2 2 2 rewritten in the form r = ((a>π) + (b>π) )/(X>π3) , where a, b are 12-vectors || ||2 determined by the coordinates of the image point x and the scene point X. Recalling the

2 derivations for the (L2,L2)-case of triangulation, it follows that r can be written as || ||2 2 2 a fraction f/g with f = ((a>π) + (b>π) )/(X>π3) which is convex and g = X>π3 concave in accordance with Problem (5.9). Similar derivations show that the same is true for camera resectioning with (L2,L1)-norm as well as (L1,L1)-norm.

5.5.3 Projections from Pn to Pm

Our formulation for the camera resectioning problem is very general and not restricted by the dimensionality of the world or image points. Thus, it can be viewed as a special case of a Pn Pm projection with n = 3 and m = 2. 7→ When m = n, the mapping is called a homography. Typical applications include homography estimation of planar scene points to the image plane, or inter-image homo- graphies (m = n = 2) as well as the estimation of 3D homographies due to different coordinate systems (m = n = 3). For projections (n > m), camera resection is the most common application, but numerous other instances appear in the computer vision field (Wolf and Shashua, 2002). 104

5.6 Multiview Fractional Programming

In this section, we present some important aspects of our implementation which extend the traditional methods to solve fractional programs by exploiting properties specific to the structure of multiview geometry problems. In fact, these developments form the basis for the excellent convergence rates our implementation achieves, as opposed to an exponential search in several dimensions that a na¨ıve implementation of existing fractional programming techniques results in.

5.6.1 Bounds Propagation

Consider a fractional program with k fractions. Traditional approaches to frac- tional programming require branching in at least k dimensions corresponding to the denominators for the algorithm to converge correctly. For a triangulation problem, k is the number of cameras and for a resectioning problem, it is the number of points. A branching dimension in a traditional branch and bound algorithm is the denominator of the reprojection error term corresponding to each point (for resectioning) or camera (for triangulation). It is evident that the search space of a branch and bound algorithm that branches in k dimensions can be untenably large even for medium-sized problems. Con- temporary literature (Schaible and Shi, 2003) documents reasonable results for practical problems with k at most 10 to 12. However, we can do much better with the realization that for all problems pre- sented in Section 5.5, the denominator is a linear function in the unknowns. To elucidate the concept, let us assume the problem is one of triangulating the location of a (homoge-

3 neous) point X = (U, V, 1)> R so that the branching entity (the denominator g(X)) ∈ is a linear function in two variables U and V . Please refer to Figure 5.2 for an illustration. Each bounding constraint restricts the denominator to lie in a particular half space in

R2, thus, a pair of lower and upper bounds on two linearly independent denominators g1 and g2 restrict the feasible values to a convex quadrilateral on the 2D plane. Further,

U and V are linear in g1 and g2 and so are the denominators of all the other fractions in 105 the triangulation problem corresponding to views 3, , k. So, the convex polygon that ··· represents the bounds on the denominators g1 and g2 induces bounds on the denominators of all the fractions in the triangulation problem.

V g (U, V) = gU U 1 1 g3(U, V) = g3 U g2(U, V) = g2

L g1(U, V) = g1

U

L g3(U, V) = g3 L g2(U, V) = g2

Figure 5.2: The red lines indicate the lower and upper bounds on the denominator g1 while the blue lines indicate bounds on the denominator g2. The shaded gray region represents the induced bounds on the variables U and V . Any linear function of U and V restricted to the domain represented by the gray polygon will attain its extremal values at two vertices of this simplex, as illustrated by the thick black points for some linear function g3(U, V ) represented by the green lines.

Extending the analogy to the case of triangulation in three dimensions, the unknown point coordinates X = (U, V, W, 1)> are linear in gi(X) = π3>iX for i =

1, . . . , k. Suppose k > 3 and bounds are given on three denominators, say g1, g2, g3 which are not linearly dependent. These bounds then define a convex polytope in R3. This polytope constrains the possible values of U, V and W which in turn induce bounds on the other denominators g4, . . . , gk. The bounds can be obtained by solving a set of 106 linear equations each time branching is performed.

for i = 1, . . . , k, min gi(x) max gi(x)

Lj gj(x) Uj Lj gj(x) Uj j = 1, 2, 3. (R1) ≤ ≤ ≤ ≤

Thus, it is sufficient to branch on three dimensions in the case of triangulation. Similarly, in the case of camera resectioning, the denominator has only three degrees of freedom and more generally, for projections Pn Pm, the denominator has n degrees of freedom. 7→ The choice of the n dimensions to be used for bounds propagation is, in our opinion, a matter of implementation preferences. A simple heuristic is to branch on the dimension along which the rectangle to be split is the widest and incorporate the same as one of the n dimensions used in the subsequent step of propagating bounds. This might, in principle, avoid issues with committing once and for all to some choice of n particular denominators as the ones branched upon, such as the case when two or more faces of the bounding constraints polytope are nearly parallel. However, in both our synthetic and real experiments, we have observed no such (numerical) instabilities. As a practical note, we must point out that as the number of fractions increases, bounds propagation becomes the time critical step of the algorithm. However, the gains accrued in reduced dimensionality of the search space more than outweighs any cost involved in solving the large LP which constitutes the bounds propagation step.

5.6.2 Initialization

Besides bounds propagation, another component of the algorithm crucial to a rapid convergence is the initialization. In the construction of the algorithm, we assumed that initial bounds are available on the numerator and the denominator of each of the

2k fractions. This initial rectangle Q0 in R is the starting point for the branch and bound algorithm. It is clear that the size of this initial search region will affect the runtime of the search algorithm. However it is not clear how the user should specify the bounds that 107 define the initial region, especially since they depend on the problem geometry and are not straightforward to guess intuitively. What is intuitive, however, is the notion of reprojection error (in pixels) and it is easy for the user to specify a reasonable upper bound on the worst reprojection error. This upper bound can then be used to construct bounds on the numerator and denominator by solving a set of simple optimization problems. Let γ be an upper bound on the reprojection error in pixels (specified by the user), then we can bound the denominators gi(x) by solving the following set of 2k optimization problems:

for i = 1, . . . , k, min gi(x) max gi(x) f (x) f (x) j γ j γ j = 1, . . . , k. (5.16) gj(x) ≤ gj(x) ≤

Depending on the choice of error norm, the above optimization problems will be instances of linear programming (for L1 L1) or quadratic programming (for L2 L1 and L2 L2). − − − We will call this γ-initialization. If the user-specified reprojection error is too small to lead to a feasible solution or so large that the SOCP solver is mired in numerical errors, the algorithm defaults to initial bounds which are wide enough for usual problem scales and known to be small enough to be numerically stable. This situation arises sometimes in our experiments, but we have found that the search space shrinks rapidly even with extremely liberal default values for the initial bounds. As a further note on the implementation, while tight bounds on the denominators are crucial for the performance of the overall algorithm, the bounds on the numerators are not. Therefore, we set the numerator bounds to preset values.

5.6.3 Coordinate System Independence

All three error norms (see Table 5.1) are independent of the coordinate system chosen for the scene (or source) points. In the image, one can translate and scale the 108 points without effecting the norms. For all problem instances and all three error norms considered, the coordinate system can be chosen such that the first denominator g1 is a constant equal to one. Thus, there is no need to approximate the first term in the cost-function with a convex envelope, since it is a convex function already.

5.7 Experiments

Both triangulation and estimation of projections Pn Pm have been imple- 7→ mented for all three error norms in Table 5.1 in the Matlab environment using the convex solver SeDuMi (Sturm, 1999) and the code is publicly available1. The optimization is based on the branch and bound procedure as described in Algorithm1 and α-bisection (see Algorithm2) with α = 0.5. To compute the initial bounds, γ-initialization is used (see Section 5.6.2) with γ = 15 pixels for both real and synthetic data. The branch and bound terminates when the difference between the global optimum and the underestima- tor is less than  = 0.05. In all experiments, the Root Mean Squares (RMS) errors of the reprojection residuals are reported regardless of the computation method. In addition to the methods based on fractional programming, the results are also compared to that of bundle adjustment initialized with a linear method (Hartley and Zisserman, 2004).

5.7.1 Synthetic Data

We demonstrate the various aspects of our algorithm such as scalability, runtime and termination using extensive simulations on synthetic data. Our data is generated by creating random 3D points within the cube [ 1, 1]3 and then projecting to the images. − The image coordinates are corrupted with iid Gaussian noise with different levels of variance. In all graphs, the average of 200 trials are plotted. In the first experiment, we employ a weak camera geometry for triangulation, whereby three cameras are placed along a line at distances 5, 6 and 7 units, respectively,

1See http://www.maths.lth.se/matematiklth/personal/fredrik/download.html. 109

0.012 35 Bundle Bundle 30 0.01 L2−L2 L2−L2 L2−L1 25 L2−L1 0.008 L1−L1 L1−L1 20 0.006 15 3D error 0.004 10 Reprojection error 0.002 5

0 0 0.002 0.004 0.006 0.008 0.010 0.002 0.004 0.006 0.008 0.010 Noise level (pixels) Noise level (pixels)

(a) (b)

Figure 5.3: Triangulation with forward motion. Figure (a) compares the reprojection error of the three algorithms with bundle adjustment. Note the degradation in performance of bundle adjustment with increasing noise in the image, further demonstrated in Figure (b), which plots the mean 3D error for the four algorithms. from the origin. In Figures 5.3(a) and (b), the reprojection errors and the 3D errors are plotted, respectively. The (L2,L2) method, on the average, results in a much lower error than bundle adjustment, which can be attributed to bundle adjustment being enmeshed in local minima due to the non-convexity of the problem. The graph in Figure 5.4 depicts the percentage number of times (L2,L2) outperforms bundle adjustment in accuracy. It is evident that higher the noise level, the more likely it is that the bundle adjustment method does not attain the global optimum. In the next experiment, we simulate outliers in the data in the following manner. Varying numbers of cameras, placed 10o apart and viewing toward the origin, are gener- ated in a circular motion of radius 2 units. In addition to Gaussian noise with standard deviation 0.01 pixels for all image points, the coordinates for one of the image points have been perturbed by adding or subtracting 0.1 pixels. This point may be regarded as an outlier. As can seen from Figure 5.5(a) and (b), the reprojection errors are lowest for the (L2,L2) and bundle methods, as expected. However, in terms of 3D-error, the L1 methods perform best and already from two cameras one gets a reasonable estimate of the scene point. 110

30

25

20

15

10

5 Local minima in bundle (%)

0 0.002 0.004 0.006 0.008 0.010 Noise level (pixels)

Figure 5.4: For triangulation with forward motion, this figure shows the percentage number of times the (L2,L2) algorithm found a better solution than bundle adjustment.

In the third experiment, six 3D points in general position are used to compute the camera matrix. Note that this is a minimal case, as it is not possible to compute the camera matrix from five points. The true camera location is at a distance of two units from the origin. The reprojection errors are graphed in Figure 5.6. Results for bundle adjustment and the (L2,L2) methods are identical and thus, likelihood of local minima is low. No errors on the estimated quantities are given since it is not meaningful to compare (homogeneous) camera matrices. To demonstrate scalability, Table 5.2 reports the runtime of our algorithm over a variety of problem sizes for resectioning. The tolerance, , is set to within 1 percent of the global optimum, the maximum number of iterations to 500 and mean and median runtimes are reported over 200 trials. The algorithm’s excellent runtime performance is demonstrated by almost linear scaling in runtimes. As can be seen, both median and mean runtime scale almost linearly with the size of the problem, making this an attractive algorithm for use in practice. Finally, we demonstrate the effect of the optimality tolerance, , on the time it takes the branch and bound algorithm to converge. Five cameras were used for the triangulation experiment, placed in a circular arc of radius 1, looking towards the origin, 111

2 0.06 10 Bundle Bundle L2−L2 0.055 L2−L2 1 10 L2−L1 L2−L1 scale)

0.05 − L1−L1 L1−L1 0 10 0.045

−1 Reprojection error 10 0.04 3D error (log

−2 10 2 3 4 5 6 2 3 4 5 6 Number of cameras Number of cameras

(a) (b)

Figure 5.5: (a) and (b) show reprojection and 3D errors, respectively, for triangulation with one outlier. Despite a higher reprojection error, the L1-algorithms better bundle adjustment in terms of 3D error.

−3 x 10 4 Bundle L2−L2 3 L2−L1 L1−L1 2

Reprojection error 1

0 0.002 0.004 0.006 0.008 0.01 Noise level (pixels)

Figure 5.6: Reprojection errors for camera resectioning. 112

Table 5.2: Mean and median runtimes (in seconds) for the three algorithms as the number of points for a resectioning problem is increased. MI is the percentage number of times the algorithm reached 500 iterations.

Points (L2,L2)(L2,L1)(L1,L1) Mean Median MI Mean Median MI Mean Median MI 6 42.8 35.5 0.5 41.6 31.5 1.5 7.9 4.7 0.0 10 51.8 41.9 0.5 105.8 66.6 3.5 20.3 13.5 0.5 20 72.7 50.5 2.5 210.2 121.2 9.0 46.8 28.2 1.0 50 145.5 86.5 4.5 457.9 278.3 8.5 143.0 75.9 2.5 70 172.5 107.8 3.5 616.5 368.7 7.5 173.0 102.8 1.5 100 246.2 148.5 4.5 728.7 472.4 4.0 242.3 133.6 2.0

with an angular separation of 10◦ between adjacent cameras. The points to be triangulated are generated in the cube [ 0.5, 0.5]3 and Gaussian noise of standard deviation 1% of − image size is added to the image coordinates. Six points in general position are used for the resectioning experiments with similar additive noise. The mean and median number of iterations over 200 trials for the triangulation and resectioning experiments are recorded in Figure 5.7 as  is varied from 0.001 to 0.1. The dependence of the convergence time on  is exponential. This is expected, since for a given initial region and fixed number of branching dimensions, d, reducing  by half increases discrete volume by a factor of 2d. Note that a value of  below 0.01 is, for all practical purpose, too stringent. In all our experiments, a value of  between 0.01 to 0.1 suffices and in that range, the exponential behavior is not significantly pronounced.

5.7.2 Real Data

We have evaluated the performance on two publicly available data sets as well - the dinosaur and the corridor sequences. In Table 5.3, the reprojection errors are given for (1) triangulation of all 3D points given pre-computed camera motion and (2) resection of cameras given pre-computed 3D points. Both the mean error and the estimated standard deviation are given. There is no difference between the bundle adjustment and 113

50 35 l2l2 l2l2 l2l1 30 l2l1 40 l1l1 l1l1 25

30 20

15 Iterations 20 Iterations

10 10 5

0 0 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 ε ε

(a) Mean times for triangulation (b) Median times for triangulation

1000 200 l2l2 l2l2 l2l1 l2l1 800 l1l1 l1l1 150

600 100 Iterations Iterations 400

50 200

0 0 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 ε ε

(c) Mean times for resectioning (d) Median times for resectioning

Figure 5.7: (a) and (b) show trends for the mean and median number of iterations, re- spectively, over 200 trials, for termination of the triangulation algorithm as the optimality tolerance, , is varied from 0.001 to 0.1. (c) and (d) show the same for the resectioning experiment. 114

Table 5.3: Reprojection errors (in pixels) for triangulation and resectioning in the Di- nosaur and Corridor data sets. “Dinosaur” has 36 turntable images with 324 tracked points, while “Corridor” has 11 images in forward motion with 737 points. A denotes triangulation experiments and a denotes resectioning ones. ∗ †

Experiment Bundle (L2,L2)(L2,L1)(L1,L1) Mean Std Mean Std Mean Std Mean Std Dino∗ 0.30 0.14 0.30 0.14 0.18 0.09 0.22 0.11 Corridor∗ 0.21 0.16 0.21 0.16 0.13 0.13 0.15 0.12 Dino† 0.33 0.04 0.33 0.04 0.34 0.04 0.34 0.04 Corridor† 0.28 0.05 0.28 0.05 0.28 0.05 0.28 0.05

Table 5.4: Number of branch and bound iterations for triangulation and resectioning on the Dinosaur and Corridor datasets. More parameters are estimated for resectioning, but the main reason for the difference in performance between triangulation and resectioning is that several hundred points are visible to each camera for the latter. A denotes triangulation experiments and a denotes resectioning ones. ∗ †

Experiment (L2,L2)(L2,L1)(L1,L1) Mean Std Mean Std Mean Std Dino∗ 1.2 1.5 1.0 0.2 6.7 3.4 Corridor∗ 8.9 9.4 27.4 26.3 25.9 27.4 Dino† 49.8 40.1 84.4 53.4 54.9 42.9 Corridor† 39.9 2.9 49.2 20.6 47.9 7.9

the (L2,L2) method. Thus, for these particular sequences, the bundle adjustment did not get trapped in any local optimum. The L1 methods also result in low reprojection errors as measured by the RMS criterion. More interesting, perhaps, is the number of iterations and execution times on a standard PC (3 GHz), see Tables 5.4 and 5.5, respectively. We must point out that the implementations are (unoptimized) MATLAB functions. The differences in iterations and runtimes are most likely due to the setup: the dinosaur sequence has a circular camera motion and thereby a more well-posed camera geometry compared to the forward-moving camera in the corridor sequence. In the case of triangulation, a point is typically visible in only a few frames, while several hundred points may be visible in each view for the resectioning experiments. 115

Table 5.5: Triangulation and resectioning runtimes (in seconds) for experiments on real datasets. A denotes triangulation experiments and a denotes resectioning ones. ∗ †

Experiment Bundle (L2,L2)(L2,L1)(L1,L1) Mean Std Mean Std Mean Std Mean Std Dino∗ 1.0 0.4 5.5 4.5 12.1 4.0 17.0 9.9 Corridor∗ 1.0 0.6 18.7 17.7 51.0 47.3 46.4 51.6 Dino† 4.0 3.0 273.1 192.3 640.0 554.1 312.8 304.9 Corridor† 38.3 15.7 1433.5 348.0 1271.6 608.1 1122.7 565.0

5.8 Discussions

In this chapter, we have demonstrated that several problems in multiview ge- ometry can be formulated within the unified framework of fractional programming, in a form amenable to global optimization. A branch and bound algorithm is proposed that provably finds a solution arbitrarily close to the global optimum, with a fast con- vergence rate in practice. Besides minimizing reprojection error under Gaussian noise, our framework allows incorporation of robust L1 norms, reducing sensitivity to outliers. Two improvements that exploit the underlying problem structure and are critical for expeditious convergence are: branching in a small, constant number of dimensions and bounds propagation. It is inevitable that our solution times be compared with those of bundle adjust- ment, but we must point out that it is producing a certificate of optimality that forms the most significant portion of our algorithm’s runtime. In fact, it is our empirical observation that the optimal point ultimately reported by the branch and bound is usually obtained within the first few iterations. A distinction must also be made between the accuracy of a solution and the optimality guarantee associated with it. An optimality criterion of, say  = 0.95, is only a worst case bound and does not necessarily mean a solution 5% away from optimal. Indeed, as evidenced by our experiments, our solutions consistently equal or better those of bundle adjustment in accuracy. In fact, it is our empirical observation that the optimal 116 point ultimately reported by the branch and bound is usually obtained within the first few iterations. Thus, from a practitioner’s viewpoint, it is useful to set a lower criterion for global optimality and use gradient descent in the neighborhood of the resulting solution. Needless to say, other segments of the computer vision community can also benefit from our approach as it is general enough to be applicable to any problem formulated as a fractional program in a few independent dimensions. Another avenue for potential future work is the exploration of other algorithms for achieving global optimality in specialized fractional programs. As faster and more reliable algorithms are designed for achieving global optimality in fractional programs, we can anticipate corresponding improvements in our solution times. The most significant portions of this chapter are based on “Practical Global Optimization for Multiview Geometry”, by F. Kahl, S. Agarwal, M. K. Chandraker, D. J. Kriegman and S. Belongie, as it appears in (Kahl et al., 2008) and (Agarwal et al., 2006). Chapter 6

Stratified Autocalibration

“To see a world in a grain of sand And heaven in a wild flower Hold infinity in the palms of your hand And eternity in an hour.”

William Blake (English poet, 1757-1827 AD), Auguries of Innocence

In this chapter, we seek to extend the global optimization framework of Chapter 5 to tackle the more difficult problem of autocalibration. Recall from Section 1.2.4 that autocalibration seeks to estimate the plane at infinity and the dual image of the absolute conic in order to upgrade a projective reconstruction to a metric one. As outlined in Section 3.4, a stratified approach to autocalibration estimates the plane at infinity to take the projective reconstruction to an affine one and subsequently, estimating the DIAC results in a metric upgrade. This chapter proposes practical algorithms that provably solve well-known formulations for both affine and metric upgrade stages of stratified autocalibration to their global optimum. At this stage, we suggest a brief perusal of Sections 2.3 and 2.5 to a reader who wishes to get acquainted with the projective geometry background required for this chapter.

117 118

6.1 Introduction

Given n feature correspondences across n views of a scene, it is well-known that a projective reconstruction may be computed that differs from the true scene by an arbitrary 4 4 linear transformation, or homography Faugeras(1992); Hartley et al. × (1992). A projective reconstruction may be upgraded to a metric one, which differs from the true scene by a similarity transformation, using a priori knowledge of a few scene characteristics, such as the angles between a few 3D lines. An alternative approach to estimate the 4 4 transformation that restores the metric scene is through simple × assumptions on the internal parameters of the cameras used to image the scene, such as their constancy across different views, or the rectangularity of the image pixels. The latter approach is called autocalibration, or camera self-calibration, which forms the subject of this chapter. A typical approach to calibrating a camera involves using several images of a known calibration grid. Once a correspondence can be ascertained between scene points (or higher order features like curves) and their counterparts on the image plane, it is straightforward to recover the camera parameters. The term autocalibration stems from its key premise that it obviates the requirement for an explicit calibration grid. Instead, it tries to locate the image of the so-called absolute conic, which is an imaginary object on the plane at infinity, whose location stays fixed in any metric reconstruction. Its image can be shown to be related to the internal parameters of the camera, so locating the image of the absolute conic is equivalent to calibrating the camera. The method of autocalibration presented in this chapter is a stratified one, whose first step upgrades the projective reconstruction to an affine one, while the next step performs the upgrade to a metric reconstruction. An affine reconstruction restores certain aspects of the scene, such as parallelism between 3D lines, while the metric rconstruction restores characteristics such as exact angles and length ratios. The affine upgrade, which is arguably the more difficult step in stratified autocal- ibration, is succinctly computable by estimating the position of the plane at infinity in 119 a projective reconstruction, for instance, by solving the modulus constraints Pollefeys and van Gool(1999). Previous approaches to minimizing the modulus constraints for several views rely on local, gradient-based methods with random reinitializations. These methods are not guaranteed to perform well for such non-convex problems. Moreover, in our experience, a highly accurate estimate of the plane at infinity is imperative to obtain a usable metric reconstruction. The metric upgrade step involves estimating the intrinsic parameters of the cam- eras, which is commonly approached by estimating the dual image of the absolute conic (DIAC). A variety of linear methods exist towards this end, however, they are known to perform poorly in the presence of noise Hartley and Zisserman(2004). Perhaps more significantly, most methods a posteriori impose the positive semi-definiteness of the DIAC, which might lead to a spurious calibration. Thus, it is important to impose the positive semidefiniteness of the DIAC within the optimization, not as a post-processing step. This chapter proposes global minimization algorithms for both stages of stratified autocalibration that furnish theoretical certificates of optimality. That is, they return a solution at most  away from the global minimum, for arbitrarily small . Our solution approach relies on constructing efficiently minimizable, tight convex relaxations to non- convex programs and using them in a branch and bound framework Horst and Tuy(2006); Tawarmalani and Sahinidis(2002). A significant drawback of local methods is that they critically depend on the quality of a heuristic initialization. To be considered truly optimal, an algorithm must converge to the global optimum regardless of the choice of initialization. Branch and bound methods require a demarcated region of the search space as initialization. Arbitrar- ily choosing a small initial region might compromise optimality, since the true solution might lie outside that chosen region. On the other hand, choosing a very large region might lead to a ponderous convergence rate for the branch and bound algorithm. In this chapter, we use chirality constraints derived from the scene to compute a theoretically correct initial search space for the plane at infinity, within which we 120 are guaranteed to find the global minimum Hartley(1998a); Hartley et al.(1999). In practice, for a moderate number of cameras, the initial region determined by the chirality constraints are tight enough to allow rapid convergence of the search algorithm. Our initial region for the metric upgrade is intuitively specifiable as conditions on the intrinsic parameters of the camera and can be wide enough to include any practical case. A crucial concern in branch and bound algorithms is the exponential dependence of the worst case time complexity on the number of branching dimensions. The number of branching dimensions in most computer vision problems scales with the number of points and views, which can quickly translate into an impractical branch and bound search. In this chapter, we exploit the inherent problem structure of autocalibration to restrict our branching dimensions to a small, fixed number, independent of the number of views. In our experiments, this allows the runtime of algorithms proposed in this chapter to scale gracefully with the number of views. In summary, the main contributions of this chapter are the following:

Highly accurate recovery of the plane at infinity in a projective reconstruction by • global minimization of the modulus constraints.

Highly accurate estimation of the DIAC by globally solving the infinite homogra- • phy relation.

A general exposition on novel convexification methods for global optimization of • non-convex programs.

The outline of the rest of this chapter is as follows. Section 6.2 describes back- ground relevant to autocalibration and Section 6.3 outlines the related prior work. Section 6.4.1 describes the general strategy that we employ for constructing the convex relaxation of a non-convex function, while Sections 6.5 and 6.6 describe our global optimization algorithms for estimating the plane at infinity and the DIAC, respectively. Section 6.7 presents experiments on synthetic and real data and Section 6.8 concludes with a discus- sion of further extensions. 121

6.2 Background

As is our convention, unless stated otherwise, we will denote 3D world points X by homogeneous 4-vectors and 2D image points x by homogeneous 3-vectors. Recall that, given the images of n points in m views, a projective reconstruction P, X computes { } the Euclidean scene Pˆ , Xˆ up to a 4 4 homography: { } ×

1 Pi = Pˆ iH− , i = 1, , m ···

Xj = HXˆ j , j = 1, , n (6.1) ··· where P and Pˆ denote the 3 4 projective and Euclidean camera matrices, respectively. × Given a projective reconstruction, autocalibration seeks to estimate the best homography H that upgrades the reconstruction to a metric one. A more detailed discussion of the material in this section can be found in (Hartley and Zisserman, 2004).

6.2.1 The Infinite Homography Relation

The Euclidean camera is parametrized as Pˆ = K[r t] where the 3 3 rotation | × matrix r and the 3 1 translation vector t constitute the exterior orientation, while the × 3 3 upper-triangular matrix K encodes the intrinsic parameters of the camera. × We can always perform the projective reconstruction such that P1 = [I 0]. Let | the world coordinate system be aligned with the first (world) camera, i.e. Pb 1 = K1 [I 0]. | Let the homography H that we seek to estimate have the form   A t H =   (6.2) v> 1

Then, since Pb 1H = P1, we have     A t K1 0 K1 [ I 0 ] = [ I 0 ]   H =   . (6.3) | | v> 1 ⇒ v> 1 122

Further, let the plane at infinity in the given projective reconstruction be π = (p, 1)>. ∞ Then, since the plane at infinity is moved out of its canonical position in the metric reconstruction by H,     p 0   = H−>   1 1       K1−> K1−>v 0 K1−>v =  −    =  −  . 0> 1 1 1

It follows that v = K−>p, so H must have the form − 1   K1 0 H =   (6.4) p>K1 1 − This is consistent with the notion that the aim of autocalibration is to recover the plane at infinity and the intrinsic parameters.

Let the cameras in the projective reconstruction be of the form Pi = [Ai ai] where | Ai is 3 3 and ai is a 3 1 vector. Further, let the metric cameras be Pb i = Ki [Ri ti], × × | where Ri is a 3 3 rotation matrix. Then, since Pb i = PiH, given the form of H in (6.4), × we have   K1 0 Ki [ Ri ti ] = [ Ai ai ]   | | p>K1 1 −

KiRi = (Ai aip>)K1. (6.5) ⇒ − By post-multiplying both sides of 6.5 with its transpose, we can eliminate the rotations

Ri to obtain:

KiK> = (Ai aip>)K1K>(bAi a1p>)>, (6.6) i − 1 − which can be rewritten as

ω∗ = (Ai aip>)ω∗(bAi a1p>)>, (6.7) i − 1 − since, by definition, the dual image of absolute conic is ω∗ = KK>. This is one of the basic relations for autocalibration, whose aim can now be restated as estimating the plane 123 at infinity and the DIAC. It is important to note that 6.7 relates projective entities, so the equality holds only up to scale.

The Infinite Homography

Since 6.7 is central to the problem of autocalibration, we will digress slightly to characterize it further. First, we need an auxiliary result:

Theorem 10. Given projective cameras P = [ I 0 ] and P0 = [ A a ], the homography, | | x0 = Hx, induced from the first view to the second through any plane π = (v>, 1)> is given by

H = A av>. (6.8) −

Figure 6.1: Any plane in the scene induces a homography between two projective cameras.

Proof. Referring to Figure 6.1, let π = (v>, 1)> be any plane and Xπ be the point where the back-projected ray from point x in the first image intersects the plane π. Then,

x = [ I 0 ] Xπ Xπ = (x>, λ)>. | ⇒ 124

Since Xπ lies on the plane π,

π>Xπ = 0 v>x + λ = 0 Xπ = (x>, v>x)>. ⇒ ⇒ −

Let x0 be the image of Xπ in the second view. Then,   x x0 = P0Xπ = [ A a ]   | v>x −

= (A av>)x, − which means that for every point x in the first image, there exists a corresponding point x0 related by the planar homography H = A av>. − The homography H above is called the homography from the first view to the second induced via the plane π.

Returning to equation (6.7), it is clear that Ai aip> is the homography induced − from the first view to the i-th via the plane at infinity π = (p>, 1)>. We denote ∞

H = A ap>, (6.9) ∞ − where H is called the infinite homography. The autocalibration relation (6.7) can now ∞ be written as i i ωi∗ = H ω1∗H > (6.10) ∞ ∞ and will be henceforth referred to as the infinite homography relation.

6.2.2 Modulus Constraints

From equation (6.5), it follows that the infinite homography has the form

j 1 H = KiRiK1− (6.11) ∞ where the equality holds up to a constant scale factor. Let us assume that all the cameras

Pb i have the same internal parameters, that is, Kj = K, for j = 1, , m. Then, equation ··· (6.11) becomes j 1 H = µjKRjK− , (6.12) ∞ 125

where, we have introduced an explicit scale factor µi to make the equality exact. Clearly, the infinity homography is conjugate (or similar) to the rotation matrix Rj. It follows

j iθj iθj that the eigenvalues of H must be µj, µje , µje− . In particular, the infinite ∞ { } homography has the property that its eigenvalues have equal moduli. Remarkably, this gives us enough leverage to be able to estimate the plane at infinity. The characteristic polynomial of the infinite homography, det(Hj λI) will be ∞ − a degree three polynomial in λ. Let its roots be λ1, λ2 and λ3. Then,

j det(H λI) = (λ λ1)(λ λ2)(λ λ3) ∞ − − − − 3 2 = λ αjλ + βjλ γj (6.13) − − where, it can be easily seen that

αj = λ1 + λ2 + λ3 = µj(1 + 2 cos θj)

2 βj = λ1λ2 + λ2λ3 + λ3λ1 = µj (1 + 2 cos θj)

3 γj = λ1λ2 + λ3 = µj . (6.14)

From the above, it follows that 3 3 γjαj = βj (6.15) which gives one constraint for each view, the so-called modulus constraint. Now, considering the determinant det(Hj λI) and noting from (6.9) that the ∞ − dependence of the infinite homography on the plane at infinity is through a rank 1 term, it is evident that αj, βj and γj must all be affine in the coordinates p1, p2 and p3 of the plane at infinity π = (p, 1)>. It follows that the modulus constraint is a quartic polynomial ∞ in the coordinates of the plane at infinity.

Three views can provide three such degree 4 polynomials in p1, p2, p3, which suffices to restrict the solution space to a 43 = 64 possibilities. However, for the case of three views, an additional cubic equation is available from the modulus constraints, which can be used to eliminate several spurious solutions and restrict the solution to 21 possibilities (Schaffalitzky, 2000). In practice, there may be several views available, so the modulus constraints can be solved in a least squares sense. 126

6.2.3 Chirality Bounds on Plane at Infinity

Chirality constraints demand that the reconstructed scene points lie in front of the camera. While a general projective transformation may result in the plane at infinity splitting the scene, a quasi-affine transformation is one that preserves the convex hull of the scene points X and camera centers C. A transformation Hq that upgrades a projective reconstruction to quasi-affine can be computed by solving the so-called chiral inequalities.

A subsequent affine centering, Ha, guarantees that the plane at infinity in the centered quasi-affine frame, v = (HaHq)−>π , cannot pass through the origin. So it can be ∞ parametrized as (v1, v2, v3, 1)> and bounds on vi in the centered quasi-affine frame can be computed by solving six linear programs:  min / max v  i  q  subject to Xj>v > 0, j = 1, . . . , n i = 1, 2, 3. (6.16)  q  Ck>v > 0, k = 1, . . . , m 

q q Where Xj and Ck are points and camera centers in the quasi-affine frame. We refer the reader to (Hartley, 1998a; Nister´ , 2004) for a thorough treatment of the subject.

6.2.4 Need for Global Optimization

Before we venture further, the reader might wonder about the benefits of global optimization, as opposed to a local gradient-based technique, for solving the autocal- ibration problem. To motivate the need for global optimization, we perform a small experiment similar to (Hartley et al., 1999). For a given scene, bounds on the location of the plane at infinity in a centered quasi-affine frame are computed using chirality constraints. The plane at infinity is parameterized as π = (p1, p2, p3, 1)>. Within the bounds on p1, p2 and p3, for each ∞ putative location of the plane at infinity, the infinite homography in each view is computed. Subsequently, the infinite homography relation is minimized to compute the DIAC. The minimum of the infinite homography relation for each putative plane at infinity is used to 127 populate a three-dimensional cost cube, one of whose slices is shown in Figure 6.2.A similar example is also shown in Figure 1.13.

Global minimum Global Global minimum minimum

(a) Surface plot of 2D slice (b) Another view (c) Contour plot

Figure 6.2: Various views and a contour plot of a 2D slice of the 3D cube of infinite homography relation minima as a function of position of plane at infinity. Blue denotes a low cost and red denotes a high cost.

The above example indicates that the margin of error for estimating the plane at infinity is not very high, since the cost surface terrain is quite rugged around the optimum. This is consistent with the adage that estimating the plane at infinity is the most difficult step in uncalibrated reconstruction (Hartley et al., 1999). Now, consider an alternative, more direct approach to autocalibration that tries to bound the plane at infinity and then minimizes the cost function corresponding to Figure 6.2, thereby simultaneously estimating both the plane at infinity and the DIAC. However, the multiple low basins in the search space indicate that a traditional gradient-based local minimization algorithm is bound to get stuck in local minima. Moreover, the initialization needs to be very accurate for the local minimization algorithm to reach the global optimum, which is sharply cloistered within high cliffs on all sides. It is evident that the complex nature of the autocalibration problem merits a more powerful, global optimization approach. The position marked as global minimum in Figure 6.2 corresponds to the plane at infinity estimated using the globally optimal algorithm described in Section 6.5. 128

6.3 Previous Work

Approaches to autocalibration (Faugeras et al., 1992) can be broadly classified as direct and stratified. Direct methods seek to compute a metric reconstruction by esti- mating the absolute conic. This is encoded conveniently in the dual quadric formulation of autocalibration (Heyden and Astr˚ om¨ , 1996; Triggs, 1997), whereby an eigenvalue decomposition of the estimated dual quadric yields the homography that relates the projective reconstruction to Euclidean. Linear methods (Pollefeys et al., 1998) as well as more elaborate SQP based optimization approaches (Triggs, 1997) have been proposed to estimate the dual quadric, but perform poorly with noisy data. Methods such as (Manning and Dyer, 2001) which are based on the Kruppa equations (or the fundamental matrix), are known to suffer from additional ambiguities (Sturm, 2000). This work primarily deals with a stratified approach to autocalibration (Pollefeys and van Gool, 1999). It is well-established in literature that, in the absence of prior information about the scene, estimating the plane at infinity represents the most significant challenge in autocalibration (Hartley et al., 1999). The modulus constraints (Pollefeys and van Gool, 1999) are a necessary condition for the coordinates of the plane at infinity. Local techniques are used in (Pollefeys and van Gool, 1999) to estimate the coordinates of the plane at infinity by minimizing a noisy overdetermined system in the multiview case. An alternate approach to estimating the plane at infinity exploits the chirality constraints. The algorithm in (Hartley et al., 1999) computes bounds on the plane at infinity and a brute force search is used to recover π within this region. It is argued ∞ in (Nister´ , 2004) that it might be advantageous to use camera centers alone when using chirality constraints. Several linear methods exist for estimating the DIAC (Hartley and Zisserman, 2004) for the metric upgrade, but they do not enforce its positive semi-definiteness. The only work the authors are aware of which explicitly deals with this issue is (Agrawal, 2004), which is formulated under the assumption of known principal point and zero skew. 129

The interested reader is referred to (Hartley and Zisserman, 2004) and the references therein for a more detailed overview of literature relevant to autocalibration. Of late, there has been significant activity towards developing globally optimal algorithms for various problems in computer vision. The theory of convex linear matrix inequality (LMI) relaxations (Lasserre, 2001) is used in (Kahl and Henrion, 2005) to find global solutions to several optimization problems in multiview geometry, while (Chandraker et al., 2007a) discusses a direct method for autocalibration using the same techniques. Triangulation and resectioning are solved with a certificate of optimality using convex relaxation techniques for fractional programs in (Agarwal et al., 2006). Several geometric problems in computer vision, when posed in the L -norm, can be ∞ solved to their global optimum using techniques of quasi-convex optimization (Kahl, 2005; Sim and Hartley, 2006a; Agarwal et al., 2008). A survey of recent work in developing optimal algorithms for multiview geometry can be found in (Hartley and Kahl, 2007b). An interval analysis based branch and bound method for autocalibration is pro- posed in (Fusiello et al., 2004), however the fundamental matrix based formulation does not scale well beyond a small number of views. Grobner¨ basis methods have been used to achieve optimal solutions for several geometric reconstruction problems, such as triangulation (Stewenius´ et al., 2005), but they do not scale well for more than very few number of views. Branch and bound as a solution paradigm has been used for a diverse range of applications in computer vision, such as feature selection (Zongker and Jain, 1996), geometric matching (Breuel, 2002), image segmentation (Gat, 2003), contour tracking (Freedman, 2003), object localization (Lampert et al., 2008) and so on. A branch and bound method for Euclidean registration problems is presented in (Olsson et al., 2009b). 130

6.4 The Branch and Bound Framework

In this section, we provide a general outline of the branch and bound framework that will be employed for stratified autocalibration. Consider a multivariate, non-convex, scalar-valued objective function f(x), for which we seek a global minimum over a rectangle Q0. Branch and bound algorithms require an auxiliary function flb(Q) which for every region Q Q0, satisfies two properties: ⊆

(L1) The value of flb(Q) is always less than or equal to the minimum value fmin(Q) of f(x) for all x Q. ∈ (L2) Let Q denote the size of a rectangle, which in our case, is the length of the | | longest edge. Then the relaxation gap f(x) f (x) monotonically decreases as − lb a function of Q . | | Note that while (L1) is a basic stipulation for a convex underestimator, (L2) is a Cauchy continuity requirement specific to branch and bound algorithms. Indeed, several popular convex underestimators such as linear matrix inequality (LMI) relaxations (Lasserre, 2001) and sum-of-squares relaxations for polynomial systems (Prajna et al., 2002) do not satisfy this requirement, thus they are rendered unsuitable for our purposes.

Computing the value of flb(Q) is referred to as bounding, while choosing and subdividing a rectangle is called branching. As before, we consider the rectangle with the smallest minimum of flb as the most promising to contain the global minimum and subdivide it into two rectangles along the largest dimension. A key consideration when designing bounding functions is the ease with which they can be estimated . So, it is desirable to design flb(Q) as the solution of a convex optimization problem for which efficient solvers exist (Boyd and Vandenberghe, 2004). In the following sections, we present branch and bound algorithms based on such constructions. 131

6.4.1 Constructing Convex Relaxations

In this section, we will outline our general strategy for the construction of a convex underestimator for an arbitrary non-convex function. This strategy will be employed to underestimate the objective functions that arise in both the affine and metric upgrade stages of autocalibration. Let us consider the following unconstrained, non-linear least squares problem: m X 2 min (fi(x) µi) (6.17) x i=1 − k where µi R and x R , k 1. Then, an equivalent constrained optimization problem ∈ ∈ ≥ is:

X 2 min (si µi) x,si i −

subject to si = fi(x). (6.18)

Suppose we can construct a convex underestimator conv (fi) and a concave overestimator conc (fi) for the function fi(x). Then, the following convex optimization problem minimizes the same objective as (6.18), but with a “relaxed” constraint set:

X 2 min (si µi) x,si i −

subject to conv (fi) si conc (fi). (6.19) ≤ ≤ Consequently, the minimum attained by the problem (6.19) will always be at least as low as the minimum attained by (6.18). In effect, we have constructed a convex problem whose minimum always underestimates the minimum of the non-convex problem we wished to optimize. The solution to (6.19) corresponds to the construction of the lower bounding function flb discussed in Section 4.3. An intuitive illustration of the procedure is depicted for a 1-D function in Figure 6.3. While the variable s is allowed to attain values only on the graph of the function f(x) in the original problem (6.18), it can attain any value within the larger region between the convex and concave relaxations in the relaxed problem (6.19). 132

! ! ()*(%$' "

!"#"$%&' !"#"$%&' #$%&'()*+*"*+*#$%#'()

()*+%$'

& & ! (a) (b) (c)

Figure 6.3: (a) The objective function of the non-linear least squares problem min(f(x) µ)2 is linearized by replacing the non-linear function f(x) by a scalar variable s and− introducing an equality constraint s = f(x). (b) Convex and concave relaxations are constructed for the function f(x). For the functions encountered in this chapter, Appendix B demonstrates the construction of tight, piecewise linear relaxations. (c) s is now allowed to attain values in a relaxed region between the convex under-estimator and the concave over-estimator.

The convex relaxations that are used in a branch and bound framework must satisfy conditions (L1) and (L2) specified above. Appendix B.4 proves the same for the convex relaxations constructed in this chapter.

6.5 Global Estimation of Plane at Infinity

6.5.1 Traditional Solution

Given exactly three views, the modulus constraints of (6.15) correspond to a system of three quartic polynomials in three variables, for which the 64 roots may be found, typically using continuation methods. Also for the three-view case, an additional cubic equation available from the modulus constraints Schaffalitzky(2000) can be used to eliminate several spurious solutions, reducing the number of possible solutions to 21. When more than three views are present, the modulus constraints from all the views may be used in a least squares framework for greater accuracy and robustness:

m X 3 3 2 min (γiαi βi ) . (6.20) p1,p2,p3 i=1 − 133

A gradient-based optimization routine, such as Levenberg(1944); Marquardt(1963), may be used to obtain a locally optimal solution to the above problem. In Pollefeys and van Gool(1999), several random initializations for the Levenberg-Marquardt algorithm are used to enhance the chances of converging to a global optimum.

6.5.2 Problem Formulation

Note that the cost function in (6.20) is a polynomial and some recent work in computer vision Kahl and Henrion(2005); Chandraker et al.(2007a) exploits convex linear matrix inequality (LMI) relaxations to achieve global optimality in polynomial programs. However, this is a degree 8 polynomial in three variables, which is far beyond what present-day solvers can handle Henrion and Lasserre(2003); Prajna et al.(2002). We instead consider the equivalent formulation: m X 1/3 2 min (γi αi βi) , (6.21) p1,p2,p3 i=1 − for which the global minimum is estimated using the method outlined in this section.

6.5.3 Convex Relaxation

As an illustration of higher-level concepts, we show construction of convex under- estimators for the non-convex objective in (6.21). The actual objective we minimize incorporates chirality bounds and is derived in Section 6.5.4.

1/3 Let us suppose it is possible to derive a convex under-estimator conv (γi αi) 1/3 1/3 and concave over-estimator conc (γi αi) for γi αi. Then the following convex opti- mization problem underestimates the solution to (6.21). m X 2 min (si βi) p1,p2,p3 i=1 − 1/3 1/3 subject to conv (γi αi) si conc (γi αi) (6.22) ≤ ≤ As shown in Appendix B.3, our convex and concave relaxations for functions of the form x1/3y are piecewise linear and representable using a small set of linear inequalities. Thus 134 the above optimization problem is a convex quadratic program that can be solved using a quadratic programming (QP) or a second order cone programming (SOCP) solver.

Given bounds on p1, p2, p3 , a branch and bound algorithm can now be used to { } obtain a global minimum to the modulus constraints. All that remains to be shown is that it is possible to estimate an initial region which bounds the coordinates of π . ∞

6.5.4 Incorporating Bounds on the Plane at Infinity

One way to derive bounds on the coordinates of the plane at infinity is by using the chirality conditions overviewed in Section 6.2.3. Let v be the plane at infinity in the centered quasi-affine frame, where v = (v1, v2, v3, 1)>, so that we can find bounds on each vi. However, the modulus constraints require that the first metric camera be of the form K [I 0] and the first projective camera have the form [I 0], which might not be | | satisfiable in a centered quasi-affine frame, in general. Thus, we need to use the bounds derived in the centered quasi-affine frame within the modulus constraints for the original projective frame. The centered quasi-affine reconstruction differs from the projective one by a transformation Hqa = HaHq, where Hq takes the projective frame to some quasi-affine frame and Ha is the affine centering in that quasi-affine frame. Let hi be the i-th column of Hqa, then we have pi = hi>v/h4>v. Recall that, for the j-th view, αj, βj and γj are affine expressions in p1, p2 and p3 Pollefeys and van Gool(1999). Then, for instance,

αj =αj1p1 + αj2p2 + αj3p3 + αj4 (6.23) a (v) = j , (6.24) d(v) where, aj(v) = αj1h1>v + αj2h2>v + αj3h3>v + αj4h4>v and d(v) = h4>v. Similarly, let

b (v) β = j (6.25) j d(v) c (v) γ = j , (6.26) j d(v) 135

where aj(v), bj(v), cj(v), d(v) are linear functions of v. In the following, for the sake of brevity, we will drop the reference to v and just use aj, bj, cj, d. Now the optimization problem (6.21) can be rewritten as

m 2 X  1/3 1/3  cj aj d bj j=1 − min 8/3 v1,v2,v3 d

subject to li vi ui, i = 1, 2, 3. (6.27) ≤ ≤

Introducing new scalar variables for some of the non-linear terms, the above is equivalent to

min r v1,v2,v3 m X 2 subject to r e (fj gj) · ≥ j=1 − 1/3 fj = c aj, j = 1, , m j ··· 1/3 gj = d bj, j = 1, , m ··· e = d8/3

li vi ui, i = 1, 2, 3. (6.28) ≤ ≤

As outlined in our general recipe for constructing convex relaxations (Section 6.4.1), we have reduced the non-convexity in the above optimization problem to a set of equality constraints. The quadratic inequality constraint is convex and is known as a rotated cone Boyd and Vandenberghe(2004). Given bounds on vi, it is easy to calculate bounds on aj, bj, cj, d, by solving eight linear programs in three variables. Given these bounds, we can construct convex and concave envelopes of the non-linear functions e, fj, gj and use them to construct the following convex program that underestimates the minimum of 136 the problem (6.28):

min r v1,v2,v3 m X 2 subject to r e (fj gj) , · ≥ j=1 −  1/3   1/3  conv c aj fj conc c aj , j = 1, , m j ≤ ≤ j ··· 1/3  1/3  conv d bj gj conc d bj , j = 1, , m ≤ ≤ ··· e conc (d8/3) ≤

li vi ui, i = 1, 2, 3. (6.29) ≤ ≤

Notice that the convex envelope of d8/3 is not needed. Since (6.29) is a minimization problem, e always takes its maximum possible value and does not require a lower bound. Following AppendixB, our convex relaxation in (6.29) consists of a linear objec- tive subject to linear and SOCP constraints, which can be efficiently minimized Sturm (1999). A branch and bound algorithm can now be used to obtain an estimate of

v1, v2, v3 , which globally minimizes the modulus constraints. Thereafter, the plane at { } infinity in the projective frame can be recovered as π = Hqa> v, which completes the ∞ projective to affine upgrade.

6.6 Globally Optimal Metric Upgrade

6.6.1 Traditional Solution

Recall that when the camera intrinsic parameters are held constant, the DIAC

i i satisifes the infinite homography relations ω∗ = H ω∗H >, i = 1, , m, where equal- ∞ ∞ ··· ity holds up to a scale factor. The standard technique for estimating the DIAC is to first normalize the infinite homography matrix by dividing it by the cube-root of its determinant: H bH = . (6.30) √3 H 137

Since this normalization “equates” the scale on the two sides of the infinite homography relation, estimating the DIAC can now be posed as a least squares problem:

m X i i min ω∗ bH ω∗bH > (6.31) ω∗ ∞ ∞ i=1 k − k which is typically solved linearly by ignoring the positive semidefiniteness requirement on the DIAC. For the cases where the linear solution does not yield a positive semidefinite DIAC, the closest positive semidefinite matrix is estimated as a post-processing step by dropping the negative eigenvalues. It is well-documented that this may lead to a spurious calibration in practice Hartley and Zisserman(2004).

6.6.2 Problem Formulation

For the optimal solution to the infinite homography relation, we note that both

i ω∗ and H are homogeneous entities, so our cost function must correctly account for ∞ the scale factor before it can be used to search for the optimal DIAC. Moreover, the optimization algorithm itself must take into account the positive semidefiniteness of the DIAC.

A necessary condition for the matrix ω∗ to be interpreted as ω∗ = KK> is that

ω33∗ = 1. Thus, we fix the scale in the infinite homography relation by demanding that both the matrices on the left and the right hand side of the relation have their (3, 3) entry equal to 1. To this end, we introduce additional variables λi and pose the minimization problem:

X i i 2 min ω∗ λiH ω∗H > ∗ F ω ,λi ∞ ∞ i k − k

subject to ω33∗ = 1

i> i λih3 ω∗h3 = 1

ω∗ 0 ω∗ D (6.32) ∈ 138

i i Here, h3 denotes the third row of the 3 3 infinite homography H and D is some initial × ∞ convex region whose choice is elucidated later in this section. For the present, it suffices to understand that the individual entries of ω∗ lie within the convex region D.

6.6.3 Convex Relaxation

We begin by introducing a new set of variables νi = λiω∗. Here each matrix νi is a symmetric 3 3 matrix with entries νijk = λiω∗ . Also let us assume that the domain × jk is given in the form of bounds [ljk, ujk] on the five unknown symmetric entries ω∗ of D jk ω∗. Then (6.32) can be re-written as

X i i 2 min ω∗ H νiH > ∗ F ω ,νi,λi ∞ ∞ i −

subject to νijk = λiωjk∗

ω33∗ = 1

i> i λih3 ω∗h3 = 1

ω∗ 0

ljk ω∗ ujk (6.33) ≤ jk ≤

The non-convexity in the above optimization problem has been reduced to the bilinear

i> i equality constraints νijk = λiωjk∗ and λih3 ω∗h3 = 1.

Given bounds on the entries of ω∗, a relaxation of (6.33) is obtained by replacing i> i the constraint λih ω∗h = 1 by a pair of linear inequalities of the form Li λi Ui, 3 3 ≤ ≤ i i where Li and Ui are computed by simply inverting the bounds on h3>ω∗h3. Thus, the lower bound Li can be computed as the reciprocal of the result of the maximization 139 problem:

i> i max h3 ω∗h3 ω∗

subject to ω33∗ = 1

ω∗ 0 

ljk ω∗ ujk (6.34) ≤ jk ≤

This is a semi-definite program (SDP) in 9 variables and can be solved very efficiently using interior point methods Boyd and Vandenberghe(2004). The upper bound Ui can be computed similarly by computing the reciprocal of the minimizer of the above. The relaxed optimization problem can now be stated as:

X i i 2 min ω∗ H νiH > ∗ F ω ,νi,λi ∞ ∞ i −

subject to νijk = λiωjk∗

ω33∗ = 1

ω∗ 0

ljk ω∗ ujk ≤ jk ≤

Li λi Ui (6.35) ≤ ≤

In effect, the above ensures that the introduction of an additional view does not translate into an increase in the dimensionality of our search space. Instead, the cost is limited to solving a small SDP to compute bounds on λi, while the branching variables remain the five unknowns of ω∗. Thus, the search space for the branch and bound algorithm can be restricted to a small, fixed number of dimensions, independent of the number of views. Appendix B.2 discusses the synthesis of convex relaxations of bilinear equalities, which allows us to replace each bilinear equality by a set of linear inequalities. Using 140 them, a convex relaxation of the above optimization problem can be stated as

X i i 2 min ω∗ H νiH > ∗ F ω ,νi,λi ∞ ∞ i −

subject to νijk Uiω∗ + ljkλi Uiljk ≤ jk −

νijk Liω∗ + ujkλi Liujk ≤ jk −

νijk Liω∗ + ljkλi Liljk ≥ jk −

νijk Uiω∗ + ujkλi Uiujk ≥ jk −

ω33∗ = 1

ω∗ 0 

ljk ω∗ ujk ≤ jk ≤

Li λi Ui (6.36) ≤ ≤

The objective function of the above optimization problem is convex quadratic. The constraint set includes linear inequalities and a positive semi-definiteness constraint. Such problems can be efficiently solved to their global optimum using interior point methods and a number of software packages exist for doing so. We use SeDuMi in our implementation Sturm(1999). The user of the algorithm specifies valid ranges for the entries of the calibration matrix K. From this input, we derive intervals [ljk, ujk] for the entries ωjk∗ of the matrix

ω∗ using the rules of interval arithmetic, which specifies the initial convex region D in (6.32) Moore(1966).

A Note on Normalization

The careful reader would have observed that we do not follow the standard prescription of normalizing the infinite homography by the cube root of its determinant, as discussed in Section 6.6.1, to resolve the scale in the infinite homography relations

i i ω∗ = H ω∗H >, i = 1, , m. ∞ ∞ ··· There are two reasons for this. Since the equation we are trying to satisfy with 141

the optimal estimate of ω∗ is algebraic, there are a number of ways in which the scale ambiguity can be resolved. The method based on normalizing the determinant is just one of them. However, the DIAC is not an arbitrary 3 3 symmetric definite matrix, it has × a particular geometric and numerical interpretation, which requires that ω33∗ = 1. Our choice of the objective function reflects this. If one were to choose the determinant based normalization, even though technically the scale ambiguity in the infinite homography relation would have been resolved, a further normalization would be needed before the camera parameters matrix K can be estimated from the DIAC. The second reason is that our normalization subsumes the standard one, since the

i 2/3 latter just corresponds to setting λi = det (H )− . Since we are optimizing over the ∞ choice of λi also, the solution returned by our method will correspond to a minimum which is, in general, lower than the one that corresponds to the standard normalization. In other words, our normalization is more consonant with the aim of global optimization, since we estimate this scale factor, while the traditional approach chooses one. Thus, our method, at the expense of some computational effort, poses the opti- mization problem in terms of an interpretable quantity and finds estimates which are at least as good as, or better than, those obtained by using the standard normalization.

6.7 Experiments

In this section, we will describe the experimental evaluation of our algorithms using synthetic and real data. 142

To evaluate the output of our algorithm the following metrics are defined: v u 3 uX 0 2 ∆p = t (pi/pi 1) (6.37) i=1 −

f1 f2 ∆f = 0 1 + 0 1 (6.38) f1 − f2 − ∆uv = u u0 + v v0 (6.39) | − | | − | ∆s = s s0 (6.40) | − |

Here, pi are estimated coordinates of the plane at infinity, f1, f2 represents the two 0 0 0 0 0 focal lengths, (u, v) stands for the principal point and s for the skew. pi , f1 , f2 , u , v and s0 are the corresponding ground truth quantities. In the first experiment, we simulated a scene where 100 3D points are randomly generated in a cube with sides of length 20, centered at the origin and a varying number of cameras are randomly placed at a nominal distance of 40 units. Zero mean, Gaussian noise of varying standard deviation is added to the image coordinates. A projective transformation is applied to the scene with a known, randomly generated plane at infinity and the ground truth intrinsic calibration matrix is identity. All the statistics reported in this section are acquired over 50 trials. Table 6.1 reports, for various number of cameras and noise levels, the errors in the estimates of various camera parameters and the number of iterations needed for the algorithm to converge. The column π (1) in Table 6.1 reports the number of branch and ∞ bound iterations using the algorithm described in Section 6.5.4. However, an additional optimization is possible: we can refine the value of the feasible point f(q∗) using a gradient descent method within the rectangle that contains it. This does not compromise optimality, but allows the value of the current best estimate to be lower than the value corresponding to the minimum of the lower bounding function. The number of iterations with this refinement is tabulated under π (2). The error metrics reported are computed ∞ using the refined algorithm, however, since both algorithms are run with the same very stringent tolerance ( = 1e 7), the solutions obtained are comparable. The number − 143

Table 6.1: Error in camera calibration parameters and number of iterations for conver- gence, using random synthetic data. Calibration errors are reported relative to ground truth. All quantities reported are averaged over 50 trials. σ stands for percentage noise in image coordinates and m stands for number of views.

σ m Error Iterations (%) ∆p ∆f ∆uv ∆s π (1) π (2) ω∗ ∞ ∞ 5 3.65e-5 3.28e-5 3.01e-5 2.89e-5 31.1 11.0 32.1 0 10 2.51e-6 2.14e-6 2.03e-6 2.00e-6 18.7 4.2 40.9 20 1.28e-6 1.51e-6 1.33e-6 1.73e-6 21.4 1.6 31.8 40 9.08e-7 8.24e-7 7.99e-7 7.58e-7 23.8 1.1 27.5 5 4.76e-4 4.59e-4 4.22e-4 4.05e-4 27.3 9.9 36.3 0.1 10 3.44e-4 3.07e-4 2.73e-4 2.91e-4 17.2 3.5 44.2 20 2.75e-4 2.92e-4 2.56e-4 2.31e-4 16.1 2.4 33.0 40 2.55e-4 2.41e-4 2.06e-4 1.85e-4 23.2 7.9 30.1 5 1.19e-3 1.14e-3 9.92e-4 8.73e-4 41.0 12.5 38.4 0.2 10 7.65e-4 7.13e-4 7.01e-4 6.85e-4 24.7 4.5 47.9 20 6.03e-4 6.80e-4 5.12e-4 5.79e-4 19.5 7.0 34.5 40 5.59e-4 6.05e-4 4.29e-4 5.10e-4 33.2 10.6 31.6 5 3.29e-3 3.22e-3 3.02e-3 2.63e-3 63.2 11.6 42.7 0.5 10 1.66e-3 2.19e-3 1.99e-3 2.11e-3 29.8 6.2 51.1 20 1.41e-3 1.84e-3 1.43e-3 1.55e-3 22.3 8.2 38.2 40 1.25e-3 1.50e-3 1.18e-3 9.06e-4 46.6 20.6 32.4 5 4.68e-3 4.04e-3 3.77e-3 3.36e-3 74.1 9.9 45.5 1.0 10 3.15e-3 2.88e-3 2.52e-3 2.15e-3 36.4 9.2 56.8 20 2.86e-3 2.45e-3 2.02e-3 1.74e-3 31.0 15.8 40.9 40 2.79e-3 2.21e-3 1.76e-3 1.30e-3 56.4 23.6 38.9

of iterations for the DIAC estimation (with no refinement) are tabulated under ω∗. The metric upgrade step was performed with  = 1e 5. The termination criterion measures − the gap, , between the lowest lower bound and the current best objective function value. Figure 6.4 plots the errors graphically. The accuracy of the algorithm is evident from the very low error rates obtained for reasonable noise levels. It is interesting that the algorithm performs quite well even for noise as high as 1%. In general, the accuracy improves as expected when more cameras are used. To demonstrate scalable runtime behavior, Figure 6.5 plots the runtime for the affine and metric upgrade stages for the random data experiment with 0.1% noise, for 144

0.005 0.005 Noiseless Noiseless 0.1% noise 0.1% noise 0.004 0.2% noise 0.004 0.2% noise 0.5% noise 0.5% noise

1.0% noise length 1.0% noise

ordinates 0.003 0.003 co cal fo ∞ π

0.002 in 0.002 in

0.001 Error 0.001 Error

0 0 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 Number of views Number of views (a) Error in plane at infinity (b) Error in focal length

0.005 0.005 Noiseless Noiseless 0.1% noise 0.1% noise t 0.004 0.2% noise 0.004 0.2% noise oin ew p 0.5% noise 0.5% noise sk 0.003 1.0% noise 0.003 1.0% noise pixel principal 0.002 in 0.002 in

0.001 Error 0.001 Error

0 0 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 Number of views Number of views (c) Error in principal point (d) Error in pixel skew

Figure 6.4: Error in camera calibration parameters for random synthetic data. The errors in the graphs are plotted relative to ground truth for the indicated quantities. All quantities reported are averaged over a 50 trials. varying numbers of cameras. These experiments were conducted on a Pentium IV, 1 GHz computer with 1GB of RAM. Note that the graceful variation in the runtime behavior is a direct outcome of our bounds propagation schemes, without which the branch and bound algorithms would display exponential characteristics. Our code is unoptimized MATLAB with an off-the-shelf SDP solver (Sturm, 1999), so the actual magnitude of these timings should be understood only as rough qualitative indicators.1 While the metrics in Table 6.1 are intuitive for evaluating the intrinsic parameters, it is not readily evident how ∆p should be interpreted. Towards that end, we perform a set of experiments, inspired by (Pollefeys and van Gool, 1999), where three mutually orthogonal 5 5 grids are observed by varying numbers of randomly placed cameras. × 1Prototype code available at http://vision.ucsd.edu/stratum. 145

60 800 50

40 600 seconds) 30 seconds) (in

(in 400 20 Time Time 200 10

0 0 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 Number of views Number of views (a) Affine upgrade (b) Metric upgrade

Figure 6.5: Runtime behavior of the branch and bound algorithms for the affine and metric upgrade steps. All timings reported are averaged over 50 trials.

Noise ranging from 0.1 to 1% is added to the image coordinates. The quality of the affine upgrade is indirectly inferred from the deviation from parallelism in the reconstructed grid lines, while the quality of the metric upgrade is inferred from the deviation from orthogonality. Table 6.2 reports the results of this experiment and Figure 6.6 shows the results graphically.

6 6 Noiseless Noiseless 0.1% noise 0.1% noise

5 (degrees) 5 (degrees)

0.2% noise y 0.2% noise 0.5% noise 0.5% noise 4 4 1.0% noise 1.0% noise 3 3 parallelism orthogonalit 2 2 from from 1 1

Deviation 0 0

5 10 15 20 25 30 35 40 Deviation 5 10 15 20 25 30 35 40 Number of views Number of views (a) Deviation from parallelism (b) Deviation from orthogonality (Affine upgrade) (Metric upgrade)

Figure 6.6: Errors in affine and metric properties for three synthetic, mutually orthogonal planar grids. The graphs plot (a) angular deviation from parallelism after the affine upgrade and (b) angular deviation from orthogonality after the metric upgrade, both measured in degrees. All quantities reported are averaged over 50 trials.

Again, we observe that the algorithm achieves very good accuracy for reasonable 146

Table 6.2: Error in affine and metric properties for three synthetically generated, mutually orthogonal planar grids. The table reports mean angular deviations from parallelism and orthogonality, measured in degrees. All quantities reported are averaged over 50 trials.

Noise Views Affine Metric (%) (Parallel) (Parallel) (Perpendicular) 5 2.40e-6 2.46e-6 3.07e-6 0 10 7.90e-7 8.12e-7 1.12e-6 20 5.22e-7 5.50e-7 8.70e-7 40 3.67e-7 3.88e-7 6.23e-7 5 0.40 0.40 0.34 0.1 10 0.27 0.27 0.23 20 0.19 0.19 0.15 40 0.13 0.13 0.10 5 0.79 0.80 0.63 0.2 10 0.54 0.54 0.44 20 0.36 0.36 0.25 40 0.25 0.25 0.19 5 1.95 1.96 1.88 0.5 10 1.31 1.31 1.02 20 0.89 0.89 0.79 40 0.64 0.64 0.57 5 4.05 4.07 3.97 1.0 10 2.63 2.63 2.30 20 1.83 1.83 1.52 40 1.27 1.27 1.09

noise and performs quite well even for 1% noise. With just 5 cameras, it is quite likely for the configuration to be ill-conditioned or degenerate, which causes the algorithm to break down in some cases. We also use this experimental setup to compare against traditional local optimiza- tion approaches. In the first set of experiments, for a fixed number of views (20) and varying noise levels, we minimize the modulus constraints in (6.21) using a Levenberg- Marquardt scheme with 50 random initializations. It can be seen from Figure 6.7 that the globally optimal solution of Section 6.5 yields more accurate solutions. Finally, once the plane at infinity has been estimated, the metric upgrade of 147

6 Local 5 Global (degress)

4

3 parallelism 2 from

1

Deviation 0 0 0.2 0.4 0.6 0.8 1 Noise (percent)

Figure 6.7: Comparison of the accuracy of the plane at infinity estimated by the globally optimal method (black dotted curve) and a Levenberg-Marquardt routine with 50 random initializations (red curve). The number of views is 20. The accuracy of the affine upgrade is deduced from the extent to which parallelism is preserved in the reconstruction.

Section 6.6 is compared against the traditional linear approach. The two drawbacks of the linear approach discussed previously, namely overlooking the positive semidefiniteness requirement and using a suboptimal scale factor, are illustrated by these experiments. Table 6.3 shows the percentage number of times the estimated DIAC is not positive semidefinite. The number of such instances is higher for a fewer number of views and increases with noise level. In such cases, it is not possible to decompose the DIAC into the intrinsic calibration matrix without a further approximation to project on the closest point on the cone of positive semidefinite matrices. The DIAC estimation here is based on an affine reconstruction performed using the optimal plane at infinity estimated by the method in Section 6.5. Next, for a fixed number of views (m = 5), the deviation from orthogonality observed in the globally optimal metric reconstruction is compared to the deviation obtained from the linear method, across varying noise levels (Figure 6.8). Two sets of metric reconstructions are performed for the linear method. In the first case, the metric upgrade is performed starting with the affine reconstruction obtained using the plane at infinity estimated by a local optimization method (Levenberg-Marquardt with 50 random initializations) and in the other case, the affine upgrade is computed using the optimal 148

Table 6.3: Percentage number of times the linear method for DIAC estimation returns a solution which is not positive semidefinite. All quantities reported are computed over 50 trials.

Noise (%) Views 0 0.1 0.2 0.5 1.0 5 0 4 8 8 14 10 0 2 6 4 10 20 0 0 4 4 6 40 0 0 0 2 4

method in Section 6.5, while the metric upgrade is linear. Clearly, the globally optimal DIAC estimation outperforms the linear method across all noise levels, which shows the importance of estimating the normalization factors λi introduced in Section 6.6.2.

6 Local π + Linear DIAC ∞ Global π + Linear DIAC (degrees) 5 ∞

y Global π + Global DIAC ∞ 4

3 orthogonalit 2 from 1

0

Deviation 0 0.2 0.4 0.6 0.8 1 Noise (percent)

Figure 6.8: Comparison of the globally optimal metric upgrade algorithm with a tra- ditional linear method for estimating the DIAC. The accuracy of the reconstruction is deduced by the extent to which orthogonality is preserved in the metric reconstruction. The solid red curve plots the deviation from orthogonality when a local optimization method is used for estimating the plane at infinity and a linear method is used for estimat- ing the DIAC. The dotted black curve uses a linear method for DIAC estimation, but the optimal plane at infinity is used for affine upgrade. The dashed blue curve uses optimal algorithms for both the affine and metric upgrade steps.

An important consideration for noisy situations is the sensitivity of the chirality bounds to outliers. Similar to (Nister´ , 2004), for noisy images expected in a real world 149 scenario, chirality bounds are computed using only the camera centers. The reason is that usually there are far more points than cameras and the camera centers are likely to be estimated more robustly than 3D points.

(a) Four of the twelve input images used for reconstruction

(b) Two views of the obtained 3D reconstruction

Figure 6.9: In the reconstruction, targets on the same plane are represented by lines of the same color. The second view of the obtained 3D reconstruction shows that the angle between targets on adjacent walls is recovered as nearly 90 degrees.

To demonstrate performance on real data, we consider images of marker targets on two orthogonal walls (Figure 6.9(a)). Using image correspondences from detected corners, we perform a projective reconstruction using the projective factorization al- gorithm (Sturm and Triggs, 1996) followed by bundle adjustment. The normalization procedure and exact implementation follows the description in (Hartley and Zisserman, 2004). Bounds on the plane at infinity are computed using chirality, c.f. equation (3.32). Focal lengths are assumed to lie in the interval [500, 1500], principal point within [250, 450] [185, 385] (image size is 697 573) and skew [ 0.1, 0.1]. The plane at × × − 150 infinity and DIAC are estimated using our algorithm. While ground truth measurements for the scene are not available, we can indirectly infer some observable properties. The coplanarity of the individual targets is indicated by the ratio of the first eigenvalue of the covariance matrix of their points to the sum of eigenvalues. This ratio is

6 5 5 4 measured to be 3.1 10− , 4.1 10− , 6.2 10− and 4.1 10− for the four polygonal × × × × targets. The angle between the normals to the planes represented by two targets on the adjacent walls is 88.1◦ in our metric reconstruction (Figure 6.9(b)). The same angle is measured as 89.8◦ in a reconstruction using (Pollefeys et al., 2002). The precise ground truth angle between the targets is unknown. All the results that we have reported are for the raw output of our algorithm. In practice, a few iterations of bundle adjustment following our algorithms might be used to achieve a slightly better estimate.

6.8 Conclusions and Further Discussions

In this chapter, we have presented globally optimal solutions to the affine and metric stages of stratified autocalibration. Although our cost function is algebraic, this is the first work that provides optimality guarantees for scalable stratified autocalibration. For the success of a branch and bound scheme, it is of utmost importance that the convex relaxations be as tight as possible. The second order cone programming based convex relaxation that we develop for solving the affine upgrade step and the semidefinite programming based convex relaxation for the metric upgrade step satisfy this requirement, while also being very fast to compute in practice. Sometimes, a consideration of practicality of the convex relaxation influences the choice of the algebraic form of an objective function. Indeed, the most straightforward way to minimize the modulus constraints would be to use the simpler formulation of (6.20) and construct a multi-level relaxation for the quartic polynomials by successively using the bilinear relaxation of Section B.2. However, in our experience, such multistep relaxations are very loose in practice, so the branch and bound algorithm will not converge 151 in a reasonable amount of time. Thus, the least squares version of the modulus constraints we globally minimize corresponds to the reformulated version of (6.21). A crucial aspect of designing a global optimization algorithm based on branch and bound is the choice of initial region, which must be principled and guaranteed to contain the optimal solution. Arbitrarily choosing a very large initial region will lead to impractically long convergence times for the branch and bound, while too restrictive a choice might not contain the globally optimum point. Our affine upgrade step addresses this issue by incorporating chirality constraints within the convex relaxation for the modulus constraints. In practice, this limits the location of the plane at infinity to a small region of the search space. For the metric upgrade step, the entries of the DIAC that we wish to estimate are related to more tangible entities corresponding to the internal parameters of the camera. So, a user can easily specify reasonable bounds on the focal length, pixel skew and principal point, which are propagated to initial bounds on the DIAC using interval arithmetic. Several important extensions to the methods introduced in this chapter can be envisaged. For instance, an L1-norm formulation will allow us to use an LP solver for the affine upgrade, making it possible to solve larger problems faster. To the best of our knowledge, it remains an open question whether methods similar to those proposed in this chapter can be used for obtaining optimal solutions to the direct autocalibration problem, which is the subject of the next chapter of this dissertation. Finally, we reiterate that pragmatic application of domain knowledge is important for successfully employing an optimization paradigm to globally optimize a computer vision problem. Indeed, it is the careful consideration of multiview geometry for choosing the initial region, constructing convex relaxations and restricting the dimensionality of the search space within a branch and bound framework that allows the globally optimal algorithms presented in this chapter to be practical. We are hopeful that, in the near future, methods not unlike ours will be used to exploit underlying convexities to successfully optimize challenging problems in other areas of computer vision, besides multiview geometry. 152

Most of the content of this chapter is based on “Globally Optimal Affine and Metric Upgrades in Stratified Autocalibration”, by M. K. Chandraker, S. Agarwal, D. J. Kriegman and S. Belongie, as it appears in (Chandraker et al., 2007b). Chapter 7

Direct Autocalibration

“Confidence is a good name for what is intended by the term directness. .... It denotes the straightforwardness with which one goes at what he has to do.”

John Dewey (American pragmatist, 1859-1952 AD), The Traits of Individual Method (Democracy and Education)

7.1 Introduction

While a projective reconstruction of the scene can be computed from image coordinates alone, the goal of autocalibration is to compute the projective transformation (homography) that upgrades the projective reconstruction to a metric reconstruction. Chapter6 achieved this in a stratified manner, with an intermediate affine reconstruction step. In this chapter, we will propose a global optimization method to directly upgrade a projective reconstruction to a metric one. The basic tenet of autocalibration is the constancy of the absolute conic under rigid body motion of the camera. This is encoded conveniently in the absolute dual quadric (ADQ) formulation of autocalibration (Triggs, 1997). Recall from Section 2.5.3, that the absolute dual quadric, also denoted as Q∗ , is a 4 4 rank-degenerate quadric ∞ ×

153 154 with the canonical representation:   I3 3 0 Q∗ = ˜I =  ×  . (7.1) ∞ 0> 0

The ADQ is the dual of the absolute conic and the plane at infinity is its null vector, so estimating it can directly upgrade a projective reconstruction to a metric one. Indeed, once estimated, an eigenvalue decomposition of the ADQ yields the homography that relates the projective reconstruction to a Euclidean reconstruction. From Section 2.5.3, it is evident that the ADQ must satisfy some important properties:

1. Q∗ is a degenerate quadric, so it must be rank-deficient. ∞

2. The null-space of Q∗ is the plane at infinity, π . ∞ ∞

3. Q∗ is positive (or negative, depending on scale) semi-definite. ∞ Note that these properties are not mere algebraic conveniences, they are geometric requirements essential for a quadric to be considered as the ADQ.

The common practice in estimating Q∗ is to enforce its rank degeneracy as a ∞ post-processing step by simply dropping the smallest singular value. The rank degeneracy of the absolute quadric has an important physical interpretation: enforcing it is equivalent to demanding a common support plane for the absolute conic over the multiple views. That, indeed, is the real advantage that an absolute quadric based method affords over one based on, say, the Kruppa constraints. Thus, for more than two views, it does not make sense to estimate Q∗ without enforcing the rank condition. ∞ In this chapter, we propose a method for estimating the absolute quadric where its rank deficiency is imposed within the estimation procedure. A significant drawback of several prior approaches for autocalibration is the difficulty in ensuring a positive (or negative, depending on scale) semidefinite Q∗ . As has been discussed in the literature ∞ (Hartley and Zisserman, 2004), it is usually not correct to simply output the closest positive semidefinite matrix as a post-processing step since it might lead to a spurious 155

calibration. Our formulation explicitly demands a positive (or negative) semidefinite Q∗ ∞ as an output of our optimization problem, which ensures that the resulting dual image of the absolute conic (DIAC) can be decomposed into its Cholesky factors to yield the calibration parameters of the cameras. It is well-established that the principal difficulty in autocalibration lies in the affine upgrade step, which involves a precise estimation of the plane at infinity, π . In ∞ our approach, estimating the plane at infinity is encapsulated in the absolute quadric estimation itself, as π lies in the null-space of Q∗ . Further, it is reasonable to demand ∞ ∞ that chirality holds, that is, the points and camera centers lie on one side of the plane at infinity (Hartley, 1998a). It has been argued in recent literature (Nister´ , 2004) that it is most important for any reconstruction method to start by satisfying the requirements of chirality, in particular, with respect to the camera centers. Intuitively, moving across the plane at infinity requires jumps across large “basins” in the search space. Thus, imposing chirality constraints increases the chances of an autocalibration routine to start in the correct region of the search space. Given reasonable assumptions on the internal parameters of the camera, such as zero skew and unit aspect ratio, we pose the problem of estimating the positive semidef- inite, rank-degenerate absolute quadric as one of minimizing a polynomial objective function, subject to polynomial equality and inequality conditions. Chirality conditions translate into polynomial inequality constraints on the entries of the absolute quadric, so they can be readily included in the polynomial system that we seek to minimize. Our formulation allows us to compute the global minimum for such a polynomial system using a series of convex linear matrix inequality relaxations (Kahl and Henrion, 2005; Lasserre, 2001). In summary, the contributions of this chapter are:

We present a fast and reliable method for autocalibration by estimating the • absolute dual quadric, where its rank degeneracy and positive semidefiniteness are imposed within our optimization framework. 156

We can demand that our reconstruction satisfy the requirements of chirality by • imposing constraints on the plane at infinity during our estimation procedure.

We globally minimize a reasonable objective function based on camera matrices • alone to deduce the entries of the absolute quadric subject to all the above constraints.

7.2 Background

We follow the same notations and conventions as the previous chapters, which we restate here for ease of reference. Unless stated otherwise, we will denote 3D points by homogeneous 4-vectors

(such as X = (X1,X2,X3,X4)>) and 2D points by homogeneous 3-vectors (such as

x = (x1, x2, x3)>). A projective camera is represented as P = K[R t], where the | intrinsic calibration parameters of the camera are encoded in the upper triangular matrix K and (R, t) denote the exterior orientation of the camera. One of the objectives in autocalibration is to recover the matrix K, which is parametrized as   fx s u     K =  0 fy v  (7.2)   0 0 1

where fx, fy stand for the focal lengths in the x and y directions, s denotes skew and (u, v) the position of the principal point. It is usual to estimate K by estimating the dual image of absolute conic, since

ω∗ = KK>. (7.3)

Suppose we have a projective reconstruction Pi, Xj for i = 1, , m and { } ··· j = 1, , n. Then, the goal of 3D reconstruction is to determine the projective ··· transformation H that takes the projective reconstruction back to Euclidean Pb i, Xb j { } 157 where

Pb i = PiH , i = 1, . . . , m

1 Xb j = H− Xj , j = 1, . . . , n. (7.4)

7.2.1 Autocalibration Using the Absolute Dual Quadric

From Section 2.5.3, under a projective transformation H, the ADQ moves out of its canonical position to Q∗ = H˜IH>. Thus, an eigendecomposition of the estimated ∞ ADQ in a projective reconstruction immediately gives us the rectifying homography H that upgrades to a metric reconstruction. The ADQ is the dual of the absolute conic and its image under a camera P is the dual image of the absolute conic (DIAC):

ω∗ = PQ∗ P>. (7.5) ∞

This follows from the projection properties of dual quadrics, see Result4. From (7.2) and (7.3), it follows that the DIAC has the form   2 2 2 fx + s + u sfy + uv u   ω  2 2  (7.6) ∗ =  sfy + uv fy + v v  .   u v 1

Thus, imposing some constraints on the internal parameters of a camera translates to constraints on the entries of the DIAC, which in turn yields a relation between the entries of the ADQ.

Linear autocalibration using the ADQ

If the principal point is known, then it is possible to obtain linear equations in the entries of the ADQ. Since the ADQ is representable as a symmetric 4 4 matrix, × let us parametrize it in terms of the 10 unknowns of the upper-triangular part. Since the principal point defines the origin of the image coordinate system, knowing it is 158 equivalent to assuming that the principal point is at origin, which can be ensured by a simple translation of the coordinate system. Then, from (7.6),

u = 0 and v = 0 [ω∗i]13 = 0 and [ω∗i]12 = 0 ⇒

(PiQ∗ Pi)13 = 0 and (PiQ∗ Pi)12 = 0, (7.7) ⇒ ∞ ∞ where the latter follows from (7.5). Thus, knowing the principal point in m 5 views ≥ yields 2m equations in the 10 entries of the ADQ, which can be solved for using a singular value decomposition (SVD).

Nonlinear autocalibration constraints on the ADQ

Rather than a known principal point, there can be other assumptions on the internal parameters of a camera, which translate into nonlinear constraints on the entries of the ADQ. For instance, assuming zero pixel skew leads to a quadratic equation in the entries of the DIAC:

ω∗12ω∗33 = ω∗13ω∗33 (7.8) which translates into a quadratic relation in the entries of the ADQ. Similarly, assuming that the internal parameters of the cameras are constant for the image sequence, the DIACs for any two views i and j satisfy the condition that

[ω ] ∗i kl = constant , for k = 1, 2, 3 and k l 3 , (7.9) [ω∗j]kl ≤ ≤ which again translates into a quadratic relation in the entries of the ADQ.

7.2.2 Chirality

As reviewed in Section 3.5, chirality constraints demand that the reconstruction satisfy a very basic criterion: the imaged scene points must be in front of the camera (Hartley, 1998a). A general projective transformation need not preserve the convex hull of a point set, that is, the scene can be split across the plane at infinity in a projective reconstruction. A quasi-affine reconstruction is one that differs from the Euclidean scene 159 by a projective transformation, but in which the plane at infinity is guaranteed not to split the convex hull of the set of points and camera centers. A quasi-affine reconstruction can be computed from the solution of the so-called “chiral inequalities” determined by all the scene points and camera centers and given by (3.31).

7.3 Related Work

There is a significant body of literature within computer vision that deals with autocalibration, beginning with the introduction of the concept in (Faugeras et al., 1992). Approaches to autocalibration can be broadly classified as stratified and direct. The former is a two-step process, whereby the first step involves estimating the plane at infinity for an upgrade to the affine stratum and the metric upgrade is typically performed by estimating K in a subsequent step. Estimating the plane at infinity to achieve an affine upgrade is considered the most difficult step in autocalibration (Hartley et al., 1999). The plane at infinity itself has proven to be a rather elusive entity to estimate precisely. A prior approach has been to exhaustively compute all 64 solutions to the modulus constraints (Pollefeys et al., 1996), although only 21 of them are physically realizable (Schaffalitzky, 2000). An alternate approach involves solving a linear program arising from chirality constraints imposed on the points and camera centers to delineate the region in R3 where the first three coordinates of the plane at infinity parametrized as π = (p, 1)> must lie. Subsequently, ∞ p is recovered by a brute force search within this region (Hartley et al., 1999). In our approach, estimating the plane at infinity is encapsulated in the absolute quadric estimation itself, as π lies in the null-space of Q∗ . ∞ ∞ A variety of linear methods exists for estimating K for the metric upgrade step, see (Hartley and Zisserman, 2004) for more discussion. A drawback of linear approaches is that they do not enforce positive semidefiniteness of the DIAC. One work that the authors are aware of, where estimation of the DIAC is constrained to be positive semidefinite, is (Agrawal, 2004). 160

The class of direct approaches to autocalibration are those that directly com- pute the metric reconstruction from a projective one by estimating the absolute conic. Kruppa equations are view-pairwise constraints on the projection of the absolute quadric. Methods based on the Kruppa equations (or the fundamental matrix), such as (Man- ning and Dyer, 2001), are known to suffer from additional ambiguities when used for autocalibration with three or more views (Sturm, 2000). The absolute quadric was introduced as a device for autocalibration in (Heyden and Astr˚ om¨ , 1996; Triggs, 1997), as it is a convenient representation for both the absolute conic and the plane at infinity. Moreover, it can simultaneously be estimated over multiple views. Constraints on the DIAC can be transferred to those on Q∗ using the (known) ∞ cameras in the projective reconstruction. The actual solution methods proposed in (Triggs, 1997) are a linear approach and one based on sequential quadratic programming. Linear initializations for estimating the dual quadric are also discussed in (Pollefeys et al., 1998). It is known that these methods which do not ensure positive semidefiniteness are liable to perform quite poorly with noisy data. While it can be shown that zero skew alone is sufficient for a metric reconstruction (Heyden and Astr˚ om¨ , 1998), a practitioner must use as much of the available information as possible (Hartley and Zisserman, 2004). We will adopt this latter philosophy and make the safe assumptions that the skew is close to zero, the principal point close to origin and the aspect ratio is close to unity. We allow the focal length to vary to account for varying zoom. Very recently, there has been interest in developing globally optimal solutions to several problems in multiview geometry. A number of simpler problems in multiview geometry can be formulated in terms of systems of polynomial inequalities (Kahl and Henrion, 2005), which can be globally minimized using the theory of convex linear matrix inequality relaxations (Lasserre, 2001). The inherent fractional nature of multiview geometry problems is exploited in (Agarwal et al., 2006) to compute the globally optimal solution to triangulation and resectioning, with a certificate of optimality. A branch- and-bound method is used for autocalibration in (Fusiello et al., 2004), however, their 161 problem is formulated in terms of the fundamental matrix of view pairs and does not scale beyond a small number of views.

7.4 Problem Formulation

Many self-calibration algorithms, such as (Manning and Dyer, 2001; Fusiello et al., 2004), do not estimate the absolute quadric directly. As compared to algorithms such as (Triggs, 1997; Nister´ , 2004) which estimate the absolute quadric, our main contribution is that we constrain Q∗ to be positive semidefinite and rank degenerate ∞ within our optimization framework. Our optimization problems themselves have been designed with the intention of extracting a globally minimal solution. Further, we give the user the option to impose the requirements of chirality within the same unified framework, if needed. It has been argued in prior literature that it is desirable to ensure that chirality is satisfied only with respect to the set of camera centers (Nister´ , 2004). The reason is that cameras are estimated using robust techniques from several points, so they exhibit better statistical properties than the points themselves. And a few outliers in the scene points can make it impossible to satisfy the requirements of full chirality. Similar to the quasi-affine reconstruction with respect to camera centers (QUARC) in (Nister´ , 2004), a pairwise twist test ensures that the plane at infinity cannot violate the line segment joining any pair of camera centers. By subsequently imposing the condition on our reconstruction that the plane at infinity lie on one side of all the camera centers, we are guaranteed to recover a Q∗ consistent with the requirements of ∞ chirality.

7.4.1 Imposing rank degeneracy and positive semidefiniteness of

Q∗ ∞

Let us suppose an appropriate objective function, f(Q∗ ), has been defined, which ∞ depends on the parameters of Q∗ and imposes some desired property on the metric ∞ 162 reconstruction. In addition, we demand that the absolute quadric be rank deficient and positive semidefinite. Thus, our optimization problem is of the form:

min f(Q∗ ) (7.10) ∞

subject to rank(Q∗ ) < 4, Q∗ 0. ∞ ∞ 

Since Q∗ is a symmetric matrix, it can be parameterized using 10 variables. The ∞ condition of rank degeneracy can be imposed by demanding that det Q∗ = 0, which ∞ is a polynomial of degree 4. The positive semidefiniteness of Q∗ can be ensured by ∞ asserting that each principal minor of Q∗ have a non-negative determinant, which are ∞ polynomial equations of degree at most 3. Thus, an equivalent problem is:

min f(Q∗ ) (7.11) ∞

subject to det(Q∗ ) = 0 ∞ 4 det Q∗ jk 0, j = 1, 2, 3 k = 1, , | ∞| ≥ ··· j 2 Q∗ F = 1. k ∞k where Q∗ jk stands for the k-th j j principal minor of Q∗ . Note that one need not | ∞| × ∞ impose all of the above inequalities for ensuring semidefiniteness, but doing so may strengthen the convex LMI relaxation. Since Q∗ is only defined up to a scale factor, ∞ we use the last equality constraint to fix its norm to one. The objective function in the system above is a polynomial and the constraint set consists of polynomial equalities and inequalities, so the program in (7.11) can be minimized globally using the theory in Section 4.2.4.

7.4.2 Imposing chirality constraints

Next, to impose chirality constraints, recall that the plane at infinity is the null- vector of Q∗ and can be expressed (up to scale) as ∞

h (1) (2) (3) (4) i> π = det(Q∗ ), det(Q∗ ), det(Q∗ ), det(Q∗ ) ∞ ∞ ∞ ∞ ∞ 163

(i) where Q∗ represents the 3 3 matrix formed by eliminating the fourth row and the ∞ × i-th column of Q∗ . The camera center is determined as ∞ h (1) (2) (3) (4) i Ci = det(Pi ), det(Pi ), det(Pi ), det(Pi )

(j) where Pi stands for the i-th camera in the projective reconstruction with its j-th column eliminated.

Now, chirality constraints are of the form π >Ci > 0, which is simply a poly- ∞ nomial of degree 3. Thus, even chirality constraints can be included as part of the polynomial system in our formulation for autocalibration.

7.4.3 Choice of objective function

Finally, we address the question of the objective function. Over the years, several different objective functions have been specified for autocalibration. The trade-off in designing a suitable objective function, as discussed in (Nister´ , 2004), is between retaining geometric meaningfulness and ensuring optimality of the recovered solution. We have looked at various choices of the objective function within our polynomial optimization framework, which are described below. In most situations, it is quite correct to assume that the skew is close to zero and aspect ratio close to unity, while a simple transformation of the image coordinates sets the principal point to (0, 0). Let us assume we have enough prior knowledge of the camera intrinsic parameters in our motion sequence to apply a suitable transformation to the coordinate system that brings the intrinsic parameter matrices of the cameras close to identity. With this “pre-conditioning” based on prior knowledge, we can demand that an algebraic condition be satisfied by the entries of the DIAC so that K has a form diag(f, f, 1). One such objective function to be minimized is:

X 2 2 2 2 f(Q∗ ) := ([ω∗i]11 [ω∗i]22) + [ω∗i]12 + [ω∗i]13 + [ω∗i]23 . (7.12) ∞ i −

Recall that ω∗i = PiQ∗ Pi>, thus [ω∗i]jk = pi,j>Q∗ pi,k where, pi,k> stands for the ∞ ∞ k-th row of the i-th camera matrix. 164

Experiments with synthetic data for this objective function are described in Section 7.5 and results are tabulated in Table 7.1. This works well even when we allow focal length to vary such that K is significantly different from identity. Experimental results for this scenario are also given in Table 7.1. A problem with the above objective function is that it is not normalized to account for the scale invariance of ω∗. This can be achieved by dividing each quantity by, say

ω∗33, to obtain the following objective function:

2 2 2 2 X ([ω∗i]11 [ω∗i]22) + [ω∗i]12 + [ω∗i]13 + [ω∗i]23 f(Q∗ ) := − 2 . (7.13) ∞ i [ω∗i]33 This is a rational objective function, which can be tackled in our polynomial optimization set-up by introducing a new variable corresponding to each view. The optimization problem we address now has the form:

n X min ti (7.14) i=1 2 2 2 subject to [ω∗i]33 ti =([ω∗i]11 [ω∗i]22) + [ω∗i]12 − 2 2 + [ω∗i]13 + [ω∗i]23

det(Q∗ ) =0, Q∗ 0. ∞ ∞ 

The number of variables in the above optimization problem increases linearly with the number of views. So, it can be solved only for a relatively smaller number of views (three, maybe four) with the current state of the art in polynomial minimization. It is easy to impose further constraints on the problem, such as zero skew, which corresponds to a quadratic polynomial:

[ω∗i]12[ω∗i]33 = [ω∗i]13[ω∗i]23. (7.15)

Other constraints such as principal point at the origin and unit aspect ratio can be similarly imposed as linear or quadratic polynomial constraints. There can be other principled approaches to formulating a polynomial objective 165 function for autocalibration, assuming constant intrinsic parameters, such as

X f(Q∗ ) := ω∗ λiPiQ∗ Pi> . (7.16) ∞ ∞ i k − k

For a projective reconstruction such that P1 = [ I 0 ], the ADQ can be parametrized | in terms of ω∗ and the three parameters of the plane at infinity, which leads to 10 + n variables for the n-view problem. Our experimental sections will discuss only the results obtained by using objective functions in (7.12) and (7.14).

7.5 Experiments with synthetic data

We have subjected our algorithm to extensive simulations with synthetic, noisy data to give the reader an idea of its performance in statistical terms. The implementation is done in Matlab with the GloptiPoly toolbox (Henrion and Lasserre, 2003). An LMI relaxation order of δ = 2 (cf. Section 4.2.4) has been used throughout all the experiments and this, in general, yields a global solution to the polynomial optimization problem at hand. Let the ground truth intrinsic calibration matrix be K0 and the estimated matrix be K, where     0 0 0 f1 s u f1 s u     K0  0 0  and K   =  0 f2 v  =  0 f2 v  .     0 0 1 0 0 1 Then we define the following metrics to evaluate the performance of the autocalibration algorithm:

f1 f2 ∆f = 0 1 + 0 1 f1 − f2 −  0  0 r r f1 0 f1 ∆r = max 0 , , where r = , r = 0 r r f2 f2 1 1 ∆p = ( u + v ) ( u0 + v0 ) 2 | | | | − 2 | | | | ∆s = s s0 . (7.17) | − | 166

For a calibration matrix of the form diag(f, f, 1), the ideal values of these metrics are ∆f = 0, ∆r = 1, ∆p = 0 and ∆s = 0. An additional quantity of interest is the chirality check, that is, the (percentage) number of cameras that follow the chirality constraints. For the synthetic experiments, there are 12 cameras and 15 points. All the points lie in the cube [ 1, 1]3 of side length 2, centered at the origin and the nominal − distance of the cameras from the origin is 2. For the case of constant focal length, ground truth intrinsic calibration matrix for each camera is K0 = diag(1, 1, 1). To simulate variable focal length, the intrinsic calibration matrix is of the form diag(f, f, 1), where f is allowed to attain any random value in the range [0.05, 1.00]. Noise with standard deviation 0.2% of the image size is added to the image coordinates prior to the projective factorization that forms the input to our algorithm. The objective function to be minimized is the one in (7.12). The results are tabulated in Table 7.1, where all the quantities reported are statistics over 100 trials.

The size of the LMI relaxations used to solve for Q∗ is a function of the number ∞ of variables, the number of constraints and the maximum degree of the polynomial occurring in the objective function and constraints. For the optimization problem (7.12) that we solve in this chapter, none of these quantities depend on the number of views. Thus the time complexity of our algorithm is essentially constant with respect to the number of views. Note that these experiments were conducted with a relatively few number of cameras and points, which sometimes causes the geometry to become ill-conditioned. If that happens, the algorithm can break down due to numerical instabilities, which manifest themselves as constraint violations (such as the few chirality violations above). Another way to detect this is by checking the rank of the moment matrix in the LMI relaxation which should have rank one for globally optimal solutions. A smaller set of experiments was performed using the objective function in (7.14). This was found to perform marginally better than the objective function in (7.12) for experiments conducted with three views. But the algorithm already becomes very computationally intensive for just three views and impossible to solve within reasonable 167

Table 7.1: Performance on synthetic data. The numbers are mean values over 100 trials for the performance metrics defined in (7.17). For fixed focal length, K0 = diag(1, 1, 1). For variable focal length, K0 = diag(f, f, 1), f [0.05, 1.00]. 0.2% noise is added in image coordinates. The last row indicates the number∈ of experiments that failed due to numerical issues and were excluded from the results .

Quantity Fixed focal length Variable focal length Without chirality With chirality Without chirality With chirality

∆f 0.0086 0.0360 0.0069 0.0328 ∆r 1.0041 1.0315 1.0027 1.0272 ∆p 0.0073 0.0096 0.0055 0.0098 ∆s 0.0051 0.0092 0.0033 0.0050 chirality N/A 0.9867 N/A 0.9942 failed 2 2 4 1

memory limits for four or more views. The gains in terms of accuracy of solution achieved by using this theoretically more correct objective function are not enough to justify the enormous computational expense, so we will restrict ourselves to the objective function (7.12) henceforth.

7.6 Experiments with real data

In (Nister´ , 2004), several image sequences for a variety of scenes are obtained with a hand-held camcorder. The number of images varies from 3 to 125. The images themselves are quite noisy and although acquired with a constant zoom setting, auto- focus effects cause focal length to vary across the sequence. The resulting projective factorizations were upgraded to metric using a variety of algorithms and their results visually compared. The results are rated on a scale of 0 (severely distorted) to 5 (very good metric reconstruction) according to the qualitative criteria listed in (Nister´ , 2004). We evaluate the metric reconstructions obtained by our algorithm for 25 of these sequences using the same qualitative criteria as in (Nister´ , 2004). These reconstructions are compared with the output of five prior state-of-the-art methods for autocalibration and 168

(a) Flower Pot (61 views) (b) Nissan2 (89 views)

(c) David (11 views) (d) Pickup (89 views)

Figure 7.1: Metric reconstructions for four real sequences. The points are plotted in white, the image planes in yellow and the optical axes are plotted in green. tabulated in Table 7.2. Some sample scene reconstructions using the method proposed in this chapter are depicted in Figure 7.1. 1 Method A is the method of (Nister´ , 2004) where a quasi-affine reconstruction is obtained after untwisting the cameras and a non-linear local optimization method is used to minimize an appropriate objective function which specifies some requirements on the intrinsic parameters. Method B is the technique in (Beardsley et al., 1997). Method C is the algorithm in (Hartley, 1994) which uses the full set of chirality constraints to obtain an estimate of the plane at infinity. The method of (Pollefeys et al., 1999) is used to obtain the reconstructions in Method D by minimizing a cost function based on the absolute quadric starting from a linear initialization. Method E is a modified version of (Hartley et al., 1999). More details on the individual methods A-E can be found in the

1The VRML reconstructions for these and other sequences are available at http://vision.ucsd.edu/ quadric. 169

Table 7.2: Performance on real data, compared to other state-of-the-art approaches. A score of 0 represents a severely distorted metric reconstruction, while 5 stands for a very good one and a “*” denotes cases where numerical errors were encountered. Method F is the algorithm described in this chapter. Method G is the same algorithm, but with chirality imposed with respect to the set of camera centers. Please see the text for references to Methods A-E and details of the qualitative evaluation criteria.

Dataset Views A B C D E F G Basement 9 4 4 4 4 4 2 2 House 9 5 3 5 5 5 5 5 David 11 5 3 5 2 5 5 5 ClockB 13 4 3 4 3 4 4 4 Frode2 15 5 1 5 5 5 5 5 Nissan1 17 5 3 5 5 5 5 5 Stove 19 5 1 5 5 5 5 5 Drunk 21 4 1 4 4 4 * * SceneSw 23 5 2 5 3 5 5 5 Wine 28 4 2 4 4 4 4 4 CorrPill 35 5 1 5 5 5 5 5 FlowPt2 43 5 2 5 5 5 5 5 Contai1 57 5 2 - 5 5 5 5 Nissan3 59 5 3 5 0 5 5 5 FlowPt1 61 5 1 5 5 5 5 5 Contai2 65 5 1 - 1 5 5 5 FlowPt3 83 5 1 5 5 5 5 5 StatWk2 85 5 3 5 5 5 5 5 Nissan2 89 5 1 - 5 5 5 5 Pickup 89 5 1 - 4 5 5 5 Bicycles 103 5 5 5 1 5 5 5 GirlsSt2 105 5 1 - 2 5 5 5 StatWk1 107 5 1 5 5 5 5 5 Volvo 117 5 1 - 1 5 5 5 SwBre2 125 5 1 4 5 5 5 5

references above or (Nister´ , 2004). Method F is the method described in this chapter where we impose the require- ment that the estimated dual quadric be positive semidefinite and rank deficient. The optimization problem is given by (7.11) and the objective function used is (7.12). 170

Method G is the same algorithm, but now chirality is imposed with respect to the set of camera centers. As a general observation, the algorithm performs well for a larger number of cameras. The sequence “Basement” has forward camera motion, while the sequence “Drunk” is a largely planar scene with rotational camera motion. Both these cases are ill-conditioned for our algorithm and thus, the recovered structure is projectively dis- torted. Besides these sequences, unless the algorithm breaks down due to any numerical instability, the metric reconstructions obtained are at par with other state of the art algorithms. Note that all our experiments are performed without imposing any bounds on the camera parameters. If one explicitly enforces prior information, for instance, that aspect ratio typically lies between 0.25 and 3, the reconstruction quality in even the scenes such as “Basement” which are ill-conditioned can be improved. Further, the reconstructions evaluated above are the raw output of our algorithm. Due to the fact that interior point solvers only solve the LMI relaxation up to an  tolerance, in practice a subsequent bundle adjustment step can be used to further refine the solution. Finally, we demonstrate the importance of global optimization in a comparison with a state-of-the-art, but locally optimal method of (Pollefeys et al., 2002). Figure 7.2 plots the camera trajectories for a few sequences where the globally optimal method of this chapter clearly outperforms the one in (Pollefeys et al., 2002). Notice the stretching of the inter-camera baseline and reversal of orientation of some cameras in the trajectories recovered by the local method. This kind of distortion is characteristic of the plane at infinity not being correctly estimated, which leads to a reconstruction inconsistent with the requirements of chirality (Nister´ , 2001). These reconstruction artifacts are not present in the output of our algorithm, which is a direct benefit of global optimization. It is observed that, in general, such chirality violations for the local method are more common for sequences with relatively fewer number of views. 171

Sequence: SceneSw

Local Method Global Method

Sequence: ClockB

Local Method Global Method

Sequence: Nissan1

Local Method Global Method

Figure 7.2: Comparison of the local method of (Pollefeys et al., 2002) with the global method proposed in this chapter. It is clear that the camera trajectories recovered by the local method are sometimes sub-optimal and violate the requirements of chirality. The reconstructions using the algorithm of this chapter are devoid of those artifacts. 172

7.7 Conclusions

Autocalibration using the ADQ is a mature research topic and yet none of the previously existing approaches are capable of handling many of the hard, non-convex constraints that should be imposed according to the theory of autocalibration. In this chapter, we have presented a solution that guarantees a theoretically correct estimate of the ADQ by imposing its rank deficiency and positive semidefiniteness. The involved polynomial system is solved to its global minimum using developments in the theory of convex LMI relaxations. Experiments show that the resulting algorithm is scalable, stable and robust, with performance comparable to other state-of-the-art methods for autocalibration. At this stage, a comparison is merited between the stratified and direct approaches to autocalibration. From an optimization perspective, the goal of a direct approach is more principled, since it estimates both the plane at infinity and the internal parameters of the camera together using the ADQ, while the stratified approach estimates them sequentially. The stratified approach is important from the pedagogical viewpoint, since an affine reconstruction solves the most difficult part of the transition from a projective to a metric frame. In practice, characterizing the relative benefits of the two approaches is not straightforward, since it is difficult to define a cost function for autocalibration that is purely geometric in nature. The polynomial optimization based approach proposed in this chapter has the benefit of being practical for even a very large number of views, but this scalability is at the expense of using the theoretically inferior cost function of (7.12). The branch and bound based algorithm for stratified autocalibration is decidedly slower, but it globally minimizes well-accepted cost functions for both the affine and metric upgrade steps. To the best of our knowledge, a branch and bound solution for globally optimal direct autocalibration has not yet been achieved. Subsequent to the work that contributed to this chapter, there have been advance- ments that exploit structured sparsity to globally optimize larger polynomial systems. In 173 particular, the work of (Waki et al., 2006) is ideal for situations where the members of a large set of variables interact with another small, fixed set of variables, but not among themselves. This is precisely the case for the more principled optimization problem of (7.14), which we believe is now solvable for a greater number of views (15 to 20) using the software of (Waki et al., 2008). Note that the principle of exploiting structured sparsity is quite similar to the idea of partial relaxations proposed in (Kahl and Henrion, 2007). The most significant portions of this chapter are based on “Autocalibration via Rank-Constrained Estimation of the Absolute Quadric”, by M. K. Chandraker, S. Agar- wal, F. Kahl, D. Nister´ and D. J. Kriegman, as it appears in (Chandraker et al., 2007a). Chapter 8

Bilinear Programming

“One Yang and one Yin: this is called the Tao. That which ensues from this is goodness and that which is completed thereby is the nature.”

The Tao of the Production of Things, Appendix III, I Ching (The Book of Changes)

8.1 Introduction

Bilinearity is an oft-encountered phenomenon in computer vision, since observ- ables in a vision system commonly arise due to an interaction between physical aspects well-approximated by linear models (Tenenbaum and Freeman, 2000). For instance, the coordinates of an image feature are determined by a camera matrix acting on a three-dimensional point (Tomasi and Kanade, 1992). Image intensity for a Lambertian object is an inner product between the surface normal and the light source direction. An image in non-rigid structure from motion arises due to a camera matrix observing a linear combination of the elements of a shape basis (Torresani et al., 2003). Thus, in its most general form, bilinear programming subsumes diverse sub-fields of computer vision such as 3D reconstruction, photometric stereo, non-rigid SFM and several others. This chapter proposes a practical algorithm that provably obtains the globally optimal solution to a class of bilinear programs widely prevalent in computer vision appli-

174 175 cations. The algorithm constructs tight convex relaxations of the objective function and minimizes it in a branch and bound framework. For an arbitrarily small , the algorithm terminates with a guarantee that the solution lies within  of the global minimum, thus, providing a certificate of optimality. One of the key contributions of this chapter is to establish, with theoretical and empirical justification, that it is possible to attain convergence with a non-exhaustive branching strategy that branches on only a particular set of variables. Note that this is different from bounds propagation schemes proposed in the previous chapters, which are essentially exhaustive (Agarwal et al., 2006; Chandraker et al., 2007b). This has great practical significance for many computer vision problems where bilinearity arises due to the interaction of a small set of variables with a much larger, independent set. For instance, one commonly represents an entity, say shape, as a linear combina- tion of basis entities. The image formation process under this model can be understood as a bilinear interaction between a small number of camera parameters and a large number of coefficients for the basis shapes. As an illustration, we present two applications where the nature of this interaction is suitably exploited - reconstructing the 3D structure of a face from a single input image using exemplar models (Blanz and Vetter, 1999) and deter- mining the cameras and shape coefficients in non-rigid structure from motion (Torresani et al., 2003). In this chapter, we globally minimize bilinear programs under both the standard

L2-norm for Gaussian distributions and the robust L1-norm corresponding to the heavier- tailed Laplacian distribution. The convex relaxations are linear programs (LP) for the

L1-norm case and second order cone programs (SOCP) for the L2-norm, both of which are efficiently solvable by modern interior point methods (Andersen et al., 2003). A traditional counterpoint to our approach would employ linear regression followed by singular value decomposition (SVD), which is suboptimal for this rank- constrained problem in the noisy case. Our experiments clearly demonstrate the lower error rates compared to SVD attainable by our globally optimal algorithms. The remainder of this chapter is organized as follows: Section 8.2 describes 176 related work and Section 8.3 presents convex relaxations for the bilinear programs under consideration. Section 8.4 proposes a branching strategy to globally minimize our programs and proves convergence. Experimental results on real and synthetic data are presented in Section 8.5, while Section 8.6 concludes with a discussion and future directions.

8.2 Related Work

Bilinear problems arise in several guises in computer vision (Koenderink and van Doorn, 1997), a fact highlighted by the widespread use of singular value decomposition in diverse vision algorithms. This is, in part, due to the preponderance of linear models in our understanding of several aspects of visual phenomena, such as structure from mo- tion (Tomasi and Kanade, 1992), illumination models (Belhumeur and Kriegman, 1996; Hallinan, 1994), color spectra (Marimont and Wandell, 1992) and 3D appearance (Murase and Nayar, 1995). The bilinear coupling between head pose and facial expression is recovered in (Bascle and Blake, 1998) for actor-driven animation. Disparate applications like typography, face pose estimation and color constancy are tackled in (Freeman and Tenen- baum, 1997) by learning and fitting bilinear models in an EM framework. A perceptual motivation for the abundance of bilinearity in vision is presented in (Tenenbaum and Freeman, 2000). From the perspective of optimization algorithms, bilinear programming is quite well-studied (McCormick, 1976), particularly as a special case of biconvex program- ming (Al-Khayyal and Falk, 1983). A variety of approaches, such as cutting-plane algo- rithms (Konno, 1976) and reformulation linearization techniques (Sherali and Alamed- dine, 1992) have been proposed to solve bilinear programs. Our approach differs in exploiting structure to achieve optimality in problems which would otherwise be consid- ered too large for global optimization.

Numerous computer vision applications involve norms besides L2. Supergaussian 177 distributions like Laplacian routinely arise in independent component analysis (Hyvarinen¨ et al., 2001). The L1 norm has been used where robustness is a primary concern, such as (Zach et al., 2007) for range image integration. The underlying convexity or quasi- convexity of L1 and L formulations of multiview geometry problems have been studied ∞ in (Agarwal et al., 2006; Kahl, 2005; Seo and Hartley, 2007). Note that bilinear programming is a special case of polynomial optimization, however, the problem sizes that concern us in this chapter are far greater than what modern polynomial solvers can handle (Henrion and Lasserre, 2003).

8.3 Formulation

To motivate the derivations and establish a consistent terminology, we will refer to a concrete example, namely reconstructing a 3D shape from a single image, given basis shapes. Modulo suitable variable reorderings, the following derivations hold for other bilinear programs too. For simplicity, we will write expressions for one coordinate of the 2D image,

N extension to two coordinates is straightforward. Let u = uj be the observed image, { }j=1 n i m a R be a row of the camera matrix (n = 4 for 3D shapes) and α = α i=1 be the ∈ { } i i n shape coefficients corresponding to the m basis shapes = Xj R . Then, the B { ∈ } shape is represented as P αi i and the affine imaging equation is i B Pm i i uj = a> α X , j = 1, ,N (8.1) i=1 j ···

8.3.1 LP relaxation for the L1-norm case

The L1-norm bilinear program to find the globally optimal camera and shape coefficients is: N m X X i i min uj a> α Xj (8.2) a,α j=1 − i=1

subject to a a , α α , (a, α) 0 ∈ Q ∈ Q G ≥ 178

i n where Xj R and a, α specify rectangular domains for a and α. (a, α) represents ∈ Q Q G a set of linear constraints (or constraints which can be relaxed into linear ones) on a and/or α to fix the scale of the variables.

Introducing scalar variables tj, j = 1, ,N, an equivalent constrained opti- ··· mization problem is

N n m X X X i i min tj , s.t. tj uj akα Xj,k (8.3) a,α,t ≥ − j=1 k=1 i=1

a a , α α , (a, α) 0 ∈ Q ∈ Q G ≥

i i i where a = (a1, , an)> and X = (X , ,X )>. Note that the non-convexity in ··· j j,1 ··· j,n i the problem is now contained in the bilinear terms akα in the constraints. The convex and concave relaxations for a bilinear term xy in the domain [xl, xu] [yl, yu] are given by, × respectively, the two pairs of linear inequalities (Al-Khayyal and Falk, 1983; McCormick, 1976):

z max xly + ylx xlyl, xuy + yux xuyu (8.4) ≥ { − − }

z min xuy + ylx xuyl, xly + yux xlyu (8.5) ≤ { − − }

We will collectively refer to the four inequalities above as conv (xy) z conc (xy). ≤ ≤ Each constraint of the form tj in (8.3) can be equivalently replaced by the ≥ | · | i i pair of constraints tj ( ) and tj ( ). Substituting γ = akα , we can construct a ≥ · ≥ − · k convex relaxation of (8.3):

N n m ! X X X i i min tj , s.t. tj Xj,kγk uj (8.6) a,α,γ,t ≥ − j=1 k=1 i=1 n m ! X X i i tj X γ uj ≥ − j,k k − k=1 i=1 i i i conv akα γ conc akα ≤ k ≤

a a , α α , (a, α) 0. ∈ Q ∈ Q G ≥

P i i Introducing new variables µj = tj X γ + uj and eliminating tj from the − i,k j,k k 179 resulting system of equations, the program (8.6) can be equivalently rewritten as

N n m ! X X X i i min Xj,kγk uj + µj (8.7) a,α,γ,µ − j=1 k=1 i=1 n m ! X X i i subject to 2 X γ uj µj 0 − j,k k − − ≤ k=1 i=1

µj 0 ≥ i i i conv akα γ conc akα ≤ k ≤

a a , α α , (a, α) 0. ∈ Q ∈ Q G ≥ While both (8.6) and (8.7) are linear programs, (8.6) has two “general” linear inequalities in the constraint set for each j = 1, ,N. In (8.7), one of them has been replaced by ··· a comparison of the scalar variable µj to 0, which can be handled in a more efficient manner in interior point solvers (Andersen et al., 2003). In computer vision applications where N n, the importance of this transformation is immense - it improves timings by  up to an order of magnitude in some of our experiments.

8.3.2 SOCP relaxation for the L2-norm case

The L2-norm bilinear problem is: v u N m !2 uX X i i min t uj a> α Xj (8.8) a,α j=1 − i=1

subject to a a , α α , (a, α) 0. ∈ Q ∈ Q G ≥ A convex relaxation for (8.8) can be easily constructed in the form of a second order cone program using the same principles as for the L1 case:

X i i X i i min u1 X1,kγk , , uN XN,kγk a,α,γ − ··· − i,k i,k

i i i conv akα γ conc akα ≤ k ≤

a a , α α , (a, α) 0. (8.9) ∈ Q ∈ Q G ≥ 180

The convex relaxations in (8.7) and (8.9) are tight and efficiently computable, so they are ideal for use in a branch and bound algorithm (Section 8.4).

8.3.3 Additional notes for the L2 case

A traditional approach to solving such bilinear problems as (8.8) might use SVD.

i i In particular, one may substitute β = α ak, i = 1, , m and k = 1, , n and solve k ··· ··· i for βk by minimizing the linear least squares

P P i i 2 min j(uj i,k Xj,kβk) . (8.10) − { } Subsequently, arrange βi in a n m matrix β, perform SVD and retain the first singular k × vector. However, this is not optimal in the noisy case as the rank 1 structure of β is ignored by the linear regression in (8.10). Some of the experiments in Section 8.5 will underline the global optimality of our L2-norm solution by comparison against linear regression followed by SVD. A second observation is that, for a differentiable problem like (8.8), an additional optimization is possible. Once the lower bounding obtains a feasible point, we can refine the objective function value using a gradient descent method within the rectangle under consideration. Note that this does not compromise optimality, it simply allows the best estimate in a rectangle to push closer to the lower bound. Since our experiments are designed to validate the effectiveness of the branch and bound strategy, all of them are conducted without this additional optimization.

Finally, for an orthographic camera, orthogonality constraints of the form a>b = 0 are bilinear equalities and can be handled using (8.4) and (8.5). Unit norm constraints such as a = 1, however, can only be imposed indirectly. One approach would be k k to consider only those rectangles, Q, feasible for branching that satisfy the condition mina Q a>a 1 maxa (Q) a>a , where (Q) is the set of vertices of the rectangle ∈ ≤ ≤ ∈V V Q. The left inequality is a small convex quadratic program while the right inequality is a small enumeration. Thus, at a negligible cost, rectangles which are guaranteed not to contain the a = 1 solution can be pruned away. k k 181

8.4 Branching strategy

Let us recall the branch and bound procedure and revisit some notation and terminology. Suppose f(x) be a multivariate scalar function that we wish to globally minimize over a rectangular domain Q0. Then, Branching refers to splitting a rectangle into two or more sub-rectangles. Let k be a partition of Q0 after k branching operations. P The bounding operation computes provable lower and upper bounds within each Qi k, ∈ P denoted by flb(Qi) and fub(Qi). The algorithm also maintains global lower and upper bounds for the function f after k branching operations:

l u ψk = min flb(Qi), ψk = min fub(Qi). (8.11) Qi k Qi k ∈P ∈P The algorithm is convergent if ψu ψl 0 as k , which is essentially the condition k − k → → ∞ (L2) in Section 4.3. Choosing a rectangle for branching is essentially heuristic and does not affect convergence. As before, we choose the rectangle with the lowest value of flb(Q) and subdivide it along a particular dimension. However, the choice of branching dimension and the actual subdivision do influence convergence. Guided by the underlying problem structure, we choose a branching strategy that significantly expedites the solution. This section describes our approach and proves it to be convergent. Suppose we wish to globally minimize a bilinear function f(x, y), where x ∈ n m R , y R , which might contain bilinear terms involving xiyj and linear terms involving ∈ xi. Given a procedure for constructing the convex relaxations as in Section 8.3, we l u propose a branching strategy to minimize the bilinear objective. Let ψk and ψk be the global lower and upper bounds maintained by the branch and bound algorithm at the end of k iterations, as defined in equation (8.11). The branching strategy of the algorithm is given by:

k = 0, 0 = Q0 . • P { } while ψu ψl >  • k − k 182

l – choose Q k such that f (Q) = ψ ∈ P lb k

– bisect the longest edge of Q that corresponds only to some xi to form two

rectangles Q1 and Q2 S – k+1 = k Q Q1,Q2 P {P \ } { } – k = k + 1.

The algorithm differs from traditional approaches in the choice of branching dimensions - we restrict the branching dimensions to those corresponding to only one set of variables that make up the bilinear forms. To borrow the terminology of (Tenenbaum and Freeman, 2000), we branch over either the style variables or the content ones. We will prove below that this branching strategy results in a convergent algorithm. The practical advantage in computer vision applications is that, typically, one set of vari- ables in a bilinear model has far fewer elements than the other. For instance, Section 8.5 exploits this fact to branch over just the camera parameters while optimizing for both the camera and the linear coefficients of a shape basis consisting of tens to hundreds of exemplars. The convergence is also empirically demonstrated in the experiments in Section 8.5.1.

Proposition 1. The above branch and bound algorithm is convergent.

Pn We will prove the claim for bilinear functions of the form f(x, y) = i=1 xiyi. It can

be easily extended to deal with more general bilinear forms by replacing yi with linear functions of y and appending f with linear functions of x. The program we solve in, say (8.8), contains the bilinearity in the constraint set, rather than the objective. With continuity arguments, the following analysis holds for that case too.

Let the initial region be specified as Q0 = [l1, u1] [ln, un] [L1,U1] × · · · × × × [Ln,Un], where xi [li, ui] and yi [Li,Ui]. Define χ(Q) to be the length of the · · · × ∈ ∈ longest edge of rectangle Q corresponding only to some xi, that is, χ(Q) = maxi (ui li). − We show that the branch and bound algorithm is convergent if the sub-division rule at

each branching step is to bisect an xi-edge of length χ(Q). 183

We first prove the following main result:

Lemma 1. As χ(Q) 0, fub(Q) flb(Q) 0. → − →

Proof. Let us begin with a single bilinear term xiyi. For ease of notation, we drop the subscript and denote xi by x and yi by y. The corresponding rectangle is denoted by [l, u] [L, U]. We define × L U uU lL U L uL lU λ = − x + − , λ0 = − x + − . (8.12) u l u l u l u l − − − − Noting the form of the convex and concave relaxations of xy in (8.4) and (8.5), respec- tively, the following four cases arise:

y λ, y λ0 f f = (u x)(U L) ≥ ≤ ⇒ ub − lb − − y λ, y λ0 f f = (u l)(U y) ≥ ≥ ⇒ ub − lb − − y λ, y λ0 f f = (u l)(y L) (8.13) ≤ ≤ ⇒ ub − lb − − y λ, y λ0 f f = (x l)(U L) ≤ ≥ ⇒ ub − lb − − The lemma is proved for a single bilinear term, if  > 0, δ > 0 such that ∀ ∃ u l < δ f f < , x [l, u], y [L, U]. (8.14) − ⇒ ub − lb ∀ ∈ ∀ ∈  And indeed, given  > 0, choosing δ < , satisfies the condition (8.14) for each of U L the above four cases. −

For the case where f is a sum of bilinear terms, we appeal to a result from (Horst i i and Tuy, 2006): let Φlb and Φub be the convex and concave envelopes of xiyi in the do- P main [li, ui] [Li,Ui]. Then the convex and concave envelope of f(x, y) = xiyi in the × i Q Q P i P i domain Q = i[li, ui] i[Li,Ui] are given by flb(Q) = i Φlb and fub(Q) = i Φub.

Consequently, Lemma1 stands proved if  > 0, δ > 0 such that, xi ∀ ∃ ∀ ∈ [li, ui], yi [Li,Ui] ∀ ∈ χ(Q) < δ P f i P f i < . (8.15) ⇒ i ub − i lb

And indeed, since χ(Q) maxi(ui li), choosing δ such that nδ maxi(Ui Li) <  ≥ − − always satisfies (8.15). 184

With Lemma1 proved, the convergence of our algorithm can be easily derived by standard arguments (Balakrishnan et al., 1991; Horst and Tuy, 2006). Our choice of branching strategy induces some minor technical differences, so the AppendixC includes a short proof for completion.

8.5 Experiments

We conduct experiments for performance evaluation with synthetic data, with and without outliers, for the optimal L2 and the optimal L1 methods as well as the (suboptimal) linear regression followed by SVD method. Experiments are also reported for two applications - face reconstruction from exemplar 3D models (with real data) and bilinear fitting for non-rigid structure from motion (with synthetic data). Both the LP and SOCP relaxations are implemented using the MOSEK solver (An- dersen et al., 2003), in an unoptimized MATLAB environment1. The branch and bound terminates when the difference between the global optimum and the convex underesti- mator is less than a stringent tolerance of 0.001. Regardless of the problem norm, any reprojection (image) errors we report are computed as Root Mean Squares (RMS) errors.

8.5.1 Synthetic data

The purpose of these experiments is to understand the behavior of the algorithm across varying noise levels, number of exemplars and number of point correspondences. For each experiment, the exemplar models are generated randomly in the cube [ 1, 1]3, − i P i 4 while α are random numbers in [0, 1] such that α = 1. The camera, which is a R i ∈ in equations (8.2) and (8.8), is generated randomly with each ai [ 1, 1]. Zero mean, ∈ − Gaussian noise of varying standard deviation is added to the image coordinates. We

1Prototype code at http://vision.ucsd.edu/ bilinear 185

(a) Reprojection error (b) Camera error

(c) Shape coefficient error (d) 3D error

Figure 8.1: Errors with varying percentage of noise levels for linear regression + SVD (black, dotted curve), the L2-norm bilinear algorithm (red, dashed curve) and the L1-norm bilinear algorithm (blue, solid curve). (a) Reprojection error (b) Error in camera entries (c) Error in shape coefficients (d) Error in 3D coordinates of reconstructed shape. All quantities plotted are averages of 50 trials. define:

PN 2 Reprojection error : √ (uj uj) /N j=1 − b Pn 2 Camera error : √ (ak ak) /(n ak ) k=1 − b k b k Shape coefficient error : √ Pm (αi αi)2/(m P αi) i=1 − b i b PN 2 3D error : √ (Xj Xcj) /N j=1 − 186

The hat denotes ground truth entities. All the quantitative errors that we report are averaged over 50 trials for each combination of experimental parameters.

We also compare the errors obtained by the optimal L1 and L2 algorithms to those obtained by linear regression followed by SVD. Note that camera translation cannot be recovered by SVD as the data needs to be centered. There is also a sign and scale ambiguity in the SVD solution, which is corrected by demanding that all the αi are P i non-negative and imposing, post-optimization, the condition that i α = 1.

(a) Reprojection error (b) Camera error

(c) Shape coefficient error (d) 3D error

Figure 8.2: Errors with varying percentage of outliers in the data for linear regression + SVD (black, dotted curve), optimal L2-norm algorithm (red, dashed curve) and optimal L1-norm algorithm (blue, solid curve). All quantities plotted are averages of 50 trials.

In the first set of experiments, we compute the above errors with m = 20 187 exemplars and N = 100 points, for various noise levels from 0.1% to 1% of the image size. The observations are plotted in Figure 8.1, whereby the L1 and L2 algorithms clearly outperform the SVD method, especially for higher noise levels. Note that the L1 algorithm does not minimize the L2-reprojection error, so it is not meaningful to compare its image error to that of the other two methods.

(a) (b) (c)

Figure 8.3: Time taken to converge to a pre-specified optimality gap by the optimal L2-norm algorithm (red, dashed curve) and the optimal L1-norm algorithm (blue, solid curve). (a) 10 exemplars, 100 points and varying noise levels (b) 0.5% noise, 200 points and varying number of exemplars (c) 0.5% noise, 25 exemplars and varying number of points. All quantities plotted are averages of 50 trials.

The next set of experiments are performed with outliers in the data. For these experiments, m = 20, N = 100 and 0.5% Gaussian noise was added to the image coordinates. To simulate an outlier, 10% of image size is added to the image point. The various errors are recorded as percentage of data points corrupted by outliers varies and plotted in Figure 8.2. It can be seen that the optimal L2-norm algorithm achieves a lower reprojection error than the SVD method. However, as expected, the robust

L1-norm algorithm significantly outperforms the other two in terms of the geometrically meaningful camera, shape coefficient and 3D errors. In the next three sets of experiments, we vary the noise levels, number of ex- emplars and number of points and for each case, record the time for the algorithms to converge to a pre-specified optimality gap (0.001) The graphs in Figure 8.3 show that the empirical performances of our algorithms scale quite gracefully with increase in problem size or difficulty. This illustrates the utility of our branching strategy which greatly 188 alleviates the worst case complexity of branch and bound by restricting the number of branching dimensions to a small, fixed number irrespective of problem size.

8.5.2 Applications

The experiments in this section are merely suggestive of a couple of scenarios - namely, face reconstruction using exemplar models and non-rigid structure from motion - where an accurate bilinear fitting is desirable. Both these problems are mature areas of computer vision in their own right. So rather than attempts to beat the state-of-the-art in these specific fields, these experiments must be viewed as demonstrations of the practicality of a fast and robust globally optimal solution in real-world applications.

Face reconstruction using 3D exemplars

We perform real data experiments on the Yale Face Database. 3D exemplar models are constructed by photometric stereo using 9 frontal images of each subject, followed by surface normal integration. The faces were roughly aligned to a common grid prior to photometric stereo using the coordinates of the eye centers and the nose tip. One of the 35 subjects is chosen for testing, while 3D models of the other 34 subjects are considered exemplars for the bilinear fitting problem. Correspondences are established between a non-frontal image of the test subject and the 3D exemplar models. (Since this is just a demonstration, we circumvent the feature matching stage and simply select 50 correspondences between the test image and a frontal image of the same subject, which gives us 2D-3D correspondences as the exemplars are on the same grid as the frontal image.) We assume the 3D representation of the input face is a convex combination of the exemplar faces, which turns out to be a good assumption in practice. It is, however, not a requirement of the algorithm and the input shape can as well be assumed to be a linear combination of the elements of a PCA basis derived from the exemplars. The high quality reconstructions obtained by globally minimizing the L1-norm objective are 189

(a) Frontal 1 (b) Input 1 (c) “Ground truth” for 1 (d) Reconstruction of 1

(e) Input 2 (f) Reconstruction of 2 (g) Input 3 (h) Reconstruction of 3

Figure 8.4: Reconstruction of human faces using 3D exemplars. (a) One of 9 frontal images used for obtaining “ground truth” from photometric stereo. (b) Non-frontal pose input image for subject 1, with correspondences marked to the frontal image. (c) “Ground truth” photometric stereo reconstruction for subject 1. (d) Reconstruction for subject 1 obtained using the L1-norm algorithm in this chapter. (e) Non-frontal pose input image for subject 2. (f) Reconstruction for subject 2. (g) Non-frontal pose input image for subject 3. (h) Reconstruction for subject 3. displayed in Figure 8.4.

Non-rigid structure from motion

A popular approach to non-rigid structure from motion is to explain the observed structure as a linear combination of elements of a rigid shape basis. Given a shape basis, we can solve the bilinear fitting problem to determine the optimal camera projection and coefficients of the linear combination. For experimental validation, we use the synthetic shark dataset from (Torresani et al., 2003), which consists of 240 images in various camera orientations of a non- 190 rigid shape consisting of 91 points. A shape basis consisting of 2 elements is obtained using (Torresani et al., 2003). The algorithm of this chapter is applied to minimize reprojection error in the L2-norm. Sample scatter plots to demonstrate the goodness of the fit are displayed in Figure 8.5. The average 3D shape error for our algorithm over the whole length of the sequence is 0.9%.

(a) Frame 20 (b) Frame 50

(c) Frame 115 (d) Frame 148

Figure 8.5: 3D scatter plots demonstrating the bilinear fit obtained by the optimal L2 algorithm for representing a deforming shark as a linear combination of rigid basis shapes. Red circles indicate ground truth and blue dots indicate the reconstruction.

8.6 Discussions

In this chapter, we have proposed a globally optimal algorithm for solving bilinear programs in computer vision, under the standard L2-norm as well as the robust L1-norm. The driving forces of the algorithm are efficient convex relaxations and a branch and bound framework that exploits the problem structure to restrict the branching dimensions to just one set of variables in the bilinear program. 191

Our branching strategy is provably convergent and particularly advantageous in computer vision problems where the cardinality of one set of variables is much smaller than the other. Note that the uniform convergence requirement proved in Lemma1 is non-trivial - several popular lower bounding schemes such as LMI relaxations (Lasserre, 2001) or sum of squares decompositions (Parrilo, 2003) do not possess this property and thus, cannot be used in a branch and bound framework. One avenue for future research is extension to multilinearity. While multilinear expressions can always be broken down into a series of bilinear ones, a more sophisticated approach to address trilinearity has a few applications in computer vision. For instance, it will allow texture (or reflectance) to be incorporated into the exemplar-based face reconstruction problem. This chapter is based on “Globally Optimal Bilinear Programming for Com- puter Vision Applications”, by M. K. Chandraker and D. J. Kriegman, as it appears in (Chandraker and Kriegman, 2008). Chapter 9

Line SFM Using Stereo

“Straight? What is straight? A line can be straight, or a street, but the human heart, oh, no, it’s curved like a road through mountains.”

Tennessee Williams (American playwright, 1911-1983 AD), A Streetcar Named Desire

9.1 Introduction

In the final chapter of this dissertation, let us return to the scenario we envisioned in Chapter1, namely, a robot autonomously navigating in the real world. To realize this dream, the classic computer vision problem of simultaneous estimation of the structure of a scene and the motion of a camera rig using only visual input data must be solved accurately, reliably and swiftly. This chapter describes our ongoing endeavor to achieve that goal. As outlined in Section 3.1, the first step of any structure from motion algorithm is to detect and match “interest points” as the input to a sparse scene reconstruction algorithm. Undoubtedly, corners are the primary features of interest in most applications, since they abound in natural scenes. Consequently, significant research efforts in recent times have been devoted to sparse 3D reconstructions from points (Pollefeys et al., 2008). However, it is often the case, especially in man-made scenes like office interiors

192 193

(a) (b)

Figure 9.1: Camera motion estimated using the line-based structure-from-motion algo- rithms proposed in this chapter, with narrow baseline stereo and without any bundle adjustment. The scene is an office environment, the length of the loop is about 50 meters and entire system runs at 10 frames-per-second. The camera starts inside a room, moves outside a door and returns to the room through the same door. Notice the accuracy in closing the loop (about 2% accumulated error after 1600 frames) and in recovering challenging purely forward motions. (a) The left column shows an intermediate snapshot of line tracks and the recovered structure and motion at that stage. (b) The right column shows the top and side view of the recovered motion for the entire loop. or urban cityscapes, that not enough point features can be reliably detected, while an abundant number of lines are visible. While structure from motion using lines has, in the past, received considerable attention in its own right, not much of it has been towards developing algorithms usable in a robust real-time system. On the other hand, important breakthroughs in solving minimal problems for points, such as (Nister´ , 2004; Nister´ and Stewenius´ , 2006), have led to the emergence of efficient point-based reconstruction methods. This chapter proposes algorithms for estimating structure and motion using lines which are ideal for use in a hypothesize-and-test framework (Fischler and Bolles, 1981) 194 crucial for designing a robust, real-time system. In deference to our intended application of autonomous navigation, the visual input for our algorithms stems from a calibrated stereo pair, rather than a monocular camera. This has the immediate advantage that only two stereo views (as opposed to three monocular views) are sufficient to recover the structure and motion using lines. The cost functions and representations we use in this chapter are adapted to exploit the benefits of stereo input. Our entire system, although not fully optimized yet, captures stereo images, detects straight lines, tracks them and estimates the structure and motion in complex scenes at 10 frames per second. Besides availability in man-made scenes, an advantage of using lines is accurate localization due to their multi-pixel support. However, several prior works rely on using finite line segments for structure and motion estimation, which has the drawback that the quality of reconstruction is limited by the accuracy of detecting end-points. In the worst case scenario, such as a cluttered environment where lines are often occluded, the end-points might be false corners or unstable T-junctions. Our approach largely alleviates these issues by using infinite lines as features and employing a problem formulation that depends on backprojected planes rather than reprojected end-points. Commonly used multifocal tensor based approaches involve extensive book- keeping and use a larger number of line correspondences to produce a linear estimate, which can be arbitrarily far from a rigid-body motion in the presence of noise. In contrast, our solutions use only two or three lines and can account for orthonormality constraints on the recovered motion. Rather than resorting to a brute force nonlinear minimization, we reduce the problem to a small polynomial system of low degree, which can be solved expeditiously. In theory, the use of calibrated stereo cameras also makes it possible to come up with simpler solutions that first reconstruct the 3D lines for each stereo pair and then compute a rotation and translation to align them. However, as we discuss in Section 9.3.1 and demonstrate in Section 9.5, aligning the noisy 3D lines obtained from narrow baseline stereo is especially prone to errors in real-life scenarios. Moreover, explicit reconstruction followed by alignment breaks down when scene lines lie close to the 195 epipolar plane, while our methods are comparatively robust to such cases. In summary, the main contributions of this chapter are:

Algorithms for estimating structure and motion in a stereo setup using a few lines, • which are fast enough to be used in a hypothesize-and-test framework and more robust than traditional approaches.

A system to detect straight lines, track them and estimate structure and motion in • a RANSAC framework (with optional nonlinear refinement) at very fast rates.

We begin with a brief overview of relevant prior work in Section 9.2. Our algorithms for estimating structure and motion are described in Section 9.3, while the system details such as line detection and tracking are provided in Section 9.4. Synthetic and real data experiments are presented in Section 9.5 and we finish with discussions in Section 9.6.

9.2 Related Work

It can be argued that points provide “stronger” constraints on a reconstruction than lines (Hartley, 1997). For instance, an analog of the eight-point algorithm for two-view reconstruction using lines requires three views and thirteen correspondences (Liu and Huang, 1988). Yet, using lines as primary features is beneficial in some scenarios, so there has been tremendous interest in the problem. Multifocal tensors are a common approach to estimate the structure of a scene from several line correspondences. A linear algorithm for estimating the trifocal tensor from three views and thirteen correspondences was presented in (Hartley, 1997), while the properties of the quadrifocal tensor are elucidated in (Shashua and Wolf, 2000). It is apparent from these works that the book-keeping required to enforce non-linear dependencies within the tensor indices can be substantial (Hartley, 1998b)(Heyden, 1995). Consequently, approaches such as (Comport et al., 2007) which compute the quadrifocal 196 tensor between two calibrated stereo pairs, are too cumbersome for estimating just a 6-dof motion. An algebraic construct, called the 3D line motion matrix, is used to align the Plucker¨ coordinates of two line reconstructions in (Bartoli and Sturm, 2004), however, rigid body constraints can be accounted for only through nonlinear minimization. Ex- tended Kalman Filter based techniques for estimating incremental displacements from stereo frames have been proposed (Zhang and Faugeras, 1992), but the inherent lineariza- tion assumption might lead to inaccuracies. For various reasons, it is not reliable to use end-points for designing a structure and motion algorithm with lines. The issue of representing 3D lines is addressed in detail in (Bartoli and Sturm, 2005) where an orthonormal representation well-suited for bundle adjustment is proposed. The method in (Taylor and Kriegman, 1995) minimizes a notion of reprojection error in the image plane which is robust to variable end-points across views. However, the cost function can be optimized only by iterative local minimization which requires initialization and is not amenable to a real-time implementation. Further, a minimum of six line correspondences in three views are required, which can be burden- some for a hypothesize-and-test framework. A similar procedure for 3D reconstruction of urban environments is presented in (Schindler et al., 2006). In recent years, there has been significant progress in the development of visual odometry systems (Nister´ et al., 2004), which rely on real-time solutions to important minimal problems for point-based structure-from-motion such as 5-point relative ori- entation (Nister´ , 2004)(Stewenius´ et al., 2006) or 3-point generalized pose (Nister´ and Stewenius´ , 2006). Similar to those works, our focus is to develop an efficient solution with lines that can be used in a RANSAC framework, but this leads to an over-determined system of polynomials. An over-constrained system, albeit an easier one, is solved in (Quan and Lan, 1999) for the pose estimation problem. 197

9.3 Structure and Motion Using Lines

Unless stated otherwise, 3D points are denoted as homogeneous column vectors

X R4 and 2D image points by homogeneous column vectors x R3. The perspective ∈ ∈ camera matrix is represented by a 3 4 matrix P = K[R t], where K is the 3 3 upper × | × triangular matrix that encodes the internal parameters of the camera and (R, t) represents the exterior orientation. We refer the reader to Chapter2 for details of projective geometry and in particular, Section 2.3.2 for the background on 3D line geometry.

9.3.1 A Simple Solution?

Since we use calibrated stereo cameras, a first approach to the problem would naturally be to reconstruct the 3D lines and align them. There are several different ways to achieve this, one possibility is the following. First, reconstruct the 3D lines in the frame of each stereo pair - this is easily achieved by intersecting the planes obtained by backprojection of each corresponding pair of 2D lines. Next, the rotation that aligns the 3D lines is computed, for instance, by first aligning the mean directions of the lines and then computing the residual in-plane rotation. Finally, the translation can be computed by aligning the points that lie midway between the 3D lines in the two rotationally aligned frames. The attractions of such approaches are that they are simple and geometrically intuitive. However, the quality of reconstruction obtained from such methods is critically dependent on the accuracy of reconstructed 3D lines, which is known to be quite poor for the practically important case of narrow baseline stereo. Moreover, if some scene lines lie close to the epipolar plane, then such strategies are liable to break down. The algorithms that we describe next perform robustly in real-world scenes where situations such as above commonly arise. The need for stereo-based line reconstruction algorithms that look beyond the simple multistep reconstruction and alignment approach is also empirically borne out by the experiments in Section 9.5. 198

9.3.2 Geometry of the Problem

We will assume that the stereo pair is calibrated, so the intrinsic parameter matrices, K, can be assumed to be identity. The left and right stereo cameras can be parametrized as P1 = [I 0] and P2 = [R0 t0], where R0 is a known rotation matrix and | | t0 is a known translation vector. Then, the camera matrices in a stereo pair related by a rigid body motion (R, t) are given by P3 = [R t] and P4 = [R0R R0t + t0]. | | The back-projected plane through the camera center and containing the 3D line L, imaged as a 2D line l by a camera P, is given by π = P>l. Suppose l1, l2, l3 and l4 are images of the same 3D line in the four cameras P1, , P4. See Figure 9.2. Then, the ··· four back-projected planes, stacked row-wise, form a 4 4 matrix which is the primary × geometric object of interest for our algorithms:     π1> l1> 0          π2>   l2>R0 l2>t0  W =   =   (9.1)      π3>   l3>R l3>t      π4> l4>(R0R) l4>(R0t + t0)

Since the four backprojected planes in Figure 9.2 intersect in a straight line, there are two independent points X1 and X2 that satisfy πi>Xj = 0, for i = 1, 2, 3, 4 and j = 1, 2. Consequently, the matrix W has a 2D null-space, or has rank 2, when the lines in the 4 images correspond.

9.3.3 Linear Solution

Asserting that W has rank 2 is equivalent to demanding that each of its 3 3 × sub-determinants have rank 2, which amounts to four independent constraints. One way to extract these constraints is to perform two steps of Gauss-Jordan elimination (using 199

P =[R t ] P1 =[I 0] 2 0| 0 | l1 l2 π1 = P1!l1

π2 = P2!l2

π3 = P3!l3 P =[R t] 3 |

l3

l4

P =[R R R t + t ] π4 = P!l4 4 0 | 0 0 4

Figure 9.2: Geometry of image formation. If the 4 image lines in each of the 4 cameras of the 2 stereo pairs actually correspond, the back-projected planes joining the respective camera centers to these straight lines would intersect in a straight line (the 3D line which was imaged by the 2 stereo pairs).

Householder¨ rotations) on the matrix W>, to get a system of the form   0 0 0  ×     0 0  Wgj =  × ×  (9.2)    f1(R, t) f2(R, t)   × ×  f3(R, t) f4(R, t) × × where fi(R, t) are affine functions of R and t. Since rank of a matrix is preserved by elementary operations, the matrix Wgj must also be rank 2 and thus, its third and fourth rows are linear combinations of the first two. It easily follows that

fi(R, t) = 0 , i = 1, 2, 3, 4 (9.3)

Thus, for n 3, a linear estimate for the motion can then be obtained as a solution to ≥ a 4n 12 system of the form Av = b, formed by stacking up the linear constraints fi, × where v = [r1>r2>r3>t>]> and r1, r2, r3 are columns of the rotation matrix. 200

Note that the algebraic degeneracies of our cost function are apparent here, namely, when the (1, 1) entry is zero or the determinant of the first 2 2 sub-matrix × vanishes. In practice, it is possible to detect these situations from the input data and perform column reordering to avoid the degeneracies. The drawbacks that such a linear procedure suffers from in the presence of noise are similar to those of, say, the DLT algorithm for estimating the fundamental matrix using eight points. In particular, since orthonormality constraints have been ignored, the solution can be arbitrarily far from a rigid body motion.

9.3.4 Efficient Solutions for Orthonormality

In the frequently encountered case of narrow baseline stereo with nearly parallel camera principal axes, an additional drawback is that the matrix A is close to rank- deficient, which makes its inversion unstable. A more practical approach is to work with a low-rank projection of A, that is, assume the solution lies in the space spanned by the last k singular vectors of [A, b]. While more complex than the prior methods, the − following method has the advantage of being able to impose orthonormality for general camera motions. We express the desired solution as a linear combination of the singular vectors of A, call them v : i       r1    |   |         r2         |   |         r3  = x1  v1  + + xk  vk  (9.4)     ···          t         |   |  1 | | and the problem reduces to determining the coefficients x1, , xk of the above linear ··· combination, subject to orthonormality conditions:

2 2 2 r1 = 1 , r2 = 1 , r3 = 1 (9.5) k k k k k k

r1>r2 = 0 , r2>r3 = 0 , r3>r1 = 0. (9.6) 201

Substituting for (R, t) from (9.4) in (9.5) and (9.6), we get six polynomials of degree 2 in the k variables x1, , xk. This system of polynomial equations will have no solution ··· in the general, noisy case and we must instead resort to a principled “least-squares” approach to extract the solution.

A Global Optimization Approach

Let the six quadratic polynomials in (9.5) and (9.6) be Qi(x1, , xk), for i = ··· 1, , 6. Then, a straightforward “least squares” approach is to solve ··· 6 X 2 min Qi (x1, , xk). (9.7) x1, ,xk ··· i=1 ··· where the objective function is a polynomial of degree 4 in k variables. The global optimum for this unconstrained minimization problem can be found using an LMI relaxation based polynomial solver (see Section 4.2.4), such as GloptiPoly Henrion and Lasserre(2003). In general, greater numerical precision and speed can be obtained by reducing the degree of the system and the number of variables. One way to do so is to drop the scale, that is, divide each equation in (9.4) by xk and replace the unit norm constraints in (9.5) by equal norm constraints:

2 2 2 2 r1 r2 = 0 , r2 r3 = 0. (9.8) k k − k k k k − k k Now, the five polynomials in (9.8) and (9.6) have k 1 variables each, call them − qi(x1, , xk 1), for i = 1, , 5. Then, the corresponding reduced least squares ··· − ··· problem is to minimize 5 X 2 q(x1, , xk 1) = qi (x1, , xk 1). (9.9) − − ··· i=1 ··· A different polynomial system can be obtained by observing that, at the minimum, this optimization problem is equivalent to solving the following feasibility search

find x1, , xk 1 ··· − ∂q ∂q s.t. = 0 , , = 0 (9.10) ∂x1 ··· ∂xk 1 − 202 which is a system of k 1 polynomials of degree 3 in k 1 variables. − − Once a solution for x1, , xk 1 is obtained, the scale factor xk can be recovered ··· − by imposing the unit determinant condition on R as a post-processing step.

An Efficient Solution

In the following, we will restrict our attention to the scale-free case in (9.9), the case of fixed scale is a straightforward extension. For ease of exposition, we will assume k = 4, but the following can be easily extended for other number of vari- ables too. Then, each of the five polynomial equations consists of the ten monomials

2 2 2 [x1, x2, x1x2, x1, x2, x1x3, x2x3, x3, x3, 1]. Since relatively low degree polynomials in a single variable can be solved very fast using methods like Sturm sequences (Gellert et al., 1975), we will attempt to solve for a single variable first, say x3. Then, each of the polynomials qi(x1, x2, x3) can be rewritten as

2 2 c1x1 + c2x2 + c3x1x2 + [1]x1 + [1]x2 + [2] = 0 (9.11) where we use the notation [n] to denote some polynomial of degree n in the single variable x3. Our system of polynomials, written out in a matrix format, now has the form   2   x1   c1 c2 c3 [1] [1] [2]   0  2     x2     c c c [1] [1] [2]     0   10 20 30         x1x2     c c c [1] [1] [2]    =  0  . (9.12)  100 200 300         x1     c c c [1] [1] [2]     0   1000 2000 3000         x2    c10000 c20000 c30000 [1] [1] [2]   0 1

Let the 5 6 matrix above be denoted as G. Then, the i-th component of its null-vector, × 2 2 denoted as u = [x1, x2, x1x2, x1, x2, 1]> can be obtained as

i 1 ui = ( 1) − det(Gci) (9.13) − where Gci stands for the 5 5 matrix obtained by deleting the i-th column of G. Thus, × the vector u can be obtained, up to scale, with each of its components a polynomial of 203

2 2 degree 4 (x1, x2, x1x2), 3 (x1, x2) or 2 (1) in the single variable x3. Now, we note that all the components of the vector u are not independent. In fact, in the noiseless case, they must satisfy three constraints

[u4u5 = u3u6]:(x1) (x2) = (x1x2) (1) × × 2 2 [u = u1u6]:(x1) (x1) = (x ) (1) (9.14) 4 × 1 × 2 2 [u = u2u6]:(x2) (x2) = (x ) (1) 5 × 2 ×

Substituting the expressions for u obtained from (9.13), we obtain three polynomials of degree 6 in the single variable x3. Let us call these polynomials pi(x3), i = 1, 2, 3. Then, we have a system of three univariate polynomials that must be solved in a “least-squares” sense. We approach this as an unconstrained optimization problem

2 2 2 min p1(x3) + p2(x3) + p3(x3). (9.15) x3

At the minimum, the first-order derivative of the above objective function must vanish, so the optimal x3 is a root of the polynomial

p(x3) = p1(x3)p10 (x3) + p2(x3)p20 (x3) + p3(x3)p30 (x3). (9.16)

Note that p(x3) is a degree 11 univariate polynomial, whose roots can be determined very fast in practice. There can be up to 11 real roots of p(x3), all of which must be tested as a candidate solution. Also worth noticing is that the odd degree of p(x3) guarantees at least one real solution.

Finally, once the candidates for x3 have been obtained, the corresponding can- didates for x1 and x2 are obtained by substituting the value of x3 in the expression for u4 u5 u and reading off the values of x1 = and x2 = . The rotation and translation can u6 u6 now be recovered, up to a common scale factor, by substituting in (9.4). We can fix scale by, for instance, fixing the determinant of R to 1. 204

9.3.5 Solution for Incremental Motion

For our structure from motion application, the input is a high frame-rate video sequence. So, it is reasonable to assume that the motion between subsequent frames is very small. Approximating the incremental displacement along the manifold of rotation matrices by that along its tangent space (the so(3) manifold of 3 3 skew-symmetric × matrices), the incremental rotation can be parametrized as R I + [s] , where s R3. ≈ × ∈ Now, n lines give 8n linear constraints of the form (9.3) on the 6 unknowns (s1, s2, s3, t), which can be estimated using a singular value decomposition (SVD).

9.3.6 A Note on Number of Lines

It might be noticed by the careful reader, that two lines are the minimum required to fix the six degrees of freedom of the motion of the stereo pair. Indeed, all the solutions proposed in the preceding sections (except, of course, the linear solution of Section 9.3.3) can work with just two lines. For a hypothesize-and-test framework, a solution that relies on the least possible amount of data is indispensable to minimize computational complexity. However, given the narrow baseline configuration we use in practice, it is desirable to use more than the minimum number of lines to enhance robustness. While we demonstrate the possibility of using only two lines in synthetic experiments, the real data experiments that we report use three lines within the RANSAC implementation. Although using even more lines will increase robustness, the trade-off will be a more expensive RANSAC procedure.

9.4 System Details

9.4.1 Line Detection, Matching and Tracking

The success of any SFM system critically depends on the quality of feature detection and tracking. To detect lines, we first construct Sobel edge images and then 205

(a) (b)

(c) (d)

(e) (f)

Figure 9.3: Line detection and tracking. (a) after non-maximal suppression. (b) line segments in initial cells (level-0). (c) line segments in level-2 cells. (d) final line segments. (e) detected lines after merging cells and filtering short line segments (shown with line id’s). (f) tracked (green) and newly detected (red) line segments after 15 frames. apply non-maximal suppression to accurately locate the peak/ridge location of both strong and weak edges. The following non-maximal suppression method is used to build the edge map:

  Sx (Sx+dx + Sx dx)/2 Ix = max σ − − ; s, m dx max((Sx+dx + Sx dx)/2, θedge) − 206

1 where Sx is the Sobel edge response at a pixel x = (x, y)>, σ(x; s, m) = 1+exp( (x m)/s) − − is a sigmoid function, dx (1, 0)>, (0, 1)>, (1, 1)>, (1, 1)> is the testing direction ∈ { − } (Figure 9.3 a) and θedge is a free parameter. Well-known straight line detection algorithms like Hough transform or its variants do not perform well in complex indoor scenes. Instead, motivated by divide-and-conquer schemes, we divide the edge map into small 8 8-pixel cells (Fig. 9.3 b) and detect × edge blobs in each cell. To decide if an edge blob is likely to be a line segment, we build the convex hull of the points and check its thickness, which is approximated by its area divided by the distance between two farthest points in the blob. We maintain the convex hull and the mean and covariance of the point coordinates for each line segment. When two line segments are merged, these informations can be combined without having to refer to the original points. At the next level, each cell contains 4 sub-cells of the previous level. In each cell, pairs of line segments with similar directions are merged if the thickness of the combined convex hull is within a threshold, and this process is repeated until the cell covers the entire image. To establish stereo correspondence, we use dense stereo matching with shrunk frames and a sum-of-absolute-differences measure. Lines are tracked from one frame to the next using multi-level Lucas-Kanade optical flow (Lucas and Kanade, 1981). Low-level image processing, including image rectification, Sobel edge computa- tion and non-maximal suppression is implemented using the NVidia CUDA framework and runs very fast on a GPU. Our line segment detection is on CPU and runs fast, but since it is inherently parallel, a further speed-up can be obtained by implementing it on GPU. The stereo matching and multi-level optical flow computation are also implemented in CUDA.

9.4.2 Efficiently Computing Determinants

A critical step of the algorithm in Section 9.3.4 is to compute the determinants of the 5 5 sub-matrices of G. It is important to carefully design the computation of these × 207 determinants, since the number of arithmetic operations can explode very quickly, which might adversely affect the numerical behavior of the algorithm.

As an illustration, to compute the polynomial corresponding to u1 in (9.13), con- sider the matrix G10 , which is Gc1 with the column order reversed. Then, det(G10 >) = det(Gc1). Each of the lower 4 4 sub-determinants of G0 > is a polynomial of degree 2. × 1 It now requires relatively little book-keeping to determine the coefficients of x4, , x0 3 ··· 3 in u1 by appropriately convolving the coefficients of degree 2 polynomials in the first row of G0 > with those from its lower 4 4 sub-determinants. The symbolic interac- 1 × tions of the coefficients are pre-computed and only need to be evaluated for values of v1, , v4. Similar steps can be taken to compute the coefficients of various powers of ··· x3 in u2, , u6. ···

9.5 Experiments

We report experimental results for synthetic and real data, comparing the results obtained using the simple solution (Section 9.3.1), the efficient polynomial system based solution (Section 9.3.4) and the incremental solution (Section 9.3.5).

9.5.1 Synthetic Data

Our synthetic data experiments are designed to quantify the performance of the various algorithms discussed in this chapter across various noise levels, for small and large camera motions. The stereo baseline is fixed at 0.1 units, while lines are randomly generated in the cube [ 1, 1]3. The first stereo pair is randomly generated and the second − one is displaced with a random motion that depends on the type of experiment. Zero mean, Gaussian noise of varying standard deviations is added to the coordinates of the image lines in each view. All the experiments are based on 1000 trials. For the first set of experiments, the motion is kept small. The performances of the simple solver, the incremental solver and the polynomial system based solver, each using 208 the minimum two lines, are shown in Figure 9.4. Similar to (Nister´ , 2004), the lower quartile of error distributions are used for all the algorithms, since their targeted use is in a RANSAC framework where finding a fraction of good hypotheses is more important than consistency. It can be seen that the incremental and polynomial solver yield lower errors than the simple solver, although all three solvers perform well.

(a) Rotation error (b) Translation error

Figure 9.4: Rotation and translation errors with small camera motion, for the simple solution (red curve), the polynomial solution (black curve) and the incremental solution (blue curve), each using two lines.

In the second set of experiments, again for small camera motion, the performances of each of the solvers are evaluated using three lines. As discussed in Section 9.3, while sampling three lines is still relatively inexpensive in RANSAC, the added robustness can be valuable in a real-world application. As Figure 9.5 shows, the incremental solver gives much lower error rates than the other two solvers in this case. In the next set of experiments, the camera motion is allowed to be large. The incremental solver cannot be used in this case, while the polynomial solver performs slightly better than the simple solver (Figure 9.6). While all the solvers achieve reasonable error rates for the synthetic data experi- ments, the utility of the algorithms proposed is evident in real-world situations, where noise is large and stereo baseline can be quite narrow compared to scene depth. 209

(a) Rotation error (b) Translation error

Figure 9.5: Rotation and translation errors with small camera motion, for the simple solution (red curve), polynomial solution (black curve) and incremental solution (blue curve), each using three lines.

(a) Rotation error (b) Translation error

Figure 9.6: Rotation and translation errors with large camera motion, for the simple solution (red curve) and the polynomial solution (black curve), each using three lines.

9.5.2 Real Data

Our system is extensively tested in indoor office environments where it is intended to be deployed. The image sequences for our experiments are acquired using a stereo pair with baseline 7.4 cm, with 640 360-pixel resolution and a wide field-of-view of × 110◦ 80◦ for each camera. All the experiments are conducted in a RANSAC framework. × The RANSAC estimate can be optionally refined by a within-frame, nonlinear local 210 refinement procedure using all the inliers. (Note that this is not bundle adjustment, as no information is aggregated over multiple frames.) The first dataset is obtained by imaging a set of boxes rotating on a turntable. This is an interesting dataset because there are very few straight lines for estimation, with some outliers in the form of background lines that do not move with the turntable. The ground truth rotation of the turntable is slightly more than 360◦. Note that the stereo baseline here is wide relative to the scene depth. The line detection and tracking results for this dataset are shown in Figure 9.7.

(a) Sobel edge map (b) After non-maximal suppression

(c) Detected lines (d) Tracked lines after 25 frames

Figure 9.7: Line detection and tracking for the turntable sequence.

The results obtained using the various algorithms discussed in the chapter are shown in Figure 9.8. First, it can be seen that the output of the simple solution is jerky and does not yield a complete circle, even after nonlinear refinement (Fig. 9.8 c, d). On the other hand, the incremental solver as well as the polynomial system solver yield fairly good reconstructions without any refinement (Fig. 9.8 b), but almost perfect results with 211

(a) (b)

(c) (d) (e)

Figure 9.8: Structure and motion estimation for a turntable sequence. (a) The last frame overlaid with lines. Solid lines are left camera, dotted lines are correspondences in the right camera. Red lines indicate inliers of the estimated motion. (b-e) Top and side-view of the reconstructed camera motions. (b) result of 3-line polynomial solver after RANSAC and within-frame refinement. (c) raw result of 3-line simple solver in RANSAC. (d) result of 3-line simple solver after (within-frame) refinement using inliers from RANSAC. (e) raw result of 3-line incremental solver in RANSAC. nonlinear refinement (Fig. 9.8 a,e). These results are obtained at 15 frames per second. The next dataset involves a loop traversal in an office environment. This scene is challenging, since there are significant portions of purely forward motion and there are several false edges due to glass surfaces or short-lived tracks due to varying illumination. 212

Most importantly, the stereo baseline is very narrow compared to the depth of scene lines (several meters). The results on one such sequence are shown in Figure 9.1, here, we describe the results on another sequence obtained in the same space. The simple solution completely breaks down for this situation, which is expected as explicit reconstruction produces noisy 3D lines and aligning them results in bad motion estimates. By pruning away a large number of failed frames, it is possible to discern the trajectory, but the resulting shape is still inaccurate Figure 9.9(d). In comparison, the results obtained with the incremental solver closely mimic the ground truth, with no failure cases (Figure 9.9(b) and (c)). The camera trajectories for a different loop traversal in the same space, for the polynomial solver and the incremental solver are shown in Figure 9.10. For both the incremental and polynomial solvers, note the accuracy of loop closure and the correctly recovered trajectory even in the forward motion segments of the traversal. The length of the loop in all the office corridor sequences is about 60 meters, with an additional 20 meter forward motion overlap in the experiments of Figure 9.9. There are about 1500 frames in the sequence of Figure 9.10 and about 1750 frames in the sequence of Figure 9.9. The total error in camera position for either experiment, relative to ground truth, is about 1 meter. The entire system from acquisition to motion estimation runs at about 10 frames per second. RANSAC estimation of structure and motion forms only 5% of the time spent for each frame. Note that these results are obtained without any inter-frame bundle adjustment.

9.6 Discussions

In this chapter, we have reported developments in the construction of a robust, real-time stereo-based system for performing structure and motion estimation using infinite lines. One of our primary contributions is a set of fast algorithms that require only two or three lines and are more robust than simple approaches requiring explicit 3D reconstruction. Since they use a minimum amount of data, these algorithms are well- 213

(a) (b)

(c) (d)

Figure 9.9: Structure and motion estimation for an indoor office sequence. (a) The last frame overlaid with lines. Red lines indicate inliers of the estimated motion, dotted lines are stereo correspondences. (b) Camera motion recovered from incremental solver after refinement. (c) Raw result from 3-line incremental solver after RANSAC. (d) Camera motion recovered from 3-line simple solver after nonlinear refinement using inliers and filtering out several mis-estimates. 214

(a) (b)

Figure 9.10: Structure and motion estimation for another indoor office sequence. (a) Top and side views of camera motion recovered using the 3-line polynomial solver, after within-frame refinement. (b) Top and side views of camera motion recovered using the 3-line incremental solver, after within-frame refinement. suited for use in a RANSAC framework. Our experiments demonstrate that the system, although not fully optimized yet, performs robustly at high frame rates in challenging indoor environments. While bundle adjustment was not used for any of the sequences in this chapter to better illustrate the properties of our algorithms, it can conceivably lead to further improvements in the motion estimates. This chapter is based on “Moving in Stereo: Efficient Structure and Motion Using Lines”, by M. K. Chandraker, J. Lim and D. J. Kriegman, as it appears in (Chandraker et al., 2009). Chapter 10

Discussions

“Yet all experience is an arch wherethrough Gleams that untraveled world, whose margin fades For ever and for ever when I move.”

Lord Alfred Tennyson (English poet, 1809-1892 AD), Ulysses

In the wake of this dissertation, some of the principles discussed here have also been applied for global optimization other problems in multiview geometry and even various other areas of computer vision. We list a few such sequels in Section 10.1. The content of this dissertation is inherently inter-disciplinary in nature, so we envisage an extension of our methods and results along several dimensions, a few of which are described in Section 10.2. Finally, we conclude the dissertation with a brief discussion in Section 10.3 that lends some perspective to this work and shapes an outlook for the future.

10.1 Sequels in the Computer Vision Community

One of the key outcomes of our work is an understanding of the relative difficulty of obtaining and verifying global optima for multiview geometry problems. Local minimization methods such as bundle adjustment have long been used for satisfactory 3D

215 216 reconstruction in practical situations. But before the works that constitute this dissertation, there was no mechanism for verifying the quality of those solutions. It has been our consistent observation that, in most cases, our algorithms converge to the global optimum in a fraction of the time required to verify optimality. Building on this observation, methods for verifying the optimality of a local optimum of multiview geometry cost functions is presented in (Hartley and Seo, 2008; Olsson et al., 2009a). While our emphasis in this dissertation has been on developing the tightest possible convex relaxations to non-convex 3D reconstruction problems, an alternative approach is to develop relaxations which are looser, but faster to minimize. For example, the SOCP relaxations for the triangulation problem in Chapter5 are provably tight, but as demonstrated in the follow-up work of (Lu and Hartley, 2007), a branch and bound algorithm based on a looser LP relaxation empirically converges much faster to the global optimum. Given feature correspondences in two views, an important problem in computer vision is Euclidean registration, that is, computing the transformation between the two coordinate systems. Similar to the spirit of this dissertation, convex relaxations for such problems are derived in (Olsson et al., 2009b) and globally minimized in a branch and bound framework. Similar principles as this dissertation have also been used in areas of computer vision besides multiview geometry. For instance, a global optimization method for image segmentation is presented in (Lempitsky et al., 2008), where lower bounds on the image segmentation energy functionals are computed using graph cuts and used for pruning a branch and bound search space. A branch and bound method for hypothesis selection in multiple structure and motion segmentation is presented in (Thakoor and Gao, 2008).

10.2 Future Directions

The problems that we globally optimize in this dissertation are triangulation, camera resectioning, autocalibration and bilinear fitting. Each of these is an important part 217 of the 3D reconstruction pipeline, but shy of the full SFM problem. The immediate course we envisage for future research lies towards a globally optimal projective reconstruction, which will complete an optimal stratification of 3D reconstruction. Another direction that we seek to explore is an optimal 3D reconstruction with calibrated cameras, which amounts to global optimization with orthonormality constraints. In fact, some promising work with similar objectives has already appeared in the literature (Hartley and Kahl, 2007a). One of the central tenets of this dissertation is an illustration of the effectiveness of intensive use of problem structure in global minimization. But there is also merit in seeking generic global optimization algorithms that are applicable to a wide variety of problems. For example, approaches such as the αBB framework construct convex relaxations for any non-convex objective function by compensating for any negative eigenvalues in the Hessian (Adjiman et al., 1998b,a). A similar approach in machine learning has been termed the Concave-Convex Procedure (Yuille and Rangarajan, 2001). The drawbacks of such approaches is that relaxations for even slightly complex problems tend to be very loose, so any gains in generality are accompanied by a loss in felicity. It might be worthwhile to explore whether a judicious compromise exists that generalizes to a variety of 3D reconstruction problems, while retaining enough problem structure for expeditious convergence. A practical nuisance in estimation problems is the presence of outliers, which can significantly distort a reconstruction if unaccounted for. Indeed, one of the main utilities of a hypothesize-and-test framework such as the one in Chapter9 is its robustness to the presence of outliers. Similarly, the recently developed L -norm optimization methods ∞ (Hartley and Schaffalitzky, 2004; Kahl and Hartley, 2008) have also found application as outlier removal algorithms (Sim and Hartley, 2006b). An alternate approach is to use a cost function robust to outliers, such as the L1-norm objective functions in Chapters5 and 8. An important avenue for further research is to study the viability of other well-known robust norms (Black and Rangarajan, 1996) in a convex optimization framework. It might be fruitful to pursue global optimization of piecewise continuous robust norms such as 218 the truncated quadratic in a disjunctive programming framework (Grossmann and Lee, 2003). Finally, it is fairly straightforward to derive global optimization methods similar to ours in very different areas of computer vision, such as photometric stereo or face recognition. We invite the interested readers to explore these directions to improve the state-of-the-art in their respective fields.

10.3 Conclusions

How far are we, today, from realizing the dream we pictured at the beginning of this dissertation? By some reckoning, the challenges are numerous, but no longer insurmountable. The Mars Rovers, that so captivated the imagination of scientists and dilettantes alike, use visual odometry to support inertial navigation for maneuvering across the terrain of an alien land. The application of structure from motion techniques to autonomous robotic navigation has now found resonance with the Vision-based Simultaneous Localization and Mapping (Visual-SLAM) community. Chapter9 of this dissertation is the product of our ongoing effort to equip the well-known humanoid robot, ASIMO, with a real-time SFM system. Visual odometry systems are now advanced enough to be reliably used in conjunction with Global Positioning Systems (GPS) and Inertial Navigation Sensors (INS) to navigate unconstrained outdoor environments. Structure from motion is now widely considered a mature sub-field of computer vision. Indeed, current challenges in SFM research have begun to reflect the technological developments and societal evolution over the past twenty years. The present-day ubiquity of high quality digital cameras has combined with the rapid advent of an Internet culture to make available an unprecedented amount of imaging data. Consequently, there has been substantial interest in adapting 3D reconstruction algorithms and architectures to deal with large-scale datasets. This dissertation concerns a more fundamental aspect of SFM, that is, solving 219 multiview geometry problems reliably and scalably, with a guarantee of obtaining the correct solution, regardless of the quality of initialization. It is only with the emergence of modern convex optimization that we have been able to progress in solving these problems, hitherto considered too difficult. But it is also important to realize that capricious application of convex techniques is inadequate for addressing these challenges. Rather, it is a careful consideration of problem geometry with prudent application of modern optimization methods that has allowed us to arrive at optimal solutions to these complex problems. Most structure from motion problems are at least NP-Hard, in general, so any global optimization algorithm will have an exponential worst case complexity. However, this dissertation aspires to convince the reader that, in usual cases, this worst case complexity is not commonly manifested and it is indeed possible to achieve a globally optimal solution in practice. The task requires unearthing a politic alliance between the problem stucture and the solution method at several stages within the optimization framework, which is not a trivial proposition, but as this dissertation shows, ultimately not an infeasible one either. The use of convex optimization within the computer vision community is still in its infancy. As convex solvers become more efficient and general-purpose, we expect a surge in interest even in segments of the community not directly related to 3D reconstruction. We are hopeful that, in the near future, methods not unlike ours will be used to exploit underlying convexities to successfully optimize challenging problems in other areas of computer vision, besides multiview geometry. And thus, as did Ulysses in Tennyson’s tale of yore, we stand under the arch of geometry and convexity and conclude this dissertation with the intent “to strive, to seek”. Appendix A

Fractional Programming

We include in this appendix, for completion, the proof from (Benson, 2002) that replacing the numerators by an overestimating scalar and the denominators by another underestimating scalar in a fractional program results in an equivalent program that preserves the global optimum.

n+2p Theorem 11. (x∗, t∗, s∗) R is a global, optimal solution for (5.9) if and only if ∈ n ti∗ = fi(x∗), si∗ = gi(x∗), i = 1, , p and x∗ R is a global optimal solution for ··· ∈ (5.8).

n+2p Proof. First, notice that the feasible region for (5.9), namely (x, t, s) R fi(x)+ { ∈ |− ti 0, gi(x) si 0, 0 < li ti ui, 0 < Li si Ui, x D, (t, s) Q is a ≥ − ≥ ≤ ≤ ≤ ≤ ∈ ∈ } compact convex set and non-empty if and only if D is non-empty. Thus, (5.9) is feasible if and only if (5.8) is.

Let (x∗, t∗, s∗) be a global optimal solution for (5.9), then by construction, for each i = 1, , p ··· p p fi(x∗) ti∗ X fi(x∗) X ti∗ h(x∗) = (A.1) g (x ) s g (x ) s i ∗ ≤ i∗ ⇒ i=1 i ∗ ≤ i=1 i∗

Let tˆi = fi(x∗) and sˆi = gi(x∗). Then, (x∗, ˆt, ˆs) is a feasible solution for (5.9) with an objective function value

X tˆi X fi(x∗) = = h(x∗) (A.2) sˆ g (x ) i i i i ∗

220 221

Since (x∗, t∗, s∗) is a global optimal solution for (5.9), (A.2) implies that

X ti∗ h(x∗) (A.3) s ≥ i i∗ From (A.1) and (A.3), X ti∗ h(x∗) = (A.4) s i i∗ Since, by definition,

fi(x∗) ti∗ , ti∗ gi(x∗) > 0, fi(x∗) si∗ > 0, i = 1, , p, gi(x∗) ≤ si∗ ≥ ≥ ∀ ··· it follows from (A.4) that ti∗ = fi(x∗) and si∗ = gi(x∗). If x be a feasible solution for

(5.8), then setting ti = fi(x) and si = gi(x) for i = 1, , p, (x, t, s) is a feasible ··· solution for (5.9) with an objective function value of h(x). However, since (x∗, t∗, s∗) is a global optimal solution for (5.9),

X t∗ h(x) i (A.5) s ≥ i i∗ and since x∗ D, from (A.4) and (A.5), x∗ is a global optimal solution for problem ∈ (5.8).

Conversely, let x∗ be the global optimal solution for (5.8) and let ti∗ = fi(x∗) and si∗ = gi(x∗). Then (x∗, t∗, s∗) is a feasible solution for (5.9) with objective function value h(x∗). Let (x, t, s) be a feasible solution for (5.9). Then, by construction, for each i = 1, , p, t∗ gi(x∗) > 0, fi(x∗) s∗ > 0. Thus, for each i = 1, , p ··· i ≥ ≥ i ··· fi(x∗) ti X ti h(x) (A.6) g (x ) s s i ∗ ≤ i ⇒ ≤ i i

Since x D and x∗ is a global optimal solution for (5.8), h(x) h(x∗). Thus, from ∈ ≥ (A.6), X ti h(x∗) (A.7) s ≤ i i

Since t∗ = fi(x∗) and s∗ = gi(x∗) for i = 1, , p, i i ···

X ti∗ = h(x∗) (A.8) s i i∗ 222

From (A.7) and (A.8), it is established that

X ti∗ X ti s s i i∗ ≤ i i and thus, (x∗, t∗, s∗) is a global optimal solution for (5.9). Appendix B

Convex Relaxations for Stratified Autocalibration

In this appendix, we briefly describe the convex and concave relaxations of the intermediate non-linear terms that were relaxed as part of the various convex relaxations in the main text of Chapter6. In each instance, the variables x and y take values in the intervals [xl, xu] and [yl, yu], respectively.

B.1 Functions of the Form f(x) = x8/3

8/3 8/3 8/3 The function x is convex, and thus the line joining (xl, xl ) and (xu, xu ) is a tight concave over-estimator, thus the relaxation is given by

8/3 x xl 8/3 8/3 z xl + − (xu xl ) (B.1) ≤ xu xl − −

B.2 Bilinear Functions f(x, y) = xy

We begin by considering convex and concave relaxations of the bilinear function f(x) = xy. It can be shown (Al-Khayyal and Falk, 1983) that the tightest convex lower

223 224 bound for f(x, y) is given by the function

z = max xly + ylx xlyl, xuy + yux xuyu . (B.2) { − − } Similarly, the tightest concave upper bounding function is given by

z = min(xuy + ylx xuyl, xly + yux xlyu). (B.3) − −

Thus, the convex relaxation of the equality constraint z = xy over the domain [xl, yl] × [yl, yu] is given by

z xly + ylx xlyl ≥ −

z xuy + yux xuyu ≥ −

z xuy + ylx xuyl ≤ −

z xly + yux xlyu (B.4) ≤ − These relaxations are graphically illustrated in Figure B.1.

(a) (b)

Figure B.1: (a) Construction of the convex relaxation for the bilinear function f(x, y) = xy. The function is drawn in the blue shade, while the convex envelope is the point wise maximum of the two gray shaded planes. (b) Construction of the concave relaxation for the bilinear function f(x, y) = xy. The function is drawn in the blue shade, while the concave envelope is the point wise minimum of the two gray shaded planes. A different view of the bilinear function is shown to better illustrate the relaxation. 225

B.3 Functions of the Form f(x, y) = x1/3y

We now consider the construction of the convex relaxation for a bivariate function f(x, y) = x1/3y. As illustrated in Figure B.2, this is a non-convex function whose convex relaxation is not straightforward.

Figure B.2: The non-convex function f(x, y) = x1/3y.

B.3.1 Case I: [xl > 0 or xu < 0]

Suppose xl > 0, then f(x, y) is concave in x and convex in y. It can be shown (Tawarmalani and Sahinidis, 2002) that the convex envelope for f(x, y) is given by

min z (B.5)

subject to z (1 λ)f(xl, ya) + λf(xu, yb) ≥ −

x = xl + (xu xl)λ −

y = (1 λ)ya + λyb − 0 λ 1 ≤ ≤

yl ya, yb yu (B.6) ≤ ≤ 226

1/3 Noting that f(x, y) = x y, substituting yp = (1 λ)ya and simplifying results in the − following convex relaxation:

1/3 1/3 z x yp + x (y yp) ≥ l u −

(1 λ)yl yp (1 λ)yu − ≤ ≤ −

λyl y yp λyu ≤ − ≤ x xl λ = − . (B.7) xu xl − A concave relaxation for x1/3y can be constructed by considering the negative of the convex envelope of x1/3( y). This leads to −

1/3 1/3 1/3 z (x x )y0 + x y ≤ u − l p u

λyu y + y0 λyl ≥ p ≥

(1 λ)yu y0 (1 λ)yl − ≥ − p ≥ − x xl λ = − . (B.8) xu xl − 1/3 For the case when xu < 0, we observe that a convex relaxation for x y is given by the negative of the concave relaxation of t1/3y, where t = x. Appropriate manipulation of − (B.7) gives us the convex and concave envelopes for this case too.

B.3.2 Case II: [x 0 x ] l ≤ ≤ u The function x1/3 is convex for x < 0 and concave for x > 0. The derivation in (B.7) depends critically on the convexity of x1/3 over its domain, and thus, cannot be used for the case when xl 0 xu. So instead of a one step relaxation, we will consider ≤ ≤ the two equality constraints t = x1/3, z = ty and relax each of them individually. Once we have the relaxation for t = x1/3, we can then apply the bilinear relaxation to z = ty. 1/3 To construct a concave overestimator for x , we notice that when xl/8 < xu, − 1/3 1/3 a line which upper bounds the curve x is the tangent at (xu, xu ), given by

1 2/3 2 1/3 t = x− x + x (B.9) 3 u 3 u 227

While this line is a concave overestimator suitable for branch and bound in principle,

1/3 we can make it a tighter relaxation. Notice that x is concave in the region (xl, 0) and 2 the line in (B.9) passes through (0, x2/3) which lies above the curve. Thus, the line 3 u 2 segment joining (x , x1/3) and (0, x2/3), given by l l 3 u   2/3 2 1 1/3 2 1/3 t = x− x− x x + x (B.10) l − 3 l u 3 u

1/3 is a tighter overestimator when x (xl, 0). Thus, the overestimator for t = x is given ∈ by the minimum of two lines B.9 and (B.10). 1/3 Further, when xl/8 xu, the straight line passing through (xl, x ) and − ≥ l 1/3 (xu, xu ), given by

1/3 1/3 1/3 1/3 (xu x )x + (xux xlxu ) t = − l l − , (B.11) xu xl − always lies above the curve x1/3, which gives the strongest possible concave overestimator. Thus, the unified concave overestimator for x1/3 is given by  2/3 2 1 1/3 2 1/3 x + 2x  − u min (xl 3 xl− xu )x + 3 xu , 2/3 , xl/8 < xu t { − 3xu } − (B.12) 1/3 1/3 1/3 1/3 ≤ (xu xl )x + (xuxl xlxu )  , xl/8 xu  − xu xl − − − ≥ The construction of the concave overestimator for x1/3 is graphically illustrated in Figure B.3. By similar arguments, we can derive the convex underestimator for x1/3 when

xl 0 xu as ≤ ≤  2/3 2 1 1/3 2 1/3 x + 2x  − l max (xu 3 xu− xl )x + 3 xl , 2/3 , xu/8 > xl t { − 3( xl) } − (B.13) 1/3 1/3 1/3 1/3 − ≥ (xu xl )x + (xuxl xlxu )  , xu/8 xl  − xu xl − − − ≤ The construction of the convex underestimator for x1/3 is graphically illustrated in Figure B.4. 228

1 2 0 3 3 1 / / 1 1 x x -1 0 -2 -1 -1 -0.5 0 0.5 1 -10 -8 -6 -4 -2 0 x x (a) (b)

Figure B.3: Construction of the concave overestimator for x1/3, illustrated with the solid, black curve. (a) The concave overestimator when xl/8 < xu is given by the minimum − of the dashed red and blue lines. (b) The concave overestimator when xl/8 xu is − ≥ given by the dashed blue line.

B.4 Convergence Proofs

To be usable in a branch and bound algorithm, the convex relaxations must satisfy the technical conditions (L1) and (L2) described in Section 4.3. Condition (L1) is evidently true by construction. That the Cauchy continuity condition (L2) holds for various bilinear relaxations follows from (McCormick, 1976; Al-Khayyal and Falk, 1983; Horst and Tuy, 2006). We refer the reader to (Tawarmalani and Sahinidis, 2001a, 2002) to prove the same for Case I in Section B.3. Below, we provide a brief proof of the Cauchy continuity condition (L2) for Case II in Section B.3. First, we must prove that as the interval within which x lies becomes smaller, the concave overestimator given by (B.12) becomes tighter. Let us consider the line given by (B.10). The condition (L2) is satisfied when the RHS in (B.10) decreases with an dt dt increase in xl or a decrease in xu, that is, when < 0 and > 0. Now, dxl dxu

dt 2 x 1/3 1/3 = 2 (xu xl ) (B.14) dxl 3 xl − which is negative when x [xl, 0), which is the region where (B.10) is active. Further, ∈   dt 2 2/3 x = xu− 1 (B.15) dxu 9 − xu 229

1 2 0 3 3 1 / / 1 1 x -1 x 0 -2 -1 -1 -0.5 0 0.5 1 0 2 4 6 8 10 x x (a) (b)

Figure B.4: Construction of the convex underestimator for x1/3, illustrated with the solid, black curve. (a) The convex underestimator when xu/8 > xl is given by the maximum − of the dashed red and blue lines. (b) The convex underestimator when xu/8 xl is − ≤ given by the dashed blue line.

which is non-negative when x lies in the interval [xl, xu]. Next, we notice that the part of the overestimator given by (B.9) is independent of the value of xl, since it is active in the region [0, xu]. In that interval,

dt 2 1 = xu− 3 (B.16) dxu 9 which is positive since xu > 0.

When xl/8 xu, the overestimator given by the line (B.11) must also satisfy − ≥ the same conditions. Indeed,   dt 1 1 2/3  1/3 1/3 = xl− (xu xl) xu xl dxl (xu xl) · 3 − − − − " # 1 x 2/3 x 1/3 = u + u 2 3(xu xl) · xl xl − − " # " # 1 x 1/3 x 1/3 = u 1 u + 2 3(xu xl) · xl − · xl − where we have used the identity that for any numbers a and b, a3 b3 = (a b)(a2 + − − 2 ab + b ). Since xu xl > 0, it follows that the above expression is negative when − x 1/3 2 < u < 1 (B.17) − xl 230

x or equivalently, when 8 < u < 1, which is identically true since the interval under − xl consideration satisfies xl/8 xu, xl < 0 and xu > 0. − ≥ Finally, it can be shown by similar derivations, or deduced by the symmetry of dt (B.11), that > 0 when xl/8 xu. dxu − ≥ By similar arguments as above, it can be shown that the convex underestimator of x1/3 given by (B.13) becomes tighter as the size of the interval decreases. This concludes the proof of the Cauchy continuity condition (L2) for Case II in Section B.3.

B.4.1 Errata

Note that relaxations in (B.12) and (B.13) correct a mistake in a prior version of

Chapter6, as it appears in (Chandraker et al., 2007b). There, when xl/8 < xu, one of − 1/3 the segments of the concave overestimator is the tangent line from (xl, xl ), given by

4x xl t = −2/3 (B.18) 3xu instead of the line (B.10). It can be shown that the concave overestimator constructed using (B.18) instead of (B.10) violates the condition (L2). Indeed, as presented in

(Chandraker et al., 2007b), (B.18) would have been active in the interval [xl, xa], where xa is the point of intersection of (B.18) and (B.9) given by

2/3 2/3 1/3 1/3 xu xl (2xu + xl ) xa = 2/3 2/3 . (B.19) 4xu x − l One of the necessary conditions for Cauchy continuity, as above, is that

dt 1 5/3 = g(x) = − xl− (8x + xl) (B.20) dxl 9

2/3 must be negative for all x [xl, xa]. Clearly, g(xl) = x− < 0, since xl < 0. ∈ − l However, 1 2/3 (2λ + 1)(4λ 1) g(xa) = − x− − (B.21) 9 l (2λ 1) − x 1/3 where we have defined λ = u . Since the interval under consideration is con- xl strained by xl/8 < xu xl < 0 and xu > 0, it easily follows that the above expression − 231

for g(xa) is positive. Thus, the derivative of t with respect to xl is not always negative, which is a violation of condition (L2). It can be shown that a similar error in the construction of the convex underestima- tor is also corrected by the expression in (B.13). Appendix C

Convergence Proof for Bilinear Relaxations

In this addendum, we prove that given Lemma1 in Section 8.4, it can be shown that the non-exhaustive branch and bound algorithm proposed in Chapter8 converges. The proof is motivated by similar discussions, albeit for exhaustive branching strategies, in (Balakrishnan et al., 1991; Horst and Tuy, 2006).

Define the condition number of a rectangle as (Q) = maxi (ui li)/mini (ui li), C − − where li and ui are bounds on the xi terms only. Then, the following is true for our choice of sub-division rule:

Lemma 2. (Q) max (Q0), 2 C ≤ {C }

Proof. Let some Q be split into Q1 and Q2 by bisection along the longest xi-edge. ∈ P Let χ0(Q) = mini(ui li) and as before, χ(Q) = maxi(ui li). Then, χ(Q1) χ(Q)/2 − − ≤ and χ(Q2) χ(Q)/2. Moreover, χ0(Q1) = χ0(Q2) = min χ(Q)/2, χ0(Q) . It follows ≤ { } that (Q1) max (Q), 2 and (Q2) max (Q), 2 and as a consequence, the C ≤ {C } C ≤ {C } lemma is proved.

Lemma 3. After a “large enough” number of branching operations, there exists at least one “small enough” rectangle. More precisely, there exists a rectangle for which the

232 233

maximum length along dimensions corresponding to xi, i = 1, , n, can be bounded in ··· terms of the number of branching operations.

Proof. The volume of a rectangle is given by

Q Q n Q vol (Q) = (ui li) (Ui Li) = V (ui li) i − i − 0 i − n n n 1 n [χ(Q)] V0 χ(Q) [χ0(Q)] − V0 n 1 ≥ · ≥ [ (Q)] −  n C V0 χ(Q) · , [since (Q) 1] ≥ (Q) C ≥ C n Q where V = (Ui Li) remains constant throughout the branching process. Thus, 0 i − χ(Q) 1 (Q) vol (Q)1/n. ≤ V0 C Since a new rectangle is created by bisection at every branching stage, after k

vol (Q0) branching steps, there exists some Q0 k such that vol (Q0) . Using Lemma2, ∈ P ≤ k it follows that there exists some Q0 k such that ∈ P 1 1/n χ(Q0) max (Q0), 2 [vol (Q0)/k] (C.1) ≤ V0 {C } and the lemma stands proved.

l u We are now in a position to prove Proposition1. Specifically, let ψk0 and ψk0 in equa- tion (8.11) be the global lower and upper bounds maintained by the branch and bound algorithm at the end of k0 iterations. Then, we show that given  > 0, k such that for ∃ u l some k0 < k, it was the case that ψ 0 ψ 0 . k − k ≤ Proof. Lemma1 guarantees the existence of δ > 0 such that χ(Q) δ f (Q) ≤ ⇒ ub − f (Q) . From Lemma3, for a large enough k such that lb ≤ 1 1/n max (Q0), 2 [vol (Q0)/k] δ/2 , (C.2) V0 {C } ≤ there exists a rectangle Q0 k such that χ(Q0) δ/2. Let Q0 be formed by sub-division ∈ P ≤ of some rectangle Q00. Then, χ(Q00) δ, thus, from Lemma1, f (Q00) f (Q00) . ≤ ub − lb ≤ Since the algorithm chose Q00 for branching at some iteration k0, it must have satisfied l flb(Q00) = ψk0 . Thus,

u l l ψ 0 ψ 0 f (Q00) ψ 0 f (Q00) f (Q00)  (C.3) k − k ≤ ub − k ≤ ub − lb ≤ 234 and the convergence proof stands completed. Bibliography

Adjiman, C., Androulakis, I., and Floudas, C., 1998a: A global optimization method, αBB, for general twice-differentiable constrained NLPs - II. Implementation and computational results. Computers and Chemical Engineering, 22, 1159–1178.

Adjiman, C., Dallwig, S., Floudas, C., and Neumaier, A., 1998b: A global optimization method, αBB, for general twice-differentiable constrained NLPs - I. Theoretical advances. Computers and Chemical Engineering, 22, 1137–1158.

Agarwal, S., Chandraker, M., Kahl, F., Belongie, S., and Kriegman, D., 2006: Practical global optimization for multiview geometry. In European Conference on Computer Vision, 592–605.

Agarwal, S., Snavely, N., and Seitz, S., 2008: Fast algorithms for L problems in multiview geometry. In IEEE Conference on Computer Vision and Pattern∞ Recognition.

Agrawal, M., 2004: On automatic determination of varying focal lengths using semidefi- nite programming. In IEEE International Conference on Image Processing.

Akhiezer, N., 1965: The Classical Moment Problem and Some Related Questions in Analysis. Oliver and Boyd, London.

Al-Khayyal, F., and Falk, J., 1983: Jointly constrained biconvex programming. Mathe- matica of Operations Research, 8, 273–286.

Andersen, E., Roos, C., and Terlaki, T., 2003: On implementing a primal-dual interior- point method for conic quadratic optimization. Mathematical Programming, 95(2), 249–277.

Balakrishnan, V., Boyd, S., and Balemi, S., 1991: Branch and bound algorithm for com- puting minimum stability degree of parameter-dependent linear systems. International Journal of Robust and Non-Linear Control, 1(4), 295–317.

Barricelli, N. A., 1954: Esempi numerici di processi di evoluzione. Methodos, 45–68.

Bartoli, A., and Sturm, P., 2004: The 3D line motion matrix and alignment of line reconstructions. International Journal of Computer Vision, 57(3), 159–178.

235 236

Bartoli, A., and Sturm, P., 2005: Structure-from-motion using lines: Representation, triangulation and bundle adjustment. Computer Vision and Image Understanding, 100(3), 416–441.

Bascle, B., and Blake, A., 1998: Separability of pose and expression in facial tracking and animation. In IEEE International Conference on Computer Vision, 323–328.

Bay, H., Tuytelaars, T., and van Gool, L., 2006: SURF: Speeded Up Robust Features. In European Conference on Computer Vision, 404–417.

Beardsley, P., Zisserman, A., and Murray, D., 1997: Sequential updating of projective and affine structure from motion. International Journal of Computer Vision, 23(3), 235–259.

Belhumeur, P., and Kriegman, D., 1996: What is the set of images of an object under all possible lighting conditions? In IEEE Conference on Computer Vision and Pattern Recognition, 270–277.

Benson, H. P., 2002: Using concave envelopes to globally solve the nonlinear sum of ratios problem. Journal of Global Optimization, 22(2-4), 343–364.

Beutelspacher, A., and Rosenbaum, U., 1998: Projective Geometry: From Foundations to Applications. Cambridge University Press.

Black, M. J., and Rangarajan, A., 1996: On the unification of line processes, outlier rejection and robust statistics with applications in early vision. International Journal of Computer Vision, 19(1), 57–91.

Blanz, V., and Vetter, T., 1999: A morphabe model for the synthesis of 3D faces. In ACM SIGGRAPH, 187–194.

Boyd, S., and Vandenberghe, L., 2004: Convex Optimization. Cambridge University Press.

Breuel, T., 2002: A comparison of search strategies for geometric branch and bound algorithms. In European Conference on Computer Vision, 837–850.

Brown, D., 1976: The bundle adjustment - progress and prospects. International Archives of Photogrammetry, 21(3), Paper 3 03: 33 pages.

Buchberger, B., 1965: An Algorithm for Finding the Basis Elements of the Residue Class Ring of a Zero Dimensional Polynomial Ideal. Ph.D. thesis, University of Innsbruck.¨

Buchberger, B., 1995: Introduction to Grobner¨ bases. Logic of Computation, 157.

Byrod,¨ M., Josephson, K., and Astr˚ om,¨ K., 2007: Fast optimal three view triangulation. In Asian Conference on Computer Vision, 549–559. 237

Casella, G., and George, E. I., 1992: Explaining the Gibbs sampler. The American Statistician, 46, 167–174.

Chandraker, M., Agarwal, S., Kahl, F., Nister,´ D., and Kriegman, D., 2007a: Autocalibra- tion via rank-constrained estimation of the absolute quadric. In IEEE Conference on Computer Vision and Pattern Recognition.

Chandraker, M., Agarwal, S., Kriegman, D., and Belongie, S., 2007b: Globally opti- mal affine and metric upgrades in stratified autocalibration. In IEEE International Conference on Computer Vision.

Chandraker, M., and Kriegman, D., 2008: Globally optimal bilinear programming for computer vision applications. In IEEE Conference on Computer Vision and Pattern Recognition.

Chandraker, M., Lim, J., and Kriegman, D., 2009: Moving in stereo: Efficient structure and motion using lines. In IEEE International Conference on Computer Vision.

Comport, A., Malis, E., and Rives, P., 2007: Accurate quadrifocal tracking for robust 3D visual odometry. In International Conference on Robotics and Automation, 40–45.

Cox, D., Little, J., and O’Shea, D., 1992: Ideals, Varities and Algorithms. Springer- Verlag, New York, NY, USA.

Cox, D., Little, J., and O’Shea, D., 1998: Using Algebraic Geometry. Springer-Verlag.

Dellaert, F., Seitz, S. M., Thorpe, C. E., and Thrun, S., 2000: Structure from motion without correspondence. In In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR, 557–564.

Dorigo, M., Maniezzo, V., and Colorni, A., 1996: Ant System: Optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man and Cybernetics - Part B, 26(1), 29–41.

Faugeras, O., Luong, Q.-T., and Maybank, S., 1992: Camera self-calibration: Theory and experiments. In European Conference on Computer Vision, 321–334. Springer-Verlag.

Faugeras, O. D., 1992: What can be seen in three dimensions with an uncalibrated stereo rig? In European Conference on Computer Vision, 563–578. Springer-Verlag.

Fischler, M. A., and Bolles, R. C., 1981: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24, 381–195.

Forstner,¨ W., and Gulch, E., 1987: A fast operator for detection and precise location of distinct points, corners and centres of circular features. In ISPRS Intercommission Conference on Fast Processing of Photogrammetric Data, 281–305. 238

Fraser, A., 1957: Simulation of genetic systems by automatic digital computers. Aus- tralian Journal of Biological Sciences, 10, 484–491. Freedman, D., 2003: Effective tracking through tree-search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5), 604–615. Freeman, W., and Tenenbaum, J., 1997: Learning bilinear models for two-factor problems in vision. In IEEE Conference on Computer Vision and Pattern Recognition, 554–560. Freund, R. W., and Jarre, F., 2001: Solving the sum-of-ratios problem by an interior-point method. Journal of Global Optimization, 19(1), 83–102. Fusiello, A., Benedetti, A., Farenzena, M., and Busti, A., 2004: Globally convergent autocalibration using interval analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(12), 1633–1638. Gat, Y., 2003: A branch-and-bound technique for nano-structure image segmentation. In Computer Vision and Pattern Recognition Workshop. Gellert, W., Kurstner,¨ K., Hellwich, M., and Kastner, H., 1975: The VNR Concise Encyclopedia of Mathematics. Van Nostrand Reinhold Company. Goss, S., Aron, S., Deneubourg, J.-L., and Pasteels, J.-M., 1989: The self-organized exploratory pattern of the Argentine ant. Naturwissenschaften, 76, 579–581. Granshaw, S., 1980: Bundle adjustment methods in engineering photogrammetry. Pho- togrammetric Record, 10(56), 181–207. Grossmann, I., and Lee, S., 2003: Generalized disjunctive programming: Nonlinear convex hull relaxation and algorithms. Computational Optimization and Applications, 26, 83–100. Hallinan, P., 1994: A low-dimensional representation of human faces for arbitrary lighting conditions. In IEEE Conference on Computer Vision and Pattern Recognition, 995–999. Harris, C., and Stephens, M., 1988: A combined corner and edge detector. In Alvey Vision Conference, 147–151. Hartley, R., 1994: Euclidean reconstruction from uncalibrated views. Applications of Invariance in Computer Vision, 825, 237–256. Hartley, R., Gupta, R., and Chang, T., 1992: Stereo from uncalibrated cameras. In IEEE Conference on Computer Vision and Pattern Recognition, 761–764. Champaign, USA. Hartley, R., and Schaffalitzky, F., 2004: L minimization in geometric reconstruction problems. In IEEE Conference on Computer∞ Vision and Pattern Recognition, volume I, 504–509. Washington DC, USA. 239

Hartley, R., and Sturm, P., 1997: Triangulation. Computer Vision and Image Understand- ing, 68(2), 146–157. Hartley, R. I., 1997: Lines and points in three views and the trifocal tensor. International Journal of Computer Vision, 22(2), 125–140. Hartley, R. I., 1998a: Chirality. International Journal of Computer Vision, 26(1), 41–61. Hartley, R. I., 1998b: Computation of the trifocal tensor. In European Conference on Computer Vision, 20–35. Hartley, R. I., Hayman, E., de Agapito, L., and Reid, I., 1999: Camera calibration and the search for infinity. In IEEE International Conference on Computer Vision, 510–517. Kerkyra, Greece. Hartley, R. I., and Kahl, F., 2007a: Global optimization through searching rotation space and optimal estimation of the essential matrix. In IEEE International Conference on Computer Vision. Hartley, R. I., and Kahl, F., 2007b: Optimal algorithms in multiview geometry. In Asian Conference on Computer Vision, 13–34.

Hartley, R. I., and Seo, Y.-D., 2008: Verifying global minima for L2 minimization problems. In IEEE Conference on Computer Vision and Pattern Recognition. Hartley, R. I., and Zisserman, A., 2004: Multiple View Geometry in Computer Vision. Cambridge University Press, second edition. Hastings, W., 1970: Monte-carlo sampling methods using markov chains and their applications. Biometrika, 57(1), 97–109. Henrion, D., and Lasserre, J. B., 2003: GloptiPoly: Global optimization over polynomials with Matlab and SeDuMi. ACM Transactions on Mathematical Software, 29(2), 165– 194. Henrion, D., and Lasserre, J. B., 2004: Solving nonconvex optimization problems - how GloptiPoly is applied to problems in robust and nonlinear control. IEEE Control Systems Magazine, 24(3), 72–83. Henrion, D., and Lasserre, J. B., 2005: Detecting global optimality and extracting solutions in GloptiPoly. In Positive Polynomials in Control. Springer-Verlag. Heyden, A., 1995: Geometry and Algebra of Multiple Projective Transformations. Ph.D. thesis, Lund University. Heyden, A., and Astr˚ om,¨ K., 1996: Euclidean reconstruction from constant intrinsic parameters. In IEEE International Conference on Pattern Recognition, volume I, 339–343. Vienna, Austria. 240

Heyden, A., and Astr˚ om,¨ K., 1998: Minimal conditions on intrinsic parameters for Euclidean reconstruction. In Asian Conference on Computer Vision, volume II, 169– 176. Horst, R., and Tuy, H., 2006: Global Optimization: Deterministic Approaches. Springer Verlag. Huber, P., 1981: Robust Statistics. Addison-Wesley, New York. Hyvarinen,¨ A., Karhunen, J., and Oja, E., 2001: Independent Component Analysis. John Wiley & Sons, Inc. Kahl, F., 2001: Geometry and Critical Configurations of Multiple Views. Ph.D. thesis, Lund Institute of Technology, Sweden.

Kahl, F., 2005: Multiple view geometry and the L -norm. In IEEE International Conference on Computer Vision, 1002–1009. ∞ Kahl, F., Agarwal, S., Chandraker, M., Belongie, S., and Kriegman, D., 2008: Practical global optimization for multiview geometry. International Journal of Computer Vision, 79(3), 271–284.

Kahl, F., and Hartley, R. I., 2008: Multiple view geometry under the L -norm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(9), 1603–1617.∞ Kahl, F., and Henrion, D., 2005: Globally optimal estimates for geometric reconstruction problems. In IEEE International Conference on Computer Vision, 978–985. Kahl, F., and Henrion, D., 2007: Globally optimal estimates for geometric reconstruction problems. International Journal of Computer Vision, 74(1), 3–15. Kahl, F., Triggs, W., and Astr˚ om,¨ K., 2000: Critical motions for auto-calibration when some intrinsic parameters can vary. Journal of Mathematical Imaging and Vision, 13(2), 131–146. Karmarkar, N., 1984: A new polynomial-time algorithm for linear programming. Combi- natorica, 4(4), 373–395. Ke, Q., and Kanade, T., 2005a: Quasiconvex optimization for robust geometric recon- struction. In IEEE International Conference on Computer Vision, 986 – 993. Beijing, China.

Ke, Q., and Kanade, T., 2005b: Robust L1 norm factorization in the presence of out- liers and missing data by alternative convex programming. In IEEE Conference on Computer Vision and Pattern Recognition, 739–746. San Diego, USA. Kiefer, J., and Wolfowitz, J., 1952: Stochastic estimation of the maximum of a regression function. Annals of Mathematical Statistics, 23, 462–466. 241

Kirkpatrick, S., Gelatt, C., and Vecchi, M., 1983: Optimization by simulated annealing. Science, New Series, 220(4598), 671–680.

Koenderink, J., and van Doorn, A., 1997: The generic bilinear calibration-estimation problem. International Journal of Computer Vision, 23(3), 217–234.

Konno, H., 1976: A cutting plane algorithm for solving bilinear programs. Mathematical Programming, 11, 14–27.

Kotz, S., Kozubowski, T. J., and Podgorski, K., 2001: The Laplace distribution and generalizations. Birkhauser.¨

Lampert, C., Blaschko, M., and Hofmann, T., 2008: Beyond sliding windows: Object localization by efficient subwindow search. In IEEE Conference on Computer Vision and Pattern Recognition.

Land, A. H., and Doig, A. G., 1960: An automatic method of solving discrete program- ming problems. Econometrica, 28(3), 497–520.

Lasserre, J. B., 2001: Global optimization with polynomials and the problem of moments. SIAM Journal of Optimization, 11, 796–817.

Lempitsky, V., Blake, A., and Rother, C., 2008: Image segmentation by branch-and- mincut. In European Conference on Computer Vision, 15–29.

Levenberg, K., 1944: A method for the solution of certain non-linear problems in least squares. The Quarterly of Applied Mathematics, 2, 164–168.

Lindeberg, T., 1998: Feature detection with automatic scale selection. International Journal of Computer Vision, 30(2), 79–116.

Lindeberg, T., and Bretzner, L., 2003: Real-time scale selection in hybrid multi-scale representations. In Scale Space, 148–163.

Liu, Y., and Huang, T. S., 1988: A linear algorithm for motion estimation using straight line correspondences. CVGIP, 44(1), 33–57.

Longuet-Higgins, H. C., 1981: A computer algorithm for reconstructing a scene from two projections. Nature, 293, 133–135.

Lowe, D., 2004: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

Lu, F.-F., and Hartley, R. I., 2007: A fast optimal algorithm for L2 triangulation. In Asian Conference on Computer Vision, 279–288.

Lucas, B., and Kanade, T., 1981: An iterative image registration technique with an application to stereo vision. In Image Understanding Workshop, 121–130. 242

Manning, R., and Dyer, C., 2001: Metric self calibration from screw-transform manifolds. In IEEE Conference on Computer Vision and Pattern Recognition, volume I, 590–597.

Marimont, D., and Wandell, B., 1992: Linear models of surface and illuminant spectra. Journal of the Optical Society of America, 9(11), 1905–1913.

Marquardt, D., 1963: An algorithm for least-squares estimation of nonlinear parameters. SIAM Journal on Applied Mathematics, 11, 431–441.

McCormick, G., 1976: Computability of global solutions to factorable nonconvex pro- grams. Mathematical Programming, 10, 147–175.

Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E., 1953: Equa- tions of state calculations by fast computing machines. Journal of Chemical Physics, 21(6), 1087–1092.

Mikolajczyk, K., and Schmid, C., 2005: A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10), 1615–1630.

Moore, R. E., 1966: Interval Analysis. Prentice-Hall.

Murase, H., and Nayar, S. K., 1995: Visual learning and recognition of 3-d objects from appearance. International Journal of Computer Vision, 14(1).

Nesterov, Y., and Nemirovskii, A., 1994: Interior Point Polynomial Methods in Convex Programming. Society for Industrial and Applied Mathematics.

Nister,´ D., 2001: Automatic dense reconstruction from uncalibrated video sequences. Ph.D. thesis, Royal Institute of Technology KTH, Sweden.

Nister,´ D., 2004: An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 756–770.

Nister,´ D., 2004: Untwisting a projective reconstruction. International Journal of Computer Vision, 60(2), 165–183.

Nister,´ D., Kahl, F., and Stewenius,´ H., 2007: Structure from Motion with missing data is NP-Hard. In IEEE Conference on Computer Vision and Pattern Recognition.

Nister,´ D., Naroditsky, O., and Bergen, J., 2004: Visual odometry. In IEEE Conference on Computer Vision and Pattern Recognition, 652–659.

Nister,´ D., and Stewenius,´ H., 2006: A minimal solution to the generalized 3-point pose problem. Journal of Mathematical Imaging and Vision, 27(1), 67–79.

Olsson, C., Kahl, F., and Hartley, R. I., 2009a: Projective least squares: Global solutions through local optimization. In IEEE International Conference on Pattern Recognition. 243

Olsson, C., Kahl, F., and Oskarsson, M., 2009b: Branch-and-bound methods for Eu- clidean registration problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 783–794.

Parrilo, P., 2003: Sum of squares decompositions for structured polynomials via semidefi- nite programming. Ph.D. thesis, TU Munchen.¨

Pollefeys, M., Koch, R., and van Gool, L., 1998: Self-calibration and metric reconstruc- tion in spite of varying and unknown internal camera parameters. In IEEE International Conference on Computer Vision, 90–95. Mumbai, India.

Pollefeys, M., Koch, R., and van Gool, L., 1999: Self-calibration and metric recon- struction in spite of varying and unknown internal camera parameters. International Journal of Computer Vision, 32(1), 7–25.

Pollefeys, M., Nister,´ D., Frahm, J.-M., Akbarzadeh, A., Mordohai, P., Clipp, B., Engels, C., Gallup, D., Kim, S. J., Merrell, P., Salmi, C., Sinha, S. N., Talton, B., Wang, L., Yang, Q., Stewenius,´ H., Yang, R., Welch, G., and Towles, H., 2008: Detailed real-time urban 3D reconstruction from video. International Journal of Computer Vision, 78(2-3), 143–167.

Pollefeys, M., and van Gool, L., 1999: Stratified self-calibration with the modulus constraint. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8), 707–724.

Pollefeys, M., van Gool, L., and Oosterlinck, M., 1996: The modulus constraint: A new constraint for self-calibration. In IEEE International Conference on Pattern Recognition, volume I, 349–353. Vienna, Austria.

Pollefeys, M., Verbiest, F., and van Gool, L., 2002: Surviving dominant planes in uncalibrated structure and motion recovery. In European Conference on Computer Vision, 837–851.

Pottmann, H., and Wallner, J., 2001: Computational Line Geometry. Springer.

Prajna, S., Papachristodoulou, A., and Parrilo, P., 2002: Introducing SOSTOOLS: a general purpose sum of squares programming solver. In CDC.

Putinar, M., 1993: Positive polynomials on compact semi-algebraic sets. Indiana University Mathematics Journal, 42(3), 969–984.

Quan, L., and Lan, Z., 1999: Linear N-point camera pose determination. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 21(8), 774–780.

Robbins, H., and Munro, S., 1951: A stochastic approximation method. Annals of Mathematical Statistics, 22, 400–407. 244

Robert, C., and Casella, G., 2004: Monte Carlo Statistical Methods. Springer-Verlag, New York.

Sarfraz, S., and Hellwich, O., 2008: Head pose estimation in face recognition across pose scenarios. In International Conference on Computer Vision Theory and Applications, 235–242.

Schaffalitzky, F., 2000: Direct solution of modulus constraints. In Proc. Indian Conf. on Computer Vision, Graphics and Image Processing, 314–321.

Schaible, S., and Shi, J., 2003: Fractional programming: the sum-of-ratios case. Opti- mization Methods and Software, 18, 219–229.

Schindler, G., Krishnamurthy, P., and Dellaert, F., 2006: Line-based structure from motion for urban environments. In 3D Data Processing, Visualization and Transmission, 846– 853.

Schmudgen,¨ K., 1991: The k-moment problem for compact semi-algebraic sets. Annals of Mathematics, 289(2), 203–206.

Schweighofer, M., 2006: Optimization of polynomials on compact semialgebraic sets. SIAM Journal of Optimization, 15(3), 920–942.

Seo, Y., and Hartley, R. I., 2007: A fast method to minimize L-infinity error norm for geometric vision problems. In IEEE International Conference on Computer Vision.

Shashua, A., and Werman, M., 1995: Trilinearity of three perspective views and its associated tensor. In IEEE International Conference on Computer Vision, 920–925. IEEE Computer Society Press, Cambridge Ma, USA.

Shashua, A., and Wolf, L., 2000: On the structure and properties of the quadrifocal tensor. In European Conference on Computer Vision, 710–724.

Sherali, H., and Adams, W., 1998: A Reformulation-Linearization Technique for Solving Discrete and Continuous Nonconvex Problems. Springer.

Sherali, H., and Alameddine, A., 1992: A new reformulation-linearization technique for bilinear programming problems. Journal of Global Optimization, 2(4), 379–410.

Shor, N., 1998: Nondifferentiable Optimization and Polynomial Problems. Kluwer Academic Publishers.

Sim, K., and Hartley, R. I., 2006a: Recovering camera motion using l minimization. In IEEE Conference on Computer Vision and Pattern Recognition, 1230–1237.∞

Sim, K., and Hartley, R. I., 2006b: Removing outliers using the l -norm. In IEEE Conference on Computer Vision and Pattern Recognition, 485–494.∞ 245

Slama, C., 1980: Manual of Photogrammetry. American Society of Photogrammetry and Remote Sensing, Falls Church, Virginia, USA.

Stewenius,´ H., Engels, C., , and Nister,´ D., 2006: Recent developments on direct relative orientation. ISPRS Journal of Photogrammetry and Remote Sensing, 60, 284–294.

Stewenius,´ H., Schaffalitzky, F., and Nister,´ D., 2005: How hard is three-view triangula- tion really? In IEEE International Conference on Computer Vision, 686–693.

Storn, R., and Price, K., 1997: Differential Evolution – a simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization, 11(4), 341–359.

Sturm, J., 1999: Using SeDuMi 1.02, a Matlab toolbox for optimization over symmetric cones. Optimization Methods and Software, 11-12, 625–653.

Sturm, P., 1997: Vision 3D Non Calibree:´ Contributions a` la Reconstruction Projective et Etude´ des Mouvements Critiques pour l’Auto-Calibrage. Ph.D. thesis, Institut National Polytechnique de Grenoble.

Sturm, P., 2000: A case against Kruppa’s equations for camera self-calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(10), 1199–1204.

Sturm, P., and Triggs, W., 1996: A factorization based algorithm for multi-image pro- jective structure and motion. In European Conference on Computer Vision, 709–720. Cambridge, UK.

Tawarmalani, M., and Sahinidis, N., 2001a: Semidefinite relaxations of fractional prgrams via novel convexification techniques. Journal of Global Optimization, 20, 137–158.

Tawarmalani, M., and Sahinidis, N., 2002: Convex extensions and envelopes of lower semi-continuous functions. Mathematical Programming, 93(2), 247–263.

Tawarmalani, M., and Sahinidis, N. V., 2001b: Semidefinite relaxations of fractional programs via novel convexification techniques. Journal of Global Optimization, 20(2), 137–158.

Taylor, C. J., and Kriegman, D. J., 1995: Structure and motion using line segments in multiple images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(11), 1021–1032.

Tenenbaum, J., and Freeman, W., 2000: Separating style and content with bilinear models. Neural Computation, 12(6), 1247–1283.

Thakoor, N., and Gao, J., 2008: Branch-and-bound hypothesis selection for two-view multiple structure and motion segmentation. In IEEE Conference on Computer Vision and Pattern Recognition. 246

Tomasi, C., and Kanade, T., 1992: Shape and motion from image streams under or- thography: A factorization method. International Journal of Computer Vision, 9(2), 137–154.

Torresani, L., Hertzmann, A., , and Bregler, C., 2003: Learning non-rigid 3D shape from 2D motion. In Advances in Neural Information Processing Systems, 1555–1562.

Triggs, W., 1997: Autocalibration and the absolute quadric. In IEEE Conference on Computer Vision and Pattern Recognition, 609–614.

Triggs, W., McLauchlan, P., Hartley, R., and Fitzgibbon, A., 1999: Bundle adjustment - a modern synthesis. In Vision Algorithms ’99, 298–372. in conjunction with ICCV, Kerkyra, Greece.

Waki, H., Kim, S., Kojima, M., and Muramatsu, M., 2006: Sums of squares and semidefinite programming relaxations for polynomial optimization problems with structured sparsity. SIAM Journal of Optimization, 17(1), 218–242.

Waki, H., Kim, S., Kojima, M., Muramatsu, M., and Sugimoto, H., 2008: Algorithm 883: SparsePOP – a sparse semidefinite programming relaxation of polynomial optimization problems. ACM Transactions on Mathematical Software, 35(2), 1–13.

Wolf, L., and Shashua, A., 2002: On projection matrices P k P 2, k = 3,..., 6, and their applications in computer vision. International Journal of7→ Computer Vision, 48(1), 53–67.

Yuille, A. L., and Rangarajan, A., 2001: The concave-convex procedure (cccp). In Advances in Neural Information Processing Systems, 1033–1040.

Zach, C., Pock, T., and Bischof, H., 2007: A globally optimal algorithm for robust TV-L1 range image integration. In IEEE International Conference on Computer Vision.

Zhang, Z., and Faugeras, O. D., 1992: Estimation of displacements from two 3-D frames obtained from stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(12), 1141–1156.

Zongker, D., and Jain, A., 1996: Algorithms for feature selection: An evaluation. In IEEE International Conference on Pattern Recognition, 18–22.