Inaugural-Dissertation

Inaugural-Dissertation zur Erlangung der Doktorwürde der Naturwissenschaftlich-Mathematischen Gesamtfakultät der Ruprecht–Karls–Universität Heidelberg vorgelegt von Master of Mathematics and Computer Science Andreas Neufeld aus Ramenskoje (Russland) Tag der mündlichen Prüfung: Variational Approaches for Motion and Structure from Monocular Video Betreuer: Prof. Dr. Christoph Schnörr Prof. Dr. Björn Ommer Zusammenfassung Das Schätzen von Bewegung und räumlicher Struktur aus Bildfol- gen sind grundlegende Probleme in der Bildverarbeitung. Diese Prob- lemstellungen sind noch aktuell, obwohl die ersten Methoden bereits vor mehreren Jahrzehnten veröffentlicht wurden. Wir präsentieren neue Verfahren zur Bewegungs- und Strukturschätzung für das au- tonome Fahren. Ein autonomes Auto braucht genaue Kenntnis über seine Umgebung, da eine Fehleinschätzung gravierende Konsequenzen haben kann. Speziell behandeln wir monokulare Verfahren, bei denen nur eine Kamera auf dem Fahrzeug verfügbar ist. Bildfolgen aus dem Straßenverkehr sind für das Schätzen von Be- wegung besonders herausfordernd. Durch die hohe Geschwindigkeit ist die Bewegung sehr groß, die Lichtverhältnisse sind nicht stabil und es kann Verfälschungen geben durch Reflektionen und wetterbedingte Störungen. Wir stellen neue diskrete Verfahren zur Berechnung des optischen Flusses vor, welche probabilistische graphische Modelle für den optischen Fluss definieren. In dem ersten Ansatz wählen wir einige Stellen im Referenzbild aus, und vergleichen diese mit dem zweiten Bild. Die besten Korresponden- zen, welche auch zu einer monokularen Bewegung passen, werden als Kandidaten für ein graphisches Modell ausgewählt. In einem weit- eren Verfahren, vergleichen wir alle Stellen im Referenzbild, lösen das graphische Modell jedoch nicht direkt, sondern approximieren es mit einer Sequenz von kleineren Modellen. Da wir eine monokulare Bildsequenz haben, müssen neben der Szene auch die Kamerapositionen rekonstruiert werden. Bedingt durch die projektive Geometrie gibt es blinde Flecken auf den Bildern und Mehrdeutigkeiten bezüglich der Skalierung, welche es bei einem System mit mehreren Kameras nicht gibt. Wir stellen zwei Verfahren für die Strukturschätzung vor. Das er- ste Verfahren bestimmt den optimalen Weg einer Kamera bezüglich einer Energiefunktion aus optischem Fluss und Tiefenschätzungen. Das zweite Verfahren schätzt die Bewegung der Kamera und planare Szenen- beschreibung aus einem einzigen Flussfeld. Wir evaluieren die Verfahren auf verschiedenen realen und künstli- chen Daten. Für die Evaluation der planaren Rekonstruktionen haben wir einen eigenen Datensatz erstellt, welcher Tiefen- und Ebeneninfor- mationen enthält. Abstract Motion and structure estimation are elementary problems of computer vision. These are active areas of research, even though the first methods were proposed several decades ago. We develop new approaches for motion and structure estimation for autonomous driving. An autonomous vehicle requires an accurate model of its environment, wrong decisions made by an autonomous car can have severe conse- quences. We assume the monocular setup, where only a single camera is mounted on the car. Outdoor traffic sequences are challenging for optical flow estimation. The high speed of the car causes large displacements in the optical flow field, the lighting conditions are unstable and there can be strong distortions due to reflections and difficult weather conditions. We propose new discrete methods, which determine optical flow as optimal configuration of probabilistic graphical models. The first approach selects sparse locations in the reference frame, and matches them across the second image. The best correspondences, which match constraints from a multiple view configuration, are con- sidered motion vectors in a graphical model. In a second approach, we solve for dense optical flow by approximating the original infeasible graphical model with a sequence of reduced models. The monocular configuration poses challenges to the estimation of scene structure, camera positions and scene parameters need to be estimated jointly. The geometry of multiple views creates blind spots in the images, and adds a scale ambiguity, which both to not exist in a setup with multiple cameras. We propose two methods for structure estimation. The first approach determines the energy optimal camera track, given optical flow and depth observations. A further approach estimates camera motion and a piecewise planar scene description jointly from a single optical flow field. The scene description contains depth and plane normal in- formation. We evaluate our approaches for motion and structure estimation on different real world and rendered datasets. In addition to evaluation on publicly available evaluation data, we evaluate on a new rendered dataset with ground truth plane normals. Acknowledgments During the PhD there were many challenging and uncertain ways to explore, of which a few lead to a dead end. Completing this work would not have been possible without the support and advice by many people, to whom I am very grateful. The first person to mention is my main supervisor Prof. Christoph Schnörr, who introduced me to the field of image processing, and in large parts to mathematical modeling and optimization in general. Through his supervision I gained very much personal experience during my PhD, in addition to the subject specific knowledge. Furthermore, I would like to thank my colleagues at Heidelberg University, in particular the members of the Image & Pattern Analysis Group. Special thanks go to Florian Becker and Frank Lenzen, who proposed interesting new ideas and were always prepared for questions and discussions. And I thank Jörg Kappes and Bogdan Savchynskyy for the discussions and very helpful feedback, as well as Evelyn Wilhelm and Barbara Werner for helping me with administrative issues. Finally, I would like to thank my parents, Ludmila and Alexander Neufeld and my sister Helene Gehrmann for their support and encour- agement. Contents List of Figures iii List of Tables v Chapter 1. Introduction 1 1.1. Motivation 1 1.2. Related Work 2 1.2.1. FeatureMatchingandOpticalFlow 2 1.2.2. StructureEstimation 5 1.3. Contribution 6 1.4. Organization 6 Chapter2. MathematicalOptimization 9 2.1. Convex Analysis 9 2.1.1. Duality 11 2.1.2. LagrangianMultipliers 13 2.1.3. OptimalityConditions 14 2.1.4. LinearProgramming 15 2.2. ContinuousOptimizationAlgorithms 16 2.2.1. DifferentiableFunctions 16 2.2.2. Splitting Methods 18 2.3. ElementsofOptimalControl 20 Chapter3. MathematicalModels 23 3.1. ProbabilisticGraphicalModels 23 3.1.1. ExponentialModels 23 3.1.2. Marginalization 26 3.1.3. MaximumA-PosterioriInference 26 3.1.4. SubmodularGraphicalModels: GraphCuts 27 3.1.5. TheLocalPolytopeRelaxation 30 3.1.6. DualDecomposition 31 3.1.7. TreeBasedBeliefPropagationAlgorithms 32 3.1.8. ContinuousLabeling 33 3.2. ElementaryManifolds 33 3.2.1. DifferentialGeometry 34 3.2.2. RiemannianManifolds 35 3.2.3. Lie Groups 37 3.2.4. GradientFlowsonManifolds 39 i ii CONTENTS Chapter4. MultipleViewGeometry 41 4.1. Projective Geometry 41 4.1.1. The Camera Model 41 4.1.2. Homographies 43 4.1.3. Epipolar Geometry 44 4.1.4. EssentialMatrixEstimation 46 4.2. Stereo Matching 48 4.3. StructureEstimationwithBundleAdjustment 49 4.4. MonocularLocalizationandMapping 51 Chapter 5. Variational Discrete Optical Flow Estimation 53 5.1. OpticalFlowEstimation 53 5.1.1. RobustOpticalFlow 55 5.1.2. DiscreteApproaches 57 5.1.3. FunctionalLiftingandConvexRelaxation 57 5.2. SparseDiscreteOpticalFlow 59 5.2.1. LearningPatch-basedMatching 59 5.2.2. GraphicalModelConstructionandInference 61 5.2.3. Evaluation 64 5.3. DenseHierarchicalLabelReduction 64 5.3.1. ReducedGraphicalModelParameters 67 5.3.2. Label Reduction Based on the α-Divergence 68 5.3.3. Evaluation 70 Chapter6. StructurefromMotion 79 6.1. Monocular Reconstruction based on Filtering Techniques 79 6.1.1. VariationalStructurefromMotion 79 6.1.2. A Novel Minimum Energy Filter for Visual Odometry 81 6.2. DepthandNormalRegularization 82 6.2.1. PiecewisePlanarDepthMapSmoothing 83 6.2.2. SceneEstimationfromDenseOpticalFlow 85 6.2.3. PlaneEstimationonSuperpixels 86 6.2.4. OptimizationFramework 90 6.2.5. EvaluationontheKITTIDataset 94 6.2.6. EvaluationonRenderedData 101 Chapter 7. Conclusion 111 Bibliography 113 List of Figures 1.1 SIFT Keypoints 3 1.2 Optical Flow Pyramid 4 2.1 Fenchel Duality 12 2.2 Line Search 17 2.3 OptimalControlExample:TheRocketCar 22 3.1 Factor Graph 25 3.2 Maximum Flow 27 3.3 Graph Cuts 28 3.4 Ishikawa’s Method 29 4.1 ThePinholeCameraModel 42 4.2 TheFundamentalMatrix 44 4.3 Epipole Locations 45 4.4 BundleAdjustment:PanoramaStitching 50 5.1 SparseDiscreteFlowExamples 61 5.2 Example Matches 61 5.3 ExampleMatchesDetails 62 5.4 SparseCandidateFiltering 63 5.5 SparseOpticalFlowResults 65 5.6 OpticalFlowLabelHierarchy 66 5.7 GraphicalModelReduction 66 5.8 LabelReductionExample 69 5.9 Tsukuba Results 71 5.10 HierarchicalReductionRuntimeComparison 72 5.11 KITTI Stereo Results 1 73 5.12 KITTI Stereo Results 2 74 5.13 SintelDiscreteFlowResults1 75 5.14 SintelDiscreteFlowResults2 76 6.1 InverseDepthParameterization 82 iii iv LISTOFFIGURES 6.2 TGVRegularizedDepthSmoothing 84 6.3 ExtendedTGVRegularizedDepthSmoothing 86 6.4 ParameterizedPointCorrespondence 86 6.5 SuperpixelDiscretization 87 6.6 TheCharbonnierPenaltyFunction 88 6.7 OpticalFlowDataEnergy 92 6.8 MonocularReprojectiveError 94 6.9 KITTIRenderedReconstructions 95 6.10 KITTIExamples,Sequences4and9

Inaugural-Dissertation

Lecture 6: Camera Computation and the Essential Matrix

Table of Contents Stereopsis

Monitoring Mountain Cryosphere Dynamics by Time Lapse Stereo Photogrammetry

Essential' Matrix

Generalized Essential Matrix: Properties of the Singular Value Decomposition

Mathematical Methods for Camera Self-Calibration in Photogrammetry and Computer Vision

Epipolar Geometry and the Essential Matrix

Variational 3D Reconstruction from Stereo Image Pairs and Stereo Sequences

Lecture 9: Epipolar Geometry

Fundamental Matrix / Image Rectification

Epipolar Geometry: Essential Matrix

Fast RANSAC Hypothesis Generation for Essential Matrix Estimation