Toward 3D Reconstruction of Static and Dynamic Objects

TOWARD 3D RECONSTRUCTION OF STATIC AND DYNAMIC OBJECTS Enliang Zheng A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science. Chapel Hill 2016 Approved by: Jan-Michael Frahm Enrique Dunn Tamara L. Berg Vladimir Jojic Yaser Sheikh ©2016 Enliang Zheng ALL RIGHTS RESERVED ii ABSTRACT Enliang Zheng: Toward 3D Reconstruction of Static and Dynamic Objects (Under the direction of Jan-Michael Frahm and Enrique Dunn) The goal of image-based 3D reconstruction is to construct a spatial understanding of the world from a collection of images. For applications that seek to model generic real-world scenes, it is important that the reconstruction methods used are able to characterize both static scene elements (e.g. trees and buildings) as well as dynamic objects (e.g. cars and pedestrians). However, due to many inherent ambiguities in the reconstruction problem, recovering this 3D information with accuracy, robustness, and efficiency is a considerable challenge. To advance the research frontier for image-based 3D modeling, this dissertation focuses on three challenging problems in static scene and dynamic object reconstruction. We first target the problem of static scene depthmap estimation from crowd-sourced datasets (i.e. photos collected from the Internet). While achieving high-quality depthmaps using images taken under a controlled environment is already a difficult task, heterogeneous crowd-sourced data presents a unique set of challenges for multi-view depth estimation, including varying illumination and occasional occlusions. We propose a depthmap estimation method that demonstrates high accuracy, robustness, and scalability on a large number of photos collected from the Internet. Compared to static scene reconstruction, the problem of dynamic object reconstruction from monocular images is fundamentally ambiguous when not imposing any additional assumptions. This is because having only a single observation of an object is insufficient for valid 3D triangulation, which typically requires concurrent observations of the object from multiple viewpoints. Assuming that dynamic objects of the same class (e.g. all the pedestrians walking on a sidewalk) move in a common path in the real world, we develop a method that estimates the 3D positions of the dynamic iii objects from unstructured monocular images. Experiments on both synthetic and real datasets illustrate the solvability of the problem and the effectiveness of our approach. Finally, we address the problem of dynamic object reconstruction from a set of unsynchronized videos capturing the same dynamic event. This problem is of great interest because, due to the increased availability of portable capture devices, captures using multiple unsynchronized videos are common in the real world. To resolve the challenges that arises from non-concurrent captures and unknown temporal overlap among video streams, we propose a self-expressive dictionary learning framework, where the dictionary entries are defined as the collection of temporally varying structures. Experiments demonstrate the effectiveness of this approach to the previously unsolved problem. iv ACKNOWLEDGEMENTS My deepest gratitude is to my advisors Jan-Michael Frahm and Enrique Dunn. I have been amazingly fortunate to have advisors who gave me the guidance and encouragement when my steps faltered, and the freedom to explore on my own. I would also like to thank my committee members, Tamara L. Berg, Vladimir Jojic, and Yaser Sheikh, for their feedback and advice. Additionally, I would like to thank my labmates, as their company and discussion made my time more fruitful and enjoyable: Philip Ammirato, Akash Bapat, Sangwoo Cho, Marc Eder, Pierre Fite-Georgel, Yunchao Gong, Rohit Gupta, Shubham Gupta, Xufeng Han, Jared Heinly, Junpyo Hong, Yi-Hung Jen, Dinghuang Ji, Alex Keng, Hadi Kiapour, Hyo Jin Kim, Wei Liu, Jie Lu, Licheng Yu, Vicente Ordońez-Rom˜ an,´ David Perra, True Price, Rahul Raguram, Patrick Reynolds, Johannes Schonberger,¨ Meng Tan, Joseph Tighe, Sirion Vittayakorn, Ke Wang, Yilin Wang, Yi Xu, and Hongsheng Yang. I would like to thank my parents, who encouraged me to explore, learn and pursue for a PhD. Finally, I would like to thank my wife, Lingling Zheng, for her understanding and support. v TABLE OF CONTENTS LIST OF TABLES . x LIST OF FIGURES . xi LIST OF ABBREVIATIONS . xiii 1 Introduction . 1 1.1 Thesis Statement . 4 1.2 Outline of Contributions . 4 2 Related work . 7 2.1 Camera Parameter Estimation . 7 2.2 Static Scene Reconstruction . 8 2.2.1 Multiview Depthmap Estimation. 8 2.2.2 Robustness . 9 2.2.3 Efficiency . 10 2.2.4 Point Cloud and Mesh Generation . 11 2.3 Dynamic Object Reconstruction . 12 2.3.1 Trajectory Triangulation. 12 2.3.2 Sequencing and Synchronization . 13 2.3.3 Articulated Object Reconstruction . 13 2.3.4 Non-rigid SfM . 14 2.3.5 Single Image Reconstruction . 15 3 PatchMatch Based Joint View Selection and Depthmap Estimation . 17 3.1 Introduction . 17 vi 3.2 Joint View Selection and Depth Estimation . 19 3.2.1 PatchMatch Propagation for Stereo . 19 3.2.2 Graphical Model . 20 3.2.3 Variational Inference . 24 3.2.4 Update Schedule . 26 3.2.5 Algorithm Integration . 28 3.3 Experiments . 29 3.4 Conclusion . 35 4 Joint Object Class Sequencing and Trajectory Triangulation (JOST) . 37 4.1 Introduction . 37 4.2 Joint Object Class Sequencing and Trajectory Triangulation . 39 4.2.1 Spatial Registration . 39 4.2.2 Object Detection and Motion Tangent Estimation . 40 4.2.3 Object Class Trajectory Triangulation . 41 4.2.4 Generalized Trajectory Graph . 42 4.2.5 GMST . 45 4.2.6 Continuous Refinement . 46 4.2.7 Reconstructability Analysis . 47 4.3 Object Detector and Motion Tangent Estimation . 49 4.4 Experiments . 51 4.5 Conclusion . 56 5 Self-expressive Dictionary Learning for Dynamic 3D Reconstruction . 57 5.1 Introduction . 57 5.2 Problem and Notation . 59 5.3 Principle . 61 5.4 Method . 64 vii 5.4.1 Cost Function. 65 5.4.2 Dictionary Space Reduction in Self-representation . 65 5.4.3 Coefficient Relationships: Ψ1(W) ........................................ 66 5.4.4 Sequencing Information: Ψ2(X) ......................................... 67 5.5 Parameterization of X ........................................................... 68 5.5.1 Noisy Observations . 68 5.5.2 Missing Data . ..

Load more