3D Reconstruction Is Not Just a Low-Level Task: Retrospect and Survey
Total Page:16
File Type:pdf, Size:1020Kb
3D reconstruction is not just a low-level task: retrospect and survey Jianxiong Xiao Massachusetts Institute of Technology [email protected] Abstract 3D reconstruction is in obtaining more accurate depth maps [44, 45, 55] or 3D point clouds [47, 48, 58, 50]. We now Although an image is a 2D array, we live in a 3D world. have reliable techniques [47, 48] for accurately computing The desire to recover the 3D structure of the world from 2D a partial 3D model of an environment from thousands of images is the key that distinguished computer vision from partially overlapping photographs (using keypoint match- the already existing field of image processing 50 years ago. ing and structure from motion). Given a large enough set of For the past two decades, the dominant research focus for views of a particular object, we can create accurate dense 3D reconstruction is in obtaining more accurate depth maps 3D surface models (using stereo matching and surface fit- or 3D point clouds. However, even when a robot has a depth ting [44, 45, 55, 58, 50, 59]). In particular, using Microsoft map, it still cannot manipulate an object, because there is Kinect (also Primesense and Asus Xtion), a reliable depth no high-level representation of the 3D world. Essentially, map can be obtained straightly out of box. 3D reconstruction is not just a low-level task. Obtaining However, despite all of these advances, the dream of hav- a depth map to capture a distance at each pixel is analo- ing a computer interpret an image at the same level as a two- gous to inventing a digital camera to capture the color value year old (for example, counting all of the objects in a pic- at each pixel. The gap between low-level depth measure- ture) remains elusive. Even when we have a depth map, we ments and high-level shape understanding is just as large as still cannot manipulate an object because there is no high- the gap between pixel colors and high-level semantic per- level representation of the 3D world. Essentially, we would ception. Moving forward, we would like to argue that we like to argue that 3D reconstruction is not just a low-level need a higher-level intelligence for 3D reconstruction. We task. Obtaining a depth map to capture a distance at each would like to draw attention of the 3D reconstruction re- pixel is analogous to inventing a digital camera to capture search community to put greater emphasis on mid-level and the color value at each pixel. The gap between low-level high-level 3D understanding, instead of exclusively focus depth measurements and high-level shape understanding is on improving of low-level reconstruction accuracy, as is the just as large as the gap between pixel colors and high-level current situation. In this report, we retrospect the history semantic perception. Moving forward, we need a higher- and analyze some recent efforts in the community, to argue level intelligence for 3D reconstruction. that a new era to study 3D reconstruction at higher level is starting to come. This report aims to draw the attention of the 3D recon- struction research community to put greater emphasis on mid-level and high-level 3D understanding, instead of ex- 1. Introduction clusively focus on improving of low-level reconstruction accuracy, as is the current situation. In Section 2, we retro- Although an image is a 2D array, we live in a 3D world spect the history to study the different views of paradigms in where scenes have volume, affordances, and are spatially ar- the filed that makes 3D reconstruction a complete low-level ranged with objects occluding each other. The ability to rea- task, apart from the view at the very beginning of 3D re- son about these 3D properties would be useful for tasks such construction research. We highlighted the point that draws as navigation and object manipulation. As humans, we per- a clear difference for the field, and analyze the long-term ceive the three-dimensional structure of the world around us implication for subconscious changing in the view for 3D with apparent ease. But for computers, this has been shown reconstruction. In Section 3, we review the widely accepted to be a very difficult task, and have been studied for about “two-streams hypothesis” model of the neural processing 50 years in the 3D reconstruction community in computer of vision in human brain, in order to draw the link between vision, which has made significant progress. Especially, computer and human vision system. This link allows us to in the past two decades, the dominant research focus for conjecture recognition in computer vision to be the counter- 1 part of ventral stream in human vision system, and recon- three decades, there are a lot more success we achieve for struction to be the counterpart of dorsal stream in human the low-level 3D reconstruction for stereo correspondence vision system. In Section 4, we provide brief survey on and structure from motion, than for the high level 3D un- some recent efforts in the community that can be regarded derstanding. For low-level 3D reconstruction, thanks to as studies for 3D reconstruction beyond low level. Finally, the better understanding of geometry, more realistic image We highlight some recent efforts to unify recognition and features, more sophisticated optimization routine and faster reconstruction, and argue that a new era to study 3D recon- computers, we can obtain a reliable depth map or 3D point struction at higher-level is starting to come. cloud together with camera poses. In contrast, for higher- level 3D interpretation, because the line-based approaches 2. History and Retrospect hardly work for real images, this field diminished after a short burst. Nowadays, the research for 3D reconstruction Physics (radiometry, optics, and sensor design) and com- almost exclusively focuses on only low-level reconstruc- puter graphics study the forward models about how light re- tion, in obtaining better accuracy and improving robust- flects off objects’ surfaces, is scattered by the atmosphere, ness for stereo matching and structure from motion. People refracted through camera lenses (or human eyes), and fi- seems to have forgetten the end goal of such low-level re- nally projected onto a 2D image plane. In computer vision, construction, i.e. to reach a full interpretation of the scenes i.e we are trying to do the inverse [49], . to describe the world and objects. Given that we can obtain very good result on that we see in one or more images and to reconstruct its low-level reconstruction now, we would like to remind the properties, such as shape. In fact, the desire to recover the community and draw attention to put greater emphasis on three-dimensional structure of the world from images and to mid-level and high-level 3D understanding. use this as a stepping stone towards full scene understand- We should separate the approach and the task. The ing is what distinguished computer vision from the already less success of line-based approach for high-level 3D un- existing field of digital image processing 50 years ago. derstanding should only indicate that we need a better ap- Early attempts at 3D reconstruction involved extracting proach. It shouldn’t mean that higher-level 3D understand- edges and then inferring the 3D structure of an object or ing is not important and we can stop working on it. In an- a “blocks world” from the topological structure of the 2D other word, we should focus on designing better approaches lines [41]. Several line labeling algorithms were developed for high-level 3D understanding, which is independent of at that time [29, 8, 53, 42, 30, 39]. Following that, three- the fact that line based approach is less successful than key- dimensional modeling of non-polyhedral objects was also point and feature based approach. being studied [5, 2], using generalized cylinders [1, 7, 35, 40, 26, 36] or geon [6]. 3. Two-streams Hypothesis :: Computer Vision Staring from late 70s, more quantitative approaches to 3D were starting to emerge, including the first of many In parallel to the computer vision researchers’ effort feature-based stereo correspondence algorithms [13, 37, 21, to develop engineering solutions for recovering the three- 38], and simultaneously recovering 3D structure and cam- dimensional shape of objects in imagery, perceptual psy- era motion [51, 52, 34], i.e. structure from motion. After chologists have spent centuries trying to understand how three decades of active research, nowadays, we can achieve the human visual system works. The two-streams hypothe- very good performance with high accuracy and robustness, sis is a widely accepted and influential model of the neu- for both stereo matching [44, 45, 55] and structure from mo- ral processing of vision [14]. The hypothesis, given its tion [47, 48, 58, 50]. most popular characterization in [17], argues that humans However, there is a significantly difference between possess two distinct visual systems. As visual information these two groups of approaches. The first group represented exits the occipital lobe, it follows two main pathways, or by “block world”, targets on high-level reconstruction of “streams”. The ventral stream (also known as the “what objects and scenes. The second group, i.e. stereo corre- pathway”) travels to the temporal lobe and is involved with spondence and structure from motion, targets on very low- object identification and recognition. The dorsal stream (or, level 3D reconstruction. For example, the introduction of “how pathway”) terminates in the parietal lobe and is in- structure from motion was inspired by “the remarkable fact volved with processing the objects spatial location relevant that this interpretation requires neither familiarity with, nor to the viewer.