On the Importance of Stereo for Accurate Depth Estimation: an Efficient Semi-Supervised Deep Neural Network Approach
Total Page:16
File Type:pdf, Size:1020Kb
On the Importance of Stereo for Accurate Depth Estimation: An Efficient Semi-Supervised Deep Neural Network Approach Nikolai Smolyanskiy Alexey Kamenev Stan Birchfield NVIDIA {nsmolyanskiy, akamenev, sbirchfield}@nvidia.com Abstract stereo cameras); and 2) multiple images provide geomet- ric constraints that can be leveraged to overcome the many We revisit the problem of visual depth estimation in the ambiguities of photometric data. context of autonomous vehicles. Despite the progress on The alternative is to use a single image to estimate monocular depth estimation in recent years, we show that depth. We argue that this alternative—due to its funda- the gap between monocular and stereo depth accuracy re- mental limitations—is not likely to be able to achieve high- mains large—a particularly relevant result due to the preva- accuracy depth estimation at large distances in unfamiliar lent reliance upon monocular cameras by vehicles that are environments. As a result, in the context of self-driving expected to be self-driving. We argue that the challenges cars we believe monocular depth estimation is not likely of removing this gap are significant, owing to fundamen- to yield results with sufficient accuracy. In contrast, we tal limitations of monocular vision. As a result, we focus offer a novel, efficient deep-learning stereo approach that our efforts on depth estimation by stereo. We propose a achieves compelling results on the KITTI 2015 dataset by novel semi-supervised learning approach to training a deep leveraging a semi-supervised loss function (using LIDAR stereo neural network, along with a novel architecture con- and photometric consistency), concatenating cost volume, taining a machine-learned argmax layer and a custom run- 3D convolutions, and a machine-learned argmax function. time that enables a smaller version of our stereo DNN to The contributions of the paper are as follows: run on an embedded GPU. Competitive results are shown on the KITTI 2015 stereo dataset. We also evaluate the re- • Quantitative and qualitative demonstration of the gap cent progress of stereo algorithms by measuring the impact in depth accuracy between monocular and stereoscopic upon accuracy of various design criteria.1 depth. • A novel semi-supervised approach (combining lidar and photometric losses) to training a deep stereo neu- 1. Introduction ral network. To our knowledge, ours is the first deep stereo network to do so.3 Estimating depth from images is a long-standing prob- lem in computer vision. Depth perception is useful • A smaller version of our network, and a custom run- for scene understanding, scene reconstruction, virtual and time, that runs at near real-time (∼20 fps) on a stan- augmented reality, obstacle avoidance, self-driving cars, dard GPU, and runs efficiently on an embedded GPU. robotics, and other applications. To our knowledge, ours is the first stereo DNN to run Traditionally, multiple images have been used to esti- on an embedded GPU. mate depth. Techniques that fall within this category in- • Quantitative analysis of various network design clude stereo, photometric stereo, depth from focus, depth choices, along with a novel machine-learned argmax from defocus, time-of-flight,2 and structure from motion. layer that yields smoother disparity maps. The reasons for using multiple images are twofold: 1) abso- lute depth estimates require at least one known distance in 2. Motivation the world, which can often be provided by some knowledge The undeniable success of deep neural networks in com- regarding the multi-camera rig (e.g., the baseline between puter vision has encouraged researchers to pursue the prob- 1Video of the system is at https://youtu.be/0FPQdVOYoAU. lem of estimating depth from a single image [5, 20, 6, 9, 17]. 2Although time-of-flight does not, in theory require multiple images, in practice multiple images are collected with different bandwidths in order 3Similarly, Kuznietsov et al. [17] use a semi-supervised approach for to achieve high accuracy over long ranges. training a monocular network. 1120 This is, no doubt, a noble endeavor: if it were possible to ac- engineering approach to such a safety-critical system is to curately estimate depth from a single image, then the com- leverage all available sensors rather than assume they are plexity (and hence cost) of the hardware needed would be not needed; thus, we believe that it is important to accu- dramatically reduced, which would broaden the applicabil- rately assess the increased error in depth estimation when ity substantially. An excellent overview of existing work on relying upon monocular cameras. monocular depth estimation can be found in [9]. At typical highway speeds, the braking distance re- Nevertheless, there are reasons to be cautious about the quired to completely stop before impact necessitates ob- reported success of monocular depth. To date, monocular serving an unforeseen stopped object approximately 100 m depth solutions, while yielding encouraging preliminary re- away. Intrigued by the reported success of monocular sults, are not at the point where reliable information (from a depth, we tried some recent algorithms, only to discover robotics point of view) can be expected from them. And al- that monocular depth is not able to achieve accuracies any- though such solutions will continue to improve, monocular where close to that requirement. We then turned our at- depth will never overcome well-known fundamental limi- tention to stereo, where significant progress has been made tations, such as the need for a world measurement to infer in recent years in applying deep learning to the problem absolute depth, and the ambiguity that arises when a pho- [25, 24, 11, 27, 26, 29, 8, 15, 23]. An excellent overview tograph is taken of a photograph (an important observation of recent stereo algorithms can be found in [15]. In this for biometric and security systems). flurry of activity, a variety of architectures have been pro- One of the motivations for monocular depth is a long- posed, but there has been no systematic study as to how standing belief that stereo is only useful at close range. these design choices impact quality. One purpose of this It has been widely reported, for example in [10], that be- paper is thus to investigate several of these options in order yond about 6 meters, the human visual system is essentially to quantify their impact, which we do in Sec. 5. In the con- monocular. But there is mounting evidence that the hu- text of this study, we develop a novel semi-supervised stereo man stereo system is actually much more capable than that. approach, which we present in Sec. 4. First, however, we il- Multiple studies have shown metric depth estimation up to lustrate the limitations of monocular depth estimation in the 20 meters [18, 1]; and, although error increases as disparity next section. increases [13], controlled experiments have confirmed that scaled disparity can be estimated up to 300 m, even with- 3. Difficulties of Monocular Depth Estimation out any depth cues from monocular vision [22]. Moreover, since the human visual system is capable of estimating dis- To appreciate the gap between mono and stereo vision, parity as small as a few seconds of arc [22], there is rea- consider the image of Fig. 1, with several points of interest son to believe that the distance could be 1 km or greater, highlighted. Without knowing the scene, if you were to ask with some evidence supporting such a claim provided by yourself whether the width of the near road (on which the the experiments of [4]. Note that an artificial stereo system car (A) sits) is greater than the width of the far tracks (dis- whose baseline is wider than the average 65 mm interpupil- tance between the near and far poles (E and F)), you might lary distance of the human visual system has the potential be tempted to answer in the affirmative. After all, the road to provide even greater accuracy. not only occupies more pixels in the image (which is to be This question takes on renewed significance in the con- expected, since it is closer to the camera), but it occupies text of self-driving cars, since most automobile manufac- orders of magnitude more pixels. We showed this image to turers and experimental autonomous vehicles do not install several people in our lab, and they all reached the same con- stereo cameras in their vehicles.4 Rather, these systems clusion: the road indeed appears to be significantly wider. rely on various combinations of monocular cameras, lidar, As it turns out, if this image is any indication, people are not 6 radar, and sonar sensors.5 For detecting static obstacles very good at estimating metric depth from a single image. such as trees, poles, railings, and concrete barriers, most The output of a leading monocular depth algorithm, systems rely on cameras and/or lidar. Although it is be- called MonoDepth [9], is shown in Fig. 2,7 along with the yond the scope of this paper whether monocular cameras are output of our stereo depth algorithm. At first glance, both sufficient for self-driving behavior (certainly people with results appear plausible. Although the stereo algorithm pre- monocular vision can drive safely in most situations), or serves crisper object boundaries and appears at least slightly whether stereo is better than lidar, we argue that the proper more accurate, it is difficult to tell from the grayscale im- 4To the authors’ knowledge, at the time of this writing stereo cameras 6Specifically, we asked 8 people to estimate the distance to the fence can be found only on certain models of Mercedes and Subaru vehicles; no (ground truth 14 m) and the distance to the building (ground truth 30 m).