Visually Imbalanced Stereo Matching

Visually Imbalanced Stereo Matching Yicun Liu∗13† Jimmy Ren∗1 Jiawei Zhang1 Jianbo Liu12 Mude Lin1 SenseTime Research1 The Chinese University of Hong Kong2 Columbia University3 {rensijie,zhangjiawei,linmude}@sensetime.com1 [email protected] [email protected] Abstract Understanding of human vision system (HVS) has in- spired many computer vision algorithms. Stereo matching, which borrows the idea from human stereopsis, has been ex- tensively studied in the existing literature. However, scant attention has been drawn on a typical scenario where binocular inputs are qualitatively different (e.g., high-res master camera and low-res slave camera in a dual-lens module). Recent advances in human optometry reveal the capability of the human visual system to maintain coarse stereopsis under such visually imbalanced conditions. Bionically aroused, it is natural to question that: do stereo machines share the same capability? In this paper, we carry out a systematic comparison to investigate the effect of various imbalanced conditions on current popular stereo match- Figure 1. Illustration of visually imbalanced scenarios: (a) input left view (b) Input downgraded right view, from top to bottom: ing algorithms. We show that resembling the human vi- monocular blur, monocular blur with rectification error, monocular sual system, those algorithms can handle limited degrees of noise. (c) Disparity predicted from mainstream monocular depth monocular downgrading but also prone to collapses beyond (only left view as input)/stereo matching (stereo views as input) a certain threshold. To avoid such collapse, we propose a algorithms: from top to bottom are DORN [9], PSMNet [7] and solution to recover the stereopsis by a joint guided-view- CRL [31]. (d) Disparity generated from our proposed framework. restoration and stereo-reconstruction framework. We show the superiority of our framework on KITTI dataset and its performance [45]. extension on real-world applications. However, little attention has been drawn on the imbal- 1. Introduction anced condition between the stereo views. In many real- world cases, the visual quality of the left view and the right There have been remarkable signs of progress in under- view are not guaranteed to be matched. It is common for standing and mimicking the human vision system, and lots human stereopsis to suffer from different degrees of ani- of works are focusing on sensing the 3D structures sur- sometropia and astigmatism in binocular vision [6, 21]; or rounding us. In human’s visual brain, depth perception is in- for computer vision, the master and slave camera in a dual- terceded by a set of scale-variant spatial filters, where low- lens module to have different resolution, lens blur, imaging frequency stimuli establish coarse stereopsis, and then high- modality, noise tolerance, and rectification accuracy [43]. frequency visual cues escalate stereo acuity [5]. Early re- Not until recently, discovery in optometry reveals that it searchers in computer vision define this problem as search- is attainable for the human to maintain decent stereo acuity ing for corresponding pixels [3], edges [1], or patches [2] with imbalanced binocular signals. In fact, the monocular among different views. Taxonomy and benchmark were downgrading jeopardizes stereo acuity for high spatial fre- later constructed in [35]. With large datasets becoming quencies components, but low-frequency targets like struc- available, NN-based stereo algorithms exhibited superior tures are merely affected through a natural process called spatial frequency tuning [24]. With this unearthing in mind, ∗Equal contribution. Code will be available at github.com/ DandilionLau/Visually-Imbalanced-Stereo we tend to ask: are stereo machines able to handle qualita- †Work was done during internship at SenseTime Research. tively imbalanced inputs? 43212029 Figure 2. Intuition behind our proposed guided view synthesis framework: predicting the latent view solely based on single view is an i ill-posed problem, as there exists a bunch of plausible novel views IR with different disparities. However, with the geometric information in the downgraded right view Ir as a guide, the task is achievable. Even though high-frequency component of Ir is missing, rough object contour can still be inferred, as shown in our toy example. The contour provides a positional hint for the later displacement prediction. We design a systematic comparison to answer this ques- with real-world imbalanced factors. tion. In a controlled-variant setting, we test several ma- The main contributions of our work are threefold: jor monocular degradation effects on current mainstream stereo matching algorithms, including both NN-based and • We discover that constructing stereopsis from imbal- heuristics-based methods. By selectively increasing the anced views is not only feasible for human visual sys- corrupted levels of monocular downgrading factors, we tems but also achievable for computer vision. It is show that existing stereo matching frameworks are resis- the first work to consider the imbalanced condition for tant to mere degrees of monocular downgrading. Neverthe- stereo matching. less, stereo matching accuracy quickly degenerates as the • We explore the potential of current stereo machines on monocular downgrading increases. Similar to human stere- the task of visually imbalanced stereo matching and opsis, all tested algorithms are observed to ‘collapse’ be- examine the threshold of ‘stereo collapse’ on different yond specific downgrading threshold, leading to unreason- models and various imbalanced conditions. able disparity predictions. • We exploited a guided view synthesis framework to re- Ideally, there exist potential cures to alleviate such col- store the corrupted view and tackle scenarios beyond lapse, but each of them has a certain limit. One intuitive ‘stereo collapse’, which is even out of the capability of approach is to conduct depth estimation only based on the human stereopsis. high-quality monocular view. However, it cannot generalize well for unseen scenarios because it relies on prior knowl- edge of the object size and other physical attributes. An- 2. Related Work other approach is to conduct stereo matching on a lower Depth Perception There are diverse methods proposed resolution as a compromise to the information loss in the to predict depth from single view [26], stereo views [27], downgraded view, but low-res solutions cannot satisfy the and multiple views [19]. Among these variants, stereo demand of sharp disparity for tasks like portrait defocusing. matching is the currently popular way for low-cost depth Taking a detour in thinking, instead of directly predict- perception. For the traditional stereo matching setup, a rep- ing the disparity from imbalanced views, it is easier to first resentative taxonomy was proposed in [35]. Many com- restore the corrupted view using the high-quality textures in prehensive testbeds were later introduced for quantitative the main view and then conduct stereo matching. With the evaluation of different stereo frameworks [10, 35]. Analy- vague object contour observed in the corrupted view, human sis of subpixel calibration error and radiometric differences beings are pretty good at hallucinating the missing textures emerges in [15, 14]. Significant progress in single view by ‘moving’ the objects from the high-quality view to the depth estimation emerged in [9], but estimating depth from corresponding position in the corrupted view. The problem a single view remains hard considering the ill-posed body of predicting the dense disparity map between imbalanced of the problem. binocular inputs can be decomposed into two sub-problems: Human Stereopsis Early theoretical frameworks of view restoration guided by limited structural information in binocular disparity detection mechanisms were proposed in the corrupted view, and reconstruct stereopsis based on the [41, 11]. Neural models of frequency-variant spatial filters restored view. For the first sub-problem, we formulate it as in the human’s visual brain were then proposed and devel- a guided view-synthesis process and designed a structure- oped in [5, 4]. To better analyze the neural network sub- aware displacement prediction network to achieve that. Our serving stereopsis, a large portion of work has been con- approach achieves unprecedented performance and demon- ducted to characterize the functional attributes for these ba- strates impressive generalization capabilities on a dataset sic operators in the visual brain [32, 33, 38]. 43222030 Model All / Est 1X 2X 3X 5X 8X 10X 15X 20X SVS D1-bg 14.13% 15.66% 19.53% 24.20% 62.60% 79.49% 83.98% 89.11% 25.18% SGBM D1-fg 21.99% 22.35% 25.36% 32.32% 58.36% 80.16% 85.48% 90.39% 20.77% D1-all 15.88% 16.76% 20.49% 25.88% 61.89% 79.60% 84.23% 89.32% 24.44% D1-bg 3.09% 3.12% 3.31% 4.69% 11.40% 24.38% 89.51% 98.16% 25.18% DispNetC D1-fg 3.16% 3.21% 3.28% 4.23% 11.08% 24.72% 89.75% 98.35% 20.77% D1-all 3.10% 3.13% 3.30% 4.62% 11.35% 24.44% 89.55% 98.18% 24.44% D1-bg 3.02% 3.05% 3.25% 4.84% 12.24% 29.16% 94.47% 99.42% 25.18% CRL D1-fg 2.89% 3.00% 3.18% 4.41% 12.36% 28.76% 95.27% 99.66% 20.77% D1-all 3.00% 3.04% 3.24% 4.77% 12.26% 29.09% 94.60% 99.64% 24.44% D1-bg 2.36% 2.75% 5.63% 8.23% 20.86% 91.81% 99.32% 99.89% 25.18% PSMNet D1-fg 5.72% 5.78% 8.42% 10.25% 18.85% 100.00% 100.00% 100.00% 20.77% D1-all 2.92% 3.25% 6.21% 10.01% 20.52% 92.93% 99.91% 99.97% 24.44% Table 1. Performance of stereo algorithms under different levels of monocular blur: we mark the ‘turning point’ observed as red. D1- bg/D1-fg/D1-all refers to average percentage of outliers only over background/foreground/full regions.

Visually Imbalanced Stereo Matching

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support