What Makes 2D-To-3D Stereo Conversion Perceptually Plausible?
Total Page:16
File Type:pdf, Size:1020Kb
What makes 2D-to-3D stereo conversion perceptually plausible? Petr Kellnhofer Thomas Leimkuhler¨ Tobias Ritschel Karol Myszkowski Hans-Peter Seidel MPI Informatik a) b) c) d) e) Reference Smooth remapping Spatial blur Object removal Temporal blur Figure 1: We intentionally introduce the depth distortions typically produced by 2D-to-3D conversion into close-to-natural computer generated images (a, top) such as the one from the MPI Sintel dataset [Butler et al. 2012] where ground truth depth is available (a, bottom). User response to stereo images (b–e, top) showing typical disparity distortions (b–e, bottom) gives an indication whether a certain amount of distortion results in functional equivalence for natural images or not. According to numerical measures such as PSNR or perceptual disparity metrics, the depth is considered very different (inequality sign, bottom), whereas it is functionally equivalent (equivalence sign, top). Abstract However, it is not clear what depth fidelity is required to produce plausible disparity in natural images, which include other monocular Different from classic reconstruction of physical depth in computer cues. vision, depth for 2D-to-3D stereo conversion is assigned by humans using semi-automatic painting interfaces and, consequently, is of- In this paper we argue that physically accurate depth is not required ten dramatically wrong. Here we seek to better understand why to produce plausible disparity. Instead, we provide evidence that as it still does not fail to convey a sensation of depth. To this end, long as four main properties of the disparity hold, it is perceived as four typical disparity distortions resulting from manual 2D-to-3D plausible. First, the absolute scale of disparity is not relevant, and stereo conversion are analyzed: i) smooth remapping, ii) spatial any reasonable smooth remapping [Jones et al. 2001; Lang et al. smoothness, iii) motion-compensated, temporal smoothness, and iv) 2010; Didyk et al. 2012] is perceived equally plausible and may even completeness. A perceptual experiment is conducted to quantify the be preferred in terms of viewing comfort and realism. Therefore, we impact of each distortion on the plausibility of the 3D impression can equally well use disparity that is the same as the physical one relative to a reference without distortion. Close-to-natural videos under a smooth remapping. Second, not every detail in the scene can with known depth were distorted in one of the four above-mentioned be augmented with plausible depth information, resulting in isolated aspects and subjects had to indicate if the distortion still allows for a objects that remain 2D or lack disparity relative to their content. We plausible 3D effect. The smallest amounts of distortion that result in will see that, unless those objects are large or salient, this defect a significant rejection suggests a conservative upper bound on the often remains largely unnoticed. Third, the natural statistics of depth quality requirement of 2D-to-3D conversion. and luminance indicate that depth is typically spatially smooth, except at luminance discontinuities [Yang and Purves 2003; Merkle CR Categories: I.3.3 [Computer Graphics]: Three-Dimensional et al. 2009]. Therefore, not reproducing disparity details can be Graphics and Realism—Display Algorithms acceptable and is often not even perceived, except at luminance edges [Kane et al. 2014]. Fourth and finally, the temporal perception of disparity allows for a temporally coarse disparity map, as fine Keywords: temporal variations of disparity are not perceivable [Howard and Rogers 2012; Kane et al. 2014]. Consequently, as long as the error 1 Introduction is 2D-motion compensated [Shinya 1993], depth from one point in time can be used to replace depth at a different, nearby point in time. The majority of images and videos available is 2D, and automatic conversion to 3D is a long-standing challenge [Zhang et al. 2011]. The requirements imposed on the precise meaning of “3D” might 2 Previous work differ: For applications such as view synthesis, surveillance, au- tonomous driving, human body tracking, relighting or fabrication, In this section, we review the three main approaches for 2D-to-3D accurate physical depth is mandatory. Obviously, binocular disparity (manual, automatic and real-time), the use of luminance and depth can be computed from such accurate physical depth, allowing for edges in computational stereo, as well as perceptual modeling of the synthesis of a stereo image pair using image-based rendering. binocular and monocular depth cues. 2D-to-3D conversion Manual conversion produces high-quality results but requires human intervention, which can result in substan- tial cost. They are based on painting depth annotations [Guttmann et al. 2009] with special user interfaces [Ward et al. 2011] and prop- agation in space and time [Lang et al. 2012]. The semi-supervised method of Assa and Wolf [2007] combines cues extracted from an image with user intervention to create depth parallax. Automatic conversion does not need manual effort, but does require filtered, without causing visible depth differences. In this work, we lengthy computation to produce results of medium quality. The conduct similar experiments for natural scenes involving monocular system of Hoiem et al. [2005] infers depth from monocular images cues. by a low number of labels. Make3D [Saxena et al. 2009] is based on learning appearance features to infer depth. Both approaches show Surprisingly, depth edges appear sharp, even though human ability good results for static street-level scenes with super-pixel resolution to resolve them in space and time is low. One explanation for this but require substantial computation. Non-parametric approaches is that the perceived depth edge location is determined mostly by rely on a large collection of 3D images [Konrad et al. 2012] or 3D the position of the corresponding luminance edge [Robinson and videos [Karsch et al. 2014] that have to contain an exemplar similar MacLeod 2013]. to a 2D input. For cel animation with outlines, T-junctions have been In previous work, perception was taken into account for stereogra- shown to provide sufficient information to add approximate depth phy when disparity is given [Didyk et al. 2012], but it was routinely [Liu et al. 2013]. ignored when inferring disparity from monocular input for 2D-to-3D Real-time methods to produce disparity from 2D input videos usually conversion. Interestingly, depth discontinuities that are not accom- come at low visual quality. A simple and computationally cheap panied by luminance edges of sufficient contrast poorly contribute solution is to time-shift the image sequence independently for each to the depth perception and do not require precise reconstruction in eye, such that a space-shift provides a stereo image pair [Murata et al. stereo 3D rendering [Didyk et al. 2012]. 1998]. This requires an estimate of the camera velocity and only works for horizontal motions. For other rigid motions, structure- 3 Experiment from-motion (SfM) can directly be used to produce depth maps [Zhang et al. 2007]. SfM makes strong assumptions about the scene In this experiment we would like to find how typical 2D-to-3D stereo content such as a rigid scene with camera motion. Additionally, conversion distortions affect the plausibility of a stereo image or individual cues such as color [Cheng et al. 2010], motion [Huang movie. To this end, we intentionally reduce physical disparity in one et al. 2009] or templates [Yamada and Suzuki 2009] were combined of four aspects and collect the users’ response. into a disparity estimate in an ad-hoc fashion. Commercial 2D-to-3D solutions [Zhang et al. 2011] based on custom 3.1 Materials hardware (e. g., JVC’s IF-2D3D1 Stereoscopic Image Processor) and software (e. g., DDD’s Tri-Def-Player), reveal little about their used Stimuli Stimuli were distorted variants of a given stereo video techniques, but anecdotal testing shows the room for improvement content with known, undistorted disparity. We used four video se- [Karsch et al. 2014]. quences from the MPI Blender Sintel movie dataset [Butler et al. 2012], and the Big Buck Bunny movie by The Blender Founda- tion, which provide close-to-natural image statistics combined with Perception of luminance and depth Since luminance and depth ground-truth depth and optical flow. Additionally, we used four edges often coincide, e. g., at object silhouettes, full-resolution RGB rendered stereo image sequences with particularly discontinuous images have been used to guide depth map upsampling both in motion that are especially susceptible to temporal filtering. Stim- the spatial [Kopf et al. 2007] and the spatio-temporal [Richardt uli were presented as videos for temporal distortions and as static et al. 2012; Pajak et al. 2014] domain. Analysis of a database with frames for spatial distortions to prevent threshold elevation by pres- range images for natural scenes reveals that depth maps mostly ence of motion that would underestimate the effect for a theoretical consist of piecewise smooth patches separated by edges at object completely static scene. The scenes did not show any prominent boundaries [Yang and Purves 2003]. This property is used in depth specular areas that required special handling [Dabala et al. 2014]. compression, where depth edge positions are explicitly encoded, e. g., by using piecewise-constant or linearly-varying depth repre- Distortions were performed in linear space with a normalized range sentations between edges [Merkle et al. 2009]. This in turn leads of (0; 1). For stereo display, this normalized depth was mapped to significantly better depth-image-based rendering (DIBR) [Fehn to vergence angles corresponding to a depth range of (57; 65) cm 2004] quality compared to what is possible at the same bandwidth of surrounding a display at 60 cm distance. This distribution around MPEG-style compressed depth, which preserves more depth features the display plane reduces the vergence-accommodation conflict, at the expense of blurring depth edges.