[3B2-9] mmu2011040012.3d 24/10/011 12:55 Page 12

Wenjun Zeng Multimedia at Work University of Missouri-Columbia

Converting 2D Video to 3D: An Efficient Path to a 3D Experience

Xun Cao ide-scale deployment of 3D video tech- development in home entertainment since Tsinghua University Wnologies continues to experience rapid high-definition TV. Yet, aside from the wide growth in such high-visibility areas as cinema, availability of theatrical 3D productions, there’s Alan C. Bovik TV, and mobile devices. Of course, the visual- still surprisingly little 3D content (3D image University of ization of 3D videos is actually a 4D experience, and video) in view of the tremendous number Texas at Austin because three spatial dimensions are perceived of 3D displays that have been produced and as the video changes over the fourth dimension sold. The lack of good-quality 3D content has Yao Wang of time. However, because it’s common to de- become a severe bottleneck to the growth of Polytechnic Institute scribe these videos as simply ‘‘3D,’’ we shall the 3D industry. of New York do the same, and understand that the time di- There is, of course, an enormous amount of University mension is being ignored. So why is 3D sud- high-quality 2D video content available, and denly so popular? For many, watching a 3D it’s alluring to think that this body of artistic Qionghai Dai video allows for a highly realistic and immer- and entertainment data might be converted Tsinghua University sive perception of dynamic scenes, with more into 3D formats of similar high quality. Our deeply engaging experiences, as compared to goal in this article is to introduce and explain traditional 2D video. This, coupled with great emerging theories and algorithms for 2D-to- advances in 3D technologies and the appear- 3D video conversion that seek to convert ordi- ance of the most successful movie of all time nary 2D videos into 3D. in vivid 3D (Avatar), has apparently put 3D video production on the map for good. A typical 3D video system includes the steps 3D content generation of content creation, coding and transmission, An obvious approach toward ameliorating reconstruction, and display. Three-dimensional the current shortage of good-quality 3D video processing and display technologies have dra- content is to explore effective 3D content- matically advanced—for example, glasses-free creation methods. Most contemporary 3D con- (auto-stereoscopic) 3D displays are commer- tent (for example, commercials and movies) is cially available on small-screen mobile devices. created using commercial modeling software, Large-format 3DTV cable receivers and display such as Autodesk’s 3ds Max and Maya. The systems are now commonly found in the basic mode of operation is to first build the tar- home, which is regarded as the biggest get object’s 3D structure (a graphical mesh rep- resentation), then render the meshes using surface texture and lighting models to create re- Editor’s Note alistic appearances (see the middle of Figure 1a). Computer graphics animation is a successful The 3D experience is poised to be the future of entertainment. The example of this category of 3D content produc- path to that future, however, is bumpy. Despite widely available 3D dis- tion (for example, Shrek 3). The advantage of play devices, a lack of 3D content impedes the 3D industry’s rapid growth. Here the authors discuss some common 3D content-creation modeling software is that the synthesized 3D processes and the key elements that affect quality, and present an models are accurate and easily manipulated. approach that converts 2D content to 3D content as an efficient path The drawback is that artificial 3D models are to a 3D multimedia experience. quite time consuming to create, particularly —Wenjun Zeng when highly realistic or otherwise complex scenes are required.

12 1070-986X/11/$26.00 c 2011 IEEE Published by the IEEE Computer Society [3B2-9] mmu2011040012.3d 24/10/011 12:55 Page 13

Beyond graphical modeling, 3D capture de- 3D vices have also greatly progressed in recent capture Scene years. Stereoscopic cameras and time-of-flight (TOF) cameras, both consumer and profes- Binocular P sional, are increasingly common and used in 3D a variety of applications. Similar to the binocu- model lar vision system found in humans and many Disparity other animals, a stereoscopic camera setup Image plane P1 P2 P1 with two cameras, located in nearly the same 2D-to-3D place and pointing in almost the same direc- conversion Monocular tion, capture the same visual scene (see the top of Figure 1a). In this way, the video is cap- tured from two viewpoints with slightly differ- 3D content-creation methods ent perspectives, yielding slightly different Camera 1 Camera 2 projections onto the cameras’ recording appa- (a) (b) ratus. The left and right views are then inte- grated in specific visual centers of the brain to form the 3D depth experience, which is called streams. Once the 3D information has been Figure 1. Generating the cyclopean image. This 3D experience is estimated, it can be deployed to create a 3D 3D content. quite different from viewing the 2D images sep- (stereoscopic) video from the existing 2D (a) Contemporary arately. The sense of depth, 3D perspective, and video. The successful development of efficient methods used for 3D shape is profoundly amplified beyond the and reliable 2D-to-3D video-conversion algo- 3D content creation. 3D experience delivered by other visual cues, rithms could make it possible to create large (b) Stereoscopic such as motion, size, surface-luminance gra- quantities of high-quality 3D content at a low camera and the dients, and so on. cost in terms of time and resources. By analogy, concept of disparity. Alternatively, TOF 3D cameras directly cap- just as many classic movies were once color- ture depth information using a single camera. ized, and as they are now being up-sampled The basic principle of TOF cameras is as follows: to high definition using super-resolution tech- the camera emits acoustic, electromagnetic, or niques, we believe that much of the available other waveforms; these are reflected from corpus of attractive archived videos will be con- object(s) in the scene, some of which are inci- verted to 3D formats. dent upon the sensor’s detector. The distance between object and sensor, at each pixel, can Perceptual depth cues be computed from the travel time of the emit- The real world is 3D; however, during the ted, reflected, and sensed waveform. Once the process of projection through the lens, one of depth information for each pixel is known, the three dimensions is lost: depth. To obtain left and right views can be rendered for 3D re- the survival advantage afforded by depth infor- production using a depth-image-based render- mation, humans have evolved two frontally ing technique.1 placed eyes separated by an interocular distance While 3D capture, processing, and display of about 65 millimeters (in adults). Spatial techniques have evolved rapidly in recent points in the scene project to slightly different years; there’s a relative dearth of 3D content positions between the 2D projection on left and compared to potential market demand. The right eyes; this displacement between the pro- ocean of older 2D videos captured during past jected positions is called disparity, as Figure 1b October decades contains a treasure trove of 3D possibil- illustrates. If the left and right views are placed ities. Because generating native 3D content is a so that the optical axes of the eyes (or cameras) relatively slow process, 2D-to-3D video conver- are parallel, then the disparity between the — sion offers a tremendous opportunity to allevi- locations of points that were projected from 2011 December ate this shortage. the 3D scene onto the left and right 2D image The target accomplishment of 2D-to-3D planes is inversely proportional to depth. conversion techniques is to be able to estimate Such a camera configuration is called a parallel the true 3D information present in the scene optical axis binocular geometry.Thiscamera (for example, in the form of depths relative to arrangement simplifies algorithm design in cer- some baseline) from single, monocular video tain ways but doesn’t exploit other advantages

13 [3B2-9] mmu2011040012.3d 24/10/011 12:55 Page 14

Multimedia at Work

available if the cameras can toe-in, or verge as Fully automatic 2D-to-3D video human eyes are normally able. conversion A normal human viewer has eyes that are Fully automatic 2D-to-3D video conversion highly mobile and that engage in a number of means that 3D information is estimated from types of eye movements, including binocular 2D video clips automatically, without a priori vergence, whereby the optical axes of the two knowledge of depths in the scene. This tech- eyes fixate on the same point. Cameras can nique aims at scenarios where humans are un- also be made to verge under computer control able to participate in the depth-computation provided that points of stereo fixation can be process, as well as real-time applications where selected in the 3D scene. A vergent stereo imag- human participation is impossible. For exam- ing geometry is a powerful means for concen- ple, a desirable goal is the development of trating the visual attention of a vision system real-time, 2D-to-3D video-conversion modules on a particular region of 3D interest. However, suitable for incorporation into 3DTV receivers. a more complex geometric relationship exists In this situation, algorithms must be able to ex- between points that are projected to a vergent tract information from the input 2D video that stereo pair of videos. For simplicity we will as- can be used to directly compute depth dimen- sume that the acquisition cameras are arranged sion. All of the aforementioned monocular in parallel-axis geometry in the following dis- depth cues can contribute toward estimating a cussion. In practice, this is not always the scene’s 3D structure. However, it’s unlikely case, as the will wish to capture that any monocular depth cue could prove ade- the stereo content using a geometry that will quate for estimating depths from videos of all create stereo pairs that are comfortable to kinds of scenes. view in 3D. This is often correctly done using One approach to simplifying this issue is to cameras in a vergent geometry that can capture execute a preprocessing step, where we pick a images that are subsequently more consistent monocular cue category we think will prove with those captured by vergent eyes. most valuable in representing the video’s 3D While , or , is one content. The most commonly used monocular of the most important modes of human visual- depth cues in the fully automatic 2D-to-3D depth sensing, it’s not the only source of 3D in- conversion algorithms are occlusion/relative formation nor is it necessarily the most power- size, texture gradient, perspective/horizon, ful 3D sensing cue. Indeed, people might still focus/defocus, and motion . A detailed perceivedepthevenwithasingleeye.Other explanation and illustration for these depth people, although equipped with two working cues can be found at http://media.au.tsinghua. eyes, have partial or complete stereo deficiency. edu.cn/2Dto3D/monocularcues.html. The number of people who have imperfect However, all of these depth cues are imper- (or nonexistent) stereo vision lies between 4 fect, and given a situation, none are necessarily and 10 percent of the overall population, yet available or will supply an adequate amount of these people still experience varying degrees information to allow for depth computation. of 3D visual sensation. is Therefore, an overall, fully automatic 2D-to- stimulated not only by binocular depth dispar- 3D video-conversion system should, ideally, ity, but also by a variety of monocular depth seek to combine multiple monocular depth cues, such as occlusion (where one visible ob- cues to compute a stable and reliable depth ject covers another), motion parallax (changes map. For example, when a piece of video con- in pose with motion), diminution in size or taining a sufficient number of frames is brightness (with distance, shading, and texture acquired, the incoming frames might be cate- gradients), and so on.2 In the absence of binoc- gorized into static (no- or low-motion) and dy- ular information, 2D-to-3D video conversion namic (high-motion) frames using a change- would enable the creation of 3D video content detection algorithm that employs statistical by using monocular depth cues such as those motion information computed from neighbor- just mentioned. In the following, we describe ing frames. A third category could be frames how these monocular depth cues can be in- where the scene abruptly changes from one sit- corporated into 2D-to-3D video-conversion uation to another, such as the occurrence of a

IEEE MultiMedia systems. commercial in ordinary TV programming.

14 [3B2-9] mmu2011040012.3d 24/10/011 12:55 Page 15

One possible sequence of processing could be to first parse the video at scene changes; then, compute motion information and use it to categorize frames. For static frames, deploy depth cues as occlusion/relative size and focus/ defocus to initially estimate the scene depths; (a) (b) (c) (d) whereas for dynamic scenes use motion paral- lax for the first estimates. In both situations, depth cues, texture gradient, and perspective/ horizon can be used to provide supplementary high compression: in algorithms such as Figure 2. Workflow depth information. For complete scenes H.264, an I-frame (similar to a keyframe) is of semiautomatic (between scene changes), consistency over transmitted faithfully using a larger bandwidth, 2D-to-3D conversion. time can be used to improve and affirm the while non-I frames (P-frames, similar to non- (a) Reading 2D depth computations. Depth post-processing keyframes) are compared with others frames, video into frames, can then be applied to smooth the estimated with the differences being compressed and (b) keyframe detection depth information in both space (within con- transmitted. Similarly, in semiautomatic 2D- and depth annotation, nected objects or textured regions) and over to-3D video conversion, keyframes are given (c) depth propagation, time, to alleviate flicker and other artifacts carefully measured human-annotated depth in- and (d) stereoscopic that might become evident when watching formation that indicate partial, or even com- rendering and display. the 3D video. Finally, given the 2D color plete, information about the scene. video and the newly computed depth video, The depth annotations can then be used to we can create an associated second color initialize and constrain the depths computed video from the associated disparities (pre- from subsequent non-keyframes (see Figure 2). suming a viewing distance and geometry), so As a design starting point, keyframes might al- that the two color videos can be viewed ways be assigned at scene changes, as well as at stereoscopically. other significant temporal instants in a video— forexample,atcategorychanges.Figure2 Semiautomatic 2D-to-3D conversion shows a typical overall framework for semi- Because 2D-to-3D conversion attempts to re- automatic 2D-to-3D video conversion. cover the depth information lost during camera Among the four steps of the 2D-to-3D video- projection, it’s a poorly posed problem; primar- conversion framework depicted in Figure 2, the ily because an insufficient reservoir of physical depth-propagation stage is the most important, constraints is placed on the possible 3D recon- because it determines the depth quality of most structed scene by the available projection data. frames of the video. Depth propagation is an A convenient solution is to involve human inference problem, wherein the known colors participation as a means for injecting prior (RGB values of all the pixels) from all frames knowledge about the true 3D scene structure is combined with depth information from just onto the video shot to be converted, thereby the keyframes, to infer the depth information enabling both more efficient conversion and of the non-keyframes. more accurate results. In most semiautomatic A large number of different depth-propagation 2D-to-3D video-conversion systems, specific kernels have been proposed in recent years. frames (keyframes) of the video sequence are Some methods infer the depth information by identified and annotated with 3D (depth or dis- exploring the relationships between depth parity) information via human intervention. and color. This is because pixels of similar October The other frames (non-keyframes) are con- color and texture between neighboring frames verted to 3D automatically using the informa- arelikelytocomefromthesameobject(s)if tion in the keyframes as constraints. This their spatial locations are also similar. 3 — strategy is feasible because the frames in a The so-called bilateral filter is a powerful 2011 December video taken of a given scene typically have a tool for approaching such problems. This de- high degree of correlation, containing objects vice combines two attributes, such as color andsurfacesthatappearacrossmanyvideo and space, with linear weighting. In our prob- frames, albeit from changing observation lem, similarity in color space and spatially points. This high correlation is one of the fun- proximity can be used to define the bilateral damental attributes of video that allows for filter.

15 [3B2-9] mmu2011040012.3d 24/10/011 12:55 Page 16

Multimedia at Work

Although a bilateral filter can be used to ef- images and videos has been studied for many fectively and smoothly propagate depth infor- years.5 However, understanding the quality of mation based on color and spatial similarity, experience (QoE) of human viewers of 3D vid- problems can occur when there’s fast move- eos is a much more complicated problem, and ment in the video. Movement causes lumi- promising results are still forthcoming. nance to change position relative to the video First of all, distortions of stereoscopic videos frame, which can invalidate the assumption might be perceived differently than distortions that neighboring pixels share similar depths. of 2D videos, even if the distortions are of the We therefore advocate semiautomatic 2D-to- same type (compression, packet loss, and so 3D video-conversion methods that add a third on). Indeed, these 2D distortions can lead to factor into the bilateral depth-propagation distortions in the perception of the third di- framework: motion. Motion estimation is first mension. The geometry of stereopsis can lead implemented to establish temporal correlation to other visual artifacts such as depth-plane between the pixels in successive frames. In curvature, shearing, and keystone distortion.6 this way, target pixels being examined in non- Distortions can also arise from imperfect stereo- keyframes can be shifted, or motion compen- scopic capture, such as poor synchronization of sated by displacement relative to the reference capture of the left and right views, or by faulty keyframes. Bilateral filtering is then performed left—right image-rendering procedures during on the shifted pixels. Such a shifted bilateral fil- 3D content generation. ter (SBF) can account for correlations across Even disregarding possible distortions in the three attributes: color, space, and time. With 3D video content, a viewer might still feel vi- an SBF, the motion estimation might proceed sual discomfort in seeing the third dimension by assuming that the shifted pixels occur at in a 3D video. This can arise from problems in the same depth layer; however, in practice stereography if the geometry of the captured there might also be displacements in depth images (as defined by the distances to objects), (along the camera’s optical axis, forward or the vergence angle between the cameras, and backward from the camera). In such cases, the presumed point of 3D fixation don’t match SBF approach can lead to incorrect depth prop- what could be viewed by a human observer. agation. To solve this problem, a bidirectional In a converted video there’s only a single cam- propagation strategy can be used to further up- era, but geometric parameters must be defined date the SBF depth-propagation kernel.4 to create the second view. Some 3D displays can also result in visual Quality of experience distortions. For example, commonly used 3D At this point, it’s natural to wonder whether glasses separate left and right images using dif- it’s possible to estimate the perceptual quality ferent polarization modes; however, some of a converted 3D video, and to likewise com- amount of the left image might leak into the pare different conversion algorithms. This right eye view and vice versa, which is a form question leads us to an important field: image/ of cross-talk. This might occur because of poor video quality assessment (QA). The goal of QA synchronization or separation of the two is to develop algorithms that can accurately views. Other 3D display artifacts7can reduce predict the quality of input images and videos the quality or the realism of the 3D viewing as humans will judge them. Therefore, QA con- experience. tains two important aspects: one is human sub- Last, despite decades of research there’s jective opinions about converted 3D data, and much that remains unknown regarding stereo the other is designed QA models that lead to ef- perception. In particular, little is known regard- ficient and accurate algorithms. While the ideal ing the perception of 3D distortions, or of the arbiter of converted video quality would be particulars of the way in which geometric recorded human judgments, such subjective inconsistencies affect the QoE. It’s particularly evaluation data requires a lot of participants difficult to create algorithms that automatically and a controlled test environment, which is predict the QoE of a stereo 3D video, because usually infeasible in practice. Therefore, objec- there’s no ground-truth reference signal avail- tive QA algorithms that can accurately predict able against which to compare the experience. human judgments of converted video quality Indeed, the only 3D signal that occurs is the

IEEE MultiMedia are highly desirable. The topic of QA on 2D so-called 3D cyclopean image in the brain.

16 [3B2-9] mmu2011040012.3d 24/10/011 12:55 Page 17

Thereareonlycrudemethodsavailableto Table 1. The subjective and objective test on different 2D-to-3D conversion estimate this distal signal, for example, by methods.* computer-vision stereo algorithms. We acquired ten 3D video clips with associ- Subjective first Subjective second Mean squared ated ground-truth depth maps, which we used Conversion method choice (%) choice (%) error to evaluate the aforementioned semiautomatic Bilateral filter 4.55 0 4.51 2D-to-3D video-conversion algorithms. We Shifted bilateral filter 18.2 22.7 4.42 obtained both subjective evaluations and objec- Shifted bilateral filter 22.7 45.5 3.89 tive QA algorithm results on these data sets. with bi-direct4 The subjective test invited 22 viewing subjects Ground truth 54.5 31.8 0 of various ages and backgrounds to watch and * The data sets are available at http://media.au.tsinghua.edu.cn/2Dto3D/ rate the 3D videos by ranking and score. Here, Testsequence.html. the percentage of times subjects selected a method as best (first choice) or second best (sec- ond choice) are tabulated. In the absence of a standard 3D QA index, and because we believe Approach on 3DTV,’’ Proc. SPIE, Int’l Soc. for that 3D QA algorithms will operate using prin- Optics and Photonics (SPIE), vol. 5291, 2004, ciples different from 2D QA algorithms, we sim- pp. 93-104. ply measured the mean squared error between 2. I.P. Howard and B.J. Rogers, Seeing in Depth: depth maps from converted video and ground Depth Perception, Porteous, 2002. truth. These results (listed in Table 1) suggest 3. C. Tomasi and R. Manduchi, ‘‘Bilateral Filtering that the efficacy of the algorithms can be dis- for Gray and Color Images,’’ Proc. Int’l Conf. tinguished.Wehopethisdatasetwillhelp Computer Vision (ICCV), IEEE CS Press, 1998, bridge the gap between 2D-to-3D video conver- pp. 839-846. sionand3DQoEresearch,benefitingboth 4. X. Cao, Z. Li, and Q. Dai, ‘‘Semi-Automatic 2D- disciplines. to-3D Conversion Using Disparity Propagation,’’ IEEE Trans. Broadcasting, vol. 57, no. 2, 2011, Conclusions pp. 491-499. The cost of creating 3D video content re- 5. H.R. Sheikh, M.F. Sabir, and A.C. Bovik, ‘‘A Statis- mains high, and the amount of quality 3D con- tical Evaluation of Recent Full Reference Image tent being delivered is far from meeting the Quality Assessment Algorithms,’’ IEEE Trans. market demand. Breakthroughs might come Image Processing, vol. 15, no. 11, 2006, with the introduction of efficient and effective pp. 3440-3451. 2D-to-3D conversion algorithms. Further stud- 6. A. Woods, T. Docherty, and R. Koch, ‘‘Image ies on the perception of stereoscopic distortions Distortions in Stereoscopic Video Systems,’’ Proc. and artifacts are likely to deepen our under- SPIE, Int’l Soc. for Optics and Photonics (SPIE), standing of the 3D QoE. The database we vol. 1915, 1993, pp. 36-48. discussed should prove useful for potential 7. L. Meesters, W. IJsselsteijn, and P. Seuntiens, efforts toward creating effective 2D-to-3D con- ‘‘A Survey of Perceptual Evaluations and Require- version algorithms and for assessing the QoE ments of Three-Dimensional TV,’’ IEEE Trans. delivered by them. MM Circuits and Systems for Video Technology, vol. 14, no. 3, 2004, pp. 381-391.

Acknowledgment October The authors would like to thank the National Contact author Xun Cao at [email protected]. Basic Research Project (no. 2010CB731800) and the Key Project of NSFC (no. 61035002, 60932007, Contact editor Wenjun Zeng at zengw@ — and 61021063). missouri.edu. 2011 December

References 1. C. Fehn, ‘‘Depth-Image-Based Rendering (DIBR), Compression, and Transmission for a New

17