Dense Disparity Estimation in Multiview Video Coding

Dense Disparity Estimation in Multiview Video Coding I. Daribo, M. Kaaniche, W. Miled, M. Cagnazzo, B. Pesquet-Popescu Telecom ParisTech Signal and Image Proc. Dept. 37-39, rue Dareau, 75014 Paris, France [daribo,kaaniche,miled,cagnazzo,pesquet]@telecom-paristech.fr Abstract—Multiview video coding is an emerging application In order to better deal with inter-view redundancy, the where, in addition to classical temporal prediction, an efficient joint video team (JVT) is developing a multiview extension disparity prediction should be performed in order to achieve the of H.264/AVC standard, known as multiview video coding best compression performance. A popular coder is the multiview video coding (MVC) extension of H.264/AVC, which uses a (MVC) extension [3]. The aim of this extension is to provide block-based disparity estimation (just like temporal prediction new techniques improving coding efficiency taking advantage in H.264/AVC). In this paper, we propose to improve the MVC of both temporal and inter-view redundancies, thus leading to extension by using a dense estimation method that generates a additional coding gain compared to the H.264/AVC simulcast smooth disparity map with ideally infinite precision. The obtained solution. disparity is then segmented and efficiently encoded by using a rate-distortion optimization technique. Experimental results show On the other hand, adding the inter-view prediction to that significant gains can be obtained compared to the block- the temporal one requires more computational and memory based disparity estimation technique used in the MVC extension. resource. Nevertheless, it has been shown [4] that most of the coding gain of MVC comes from the inter-view prediction I. INTRODUCTION of the temporal intra picture, while for the inter pictures the temporal prediction is the most efficient prediction mode. A multiview video system consists in generating multiple As a consequence, limiting the inter-view prediction to the views by capturing from different viewpoints the same scene intra pictures is commonly reputed as a reasonable complex- via a set of multiple cameras. A set of slightly different views ity/efficiency trade-off. can be used to reproduce the scene in three dimensions. The inter-view correlation between adjacent cameras is The improvement of 3D technologies raised interest in 3D removed via the so-called disparity estimation (DE) and com- television (3DTV) [1] and in free viewpoint video (FVV) [2]. pensation. Two main approaches, block-based and dense, have While 3DTV offers depth perception of program entertain- been used to estimate the disparity vectors (DV). A survey of ments without wearing special additional glasses, FVV allows the different techniques proposed in the literature can be found the user to freely change his viewpoint position and viewpoint in [5]. The MVC extension employs a variable block-based direction around a 3D reconstructed scene. Other target fields disparity estimation, assuming that within each partition of the are expected, like Digital Cinema, IMAX theaters, medicine, current macroblock the disparity vector is constant. However, dentistry, air-traffic control, military technologies, computer this assumption does not always hold, especially around depth games, etc. discontinuities and in textureless regions. Dense pixel-based In the meantime, the digital TV technology and 3D displays approaches attempt to overcome this drawback by assigning have largely improved recently, making even more relevant the one disparity vector to each pixel. Of course this means that problem of multiview applications. Capturing, processing and the disparity map would require a very high bit-rate to be coding multiview video are now very active research topics. encoded: for this reason, dense DVs have rarely been consider In particular, in sight of the huge amount of data concerned, for compression. The basic idea of this paper is to reduce the compression assumes a paramount importance. coding cost of the dense DV map by operating a RD-driven A straight method to compress multiview video is to encode segmentation on it. each view independently using the state-of-the-art H.264/AVC In particular, we propose to improve the disparity prediction encoder [?]. This approach is denoted, in the literature, as unit in the MVC extension by using the dense DE (DDE) simulcast coding. However, since all the cameras capture the method described in [6] (because it achieves good results same scene through different viewpoints, there is an inter-view compared with the state-of-art methods, such as graph cuts statistical expected dependencies between adjacent cameras, and belief propagation based methods) followed by the seg- which is not exploited in the simulcast case. mentation step. Based on a set theoretic framework, this DDE approach incorporates various convex constraints correspond- MMSP’09, October 5-7, 2009, Rio de Janeiro, Brazil. ing to a priori information such as the range of DVs or the 978-1-4244-4464-9/09/$25.00 c 2009 IEEE. total variation regularization constraint which assures a smooth disparity field while preserving discontinuities. Assuming that the magnitude difference of both fields is relatively small, the warped reference frame is approximated reference frames reference frames current frame current frame d¯ (previously coded) (previously coded) around by a Taylor expansion: (n−1,t) block-based dense I (x + d, y) ≃ disparity disparity I(n−1,t)(x + d,¯ y) + ∇I(n−1,t)(x + d,¯ y)(d − d¯) (2) estimation estimation x (n−1,t) where ∇Ix (x + d,¯ y) is the horizontal gradient of the warped reference frame. Note that in (2), we have not made DV per pixel per explicit that d and d¯ are functions of s = (x,y) for notation concision. Using the linearization (2), the criterion J˜ in (1) DV RD can be approximated by the quadratic convex functional J in segmentation per macroblock d: J(d) = [r(s) − L(s) d(s)]2 (3) X DV s∈D per macroblock per where block-based block-based (n−1,t) ¯ disparity disparity L(s) = ∇Ix (x + d(s),y) - - (n,t) (n−1,t) compensation compensation r(s) = I (s) − I (x + d¯(s),y) + d¯(s) L(s) residual frame residual frame The minimization of this quadratic functional is an ill-posed problem as the components of L may locally vanish. Thus, Fig. 1. Disparity prediction: (left) block-based estimation, (right) enhanced to convert this problem to a well-posed one, we incorporate by a dense estimation. additional constraints reflecting the prior knowledge about the disparity field. In this work, we address the problem through The proposed scheme is summarized in Fig. 1: we replace a set theoretic framework [6]. Firstly, each constraint is rep- the block-based disparity estimation (BDE) stage by a dense resented by a closed convex set S with m ∈ {1,...,M}, disparity estimation (DDE) one, and then, we apply a rate- m in a Hilbert space H. The intersection S of all the M sets distortion segmentation to the generated disparity map. This S constitutes the family of possible solutions. Therefore, the is performed by optimizing a Lagrangian cost function which m constrained problem amounts to find the solution in S which takes into account the accuracy and the coding cost of the minimizes the functional J: disparity map. The remainder of this paper is organized as follows. Sec- M Find dˆ∈ S = Sm such that J(dˆ) = min J(d). (4) tion II provides details about the DDE method. In Section III, \ d∈S we address the problem of the RD-optimized segmentation m=1 and encoding of the disparity field. Finally, in Section IV, The constraint sets are modeled as level sets: we give experimental results confirming the effectiveness of ∀m ∈ {1,...,M}, S = {d ∈ H | f (d) ≤ δ } (5) the proposed method, while Section V draws conclusions and m m m outlines future work. where fm : H → R is a continuous convex function for all m ∈ {1,...,M} and (δm)1≤m≤M are real-valued parameters II. DENSE DISPARITY ESTIMATION M such that S = Sm 6= ∅. Tm=1 A. Problem statement Hence, it is required to define the convex sets Sm to proceed Let I(n,t) and I(n−1,t) be two frames taken respectively to the DDE algorithm within the set theoretic framework. At by the n-th and (n − 1)-th cameras at time t. We assume this level, it is important to emphasize the great flexibility in that cameras are rectified, so that the disparity vectors are incorporating any set of arbitrary convex constraints. In what restricted to the horizontal component, that will be denoted follows, we will focus on M = 2 constraints. The first one by d. DDE methods attempt to determine, for each pixel in consists of restricting the variation of the disparity d within n,t the current frame I( ), the best corresponding pixel in the a specified range [dmin, dmax]. It can be expressed by the n− ,t reference frame I( 1 ). Generally, the estimation is obtained following constraint set S1: by minimizing a given cost functional, formulated in terms of S = {d ∈ H | d ≤ d ≤ d } (6) the sum of squared differences: 1 min max Most importantly, a constraint can be incorporated in order dˆ= arg min [I(n,t)(x,y) − I(n−1,t)(x + d, y)]2 (1) d∈Ω X to strengthen the smoothness of the disparity field in the (x,y)∈D homogeneous areas while preserving edges. Indeed, neighbor- where D is the picture support and Ω is the range of candidate ing pixels belonging to the same object should have similar disparity values. Generally, an initial estimate d¯ of d is disparities. This can be achieved by considering the total available, for example using a dense correlation-based method.

Load more