<<

Convolutional Networks for Shape from Field

Stefan Heber1 Thomas Pock1,2 1Graz University of Technology 2Austrian Institute of Technology [email protected] [email protected]

Abstract x η x η y y y y

Convolutional Neural Networks (CNNs) have recently been successfully applied to various (CV) applications. In this paper we utilize CNNs to predict depth information for given Light Field (LF) data. The proposed method learns an end-to-end mapping between the 4D light x x field and a representation of the corresponding 4D depth ξ ξ field in terms of 2D hyperplane orientations. The obtained (a) LF (b) depth field prediction is then further refined in a post processing step by applying a higher-order regularization. Figure 1. Illustration of LF data. (a) shows a sub-aperture im- Existing LF datasets are not sufficient for the purpose of age with vertical and horizontal EPIs. The EPIs correspond to the training scheme tackled in this paper. This is mainly due the positions indicated with dashed lines in the sub-aperture im- to the fact that the ground truth depth of existing datasets is age. (b) shows the corresponding depth field, where red regions inaccurate and/or the datasets are limited to a small num- are close to the camera and blue regions are further away. In the EPIs a set of 2D hyperplanes is indicated with yellow lines, where ber of LFs. This made it necessary to generate a new syn- corresponding scene points are highlighted with the same color in thetic LF dataset, which is based on the raytracing software the sub-aperture representation in (b). POV-Ray. This new dataset provides floating point accu- rate ground truth depth fields, and due to a random scene generator the dataset can be scaled as required. rays, which can be arranged position major or direction major. The direction major representation can be inter- preted as a set of pinhole views, where the viewpoints are 1. Introduction arranged on a regular grid parallel to a common image plane (c.f . Figure 2). Those pinhole views are called sub- A 4D light field [23, 14] provides information of all light aperture images, and they clearly show that the LF provides rays, that are emitted from a scene and hit a predefined sur- information about the scene geometry. When considering face. Contrary to a traditional image, a LF contains not only Equation (1) a sub-aperture image is obtained by holding intensity information, but also directional information. This q fixed and by varying over all spatial coordinates p. An- additional directional information inherent in the LF implic- other visualization of LF data is called Epipolar Plane Im- itly defines the geometry of the observed scene. age (EPI) representation, which is a more abstract visualiza- It is common practice to use the so-called two-plane or tion, where one spatial coordinate and one directional coor- light slab parametrization to describe the LF. This type of dinate is held constant. For example if we fix y and η, then parametrization defines a ray by its intersection points with we restricts the LF to a 2D function two planes, that are usually referred to as image plane Ω ⊆ 2 2 R and lens plane Π ⊆ R . Hence the LF can be defined in 2 Σy,η : R → R, (x,ξ) 7→ L(x,y,ξ,η) , (2) mathematical terms as the mapping L :Ω × Π → R, (p, q) 7→ L(p, q) , (1) that defines a 2D EPI. The EPI represents a 2D slice through the LF and it also shows that the LF space is largely linear, where p =(x,y)⊤ ∈ Ω and q =(ξ,η)⊤ ∈ Π represent the i.e. that a 3D scene point always maps to a 2D hyperplane spatial and directional coordinates. in the LF space (c.f . Figure 1). There are different ways to visualize the 4D LF. One There are basically two ways of capturing a dynamic LF. way of visualizing the LF is as a flat 2D array of 2D ar- First, there are camera arrays [34], that are bulky and ex-

3746 pensive, but allow to capture high resolution LFs. Second, 2. Related Work more recent efforts in this field focus on plenoptic cam- eras [1, 24, 25], that are able to capture the LF in a sin- gle shot. Although LFs describe a powerful concept that is Extracting geometric information from LF data is one well-established in CV, commercially available LF captur- of the most important problems in LF image processing. ing devices (e.g. Lytro, Raytrix, or Pelican) currently only We briefly review publications most relevant in this field. fill a market niche, and are by far outweighed by traditional As already mentioned in the introduction, LF imaging can 2D cameras. be seen as an extreme case of a multi-view stereo sys- tem, where a large amount of highly overlapping views are LF image processing is highly interlinked with the devel- available. Hence, it is hardly surprising, that the increas- opment of efficient and reliable shape extraction methods. ing popularity of LFs renewed the interest on specialized Those methods are the foundation of all kinds of applica- multi-view reconstruction methods [2, 3, 17, 8]. For in- tions, like digital refocusing [18, 24], image segmentation stance, in [2] Bishop and Favaro proposed a multi-view [33], or super-resolution [4, 31], to name but a few. In the stereo method, that theoretically utilizes the information of context of Shape from Light Field (SfLF) authors mainly all possible combinations of sub-aperture images. Anyhow, focus on robustness w.r.t. depth discontinuities or occlusion the paper mainly focuses on anti-aliasing filters, that are boundaries. These occlusion effects occur when near ob- used as a pre-possessing step for the actual depth estima- jects hide parts of objects that are further away from the tion. In a further work [3] they propose a method that per- observer. In the case of binocular stereo occlusion handling forms the matching directly on the raw image of a plenoptic is a tough problem, because it is basically impossible to es- camera by using a specifically designed photoconsistency tablish correspondences between points that are observed constraint. Heber et al.[17] proposed a variational multi- in one image but occluded in the other image. In this case view stereo method, where they use a circular sampling only prior knowledge about the scene can be used to resolve scheme that is inspired by a technique called Active Wave- those problems. This prior knowledge is usually added in front Sampling (AWS) [11]. In [8] Chen et al. introduced terms of a regularizer. In the case of multi-view stereo those a bilateral consistency metric on the surface camera to indi- occlusion ambiguities can be addressed by using the differ- cate the probability of occlusions. This occlusion probabil- ent viewpoints. This somehow suggests to select for each ity is then used for LF stereo matching. image position a subset of viewpoints that, when used for shape estimation, reduce the occlusion artifacts in the final Another, more classical way of extracting the depth in- result. In this paper we propose a novel method that implic- formation from LF data is to analyze the line directions in itly learns the pixelwise viewpoint selection and thus allows the EPIs [30, 13, 16]. Wanner and Goldluecke [30, 13] to reduce occlusion artifacts in the final reconstruction. for example applied the 2D structure tensor to measure the direction of each position in the EPIs. The estimated Contribution. The contribution of the presented work is line directions are then fused using variational methods, twofold. First, we propose a novel method to estimate the where they incorporate additional global visibility con- shape of a given LF by utilizing deep learning strategies. straints. Heber and Pock [16] recently proposed a method More specifically, we propose a method that predicts for for SfLF, that shears the 4D light field by applying a low- each imaged scene point the orientation of the correspond- rank assumption. The amount of shearing then allows to ing 2D hyperplane in the domain of the LF. After the point- estimate the depth map of a predefined sub-aperture image. wise prediction, we use a 4D regularization step to over- come prediction errors in textureless or uniform regions, Unlike all the above mentioned methods, we suggest to where we use a confidence measure to gauge the reliabil- train a CNN that allows to predict for each imaged 3D scene ity of the estimate. For this purpose we formulated a con- point the corresponding 2D hyperplane orientation in the LF vex optimization problem with higher-order regularization, domain. This is achieved by extracting information from that also uses a 4D anisotropic diffusion tensor to guide the vertical and horizontal EPIs around a given position in the regularization. LF domain, and feeding this information to a CNN. In or- Second, we present a dataset of synthetic LFs, that pro- der to handle textureless regions a 4D regularization is ap- vides highly accurate ground-truth depth fields, and where plied to obtain the final result, where a confidence measure scenes can be randomly generated. On the one hand, the is used to gauge the reliability of the CNN prediction. Our generated LFs are used to train a CNN for the hyperplane approach incorporates higher-order regularization, which prediction. On the other hand, we use the generated data to avoids surface flattening. Moreover, we also make use of analyze the results and compare to other SfLF algorithms. a 4D anisotropic diffusion tensor, that is calculated based Our experiments show that the proposed method works for on the intensity information in the LF. This tensor weights synthetic and real-world LF data. and orients the gradient during the optimization process.

3747 ξ ξ0 x0 x 7 7 × 5 × 5 × × 7 5 5 7 21 ) × 0 25 25 × × 1 1 η q 31 31 5 5 , × × 128 0 11 11

ξ p 64 128 L

( 5 h 6 64

P L3 L4

x L L η0 1 2

y y0 y ) 0 Figure 3. Illustration of the network architecture. q , 0

η p ( v P

Figure 2. Illustration of the patch extraction. The figure to the left illustrates the LF as a direction major 2D array of 2D arrays, where the coordinate (p0, q0) is marked. The corresponding vertical and horizontal patches at that location are shown to the right.

3. Methodology (a) 3D view (b) sub-aperture image In this section we describe the methodology of the pro- Figure 4. Illustration of a rendered LF. (a) provides a 3D view of posed approach, that can be divided into three main areas: a randomly generated scene, where foreground, midground, and (1) Utilizing deep learning to predict 2D hyperplane orien- image plane are highlighted in green, blue and purple. (b) shows a tations in the LF space (c.f . Section 3.1), (2) formulating a sub-aperture image of the obtained LF. convex energy functional to refine the predicted orientations (c.f . Section 3.2), and (3) solving the resulting optimization problem using a first-order primal-dual algorithm (c.f . Sec- Network Architecture. The used network architecture is tion 3.3). depicted in Figure 3. The network consists of five layers, 1 denoted as Li, i ∈ [5] . The first four layers are convolu- 3.1. Hyperplane Prediction tional layers, followed by one fully-connected layer. Each convolutional layer is followed by a Rectified Linear Unit The popularity of CNNs trained in a supervised man- (ReLU) nonlinearity. The first and third layer is padded ner via backpropagation [22] increased drastically after such that the width and height between input and output Krizhevsky et al.[21] utilized them effectively for the task is not changing. The kernel size of the convolutional lay- of large-scale image classification. Inspired by the good ers decreases towards deeper layers. More precisely, we us performance on the image classification task, authors pro- kernels of size 7 × 7 for the first two layers, and kernels of posed numerous works, that apply CNNs to different CV size 5 × 5 for the layers three and four. The number of fea- problems including depth prediction [10], keypoint local- ture maps also increases towards deeper layers, i.e. we use ization [15], edge detection [12], and image matching [35]. 64 feature maps for the first two layers and double them for Zbontar and LeCun [35] for example proposed to train a the following two layers. Note, that there is no pooling in- CNN on pairs of small image patches, to predict stereo volved in the used network architecture, and the inputs are matching costs. Those costs were then refined using cross- two RGB image patches of size 31 × 11. based cost aggregation and semiglobal matching. In the case of LF data the depth information of an im- POV-Ray Dataset. Despite the success of CNNs, there aged scene point is encoded in the orientation of the corre- are also some drawbacks. One main drawback is the need sponding 2D hyperplane in the LF domain. In order to be of huge labeled datasets, that can be used for the supervised able to predict this orientation we extract information from training. In order to fulfill this requirement we generated a a predefined neighborhood of a given point (p, q) ∈ Ω×Π. synthetic LF dataset using POV-Ray [28]. Compared to the More specifically, a training example comprises two image widely used Light Field Benchmark Dataset (LFBD) [32], patches of size 31 × 11 centered at (p, q), where the first which is generated with Blender [5], POV-Ray allows to patch P (p, q) ⊆ Σ is extracted from the vertical EPI, v x,ξ calculate floating point accurate ground truth depth maps and the second patch P (p, q) ⊆ Σ is extracted from h y,η without discretization artifacts. In order to be able to in- the horizontal EPI. Note, that values outside the domain of crease the dataset as required, we also implemented a ran- the LF are set to zero. Figure 2 illustrates this patch extrac- dom scene generator. This scene generator divides the en- tion step. The figure shows a pair of horizontal and vertical tire scene in foreground, midground, and background, as patches, where the orientation of the line that intersects the illustrated in Figure 4(a). The foreground and midground center of the patch defines the orientation of the 2D hyper- plane. 1Notation: [N] := {1,...,N}

3748 scape iwons hr hs iwonsaepae naregular a on ( placed grid are viewpoints those where viewpoints, nteL oan h sd3 bet o h foreground scanning 3D the [ Stanford for repository the objects from obtained 3D are used midground The and domain. LF intersections hyperplane the disocclu- of in degree and high a occlusion to oc- lead resulting effects heavily The sion are other. objects each and Those cluding small respectively. comparatively objects, with large filled randomly are regions results augmentation right. the different the three shows to and figure left The the to augmentation. patches original data of Illustration 5. Figure n diieclrcagsfrec G hne r ran- are interval channel the RGB from each sampled for domly changes color additive and erh htaelbldfrrue euebcgon im- background categories the use from We resolutions various reuse. with for ages scenes labeled are image the Google that of from search, downloaded backgrounds images The by represented are proper- surfaces. characteristics different finish reflectance the those non-Lambertian of things the other define Among ties properties. ish oewt admtxue rmctgre iefrinstance for like stone categories from textures random with come ieGusa os a im of sigma a has noise addi- the Gaussian specifically, tive More noise. as Gaussian color, additive and as well brightness, in changes used The includes overfitting. augmentation avoid to during important be augmentation to dataset, seems training the increase simply could we around riigand training uino h edrdLsi e to set is LFs rendered the of lution epciey Figure respectively. etoa eouini e to set is resolution rectional aeavcos hc sitne.Uigti procedure this Using intended. generate we is non- the which to vectors, non-perpendicular due camera in that results this Note, directions viewing background. parallel the im- the and between plane somewhere predefined random age a at at chosen converge is axes that optical point, the and plane, image taeyt eeaienua ewrs[ networks neural generalize to strategy Augmentation. Data LF. per images sub-aperture ples. Ph Pv fe raigarno cn erne tfo various from it render we scene random a creating After , c.f , wood mountain original 20 Figure . ifrn Dojcs hr bu afo them of half about where objects, 3D different 9 or , 25 ,adfo h ynl aae [ dataset Oyonale the from and ], 5 F r sdfrtsig h pta reso- spatial The testing. for used are LFs F,weew use we where LFs, metal and , 2 et.Alrnee mgsuetesame the use images rendered All left). 5 street oevr eas s admfin- random use also we Moreover, . rvdssm umnainexam- augmentation some provides aaagetto sawdl used widely a is augmentation Data . 11 × [0 augmentations 11 0 20 . 5 640 . 05 hc eut in results which , , oetatpthsfor patches extract to 2] h multiplicative The . 21 × and , 480 10 [ − 26 n h di- the and , .Although ]. 0 city . .W use We ]. 15 , , 0 land- . 15] 121 , 3749 a ewl prxmtdb icws fn ufcs The surfaces. objects affine piecewise most by since approximated sufficient, well is be can order second of TGV pose e eiaie oicroaesotns rmtefis up first the the from to smoothness incorporate to derivatives der ml steTtlVrain(V einr [ semi-norm (TV) ex- Variation famous Total A the is based assumption. are ample smoothness terms first-order regularization the Common on in well. and as images EPIs sub-aperture the the preserv- in simultaneously discontinuities and sharp ing noise and artifacts removing product. Hadamard the icws oyoilsltoso order of solutions polynomial piecewise h rdce measurements predicted the yBredies introduced the (TGV) by Variation in Generalized effect Total generalization called this a TV avoid use of we to reconstruction order the of In domain spatial desirable. reconstructions, not constant depth is piecewise fronto-parallel which be piecewise to range solution in for the results used of in- when property for However, this suited data well denoising. thus image is tensity an solutions constant piecewise as loss. Euclidean the the minimize From to method optimization the as h scalar The scoet h rdce esrmnso h N n is and CNN the of as measurements defined predicted thus the to close is term. regularization the w.r.t. R ucini qain( Equation in function where aitoa ehiusadfruaetefloigopti- following the problem formulate mization and techniques variational Model Refinement 3.2. after obtained are paper this in 150 presented results The ples. D re omntroefitn eueats e of set test a use In we intensities. overfitting pixel monitor the to of order deviation di- and standard mean the the by subtracting viding by augmentation. patch data each using pre-process We doubled are which examples, ing where s ftecfefaeok[ framework caffe the of use Training. Network ( ( u h aatr normdlesrsta h nlsolution final the that ensures model our in term data The h euaiaintr a ome h hlegsof challenges the meet to has term regularization The nodrt en h rdce rettosw utilize we orientations predicted the refine to order In TV( u trtoso backpropagation. of iterations k , ) f k hticroae ro-nweg bu h solution. the about prior-knowledge incorporates that ) f c th u htmaue h dlt fteargument the of fidelity the measures that , and sacndnemaue( measure confidence a is = ) 20 eiaie nohrwrs G forder of TGV words, other In derivative. µ tal et sue oblneteiflec ftedt term data the of influence the balance to used is F edrdfrtann eextract we training for rendered LFs u minimize D k∇ eoetnoso re or h objective The four. order of tensors denote [ . ( u u 6 u , .TVo order of TGV ]. f k = ) 1 hstp frglrzto favors regularization of type This . 3 nodrt ri h N emake we CNN the train to order In sacmiaino h aaterm data the of combination a is ) 1 2 µ k D c 19 f ( n h euaiainterm regularization the and , ⊙ u ,weew s dm[ Adam use we where ], , ( f u + ) c.f k − ( . nrdcshge or- higher introduces f R ) 8 k k ( ) and )), 2 2 u − ) , 1 , o u pur- our For . 2e6 ⊙ 29 8e6 k denotes given ] exam- favors train- u 20 (3) (4) to ] primal form of TGV of second order is given as 3.3. Optimization

2 The optimization problem in Equation (9) is convex but TGV (z) = min α1 k∇z − wk + α0 kEwk , (5) α w 1 1 non-smooth. In order to find a global optimal solution we n o where Ew is the distributional symmetrized derivative of will utilize the primal-dual algorithm proposed by Cham- w, and αi, i ∈ {0, 1}, are weighting factors. The objective bolle and Pock [7]. Therefore we reformulate (9) as the function in Equation (5) has the following intuitive interpre- following convex-concave saddle-point problem tation. Before the TV of z is measured a vector field w of µ min max kc ⊙ (u − f)k2 (10) low variation is subtracted from the gradient. We choose to u v 2 , du,dw 2 apply the TGV regularization w.r.t. the spatial coordinates n α1(∇u|x,y − w) p of the LF, and use a TV regularization w.r.t. to the direc- + Γ , du β ∇u|ξ,η tional coordinates q of the LF, which results in the follow-     d d ing regularization term − χB∞(0,1)( u|x,y) − χB∞(0,1)( u|ξ,η)

2 + hα0 Ew, dwi− χB∞(0,1)(dw) , R(u) = TGVα(u|x,y)+ β TV(u|ξ,η) , (6) o where β is a scalar that allows to weight the TV compo- where we introduced the dual variables du and dw. More- nent, and u|x,y and u|ξ,η denote the restrictions of u to the over, we denote by B∞(0, 1) the ℓ2,∞ norm ball centered coordinates (x,y) and (ξ,η), respectively. at zero with radius one, and χA denotes the characteris- Assuming that intensity discontinuities in the LF cor- tic function of a set A. The saddle-point formulation in respond to depth discontinuities, we will make use of the Equation (10) allows to directly apply the primal-dual al- intensity information to guide the regularization. More gorithm. Moreover, using adequate symmetric and posi- specifically, we will include an anisotropic diffusion tensor tive definite preconditioning matrices as suggested in [27] Γ, that is calculated by analyzing the 4D structure tensor at the convergence speed of the algorithm can be further im- each point (p, q) in the discrete domain of the LF. There- proved. Note however that the diagonal preconditioning re- fore we will first calculate the eigenvalues λi and eigenvec- sults in dimension-dependent step lengths, instead of global tors vi, i ∈ [4], of the 4D structure tensor at position (p, q). step lengths, i.e. the global complexity of the algorithm does Assuming that the eigenvalues are given in ascending order, not change. The final algorithm is iterated for a fixed num- λ1 6 ... 6 λ4, the anisotropic diffusion tensor Γ(p, q) is ber of iterations or till a suitable convergence criterion is given as fulfilled. The involved gradient and divergence operators are approximated using forward/backward differences with ⊤ δ ⊤ vivi + exp(−γ k∇L(p, q)k ) vjvj , (7) Neumann and Dirichlet boundary conditions, respectively. iX∈[2] j∈X[4]\[2] where γ and δ adjust the magnitude and the sharpness of 4. Experiments the tensor. Γ will orientate and weight the gradient direc- In this section we will evaluate the proposed method on tion during the optimization process, which leads to sharp synthetic and real world LF data. For the synthetic eval- depth transition at regions with high intensity differences. uation we will use the test set of the generated POV-Ray Note that a similar strategy was used in [17] to regularize dataset. For the real world evaluation we use the Stan- the depth map of the 2D center view of the LF. The confi- ford Light Field Archive (SLFA), that includes LFs captured dence measure c is also calculated based on the information with a multi-camera array [34], where each LF contains 289 derived from the structure tensor, i.e. the confidence at po- sub-aperture images on a 17 × 17 grid. sition (p, q) is calculated as

2 Synthetic Evaluation. We start with the synthetic eval- c(p, q)= (λi − λj) . (8) uation. When considering Figure 6, that provides network i∈[3] j∈[4]\[i] X X predictions and refinement results for two examples of the The final energy term combines the data term (4), the POV-Ray test set, we see that the CNN is able to predict confidence measure (8), the regularization term (6), and the reasonable 2D hyperplane orientations. The predicted ori- anisotropic diffusion tensor (7), and is given as entations are accurate in well textured regions, but degrade in regions with less texture. Note that predicted orienta- µ 2 tions are barely effected by depth discontinuities. When min kc ⊙ (u − f)k2 (9) u,w ( 2 comparing the network predictions with the refinement re- sults, we see that the additional refinement model allows to α1(∇u|x,y − w) + Γ + α0 kEwk1 . reduce the errors in textureless regions and simultaneously β ∇u|ξ,η   1 ) preserves sharp depth discontinuities, as expected.

3750 seilyueu naeswt eeeocuinefcs The effects. is occlusion severe method with proposed areas the in Hence useful especially camera. the for to discontinuities close depth objects sharper and details [ more Pock provides and Heber by model tional Gold- and match- Wanner by all method the vs. [ to luecke all Compared an define term. ing to (RPCA), mini- Com- Analysis Principal rank Robust ponent from low ideas on use they based where mization, model stereo multi-view [ con- ational Pock globally and Heber a calculates labeling. depth and sistent LF the of representation EPI LF the right, to left from shows, figure result. The refinement the dataset. and POV-ray prediction, the CNN from the scenes truth, example ground for color-coded the results data, reconstruction of Illustration 6. Figure ehdb anradGllek [ Goldluecke and Wanner by method e n oduce[ Goldluecke and ner n oducefisi h yepaeoinain r too the are of orientation orientations the hyperplane Wanner to the close by if method the fails that Goldluecke note and dis- Also a labeling. provides only depth Goldluecke crete and Wanner of method and the estimates fact continuous provides the method to proposed due the mainly that is This reconstruction. accurate more ml h esnfrtelrercntuto roso this of ( ex- errors scene1 for reconstruction in is large This the method for to line. reason fails the the tensor of ample structure orientation 2D correct the the and estimate disconnect EPIs the in scene scene

nFigure In 4 5 y y ξ ξ x x x x 30 eosreta h rpsdmto rvdsa provides method proposed the that observe we ] 7 ecmaeormto otewrso Wan- of works the to method our compare we Fgon rt N refinement CNN truth ground LF c.f 30 Figure . n ee n ok[ Pock and Heber and ] y y xy η η 7 .Cmae otevaria- the to Compared ). ln.I hscs h lines the case this In plane. y y ξ ξ 16 x x x x h rpsdmethod proposed the ] 30 16 ae s fthe of use makes ] rpsdavari- a proposed ] 16 .The ]. y y η η 3751 y y ξ ξ eseta h rpsdmto sal ootefr the outperform to able scenes is method individual proposed the the of that results see we the considering Table When in set. presented are (RMSE) error scheme. warping fine to coarse required the to due mainly ta- of Figure factor The in a shown by set. scenes scaled test synthetic POV-Ray RMSE the the shows for ble results Quantitative 1. Table ehdb ee n ok[ Pock and Heber by method [ Goldluecke and Wanner by proposed methods the for results n ee n ok[ Pock and Heber and rvddb h authors. the by provided x x x x uniaierslsi em ftero ensquared mean root the of terms in results Quantitative 1 2 5 4 3 # oduce[ Goldluecke anrand Wanner 2 0 0 0 0 0 ...... 180 9168 1309 340 6334 000 0 0 6080 9546 2574 30 16 ] r bandb unn h orecode source the running by obtained are ] y y ee and Heber 0 ok[ Pock η η ...... 740 3774 510 2501 600 7610 930 0 4903 0 1760 2094 16 y y 16 ξ ξ ufr rmals fdetail of lose a from suffers ] 6 x x x x ] n Figure and ...... N proposed CNN 3227 530 2593 5577 4110 1829 2027 1 100 o h nietest entire the for 7 o h different the for oe htthe that Note, . 0 0 0 0 0 ...... 2575 3010 4018 1408 1847 5202 y y 30 η η ] rpsdmethod. proposed iue8 ulttv oprsnfraL rmteSF.Tefiuesos rmlf orgt h etrve fteL,terslsfor results the LF, the of view center the right, to left from shows, figure The SLFA. the [ from methods LF SfLF state-of-the-art a two for comparison Qualitative 8. Figure nteprl sub-window. purple the in view center the right, [ to methods left SfLF from state-of-the-art shows, two for figure results The the dataset. truth, POV-Ray ground synthetic color-coded the the LF, on the methods of state-of-the-art to Comparison 7. Figure scene3 scene2 scene1 Amethyst Fgon rt Wanner truth ground LF etrve Wanner view center 30 , 16 ,adterfieetrsl ftepooe ehd hr eas hwtentokprediction network the show also we where method, proposed the of result refinement the and ], tal et [ . 30 ee n ok[ Pock and Heber ] 3752 tal et [ . 30 ee n ok[ Pock and Heber ] 30 , 16 16 ,floe yterfieetrsl fthe of result refinement the by followed ], proposed ] 16 proposed ] competing methods on all but one scene. Furthermore, on References average the proposed method is able to clearly outperform [1] E. H. Adelson and J. Y. A. Wang. Single lens stereo with the other state-of-the-art methods. However, it should be a plenoptic camera. IEEE Transactions on Pattern Analysis emphasized that a main drawback of the presented method and Machine Intelligence, 14(2):99–106, 1992. 2 is to apply the CNN in a sliding window fashion, which [2] T. Bishop and P. Favaro. Plenoptic depth estimation from results in considerable high computational costs. multiple aliased views. In 12th International Conference on Computer Vision Workshops, pages 1622–1629. IEEE, 2009. 2 Real World Evaluation. We continue with a short real [3] T. E. Bishop and P. Favaro. Full-resolution depth map esti- world evaluation. Figure 8 provides a qualitative compari- mation from an aliased plenoptic light field. In Proceedings son of different SfLF methods, where a LF from the SLFA of the 10th Asian Conference on Computer Vision - Volume is used as input. Note that the scene is quite challenging Part II, pages 186–200, Berlin, Heidelberg, 2011. Springer- because of the high degree of specularity. The result of the Verlag. 2 proposed method basically shows that the trained model can [4] T. E. Bishop and P. Favaro. The light field camera: Ex- be applied to reconstruct real world LF data. tended depth of field, aliasing, and superresolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(5):972–986, 2012. 2 5. Conclusion [5] Blender. https://www.blender.org. 3 [6] K. Bredies, K. Kunisch, and T. Pock. Total generalized vari- In this paper we proposed a novel method for SfLF. Our ation. SIAM Journal on Imaging Sciences, 3(3):492–526, method is a combination of deep learning and variational 2010. 4 techniques. We trained a CNN to predict 2D hyperplane [7] A. Chambolle and T. Pock. A first-order primal-dual al- orientations in the LF domain. Knowing these orientations gorithm for convex problems with applications to imaging. allows to reconstruct the geometry of the scene. In addition Journal of Mathematical Imaging and Vision, 40:120–145, to the learning approach we formulated a global energy op- 2011. 5 timization problem with a higher-order regularization to re- [8] C. Chen, H. Lin, Z. Yu, S. Bing Kang, and J. Yu. Light field fine the network predictions. For numerical optimization of stereo matching using bilateral statistics of surface cameras. the variational model we use a first-order primal-dual algo- June 2014. 2 rithm. Overall the presented method demonstrates the pos- [9] B. Curless and M. Levoy. A volumetric method for build- sibility to use deep learning strategies for the task of shape ing complex models from range images. In Proceedings of estimation in the LF setting. the 23rd Annual Conference on and In- teractive Techniques, SIGGRAPH ’96, pages 303–312, New In order to provide enough data to train the network we York, NY, USA, 1996. ACM. 4 generated a synthetic dataset by using the raytracing soft- [10] D. Eigen, C. Puhrsch, and R. Fergus. Depth map predic- ware POV-Ray. To generate an arbitrary amount of scenes tion from a single image using a multi-scale deep network. we also implemented a random scene generator. The gen- In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and erated data was not just used to train the CNN, but also to K. Weinberger, editors, Advances in Neural Information Pro- provide quantitative and qualitative comparisons to existing cessing Systems 27, pages 2366–2374. Curran Associates, SfLF methods. The qualitative evaluation of reconstruction Inc., 2014. 3, 4 results of synthetic and real world LF data showed that the [11] F. Frigerio. 3-dimensional Surface Imaging Using Active proposed method is able to provide accurate reconstructions Wavefront Sampling. PhD thesis, Massachusetts Institute of with sharp depth discontinuities. Moreover, our quantita- Technology, 2006. 2 tive experiments showed that the proposed method is able [12] Y. Ganin and V. Lempitsky. n4-fields: Neural network near- to outperform existing methods on the POV-Ray test set in est neighbor fields for image transforms. In D. Cremers, terms of the RMSE. I. Reid, H. Saito, and M.-H. Yang, editors, Computer Vision – ACCV 2014, volume 9004 of Lecture Notes in Computer Science, pages 536–551. Springer International Publishing, 2015. 3 Acknowledgment. This work was supported by the FWF-START project Bilevel optimization for Computer Vi- [13] B. Goldluecke and S. Wanner. The variational structure sion, No. Y729 and the Vision+ project Integrating vi- of disparity and regularization of 4d light fields. In IEEE sual information with independent knowledge, No. 836630. Conference on Computer Vision and Pattern Recognition (CVPR), 2013. 2 [14] S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen. The lumigraph. In SIGGRAPH, pages 43–54, 1996. 1 [15] B. Hariharan, P. Arbelaez,´ R. Girshick, and J. Malik. Hyper- columns for object segmentation and fine-grained localiza-

3753 tion. In Computer Vision and Pattern Recognition (CVPR), fields. In IEEE Conference on Computer Vision and Pattern 2015. 3 Recognition (CVPR), 2013. 2 [16] S. Heber and T. Pock. Shape from light field meets robust [34] B. Wilburn, N. Joshi, V. Vaish, E.-V. Talvala, E. Antunez, PCA. In Proceedings of the 13th European Conference on A. Barth, A. Adams, M. Horowitz, and M. Levoy. High per- Computer Vision, 2014. 2, 6, 7 formance imaging using large camera arrays. ACM Trans. [17] S. Heber, R. Ranftl, and T. Pock. Variational Shape from Graph., 24(3):765–776, July 2005. 1, 5 Light Field. In International Conference on Energy Mini- [35] J. Zbontar and Y. LeCun. Computing the stereo match- mization Methods in Computer Vision and Pattern Recogni- ing cost with a convolutional neural network. CoRR, tion, 2013. 2, 5 abs/1409.4326, 2014. 3 [18] A. Isaksen, L. McMillan, and S. J. Gortler. Dynamically reparameterized light fields. In SIGGRAPH, pages 297–306, 2000. 2 [19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell. Caffe: Con- volutional architecture for fast feature embedding. CoRR, abs/1408.5093, 2014. 4 [20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. 4 [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, edi- tors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. 3, 4 [22] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Com- put., 1(4):541–551, Dec. 1989. 3 [23] M. Levoy and P. Hanrahan. Light field rendering. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, pages 31–42, New York, NY, USA, 1996. ACM. 1 [24] R. Ng. Digital Light Field Photography. Phd thesis, Stanford University, 2006. 2 [25] R. Ng, M. Levoy, M. Bredif,´ G. Duval, M. Horowitz, and P. Hanrahan. Light field photography with a hand-held plenoptic camera. Technical report, Stanford University, 2005. 2 [26] Oyonale. http://www.oyonale.co. 4 [27] T. Pock and A. Chambolle. Diagonal preconditioning for first order primal-dual algorithms in convex optimization. In International Conference on Computer Vision (ICCV 2011), 2011. 5 [28] Pov-ray. http://www.povray.org. 3 [29] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total varia- tion based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4):259–268, Nov. 1992. 4 [30] S. Wanner and B. Goldluecke. Globally consistent depth la- beling of 4D lightfields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012. 2, 6, 7 [31] S. Wanner and B. Goldluecke. Spatial and angular varia- tional super-resolution of 4d light fields. In European Con- ference on Computer Vision (ECCV), 2012. 2 [32] S. Wanner, S. Meister, and B. Goldluecke. Datasets and benchmarks for densely sampled 4d light fields. In Vision, Modelling and Visualization (VMV), 2013. 3 [33] S. Wanner, C. Straehle, and B. Goldluecke. Globally con- sistent multi-label assignment on the ray space of 4d light

3754