Depth from Gradients in Dense Fields for Object Reconstruction

Kaan Yucer¨ 1,2 Changil Kim1 Alexander Sorkine-Hornung2 Olga Sorkine-Hornung1

1ETH Zurich 2Disney Research

Abstract

Objects with thin features and fine details are challenging for most multi-view stereo techniques, since such features occupy small volumes and are usually only visible in a small portion of the available views. In this paper, we present an efficient algorithm to reconstruct intricate objects using densely sampled light fields. At the heart of our technique lies a novel approach to compute per-pixel depth values by Figure 1: Given a dense light field around a foreground exploiting local gradient information in densely sampled object, our technique reconstructs the object in 3D with its light fields. This approach can generate accurate depth val- fine details. Left: An input view from the AFRICA dataset. ues for very thin features, and can be run for each pixel Right: The reconstructed mesh rendered with and without in parallel. We assess the reliability of our depth estimates the texture. Note the reconstructed details, such as the stick, using a novel two-sided photoconsistency measure, which the legs and the arms of the figurines. can capture whether the pixel lies on a texture or a silhouette cult. They are also usually only visible in a small number of edge. This information is then used to propagate the depth views, such that matching between different views becomes estimates at high gradient regions to smooth parts of the challenging. Moreover, most patch based MVS techniques views efficiently and reliably using edge-aware filtering. In miss thin features, since they require the patches on the ob- the last step, the per-image depth values and infor- jects to be several pixels wide, which is not always the case mation are aggregated in 3D space using a voting scheme, in real-life capture scenarios. Graph-cut based reconstruc- allowing the reconstruction of a globally consistent mesh for tion techniques [21, 35] face difficulties with textureless thin the object. Our approach can process large video datasets features, because it is hard for such methods to localize the very efficiently and at the same time generates high quality features using photoconsistency values inside a volumetric object reconstructions that compare favorably to the results discretization, often resulting in elimination of these features of state-of-the-art multi-view stereo methods. in the reconstructions. 1. Introduction In this work, we propose an algorithm that uses light fields with high spatio-angular sampling, such that fine features Reconstructing objects in 3D from a set of pictures is a become more prominent thanks to the increased coherency long standing problem in , since it enables and redundancy in the data. The features are visible in more many applications, including (but not limited to) digitizing views and occupy a larger amount of pixels inside the light real-world objects, localization and navigation, and generat- field compared to sparsely sampled images. Increased co- ing 3D content for computer graphics and virtual reality. Due herency of the data also makes it possible to work directly on to the breadth of these possible use cases, image based 3D the pixel level without requiring patch based matching, such reconstruction has been well studied, and many different so- that thin object parts can be identified (Figure1). At the same lutions have been proposed, ranging from generating meshes time, dense light fields are challenging to process due to the for small objects [15] to reconstructing entire cities [1]. sheer amount of available data. Existing multi-view stereo Despite significant research efforts, objects with thin fea- methods evolved largely towards handling sparse, imperfect tures and fine details still pose a problem to multi-view input and thus cannot handle such dense data effectively. stereo (MVS) techniques due to numerous reasons. First Recent novel approaches to light field processing have of all, these features occupy only a small number of pixels already shown an increase in the quality of the depth maps, in the views they are visible in, making locating them diffi- where finer details are visible in the reconstructions [24, 37].

1 However, the resulting scene representations are per-view especially around object boundaries. Most recently, convolu- depth maps, which are not always globally consistent and tional neural networks [20] have been applied to solve the require an additional surface reconstruction step to gener- depth reconstruction problem. Similar ideas have been used ate more widely used representations, e.g. triangle meshes. to reconstruct depth maps from light fields captured with Furthermore, such techniques often specialize in regularly lenslet arrays [3,27,33]. Our aim in this paper is complemen- sampled light fields (such as regular 2D grids) and do not tary to these works: we strive to reconstruct the 3D shape easily generalize to more casual, hand-held capture scenar- of the object using a dense light field captured from view ios. A more recent method [40] generates pixel-accurate points around the object, making use of the local gradient segmentations for casually captured light fields using pixel- information to compute fast and reliable depth maps without level approaches. Unfortunately, segmentations can only be the need for optimization techniques. used to describe the objects in form of visual hulls, where When capturing a light field from a scene, the scene concavities are missing. points’ projections to the views create trajectories depending Our approach makes the following contributions: Firstly, on the 3D position of the scene point and the camera posi- we propose an efficient and reliable technique for computing tions. Exploiting such trajectories for direct reconstruction per-pixel depth values by utilizing the gradient directions in- of depth or segmentations have been a promising research side the light field volume. In this way, depths at image edges direction since its first introduction by Bolles et al.[4], who can be computed without the need to match features or opti- analyzed several different camera alignments, including lin- mize for 3D patches. Secondly, we utilize a novel directional ear alignment. Feldmann et al.[10] compute point clouds by photoconsistency computation, which makes it possible to fitting analytical curves to the trajectories, but are limited to differentiate between texture and boundary edges in the im- perfectly circular camera motion. Fitting lines to the epipolar ages. This computation is instrumental in propagating depth plane images can reveal the depths of scene points [37], but values from high gradient regions to smooth areas. Lastly, this computation requires the camera positions to be aligned we propose a robust depth aggregation technique that thrives on a regular 2D grid. More recent work [40] has shown that with a large amount of depth maps and extracts precise and light field gradients can be utilized to segment the foreground detailed object surfaces from the gathered information. object from the background clutter both in 2D as segmenta- Every step of our technique is designed to take advantage tion masks, and in 3D, described by a visual hull, without of the highly coherent light field data. Typically, we cap- requiring stringent capture scenarios. In our work, we extend ture an HD video containing thousands of frames to sample the gradient computations to estimate depth values instead the light field surrounding an object. Since such dense light of segmentations, and the depths are then used to generate fields contain more samples about every scene point com- 3D surface meshes with concavities and fine details. pared to traditional MVS techniques, we can work on pixel Multi-view stereo. The aim of multi-view stereo is to re- level and make robust decisions about the pixels indepen- construct objects in 3D using a set of images. We discuss dently. Most existing MVS methods cannot make use of such the previous work on MVS that is most related to ours. For a data in full. We designed our pipeline to be able to handle more detailed survey and evaluation of MVS, we refer the the sheer amount of data and avoid long run times or large reader to the comprehensive studies [13, 29]. memory requirements. These design decisions make it possi- ble to manage light fields consisting of thousands of views A common approach to object reconstruction is to start efficiently and generate high quality 3D reconstructions for with a visual hull of the object and carve it out to get the objects with fine details. actual shape with concavities [14, 21, 31, 35]. The carving operation is done either via volumetric graph-cuts inside the 2. Related work visual hull volume, or by deforming the visual hull mesh via an energy optimization. In order to get the initial surface, 3D reconstruction from light fields. Beyond the early ap- per-image segmentations need to be computed either by hand plication of light fields in image based rendering [6,9,19,26], or by background subtraction, which is time consuming with the dense representation of the scene they attain has provided thousands of images and cluttered backgrounds. With high new perspectives to a variety of existing computer vision voxel and image resolutions, computing a global solution problems. The high coherency in light fields can be utilized can become infeasible. Also, silhouette extraction is prone to to extract higher level information about the scene, such as errors, which requires special handling for thin objects [32]. understanding specular properties of objects [8], segment- Another general direction is to create reliable oriented ing scenes into multiple components [38], computing scene point clouds by applying patch-based feature matching, and flow [2], synthesizing superresolution images [34], and gen- then expand around these points to cover the object [1,15,18]. erating high quality per-view depth maps [24,39]. The use The resulting dense point cloud is turned into a mesh by a of image statistics [7] and explicitly trying to identify oc- surface reconstruction technique, e.g. Poisson reconstruc- clusion edges [36] can enhance the quality of depth maps, tion [23]. Thin features, which cannot be matched via feature matching, and textureless regions pose problems, since with- dient only in high-gradient regions. The depth values are out sufficient number of points in these areas, they cannot propagated within each image guided by a bi-directional be accurately represented in the final mesh. photo-consistency. All depth maps are then aggregated into Computing per-image depth maps and merging them later a voxelized 3D scene, and further refined to result in an in 3D space has been shown to generate high quality ob- oriented point cloud, which are turned into a triangle mesh ject reconstructions [5,17,28,30]. Usually, depth maps are using Poisson surface reconstruction. computed using dense correspondence matching, which is susceptible to image noise or large smooth regions without 3.1. Depth computation regularization, while applying regularization often destroys Given a dense 3D light field L represented by a set of fine details. Further, correspondence matching and the sub- images (I1,...,In), where L(x, y, i) refers to the pixel p = sequent triangulation becomes inaccurate in narrow baseline (x, y) in image Ii, and their camera projection matrices Pi, scenarios, as is the case with densely sampled light fields. we define the light field gradient ∇Li,j around p in image In this work, we propose an efficient narrow baseline depth Ii and q in image Ij as: computation technique using light field gradients, which can be applied for each pixel independently, and a depth map ag- ∇Li,j(p, q) = ∇si,j(p, q), (1) gregation technique that can merge a large number of depth maps efficiently. where si,j(p, q) is a 2 × 5 image patch constructed by stack- ing 5-pixel-long light field segments centered at p in Ii and 3. Depth from gradient centered at q in Ij together. We construct si,j(p, q) by using the epipolar geometry between the views Ii and Ij: We know Our goal is to reconstruct objects in 3D that are captured that the actual scene point at p will appear on its epipolar line densely in unstructured light fields [9]. As in Davis et al.[9], ` in Ij. Given a reference point q in Ij along `, we sample a we do not try to reparameterize the captured light field, but 5-pixel-long segment in Ij along ` centered at q to generate work directly on a sequence of images and associated camera sj(p, q). q also corresponds to a depth value dq for p as a poses. Such unstructured light fields can be easily captured result of epipolar geometry. We project the sampled points at a very dense sampling rate by a video camera moving in Ij back to Ii using a fronto-parallel plane placed at depth around the objects. We do not assume a particular camera dq, and sample Ii at these locations to generate si(p, q). path nor a machine gantry, but assume only that the camera Note that this computation is reliable only when Ii and Ij moves roughly along a path surrounding the object to sample face similar directions. If the depth value dq is the actual as much of it as possible. In a more regularly sampled light depth value for p, then we expect si(p, q) and sj(p, q) to field, each scene point will leave a distinctive trajectory in be identical. If the actual depth deviates from dq, then the the spatio-angular volume of the light field depending on the in sj(p, q) will be a shifted version of the colors in particular camera motion and the 3D position of the scene si(p, q). In both cases, we can use ∇si,j(p, q) to compute point: lines with different slopes for linear camera motion [4] the trajectory of the points between the two segments using or helical structures for circular camera motions [10]. The the direction perpendicular to the gradient direction: direction of such trajectories encodes the depths of scene −1 points, and has been used to reconstruct scene depth. γi,j(p, q) = tan (−∇xsi,j(p, q)/∇ysi,j(p, q)). (2) As the spatio-angular sampling rate of the light field in- s creases, the parallax between views starts getting smaller, Using γi,j(p, q), we can find the mapping pj of p in and each 3D point’s trajectory forms a more continuous sj(p, q): ps = 1/ tan(γ (p, q)), (3) curve, where the direction of the trajectory could be esti- j i,j mated locally (See Figure3) without requiring any specific Please refer to our supplemental material for the derivation s camera motion. The gradient direction over a few adjacent of this formula. We then map pj back to ` to compute the frames gives a local, linear approximation of the trajectory mapping pj, from which the actual depth dp can be com- of each pixel, from which depths of scene points can be com- puted via triangulation (see Figure3 for a visualization). puted. This observation opens up the possibility to compute The question then becomes how to sample q in Ij. If per-pixel depth values efficiently by solely measuring light dq is close to the actual depth of the scene point at p, then field gradients. Since this does not require any patches or fea- we can expect a reliable depth computation. However, if tures to be computed and matched between images but uses the difference between q and pj is larger than a pixel, the local information only, pixel-wise depth values can be com- gradient computation becomes unstable, leading to erroneous puted more efficiently. Further, since the depth estimation is results. To that end, we sample ` multiple times between done on a per-pixel basis, potentially finer object details can qmin and qmax, which correspond to reference points for be revealed and computation can be trivially parallelized. the minimum and maximum depths of the scene, and get a We first compute the per-view depth maps using the gra- set of reference points qk, k ∈ 1,...,K, where K is the (a) Input (b) Depth from gradient (c) Filtering (d) Propagation (e) Aggregation

Figure 2: Steps of our technique. From left to right: One input frame, depth estimates using depth from gradient (Section 3.1), the filtered depth map (Section 3.2), the dense depth map after edge-aware depth propagation (Section4), and the final meshes after depth aggregation (Section5) with and without texture. The depth maps and meshes are rendered from a slightly different viewpoint for a better 3D visualization. number of samples. We sample qk one pixel apart from each k k other, compute a mapping pj for each reference point q k (see Figure3), and choose the depth dp that maximizes two confidence measures. First, we expect the colors of p and k pj to be similar due to :

 1 2  Cc(p, pk) = exp − I (p) − I (pk) . (4) i j 2 i j j 2 2σc

For our experiments, we used σc = 0.025. The gradient computation results in more robust depth estimates, if qk and pk are close, i.e. the depth dk of p is close to the depth j p Figure 3: Top: Two close-up views of two images (Ii and k of the plane dq used for gradient computation: Ij) from the BASKET dataset. The and lines rep- d k k k 2  resent the lines that are sampled to generate si,j(p, q). The C (p, p ) = exp − d − d . (5) i j q p dot in Ii is p, whose depth is computed. The blue and 1 magenta dots in Ij are two different reference points q and The final confidence measure is computed by multiplying 2 the individual components in q , around which sj(p, q) is sampled. Bottom: si,j(p, q) around the two different q, on which the light field gradient k c k d k ∇L (p, q) Ci(p, pj ) = Ci (p, pj ) · Ci (p, pj ). (6) i,j is computed. The lines show the trajec- k tory directions γi,j(p, q). Note that different q result in k For each p, we choose pj which maximizes this confidence different trajectory directions. measure as the mapping pj, and store the depth value dp depths only around regions with enough gradient response, and the confidence value cp. i.e. ∀p, k∇Ii(p)k > g, where g = 0.05. We start by computing the depth maps for Ii using the nearest neighbors Ii−1 and Ii+1, and hierarchically move 3.2. Depth filtering to next images in L to adjust the depth estimates further. After the initial step, we use the depth estimate dp for p as Merging all per-view depth maps computed by our depth- the initial guess, and sample K reference points qk around from-gradient approach, one obtains a dense point set in the new q in Ii+2 corresponding to dp. We again sample the high-gradient regions of the object; see Figure2(b). The the reference points one pixel apart from each other. Note depth computation assumes that the scene points are on that as we move further away from Ii in L, the relative fronto-parallel surfaces as seen from the central viewpoint. motion of a point along the epipolar line with respect to However, the actual surface shape may deviate from this the change in depth gets faster. However, since we again assumption, and when using further-away views, this can sample K reference points, we implicitly make the depth lead to depth estimates that can differ from the actual surface. range smaller at each step, leading to more precise depth We remove such noisy estimates of each depth map by estimates. We compute depths over the views whose viewing examining their consistency with the estimates from other ◦ directions are no more different from that of Ii than 5 for views with similar viewing directions. To this end, we dis- each viewpoint, and store the final depth maps Di with their cretize the 3D space where the foreground object resides confidence maps Ci. Since the scene points’ trajectories are using a fine, regular voxel grid V. We denote the image re- only visible around high gradient regions, we compute the gions that project inside this grid as foreground pixels, and the rest as background pixels. In order to filter a depth map sampled pixels within the two patches, and denote the one Di of view Ii, we back-project to V the depth values and taken in the positive θ(p) direction by f+ and the other by the confidences of all views whose viewing directions are f−. The two patches are then projected to the neighboring ◦ different no larger than 15 from that of Ii. For each v ∈ V, views in L through a fronto-parallel plane placed at dp. In a we sum up the contributions of all back-projected 3D fore- second view, say Ij, the pixels within the projected patches ground points x using a voting scheme defined as follows: are sampled, forming g+ and g−, also vectorized. In our implementation, for each direction, we sample three pixels X  1 2  H(v) = cx · exp − 2 kv − xk2 , (7) along θ(p) in Ii and three other pixels in Ij at the locations 2σv x that are projected from the three pixels of Ii. One side of the p I I where cx is the confidence value associated to x. We used photoconsistency for between i and j is then defined as f g σv equal to the length of a voxel in our experiments. the patch difference between + and +: We reassign the depth Di(p) of each foreground pixel p  1  ρ(f , g ) = exp − kf − g k2 . (9) to the depth value of the most voted voxel along the viewing + + 2σ2 + + 2 ray from the camera center through p; see Figure2(c). Since p we are interested in the shape of the foreground object, we The other side of the photoconsistency is defined simi- filter only the foreground points, while background depths larly for f− and g−. We chose σp to be the same as σc are kept as they are. in Eq. (4). The bidirectional photoconsistency values C+(p) and C−(p) are computed by averaging all pairwise photo- 4. Depth propagation consistency values among the views in L whose viewing directions are no more different than 5◦ from that of I . The acquired point cloud, as described in Section3, al- i The bidirectional photoconsistency indicates the likeli- ready reveals the accurate shape of the object, but only in hood of both sides being on the same depth as p: if p is on highly textured regions and over the silhouettes. In order to the silhouette, the background seen in one side will move at generate a more complete 3D object reconstruction, we prop- a different speed with respect to the camera, leading to a low agate the depth information towards low-gradient regions, consistency value for that side (Figure4). The differentiation for which we use the information about whether any high- between texture and silhouette edges helps decide on the gradient region corresponds to texture or object boundaries. direction to which the depth is propagated. Thanks to the dense sampling of the data, we can dif- ferentiate between texture and silhouette edges by looking 4.2. Edge-aware depth propagation at the trajectory of 3D points corresponding the edges. For a texture edge, scene points on both sides of the edge will The depth maps Di are sparsely sampled, since the depths have similar trajectories, whereas for silhouette edges, only and the consistencies are computed only on high gradient one side of the edge follows the same trajectory. To utilize regions. In this step, we propagate them to smooth regions this observation, we propose to use the bidirectional photo- using edge-aware filtering, thereby exploiting the computed consistency, which assesses how consistently both sides of photoconsistencies. However, each p on a high gradient re- an edge move in different views. The use of bidirectional gion has two photoconsistency values, one for each direction photoconsistency has two advantages. First, we can differen- along θ(p), which needs special care during filtering. Since tiate texture edges from silhouette edges, since only texture we know that the direct neighbors in these directions should edges will be consistent on both sides of the edge. Second, share the depth and confidence values with the edge regions, when propagating the depth values from edges to smoother we propose a simple splatting strategy to avoid this special 0 regions, we can use it to decide on which side to propagate. case: The neighboring pixel p in the positive θ(p) direction from p is assigned C+(p), whereas the neighboring pixel 4.1. Bidirectional photoconsistency in the negative θ(p) direction is assigned C−(p). The depth values D (p0) are initialized with D (p). If a pixel p0 is af- The bidirectional photoconsistency measures the texture i i fected by multiple pixels on high gradient regions, we choose variation on both side of the image edge separately. For a the depth and confidence values from the neighbor with the pixel p in I , whose depth value is d = D (p), we first i p i highest confidence value. For the high gradient regions, we compute its image gradient direction: keep the higher value of C+(p) and C−(p) as Ci(p). −1 θ(p) = tan (∇yIi(p)/∇xIi(p)). (8) Now that we have per-pixel depth and confidence maps for each view, we employ confidence-weighted joint-edge- Note that θ(p) is different than γi,j(p, q); θ(p) is computed aware filtering using the images Ii inside L as the joint- per image, whereas γi,j(p, q) is computed between different domains, as proposed by Lang et al.[25], which makes use images inside L. Then, we sample a thin rectangular patch of the geodesic filter by Gastal and Oliveira [16]. First, we on each side of p along θ(p); see Figure4. We vectorize the multiply Di and Ci element-wise and filter them using the 5. Depth aggregation The depth propagation step generates dense depth maps for each input image Ii independently, where smooth regions are assigned depth values by interpolating known depth val- ues at image edges. These depth maps already describe the objects shape as a point cloud, but can have inconsistencies due to the inexact image-space filtering operation. We aggre- gate these depth maps in 3D space to reconstruct a globally consistent object representation in the form of a mesh. Since the number of views is in the order of thousands, computing globally consistent depth maps is not a viable option due to the time complexity. On the other hand, having a very large number of depth maps has the advantage that their consensus in 3D space provides enough information to infer the actual surface. Noisy estimates from a small number of views can be tolerated by correct estimates from other views that see the same scene point. Our solution to compute the 3D surface relies on these observations and makes use of Silhouette Edges Texture Edges an efficient depth aggregation idea via a dense voxel grid. Figure 4: Top: One view of the AFRICA dataset in the center, We use the same voxel grid V as in Section 3.2, but this with close-ups to two regions where bi-directional PC is com- time, we utilize both foreground and background points. puted. The leg of the person and the texture on the giraffe For each v ∈ V, we compute the probability H(v) of that shown in cyan and green boxes, are sampled along the gradi- voxel being on the surface of the object. In order to compute ent directions, shown in matching colors. These patches and these probabilities, we project every voxel v to the images, patches that are sampled from nearby views are stacked to- and interpolate Di and Ci. Given a voxel v projects to a gether for both edges and shown in boxes of matching colors. subpixel location pi in Ii, with interpolated depth value di Note that only the texture edge is consistent on both sides and confidence value ci, we compute the per-view probability of the image edge. Bottom: Reconstructed points marked of having the surface at v by differentiating between two as silhouette edges (left) and texture edges (right). For this cases. If di falls inside V, it is a foreground point. We then visualization, we used the ratio |c+ − c−| /(c+ + c−), and compute the confidence cv,i of having the surface at v using marked points with ratios higher than 0.3 as silhouette edges. an exponential decay function, depending on the difference between di and dv,i, the depth of v with respect to Ii: geodesic filter with Ii as the joint domain, which generates 0  2 2  (Ci Di) , where represents element-wise multiplication. cv,i = ci · exp − |di − dv,i|2 /(2σv)) . (11) This process gives higher emphasis to depth values with higher confidence. We then normalize the results by dividing If di is outside V, i.e. is a background point, then we di- 0 0 (Ci Di) by Ci, the filtered version of the confidence map, rectly assign cv,i = −ci, since all voxels on this viewing ray again element-wise. The final depth map is computed as should be in free space and affected in the same magnitude. Using these confidence values, we directly apply Bayes’ (C D )0 D0 = i i . (10) rule to compute the per-view probability P (v ∈ S|c ) of i C0 i v,i i having the surface at v, given the confidence value:

We refer the readers to the original papers for a more in-depth P (cv,i|v ∈ S) · P (v ∈ S) discussion of this filtering operation. Pi(v ∈ S|cv,i) = , (12) P (cv,i) In order to avoid depth values that are vaguely between the foreground object and the background clutter, we apply where S stands for the set of voxels on the object surface. We the filtering operation for the foreground and background model P (cv,i|v ∈ S), i.e. the confidence value of a surface depth maps seperately. If the confidence at p is larger in voxel v, using N (1, σs), a normal distribution with mean of the foreground depth map, we keep this depth value for that 1, to handle noise of per-view depth maps. The confidence pixel, and vice versa. The final confidence map is then the value of a voxel in the free space, denoted by P (cv,i|v ∈ absolute difference between the confidence maps for the F), is also modeled with a normal distribution N (−1, σs), foreground and background depth maps. From this point on, but with mean of −1. The denominator in Eq. (12) can be 0 0 we will refer to Di and Ci as Di and Ci, respectively. computed as follows: P (cv,i) = P (cv,i|v ∈ S) · P (v ∈ S) + (13) where µv is the weighted average of cv. We carve out all + P (cv,i|v ∈ F) · P (v ∈ F). voxels that have PC(v) lower than a threshold, which we set to 0.95. The carving is repeated until no voxels are carved We modeled P (v ∈ F) and P (v ∈ S) to be of equal proba- out. Our voxel carving approach is very efficient in remov- bility, 0.5, due to no prior knowledge about the scene. ing unnecessary voxels from the surface, and converges very Our aggregation scheme is similar to that of Yucer¨ et quickly. We performed a maximum of 3 iterations for all al.[40] in that we accumulate the per-image probabilities results in the paper. Finally, we supply all voxels v and their using a geometric mean. Given all Pi(v ∈ S|cv,i), we com- normals nv on the boundary of R to Poisson surface recon- pute the probability H(v) using the following formula: struction [23] to generate our final results (See Figure2).

n !1/n Y 6. Experiments and Results H(v) = Pi(v ∈ S|cv,i) . (14) i=1 In order to assess the quality and performance of our re- construction method, we used the dense light field datasets H(v) 0.2 We generate the surface by thresholding at and of Yucer¨ et al.[40]. The datasets feature various objects applying marching cubes. We use a small value for threshold- with fine details comprising many thin, intricate features, ing, since our surface probabilities are generally of smaller and complicated topology, which we aim at reconstructing. magnitude compared to the free space probabilities. Since we intend to demonstrate our algorithm in a more re- Voxel carving The resulting mesh already captures most alistic setup, we used the hand-held datasets. We compare details of the object and is ready to be used as is. In or- our reconstructions with those from well-known reconstruc- der to pronounce the fine details further, we examine the tion methods whose implementations are publicly available; photoconsistency of the voxels inside the surface. namely PMVS [15] and MVE [12]. Additionally, we com- A general solution for refining the mesh would be to apply pared with the visual hull results accompanying the datasets. volumetric graph-cuts inside the voxel grid. However, un- Figure5 shows the results of DECORATION, BASKET, and AFRICA obtained using PMVS, MVE, Yucer¨ et al. (de- textured thin features, like the legs and arms in the AFRICA noted by LFS), and our method. For the two multi-view dataset, or the straw details of the BASKET dataset, pose a problem for graph-cuts. Around such features, photoconsis- stereo methods, we used as many views as possible until tency measures do not clearly point to the object boundary, the results stop improving, which were about 200 views, and the graph-cut result can remove them from the final re- and hand-picked the parameters to obtain the best results. construction altogether. Instead, we use an efficient voxel The screened Poisson surface reconstruction (PSR) [23] was carving approach, which only carves out inconsistent voxels used to extract meshes from the PMVS point clouds, and the and keeps the thin features intact. floating scale surface reconstruction (FSSR) [11] was used for MVE point clouds as it is bundled with MVE. For our First, we compute a region of interest R inside the mesh, method, we used the same set of parameters as reported in which is 3 voxels deep from the surface, and propagate mesh the paper regardless of the datasets. normals inside R. The visibility of the voxels is computed While PMVS presents largely smooth results, it lacks using the current mesh as a prior and rendering the back- detail around object features and shows irregular surfaces facing faces from each view point I . If a voxel’s depth is i in the regions where the objects have complicate shapes. smaller than the depth of the mesh seen from I , then it is i MVE results in clean surfaces in smooth regions, but in counted as visible from I [22]. After all voxels v ∈ R i the regions with intricate detail, the reconstructed surfaces are projected to all images I , given a voxel v, we gather i become noisy, which we believe is attributed to the FSSR’s the color values {c (i)} and the weights {w (i)} from all v v relative deficiency in tolerating outlier points. While LFS images I to which it projects. The weights of views that i reveals a certain amount of details and topologically accurate are not seeing the voxel are set to 0. For all other views, we surfaces, e.g. in the reconstructed BASKET, it is not able compute the weight as the dot product of the voxel normal to handle concavities and the surfaces looking carved-out, nv and the vieweing ray from Ii to v, namely rv,i: e.g. see the insets of DECORATION. On the other hand, our  n · r , if n · r > 0 method reveals the fine details of the object possessing many w (i) = v v,i v v,i . (15) v 0, otherwise intricate features and topologically complicated structures; see the thin features and narrow openings in the insets. Given the colors and weights, we compute a weighted vari- In Figure6 featuring vegetation, we compare our results ance of the colors as the photoconsistency PC(v): also with the depth reconstruction method of Zhang et al.

n n (ACTS) [41] which is tailored to dense video input. Since X 2. X ACTS produces per-view depth maps, we merged them into PC(v) = wv(i)(cv(i) − µv) wv(i), (16) i=1 i=1 point clouds for a better visualization. Compared to ACTS, ihmn hneeet,weesAT eut contain results ACTS whereas elements, thin many with hc okaon 5minutes. 25 around took which ts n h eto iei pn o h grgto step, aggregation the for spent is time of rest min- the 40 and about utes, require minutes. propagation 55 depth about and taking computation parts, PC consuming time filtering most the and are computation video Depth 720p frames. a from 3000 cloud comprising point to oriented implementation final our the for exper- minutes reconstruct the 120 for about RAM 3.2 takes of It GB with iments. 32 PC and CPU desktop Intel a Quad-core GHz on OpenMP- implementation an ran parallel We entirely. based parallelized be can aggrega- steps and tion computation depth particular in parallelization, Performance. required. are results resolution down- high a extremely be when can complexity side cubic to Its tied resolution. is grid resolution voxel reconstruction the in final efficient the space but and steps, time most both may is step algorithm aggregation Our subsequent enough holes. the leave effective and be areas not these fill may to propagation In the estimates. cases, depth such without regions large leave al- will enough calculation gorithm possess depth not gradient-based does our scene variations, the texture if work, to method our for Limitations. datasets. input and results the at for look material closer supplementary accompanying the the to in refer as Please space consistently free even in method points the reconstructs and features the around noise plants the the of Note boxes. geometry blue faithful and more cyan in the results inside method of surface close-ups our clouds scale two features. floating point show object using The we thin meshed result, datasets. around are each different achieves MVE For method 3 of pipeline. our on results their quality [40] the of LFS and part and [23] as [18] reconstruction [11] MVE surface reconstruction Poisson [15], with PMVS meshed to are method PMVS our of Comparison 5: Figure Africa Basket Decoration hl otojcsfaueeog texture enough feature objects most While ayseso u loih r utbefor suitable are algorithm our of steps Many PMVS T RUNKS MVE dataset. eso hs rmtodfeetviewpoints. different two from clouds, these point show produces we mesh ACTS Since resulting texture. our without show and we with algorithm, our For [41]. ware .Conclusions 7. References research. further inspire will and methods existing stereo the complexmulti-view complement and contributions intricate our believe with We objects shape. highly of the models extremely reveals 3D utilize detailed and fully information, and spatio-angular benefit dense to Our designed algorithm. is aggregation depth method robust and efficient photoconsistency an bidirectional propa- and by depth assisted a to scheme addition gation in estimation, accurate and depth efficient pixel-wise novel for A proposed fields. is light method sampled gradient-based densely from models 3D tailed soft- ACTS the to technique our of Comparison 6: Figure

[1] Trunks Thin Plant epeetdamto orcntutacrt n de- and accurate reconstruct to method a presented We .M et,adR zlsi uligRm naday. a in Rome Building Szeliski. R. Curless, and B. Seitz, Simon, M. I. S. Snavely, N. Furukawa, Y. Agarwal, S. LFS ACTS O urs Ours Com- munications of the ACM, pages 105–112, 2011.1,2 [23] M. Kazhdan and H. Hoppe. Screened poisson surface recon- [2] T. Basha, S. Avidan, A. Hornung, and W. Matusik. Structure struction. ACM Trans. Graph., 32(3), 2013.2,7,8 and motion from scene registration. In CVPR, 2012.2 [24] C. Kim, H. Zimmer, Y. Pritch, A. Sorkine-Hornung, and [3] T. E. Bishop and P. Favaro. Full-resolution depth map esti- M. Gross. Scene reconstruction from high spatio-angular mation from an aliased plenoptic light field. In ICCV, 2010. resolution light fields. ACM Trans. Graph., 32(4), 2013.1,2 2 [25] M. Lang, O. Wang, T. Aydin, A. Smolic, and M. Gross. Prac- [4] R. C. Bolles, H. H. Baker, and D. H. Marimont. Epipolar- tical temporal consistency for image-based graphics applica- plane image analysis: An approach to determining structure tions. ACM Trans. Graph., 31(4):34:1–34:8, 2012.5 from motion. IJCV, 1(1):7–55, 1987.2,3 [26] M. Levoy and P. Hanrahan. Light field rendering. In ACM [5] D. Bradley, T. Boubekeur, and W. Heidrich. Accurate multi- SIGGRAPH, pages 31–42, 1996.2 view reconstruction using robust binocular stereo and surface [27] C.-K. Liang and R. Ramamoorthi. A light transport frame- meshing. In CVPR, 2008.3 work for lenslet light field cameras. ACM Trans. Graph., [6] C. Buehler, M. Bosse, L. McMillan, S. Gortler, and M. Cohen. pages 16:1–16:19, 2015.2 Unstructured lumigraph rendering. In SIGGRAPH, 2001.2 [28] P. Merrell, A. Akbarzadeh, L. Wang, P. Mordohai, J.-M. [7] C. Chen, H. Lin, Z. Yu, S. Bing Kang, and J. Yu. Light field Frahm, R. Yang, D. Nister,´ and M. Pollefeys. Real-time stereo matching using bilateral statistics of surface cameras. visibility-based fusion of depth maps. In ICCV, 2007.3 In CVPR, 2014.2 [29] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and [8] A. Criminisi, S. B. Kang, R. Swaminathan, R. Szeliski, and R. Szeliski. A comparison and evaluation of multi-view stereo P. Anandan. Extracting layers and analyzing their specu- reconstruction algorithms. In CVPR, pages 519–528, 2006.2 lar properties using epipolar-plane-image analysis. CVIU, [30] Q. Shan, B. Curless, Y. Furukawa, C. Hernandez, and S. M. 97(1):51–85, 2005.2 Seitz. Occluding contours for multi-view stereo. In CVPR, [9] A. Davis, M. Levoy, and F. Durand. Unstructured light fields. pages 4002–4009, 2014.3 Comput. Graph. Forum, 31(2):305–314, 2012.2,3 [31] S. N. Sinha and M. Pollefeys. Multi-view reconstruction [10] I. Feldmann, P. Kauff, and P. Eisert. Extension of epipolar using photo-consistency and exact silhouette constraints: A image analysis to circular camera movements. In ICIP, 2003. maximum-flow formulation. In ICCV, 2005.2 2,3 [32] A. Tabb. Shape from silhouette probability maps: Reconstruc- [11] S. Fuhrmann and M. Goesele. Floating scale surface recon- tion of thin objects in the presence of silhouette extraction struction. ACM Trans. Graph., 33(4):46, 2014.7,8 and calibration error. In CVPR, pages 161–168, 2013.2 [12] S. Fuhrmann, F. Langguth, and M. Goesele. MVE – a multi- [33] M. Tao, P. Srinivasa, J. Malik, S. Rusinkiewicz, and R. Ra- view reconstruction environment. In Eurographics Workshop mamoorthi. Depth from shading, defocus, and correspon- on Graphics and Cultural Heritage, 2014.7 dence using light-field angular coherence. In CVPR, 2015. 2 [13] Y. Furukawa and C. Hernandez.´ Multi-view stereo: A tutorial. Foundations and Trends in Computer Graphics and Vision, [34] K. Venkataraman, D. Lelescu, J. Duparre,´ A. McMahon, 9(1-2):1–148, 2015.2 G. Molina, P. Chatterjee, R. Mullis, and S. Nayar. PiCam: An ultra-thin high performance monolithic camera array. ACM [14] Y. Furukawa and J. Ponce. Carved visual hulls for image- Trans. Graph, 32(6):166:1–166:13, 2013.2 based modeling. IJCV, 81(1):53–67, 2009.2 [35] G. Vogiatzis, C. Hernandez´ Esteban, P. H. S. Torr, and [15] Y. Furukawa and J. Ponce. Accurate, dense, and robust multi- R. Cipolla. Multiview stereo via volumetric graph-cuts and view stereopsis. PAMI, 32(8):1362–1376, 2010.1,2,7,8 occlusion robust photo-consistency. PAMI, 29(12), 2007.1,2 [16] E. S. L. Gastal and M. M. Oliveira. Domain transform for [36] T.-C. Wang, A. A. Efros, and R. Ramamoorthi. Occlusion- edge-aware image and video processing. ACM Trans. Graph., aware depth estimation using light-field cameras. In ICCV, 30(4):69:1–69:12, 2011.5 2015.2 [17] M. Goesele, B. Curless, and S. M. Seitz. Multi-view stereo [37] S. Wanner and B. Goldluecke. Globally consistent depth revisited. In CVPR, 2006.3 labeling of 4D lightfields. In CVPR, 2012.1,2 [18] M. Goesele, N. Snavely, B. Curless, H. Hoppe, and S. M. [38] S. Wanner, C. N. Straehle, and B. Goldluecke. Globally Seitz. Multi-view stereo for community photo collections. In consistent multi-label assignment on the ray space of 4D light ICCV, 2007.2,8 fields. In CVPR, pages 1011–1018, 2013.2 [19] S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen. [39] Z. Yu, X. Guo, H. Ling, A. Lumsdaine, and J. Yu. Line SIGGRAPH The Lumigraph. In , 1996.2 assisted light field triangulation and stereo matching. In ICCV, [20] S. Heber and T. Pock. Convolutional networks for shape from pages 2792–2799, 2013.2 light field. In CVPR, 2016.2 [40] K. Yucer,¨ A. Sorkine-Hornung, O. Wang, and O. Sorkine- [21] A. Hornung and L. Kobbelt. Hierarchical volumetric multi- Hornung. Efficient 3D object segmentation from densely view stereo reconstruction of manifold surfaces based on dual sampled light fields with applications to 3D reconstruction. graph embedding. In CVPR, 2006.1,2 ACM Trans. Graph., 35(3), 2016.2,7,8 [22] A. Hornung and L. Kobbelt. Robust and efficient photo- [41] G. Zhang, J. Jia, T. Wong, and H. Bao. Consistent depth maps consistency estimation for volumetric 3d reconstruction. In recovery from a video sequence. PAMI, 31(6), 2009.7,8 ECCV, 2006.7