Surface Light Field Fusion

Jeong Joon Park Richard Newcombe Steve M. Seitz University of Washington Facebook Reality Labs University of Washington [email protected] [email protected] [email protected]

Figure 1: Synthetic renderings of highly reflective objects reconstructed with a hand-held commodity RGBD sensor. Note the faithful texture, specular highlights, and global effects such as interreflections (2nd image) and shadows (3rd image).

Abstract pearance in its native environment (Fig. 1). To achieve this capability, we approximate the full surface light field by fac- We present an approach for interactively scanning highly toring it into two components: 1) a local specular BRDF reflective objects with a commodity RGBD sensor. In addi- model that accounts for view-dependent effects, and 2) a tion to shape, our approach models the surface light field, diffuse component that factors in global light transport. encoding scene appearance from all directions. By fac- The resulting approach reconstructs both shape and high toring the surface light field into view-independent and quality appearance, with about the same effort needed to wavelength-independent components, we arrive at a repre- recover shape alone – we simply require a handful of IR sentation that can be robustly estimated with IR-equipped images taken during the usual scanning process. In partic- commodity depth sensors, and achieves high quality results. ular, we exploit the IR sensor in commodity depth to capture not just the shape, but also the view-dependent 1. Introduction properties of the BRDF in the IR channel. Our renderings under environment lighting capture glossy and shiny ob- The advent of commodity RGBD sensing has led to great jects, and provide limited support for anisotropic BRDFs. progress in 3D shape reconstruction [9, 30, 40]. However, We make the following contributions. The first is our the appearance of scanned models is often unconvincing, as novel, factored, surface light field-based problem formula- view-dependent reflections (BRDFs) and global light trans- tion. Unlike traditional BRDF-fitting methods which en- port (shadows, interreflections, subsurface scattering) are able scene relighting, we trade-off relighting for the abil- very difficult to model. ity to capture diffuse global illumination effects like shad- Rather than attempt to fit BRDFs and invert global light ows and diffuse interreflections that significantly improve transport, an alternative is just to model the radiance (in ev- realism. Second, we present the first end-to-end system for ery direction) coming from each surface point. This repre- capturing full (non-planar) surface light field object mod- sentation, known as a surface light field [41], captures the els with a hand-held scanner. Third, we introduce the scene as it appears from any viewpoint in its native lighting first reconstructions of anisotropic surfaces using commod- environment. ity RGBD sensors. Fourth, we introduce a high-resolution We present the first Kinect Fusion-style approach de- texture tracking and modeling approach that better models signed to recover surface light fields. As such, we are able high frequency details, relative to prior Kinect Fusion [30] to convincingly reproduce view-dependent effects such as approaches, while retaining real-time performance. Fifth, specular highlights, while retaining global effects such as we apply semantic segmentation techniques for appearance diffuse interreflections and shadows that affect object ap- modeling, a first in the appearance modeling literature. 2. Problem Formulation and Related Work with parameters β, and Ed direct illumination; we ignore specular interreflection. The input to our system is an RGBD stream and infrared Rather than infer diffuse albedo f as in [12, 22, 42, 43], (IR) video from a consumer-grade sensor. The output is a d we simply capture the diffuse appearance D directly in a shape reconstruction and a model of scene appearance rep- texture map (the concept referred to as a “lightmap” in the resented by a surface light field SL, which describes the gaming industry). We thereby make use of the rich global radiance of each surface point x as a function of local out- effects like shadows and interreflections that are captured going direction ωo and wavelength λ: SL(x, ωo, λ). We in D but avoid the difficult problem of factoring out re- introduce a novel formulation designed to effectively repre- flectance and multi-bounce lighting. In doing so, we give sent and estimate SL. We start by writing SL in terms of up the ability to relight the scene, in exchange for the abil- the rendering equation [17]: ity to realistically capture global illumination effects. The specular term S is approximated by recovering a SL(x, ωo, λ) = E(x, ωi, λ)f(x, ωo, ωi, λ)(nx · ωi)dωi, parametric BRDF f with an active-light IR system, assum- �Ω s ing that f is wavelength-independent. S captures the spec- (1) s ular properties of dieletric materials like plastic or wood and where E denotes global incident illumination, f surface non-colored metals like steel or aluminum [33, 36, 46]. BRDF, ωi incident direction, and nx surface normal. We assume the residual R to be negligible - as such, Prior work on modeling SL uses either parametric our approach does not accurately model colored metals like BRDF estimation methods or nonparametric image-based bronze or gold, and omits specular interreflections. rendering methods. Overall, our formulation has two main advantages. First, BRDF estimation methods [12, 22, 42, 43] solve for an we realistically capture diffuse global illumination effects analytical BRDF f by separating out the global lighting E. in SL without explicitly simulating global light transport. Accurately modeling the global light transport, however, is Second, the specular BRDF is analytically recovered from extremely challenging; hence most prior art [12, 22, 42, 43] only a sparse set of IR images, removing the need for con- simply ignore global effects such as shadows and inter- trolled lighting rigs [14, 22, 46] and extensive view sam- reflections, assuming the scene surface is convex. Occlu- pling [5, 15, 28, 41]. sions and interreflections cause artifacts under this assump- Other Related Works Some authors explicitly model tion, so some methods [42, 43] require a black sheet of pa- global light transport, as in [24] and, to a lesser extent, [32] per below the target object to minimize these effects. that assumes low-frequency lighting and material. How- Image-based rendering techniques [5, 28, 41] avoid this ever, this requires solving a much more difficult problem, problem by nonparametrically modeling the surface light and the authors note difficulties in recovering highly specu- field SL, via densely captured views on a hemisphere. lar surfaces and limit their results to simple, mostly uniform However, the dense hemispherical capture requirement is textured objects. laborious, and therefore not suited to casual hand-held scan- [15] captures a planar surface light field with a monoc- ning scenarios. ular using feature-based pose tracking. Although We address these two issues by 1) empirically capturing [15] removes the need for special gantries used in [23, 41], the diffuse component of the appearance that factors in the it still requires a dense sampling of the surface light field, global lighting, and 2) analytically solving for specular pa- making it hard to scale beyond a small planar surface. rameters leveraging the built-in IR projector. A few other authors utilize IR light and sensors to en- Specifically, we further decompose SL into a diffuse hance the estimation of surface properties. [6] observes term D, specular term S with wavelength-independent that many interesting materials exhibit limited texture vari- BRDF (as in the Dichromatic model [33]), and residual R: ations in the IR channel to estimate the geometric details and BRDF. [7] refines a reconstructed mesh with shading SL(x, ωo,λ) = E(x, ωi, λ)fd(x, λ)(nx · ωi)dωi �Ω cues in the IR channel. [18] leverages an IR sensor to add constraints to the intrinsic image decomposition task. D(x, λ) � �� � 3. System Overview + Ed(x, ωi, λ)fs(x, ωo, ωi; β)(nx · ωi)dωi �Ω Our system operates similarly to Kinect Fusion [30] in S(x, ωo, λ) that a handheld commodity RGBD sensor is moved over a � �� � + R(x, ωo, λ), (2) scene to reconstruct its geometry and appearance. The dif- fuse component D is obtained by interactively fusing the where fd is diffuse albedo, fs specular component of BRDF high resolution texture onto the base geometry [30] and to [18] for a time-of-flight projector model.

IR Projector and Camera Calibration With constant ex- posure time, the IR projector’s light intensity κ and the cam- era’s compression parameter γ is optimized as fol- lowing by capturing a white piece of paper whose albedo is assumed to be 1:

γ 2 nx · lx min L(x) − κ , (3) Figure 2: IR image (left), zoomed in (top-right) and local maxi- κ,γ 2 x � � πdx � � mum subsampled image (bottom-right). Best viewed digitally. �∈P where L(x) is observed IR intensity at a x in a col- subsequently removing baked-in specular highlight using a lated pixel set P over multiple images, and nx, lx and dx min-composite [35] approach. (Sec. 5) are normal, light direction and distance to the light source Section 4 describes recovering S. We use a learning- of the surface point seen by x, respectively. Following [7], fi based classi er to segment the scene with similar surface we assume indoor ambient IR light is negligible. properties and estimate the specular BRDF fs for each seg- ment using a calibrated IR system. S is then computed 4.2. Per-Segment BRDF Estimation by taking the product of the recovered fs and environment lighting captured from a low-end 360◦ camera. Specular Reflection Model Renaming the directions ωo and ωi in Eq. (2) as local view vx and light directions lx re- 4. Computing S spectively, we denote the specular component of the BRDF as fs(vx, lx; β), where β is a vector of BRDF parameters In this section, we assume that the geometry is al- that decides the reflectance distribution. In this work, we ready reconstructed and focus on estimating S, the view- use the Ward BRDF model [39] for its and appli- dependent component of the surface light field. We assume cability to a variety of materials. that the material varies over the surface of the object or scene, but that there is a relatively small number of mate- Material Segmentation Accurately fitting BRDF param- rial clusters which estimates can be aggregated. We begin eters for each individual point requires observing specular by calibrating the infrared (IR) system attached to the depth highlights for all surface points, which is prohibitively ex- sensor and describe how to measure material properties in pensive. Therefore, estimating a single BRDF for a region the IR channel for each material. with shared reflectance properties is essential for a practical and robust system. To identify regions of the surface with 4.1. IR Projector Modeling similar material, we apply an image segmentation method BRDF Estimation in the IR Channel We leverage the [1] using a Convolutional Neural Network (CNN) for se- IR projector built-in to most depth sensors as a point light mantic material classification, adapted to operate on a mesh source. Because we know the projector location relative to rather than an image (Figure 3). We first convert the patch the camera along with the scanned geometry, the surface classifier into a Fully Convolutional Network [25] for dense BRDF can be estimated. class predictions (as in [1]) but increase the resolution of One challenge, however, is that the IR projector is typ- the output signal using Dilated Convolution [45]. At each ically not uniform – we use a Primesense structured light input frame, the output probability map from the softmax sensor that generates complex speckle patterns (Fig. 2). layer is projected and averaged into the vertices of our mesh We take a simple approach to select that are lit M = {V,E}. We consider the mesh connectivity as a by the projector: we divide the IR image into a grid struc- Markov Random Field and obtain the final vertex material ture and keep the brightest pixel within each 5x5 grid label y by minimizing the following energy function: (Fig. 2). The intensity of such pixel is averaged with its four-neighborhood. Throughout the paper, a pixel in the IR Φ(y) = ψv(yv) + ψm,n(ym, yn), (4) image refers to this local maximum intensity. For simplic- v�∈V {m,n�}∈E ity, we consider only the central 192x192 region, to mini- mize the effects of and radial distortion. where the data term ψv(yv) of the MRF is the standard neg- Despite our usage of the structured light sensor, our sys- ative log likelihood at vertex v: ψv(yv) = − log(pi(yv)), tem can be easily generalized to other depth sensors with a and ψm,n(ym, yn) is the pairwise term. Please refer to the projector and a receiver pair. For example, we refer readers supplementary material for description of the pairwise term. 0.35 0.5

Observed 0.45 Observed 0.3 Predicted Predicted 0.4

0.25 0.35

0.3 0.2 0.25 0.15

0.2 Pixel Intensity Pixel 0.1 Intensity Pixel 0.15

0.1 0.05 0.05

0 0 0 0.1 0.2 0.3 0.4 0.5 0 0.5 1 Half Angle (rad) Half Angle (rad) (a) Dog Scene (b) Corncho Scene Figure 4: Specular BRDF fitting results for two scenes, plot- ting specular components versus half angles of pixel observations along with the fitted prediction curves. Fabric Wood Polished Stone Stone Metal Painted Wall Plastic No Geometry Among other methods in the literature [11, 38], [14] sug- Figure 3: Material segmentation results. The first row shows pic- gests a photometric approach to estimate the per-point tan- tures of target scenes, and the second row shows 3D material seg- gent vector by projecting half vectors onto the tangent plane mentation results. Note that the segmentation algorithm works on and finding the direction that maximizes symmetry. Apply- the reconstructed mesh instead of a single image. ing this technique in our context means densely observing each surface point from many different viewpoints, which Specular BRDF Optimization For each material seg- is infeasible in a casual hand-held scanning session. We ment, the specular parameters β that best explain the IR therefore assume that the tangent vector is shared across a video are estimated. Each pixel x in an IR frame constrains material segment, i.e., the surface is planar or cylindrical. the shared specular parameters β and the IR diffuse albedo Each pixel x in an IR frame has an associated half-vector ρ(x) of a surface point. We thus minimize the difference hx and normal vector nx of the surface point represented in between the actual intensity L(x) and the prediction from global coordinates and intensity L(x). To determine a sin- our model to obtain the optimal β and IR diffuse albedo ρ: gle representative tangent plane PR, we choose a pixel with the smallest half-angle (angle between half-vector and local γ 2 nx · lx ρ(x) normal) as a reference, with its normal n . Half-vectors of L x − κ f v , l β , R min ( ) 2 + s( x x; ) β,ρ � � dx � π �� � all other pixels relative to its local normal are projected onto x�∈P (5) the tangent plane of the reference point: where the set P collates pixels over all frames. The result- PR(Projn (nR + hx − nx)) ← L(x). (6) ing β is used to describe fs, and thus S. R The full optimization involves estimating hundreds of The discretized tangent plane is then smoothed with Gaus- thousands of parameters from millions of pixels. So, in sian filtering to get the dense BRDF slice as in Figure 5. We practice, we exploit the interactive nature of the scanning then use the Nelder-Mead Simplex Method [29] to find the process to have the user choose a small subset of frames tangent and binormal vectors that maximize the symmetry over which to optimize. These frames are chosen such that [14]. Given the estimated tangent vector, the BRDF param- a specular highlight is observed in the IR channel and each eters β for the anisotropic surface are computed through material is captured at least once. When a material is cap- Eq. (5), but with spatially constant diffuse albedo (Fig. 5). tured in only one frame, Eq. 5 is solved with spatially con- stant diffuse albedo. Choosing reference views of convex 5. Computing D surface regions improves the specularity estimation by re- ducing the impact of interreflections. The optimization is In this section, we describe how to obtain high quality solved with Levenberg-Marquardt method [27]. (Fig. 4) scene texture in real-time and iteratively remove baked-in specular highlights as a post-processing step. Anisotropic Surfaces Many common reflective surfaces (especially wood and metal) reflect light anisotropically. Real-time High Resolution Texture (HRT) Fusion We therefore extend our approach to model anisotropy for High quality texture is essential to the perceived realism of a limited class of surfaces (planes and cylinders) that occur scene appearance. While interactive RGBD scanning has frequently in man-made scenes. seen great progress on delivering quality geometry [9, 40], For an anisotropic surface, solving Eq. (5) requires esti- the state-of-the-art systems still produce low-resolution re- mating tangent and binormal vectors at each surface point. constructions with blurring and ghosting artifacts. 5.2. Texture Representation We represent the texture of the model as a 2D tex- ture atlas [26] A that stores RGB values and associated weights. We parameterize our mesh M by simply laying each triangle on A without overlap. For each pixel u on the texture plane, its 3D position can be computed through barycentric interpolation, denoted as u3d. 5.3. Rendering for Appearance Prediction Using the rasterization pipeline, we render a high reso- lution prediction of the scene appearance Ri along with a depth map Zi and normal map into a camera with estimated Figure 5: Anisotropic surface fitting results. From left to right: pose T . This high resolution rendering provides constraints Photo of the target surface (brushed metal cylinder and planar i wood surface); Image captured from IR camera; our rendering for pose optimization for the upcoming input frame. from the same camera pose; Tangent plane of the reference point 5.4. Texture Update where half-vectors of observations are projected (the yellow line is the optimal tangent vector that maximizes symmetry). Best Given its estimated pose Ti, each RGB frame Ii is viewed digitally. densely fused into the texture atlas. We compute the color input of a texture pixel u ∈ Ω by projecting the correspond- 3d Recently, global pose optimization approaches [47] and ing 3D point u onto the bilinearly interpolated image Ii its patch-based variants [2] have produced impressive tex- with depth testing. Specifically, let x be a point on the im- 3d ture mapping results. However, these methods work offline age plane of Ii obtained from perspective projection of u . and thus cannot provide interactive visual feedback to users The color measurement Ii(x) is blended into A(u) through on which part of the current model is missing texture or weighted averaging with weight wi(x) associated with x at needs close-up scanning. time i. We assign high weights to reliable and important We show that a high resolution texture representation color observations: combined with GPU-accelerated dense photometric pose wi(x) = mizi(x)si(x). (8) optimization [19, 40] can greatly improve the real-time scanning quality (Figure 6). We also propose a gradient- Here, mi accounts for motion blur and rolling ef- based objective function and generalized specular highlight fects that occur during fast camera motions measured by the 2 removal algorithm for handling non-Lambertian surfaces. L distance between translation and rotation components of Ti−1 and Ti. zi(x) rejects pixels on depth discontinuities 5.1. Preliminaries where estimated geometry is unreliable. Finally, measure- We adopt Kinect Fusion [30] to obtain the scene geome- ments close to the surface and near perpendicular to it pro- try and refine the extracted mesh using the method of [16]. vide sharper texture, yielding the importance term si(x): We denote the final mesh, vertex set, and vertex normal set nx · vx si(x) = 2 . (9) as M, V, and N , respectively. Zi (x) The f and principal point c of the cam- 5.5. Frame-to-Model Photometric Pose Refinement era are estimated to form a calibration matrix K. We de- fine a projection operator π : R4 �→ R2 from a 3D ho- To estimate the camera pose Ti, we use a frame-to-model mogeneous point to a 2D point on image plane: π(p) := RGBD dense alignment approach, minimizing the photo-

px py metric error between the live frame Ii and the previous fx + cx, fy + cy . The rigid body transformation of pz pz frame’s rendering R on top of traditional dense point- � � fi i−1 the camera at time i relative to the rst frame is: plane Iterative Closest Point algorithm [30]. Unlike previous real-time RGBD tracking methods [19, R t 3 Ti = : R ∈ SO3, t ∈ R , (7) � 0 1� 40], our 2D texture atlas can store higher frequency photo- metric details compared to commonly used surfels or volu- where a 3D point p in local camera coordinates can be trans- metric textures, providing tighter pose constraints. � formed to a point in global coordinates as p = Tip. We Dense RGBD alignment techniques aim to find a rela- ◦ −1 calibrate low polynomial radial and decentering distortion tive rigid body transformation T = Ti Ti−1 that mini- [4] parameters, and each RGB frame is undistorted accord- mizes photometric error when Ii is warped into the refer- ingly. All RGB images are compensated for vignetting after ence image Ri−1 with the corresponding geometry repre- radiometric calibration [20]. sented by the reference depth map Zi−1. For a pixel x on (a) [40] (b) Our Method (c) [40] (d) Our Method Figure 6: Comparison of texture quality with a state-of-the-art method [40]. The first and third columns show reconstruction results of [40], while the second and fourth show our results. The second row shows magnified patches of the green boxes. Best viewed digitally. image plane Λ of R , we get a 3D point px in local coor- 0.45 i−1 0.4

−1 0.35 dinates through back-projection: px = Zi−1(x)K x . 0.3 We can then look up how this point appears in the live 0.25 ◦ 0.2 input image: Ii(π(T px)). Modern dense alignment tech- 0.15 0.1 niques then minimize greyscale photometric error between 0.05 ◦ 0 Ri−1(x) and Ii(π(T px)) under a quadratic loss. For our application, however, maximizing photoconsis- (a) Before (b) After (c) Difference tency does not necessarily results in accurate camera track- Figure 7: Specular highlights are removed from (a) by comput- ing since specular surfaces can appear significantly different ing a min-composite via Iteratively Reweighted Least Squares. (b) D across viewpoints. We find that applying a high-pass filter shows the estimated and (c) the difference heat map. to images improves tracking accuracy and therefore modify the objective to operate on the gradient of the predicted and pixel on the texture map, computing its weighted average qˆ ˆ◦ live images. The optimal transformation T is then: as in Section 5.4 is equivalent to finding the solution of a 2 weighted least squares problem argmin wi (qi − qˆ) , ◦ G G ◦ 2 qˆ i Tˆ = argmin ||∇R (x)|| − ||∇I (π(T px))|| , i−1 i where qi is the ith intensity input, wi is the weight of the T ◦ � x�∈Λ � � input computed in equation (8). IRLS filters out the high- (10) lights by down-weighting bright outliers in each iteration: where the superscript G denotes greyscale images. The intuition is that the high-pass filter reduces each t+1 t 2 specular highlight (which can be large) to its boundary, and qˆ = argmin µτ,υ(ˆq , qi)wi (qi − qˆ) , (11) qˆ the boundary tends to be lower frequency than the underly- �i ing diffuse texture, as the specular BRDF acts as a low-pass filter on the incident illumination [31]. t 1 qi ≤ qˆ + τ t µτ,υ(ˆq , qi) =  t 2 5.6. Specular Highlight Removal − (ˆq +τ−qi) t exp( υ2 ) qi > qˆ + τ. The resulting texture has specular highlights “baked-in”,  (12) which need to be removed to obtain D. This problem can be In practice, one or two iterations is sufficient. (Figure 7) posed as a layer separation problem [8, 44], to separate the While this approach is effective for surfaces with signif- stationary diffuse texture from the view-dependent specular icant diffuse components, it fails for metallic surfaces with layer. Specifically, [35] proposes the min-composite, i.e., negligible diffuse reflection. Fortunately, we can leverage minimum intensity values across all viewpoints, to approx- our IR BRDF measurements (Sec.4) to identify such sur- imate (via least upper bound) the diffuse layer. faces; we set D to zero for any surface whose estimated We use Iteratively Reweighted Least Squares (IRLS) diffuse albedo is less than 0.03 and specular albedo greater [13] to robustly compute the min-composite. For a given than 0.15. Error/Method ICP f2f f2m ours RMSE 0.1361 0.1177 0.1184 0.0902 1-NCC(3x3) 0.7220 0.6124 0.6521 0.4692 (b) Ours 1-NCC(5x5) 0.6510 0.5430 0.5689 0.3877 1-NCC(7x7) 0.5999 0.5043 0.5138 0.3491 Table 2: Photometric Errors of Tracking Methods a consumer-grade 360 degree camera by taking several pho- (c) Frame-to-Model [40] tos with varying times [10]. Figure 9 compares the reconstructed surface light field of test scenes with real pho- (a) Ground-truth tographs. We highly encourage readers to watch the supple- mentary video for in-depth results. (d) Frame-to-Frame [19] These results demonstrate that our method can success- fully recover surface light fields under a wide range of ge- Figure 8: Ground-truth image and predicted rendering of 3D mod- els reconstructed from different tracking methods. The visualiza- ometry, material, and lighting variations. The Rooster scene tion shows magnified patches of the green box. features challenging high-frequency textures and shiny ce- ramic surfaces, captured under office florescent lights. Our 6. Experiments method can accurately reproduce sharp diffuse textures and We conducted qualitative and quantitative experiments specular highlights. The Corncho scene contains a bumpy to assess our method. All experiments use a calibrated specular vinyl surface captured near a large window. No- Primesense Carmine RGBD sensor capturing 640x480 tice the faithful soft-shadows and strong interreflections on video at 30Hz for RGB and IR channels, synchronized with the white floor expressed in the renderings. The Dog scene the same size depth video. Exposure time was set to be has significant self-occlusions. A bright directional lamp il- constant within each sequence. The system runs on a laptop luminates the object from a short distance, inducing strong with a NVIDIA GTX 980M GPU and an Intel i7 CPU. specularities and sharp shadows. Our system faithfully re- constructs these effects. The Bottle scene contains a highly 6.1. HRT Tracking Evaluation anisotropic brushed metal cylinder and a glossy black plas- We evaluate the performance of our texture fusion tic cap. The results show that our system successfully re- method by measuring photometric error between the ren- covers the vertically elongated specular highlights on the dered and ground truth images for a held-out sample of metal surface. Also notice that the relative specular re- frames. For comparison, we replace our HRT tracking with flectance of the two glossy materials are estimated correctly. existing real-time camera tracking methods and evaluate the photometric error on a video of a highly textured Lamber- 7. Limitations and Conclusions tian scene. We test RGBD tracking methods using a combi- Our system is unable to model near perfect mirrors nation of frame-to-model Iterative Closest Point (ICP) [30], or surfaces with micro-structure, due to inaccurate sen- frame-to-frame (f2f) RGBD tracking [19, 34], and frame- sor shape measurement; e.g., tiny bumps on the Corncho to-model RGBD tracking (f2m) [40]. We did not imple- model were not captured, causing noticeable difference in ment loop closure techniques introduced in the state-of-the- the sharpness of the reflection. Moreover, our surface light art approaches such as that of [9, 40] as our target scenes field formulation does not model colored metals such as are of smaller scale where tracking drift is limited. bronze and omits specular interreflections. Figure 8 compares the quality of texture fused using We presented the first end-to-end system for computing camera poses provided by each algorithm. Our high res- surface light field object models using a hand-held, com- olution frame-to-model tracking yields the sharpest result. modity RGBD sensor. We leverage a novel factorization of For a quantitative measure, we adopt the evaluation the surface light field that simplifies capture requirements, method introduced by [37] and compute the pixel-wise root- yet enables high quality results for a wide range of mate- mean-square-error (RMSE) and 1-NCC error (for different rials, including anisotropic ones. Our approach captures patch sizes) between the ground-truth and rendered texture global illumination effects (like shadows and interreflec- reconstruction. We refer to [37] for the precise procedure. tions) and high resolution textures that greatly improve re- Results can be found in Table 2. alism relative to prior interactive RGBD scanning methods. 6.2. Surface Light Field Rendering Results Acknowledgement To render a reconstructed model, we use D represented This work was supported by funding from the National Sci- as a texture map and implement a custom shader to evaluate ence Foundation grant IIS-1538618, Intel, Google, and the S with a HDR environment lighting which is captured from UW Reality Lab. Figure 9: Ground truth images and renderings. First column: Ground truth . Second column: Synthetic rendering of the same pose. Third and fourth column: Rendering of different camera poses. References 2012 IEEE International Symposium on, pages 91–97. IEEE, 2012. [1] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material [16] W. Jakob, M. Tarini, D. Panozzo, and O. Sorkine-Hornung. recognition in the wild with the materials in context database. Instant field-aligned meshes. ACM Transactions on Graphics In Proceedings of the IEEE conference on computer vision (TOG), 34(6):189, 2015. and pattern recognition, pages 3479–3487, 2015. [17] J. T. Kajiya. The rendering equation. In ACM Siggraph Com- [2] S. Bi, N. K. Kalantari, and R. Ramamoorthi. Patch-based op- puter Graphics, volume 20, pages 143–150. ACM, 1986. timization for image-based texture mapping. ACM Transac- [18] C. Kerl, M. Souiai, J. Sturm, and D. Cremers. Towards tions on Graphics (Proceedings of SIGGRAPH 2017), 36(4), illumination-invariant 3d reconstruction using tof rgb-d cam- 2017. eras. In 3D Vision (3DV), 2014 2nd International Conference [3] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en- on, volume 1, pages 39–46. IEEE, 2014. ergy minimization via graph cuts. IEEE Transactions on pat- tern analysis and machine intelligence, 23(11):1222–1239, [19] C. Kerl, J. Sturm, and D. Cremers. Dense visual slam for 2001. rgb-d cameras. In Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on, pages 2100– [4] D. C. Brown. Close-range camera calibration. In PHO- 2106. IEEE, 2013. TOGRAMMETRIC ENGINEERING, 1971. [20] S. J. Kim and M. Pollefeys. Robust radiometric calibra- [5] W.-C. Chen, J.-Y. Bouguet, M. H. Chu, and R. Grzeszczuk. tion and vignetting correction. IEEE transactions on pattern Light field mapping: efficient representation and hardware analysis and machine intelligence, 30(4):562–576, 2008. rendering of surface light fields. ACM Transactions on Graphics (TOG), 21(3):447–456, 2002. [21] K. Lai, L. Bo, X. Ren, and D. Fox. Detection-based object [6] G. Choe, S. G. Narasimhan, and I. So Kweon. Simultaneous labeling in 3d scenes. In Robotics and Automation (ICRA), estimation of near ir brdf and fine-scale surface geometry. 2012 IEEE International Conference on, pages 1330–1337. In Proceedings of the IEEE Conference on Computer Vision IEEE, 2012. and Pattern Recognition, pages 2452–2460, 2016. [22] H. Lensch, J. Kautz, M. Goesele, W. Heidrich, and H.-P. Sei- [7] G. Choe, J. Park, Y.-W. Tai, and I. So Kweon. Exploiting del. Image-based reconstruction of spatial appearance and shading cues in kinect ir images for geometry refinement. geometric detail. ACM Transactions on Graphics (TOG), In Proceedings of the IEEE Conference on Computer Vision 22(2):234–257, 2003. fi and Pattern Recognition, pages 3922–3929, 2014. [23] M. Levoy and P. Hanrahan. Light eld rendering. In Pro- [8] A. Criminisi, S. B. Kang, R. Swaminathan, R. Szeliski, and ceedings of the 23rd annual conference on Computer graph- P. Anandan. Extracting layers and analyzing their specular ics and interactive techniques, pages 31–42. ACM, 1996. properties using epipolar-plane-image analysis. Computer [24] S. Lombardi and K. Nishino. Radiometric scene decompo- fl vision and image understanding, 97(1):51–85, 2005. sition: Scene re ectance, illumination, and geometry from [9] A. Dai, M. Nießner, M. Zollhofer,¨ S. Izadi, and C. Theobalt. rgb-d images. In 3D Vision (3DV), 2016 Fourth International Bundlefusion: Real-time globally consistent 3d reconstruc- Conference on, pages 305–313. IEEE, 2016. tion using on-the-fly surface reintegration. ACM Transac- [25] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional tions on Graphics (TOG), 36(3):24, 2017. networks for semantic segmentation. In Proceedings of the [10] P. E. Debevec and J. Malik. Recovering high dynamic range IEEE Conference on Computer Vision and Pattern Recogni- radiance maps from photographs. In ACM SIGGRAPH 2008 tion, pages 3431–3440, 2015. classes, page 31. ACM, 2008. [26] J. Maillot, H. Yahia, and A. Verroust. Interactive texture [11] A. Ghosh, T. Chen, P. Peers, C. A. Wilson, and P. Debevec. mapping. In Proceedings of the 20th annual conference on Estimating specular roughness and anisotropy from second Computer graphics and interactive techniques, pages 27–34. order spherical gradient illumination. In Computer Graphics ACM, 1993. Forum, volume 28, pages 1161–1170. Wiley Online Library, [27] D. W. Marquardt. An algorithm for least-squares estimation 2009. of nonlinear parameters. Journal of the society for Industrial [12] D. B. Goldman, B. Curless, A. Hertzmann, and S. M. Seitz. and Applied Mathematics, 11(2):431–441, 1963. Shape and spatially-varying brdfs from photometric stereo. [28] G. Miller, S. Rubin, and D. Ponceleon. Lazy decompression IEEE Transactions on Pattern Analysis and Machine Intelli- of surface light fields for precomputed global illumination. gence, 32(6):1060–1071, 2010. In Rendering Techniques 98, pages 281–292. Springer, 1998. [13] P. J. Green. Iteratively reweighted least squares for max- [29] J. A. Nelder and R. Mead. A simplex method for function imum likelihood estimation, and some robust and resistant minimization. The computer journal, 7(4):308–313, 1965. alternatives. Journal of the Royal Statistical Society. Series [30] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, B (Methodological), pages 149–192, 1984. D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and [14] M. Holroyd, J. Lawrence, G. Humphreys, and T. Zickler. A A. Fitzgibbon. Kinectfusion: Real-time dense surface map- photometric approach for estimating normals and tangents. ping and tracking. In Mixed and augmented reality (ISMAR), ACM Transactions on Graphics (TOG), 27(5):133, 2008. 2011 10th IEEE international symposium on, pages 127– [15] J. Jachnik, R. A. Newcombe, and A. J. Davison. Real-time 136. IEEE, 2011. surface light-field capture for augmentation of planar spec- [31] R. Ramamoorthi and P. Hanrahan. A signal-processing ular surfaces. In Mixed and Augmented Reality (ISMAR), framework for inverse rendering. In Proceedings of the 28th annual conference on Computer graphics and interactive niques, pages 215–224. ACM Press/Addison-Wesley Pub- techniques, pages 117–128. ACM, 2001. lishing Co., 1999. [32] T. Richter-Trummer, D. Kalkofen, J. Park, and D. Schmal- [47] Q.-Y. Zhou and V. Koltun. Color map optimization for 3d stieg. Instant mixed reality lighting from casual scanning. In reconstruction with consumer depth cameras. ACM Trans- Mixed and Augmented Reality (ISMAR), 2016 IEEE Interna- actions on Graphics (TOG), 33(4):155, 2014. tional Symposium on, pages 27–36. IEEE, 2016. [33] S. A. Shafer. Using color to separate reflection components. Appendix Color Research & Application, 10(4):210–218, 1985. [34] F. Steinbrcker, J. Sturm, and D. Cremers. Real-time visual Details on Pairwise Term (Sec. 4.2) odometry from dense rgb-d images. In ICCV Workshops, The pairwise term incorporates both a photometric and a pages 719–722. IEEE, 2011. geometric smoothness prior: [35] R. Szeliski, S. Avidan, and P. Anandan. Layer extrac- tion from multiple images containing reflections and trans- 1 ψm,n(ym, yn) = [y =� y ] parency. In Computer Vision and Pattern Recognition, 2000. m n Proceedings. IEEE Conference on, volume 1, pages 246– 2 (g(m, n) + �) λp exp(−θp||cm − cn|| ) + λg , 253. IEEE, 2000. � ||N (m) − N (n)||2 � [36] R. T. Tan and K. Ikeuchi. Separating reflection components of textured surfaces using a single image. IEEE transactions where λp, λg, θp, � are balancing parameters, and N and c on pattern analysis and machine intelligence, 27(2):178– are the vertex normal and color, respectively. Here, g(m, n) 193, 2005. is an indicator function designed to penalize concave sur- [37] M. Waechter, M. Beljan, S. Fuhrmann, N. Moehrle, J. Kopf, face transitions as in [21]: and M. Goesele. Virtual rephotography: Novel view pre- diction error for 3d reconstruction. ACM Trans. Graph., g(m, n) = (N (m) − N (n)) · (V(m) − V(n)) > 0, 36(1):8:1–8:11, Jan. 2017. [38] J. Wang, S. Zhao, X. Tong, J. Snyder, and B. Guo. Modeling where V is the vertex location. The optimal label y for Φ(y) anisotropic surface reflectance with example-based micro- is estimated via graph cuts [3]. facet synthesis. In ACM Transactions on Graphics (TOG), fi volume 27, page 41. ACM, 2008. Details on Photometric Pose Re nement (Sec. 5.5) [39] G. J. Ward. Measuring and modeling anisotropic reflec- On top of applying the high-pass filter to input and ref- tion. ACM SIGGRAPH Computer Graphics, 26(2):265–272, erence images, we mitigate the remaining specular effects 1992. by ignoring pixels with the gradient error of Equation (10) [40] T. Whelan, S. Leutenegger, R. Salas-Moreno, B. Glocker, larger than 0.2. We preprocess the images with a Sobel and A. J. Davison. Elasticfusion: Dense slam without a pose filter to compute the gradient. We initialize T ◦ with ICP graph. Robotics: Science and Systems, 2015. and refine the pose by minimizing Equation (10) using a [41] D. N. Wood, D. I. Azuma, K. Aldinger, B. Curless, fi T. Duchamp, D. H. Salesin, and W. Stuetzle. Surface light coarse-to- ne Gauss-Newton optimization operating on a fields for 3d photography. In Proceedings of the 27th an- three-level image pyramid. nual conference on Computer graphics and interactive tech- Additional Scene Reconstruction Result niques, pages 287–296. ACM Press/Addison-Wesley Pub- lishing Co., 2000. We show a reconstruction result of another test scene [42] H. Wu, Z. Wang, and K. Zhou. Simultaneous localization (Figure 10). Note the high quality reconstruction of the and appearance estimation with a consumer rgb-d camera. multi-object scene with texture and specular highlights IEEE transactions on visualization and computer graphics, faithfully reproduced. 22(8):2012–2023, 2016. [43] H. Wu and K. Zhou. Appfusion: Interactive appearance acquisition using a kinect sensor. In Computer Graphics Forum, volume 34, pages 289–298. Wiley Online Library, 2015. [44] T. Xue, M. Rubinstein, C. Liu, and W. T. Freeman. A com- putational approach for obstruction-free photography. ACM Transactions on Graphics (TOG), 34(4):79, 2015. [45] F. Yu and V. Koltun. Multi-scale context aggregation by di- lated convolutions. arXiv preprint arXiv:1511.07122, 2015. [46] Y. Yu, P. Debevec, J. Malik, and T. Hawkins. Inverse global illumination: Recovering reflectance models of real scenes from photographs. In Proceedings of the 26th an- nual conference on Computer graphics and interactive tech- (a) Ground-truth (b) Our Reconstruction

(c) Ground-truth (d) Our Reconstruction Figure 10: Ground-truth images and our surface light field renderings of the same camera pose.