TextureFusion: High-Quality Texture Acquisition for Real-Time RGB-D Scanning

Joo Ho Lee∗1 Hyunho Ha1 Yue Dong2 Xin Tong2 Min H. Kim1 1KAIST 2Microsoft Research Asia

Abstract

Real-time RGB-D scanning technique has become widely used to progressively scan objects with a hand-held sensor. Existing online methods restore color information per , and thus their quality is often limited by the trade- off between spatial resolution and time performance. Also, such methods often suffer from blurred artifacts in the cap- tured texture. Traditional offline methods (a) Voxel representation (b) Texture representation with non-rigid warping assume that the reconstructed ge- without optimization ometry and all input views are obtained in advance, and the optimization takes a long time to compute mesh param- eterization and warp parameters, which prevents them from being used in real-time applications. In this work, we pro- pose a progressive texture-fusion method specially designed for real-time RGB-D scanning. To this end, we first de- vise a novel texture-tile voxel grid, where texture tiles are embedded in the voxel grid of the signed distance func- tion, allowing for high-resolution texture mapping on the low-resolution geometry volume. Instead of using expensive mesh parameterization, we associate vertices of implicit ge- (c) Global optimization only (d) Spatially-varying ometry directly with texture coordinates. Second, we in- perspective optimization troduce real-time texture warping that applies a spatially- Figure 1: We compare per-voxel color representation (a) varying perspective mapping to input images so that texture with conventional texture representation without optimiza- warping efficiently mitigates the mismatch between the in- tion (b). Compared with global optimization only (c), our termediate geometry and the current input view. It allows us texture fusion (d) can achieve high-quality color texture to enhance the quality of texture over time while updating in real-time RGB-D scanning. Refer to the supplemental the geometry in real-time. The results demonstrate that the video for a real-time demo. quality of our real-time texture mapping is highly compet- itive to that of exhaustive offline texture warping methods. camera is used to determine the camera pose and is accu- Our method is also capable of being integrated into existing mulated to a voxel grid that contains surface distance. The RGB-D scanning frameworks. truncated depth values in the grid become a surface rep- resentation through a marching cube algorithm [24]. The color stream, in general, is accumulated to the voxel to scan color representation in a similar manner that captures depth. 1. Introduction However, the current color representation in real-time RGB- Real-time RGB-D has become widely used D scanning remains suboptimal due to the following chal- to progressively scan objects or scenes with a hand-held lenges: first, since the color information is stored in each RGB-D camera, such as Kinect. The depth stream from the voxel in the grid, there is an inevitable tradeoff between spatial resolution and time performance [6, 13, 25, 24]. For ∗Part of work was done during an internship in MSRA. instance, when we decrease the spatial resolution of the grid

1272 for fast performance, we need to scarify the spatial resolu- 2.1. Online Color Per Voxel tion of color information. See Figure 1(a) for an example. Existing real-time RGB-D scanning methods [24, 23, Second, the imperfection of the RGB-D camera introduces 6, 11, 12] reconstruct surface geometry by accumulating several artifacts: depth noise, distortion in both depth and the depth stream into a voxel grid of the truncated surface color images, and asynchronization between the depth and distance function (TSDF) [5], where the camera pose for color frames [31, 3, 10, 14, 9]. They lead to inaccurate each frame is estimated by the iterative closest point (ICP) estimation of imperfect geometry, color camera pose, and method [26] simultaneously. To obtain the color informa- the mismatch between the geometry and color images (Fig- tion together, existing methods inherit the data structure of ures 1a and 1b). TSDF, restoring a color vector to each voxel in the volu- These challenges could be mitigated by applying a tex- metric grid. While this approach can be integrated easily ture mapping method that includes global/local warping of into the existing real-time geometry reconstruction work- texture to geometry. The traditional texture mapping meth- flow, the quality of color information is often limited by ods assume that the reconstructed geometry and all input the long-lasting tradeoff between spatial resolution and time views are known for mesh parameterization that builds a performance. In addition, with the objective of real-time texture atlas [18, 27, 30]. Moreover, the existing texture op- performance, neither global nor local registration of texture timization methods calculate local texture warping to regis- has been considered in real-time RGB-D scanning. Due ter texture and geometry accurately [10, 14, 9, 31, 3]. How- to the imperfectness of the RGB-D camera, these real-time ever, these methods also require long computational time methods often suffer from blurred artifacts in the captured for optimization and are not suitable for real-time RGB-D color information. In contrast, we perform real-time texture scanning. For example, a state-of-the-art texture optimiza- optimization to register texture to geometry progressively. tion method [31] takes five to six minutes to calculate non- To the best of our knowledge, our method is the first work rigid warp parameters of 30 images for a given 3D model. that enables the non-rigid texture optimization for real-time It is inapplicable for real-time 3D imaging applications. RGB-D scanning. In this work, we propose a progressive texture fusion method, specially designed for real-time RGB-D scanning. 2.2. Offline Texture Optimization To this end, we first develop a novel texture-tile voxel grid, where texture tiles are embedded in the structure of the The mismatch between texture and geometry is a long- signed distance function. We integrate input color views lasting problem, which has triggered a multitude of studies. as warped texture tiles into the texture-tile voxel grid. Do- First, to avoid the mismatch problem of multi-view input ing so allows us to establish and update the relationship be- images, an image segmentation approach was introduced to tween geometry and texture information in real-time with- divide surface geometry into small segments and map them out computing a high-resolution texture atlas. Instead of us- to the best input view [17, 28, 10, 9, 19]. However, the ini- ing expensive mesh parameterization, we associate vertices tial segmentation introduces seam artifacts over surface seg- of implicit geometry directly with texture coordinates. Sec- ment boundaries, requiring an additional optimization step ond, we introduce an efficient texture warping method that to relax the misalignment. They optimize small textures applies the spatially-varying perspective mapping of each on each surface unit by accounting for photometric consis- camera view to the current geometry. This mitigates the tency [10, 9], or estimating a flow field from the boundaries mismatch between geometry and texture effectively, achiev- to the inner regions on each surface unit [19]. However, the ing a good tradeoff between quality and performance in image segmentation per surface unit easily loses the low- texture mapping. Our local perspective warping allows us frequency base image structure. It often requires additional to register each color frame to the canonical texture space global optimization over all the multi-view images after the precisely and seamlessly so that it can enhance the quality scanning completes, making online scanning of these im- of texture over time (Figures 1d). The quality of our real- ages infeasible. Also, this process cannot be applied to the time texture fusion is highly competitive to exhaustive of- progressive optimization of real-time RGB-D scanning. In fline texture warping methods. In addition, the proposed contrast, our spatially-varying perspective warping method method can be easily adapted to existing RGB-D scan- is capable of being applied to online scanning, allowing for ning frameworks. All codes and demo are published on- seamless local warping of input images without misalign- line to ensure reproducibility (https://github.com/ ment artifacts. KAIST-VCLAB/texturefusion.git). Second, a warping approach was proposed to transform input images to mitigate the misalignment between geome- 2. Related Work try and texture by proposing local warp optimization. Aganj et al. [1] leverage SIFT image features to estimate a warp In this section, we provide a brief overview of color re- field. The sparse field is then interpolated through thin plate construction methods in RGB-D scanning. geometry. Dellepiane et al. [8] make use of hierarchical

1273 Cell A

Geometry update Texture update

Camera Geometry Tex.-geo. Texture Texture Cell A tracking integration correspond. optimization blending Cell B Figure 2: Overview. Our workflow consists of two main Cell B steps: Once a new and color image are acquired, the camera parameters and geometry information are up- dated. After that, the correspondence between new geom- etry and texture is computed. Finally, the new image is Texture-voxel grid Tile space Tile blended to the texture space via perspective warping. Figure 3: Our texture-tile data structure is integrated to the grid of the SDF. For every zero-crossing cell, we assign a 2D regular texture tile. To associate surface and texture, the optical flow to achieve non-local registration of multi-view texture tile is projected along the orthogonal axis direction. input. However, the former is less accurate in the interpo- lated region on the flow field; the later requires complex flow interpolation. Then, computational cost increases ac- 3.1. Texture•Tile Voxel Grid cordingly. The first challenge for real-time texture fusion is that Recently, the global optimization of multi-view input both geometry and texture need to be updated progressively was proposed to achieve high-quality texture mapping by every frame, and thus the existing mesh parameterization not only optimizing the local mismatch of texture but also and texture atlas are not suitable for real-time scanning. adjusting the global alignment of multi-view input. Zhou et In previous studies, the tile-shaped texture architecture al. [31] optimize the camera pose and apply affine-based has been proposed to achieve the optimal compactness of a warp fields to align them to the surface geometry. Bi texture atlas, often using a tree structure for rendering with et al. [3] synthesize novel intermediate views with photo- texture [16, 7, 2]. One benefit is that no mesh parameteri- metric consistency by backprojecting neighboring views. zation is required. With the objective of real-time scanning, Again, they also require global optimization over all the we are inspired to combine the concept of texture tiles with multi-view images after the scanning completes in order to the voxel grid of the signed distance function (SDF). By estimate optical flow, making them incapable of being ap- doing so, the relationship between an implicit surface and a plied to online scanning. In contrast, we devise a spatially- texture tile within a voxel can be maintained through ortho- varying projective warping method that is more efficient graphic projection, rather than relying on expensive mesh- than the flow-based approaches, enabling real-time texture ing and mesh parameterization. This key idea allows us fusion. Our non-rigid warping field aligns per-frame color to update the texture information progressively in real-time information to the texture-voxel grid progressively, result- scanning. ing in high-quality texture mapping competitive to offline We introduce a novel texture-tile voxel grid that consists warping optimization. of a regular lattice grid in 3D, where each grid point has an SDF value and the index of a texture tile. A cell, a basic 3. Online Texture Fusion unit of surface reconstruction, includes eight adjacent grid points. An implicit surface in the cell is defined by con- Our online texture fusion method is enabled by two necting a set of zero-crossing points through the trilinear main contributions: a novel texture-tile voxel grid (Sec- interpolation of SDF values. We assign a fixed size of tex- tion 3.1) and an efficient, non-rigid texture optimization ture patches, so-called texture tiles, to zero-crossing cells. method (Section 3.3). A pixel of texture tiles contains three properties: a weight, color and depth. We update these properties of each tile Overview. Our real-time RGB-D scanning iteratively re- dynamically over time. peats two main steps for geometry and texture to progres- The mapping between texture and geometry is per- sively accumulate both pieces of information in a canonical formed by orthogonal projection. To minimize the sampling space. The geometric reconstruction part is based on an loss, we select one of the axis directions, where the projec- existing real-time 3D scanning method [25], and our tex- tion area is the maximum. We assume that the size of a cell ture reconstruction part consists of three main sub-steps: is small so that only, at most, only one facet exists and is (1) finding the texture-geometry correspondence after up- associated with one texture tile in a cell. This holds because dating geometry in the canonical space, (2) calculating the our texture cell shares the same resolution as the SDF that spatially-varying perspective warping of the current view, contains the reconstructed geometry. Figure 3 depicts the and (3) blending the warped view with texture in the canon- data structure of the texture-tile grid and the orthographic ical space. Figure 2 provides an overview of our workflow. projection.

1274 geometry emdaetxuespace texture termediate htthe that Optimization Texture Efficient 3.3. xdb esetv ap fcus,altoemismatch those all course, Of warp. perspective a be can by frequency, low fixed with especially geometry scanning, some 3D Lastly, in perspec- errors problem. offset local this a mitigate time can therefore, asynchronous warp regions; tive that different in means exist This spatially could are frames [15]. rolling input the warped and Second, occur, often warp. artifacts perspective The eas- shutter global be a can 3]. by depth fixed [31, and ily times and color different depth of at non-synchronization the global captured i.e., are synchronized, frames fully in color not modules are camera RGB-D Xtion, color PrimeSense, and real Kinect, and as in such depth cameras, hold RGB-D the conventional not First, does input scanning. multi-view and etry orsodnebtentecrettexture current the between correspondence values asieysacigcrepnigpit npeiu ge- previous on ometry ex- points by corresponding correspondences searching texture haustively new correspondences. out new find the we for Therefore, searching in prior certain any ometry aegoer from values geometry face depth current the using ICP icso nomto rgesvl.SeFgr o an for 4 Figure See both overview. progressively. update to information workflow of them a pieces makes propose this therefore, We, however, offline. manner; also optimization choose is methods global optimization This texture a the changes. all why them reason of the either longer when no anymore correspondence holds texture the because problem egg Correspondence Texture•Geometry 3.2. iainadte nert tt e texture new to it integrate then and mization ture iue4 efis paegeometry update first We 4: Figure epnecs eupdate we respondences, ilignwgeometry new yielding Texture Geometry h eodcalnefrhg-ult etr uinis fusion texture high-quality for challenge second The ups eetmt h urn aeapose camera current the estimate we Suppose chicken-and- a is texture and geometry both Updating Q S Q ˜ ewrsiptimage input warps We . D S S perfect e.correspondence Tex. ′ ntetxuetl oe rd ncneune the consequence, In grid. voxel texture-tile the in S ascae ihtexture with (associated sn ogrvld hr sn rudtuhnor truth ground no is There valid. longer no is ′ ln t omldrcin,yedn e in- new a yielding directions, normal its along emti orsodnebtengeom- between correspondence geometric D orsodnesearch Correspondence S emtyupdate Geometry to S Q S ˜ ′ ycmuignwtxuecor- texture new computing By . Q ′ htcrepnsto corresponds that yitgaigtecretdepth current the integrating by oyedteitreit tex- intermediate the yield to Q ˜ I i o-ii etr opti- texture non-rigid via Q e.optimization Tex. D rmtenwsurface new the from ) S efis paesur- update first We . ihiptdepth input with I Q Q ′ n e ge- new and S . ′ C . through Q S D ′ ′ , 1275 emtyi h aoia pc.W rtrender first We space. canonical updated the newly the in to geometry respect with information texture the to ile to tiplied prxmt h io aeamto n3 as 3D in motion camera minor the pose approximate different minor with sub-camera virtual each etto 2] hsmnrtransformation minor repre- This twist [22]. using sentation transformation 6-DOF linearized of form nFgr )adta ahlclwindow local each that and 5) Figure in oo rm cpue yteRBcmr)cnit flocal of consists camera) windows RGB the by (captured frame color 5. Figure See intrinsics. camera known with function tion iue5 orgse h urn aeaimage camera current the register To 5: Figure el pae mlctsurface implicit updated newly h urn aeave ftenwgoer a eex- be can geometry new the as of pressed view camera current The hs ouin setal become that essentially realized We solutions warp. free-form these a by fixed be can problems field. motion inverse the using space canonical etry o) ete nert h warped the integrate then We (yellow window local box). the of photo-consistency the increase mtyinformation ometry transforma- linear [29]. direct (DLT) moving tion the by inspired windows, warp We of approach alternative optimization. an texture propose between local therefore tradeoff in inevitable performance an and intro- is quality often There but transformation errors. powerful affine local is duces the flow and optical slow, Patch-based relatively 3]. pro- 31, been [8, have transformation, posed affine warping, or free-form flow optical many as registration, such global the after rors warp. perspective Spatially-varying warping. based perspective optimization spatially-varying texture on efficient novel, a devise we thus, ato h cn rmtecmr pose camera the from captures scene camera voxel the color texture-tile of the part the When a direction. of certain space a canonical in grid the to image era edrdiae(sn h urn texture current the (using image rendered 1 u anojciei orgse h urn aeaview camera current the register to is objective main Our oadeslclwrig easm httecurrent the that assume we warping, local address To h aeapose camera The o vr ie,w ataryt n nitreto on ihthe with point intersection an find to ray a cast we pixel, every For Geometry S siaigtecmr oin flclperspective local of motions camera the estimating , ′ ,w rtetmt aeamto edo pixel of field motion camera a estimate first we ), C x ν I oacutfrtelclmto fsub-windows. of motion local the for account to teylo oe bu pixel about boxes yellow (the y S ( π ′ n Texture and ( C P ˜ )) S C where , ′ losu orgse h urn cam- current the register to us allows a epeetda on cloud point a as presented be can Q ˜ S ′ efidacrepnigcl htcon- that cell corresponding a find We . π () steprpcieprojec- perspective the is esetv warp perspective I otetxuei the in texture the to oadeslcler- local address To ν Image Q ( C T ˜ x x ) ( h ata ge- partial the , n e geom- new and Dperspective 2D nte2 grid 2D the on x scpue by captured is ) I a emul- be can T I ( T 1 x ( n the and ) x nim- an and, , nthe in ) We . x P ˜ to . age I˜ at current camera pose C using the newly updated ge- following Buehler et al. [4]. The size of the projected area is ometry S′ with the intermediate texture Q˜. Now we want to related to the depth and the angle between the view direction enforce the photometric consistency of the current image to and the normal direction, i.e., the projected area is inversely the texture map through the local window ν by formulating proportional to depth squared and proportional to the cosine the following objective function: value of the view angle. We compute the area weight factor, based on the factor ρ with the minimum threshold γarea:

2 2 ˜ ˜ w (p) = max(e−((1−ρ)/σarea) ,γ ), (2) E(T(x)) = w (y) I π T(x)CP (y) − I (y) , (1) area area y∈Pν(x) 2 where ρ = zmin n · p−c . Here z is the minimum z  |p−c| min where y is a neighbor pixel within the local window ν(x) depth of scanning, p is a 3D point corresponding to each (see Figure 5), w(y) is a Gaussian weight of the distance as , n is the surface normal of p, and c is the camera cen- 2 2 exp(− ky − xk /σ ) with a spatial coefficient σ. The win- ter. dow size ν and spatial coefficient σ controls the regularity of the estimated camera motion. If the spatial coefficient is Registration certainty. The registration certainty between higher and the window size becomes larger, the estimated the image and the geometry is crucial w.r.t. the texture ac- camera motion reduces gradually to global camera motion. curacy. The texture inaccuracy, caused by the imperfect ge- To optimize a camera motion field from global cam- ometry, is crucial for two types of surface areas: skewed era motion to local camera motion seamlessly, we compute surface and near-occlusion surface. When the angle be- them in a hierarchical manner. We multi-scale pyra- tween the view direction and the surface normal becomes mids of I, I˜, and P˜ and then estimate a motion field from large, the displacement error greatly increases. Also, near the coarse to the fine level. To initialize the next level, we the occlusion edges, the inaccurate object boundaries of ob- conduct bilinear sampling from the current level of the es- jects causes severe artifacts, such as texture bleeding to for- timated motion field. For the online performance, instead ward/backward objects. To attenuate these artifacts, we first of estimating camera motions for every pixel, we estimate estimate occlusion edges and evaluate the soft-occlusion camera motions T(x) of points of a regular lattice grid at weight wocc(p) at point p that corresponds to the camera each level and then interpolate them for every pixel. We pixel x = π(Cp), by eroding occlusion edges with a dis- empirically found that three levels are sufficient and that crete box kernel and a Gaussian kernel. We evaluate the the width between grid points is set as 64 . angle weight factor as:

2 Different from the original moving DLT, we estimate lo- − 1−n· p−c /σ (( |p−c| ) angle) (3) cal camera motion as a twist matrix (with six unknown pa- wangle (p) = max(e ,γangle). SE rameters), later converted to (3), instead of estimating Finally, our total blending weight for a given camera is the homography matrix (with eight unknown parameters), computed as because we already know the 3D coordinates of the corre- sponding geometry. In terms of optimization, it is benefi- w (p)= warea (p) · wocc (p) · wangle (p) . (4) cial to reduce the number of unknowns by two. In addition, We update weights and colors of texture in an inte- our perspective warp can provide more accurate local warp- Q grated manner. The previous texture ˜ and the image ing with more efficient computation. Particularly, it is more Q I are blended as powerful when the surface geometry of the scene consists of various surface orientations, thanks to the benefits of per- W (p)Q˜(p)+w(p)I(x′) Q(p)= , spective warp (see Figure 1 for an example). W (p)+w(p) (5) W (p) = min(W (p)+ w(p),ψ), 3.4. Online Texture Blending where ψ is a weight upper bound. Once we know the texture correspondences and the warp transformation for the input image with respect to the latest 4. Results surface geometry, we are ready to blend the image into the current texture in the canonical space. For each texel, we Experimental Setup. We implemented our method by in- evaluate each image pixel by spatial resolution and registra- tegrating our texture fusion algorithm on an existing real- tion certainty. time 3D scanning method [25]. We used a desktop machine equipped with Intel Core i7-7700K 4.20 GHz and a graph- Resolution sensitivity. We estimate the projected areas of ics card of NVIDIA Titan V. We used a commercial RGB-D correspondent image pixels and compute blending weights, camera, an Asus Xtion Pro Live, which provides imperfect tains the intersection point and sample color in the texture tile using bilin- synchronization of color and depth frames in the VGA res- ear interpolation. olution (640×480) in 30 Hz.

1276 (a) Nießner et al. [25] (b) Maillot et al. [21] (c) Zhou et al. [31] (d) Ours (per-voxel representation) (texture but no texture optimization) (global perspective + affine warp) (spatially-varying perspective warp) Figure 6: We compare our online method (d) with three existing RGB-D scanning methods: (a) the online voxel representa- tion method [25], (b) the offline traditional texture mapping method [21], and (c) the offline texture optimization method [31]. Our method achieves high-quality texture representation without sacrificing scanning performance thanks to our spatially- varying perspective warping and the texture-tile voxel grid. Two scenes from the top are our captures, where 8×8 texture tile per 10mm voxel is used. The last reference scene is obtained from [31], where a 4×4 texture tile per 4mm voxel is used.

Comparison with Other RGB-D Scanning We compare The third column shows an offline texture optimization the texture quality of our method with three other methods. method [31] that includes global warping of the color cam- See Figure 6. The first column shows scanning results by era motion and local affine transformation. We experi- an online RGB-D scanning method [25] that uses per-voxel mented with our implementation of [31] on a GPU for this color representation in their implementation. The voxel- experiment. The results of the offline texture optimiza- based representation in the online method suffers from the tion still present misalignments in local regions despite long tradeoff between quality and performance. The second col- computation time. The last column shows the results of umn presents results by a traditional offline texture mapping our online texture optimization method. Thanks to the ef- method [21], where no texture optimization is conducted. ficient spatially-varying perspective warp, we can achieve Owing to the correspondence mismatch between geome- both high-quality texture and real-time performance. try and texture, the image structures are blurred severely.

1277 Voxel 4 mmVoxel 10 mm Voxel 4 × etr 8 texture 4

(a) Zhou et al. [31] (low-res) (b) Zhou et al. [31] (high-res) × texture 8

Reconstruction result Zoom Figure 8: Impacts of resolution parameters. Our method can (c) Scene image (d) Ours achieve high-resolution texture even with a low-resolution Figure 7: Comparison of our online texture optimization voxel grid. The texture quality of our method is signifi- against an offline texture optimization method [31]. cantly higher than that of existing voxel representation. Texture tile, Geometry Texture Texture Texture Total time Implementation details. For the voxel unit integration correspondence optimization blending per frame 4×4 tile, 4 mm 3.7 ms 4.1 ms 30.5 ms 2.3 ms 40.6 ms backward compatibility of our 8×8 tile, 4 mm 3.8 ms 4.0 ms 40.6 ms 2.4 ms 50.7 ms method, we implement a conver- 4×4 tile, 10 mm 3.7 ms 2.3 ms 25.8 ms 1.4 ms 33.3 ms sion from our texture model to 8×8 tile, 10 mm 3.7 ms 2.3 ms 27.1 ms 1.5 ms 34.6 ms a conventional texture atlas, as- Table 1: Per-function performance measures of our method sociated with an ordinary surface with different resolutions. We mainly use the first row con- model. We first create mesh rep- figuration of a 4×4 tile and 4 mm voxel size. resentation by the marching cube Geometry algorithm [20] and then create Impacts of resolution. Figure 8 shows the impact of the a lattice-shaped texture atlas by resolution parameter (the resolutions of the voxel grid and finding out zero-crossing cells in Texture map the texture tile) in our method on the texture quality. In our the SDF to pack texture tiles of each cell into the texture at- method, the configuration of 8×8 and 4 mm does not in- las. By computing the local 3D coordinates within the cell crease texture quality any further. We empirically choose and projecting them in the selected axis direction, we com- the configuration of 4×4 tile and 4 mm voxel size. Ta- pute local 2D texture coordinates of the vertices of facets. ble 1 provides averaged per-function performance measures The local texture coordinates converted to the global tex- of our method with different resolutions. Even including ture coordinates associated with the generated texture map. texture optimization for every frame, we achieve 25 frames- Note that the origin of the local texture coordinates is de- per-second (fps) in runtime. We found that texture opti- termined by the new tile index. To avoid texture bleed- mization does not need to be computed every frame in gen- ing on the boundary among discrete tiles, we add one-pixel eral scanning, and thus we computed the optimization pro- padding for every tile. See the inset figure on the right for cess every fourth frame so that we achieved 25–35 fps in an example of a converted 3D model and its texture atlas our video results. As the texture resolution increases, it in- (the scene model shown in Figure 1). creases the rendering time so optimization time increases. Those numbers are trade-offs based on the scene, RGB-D Comparison with other offline optimization. We com- sensor and GPU computation. Our method is not limited to pared our online method with an offline texture optimiza- a specific camera, VGA input, etc., but is flexible to extend tion method [31]. For this experiment, we used the authors’ to other devices as well. implementation [32]. See Figure 7. For the fairness of com- parison, we share the same geometry model. Our method Comparison of warping methods. Figure 9 compares our used the voxel size of 4 mm in each dimension with a 4×4 perspective warping method with different warping meth- texture tile in each cell, and since Zhou’s method calculates ods in terms of the accuracy. For fair comparison, we use the final output color per vertex, we increased the spatial hyperparameters, such as the same cell width and hierar- resolution of the 3D model, where each vertex distance is chical levels. The local affine warping cannot correct the around 1 – 2 mm. Even though Zhou’s method has two to pose error due to the model limitation. The global perspec- four-times higher spatial resolution than ours, the spatial tive warping method reduces the misalignment globally, but resolution of texture information of our method is signifi- local errors still remain. The spatially-varying perspective cantly higher, given the same input images. warping model robustly corrects local errors of registration,

1278 Global perspective warp Spatially varying No correction Affine warp only Global perspective warp + affine warp perspective warp Figure 9: Comparison of different warping methods. Different warp methods are tested with the same parameters, such as the number of pyramid levels, the width of lattices, and others as possible. The affine-based warp method often fails to correct global camera motion. While even the combination of affine warp and global perspective warp still has errors, our perspective warp method reduces the ghosting artifact significantly. 10 mm 4 mm

Voxel Ours Voxel Ours Ground-truth Figure 10: Comparison of quantization errors between volumetric representation and our textile representation (4×4 texture tile). We unproject an image on each representation and render it to compare quantization errors. In our experiment, 4×4 patch size with a 4 mm voxel size is sufficient in terms of image quality. while the combination of global perspective warp and local thy to explore as future work. affine warp cannot correct perfectly. 6. Conclusions Comparison of quantization error. Figure 10 compares quantization errors with different configurations. We back- We have presented a high-quality texture acquisition project an image on the texture space and then render the method for real-time RGB-D scanning. We have proposed texture in the same image space. For the VGA image reso- two main contributions: First, our novel texture-tile voxel lution, a 4×4 texture tile in a 4 mm voxel unit is sufficient grid combines the texture space and the signed distance to project high-frequency information on the texture space. function together, allowing for high-resolution texture map- ping on the low-resolution geometry volume. Second, our 5. Discussion spatially-varying perspective mapping efficiently mitigates the mismatch between geometry and texture. While updat- The proposed method is not free from limitations, which ing the geometry in real-time, it allows us to enhance the can lead to interesting future work. Our method evalu- quality of texture over time. Our results demonstrated that ates the photometric consistency of the texture and input the quality of our real-time texture mapping is highly com- images. View-dependent appearance or severe exposure petitive to that of existing offline texture warping methods. change might degrade the quality of texture reconstruction. We anticipate that our method could be used broadly as our Particularly, the new geometry scanning with blunt change method is capable of being integrated into existing RGB-D of exposure could cause a problem in the texture-to-image scanning frameworks. registration step. In addition, we just capture objects’ color directly as texture. There is no separation process of inci- Acknowledgements dent color to diffuse color and scene illumination. Captur- ing diffuse color as texture is interesting future work. Min H. Kim acknowledges Korea NRF grants We found that our method is particularly effective when (2019R1A2C3007229, 2013M3A6A6073718), Samsung the captured scene is large and includes various surface ori- Research, KOCCA in MCST of Korea, and Cross-Ministry entations. Capturing even larger scale scenes would be wor- Giga KOREA (GK17P0200).

1279 References [18] Bruno Levy,´ Sylvain Petitjean, Nicolas Ray, and Jerome´ Maillot. Least squares conformal maps for automatic texture [1] E. Aganj, P. Monasse, and R. Keriven. Multi-view texturing atlas generation. In ACM transactions on graphics (TOG), of imprecise mesh. Mar. 2009. volume 21, pages 362–371. ACM, 2002. [2] D. Benson and J. Davis. Octree textures. ACM Trans. Gr., [19] W. Li, H. Gong, and R. Yang. Fast texture mapping adjust- 21(3):785–790, July 2002. ment via local/global optimization. IEEE Trans. Visualiza- [3] S. Bi, N. K. Kalantari, and R. Ramamoorthi. Patch-based tion. Comput. Graph., pages 1–1, 2018. optimization for image-based texture mapping. ACM Trans. [20] W. E. Lorensen and H. E. Cline. Marching cubes: A high Graph., 36(4), July 2017. resolution 3d surface construction algorithm. SIGGRAPH [4] C. Buehler, M. Bosse, L. McMillan, S. Gortler, and M. Co- Comput. Graph., 21(4), Aug. 1987. hen. Unstructured lumigraph rendering. In Proceedings [21] Jer´ omeˆ Maillot, Hussein Yahia, and Anne Verroust. Interac- of the 28th Annual Conference on and tive texture mapping. In Proceedings of the 20th annual con- Interactive Techniques, SIGGRAPH ’01, pages 425–432, ference on Computer graphics and interactive techniques, 2001. pages 27–34. ACM, 1993. [5] B. Curless and M. Levoy. A volumetric method for building [22] Richard M. Murray, S. Shankar Sastry, and Li Zexiang. A complex models from range images. In Proceedings of the Mathematical Introduction to Robotic Manipulation. CRC 23rd Annual Conference on Computer Graphics and Inter- Press, Inc., Boca Raton, FL, USA, 1st edition, 1994. active Techniques, SIGGRAPH ’96, pages 303–312, 1996. [23] R. A. Newcombe, D. Fox, and S. M. Seitz. Dynamicfu- [6] A. Dai, M. Niessner, M. Zollhofer, S. Izadi, and C. Theobalt. sion: Reconstruction and tracking of non-rigid scenes in real- Bundlefusion: Real-time globally consistent 3d reconstruc- time. In Proc. IEEE Conf. Comput. Vis. Pattern Recogn., tion using on-the-fly surface reintegration. ACM Trans. June 2015. Graph., 36(4), May 2017. [24] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. [7] Da. (g.) DeBry, J. Gibbs, D. D. Petty, and N. Robins. Paint- Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. ing and rendering textures on unparameterized models. ACM Fitzgibbon. Kinectfusion: Real-time dense surface mapping Trans. Gr., 21(3):763–768, July 2002. and tracking. In IEEE Int. Symp. Mixed and Augmented Re- [8] M. Dellepiane, R. Marroquim, M. Callieri, P. Cignoni, and ality, Oct. 2011. R. Scopigno. Flow-based local optimization for image-to- [25] M. Niessner, M. Zollhofer,¨ S. Izadi, and M. Stamminger. geometry projection. IEEE Transactions on Visualization Real-time 3d reconstruction at scale using voxel hashing. and Computer Graphics, 18(3), Mar. 2012. ACM Trans. Graph., 32(6):169:1–169:11, Nov. 2013. [9] Y. Fu, Q. Yan, L. Yang, J. Liao, and C. Xiao. Texture map- [26] S. Rusinkiewicz and M. Levoy. Efficient variants of the icp ping for 3d reconstruction with rgb-d sensor. In Proc. IEEE algorithm. In Proceedings Third International Conference Conf. Comput. Vis. Pattern Recogn., June 2018. on 3-D and Modeling, pages 145–152, May 2001. [10] R. Gal, Y. Wexler, E. Ofek, H. Hoppe, and D. Cohen-Or. Seamless montage for texturing models. Computer Graphics [27] Pedro V Sander, Steven Gortler, John Snyder, and Hugues Forum, 29(2):479–486, 2010. Hoppe. Signal-specialized parameterization. 2002. [28] M. Waechter, N. Moehrle, and M. Goesele. Let there be [11] K. Guo, F. Xu, T. Yu, X. Liu, Q. Dai, and Y. Liu. Real-time color! large-scale texturing of 3d reconstructions. In Proc. geometry, albedo, and motion reconstruction using a single Euro. Conf. Comput. Vis., pages 836–850, 2014. rgb-d camera. ACM Trans. Graph., 36(3), June 2017. [29] J. Zaragoza, T. Chin, M. S. Brown, and D. Suter. As- [12] K. Guo, F. Xu, T. Yu, X. Liu, Q. Dai, and Y. Liu. Real- projective-as-possible image stitching with moving DLT. In time high-accuracy 3d reconstruction with consumer rgb-d Proc. IEEE Conf. Comput. Vis. Pattern Recogn., pages 2339– cameras. ACM Trans. Graph., 1(1), May 2018. 2346, 2013. [13] M. Innmann, M. Zollhofer, M. Niessner, C. Theobalt, and [30] Kun Zhou, John Synder, Baining Guo, and Heung-Yeung M. Stamminger. Volumedeform: Real-time volumetric non- Shum. Iso-charts: stretch-driven mesh parameterization us- rigid reconstruction. In Proc. Euro. Conf. Comput. Vis., Mar. ing spectral analysis. In Proceedings of the 2004 Eurograph- 2016. ics/ACM SIGGRAPH symposium on , [14] J. Jeon, Y. Jung, H. Kim, and S. Lee. Texture map gen- pages 45–54. ACM, 2004. eration for 3d reconstructed scenes. The Visual Computer, [31] Q. Zhou and V. Koltun. Color map optimization for 3d recon- 32(6-8):955–965, 2016. struction with consumer depth cameras. ACM Trans. Graph., [15] C. Kerl, J. Stuckler,¨ and D. Cremers. Dense continuous-time 33(4), July 2014. tracking and mapping with rolling shutter rgb-d cameras. In [32] Q. Y. Zhou, J. Park, and V. Koltun. Open3D: A modern Proc. IEEE Int. Conf. Comput. Vis., pages 2264–2272, 2015. library for 3D data processing. arXiv:1801.09847, 2018. [16] S. Lefebvre and C. Dachsbacher. Tiletrees. In ACM SIG- GRAPH Symp. Interactive 3D Graph. Games., May 2007. [17] V. Lempitsky and D. Ivanov. Seamless mosaicing of image- based texture maps. In Proc. IEEE Conf. Comput. Vis. Pat- tern Recogn., pages 1–6, June 2007.

1280