Iso-Points: Optimizing Neural Implicit Surfaces with Hybrid Representations

Wang Yifan1 Shihao Wu1 Cengiz Oztireli¨ 2 Olga Sorkine-Hornung1

1ETH Zurich 2University of Cambridge

vanilla optimization with Abstract input optimization iso-points Neural implicit functions have emerged as a powerful representation for surfaces in 3D. Such a function can en- code a high quality with intricate details into the parameters of a deep neural network. However, optimiz- ing for the parameters for accurate and robust reconstruc- tions remains a challenge, especially when the input data is noisy or incomplete. In this work, we develop a hy- brid neural surface representation that allows us to impose geometry-aware sampling and regularization, which signif- icantly improves the fidelity of reconstructions. We propose to use iso-points as an explicit representation for a neural implicit function. These points are computed and updated on-the-fly during training to capture important geometric features and impose geometric constraints on the optimiza- tion. We demonstrate that our method can be adopted to im- prove state-of-the-art techniques for reconstructing neural implicit surfaces from multi-view images or point clouds. Figure 1: We propose a hybrid neural surface representation with implicit Quantitative and qualitative evaluations show that, com- functions and iso-points, which leads to accurate and robust surface re- construction from imperfect data. The iso-points allows us to augment pared with existing sampling and optimization methods, our existing optimization pipelines in a variety of ways: geometry-aware reg- approach allows faster convergence, better generalization, ularizers are incorporated to reconstruct a surface from a noisy point cloud and accurate recovery of details and topology. (first row); geometric details are preserved in multi-view reconstruction via feature-aware sampling (second row); iso-points can serve as a 3D 1. Introduction prior to improve the topological accuracy of the reconstructed surface (third row). The input data are respectively: reconstructed point cloud [12] Reconstructing surfaces from real-world observations is of model 122 of DTU-MVS dataset [21], multi-view rendered images of DOG-WINGED model from Sketchfab dataset [55] and multi-view images a long-standing task in computer vision. Recently, repre- of model 55 of DTU-MVS dataset senting 3D geometry as a neural implicit function has re-

arXiv:2012.06434v2 [cs.CV] 9 Apr 2021 ceived much attention [40, 36,7]. Compared with other tures, e.g. sine activations [45] and Fourier features [41]. 3D shape representations, such as point clouds [29, 20, 54], We show examples of problems in fitting neural implicit polygons [24, 31, 39, 32], and voxels [8, 47, 35, 22], it pro- functions in Figure1. When fitting a neural surface to vides a versatile representation with infinite resolution and a noisy point cloud, “droplets” and bumps emerge where unrestricted topology. there are outlier points and holes (first row); when fitting a Fitting a neural implicit surface to input observations is surface to image observations, fine-grained geometric fea- an optimization problem. Some common application exam- tures are not captured due to under-sampling (second row); ples include surface reconstruction from point clouds and topological noise is introduced when inadequate views are multi-view reconstruction from images. For most cases, the available for reconstruction (third row). observations are noisy and incomplete. This leads to fun- In this work, we propose to alleviate these problems by damental geometric and topological problems in the final introducing a hybrid neural surface representation using reconstructed surface, as the network overfits to the imper- iso-points. The technique converts from an implicit neural fect data. We observe that this problem remains, and can surface to an explicit one via sampling iso-points, and goes become more prominent with the recent powerful architec- back to the implicit representation via optimization. The

1 two-way conversion is performed on-the-fly during training geometric regularization and provide a theoretical analysis to introduce geometry-aware regularization and optimiza- of the reproduction property possessed by the neural tion. This approach unlocks a large set of fundamental tools zero level set surfaces. Erler et al. [11] propose a patch- from geometry processing to be incorporated for accurate based framework that learns both the local geometry and the and robust fitting of neural surfaces. global inside/outside information, which outperforms exist- A key challenge is to extract the iso-points on-the-fly ef- ing data-driven methods. None of these methods exploit an ficiently and flexibly during the training of a neural surface. explicit sampling of the implicit function to improve the op- Extending several techniques from point-based geometry timization. Poursaeed et al. [42] use two different encoder- processing, we propose a multi-stage strategy, consisting of decoder networks to simultaneously predict both an explicit projection, resampling, and upsampling. We first obtain a atlas [15] and an implicit function. In contrast, we propose sparse point cloud on the implicit surface via projection, a hybrid representation using a single network. then resample the iso-points to fix severely under-sampled When the input observations are in the form of 2D im- regions, and finally upsample to obtain a dense point cloud ages, differentiable rendering allows us to use 2D pixels to that covers the surface. As all operations are GPU friendly supervise the learning of 3D implicit surfaces through au- and the resampling and upsampling steps require only local tomatic differentiation and approximate gradients [23, 46]. point distributions, the entire procedure is fast and practical The main challenge is to render the implicit surface and for training. compute reliable gradients at every optimization step effi- We illustrate the utility of the new representation with ciently. Liu et al. [34] accelerate the ray tracing process via a variety of applications, such as multi-view reconstruction a coarse-to-fine tracing algorithm [16], and use an and surface reconstruction from noisy point clouds. Quan- approximate gradient in back propagation. In [33], a ray- titative and qualitative evaluations show that our approach based field probing and an importance sampling technique allows for fast convergence, robust optimization, and accu- are proposed for efficient sampling of the object space. Al- rate reconstruction of details and topology. though these methods greatly improve rendering efficiency, the sampling of ray-based algorithms, i.e., the intersection 2. Related work between the ray and the iso-surface, are intrinsically irregu- lar and inefficient. Most of the above differentiable render- We begin by discussing existing representations of im- ers use ray casting to generate the supervision points. We plicit surfaces, move on to the associated optimization and propose another type of supervision points by sampling the differentiable rendering techniques given imperfect input implicit surface in-place. observations, and finally review methods proposed to sam- Sampling implicit surfaces. In 1992, Figueiredo et ple points on implicit surfaces. al. [10] proposed a powerful way to sample implicit sur- Implicit surface representations. Implicit functions faces using dynamic particle systems that include attraction are a flexible representation for surfaces in 3D. Tradi- and repulsion forces. Witkin and Heckbert [50] further de- tionally, implicit surfaces are represented globally or lo- veloped this concept by formulating an adaptive repulsion cally with radial basis functions (RBF) [6], moving least force. While the physical relaxation process is expensive, squares (MLS) [28], volumetric representation over uni- better initialization techniques have been proposed, such as form grids [9], or adaptive octrees [25]. Recent works in- using seed flooding on the partitioned space [27] or the oc- vestigate neural implicit surface representations, i.e., using tree cells [43]. Huang et al. [19] resample point set sur- deep neural networks to encode implicit function [40, 45], faces to preserve sharp features by pushing points away which achieves promising results in reconstructing surfaces from sharp edges before upsampling. When sampling a from 3D point clouds [3, 44, 11] or images [30, 52, 38]. neural implicit surface, existing works such as Atzmon et Compared with simple polynomial or Gaussian kernels, al. [2] project randomly generated 3D points onto the iso- implicit functions defined by nested activation functions, surface along with the gradient of the neural level set. How- e.g., MLPs [7] or SIREN [45], have more capability in ever, such sampled points are unevenly distributed, and may representing complex structures. However, the fiting of leave parts of the surface under-sampled or over-sampled. such neural implicit function requires clean supervision points [51] and careful optimization to prevent either over- 3. Method fitting to noise or underfitting to details and structure. Given a neural implicit function ft(p; θt) representing Optimizing neural implicit surfaces with partial ob- the surface St, where θt are the network parameters at the servations. Given raw 3D data, Atzmon et al. [3,4] use t-th training iteration, our goal is to efficiently generate and sign agnostic regression to learn neural implicit surfaces utilize a dense and uniformly distributed point set on the without using a ground truth implicit function for supervi- zero level set, called iso-points, which faithfully represents sion. Gropp et al. [14] use the Eikonal term for implicit the current implicit surface St. Intuitively, we can deploy

2 the iso-points back into the ongoing optimization to serve sampling , s = f (p; 휃) various purposes, e.g. improving the sampling of training t t regularization data and providing regularization for the optimization, lead- ing to a substantial improvement in the convergence rate project resample upsample and the final optimization quality. In this section, we first focus on how to extract the iso- points via projection and uniform resampling. We then ex- plain how to utilize the iso-points for better optimization in practical scenarios. St Pt

3.1. Iso-surface sampling Figure 2: Overview of our hybrid representation. We efficiently extract a dense, uniformly distributed set of iso-points as an explicit representation As shown in Figure2, our iso-surface sampling consists for a neural implicit surface. Since the extraction is fast, iso-points can be of three stages. First, we project a set of initial points Qt integrated back into the optimization as a 3D geometric prior, enhancing the optimization. onto the zero level set to get a set of base iso-points Q˜t. We then resample Q˜t to avoid clusters of points and fill large Uniform resampling. The projected base iso-points Q˜t holes. Finally, we upsample the points to obtain dense and can be sparse and hole-ridden due to the irregularity present uniformly distributed iso-points Pt. in the neural distance field, as shown in Figure2. Such Projection. Projecting a point onto the iso-surface can irregular sample distribution prohibits us from many down- be seen as using Newton’s method [5] to approximate the stream applications described later. root of the function starting from a given point. Atzmon et al.[2] derive the projection formula for functions modeled by generic networks. For completeness, we recap the steps The resampling step aims at avoiding over- and under- here, focusing on f : R3 → R. sampling by iteratively moving the base iso-points away Given an implicit function f(p) representing a signed from high-density regions, i.e. R3 distance field and an initial point q0 ∈ , we can find q˜ ← q˜ − αr, (2) an iso-point p on the zero level set of f using Newton q + + where α = D/|Q˜t| is the step size. The update direc- iterations: qk+1 = qk − Jf (qk) f(qk), where Jf is the Moore-Penrose pseudoinverse of the Jacobian. In our case, tion r is a weighted average of the normalized translations T + Jf (qk) between q˜ and its K-nearest points (we set K = 8): Jf is a row 3-vector, so that J (qk) = 2 , where f kJf (qk)k J (q ) X q˜i − q˜ the Jacobian f k can be conveniently evaluated in the r = w(q˜i, q˜) . (3) network via backpropagation. kq˜i − q˜k q˜i∈N (q˜) However, for some contemporary network designs, such The weighting function is designed to gradually reduce the as sine activation functions [44] and positional encod- influence of faraway neighbors, specifically ing [37], the signed distance field can be very noisy and the 2 kq˜i−q˜k − σ gradient highly non-smooth. Directly applying Newton’s w(q˜i, q˜) = e p , (4) method then causes overshooting and oscillation. While one where the density bandwidth σp is set to be 16D/|Q˜t|. could attempt more sophisticated line search algorithms, we instead address this issue with a simple clipping operation to bound the length of the update, i.e. Upsampling. Next, we upsample the point set to the  J T(q )  f k desired density while further improving the point distribu- qk+1 = qk − τ kJ (q )k2 f(qk) , (1) f k tion. Our upsampling step is based on edge-aware resam- v D where τ(v) = min(kvk, τ0). We set τ0 = with D pling (EAR) proposed by Huang et al.[19]. We explain the kvk 2|Qt| denoting the diagonal length of the shape’s bounding box. key steps and our main differences to EAR as follow. In practice, we initialize Qt with randomly sampled points at the beginning of the training and then with iso- points Pt−1 from the previous training iteration. Similar to First, we compute the normals as the normalized Jaco- [2], at each training iteration, we perform a maximum of 10 bians and apply bilateral filtering, just as in EAR. Newton iterations and terminate as soon as all points have Then, the points are pushed away from the edges to create converged, i.e. |f(qk)| < , ∀q ∈ Qt. The termination a clear separation. We modify the original optimization for- threshold  is set to 10−4 and gradually reduced to 10−5 mulation with a simpler update consisting of an attraction during training. and a repulsion term. The former pulls points away from

3 the edge and the latter prevents the points from clustering. P φ(ni, pi − p)(p − pi) ∆p = pi∈N (p) , (5) attraction P φ(n , p − p) pi∈N (p) i i P w(pi, q)(pi − p) ∆p = 0.5 pi∈N (p) , (6) repulsion P w(p , q) pi∈N (p) i

p ← p − τ(∆prepulsion) − τ(∆pattraction), (7) T 2 (ni (p−pi)) Figure 3: Examples of importance sampling based on different saliency − σ where φ(ni, p−pi) = e p is the anisotropic pro- metrics. The default uniform iso-points treat different regions of the iso- jection weight, ni is the point normal of neighbor pi and w surface equally. We compute different saliency metrics on the uniform is the spatial weight defined in (4). We use the same direc- iso-points, based on which more samples can be gathered in the salient τ regions. The curvature-based metric emphasizes geometric details, while tional clipping function as before to bound the two update the loss-based metric allows for hard example mining. terms individually, which improves the stability of the algo- rithm for sparse point clouds. applications where the explicit 3D geometry is not avail- By design, new points are inserted in areas with low den- able during training, the question of how to generate train- sity or high curvature. The trade-off is controlled by a uni- ing samples remains mostly unexplored. fying priority score P (p) = maxpi∈N (p) B(p, pi), where We exploit the geometry information and the predic- B is a distance measure (see [19] for the exact definition). tion uncertainty carried by the iso-points during training. Denoting the point with the highest priority as p∗, a new The main idea is to compute a saliency metric on iso- point is inserted at the midpoint between p∗ and neighbor points, then add more samples in those regions with high ∗ ∗ ∗ ∗ pi∗ , where pi∗ = arg maxpi∈N (p ) B(p , pi). In the orig- saliency. To this end, we experiment with two types of met- inal EAR method, the insertion is done iteratively, requiring rics: curvature-based and loss-based. The former aims at an update of the neighborhood information and recalcula- emphasizing geometric features, typically reflected by high tion of B after every insertion. Instead, to allow parallel curvature. The latter is a form of hard example mining, as computation at GPU, we approximate this process by si- we sample more densely where the higher loss occurs, as multaneously inserting a maximum of |P|/10 points each shown in Figure3. step. In this way, however, inserting the midpoint of point Since the iso-points are uniformly distributed, the cur- pairs would lead to duplicated new points. Thus, we insert vature can be approximated by the norm of the Lapla- 1 ∗ ∗ P asymmetrically at (p ∗ + 2p ) instead. cian, i.e. R (p) = kp − p k. For the 3 i curvature pi∈N (p) i We then project the upsampled iso-points to the iso- loss-based metric, we project the iso-points on all train- surface. As shown in Figure2, the final iso-points success- ing views and compute the average loss at each point, i.e. 1 PN fully reflect the 3D geometry of the current implicit surface. Rloss(p) = N i loss(p), where N is the number of oc- Compared with using marching cubes to extract the iso- currences of p in all views. surface, our adaptive sampling is efficient. Since both the As both metrics evolve smoothly, we need not update resampling and upsampling steps only require information them in each training iteration. Denoting the iso-points at of local neighborhood, we implement it on the GPU. Fur- which we computed the metric as T and the subset of tem- thermore, since we use the iso-points from the previous it- plate points with high metric values by T ∗ = {t∗}, the eration for initialization, the overall point distribution im- metric-based insertion for each point p in the current iso- proves as the training stabilizes, requiring fewer (or even point set Pt can be written as zero) resampling and upsampling steps at later stages. 2 1 ∗ pnew,i = 3 p + 3 pi, ∀pi ∈ N (p), if mint∗ kp − t k ≤ σ. 3.2. Utilizing iso-points in optimization The neighborhood radius σ is the same one used in (3). We introduce two scenarios of using iso-points to guide Iso-points for regularization. The access to an explicit neural implicit surface optimization: (i) importance sam- representation of the implicit surface also enables us to in- pling for multi-view reconstruction and (ii) regularization corporate geometry-motivated priors into the optimization when reconstructing neural implicit surfaces from raw in- objective, exerting finer control of the reconstruction result. put point clouds. Let us consider fitting a neural implicit surface to a Iso-points for importance sampling. Optimizing a neu- point cloud. Depending on the acquisition process, the ral implicit function to represent a high-resolution 3D shape point cloud may be sparse, distorted by noise, or distributed requires abundant training samples – specifically, many su- highly unevenly. Similar to previous works [48, 49], we ob- pervision points sampled close to the iso-surface to capture serve that the neural network tends to reconstruct a smooth the fine-grained geometry details accurately. However, for surface at the early stage of learning, but then starts to pick

4 Figure 4: Progression of overfitting. When optimizing a neural implicit Figure 5: Iso-points for regularization. At the early stage of the training, surface on a noisy point cloud, the network initially outputs a smooth sur- the implicit surface is smooth (left), and we extract iso-points (middle) face, but increasingly overfits to the noise in the data. Shown here are as a reference shape, which can facilitate various regularization terms. In the reconstructed surfaces after 1000, 2000 and 5000 iterations. The input the example on the right, we use the iso-points to reduce the influence of point cloud is acquired in-house using an Artec Eva scanner. outliers (shown in red). up higher frequency signals from the corrupted observa- and those computed from the gradient of the network, i.e. tions and overfits to noise and other caveats in the data, as 1 X L = (1 − | cos(J T(p), n )|). (9) shown in Figure4. This characteristic is consistent across isoNormal |P| f PCA network architectures, including those designed to accom- p∈P modate high-frequency information, such as SIREN. The larger the PCA neighborhood is, the smoother the re- construction becomes. Optionally, additional normal filter- Existing methods that address overfitting include early ing can be applied after PCA to reduce over-smoothing and stopping, weight decay, drop-out etc.[13]. However, enhance geometric features. whereas these methods are generic tools designed to Finally, we can use the iso-points to filter outliers in the improve network generalizability for training on large training data. Specifically, given a batch of training points datasets, we propose a novel regularization approach that Q = {q}, we compute a per-point loss weight based on is specifically tailored to training neural implicit surfaces their alignment with the iso-points. Here, we choose to use and serves as a platform to integrate a plethora of 3D priors bilateral weights to take both the Euclidean and the direc- from classic geometry processing. tional distance into consideration. Denoting the normalized Our main idea is to use the iso-points as a consistent, gradient of an iso-point p and a training point q as np and smooth, yet dynamic approximation of the reference geom- nq, respectively, the bilateral weight can written as etry. Consistency and smoothness ensure that the optimiza- v(q) = min φ(np, p − q)ψ(np, nq), with (10) tion does not fluctuate and overfit to noise; the dynamic p∈P nature lets the network pick up consistent high-frequency T 2  1−n nq  − 1− p signals governed by underlying geometric details. 1−cos(σn) ψ(np, nq) = e , (11) To this end, we extract iso-points after a short warmup where σn regulates the sensitivity to normal difference; we training (e.g. 300 iterations). Because of the aforemen- ◦ set σn = 60 in our experiments. This loss weight can be tioned smooth characteristic of the network, the noise level incorporated into the existing loss functions to reduce the in the initial iso-points is minimal. Then, during subsequent impact of outliers. A visualization of the outliers detected training, we update the iso-points periodically (e.g. every by this weight is shown in Figure5. 1000 iterations) to allow them to gradually evolve as the network learns more high-frequency information. 4. Results The utility of the iso-points includes, but is not lim- ited to 1) serving as additional training data to circumvent Iso-points can be incorporated into the optimization data scarcity, 2) enforcing additional geometric constraints, pipelines of existing state-of-the-art methods for implicit 3) filtering outliers in the training data. surface reconstruction. In this section, we demonstrate the Specifically, for sparse or hole-ridden input point clouds, benefits of the specific techniques introduced in Section 3.2. we take advantage of the uniform distribution of iso-points We choose state-of-the-art methods as the baselines, then and augment supervision in undersampled areas by enforc- augment the optimization with iso-points. Results show that ing the signed distance value on all iso-points to be zero: the augmented optimization outperforms the baseline quan- titatively and qualitatively. 1 X L = |f(p)|. (8) isoSDF |P| p∈P 4.1. Sampling with iso-points Given the iso-points, we compute their normals from We evaluate the benefit of utilizing iso-points to generate their local neighborhood using principal component analy- training samples for multi-view reconstruction. As the base- sis (PCA) [18]. We then increase surface smoothness by en- line, we employ the ray-tracing algorithm from a state-of- forcing consistency between the normals estimated by PCA the-art neural implicit renderer IDR [53], which generates

5 sampling with ray-tracing sampling with iso-points

on-surface sample in-surface sample out-surface sample

Figure 6: A 2D illustration of two sampling strategies for multi-view reconstruction. Ray-tracing (left) generates training samples by shoot- ing rays from camera C through randomly sampled pixels; depending on whether an intersection is found and whether the pixel lies inside the ob- Figure 7: Qualitative comparison between sampling strategies for multi- ject silhouette, three types of samples are generated: on-surface, in-surface view reconstruction. Using the same optimization time and similar total and out-surface points. We generate on-surface samples directly from iso- sample count, sampling the surface points with uniformly distributed iso- points, obtaining evenly distributed samples on the implicit surface, and points considerably improves the reconstruction accuracy. A loss based also use the iso-points to generate more reliable in-surface samples. importance sampling further improves the recovery of small-scale struc- tures. The models shown here are the COMPRESSOR and ANGEL2 from the Sketchfab dataset [55]. baseline uniform curvature- loss- (ray-tracing) based based Below, we demonstrate two benefits of the proposed CD ·10−4(position) 17.24 1.80 1.83 1.71 sampling scheme. CD ·10−1(normal) 1.51 1.10 0.99 0.95 Surface details from importance sampling. First, we Table 1: Quantitative effect of importance sampling with iso-points. Com- examine the effect of drawing on-surface samples using iso- pared to the baseline, which generates training points via ray-marching, we points by comparing the optimization results under fixed use iso-points to draw more attention on the implicit surface. The result is optimization time and the same total sample count. averaged over 10 models selected from the Sketchfab dataset [55]. As inputs, we render 512 × 512 images per object un- training samples by ray-marching from the camera center der known lighting and material from varied camera posi- through uniformly sampled pixels in the image. As shown tions. When training with the iso-points, we extract 4,000 in Figure6, three types of samples are used for different iso-points after 500 iterations, then gradually increase the types of supervision: on-surface samples, which are ray- density until reaching 20,000 points. To match the sample surface intersections inside the object’s 2D silhouette, in- count, in the ray-tracing variation, we randomly draw 2,048 surface samples, which are points with the lowest signed pixels per image and then increase the sample count until distance on the non-intersecting rays inside 2D silhouette, reaching 10,000 pixels. We use a 3-layer SIREN model and out-surface samples, which are on the rays outside the with the frequency multipliers set to 30, and optimize with 2D silhouette either at the surface intersection or at the po- a batch size of 4. sition with the lowest signed distance. We evaluate our method quantitatively using 10 water- On this basis, we incorporate the iso-points directly as tight models from the Sketchfab dataset [55]. As shown in on-surface samples. We can direct the learning attention Table1, we compute 2-way chamfer point-to-point distance 2 by varying the distribution of iso-points using the saliency (kpi −pjk ) and normal distance (1 − cos(ni, nj)) on 50K metrics described in Section 3.2. The iso-points also pro- points, uniformly sampled from the reconstructed meshes. vide us prior knowledge to generate more reliable in-surface Results show that using uniform iso-points as on-surface samples. More specifically, as shown in Figure6 (right), samples compares favorably against the baseline, especially we generate the three types of samples as follows: a) on- in the normal metric. It suggests that we achieve higher fi- surface samples: we remove occluded iso-points by visi- delity on the finer geometric features, as our surface sam- bility testing using a point rasterizer, and select those iso- ples overcome under-sampling issues occurring at small points whose projections are inside the object silhouette; scale details. We also see that importance sampling with b) in-surface samples: on the camera rays that pass through iso-points and loss based upsampling exhibits a substantial the on-surface samples, we determine the point with the advantage over other variations, demonstrating the effec- lowest signed distance on the segment between the on- tiveness of smart allocation of the training samples accord- surface sample and the farther intersection with the object’s ing to the current learning state. In comparison, curvature- bounding sphere. c) out-surface samples: we shoot camera based sampling performs similarly to the baseline, but no- rays through pixels outside the object silhouette, and choose tability worse than with the uniform iso-points. We observe the point with the lowest signed distance. that the iso-points, in this case, are highly concentrated on a

6 apparent. In contrast, our sampling enables more accurate reconstruction of the signed distance field inside and outside the surface with more faithful topological structure. 4.2. Regularization with iso-points We evaluate the benefit of using iso-points for regular- ization. As an example application, we consider surface reconstruction from a noisy point cloud. As our baseline method, we use the publicly available SIREN codebase and adopt their default optimization pro- tocol. The noisy input point clouds are either acquired with a 3D scanner (Artec Eva) or reconstructed [12] from the multi-view images in the DTU dataset. In each optimization iteration, the baseline method ran- domly samples an equal number of oriented surface points Qs = {qs, ns} from the input point cloud and unoriented Figure 8: Topological correctness of the reconstructed surface in multi- off-surface points Qo = {qo} from bounding cube’s inte- view reconstruction. The erroneous inner structures of IDR are indicated rior. The optimization objective is comprised of four parts: by the artifacts in the images rendered by Blender [17] (using glossy and transparent material) and by the contours of the signed distance field on L = γonSDFLonSDF + γnormalLnormal + γoffSDFLoffSDF + γeikonalLeikonal, cross-sections. where few spots on the surface and ignore regions where the cur- P |f(q )| P |1−cos(J T(q ),n )| qs∈Qs s qs∈Qs f s s LonSDF = , Lnormal = , rent reconstruction is problematic (Figure3). |Qs| |Qs| The improvement is more pronounced qualitatively, as P e−α|f(qo)| P |1−kJ T(q)k| qo∈Qo q∈Qo∪Qs f LoffSDF = , Leikonal = and shown in Figure7. Sampling on-surface with uniform iso- |Qo| |Qs|+|Po| points clearly enhances reconstruction accuracy compared to the baseline with ray-tracing. The finer geometric details γonSDF = 1000, γnormal = 100, γoffSDF = 50, γeikonal = 100. further improve with loss-based importance sampling. We alter this objective with the outlier-aware loss weight de- Topological correctness from 3D prior. IDR can recon- fined in (11), and then add the regularizations on iso-points struct impressive geometric details on the DTU dataset [21], LisoSDF (8) and LisoNormal (9). The final objective becomes but a closer inspection shows that the reconstructed surface L = γ (L + L ) + γ (L + L )+ contains a considerable amount of topological errors inside onSDF onSDF isoSDF normal normal isoNormal the visible surface. We use iso-points to improve the topo- γoffSDFLoffSDF + γeikonalLeikonal, logical accuracy of the reconstruction. where the loss terms with on-surface points are weighted as We use the same network architecture and training pro- follows: tocol as IDR, which samples 2048 pixels from a randomly 1 P L = v(qs)|f(qs)|, onSDF |Qs| qs∈Qs chosen view in each optimization iteration. We use uniform 1 P T Lnormal = v(qs)|1 − cos(J (qs), ns)|. iso-points in this experiment. To keep a comparable sam- |Qs| qs∈Qs f ple count, we subsample the visible iso-points to obtain a The iso-points are initialized by subsampling the input point maximum of 1500 on-surface samples per iteration. Since cloud by 1/8 and updated every 2000 iterations. our strategy automatically creates more in-surface samples The comparison with the baseline, i.e., vanilla optimiza- (as shown in Figure6), we halve the loss weight on the in- tion without regularization, is shown in Figure9. For surface samples compared to the original implementation. the DTU-MVS data, we also conduct quantitative evalua- We visualize the topology of the reconstructed surface tion following the standard DTU protocol as shown in Ta- in Figure8. To show the inner structures of the sur- ble2, i.e., L1-Chamfer distance between the reconstructed face, we render it in transparent and glossy material with and reference point cloud within a predefined volumetric a physically-based renderer [17] and show the back faces mask [1]. For clarity, we also show the results of screened of the mesh. Dark patches in the rendered images indi- Poisson reconstruction [26] and Points2Surf [11]. The for- cate potentially erroneous light transmission caused by in- mer reconstructs a watertight surface from an oriented point ner structures. Similarly, we also show the contour lines set by solving local Poisson equations; the latter fits an of the iso-surface on a cross-section to indicate the irregu- implicit neural function to an unoriented point set in a larity of the reconstructed implicit function. In both visu- global-to-local manner. Compared with the baseline and alizations, the incorrect topology in IDR reconstruction is screened Poisson, our proposed regularizations significantly

7 input ours baseline screened Poisson points2surf is much less notable. As discussed, we filter the occluded iso-points before the projection, which also saves opti- mization time. The trade-off between optimization speed and quality is de- picted in a concrete ray-marching example in Fig. 10, 10−2 iso-points where we plot the evolution of 10−3 the chamfer dis- chamfer distance tance during the CD:0.42 CD:0.50 CD:0.46 CD:0.46 10−4 optimization of 0 2,000 4,000 6,000 the COMPRESSOR time (s) model (first row of Figure 10: Validation error in relation to op- Fig.7). Compared timization time. The first time stamp is at the to the baseline opti- 100-th iteration. CD:0.56 CD:0.69 CD:0.69 CD:0.98 mization with ray-tracing, it is evident that the iso-points augmented optimization consistently achieves better results Figure 9: Implicit surface reconstruction from noisy and sparse point at every timestamp. In other words, with iso-points we can clouds. From left to right: input, reconstruction with our proposed reg- ularizations, baseline reconstruction without our regularizations, screened reach a given quality threshold faster. Poisson reconstruction and Points2Surf. CD denotes L1-Chamfer distance. The sparse point cloud in the first row is acquired with an Artec Eva scan- 5. Conclusion and Future Work ner, whereas the inputs in the second row and third row are reconstructed from DTU dataset (model 105 and 122) using [12]. In this paper, we leverage the advantages of two inter- dependent representations: neural implicit surfaces and iso- ID ours baseline point2surf screened Poisson points, for 3D learning. Implicit surfaces can represent 3D 55 0.37 0.41 0.56 0.42 structures in arbitrary resolution and topology, but lack an 69 0.59 0.65 0.61 0.63 105 0.56 0.69 0.98 0.69 explicit form to adapt the optimization process to input data. 110 0.54 0.51 0.61 0.55 Iso-points, as a point cloud adaptively distributed on the un- 114 0.38 0.45 0.45 0.37 118 0.45 0.49 0.59 0.55 derlying surface, are fairly straightforward and efficient to 122 0.42 0.50 0.46 0.46 manipulate and analyze the underlying 3D geometry. Average 0.53 0.61 0.52 0.47 We present effective algorithms to extract and utilize iso-points. Extensive experiments show the power of our Table 2: Quantitative evaluation for surface reconstruction from a noisy sparse point cloud. We evaluate the two-way L1-chamfer distance on a hybrid representation. We demonstrate that iso-points can subset of the DTU-MVS dataset. be readily employed by state-of-the-art neural 3D recon- struction methods to significantly improve optimization ef- suppresses noise. Points2Surf can handle noisy input well, ficiency and reconstruction quality. but the sign propagation appears to be sensitive to the point A limitation of the proposed sampling strategy is that it is distribution, leading to holes in the reconstructed mesh. mainly determined by the geometry of the underlying sur- Moreover, since their model does not use the points’ nor- face and does not explicitly model the appearance. In the mal information, the reconstruction lacks detail. future, we would like to extend our hybrid representation to model the joint space of geometry and appearance, which 4.3. Performance analysis can in turn allow us to apply path-tracing for global illumi- The main overhead in our approach is the projection step. nation, bridging the gap between existing neural rendering One newton iteration requires a forward and a backward approaches and classic physically based rendering. Another pass. On average, the projection terminates within 4 iter- promising direction is utilizing the new representation for ations. This procedure is optimized by only considering consistent space-time reconstructions of articulated objects. points that are not yet converged at each iteration. Empirically, the computation time of extracting the iso- Acknowledgement points once is typically equivalent to running 3 training it- We would like to thank Viviane Yang for her help erations. In practice, as we extract iso-points only periodi- with the point2surf code. This work was supported in cally, the total optimization time only increases marginally. parts by Apple scholarship, SWISSHEART Failure Net- In the case of multi-view reconstruction, since the ray- work (SHFN), and UKRI Future Leaders Fellowship [grant marching itself is an expensive operation, involving multi- number MR/T043229/1]. ple forward passes per ray, the overhead of our approach

8 References [16] John C Hart. Sphere tracing: A geometric method for the antialiased ray tracing of implicit surfaces. The Visual Com- [1] Henrik Aanæs, Rasmus Ramsbøl Jensen, George Vogiatzis, puter, 12(10):527–545, 1996.2 Engin Tola, and Anders Bjorholm Dahl. Large-scale data for [17] Roland Hess. Blender Foundations: The Essential Guide to multiple-view stereopsis. International Journal of Computer Learning Blender 2.6. Focal Press, 2010.7 Vision, pages 1–16, 2016.7 [2] Matan Atzmon, Niv Haim, Lior Yariv, Ofer Israelov, Haggai [18] Hugues Hoppe, Tony DeRose, Tom Duchamp, John McDon- Maron, and Yaron Lipman. Controlling neural level sets. ald, and Werner Stuetzle. Surface reconstruction from un- In Proc. IEEE Int. Conf. on Neural Information Processing organized points. ACM Trans. on Graphics (Proc. of SIG- Systems (NeurIPS), pages 2034–2043, 2019.2,3 GRAPH), pages 71–78, 1992.5 [3] Matan Atzmon and Yaron Lipman. Sal: Sign agnostic learn- [19] Hui Huang, Shihao Wu, Minglun Gong, Daniel Cohen-Or, ing of shapes from raw data. In Proc. IEEE Conf. on Com- Uri Ascher, and Hao Zhang. Edge-aware point set resam- puter Vision & Pattern Recognition, pages 2565–2574, 2020. pling. ACM Trans. on Graphics (Proc. of SIGGRAPH Asia), 2 32(1):1–12, 2013.2,3,4 [4] Matan Atzmon and Yaron Lipman. Sald: Sign agnostic [20] Eldar Insafutdinov and Alexey Dosovitskiy. Unsupervised learning with derivatives, 2020.2 learning of shape and pose with differentiable point clouds. [5] Adi Ben-Israel. A newton-raphson method for the solution In Proc. IEEE Int. Conf. on Neural Information Processing of systems of equations. Journal of Mathematical analysis Systems (NeurIPS), pages 2802–2812, 2018.1 and applications, 15(2):243–252, 1966.3 [21] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engil Tola, [6] Jonathan C Carr, Richard K Beatson, Jon B Cherrie, Tim J and Henrik Aanæs. Large scale multi-view stereopsis eval- Mitchell, W Richard Fright, Bruce C McCallum, and Tim R uation. In Proc. IEEE Conf. on Computer Vision & Pattern Evans. Reconstruction and representation of 3D objects with Recognition, pages 406–413. IEEE, 2014.1,7 radial basis functions. In Proc. of SIGGRAPH, pages 67–76, [22] Yue Jiang, Dantong Ji, Zhizhong Han, and Matthias Zwicker. 2001.2 SDFDiff: Differentiable rendering of signed distance fields [7] Zhiqin Chen and Hao Zhang. Learning implicit fields for for 3D shape optimization. In Proc. IEEE Conf. on Computer generative shape modeling. In Proc. IEEE Conf. on Com- Vision & Pattern Recognition, pages 1251–1261, 2020.1 puter Vision & Pattern Recognition, pages 5939–5948, 2019. [23] Hiroharu Kato, Deniz Beker, Mihai Morariu, Takahiro Ando, 1,2 Toru Matsuoka, Wadim Kehl, and Adrien Gaidon. Differen- [8] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin tiable rendering: A survey. arXiv preprint arXiv:2006.12057, Chen, and Silvio Savarese. 3D-R2N2: A unified approach 2020.2 for single and multi-view 3D object reconstruction. In Proc. [24] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neu- Euro. Conf. on Computer Vision, pages 628–644. Springer, ral 3D mesh renderer. In Proc. IEEE Conf. on Computer 2016.1 Vision & Pattern Recognition, pages 3907–3916, 2018.1 [9] Brian Curless and Marc Levoy. A volumetric method for [25] Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. building complex models from range images. In ACM Trans. Poisson surface reconstruction. In Proc. Eurographics Symp. on Graphics (Proc. of SIGGRAPH), pages 303–312, 1996.2 on Geometry Processing, volume 7, 2006.2 [10] Luiz Henrique de Figueiredo, Jonas de Miranda Gomes, [26] Michael Kazhdan and Hugues Hoppe. Screened poisson sur- Demetri Terzopoulos, and Luiz Velho. Physically-based face reconstruction. ACM Trans. on Graphics, 32(3):1–13, methods for polygonization of implicit surfaces. In Graphics 2013.7 Interface, volume 92, pages 250–257, 1992.2 [27] Florian Levet, Xavier Granier, and Christophe Schlick. Fast [11] Philipp Erler, Paul Guerrero, Stefan Ohrhallinger, Niloy J sampling of implicit surfaces by particle systems. In IEEE Mitra, and Michael Wimmer. Points2surf learning implicit International Conference on Shape Modeling and Applica- surfaces from point clouds. In Proc. Euro. Conf. on Com- tions, pages 39–39. IEEE, 2006.2 puter Vision, pages 108–124. Springer, 2020.2,7 [12] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and [28] David Levin. The approximation power of moving least- robust multiview stereopsis. IEEE Trans. Pattern Analysis & squares. Mathematics of computation, 67(224):1517–1531, Machine Intelligence, 32(8):1362–1376, 2009.1,7,8 1998.2 [13] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and [29] Chen-Hsuan Lin, Chen Kong, and Simon Lucey. Learning Yoshua Bengio. Deep learning. MIT press Cambridge, 2016. efficient point cloud generation for dense 3D object recon- 5 struction. arXiv preprint arXiv:1706.07036, 2017.1 [14] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and [30] Chen-Hsuan Lin, Chaoyang Wang, and Simon Lucey. SDF- Yaron Lipman. Implicit geometric regularization for learning SRN: Learning signed distance 3D object reconstruction shapes. arXiv preprint arXiv:2002.10099, 2020.2 from static images. arXiv preprint arXiv:2010.10505, 2020. [15] Thibault Groueix, Matthew Fisher, Vladimir G Kim, 2 Bryan C Russell, and Mathieu Aubry. A papier-machˆ e´ ap- [31] Hsueh-Ti Derek Liu, Michael Tao, and Alec Jacobson. Pa- proach to learning 3D surface generation. In Proc. IEEE parazzi: surface editing by way of multi-view image process- Conf. on Computer Vision & Pattern Recognition, pages ing. ACM Trans. on Graphics (Proc. of SIGGRAPH Asia), 216–224, 2018.2 37(6):221–1, 2018.1

9 [32] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. Soft ras- IEEE Int. Conf. on Neural Information Processing Systems terizer: A differentiable renderer for image-based 3D rea- (NeurIPS), pages 1121–1132, 2019.1,2 soning. In Proc. Int. Conf. on Computer Vision, pages 7708– [46] Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann, 7717, 2019.1 Stephen Lombardi, Kalyan Sunkavalli, Ricardo Martin- [33] Shichen Liu, Shunsuke Saito, Weikai Chen, and Hao Li. Brualla, Tomas Simon, Jason Saragih, Matthias Nießner, Learning to infer implicit surfaces without supervision. In et al. State of the art on neural rendering. arXiv preprint Proc. IEEE Int. Conf. on Neural Information Processing Sys- arXiv:2004.03805, 2020.2 tems (NeurIPS), pages 8295–8306, 2019.2 [47] Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Ji- [34] Shaohui Liu, Yinda Zhang, Songyou Peng, Boxin Shi, Marc tendra Malik. Multi-view supervision for single-view recon- Pollefeys, and Zhaopeng Cui. DIST: Rendering deep im- struction via differentiable ray consistency. In Proc. IEEE plicit signed distance function with differentiable sphere Conf. on Computer Vision & Pattern Recognition, pages tracing. In Proc. IEEE Conf. on Computer Vision & Pattern 2626–2634, 2017.1 Recognition, pages 2019–2028, 2020.2 [48] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. [35] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Deep image prior. In Proc. IEEE Conf. on Computer Vision Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol- & Pattern Recognition, pages 9446–9454, 2018.4 umes: Learning dynamic renderable volumes from images. [49] Francis Williams, Teseo Schneider, Claudio Silva, Denis arXiv preprint arXiv:1906.07751, 2019.1 Zorin, Joan Bruna, and Daniele Panozzo. Deep geomet- [36] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- ric prior for surface reconstruction. In Proc. IEEE Conf. bastian Nowozin, and Andreas Geiger. Occupancy networks: on Computer Vision & Pattern Recognition, pages 10130– Learning 3D reconstruction in function space. In Proc. IEEE 10139, 2019.4 Conf. on Computer Vision & Pattern Recognition, pages [50] Andrew P Witkin and Paul S Heckbert. Using particles to 4460–4470, 2019.1 sample and control implicit surfaces. In ACM Trans. on [37] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Graphics (Proc. of SIGGRAPH), pages 269–277, 1994.2 Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: [51] Yifan Xu, Tianqi Fan, Yi Yuan, and Gurprit Singh. Lady- Representing scenes as neural radiance fields for view syn- bird: Quasi-Monte Carlo sampling for deep implicit field thesis. arXiv preprint arXiv:2003.08934, 2020.3 based 3D reconstruction with symmetry. arXiv preprint [38] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and arXiv:2007.13393, 2020.2 Andreas Geiger. Differentiable volumetric rendering: Learn- [52] Lior Yariv, Matan Atzmon, and Yaron Lipman. Univer- ing implicit 3D representations without 3D supervision. In sal differentiable renderer for implicit neural representations, Proc. IEEE Conf. on Computer Vision & Pattern Recogni- 2020.2 tion, pages 3504–3515, 2020.2 [53] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan [39] Merlin Nimier-David, Delio Vicini, Tizian Zeltner, and Wen- Atzmon, Ronen Basri, and Yaron Lipman. Multiview neu- zel Jakob. Mitsuba 2: A retargetable forward and inverse ren- ral surface reconstruction by disentangling geometry and ap- derer. ACM Trans. on Graphics (Proc. of SIGGRAPH Asia), pearance. Proc. IEEE Int. Conf. on Neural Information Pro- 38(6):1–17, 2019.1 cessing Systems (NeurIPS), 2020.5 [40] Jeong Joon Park, Peter Florence, Julian Straub, Richard [54] Wang Yifan, Felice Serena, Shihao Wu, Cengiz Oztireli,¨ and Newcombe, and Steven Lovegrove. DeepSDF: Learning Olga Sorkine-Hornung. Differentiable surface splatting for continuous signed distance functions for shape representa- point-based geometry processing. ACM Trans. on Graphics tion. In Proc. IEEE Conf. on Computer Vision & Pattern (Proc. of SIGGRAPH Asia), 38(6):1–14, 2019.1 Recognition, pages 165–174, 2019.1,2 [55] Wang Yifan, Shihao Wu, Hui Huang, Daniel Cohen-Or, and [41] Matt Pharr, Wenzel Jakob, and Greg Humphreys. Physically Olga Sorkine-Hornung. Patch-based progressive 3d point based rendering: From theory to implementation. Morgan set upsampling. In Proceedings of the IEEE Conference Kaufmann, 2016.1 on Computer Vision and Pattern Recognition, pages 5958– [42] Omid Poursaeed, Matthew Fisher, Noam Aigerman, and 5967, 2019.1,6 Vladimir G Kim. Coupling explicit and implicit surface representations for generative . arXiv preprint arXiv:2007.10294, 2020.2 [43] Joao˜ Proenc¸a, Joaquim A Jorge, and Mario Costa Sousa. Sampling point-set implicits. In Eurographics Symposium on Point-Based Graphics, pages 11–18, 2007.2 [44] Vincent Sitzmann, Julien NP Martel, Alexander W Bergman, David B Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. arXiv preprint arXiv:2006.09661, 2020.2,3 [45] Vincent Sitzmann, Michael Zollhofer,¨ and Gordon Wet- zstein. Scene representation networks: Continuous 3D- structure-aware neural scene representations. In Proc.

10