Monocular Neural Image Based Rendering with Continuous View Control

Xu Chen∗ Jie Song∗ Otmar Hilliges AIT Lab, ETH Zurich {xuchen, jsong, otmar.hilliges}@inf.ethz.ch

......

Figure 1: Interactive novel view synthesis: Given a single source view our approach can generate a continuous sequence of geometrically accurate novel views under fine-grained control. Top: Given a single street-view like input, a user may specify a continuous camera trajectory and our system generates the corresponding views in real-time. Bottom: An unseen hi-res internet image is used to synthesize novel views, while the camera is controlled interactively. Please refer to our project homepage†.

Abstract like to view products interactively in 3D rather than from We propose a method to produce a continuous stream discrete view angles. Likewise in applications it is of novel views under fine-grained (e.g., 1◦ step-size) cam- desirable to explore the vicinity of street-view like images era control at interactive rates. A novel learning pipeline beyond the position at which the was taken. This determines the output pixels directly from the source color. is often not possible because either only 2D imagery exists, Injecting geometric transformations, including or because storing and rendering of full 3D information does projection, 3D rotation and translation into the network not scale. To overcome this limitation we study the problem forces implicit reasoning about the underlying geometry. of interactive view synthesis with 6-DoF view control, taking The latent 3D geometry representation is compact and mean- only a single image as input. We propose a method that ingful under 3D transformation, being able to produce geo- can produce a continuous stream of novel views under fine- grained (e.g., 1◦ step-size) camera control (see Fig.1). arXiv:1901.01880v2 [cs.CV] 9 Sep 2019 metrically accurate views for both single objects and natural scenes. Our experiments show that both proposed compo- Producing a continuous stream of novel views in real-time nents, the transforming encoder-decoder and depth-guided is a challenging task. To be able to synthesize high-quality appearance mapping, lead to significantly improved gener- images one needs to reason about the underlying geometry. alization beyond the training views and in consequence to However, with only a monocular image as input the task of more accurate view synthesis under continuous 6-DoF cam- 3D reconstruction is severely ill-posed. Traditional image- era control. Finally, we show that our method outperforms based rendering techniques do not apply to the real-time state-of-the-art baseline methods on public datasets. monocular setting since they rely on multiple input views and also can be computationally expensive. 1. Introduction Recent work has demonstrated the potential of learning to predict novel views from monocular inputs by leverag- 3D immersive experiences can benefit many application ing a training set of viewpoint pairs [52, 62, 40, 8]. This scenarios. For example, in an online store one would often

∗Equal contribution. †https://ait.ethz.ch/projects/2019/cont-view-synth/

1 is achieved either by directly synthesizing the pixels in the 2. Related Work target view [52, 8] or predicting flow to warp the input View synthesis with multi-view images. The task of pixels to the output [62, 51]. However, we experimentally synthesizing new views given a sequence of images as show that such approaches are prone to over-fitting to the input has been studied intensely in both the vision and training views and do not generalize well to free-from non- graphics community. Strategies can be classified into training viewpoints. If the camera is moved continuously those that explicitly compute a 3D representation of the in small increments, with such methods, the image quality scene [42, 28, 41, 7, 47, 46, 65, 4, 30], and those in which quickly degrades. One possible solution is to incorporate the 3D geometry is handled implicitly [12, 35, 36]. Others much denser training pairs but this is not practical for many have deployed full 4D light fields [18, 31], albeit at the cost real applications. Explicit integration of geometry represen- of complex hardware setups and increased computational tations such as meshes [26, 34] or voxel grids [58, 6, 16, 54] cost. Recently, deep learning techniques have been applied could be leveraged for view synthesis. However, such rep- in similar settings to fill holes and eliminate artifacts caused resentations would limit applicability to settings where the by the sampling gap, dis-occlusions, and inaccurate 3D re- camera orbits a single object. constructions [14, 19, 61, 55, 49, 13, 37]. While improving In this paper, we propose a novel learning pipeline that de- results over traditional methods, such approaches rely on termines the output pixels directly from the source color but multi-view input and are hence limited to the same setting. forces the network to implicitly reason about the underlying View synthesis with monocular input. Recent work lever- geometry. This is achieved by injecting geometric trans- ages deep neural networks to learn a monocular image-to- formations, including perspective projection, 3D rotations image mapping between source and target view from data and translations into an end-to-end trainable network. The [29, 52, 8, 62, 40, 51, 59]. One line of work [29, 52, 8, 39] latent 3D geometry representation is compact and memory directly generates image pixels. Given the difficulty of the efficient, is meaningful under explicit 3D transformation and task, direct image-to-image translation approaches struggle can be used to produce geometrically accurate views for both with preservation of local details and often produce blurry single objects and natural scenes. images. Zhou et.al. [62] estimate flow maps in order to warp More specifically, we propose a geometry aware neural source view pixels to their location in the output. Others architecture consisting of a 3D transforming autoencoder further refine the results by image completion [40] or by (TAE) network [21] and subsequent depth-guided appear- fusing multiple views [51]. ance warping. In contrast to existing work, that directly Typically, the desired view is controlled by concatenating concatenate view point parameters with latent codes, we latent codes with a flattened viewpoint transform. However, first encode the image into a latent representation which is the exact mapping between viewpoint parameters to images explicitly rotated and translated in Euclidean space. We then is difficult to learn due to sparse training pairs from the con- decode the transformed latent code, which is assumed to tinuous viewpoint space. We show experimentally that this implicitly represent the 3D geometry, into a depth map in leads to a snapping to training views, with image quality target view. From the depth map we compute dense cor- quickly degrading under continuous view control. Recent respondences between pixels in the source and target view works demonstrate the potential for fine-grained view syn- via perspective projection and subsequently the final out- thesis, but either are limited to single instances of objects put image via pixel warping. All operations involved are [48] or require additional supervision in the form of depth differentiable, allowing for end-to-end training. maps [63, 33], surface normals [33] and even light field im- Detailed experiments are performed on synthetic objects ages [50], which are cumbersome to acquire in real settings. [3] and natural images [15]. We assess the image quality, In contrast, our method consists of a fully differentiable granularity, precision of continuous viewpoint control and network, which is trained with image pairs and associated implicit recovery of scene geometry qualitatively and quanti- transformations as sole supervision. tatively. Our experiments demonstrate that both components, 3D from single image. Reasoning about the 3D shape can the TAE and depth-guided warping, drastically improve the serve as an implicit step of free-from view synthesis. Given robustness and accuracy for continuous view synthesis. the severely under-constrained case of recovering 3D shapes In conclusion, our main contributions are: from a single image, recent works have deployed neural net- • We propose the task of continuous view synthesis from works for this task. They can be categorized by their output monocular inputs under fine-grained view control. representation into mesh [26, 34], point cloud [11, 32, 23], • This goal is achieved via a proposed novel architecture voxel [58, 6, 16, 54, 44], or depth map based [9, 60, 53]. that integrates a transforming encoder-decoder network Mesh-based approaches are still not accurate enough due to and depth-guided image mapping. the indirect learning process. Point clouds are often sparse • Thorough experiments are conducted, demonstrating and cannot be directly leveraged to project dense color in- the efficacy of our method compared to prior art. formation in the output image and voxel-based methods are limited in resolution and number and type of objects due to where B is the bi-linear sampling function, Pt→s is the memory constraints. Depth maps become sparse and incom- perspective projection, and Eθe ,Dθd are the encoder and plete when projected into other views due to the sampling decoder networks respectively. This decomposition is an gap and occlusions. Layered depth map representations [53] important contribution of our work. By asking the network have been used to alleviate this problem. However, a large to predict a depth map Dt in the target view, we implic- number of layers would be necessary which poses signifi- itly encourage the TAE encoder Eθe to produce position cant hurdles in terms of scalability and runtime efficiency. predictions for features and the decoder Dθd learns to gen- In contrast to explicit models, our latent 3D geometry rep- erate features at corresponding positions by rendering the resentation is compact and memory efficient, is meaningful transformed representation from the specified view-angle. under explicit 3D transformation and can be used to render dense images. 3.1. Transforming Auto-encoder Deep generative models. View synthesis can also be seen We take inspiration from recent work [45, 22, 57, 43] as an image generation process, which is related to the field which itself builds upon earlier work by Hinton et al. [21], of deep generative modelling of images [27, 17]. Recent that uses encoder-decoder architectures to learn represen- models [2, 25] are able to generate high-fidelity images with tations that are transformation equivariant, establishing a diversity in many aspects including viewpoint, shape and direct correspondence between image and feature spaces. appearance, but offer little to no exact control over the un- We leverage such a latent space to model the relationship derlying parameters. Disentangling latent factors has been between viewpoint and implicit 3D shape. studied in [5, 20] to provide control over image attributes. To this end, we represent the latent code zs as vectorized n×3 In particular, recent work [64, 38] demonstrates inspiring set of points zs ∈ R , where n is a hyper-parameter. results of viewpoint disentanglement by reasoning about the This representation is then multiplied with the ground-truth geometry. Although such methods can be used for view syn- transformation Ts→t = [R|t]s→t describing the viewpoint thesis, the generated views lack consistency and moreover change between source view Is and target view It to attain one cannot control which object to synthesize. the rotated code zt:

3. Method zt = [R|t]s→t · z˜s, (2)

Our main contribution is a novel geometry aware network where z˜s is the homogeneous representation of zs. In this design, shown in Fig.2, that consists of four components: 3D way the network is trained to encode position predictions transforming auto-encoder (TAE), self-supervised depth map for features which can then be decoded into images. All prediction, depth map projection and appearance warping. functions in the TAE module including encoding, vector The source view is first encoded into a latent code reshaping, matrix multiplication and decoding are differ- entiable and hence amenable to training via backpropagation. (z = Eθe (Is)). This latent code z is encouraged by our learning scheme to be meaningful in 3D metric space. After encoding we apply the desired transformation between the 3.2. Depth Guided Appearance Mapping source and target to the latent code. The transformed code We decode z into 3D shape in the target view, repre- (z = T (z)) is decoded by a neural network to predict t T s→t sented as a depth image D . From D we compute the dense a depth map D as observed from the target viewpoint. D t t t t correspondence field C deterministically via perspective is projected back into the source view based on the known t→s projection P . The dense correspondences are then used camera intrinsics K and extrinsics T , yielding dense cor- t→s s→t to warp the pixels of the texture (source view) I into the respondences between the target and source views, encoded s target view Iˆ . This allows the network to warp the source as dense backward flow map C . This flow map is used t t→s view into the target view and makes the prediction of target to warp the source view pixel-by-pixel into the target view. view invariant to the texture of the input, resulting in sharp Note that attaining backward flow and hence predicting and detail-preserving outputs. depth maps in the target view is a crucial difference to prior Establishing correspondences. The per-pixel correspon- work. Forward mapping of pixel values into the target view dences C are attained from the depth image D in the I would incur discretization artifacts when moving between t→s t t target view by conversion from the depth map to 3D coordi- ray and pixel-space, visible as banding after re-projection of nates [X,Y,Z] and perspective projection: the (source view) depth map. The whole network is trained end-to-end with a simple per-pixel reconstruction loss as sole [X,Y,Z]T = D (x , y )K−1[x , y , 1]T (3) guidance. Overall, we want to learn a mapping M : X → Y , t t t t t T T which in our case can be decomposed as: and [xs, ys, 1] ∼ KTt→s[X,Y,Z, 1] . (4) where each pixel (xt, yt) encodes the corresponding pixel M(Is) = B(Pt→s(Dθ (Ts→t(Eθ (Is)))),Is) = Iˆt, (1) d e position in the source view (xs, ys). Furthermore, K is the Transforming Autoencoder Depth-Guided Warping

Source View 3D Latent Space Target Depth Flow Target View

Projection

Bilinear Sampling

Figure 2: Pipeline overview. 2D source views are encoded and the latent code is explicitly rotated before a decoder network predicts the depth map in the target view. Dense correspondences are attained via perspective projection and used to warp pixels from source view to the target with bilinear sampling. All operations are differentiable and trained end-to-end without ground-truth depth or flow maps. The only supervision is a L1 reconstruction loss between target view and ground truth image.

camera intrinsic matrix describing normalized focal length are optimized via minimization of the L1 loss between the along both axes fx, fy and image center cx, cy. Note that predicted target view Iˆt and the ground truth It. only the focal length ratio fx/fy as well as image center Lrecon = It − Iˆt (6) affect view synthesis, while the absolute scale of the focal 1 length is only important to predict geometry at correct scale. Minimizing this reconstruction loss, the network learns Warping with correspondences. With the dense correspon- to produce realistic novel views, to predict the necessary dences obtained, we are now able to warp the source view flow and depth maps and learn to form a geometrical latent to the target view. This operation propagates texture and space. local details. Since the corresponding pixel positions that are derived from Eq.4 are non-integer, this is done via dif- 4. Experiments ferentiable bilinear sampling as proposed in [24]: We now evaluate our method quantitatively and quali- X X tatively. We are especially interested in assessing image It(xt, yt) = Is(xs, ys)max(0, 1 − |xs − Cx(xt, yt)|) quality, granularity and precision of fine-grained viewpoint xs ys (5) control. First, we conduct detailed experiments on synthetic max(0, 1 − |y − C (x , y )|). s y t t objects, where ground-truth of continuous viewpoint is easy The use of backward flow Ct→s, computed from the to obtain, to numerically assess the reconstruction quality. predicted depth map Dt, makes the approach amenable to Notably, we vary the viewpoints in much smaller step-sizes gradient based optimization since the gradient of the per- than what is observed in the training data. Second, to evalu- pixel reconstruction loss provides meaningful information ate generalizability, we test our system on natural city scenes. to correct erroneous correspondences. The gradients also In this setting, given an image input, we specify the desired flow back to provide useful information to the TAE network ground-truth camera trajectories along which the system gen- owing to the fact that the correspondences are computed erates novel views. Then we run an existing visual odometry deterministically from the predicted depth maps. While bear- system on these synthesized continuous views to recover the ing similarity to [62], we introduce the intermediate step of camera trajectory. By comparing the recovered trajectory predicting depth, instead of predicting the correspondences with the ground-truth, we can evaluate the geometrical prop- directly. This enforces the network to obey geometric con- erty of the synthesized images under the consideration of straints, resolving ambiguous correspondences. granularity and continuous view control. Finally, to better un- 3.3. Training derstand the mechanism of our proposed network, we further conduct studies on its two key components, namely depth- All steps in our network, namely 3D transforming auto- guided texture mapping and transforming auto-encoder. We encoder (TAE), self-supervised depth map prediction, depth evaluate the intermediate depth and flow, and qualitatively map projection and appearance warping, are differentiable verify the meaningfulness of the latent space of the TAE. which enables end-to-end training. Among all modules, 4.1. Datasets only the TAE module contains trainable parameters (θe, θd). To train the network only pairs of source and target views We conduct our experiments on two challenging datasets: and their transformation are required. The network weights synthetic objects [3] and real natural scenes [15]. Source Tatarchenko et al. Zhou et al. Sun et al. Ours (w/o depth) Ours (w/o TAE) Ours (full) Ground-truth

Continuous views overlayed

View after rotation by +13°

Figure 3: Qualitative results for granularity and precision of viewpoint control on ShapeNet. In the top two rows, we generate and overlay 80 continuous views with step size of 1◦ from a single input. Our method exhibits similar spin pattern as the ground truth, whereas other methods mostly converge to the fixed training views (see wheels of the car and chair indicated by the box). In the bottom row, a close look at specific views is given, which reveals that previous methods display distortions or converge to neighboring training views (Zhou et al.[62], Sun et al.[51]). The image generated by Tatarchenko et al.[52] is blurry. Corresponding error maps are also depicted. Best viewed in color.

ShapeNet [3] is a large collection of 3D synthetic objects optimized during training. from various categories. Similar to [62, 40, 51] we choose Percentage of correctness under threshold δ (Acc). The car and chair to evaluate our method. We use the same predicted flow/depth yˆi at pixel i, given ground truth yi, is train test split as proposed in [62]. For training we render regarded as correct if max( yi , yˆi ) < δ is satisfied. We count yˆi yi each models from 54 viewpoints with different azimuth and the portion of correctly predicted pixels. Here δ = 1.05. 0◦ 360◦ elevation. The azimuth goes from to with a step size Rotation error and translation error are defined as: of 20◦ and the elevation from 0◦ to 30◦ with a step size of ◦ 10 . Each training pair consists of two views of the same Tr(R˜ · RT ) − 1 t˜· tT RE = arccos( ),TE = arccos( ) (7) instance, with a difference in azimuth within ±40◦. 2 ˜ ktk2 · ktk2 KITTI [15] is a standard dataset for autonomous driving, containing complex city scenes in uncontrolled environ- where Tr represents the trace of the matrix. ments. We conduct experiments on the KITTI odometry subset which contains image sequences as well as the global 4.3. Comparison with other methods camera poses of each frame. In total there are 18560 im- ages for training and 4641 images for testing. We construct We compare with several representative state-of-the-art training pairs by randomly selecting target view among 10 learning-based view synthesis methods. Tatarchenko et nearest frames of source view. The relative transformation is al. [52] treat the view synthesis as an image-to-image trans- obtained from the global camera poses. lation task and generate pixels directly. In their framework the viewpoint is directly concatenated with the latent code. 4.2. Metrics Zhou et al. [62] generates flow instead of pixels. The view In our evaluations we report the following metrics: information is also directly concatenated. Sun et al. [51] Mean Absolute Error L1 is used to measure per-pixel value combines both pixel generation [52] and image warping [62]. differences between ground-truth and the predictions. The original implementation in Zhou et al. [62] and Sun et Structural SIMilarity (SSIM) Index[56] has values in [-1, al. [51] does not support continuous viewpoint input for ob- 1] and measures the structural similarity between synthesized jects. To allow for continuous input for comparison, we image and ground truth. We report SSIM in addition to the replace their encoded discrete one hot viewpoint represen- L1 loss since it i) gives an indication of perceptual image tation with cosine and sine values of the view angles. The quality and ii) serves as further metric that is not directly same encoder and decoder are used for all comparisons. 0.14 ablative methods also perform better than previous methods, demonstrating the effectiveness of both modules.

0.12 Car Chair 0.10 L1 SSIM L1 SSIM Tatarchenko et al. [52] 0.084 0.919 0.110 0.917 0.08

L1 Zhou et al. [62] 0.062 0.924 0.074 0.920

0.06 Sun et al. [51] 0.056 0.926 0.070 0.921 Ours (w/o depth) 0.052 0.932 0.066 0.926

0.04 Tatarchenko et al. Ours (w/o TAE) 0.045 0.943 0.065 0.930 Zhou et al. Ours (full) 0.039 0.949 0.056 0.938 Sun et al. 0.02 Ours (w/o depth) Ours (w/o TAE) Ours (full) 1: Quantitative analysis of fine-grained view con- 0.00 40 20 0 20 40 trol on ShapeNet. Average L1 error and SSIM for all gen- rotation angle erated views between [−40◦, 40◦] from the source view. Figure 4: Comparison of L1 reconstruction error as a func- tion of view rotation on car. Ours outperforms other state-of- Qualitative results. The qualitative results in Fig.3 confirm the-art baselines over the entire range and yields a smoother the quantitative findings. To demonstrate the capability of ◦ loss progression. Note that 0 here means no transformation continuous viewpoint control, we generate and overlay 80 ◦ ◦ applied to the source view. (±40 , ±20 are training views views with step size of 1◦ from a single input. Compared indicated by black boxes). to previous approaches, our method exhibits similar spin pattern as the ground truth, whereas other methods mostly snap to the fixed training views (Zhou et al. [62], Sun et 4.4. ShapeNet Evaluation al. [51]). This suggests that overfitting occurs, limiting the To test the granularity and precision of viewpoint control, granularity and precision of view control. A close look at for each test object, given a source view Is, the network syn- specific views reveals that previous methods display distor- thesizes 80 views around the source view with a step size of tions at non-training views, highlighted in red. The image 1◦ which is much denser than the step size of 20◦ for training generated by Tatarchenko et al. [52] is blurry. (and much denser than previously reported experiments). In total the test set contains 100,000 view pairs of objects. 4.5. KITTI Evaluation To study the effectiveness of the transformation-aware We now evaluate our method in the more realistic setting latent space, we introduce Ours (w/o TAE) concatenating of the KITTI dataset. Note that the dataset only contains the viewpoint analogously to [52, 8, 62, 40, 62] while still fairly linear forward motion recorded from a car’s dash. This keeping the depth-guided texture mapping process. To evalu- setting is a good testbed for the envisioned application scenar- ate the depth-guided texture mapping process, we introduce ios where one desires to extract 3D information retroactively. Ours (w/o depth) which directly predicts flow without the Qualitative results In Fig.5 we show qualitative results depth guidance but does deploy the TAE. from novel views synthesized along a straight camera trajec- Viewpoint dependent error. Fig.4 plots the L1 reconstruc- tory: Zhou et al. [62] and Sun et al. [51] both have difficulties tion error between [−40◦, 40◦] of all methods. Note that to deal with viewpoints outside of the training setting and 0◦ here means no transformation applied to the source view. produce distorted images while ours are sharp and geomet- Ours consistently produces lower errors. More importantly it rically correct. Ours more faithfully reproduces the desired yields much lower variance between non-training and train- motion than [62] and [51] which remains stationary. ing views (±40◦, ±20◦ are training views). While previ- Complex trajectory recovery. To simulate real use cases, ous methods can achieve similar performance to ours at we introduce a new experimental setting. We specify ar- training views, their performance significantly decreases for bitrary desired trajectories, specifically so that the camera non-training views. Notably, both of our designs (TAE and moves away from the car’s original motion. From this spec- depth-based appearance) contribute to the final performance ification we generate a sequences of 100 images along the and the problem of snapping to training views persists with trajectories. Subsequently we run a state-of-the-art visual either of the two components discarded (Ours (w/o TAE) odometry [10] system to estimate the camera pose based and Ours (w/o depth)). Tab.1 summarizes the average L1 on the synthesized views. If the view synthesis approach is error and SSIM for all generated views between [−40◦, 40◦]. geometrically accurate, the visual odometry system should Inline with Fig.4, our method significantly outperforms pre- recover the desired trajectory. Fig.6 illustrates one such vious methods on both car and chair. In addition, both of our experiment. The estimated trajectory from ours aligns well

Zhou et al. Source

Sun et al.

Ours

t = [ 0 0 0.1 ] Forward translation t = [ 0 0 0.6 ]

Figure 5: Simple camera motion. Setting: Given a source view we synthesize linear forward motion over 0.6m. Our method produce sharp and correct images while [62, 51] produces distorted images. Zhou et al. [62]’s motion is incorrect, while Sun et al. [51] stays stationary. Ours reflects a reasonable straight forward transition.

Source Zhou et al. Sun et al. Ours

Input Trajectory ......

View synthesis Camera pose estimation Input trajectory Baselines Ours

Figure 6: Complex camera motion. Setting: given a source view and an input trajectory, a continuous sequence of views is synthesized along the user defined trajectory (green). Trajectories are estimated via a state-of-the-art visual odometry system [10] and compared to the desired trajectory. The trajectory estimated from Ours align well with the ground-truth, while [62, 51] mostly produce straight forward or wrong motion regardless of the input. with the ground-truth. In contrast, views from [62] result in in both views, and then computing and decomposing the es- a wrong trajectory and [51] mostly produce straight forward sential matrix. We report the numerical error in Tab.2. Our motion, possibly due to overfitting to training trajectories. method produces drastically lower error in rotation and the Quantitative results. To evaluate the geometrical proper- translation, indicating accurate viewpoint control. Note that ties quantitatively, we generate new views with randomly we had to remove [52] from this comparison since SURF sampled transformation T = [R|t]. We then estimate the rel- feature detection fails due to the very blurry images. ative transformation between the input and the synthesized 4.6. Depth and Flow Evaluation view T˜ = [R˜|t˜] and compare to the ground-truth T . This is done by first detecting and matching SURF features [1] The quality of predicted depth map and warping flow is essential to produce geometrically correct views. We TE RE maps in the global frame. Fig.8 shows that interpolated sam- Zhou et al. [62] 0.557 0.086 ples exhibit a smooth shape transition while the viewpoint Sun et al. [51] 0.435 0.080 remains constant (i). Moreover, rotating the latent points Ours 0.108 0.019 only changes the viewpoint without affecting the shape (ii).

Table 2: Precision evaluation of viewpoint control by cam- era pose estimation on KIITI. Interpolation evaluate the accuracy of depth and flow prediction with two metrics (L1 and Acc). Tab.3 summarizes results for

ShapeNet. Ours achieves the best accuracy in both flow Rotation and depth prediction, which directly benefits view synthesis (cf. Tab.1). The relative ranking of the ablative baselines furthermore indicates that both the TAE and the depth-guided Figure 8: Latent space analysis showing consistency of texture mapping help to improve the flow accuracy. The TAE embeddings. Left-to-right: latent space interpolation be- furthermore guides the depth prediction. To illustrate that the tween different objects. Top-to-bottom: Rotation of same reconstructed depth maps are indeed meaningful, we predict latent code. (Normals in global frame, extracted from depth). depth in different target views and visualize the extracted normal maps, as shown in Fig.7. Discussion Together these experiments indicate that the pro- posed self-supervision indeed forces the network to infer 4.8. Generalization to unseen data underlying 3D structure (yielding good depth which is nec- essary for accurate flow maps) and that it helps the final task We find that our model generalizes well to unseen data without requiring additional labels. thanks to the usage of depth-based warping. Interestingly, our model trained on 2562 images can be directly applied to Flow Depth high resolution (10242) images without additional training. L1 Acc L1 Acc The inference process takes 50ms per frame on a Titan X Zhou et al. [62] 0.035 69.1% -- GPU, allowing for real time rendering of synthetized views. Ours (w/o depth) 0.029 76.3% -- This enables many appealing application scenarios. For Ours (w/o TAE) 0.022 84.6% 0.134 89.0% example, our model, trained on ShapeNet only, can be used Ours (full) 0.021 85.7% 0.132 91.1% in an app where downloaded 2D images are brought to life and a user may browse the depicted object in 3D. With a Table 3: Quantitative analysis of flow and depth predic- model trained on KITTI, a user may explore a 3D scene from tion on car. Average L1 error and accuracy for all predicted a single image, via generation of free-viewpoint videos or flow and depth in target views between [−40◦, 40◦] from the AR/VR content (see Fig.1). source view. Ours significantly outperforms the baselines. 5. Conclusion Source Predicted 3D structure We have presented a novel learning pipeline for contin- uous view synthesis. At its core lies a depth-based image prediction network that is forced to satisfy explicitly for- mulated geometric constraints. The latent representation is meaningful under explicit 3D transformation and can be Figure 7: Unsupervised depth prediction. Depth map is used to produce geometrically accurate views for both sin- predicted from the source view and visualized as point clouds gle objects and natural scenes. We have conducted thor- depicted from different viewing angles. ough experiments on synthetic and natural images and have demonstrated the efficacy of our approach. 4.7. Latent Space Analysis Acknowledgement. We thank Nvidia for the donation of To verify that the learned latent space is indeed inter- GPUs used in this work. We would like to express our pretable and meaningful under geometrical transformation, gratitude to Olivier Saurer, Velko Vechev, Manuel Kaufmann, we i) linearly interpolate between latent points of two objects Adrian Spurr, Yinhao Huang, Xucong Zhang and David and ii) rotate each interpolated latent point set. These point Lindlbauer for the insightful discussions, James Bern and sets are then decoded into depth maps, visualized as normal Seonwook Park for the video voice-over. References imagery. In Proc. IEEE Conf. on and Pattern Recognition (CVPR), 2016.2 [1] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: [15] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we Speeded up robust features. In Proc. of the European Conf. ready autonomous driving? the kitti vision benchmark suite. on Computer Vision (ECCV), 2006.7 In Proc. IEEE Conf. on Computer Vision and Pattern Recog- [2] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large nition (CVPR), 2012.2,4,5 scale gan training for high fidelity natural image synthesis. In [16] Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Ab- Proc. of the International Conf. on Learning Representations hinav Gupta. Learning a predictable and generative vector (ICLR), 2018.3 representation for objects. In Proc. of the European Conf. on [3] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Computer Vision (ECCV), 2016.2 Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Savva, Shuran Song, Hao Su, et al. Shapenet: An information- Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and rich 3d model repository. arXiv preprint arXiv:1512.03012, Yoshua Bengio. Generative adversarial nets. In Advances in 2015.2,4,5 Neural Information Processing Systems (NIPS), 2014.3 [4] Gaurav Chaurasia, Sylvain Duchene, Olga Sorkine-Hornung, [18] Steven J Gortler, Radek Grzeszczuk, Richard Szeliski, and and George Drettakis. Depth synthesis and local warps for Michael F Cohen. The lumigraph. In Conference on Com- plausible image-based navigation. ACM Transactions on puter Graphics and Interactive Techniques, volume 96, pages Graphics (TOG), 32(3):30, 2013.2 43–54, 1996.2 [5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable rep- [19] Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, resentation learning by information maximizing generative George Drettakis, and Gabriel Brostow. Deep blending for SIGGRAPH Asia adversarial nets. In Advances in Neural Information Process- free-viewpoint image-based rendering. In 2018 Technical Papers ing Systems (NIPS), 2016.3 , page 257. ACM, 2018.2 [6] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin [20] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Chen, and Silvio Savarese. 3d-r2n2: A unified approach for Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and single and multi-view 3d object reconstruction. In Proc. of Alexander Lerchner. beta-vae: Learning basic visual concepts the European Conf. on Computer Vision (ECCV), 2016.2 with a constrained variational framework. In Proc. of the In- ternational Conf. on Learning Representations (ICLR), 2017. [7] Paul E Debevec, Camillo J Taylor, and Jitendra Malik. Mod- 3 eling and rendering architecture from : A hy- brid geometry-and image-based approach. In Conference on [21] Geoffrey E. Hinton, Alex Krizhevsky, and Sida D. Wang. and Interactive Techniques, pages 11–20. Transforming auto-encoders. In International Conference on ACM, 1996.2 Artificial Neural Networks, 2011.2,3 [8] Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim [22] Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst. Matrix Tatarchenko, and Thomas Brox. Learning to generate chairs, capsules with em routing. In Proc. of the International Conf. tables and cars with convolutional networks. IEEE transac- on Learning Representations (ICLR), 2018.3 tions on pattern analysis and machine intelligence, 39(4):692– [23] Eldar Insafutdinov and Alexey Dosovitskiy. Unsupervised 705, 2016.1,2,6 learning of shape and pose with differentiable point clouds. In [9] David Eigen, Christian Puhrsch, and Rob Fergus. Depth Advances in Neural Information Processing Systems (NIPS), map prediction from a single image using a multi-scale deep 2018.2 network. In Advances in Neural Information Processing [24] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Systems (NIPS), 2014.2 Spatial transformer networks. In Advances in Neural Infor- [10] Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct mation Processing Systems (NIPS), 2015.4 sparse odometry. IEEE transactions on pattern analysis and [25] Tero Karras, Samuli Laine, and Timo Aila. A style-based machine intelligence, 40(3):611–625, 2017.6,7 generator architecture for generative adversarial networks. In [11] Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set Proc. IEEE Conf. on Computer Vision and Pattern Recogni- generation network for 3d object reconstruction from a single tion (CVPR), 2019.3 image. In Proc. IEEE Conf. on Computer Vision and Pattern [26] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neu- Recognition (CVPR), 2017.2 ral 3d mesh renderer. In Proc. IEEE Conf. on Computer Vision [12] Andrew Fitzgibbon, Yonatan Wexler, and Andrew Zisserman. and Pattern Recognition (CVPR), 2018.2 Image-based rendering using image-based priors. Interna- [27] Diederik P Kingma and Max Welling. Auto-encoding varia- tional Journal of Computer Vision, 63(2):141–151, 2005.2 tional bayes. In Proc. of the International Conf. on Learning [13] John Flynn, Michael Broxton, Paul Debevec, Matthew Du- Representations (ICLR), 2013.3 Vall, Graham Fyffe, Ryan Overbeck, Noah Snavely, and [28] Johannes Kopf, Fabian Langguth, Daniel Scharstein, Richard Richard Tucker. Deepview: View synthesis with learned Szeliski, and Michael Goesele. Image-based rendering in gradient descent. In Proc. IEEE Conf. on Computer Vision the gradient domain. ACM Transactions on Graphics (TOG), and Pattern Recognition (CVPR), 2019.2 32(6):199, 2013.2 [14] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. [29] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Deepstereo: Learning to predict new views from the world’s Josh Tenenbaum. Deep convolutional inverse graphics net- work. In Advances in Neural Information Processing Systems [44] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. Oct- (NIPS), 2015.2 net: Learning deep 3d representations at high resolutions. In [30] Lubor Ladicky, Jianbo Shi, and Marc Pollefeys. Pulling things Proc. IEEE Conf. on Computer Vision and Pattern Recogni- out of perspective. In Proc. IEEE Conf. on Computer Vision tion (CVPR), 2017.2 and Pattern Recognition (CVPR), 2014.2 [45] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dy- [31] Marc Levoy and . Light field rendering. In Con- namic routing between capsules. In Advances in Neural ference on Computer Graphics and Interactive Techniques, Information Processing Systems (NIPS), 2017.3 pages 31–42. ACM, 1996.2 [46] Steven M Seitz, Brian Curless, James Diebel, Daniel [32] Chen-Hsuan Lin, Chen Kong, and Simon Lucey. Learning Scharstein, and Richard Szeliski. A comparison and eval- efficient point cloud generation for dense 3d object recon- uation of multi-view stereo reconstruction algorithms. In struction. In Thirty-Second AAAI Conference on Artificial Computer vision and pattern recognition, 2006 IEEE Com- Intelligence, 2018.2 puter Society Conference on, volume 1, pages 519–528. IEEE, 2006.2 [33] Miaomiao Liu, Xuming He, and Mathieu Salzmann. Geometry-aware deep network for single-image novel view [47] Jonathan Shade, Steven Gortler, Li-wei He, and Richard synthesis. In Proc. IEEE Conf. on Computer Vision and Szeliski. Layered depth images. In Conference on Com- Pattern Recognition (CVPR), 2018.2 puter Graphics and Interactive Techniques, SIGGRAPH ’98, pages 231–242, New York, NY, USA, 1998. ACM.2 [34] Shichen Liu, Weikai Chen, Tianye Li, and Hao Li. Soft rasterizer: Differentiable rendering for unsupervised single- [48] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias view mesh reconstruction. arXiv preprint arXiv:1901.05567, Nießner, Gordon Wetzstein, and Michael Zollhofer. Deepvox- Proc. IEEE 2019.2 els: Learning persistent 3d feature embeddings. In Conf. on Computer Vision and Pattern Recognition (CVPR), [35] Wojciech Matusik, Hanspeter Pfister, Addy Ngan, Paul Beard- 2019.2 sley, Remo Ziegler, and Leonard McMillan. Image-based 3d [49] Pratul P Srinivasan, Richard Tucker, Jonathan T Barron, Ravi photography using opacity hulls. In ACM Transactions on Ramamoorthi, Ren Ng, and Noah Snavely. Pushing the bound- Graphics (TOG), volume 21, pages 427–437. ACM, 2002.2 aries of view extrapolation with multiplane images. In Proc. [36] Leonard McMillan and Gary Bishop. Plenoptic modeling: An IEEE Conf. on Computer Vision and Pattern Recognition image-based rendering system. In Conference on Computer (CVPR), 2019.2 Graphics and Interactive Techniques, pages 39–46. Citeseer, [50] Pratul P Srinivasan, Tongzhou Wang, Ashwin Sreelal, Ravi 1995.2 Ramamoorthi, and Ren Ng. Learning to synthesize a 4d [37] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, rgbd light field from a single image. In Proc. of the IEEE Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and International Conf. on Computer Vision (ICCV), 2017.2 Abhishek Kar. Local light field fusion: Practical view synthe- [51] Shao-Hua Sun, Minyoung Huh, Yuan-Hong Liao, Ning sis with prescriptive sampling guidelines. ACM Transactions Zhang, and Joseph J Lim. Multi-view to novel view: Synthe- on Graphics (TOG), 2019.2 sizing novel views with self-learned confidence. In Proc. of [38] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian the European Conf. on Computer Vision (ECCV), 2018.2,5, Richardt, and Yong-Liang Yang. Hologan: Unsupervised 6,7,8 arXiv learning of 3d representations from natural images. [52] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. preprint arXiv:1904.01326 , 2019.3 Multi-view 3d models from single images with a convolu- [39] Kyle Olszewski, Sergey Tulyakov, Oliver Woodford, Hao tional network. In Proc. of the European Conf. on Computer Li, and Linjie Luo. Transformable bottleneck networks. Vision (ECCV). Springer, 2016.1,2,5,6,7 arXiv:1904.06458, 2019.2 [53] Shubham Tulsiani, Richard Tucker, and Noah Snavely. Layer- [40] Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan, structured 3d scene inference via view synthesis. In Proc. of and Alexander C Berg. Transformation-grounded image gen- the European Conf. on Computer Vision (ECCV), 2018.2,3 eration network for novel 3d view synthesis. In Proc. IEEE [54] Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jiten- Conf. on Computer Vision and Pattern Recognition (CVPR), dra Malik. Multi-view supervision for single-view reconstruc- 2017.1,2,5,6 tion via differentiable ray consistency. In Proc. IEEE Conf. [41] Eric Penner and Li Zhang. Soft 3d reconstruction for view on Computer Vision and Pattern Recognition (CVPR), 2017. synthesis. ACM Transactions on Graphics (TOG), 36(6):235, 2 2017.2 [55] Yunlong Wang, Fei Liu, Zilei Wang, Guangqi Hou, Zhenan [42] Konstantinos Rematas, Chuong H Nguyen, Tobias Ritschel, Sun, and Tieniu Tan. End-to-end view synthesis for light Mario Fritz, and Tinne Tuytelaars. Novel views of objects field with pseudo 4DCNN. In Proceedings of the from a single image. IEEE transactions on pattern analysis European Conference on Computer Vision (ECCV), pages and machine intelligence, 39(8):1576–1590, 2016.2 333–348, 2018.2 [43] Helge Rhodin, Mathieu Salzmann, and Pascal Fua. Unsu- [56] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simon- pervised geometry-aware representation for 3d human pose celli, et al. Image quality assessment: from error visibility to estimation. In Proc. of the European Conf. on Computer structural similarity. IEEE transactions on image processing, Vision (ECCV), 2018.3 13(4):600–612, 2004.5 [57] Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambe- tov, and Gabriel J Brostow. Interpretable transformations with encoder-decoder networks. In Proc. of the IEEE International Conf. on Computer Vision (ICCV), 2017.3 [58] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin- guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.2 [59] Jimei Yang, Scott E Reed, Ming-Hsuan Yang, and Honglak Lee. Weakly-supervised disentangling with recurrent trans- formations for 3d view synthesis. In Advances in Neural Information Processing Systems (NIPS), 2015.2 [60] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017.2 [61] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view syn- thesis using multiplane images. In Conference on Computer Graphics and Interactive Techniques, 2018.2 [62] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros. View synthesis by appearance flow. In Proc. of the European Conf. on Computer Vision (ECCV), 2016.1,2,4,5,6,7,8 [63] Hao Zhu, Hao Su, Peng Wang, Xun Cao, and Ruigang Yang. View extrapolation of human body from a single image. In Proc. IEEE Conf. on Computer Vision and Pattern Recogni- tion (CVPR), 2018.2 [64] Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu, Antonio Torralba, Josh Tenenbaum, and Bill Freeman. Visual object networks: image generation with disentangled 3d rep- resentations. In Advances in Neural Information Processing Systems (NIPS), 2018.3 [65] C Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and Richard Szeliski. High-quality video view interpolation using a layered representation. In ACM transactions on graphics (TOG), volume 23, pages 600–608. ACM, 2004.2