Foveated Neural Radiance Fields for Real-Time and Egocentric

NIANCHEN DENG, Shanghai Jiao Tong University ZHENYI HE, New York University JIANNAN YE, Shanghai Jiao Tong University PRANEETH CHAKRAVARTHULA, University of North Carolina at Chapel Hill XUBO YANG, Shanghai Jiao Tong University QI SUN, New York University

Our Method Our Method 0.5MB storage 20ms/frame

Foveated Foveated Rendering Rendering 100MB storage

Existing Synthesis 9s/frame Existing Synthesis (a) scene representation (b) overall quality (c) foveal quality

Fig. 1. Illustration of our gaze-contingent neural synthesis method. (a) Visualization of our active-viewing-tailored egocentric neural scene representation. (b) Our method allows for extremely low data storage of high-quality 3D scene assets and achieves high perceptual image quality and low latency for interactive virtual reality. Here, we compare our method with the state-of-the-art alternative neural synthesis [Mildenhall et al. 2020] and foveated rendering solutions [Patney et al. 2016; Perry and Geisler 2002] and demonstrate that our method significantly reduces more than 99% time consumption (from 9s to 20ms) and data storage (from 100MB mesh to 0.5MB neural model) for first-person immersive viewing. The orange circle indicates the viewer’s gaze position. We also show zoomed-in views of the foveal images generated by various methods in (c) and show that our method is superior compared to existing methods in achieving high perceptual image fidelity.

Traditional high-quality 3D graphics requires large volumes of fine-detailed visual quality, while mutually bridging human perception and neural scene scene data for rendering. This demand compromises computational effi- synthesis, to achieve perceptually high-quality immersive interaction. Both ciency and local storage resources. Specifically, it becomes more concerning objective analysis and subjective study demonstrate the effectiveness of for future wearable and portable virtual and (VR/AR) our approach in significantly reducing local storage volume and synthesis displays. Recent approaches to combat this problem include remote render- latency (up to 99% reduction in both data size and computational time), while ing/streaming and neural representations of 3D assets. These approaches simultaneously presenting high-fidelity rendering, with perceptual quality

arXiv:2103.16365v1 [cs.GR] 30 Mar 2021 have redefined the traditional local storage-rendering pipeline by distributed identical to that of fully locally stored and rendered high-quality imagery. computing or compression of large data. However, these methods typically suffer from high latency or low quality for practical visualization oflarge immersive virtual scenes, notably with extra high resolution and refresh 1 INTRODUCTION rate requirements for VR applications such as gaming and design. Efficient storage is a core challenge faced by today’s computational Tailored for the future portable, low-storage, and energy-efficient VR systems, especially due to the fast-growing high-resolution acquisi- platforms, we present the first gaze-contingent 3D neural representation and view synthesis method. We incorporate the human psychophysics of tion/display devices, the demand for 3D data quality, and machine visual- and stereo-acuity into an egocentric neural representation of 3D learning. Thus, streaming services and remote cloud-based render- scenery. Furthermore, we jointly optimize the latency/performance and ing (such as Google Stadia and Nvidia Shield) have rapidly emerged in recent years. These storage-friendly systems have shown an un- Authors’ addresses: Nianchen Deng, Shanghai Jiao Tong University; Zhenyi He, New precedented potential to reshape the traditional computer graphics York University; Jiannan Ye, Shanghai Jiao Tong University; Praneeth Chakravarthula, University of North Carolina at Chapel Hill; Xubo Yang, Shanghai Jiao Tong University; pipeline by completely relieving the end user’s computational work- Qi Sun, New York University. load and storage pressure, thus advancing device portability. Future 1:2 • Nianchen Deng, Zhenyi He, Jiannan Ye, Praneeth Chakravarthula, Xubo Yang, and Qi Sun spatial and immersive computing platforms such as VR and AR color sensitivity and stereopsis during both the training and infer- are potentially major beneficiaries due to the extensive demands of ence phases. Furthermore, we derive an analytical spatial-temporal portability and energy-efficiency 1. perception model to optimize our neural scene representation to- However, existing low-storage services have only been success- wards imperceptible loss in image quality and latency. fully deployed in relatively latency-insensitive platforms such as We validate our approach by conducting numerical analysis and TV-based gaming consoles due to the inevitable latency from trans- user evaluation on commercially available AR/VR display devices. mission (more than 100ms 2). In comparison, VR and AR require The series of experiments reveal our method’s effectiveness in deliv- remarkably low latency and significantly higher fidelity/speed for ering an interactive VR viewing experience, perceptually identical a sickness-free and natural experience. As a consequence, there to an arbitrarily high quality rendered 3D scene, with significantly is a demand for massive high-quality spatial-temporal data. Un- lowered data storage. We will open source our implementation and fortunately, no solution exists today to deliver high-quality and dataset. We make the following major contributions: dynamic/interactive immersive content at low latency and with low • A low-latency, interactive, and significantly low-storage im- local storage. mersive application. It also offers full perceptual quality with Recent years have witnessed a surge in neural representations as instant high resolution and high FoV first-person VR viewing. an alternative to compact 3D representations such as point clouds • A scene parameterization and vision-matched neural rep- and polygonal meshes, enabling compact assets. Application of resentation/synthesis method considering both visual- and neural representations includes voxelization [Sitzmann et al. 2019a], stereo- acuity. distance fields [Mildenhall et. al 2020; Park et al. 2019; Sitzmann • A spatiotemporally analytical model for jointly optimizing et al. 2020a], and many more. However, existing solutions suffer systematic latency and perceptual quality. from either high time consumption or low image fidelity with large scenes. 2 RELATED WORK Acceleration without compromising perceptual quality is also an ultimate goal of real-time rendering. Gaze-contingent render- 2.1 Image-based View Synthesis ing approaches have shown remarkable effectiveness in presenting Image-based rendering (IBR) has been proposed in computer graph- imagery of perceptual quality identical to full resolution render- ics to complement the traditional 3D assets and ray propagation ing [Guenter et al. 2012; Patney et al. 2016; Tursun et al. 2019]. This pipelines [Gortler et al. 1996; Levoy and Hanrahan 1996; Shum is achieved by harnessing high field-of-view(FoV) head-mounted and Kang 2000]. Benefiting from the uniform 2D representation, displays and high precision eye-tracking technologies. However, ex- it delivers pixel-wise high quality without compromising perfor- isting foveated rendering approaches still require locally stored full mance. However, one major limitation of IBR is the demand for 3D assets, resulting in long transmission time and storage volume, large and dense data acquisition, resulting in limited and sparse whose availability is unfortunately limited in wearable devices. view ranges. To address this problem, recent years have witnessed To overcome these open challenges mentioned above, we present extensive research on synthesizing novel image-based views instead the first gaze-contingent and interactive neural synthesis for high- of fully capturing or storing them. Examples include synthesizing quality 3D scenes. Aiming at the next-generation wearable, low- light fields [Kalantari et. al 2016; Li and Khademi Kalantari 2020; latency, and low-storage VR/AR platforms, our compact neural Mildenhall et al. 2019] and multi-layer volumetric videos [Broxton model predicts stereo frames given users’ real-time head and gaze et al. 2020]. However, despite being more compact, the synthesis motions. The predicted frames are perceptually identical to the tradi- is usually in a locally interpolated fashion than globally applicable tional real-time rendering of high-quality scenes. With wide FoV and with highly dynamic VR camera motions. high-resolution eye-tracked VR displays, our method significantly reduces 3D data storage (from 100MB mesh to 0.5MB neural model) 2.2 Implicit Scene Representation for Neural Rendering compared with traditional rendering pipelines; it also remarkably To fully represent a 3D object and environment without large ex- improves responsiveness (from 9s to 20 ms) compared with alter- plicit storage, neural representations have drawn extensive attention native view neural synthesis methods. That is, we enable instant in computer graphics and computer vision communities. With a immersive viewing without long loading/transmission waits while deep neural network that depicts a 3D world as an implicit function, preserving visual fidelity. Tailored to future lightweight, power and the neural networks may directly approximate an object’s appear- memory-efficient wearable display systems, expected to be active all- ance given a camera pose [Mildenhall et al. 2020; Park et al. 2017; day long akin to today’s everyday-use mobile phones, our method Sitzmann et al. 2020b, 2019a,b]. Prior arts also investigated implic- achieves minimal local storage for high-quality 3D scenes. itly representing shapes [Park et al. 2019] and surfaces [Mescheder To achieve immersive viewing with low storage, high visual qual- et al. 2019] to improve the visual quality of 3D objects. Inspired by ity and low systematic latency, we introduce spatial-angular human [Park et al. 2019], signed distance functions have been deployed for psychophysical characteristics to neural representation and synthe- representing raw data such as point cloud [Atzmon and Lipman sis. Specifically, we represent 3D scenes with concentric spherical co- 2020; Gropp et al. 2020] with better efficiency [Chabra .et al 2020]. ordinates. Our representation also allows for depicting the foveated The input information ranged from 2D images [Choi et al. 2019; Lin et al. 2020a; Yariv et al. 2020], 3D context [Oechsle et al. 2019; Saito 1https://www.vxchnge.com/blog/virtual-reality-data-storage-concern et al. 2019; Zhang et al. 2020], time-varying vector field [Niemeyer 2https://www.pcgamer.com/heres-how-stadias-input-lag-compares-to-native-pc-gaming/ et al. 2019] to local features [Liu et al. 2020; Tretschk et al. 2020]. Foveated Neural Radiance Fields for Real-Time and Egocentric Virtual Reality • 1:3

Current implicit representations primarily focus on locally “outside- Top view Side view in” viewing of individual objects, with low field-of-view, speed, and x resolution. For large scenes, the lack of coverage may cause the quality to drop. However, in virtual reality, the cameras are typi- R

cally first-person and highly dynamic, requiring low latency, high q

resolution, and 6DoF coverage. } M q } 2.3 Gaze-Contingent Rendering ⋅ Both rendering and neural synthesis suffer from the heavy compu- M tational load, thus systematic latency and lags. With a full mesh representation, foveated rendering has been proposed to accelerate local rendering performance, including mesh- [Guenter et al. 2012; Patney et al. 2016] or image-based [Kaplanyan et al. 2019; Sun et al. Fig. 2. Coordinate system and variable annotations. We partially visualize 2017] methods. The accelerations are typically achieved through our 3D full spherical representation with an example of 푁 = 3 spheres. high field-of-view displays (such as VR head-mounted displays) and Variables and equations are annotated at corresponding positions. eye-tracking technologies. However, these all require full access to the original 3D assets or per-frame images. Consequently, the systems usually require large local storage, long transmission waits, or constant frame-based transmissions with latency. Our method 3.1 Egocentric Neural Representation bridges human vision and deep synthesis. It accelerates neural syn- The recent single-object-oriented “outside-in” view synthesis meth- thesis beyond local rendering, enabling both computational and ods [Mildenhall et al. 2020; Sitzmann et al. 2019a] typically represent storage efficiency. It is also suitable for future cloud-based virtual the training targets using uniform voxelization. However, immersive reality with instant full-quality viewing and interaction. VR environments introduce open challenges for such parameter- ization due to the rapid variation of the egocentric (first-person) 2.4 Panorama-Based 6DoF Immersive Viewing viewing (e.g., Figure 1a). As a consequence, the neural representa- Panorama images and videos much fit VR applications thanks to tion on large virtual environments typically suffer from ghosting their 360 FoV coverage. However, the main challenge is their single artifacts, low resolution, or slow speed (e.g., the white cube and red projection center in each frame, limiting free-form camera transla- wall are blurred in Figure 1b and Figure 1c). tion, thus 6DoF immersive viewing experience. Recently, extensive Egocentric coordinates. To tackle this problem, we are inspired research has been proposed to address this problem. Serrano et al. by the recent panorama dis-occlusion methods [Attal et al. 2020b; [2019] presented a depth-based dis-occlusion method that dynam- Broxton et al. 2020; Lin et al. 2020b] to depict the rapidly varying ically reprojects to 6DoF cameras, enabling natural viewing with first-person views with concentric spherical coordinates. This rep- motion parallax. Pozo et al. [2019] jointly optimizes capture and resentation enables robust rendering at real-time rates and allows viewing processes. Machine learning approaches have advanced for 6DoF interaction to navigate complex immersive environments. the robustness of various geometric and lighting conditions [Attal As visualized in Figure 2, our representation is parameterized with et al. 2020a,b; Lin et al. 2020b]. As a spherical image, panoramas are the number of concentric spheres (푁 ) and their respective radii designed for capturing physical worlds with flexible reprojections to (r = {푟 }, 푖 ∈ [1, 푁 ]). Our network aims at predicting the func- displays. The 6DoF viewing typically suffers from trade-offs among 푖 tion 퐶(qˆ) (푟,푔,푏,푑) of any position qˆ (푟 , 휃, 휙) on these 푁 performance, achievable resolution, and dis-occlusion effects due to ≜ 푖,푖 ∈[1,푁 ] spherical surfaces. Here, (푟,푔,푏) and 푑 are the predicted color and lack of knowledge of the 3D space. With an orthogonal mission, our density, respectively. We then employ a ray marching through this model represents and reproduces arbitrary 3D virtual worlds for low intermediate function to produce images of specified views. storage VR. It lowers memory usage by replacing traditional mesh- or volume-based data than predicting image-space occlusions. Training and synthesis. For each scene, we approximate the func- tion 퐶 by a multilayer perceptron (MLP) network, similar to [Milden- 3 METHOD hall et al. 2020]. The network is composed of 푁푚 fully-connected Based on a concentric spherical representation and trained net- layers of 푁푐 channels, each using ReLu as the activation function. We work predicting RGBDs on the spheres (Section 3.1), our system further determine 푁푚, 푁푐 and 푁 via a spatial-temporal optimization predominantly comprises two main steps in real-time: visual-/stereo- in Section 3.3. Given rays of pixels in an image, the network predicts acuity adaptive elemental images synthesis with ray marching for the colors and densities along these rays on every sphere. The ray fovea, mid-, and far- peripheries; followed by image-based rendering marching weighted integrates predictions in back-to-front order to to composite displayed frames (Section 3.2). For desired precision- obtain actual colors of corresponding pixels.Similar to [Mildenhall performance balance, we further craft an analytical spatial-temporal et al. 2020], we apply a trigonometric input encoding and compute model to optimize the determination of our intra-system variables, the L2-distance between the volume-rendered and mesh-rendered including coordinates and neural-network parameters (Section 3.3). images as the training loss function. We discuss the specifics of training data creation and implementations in Section 4. 1:4 • Nianchen Deng, Zhenyi He, Jiannan Ye, Praneeth Chakravarthula, Xubo Yang, and Qi Sun

3.2 Gaze-Aware Synthesis function 퐶s) to synthesize foveal (퐼푓 (x, R, vg), 0 − 20 deg), and pe- Our concentric spherical coordinate, as described in Section 3.1, ripheral elemental images (i.e., two 퐶s to deploy for ray marching) addresses the view variance problem in VR. However, it still suffers for each eye. The periphery includes mid- (퐼푚 (x, R, vg), 0 − 45 deg) from significant rendering latency (several seconds per frame). This and high- eccentricity visual fields퐼 ( 푝 (x, R), 0−110 deg), see Figure 3. is one of the core challenges causing neural representation to be Note that 퐼푝 is independent from the gaze direction vg. Specifically, unsuitable yet for VR. Instead of the typical single image predic- to incorporate the display capabilities and perceptual effects, we 2 2 tion, we synthesize multiple elemental images to enable real-time define 퐼푓 /퐼푚/퐼푝 as of 128 /256 /230 × 256 resolutions respectively. responsiveness. The elemental images are generated based on the That is, the 퐼푓 has the highest spatial resolution of 6.4 pixels per viewer’s head and gaze motions and are adapted to the retinal acuity degree (PPD), higher than those of 퐼푚 (5.7 PPD) and 퐼푝 (2.33 PPD). in resolution and stereo. 3.2.2 Adaptive stereo-acuity. Head-mounted VR displays require { } { } { } stereo rendering, thus the elemental images 푙,푟 , 푙,푟 , and 푙,푟 3.2.1 Adaptive visual acuity. The human vision is foveated. Inspired 퐼푓 퐼푚 퐼푝 by this, we significantly accelerate the computation by integrating for each (푙eft and 푟ight) eye. This, however, doubles the inference the characteristic of spatially-adaptive visual acuity without com- computation cost that is critical for latency- and frame-rate- sen- promising perceptual quality. sitive VR experience (please refer to Section 5.3 for breakdown Specifically, given device-tracked camera position (x), direction comparisons). (R), and gaze position (vg), we train two orthogonal networks (i.e., To further accelerate stereo rendering towards real-time expe- rience, we accommodate the inference process with the adaptive stereo-acuity in the periphery. In fact, besides the visual acuity, psychophysical studies have also revealed human’s significantly declined stereopsis while receding from the gaze point [Mochizuki et al. 2012]. Inspired by this characteristic, we perform the computa- { } tion with 푙,푟 , 푐 , and 푐 instead of inferring 6 elemental images, 퐼푓 퐼푚 퐼푝 where 푐 indicates the view at the central eye (the midpoint of the left and right eyes). Figure 4 visualizes the stereopsis changes from the adaptation using an anaglyph. (a) fovea (b) mid-periphery (c) far-periphery

Foveal Foveal

w/o gaze-adaptive stereo Periphery w/ gaze-adaptive stereo Periphery (a) w/o gaze-adaptive stereo (b) w/ gaze-adaptive stereo

Fig. 4. Visualization of the adaptive stereo-acuity with anaglyph. (a) shows the {푙,푟 } {푙,푟 } {푙,푟 } rendered image with 6 retinal sub-images (퐼푓 , 퐼푚 , and 퐼푝 ). (b) shows the rendered image with our adaptive and accelerated inference considering {푙,푟 } {푐} {푐} foveated stereoacuity (퐼푓 , 퐼푚 , and 퐼푝 ). Our method preserves full stereopsis in the fovea while reducing the angular resolution in the periphery for accelerated inference.

3.2.3 Real-time frame composition. With the obtained elemental (d) display image images as input, an image-based rendering in the fragment shader is then executed to generate final frames for each eye (Figure 3d). The Fig. 3. Visual acuity adaptive synthesis and rendering mechanism. (a)/(b)/(c) output frames are displayed on the stereo VR HMDs. Two adjunct shows our individual gaze-contingent synthesis for fovea (within 20 layers are blended using a smooth-step function across 40% of the deg)/mid-periphery (within 45 deg)/far-periphery (within 110 deg, the capa- inner layer. This enhances visual consistency on the edges between 푐 bility of our VR display), respectively. (d) illustrates our real-time rendering layers [Guenter et al. 2012]. To accommodate the mono-view 퐼푚 system by dynamically blending the individual images to the final displayed 푐 and 퐼푝 , they are shifted towards each eye according to approximated frame. foveal depth range. Lastly, we enhance the contrast following the Foveated Neural Radiance Fields for Real-Time and Egocentric Virtual Reality • 1:5

(a) network overlay (b) blending (c) contrast enhancement

Fig. 5. Our multi-layer real-time rendering system. (a) shows the original 3-layer image output overlaid. (b) shows the result with cross-layer blending that enables smooth appearance on the edges. (c) shows the contrast-enhanced rendering pass in the periphery.

q−x mechanism of [Patney et al. 2016] to further preserve peripheral ∥q−x∥ . This ray may intersect with more than one sphere. Among elemental images’ visual fidelity due to its low PDD. The details are them, the closest one to q is: visualized in Figure 5. qˆ (x, 푁, r, q) ≜  (3) 3.3 Latency-Quality Joint Optimization 푝(푟푘, x, v(q, x)) | 푘 = argmin ∥q∥ − 푟 푗 . 푗 ∈[1,푁 ] As a view synthesis system based on sparse egocentric representa- ′ tion (the 푁 spheres) and neural networks (the 푁푚, 푁푐 ), the method In the 3D space, the offset distance ∥q − qˆ ∥ indicates the precision inevitably introduces approximation errors. On the other hand, loss at q from the representation. By integrating over all views and these variables also determine the online computational time that scene points, we obtain: involves inferring function 퐶 and ray marching the 푁 spheres. How- ∬ ′ ever, VR strictly demands both quality and performance. We present 퐸푠푐푒푛푒 (푁, r) = q (푁, r, q) − qˆ (x, 푁, r, q) dqdx, (4) a spatial-temporal model that analytically depicts the correlations ∀{x, q} pair without occlusion in between. and optimizes the variables for an ideal user experience. By integrating all 3D vertices q and camera positions x in our dataset Precision loss of a 3D scene. As shown in Figure 2, under the sampling, 퐸 depicts how the generic representation precision egocentric representation, a 3D point q is re-projected as the nearest 푠푐푒푛푒 of a scene, given a coordinate system defined by 푁 and r. point on a sphere that connects it to the origin point: In comparison, a neural representation aims at predicting pro- ′ q  jected image given a x and R. Thus, we further extend Equation (4) q (푁, r, q) ≜ 푟푘 , 푘 = argmin ∥q∥ − 푟 푗 . (1) ∥q∥ 푗 ∈[1,푁 ] to image space to analyze the error given a set of camera’s projection matrix M(x, R) as Similar to volume-based representation, the multi-spherical sys- ∫ tem is also defined in the discrete domain. The sparsity thus natu- 퐸푖푚푎푔푒 (푁, r, x, R) = ∥M(x, R)· (q − qˆ (x, 푁, r, q))∥ dq. (5) rally introduces approximation error that compromises the synthe- sis quality. To analytically model the precision loss, we investigate From Figure 2 and eq. (5), we observe: given a fixed min/max range of the geometric relationship among the camera, the scene, and the r, 푁 is negatively correlated to 퐸푖푚푎푔푒 ; with a fixed 푁 , the correlation representation. As illustrated in Figures 1a and 2, for a sphere (lo- between distribution of 푟 푗 and scene content (i.e., distribution of qs) cated at origin point) with radius 푟, its intersection (if exists) with a also determines 퐸푖푚푎푔푒 . directional ray {x, v} is However, for neural scene representation, infinitely enlarging   1  푁 would significantly raise the challenges in training precision 푝(푟, x, v) = x + (x · v)2 − ∥x∥2 + 푟 2 2 − x · v v, (2) and inference performance. Likewise, increasing representation densities (i.e., higher 푁 ) and/or network complexities (i.e., higher where x and v are the ray’s origin point and normalized direction, 푁푚/푁푐 ) naturally improves the image output quality (lower Equa- respectively. Inversely, given a view point x observing a spatial tion (5)). However, this significantly increases the computation dur- location q, the ray connecting them has the direction v(q, x) = ing ray marching, causing quality drop stretched along time. In the 1:6 • Nianchen Deng, Zhenyi He, Jiannan Ye, Praneeth Chakravarthula, Xubo Yang, and Qi Sun

Figure 6 visualizes an example of the optimization mechanism for

64 5.74 2.96 2.65 1.28 1.01 0.7 64 9.4 14.8 14.1 23.2 27.5 51.7 an foveal image 퐼푓 . The optimized results (푁, 푁푚, 푁푐 ) for individual networks are used for training. The optimization outcomes are de- 32 6.28 4.11 3.39 2.37 2.07 1.68 32 4.9 7.3 6.5 11.9 13.4 25.1 tailed in Section 4. The visual quality is validated by psychophysical study (Section 5.1) and objective analysis (Section 5.2). The latency 16 8.79 6.88 6.42 5.48 5.4 4.73 16 3.0 4.4 4.2 6.6 7.5 13.5 breakdown of our system is reported in Section 5.3.

4 27.6 26.0 25.7 26.3 24.8 23.1 4 1.6 1.9 1.8 2.5 2.6 4.1

64 x 4 64 x 8 128 x 4 128 x 8 256 x 4 256 x 8 64 x 4 64 x 8 128 x 4 128 x 8 256 x 4 256 x 8 4 IMPLEMENTATION (a) quality (b) latency In this section, we elaborate on how we collect data for training, the specific parameters we choose for the two orthogonal neural networks, and the software/hardware environment we use. Data generation. For each scene, we generate two datasets sepa- rately for training foveal and periphery networks. The foveal dataset is composed of 256 × 256 images rendered at different views with 20 deg field-of-view, while the periphery dataset uses 45 deg field- of-view to rendering. The view translations (x) are sampled every 5 cm on each axis. The rotations (R) are sampled so that adjacent (c) quality-only (d) latency-only (e) ours samples are seamlessly connected with no overlap.

Fig. 6. Latency-quality joint-optimization. (a)/(b) plots 퐸푖푚푎푔푒 in Equa- Optimized network parameters. Guided by the latency-quality tion (5)/latency of an example foveal network during the optimization pro- joint optimization described in Section 3.3, we implemented the cess. (in milliseconds) . The values are computed with various settings of foveal network with 푁푚 = 4, 푁푐 = 128 and 푁 = 32. For the periph- 푁 × 푁 (X-axis) and 푁 (Y-axis). The second row indicates a corresponding 푐 푚 ery neural network, the values are 푁 = 4, 푁 = 96 and 푁 = 16. foveal image under different settings. Our method balances both perceptual 푚 푐 quality and latency. These optimized parameters achieve a proper balance between qual- ity and latency, as shown in Figure 8 and Table 1. Ray marching bounding. We bound the ray marching within re- performance-sensitive VR scenario, the latency breaks the continu- gion that r are sampled in, thus has impact on 퐸푠푐푒푛푒 in Equa- ous viewing experiment and may cause simulator sickness. Thus, tion (4). As shown in Figure 7a, results with infinite upper bounding with content-aware optimization, we further optimize the system (푟 ∈ [푟푛푒푎푟 , +∞) incur blurry and ghosting artifacts due to larger towards an ideal quality-speed balance. 퐸푠푐푒푛푒 . In contrast, limited bounding significantly improves the Spatial-temporal modeling. Inspired by [Albert et al. 2017; Li et al. quality (Figure 7b). 2020], we perform spatial-temporal joint modeling to determine the optimal coordinate system (푁 ) for positional precision and network complexity (푁푚, 푁푐 ) for color precision that adapt to individual computational resources and scene content. This is achieved via latency-precision modeling in the spatial-temporal domain: ∑︁ ∫ 퐸(푁,푁푚, 푁푐 ) = 퐶푁푚,푁푐 (q)× 푡 (6) ∥M(x푡 , R푡 )· q − M(x푡−푙, R푡−푙 )· q(푟푘, x푡−푙, v푡−푙 )∥ dq, where 푙 ≜ 푙 (푁, 푁푚, 푁푐 ) is the system latency with a given coordi- nate and network setting. 퐶푁푚,푁푐 (q) is the four (푟,푔,푏, 푎) output channels’ L1-distance between a given network setting and the high- est values 푁 = 64, 푁푚 = 8, 푁푐 = 256). For simplicity, we assumed (a) infinite upper bounding (b) limited bounding uniformly distributed r with a fixed range of the spherical coverage. As suggested by Albert et al. [2017], the latency for a foveated Fig. 7. The effect of ray marching bounding. Compared to infinite ray march- system shall reach below ~50ms for undetectable artifacts. Given our ing (a), bounding the ray marching further improves visual quality in (b). test device’s eye-tracking latency ~12푚푠 and the photon submission latency ~14푚푠 ([Albert et al. 2017]), the synthesis and rendering latency shall be less than 퐿0 = 24ms. Thus, we determine the optimal Environment. The system was implemented through OpenGL {푁, r} to balance latency and precision as framework using CUDA and accelerated by TensorRT with Intel(R) argmin 퐸(푁, 푁푚, 푁푐 ), Xeon(R) Gold 6230R CPU @ 2.10GHz (256GB RAM) and one NVIDIA 푁,푁푚,푁푐 (7) GTX 3090 graphics card. For each scene, we trained around 200 푠.푡. 푙 (푁, r) < 퐿0. epochs for experiment use (about a day). Foveated Neural Radiance Fields for Real-Time and Egocentric Virtual Reality • 1:7

OURS OURS

F-GT NeRF F-GT NeRF (a) gas scene (b) minecraft scene

OURS OURS

F-GT NeRF F-GT NeRF (c) bedroom scene (d) gallery scene

Fig. 8. Visualization of scenes rendered/synthesized with different methods. The corresponding foveal image from OURS and NeRF are indicated in the dashed orange rectangles.

5 EVALUATION each group of stimuli consisted of static stereo images rendered With a variety of scenes as shown in Figure 8, we conduct a sub- via F-GT or synthesized via OURS/NeRF with the same views, as jective study (Section 5.1), and calculate objective measurements shown in Figure 8. They were generated with the same and ran- (Section 5.2) to evaluate the perceptual quality. Further, we analyze domly defined gaze fixation across conditions. The resolution ofthe the intra- and inter-system performance in Section 5.3. image per eye was 1440 × 1600. During the study, the target gaze position was indicated as a green cross on the stimuli images. We 5.1 User Study used the gas and minecraft scenes for the diverse indoor/outdoor We conducted a psychophysical experiment to investigate how users environment. Each condition from an individual scene consists of perceive our solution (OURS) compared with existing neural view 3 different gaze positions. synthesis methods ([Mildenhall et al. 2020], NeRF) and foveation The foveal kernel size in F-GT was designed to match our display rendering of the original mesh scene ([Perry and Geisler 2002], size and eye-display distances. For generating NeRF condition, we F-GT). retrained the model from [Mildenhall et al. 2020] on our dataset with 푁푚 = 4, 푁푐 = 128, and 푁 = 16. The bottom right of each scene Stimuli. For precise comparison and accommodating the low figure Figure 8 shows the resulting NeRF images. frame rate of alternative solutions (as to be compared in Section 5.3), 1:8 • Nianchen Deng, Zhenyi He, Jiannan Ye, Praneeth Chakravarthula, Xubo Yang, and Qi Sun

Setup. Each participant wore an eye-tracked HTC Vive Pro Eye Discussion. The close-to-random-guess indicated the statistically headset and remained seated to examine the stimuli during the similar perceptual quality between OURS and foveated ground truth experiment. Ten users participated in and completed the study (3 (F-GT). Given the perceptual identicality between foveated and full- females, 푀 = 23.7, 푆퐷 = 1.49). No participants were aware of the resolution rendering, the subjective study reveals that OURS can research, the experimental hypothesis, nor the number of methods. achieve similar quality as a rendering with locally fully stored high- All participants had a normal or corrected-to-normal vision. quality 3D data. Meanwhile, both of the two conditions showed significant quality Task. The task was a two-alternative-forced-choice (2AFC). Each preference than NeRF. That is, with immersive, high FoV, and first- trial consists of a pair of stimuli generated from two of the three person viewing constraints, OURS synthesizes gaze-contingent methods (OURS / NeRF / F-GT) with a sampled view and gaze po- retinal images with superior perceptual quality than alternative sition. Each stimulus appeared for 300ms on display. A forced 1.5sec synthesis solutions. The latter typically aims to reconstruct the break (black screen) was introduced between conditions to clear full image over the entire field-of-view. To further validate the the vision. During the study, the participants were instructed to fix perceptual quality at individual eccentricities across the whole visual their gazes on the green cross. To prevent the fixation from shifting field, we analyze with objective metrics in the following section. away and ensure accuracy, we tracked the users’ gaze throughout the experiment. Whenever the gaze is more than 5 deg away from the target, a trial was dropped immediately with a black screen 5.2 Visual Quality informing the participant. After each trial, the participants were Complementary to the subjective measurement (Section 5.1), we instructed to select which of the two stimuli appeared with higher evaluate the perceived quality in an objective fashion. Specifically, visual quality using a keyboard. Before each experiment, a warm-up under all four different scenes (indoor/outdoor, artificial/realistic), session with 6 trials was provided to familiarize the participants we compare OURS, F-GT, and NeRF at individual eccentricity with the study procedure. The orders of conditions among trials ranges (up to 110 deg, the capability of the VR HMD) via partial im- were randomized and counter-balanced. The entire experiment of ages. For each eccentricity range, we compare the deep perceptual each user consists of 36 trials, 6 trials per pair of ordered conditions. similarity (LPIPS) [Zhang et al. 2018] across all scenes. LPIPS uses To minimize the effect of accumulated fatigue, we also enforced deep neural networks to estimate perceptual similarities between breaks between trials (at least 2 seconds) and after each scene (60 the image provided and a reference image. Smaller values indicate seconds). Meanwhile, the participants were allowed to take as much higher perceptual similarity. For each scene, we sample 20 views time as needed. with gazes at the middle of the display, resulting in 20 data per eccen- tricity value, 21 eccentricity value per scene (5 deg step size), thus 420 data per scene. We used one-way repeated measures ANOVAs Vote Result to compare effects across three stimuli on each eccentricity range and together. Paired t-tests with Holm correction were used for all 0.9333 0.9500 1.0 pairwise comparisons between stimuli. All tests for significance

o were made at the 훼 = 0.05 level. The error bars in the graphs show 0.5000

Rati 0.4333 the 95% confidence intervals of the means. 0.5

Vote 0.1167 0.0833 Results. Figure 10 plots LPIPS values across all scenes and ec-

0.0 centricity ranges (5 deg step size). From foveal to near eccentricity F-GT v.s. OURS v.s. F-GT v.s. NeRF v.s. OURS v.s. NeRF v.s. OURS F-GT NeRF F-GT NeRF OUTS (≤ 40 deg), we observed significant effects of the stimuli on LPIPS with a “large” effect size휂 ( 2 = 0.26). That is, OURS shows sig- ∗∗∗ Fig. 9. The users’ preference votes from our evaluation experiment (Section 5.1). nificantly lower LPIPS than NeRF (푝 < .001 ), although higher ∗∗∗ X-axis shows all pairs of ordered conditions (rendering/synthesis methods than F-GT (푝 < .001 ). For example, the main effects of stim- and their orders in the 2AFC); Y-axis shows the ratio of voting for the first uli (퐹 (2, 38) = 210.76, 푝 < .001∗∗∗) was significant on eccentric- one in each pair. ity = [0, 25] deg in scene gallery (Figure 10c). OURS was signifi- cantly lower than NeRF (푡 (19) = −19.54, 푝 < .001∗∗∗) and F-GT (푡 (19) = −4.32, 푝 < .001∗∗∗) both with a “large” effect size (Cohen’s Results. Figure 9 plots the results considering both methods and d > 0.8). The example observation generally applies to all 4 scenes their orders in the 2AFC experiments. Here we analyzed and re- being validated. ported the results regardless of the orders. Among all three condi- From near- to far- eccentricity (>40 deg), we observed signifi- tions, we observed close-to-random-guess among trials that com- cant effects of the stimuli on LPIPS with a “large” effect휔 size( 2 = pared F-GT and OURS (53.3% voted for OURS, 푆퐷 = 14.9%). Mean- 0.16). OURS shows significantly higher LPIPS than NeRF. Whereas, while, a significantly higher ratio of voting F-GT over NeRF was comparing with F-GT, we observed significant lower scores푝 ( < observed (92.5%, voted for F-GT, 푆퐷 = 11.4%, binomial test showed .001∗∗∗). For instance, the main effects of stimuli was significant 푝 < 0.005∗∗∗). The preference applies to OURS vs. NeRF as well on eccentricity = 60 deg in scene bedroom (퐹 (2, 38) = 411.93, 푝 < (91.7% voted for OURS, 푆퐷 = 14.8%, binomial test showed 푝 < .001∗∗∗, Figure 10d). OURS was significantly lower than F-GT (푡 (19) = 0.005∗∗∗). −3.78, 푝 < .001∗∗∗), and higher than NeRF (푡 (19) = 22.75, 푝 < Foveated Neural Radiance Fields for Real-Time and Egocentric Virtual Reality • 1:9

OURS NeRF F-GT OURS NeRF F-GT ecc=20 ecc=60 ecc=100 ecc=20 ecc=60 ecc=100

0.8 0.8

0.6 0.6 LPIPS LPIPS LPIPS ecc=28 0.4 0.4 ecc=30 ecc=50 0.2 0.2 ecc=65 ecc=30 ecc=50 0.0 0.0 5 20 40 60 80 100 5 20 40 60 80 100 Eccentricity (deg) Eccentricity (deg) (a) gas (b) minecraft

OURS NeRF F-GT OURS NeRF F-GT ecc=20 ecc=60 ecc=100 ecc=20 ecc=60 ecc=100

0.8 0.8

0.6 0.6 LPIPS ecc=21 LPIPS 0.4 0.4

0.2 0.2 ecc=37 ecc=30 ecc=50 0.0 0.0 5 20 40 60 80 100 5 20 40 60 80 100 Eccentricity (deg) Eccentricity (deg) (c) gallery (d) bedroom

Fig. 10. LPIPS visualization of all scenes and comparison of OURS, NeRF, and F-GT over the visual field. X-axis indicates the eccentricity range (0 to 110). Y-axis shows the LPIPS loss [Zhang et al. 2018] with a full resolution rendering as reference. Lower values mean more similar perceptual quality to the reference. The semi-transparent layers indicate first and third quarterlies. The inset figures’ corresponding eccentricity ranges are indicated by the dashed lines.

.001∗∗∗) both with a “large” effect size (Cohen’s d > 0.8). The exam- high resolution, and immersive viewing of 3D scenes. Meanwhile, ple observation generally applies to all 4 scenes being validated. our synthesized views are perceptually identical to the traditionally rendered images with a complete 3D mesh. It yet achieves signifi- cant data storage-saving with only about 0.5MB neural model (up Discussion. In the foveal and near-periphery, the observation to 99.5% saving from 3D mesh in our experiment). revealed our method’s significantly higher visual quality than al- ternative neural representation solutions. This is evidenced by the significantly stronger perceptual similarity to F-GT by comparison 5.3 Performance between OURS and NeRF. The subjective study (Section 5.1) evi- Virtual and augmented reality demands high frame rates along with denced the marginally lower similarity than F-GT as statistically high quality to ensure an immersive and comfortable experience. unnoticeable among users. Our neural synthesis method introduces both spatial and angular In the far-periphery, OURS showed increased LIPIPS than F-GT (stereo) visual perception for real-time performance (Section 3.1). beyond 40 deg. The latter has been shown to display identical per- Further, our precision-latency joint optimization for the represen- ceptual similarity than full resolution rendering in VR [Patney et al. tation and network also balances quality and performance. Here, 2016]. Thus, the findings show that OURS doesn’t compromise the we evaluate the performance of each component in our system and peripheral vision’s quality with its significantly enhanced synthesis compare with existing neural synthesis solutions. acuity in the fovea. For high resolution (1440 × 1600), high field-of-view (110 deg) Note that the discoveries also agree with the observations from stereo images required by VR display, our system completes all com- Section 5.1. That is, in addition to the significantly faster perfor- putation (including gaze-contingent neural-inference and elemental mance (99.8% time reduction per frame as in Section 5.3), our method images composition) in 31.8ms per frame without stereo foveation, showed superior perceptual quality than NeRF under first-person, as shown in Table 1 . Compared with NeRF (the bottom row of 1:10 • Nianchen Deng, Zhenyi He, Jiannan Ye, Praneeth Chakravarthula, Xubo Yang, and Qi Sun

Table 1), our spatial foveation significantly improves the system performance from offline computation (in seconds) to interactive speed (about 30FPS). Our stereo foveation further reduces the com- putation time and achieves a high performance of more than 50 frames per second, contributing to a temporally continuous viewing experience without loss of perceived resolution and quality.

foveal infer (per eye) 3.8 periphery infer (per eye) 11.9 blending & contrast enhancement 0.2 OUR method without stereo foveation 31.8 (a) small translation (b) large translation OUR full method 19.7 Fig. 11. Comparing foveal quality with regards to translation range. Using 3 NeRF [Mildenhall et al. 2020] 9 × 10 our method, (a) shows a fovea image with a small-scale translation box Table 1. Time consumption breakdown and comparison. The numbers show (0.3m); (b) shows a fovea image with a large-scale translation box (0.9m). the average time consumption of each component and the overall system per frame. All units are in millisecond. even higher quality and/or faster performance. Also, our multi- spherical representation is specialized to the scene for optimal qual- ity. Thus, our network is trained per-scene. Exploring potential 6 CONCLUSION means of generalized representation may further reduce storage. In this paper, we present a gaze-aware neural scene representation For simplicity, the spatial-temporal joint optimization in Sec- and view synthesis method tailored for future portable virtual real- tion 3.3 connects the output precision and latency to the number of ity viewing experiences. Specifically, we overcome the limitations the spheres (푁 ) but not their radii r. This is due to the significant of existing neural rendering approaches for high performance, res- amount of training with parameter sampling. Incorporating the pa- olution, and fidelity. Our network individually synthesizes foveal, rameters into a single training process may significantly reduce the mid-, and far-periphery retinal images, which are then blended to time consumption for the optimization. With the adaptive training form a wide field-of-view image matching retinal acuity. This is process, a content-aware distribution (i.e., r) of the spheres would achieved via adapting the neural network model to the foveated further improve the synthesis quality and performance. visual acuity and stereopsis. Comparing with traditional rendering, our method requires less than 1% data storage yet delivering iden- REFERENCES tically high perceptual quality. Comparing with alternative neural Rachel Albert, Anjul Patney, David Luebke, and Joohwan Kim. 2017. Latency re- view synthesis approaches, our method creates significantly faster quirements for foveated rendering in virtual reality. ACM Transactions on Applied Perception (TAP) 14, 4 (2017), 1–13. and higher fidelity viewing for high resolution, high FoV, and stereo Benjamin Attal, Selena Ling, Aaron Gokaslan, Christian Richardt, and James Tompkin. VR head-mounted-displays. 2020a. MatryODShka: Real-time 6DoF Video View Synthesis using Multi-Sphere Images. In European Conference on Computer Vision (ECCV). Limitations and future work. While our method achieves unprece- Benjamin Attal, Selena Ling, Aaron Gokaslan, Christian Richardt, and James Tompkin. 2020b. MatryODShka: Real-time 6DoF Video View Synthesis Using Multi-sphere dented performance compared to existing approaches, it suffers Images. In Computer Vision – ECCV 2020. Springer International Publishing, Cham, from a few limitations. Our multi-spherical representation intro- 441–459. duces higher quality when the VR camera is within the first sphere. Matan Atzmon and Yaron Lipman. 2020. Sal: Sign agnostic learning of shapes from raw data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern As shown in Figure 11, when the virtual camera translates largely Recognition. 2565–2574. within a highly occluded scene, our method still synthesizes the Michael Broxton, John Flynn, Ryan Overbeck, Daniel Erickson, Peter Hedman, Matthew accurate views but with declined foveal image quality. This is due Duvall, Jason Dourgarian, Jay Busch, Matt Whalen, and Paul Debevec. 2020. Immer- sive Light Field Video with a Layered Mesh Representation. ACM Trans. Graph. 39, to the accumulated error of the concentric spherical representation 4, Article 86 (July 2020), 15 pages. https://doi.org/10.1145/3386569.3392485 (Equation (5)). Multiscale coordinates and networks that consider Rohan Chabra, Jan E Lenssen, Eddy Ilg, Tanner Schmidt, Julian Straub, Steven Lovegrove, and Richard Newcombe. 2020. Deep local shapes: Learning local sdf priors for various level-of-details of the 3D space have shown their effective- detailed 3d reconstruction. In European Conference on Computer Vision. Springer, ness of interpolating geometries [Winkler et al. 2010]. Developing a 608–625. multiscale network that synthesizes the image from global to local Inchang Choi, Orazio Gallo, Alejandro Troccoli, Min H Kim, and Jan Kautz. 2019. Extreme view synthesis. In Proceedings of the IEEE/CVF International Conference on level-of-details would be an interesting direction for future work, Computer Vision. 7781–7790. extending the applicability of our method to large-scale scenes with Steven J Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F Cohen. 1996. The strong occlusions. lumigraph. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques. 43–54. Currently, we sample the scene fully based on eccentricity, consid- Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and Yaron Lipman. 2020. Implicit ering acuity and stereopsis. However, we envision that fine-grained geometric regularization for learning shapes. arXiv preprint arXiv:2002.10099 (2020). Brian Guenter, Mark Finch, Steven Drucker, Desney Tan, and John Snyder. 2012. visual sensitivity analysis, such as luminance [Tursun et al. 2019] or Foveated 3D Graphics. ACM Trans. Graph. 31, 6, Article 164 (Nov. 2012), 10 pages. depth [Sun et al. 2020], would provide more insights on achieving https://doi.org/10.1145/2366145.2366183 Foveated Neural Radiance Fields for Real-Time and Egocentric Virtual Reality • 1:11

Nima Khademi Kalantari, Ting-Chun Wang, and Ravi Ramamoorthi. 2016. Learning- Vincent Sitzmann, Julien N.P. Martel, Alexander W. Bergman, David B. Lindell, and Based View Synthesis for Light Field Cameras. ACM Transactions on Graphics Gordon Wetzstein. 2020b. Implicit Neural Representations with Periodic Activation (Proceedings of SIGGRAPH Asia 2016) 35, 6 (2016). Functions. In Proc. NeurIPS. Anton S. Kaplanyan, Anton Sochenov, Thomas Leimkühler, Mikhail Okunev, Todd Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Goodall, and Gizem Rufo. 2019. DeepFovea: Neural Reconstruction for Foveated Michael Zollhöfer. 2019a. DeepVoxels: Learning Persistent 3D Feature Embeddings. Rendering and Video Compression Using Learned Statistics of Natural Videos. ACM In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE. Trans. Graph. 38, 6, Article 212 (Nov. 2019), 13 pages. https://doi.org/10.1145/ Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. 2019b. Scene Representa- 3355089.3356557 tion Networks: Continuous 3D-Structure-Aware Neural Scene Representations. In Marc Levoy and Pat Hanrahan. 1996. Light field rendering. In Proceedings of the 23rd Advances in Neural Information Processing Systems. annual conference on Computer graphics and interactive techniques. 31–42. Qi Sun, Fu-Chung Huang, Joohwan Kim, Li-Yi Wei, David Luebke, and Arie Kaufman. Mengtian Li, Yu-Xiong Wang, and Deva Ramanan. 2020. Towards Streaming Perception. 2017. Perceptually-Guided Foveation for Light Field Displays. ACM Trans. Graph. 36, In Computer Vision – ECCV 2020. Springer International Publishing, Cham, 473–488. 6, Article Article 192 (Nov. 2017), 13 pages. https://doi.org/10.1145/3130800.3130807 Qinbo Li and Nima Khademi Kalantari. 2020. Synthesizing Light Field From a Single Qi Sun, Fu-Chung Huang, Li-Yi Wei, David Luebke, Arie Kaufman, and Joohwan Kim. Image with Variable MPI and Two Network Fusion. ACM Transactions on Graphics 2020. Eccentricity effects on blur and depth perception. Opt. Express 28, 5 (March 39, 6 (12 2020). https://doi.org/10.1145/3414685.3417785 2020), 6734–6739. https://doi.org/10.1364/OE.28.006734 Chen-Hsuan Lin, Chaoyang Wang, and Simon Lucey. 2020a. SDF-SRN: Learning Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Carsten Stoll, Signed Distance 3D Object Reconstruction from Static Images. arXiv preprint and Christian Theobalt. 2020. PatchNets: Patch-Based Generalizable Deep Implicit arXiv:2010.10505 (2020). 3D Shape Representations. In European Conference on Computer Vision. Springer, Kai-En Lin, Zexiang Xu, Ben Mildenhall, Pratul P. Srinivasan, Yannick Hold-Geoffroy, 293–309. Stephen DiVerdi, Qi Sun, Kalyan Sunkavalli, and Ravi Ramamoorthi. 2020b. Deep Okan Tarhan Tursun, Elena Arabadzhiyska-Koleva, Marek Wernikowski, Radosław Multi Depth Panoramas for View Synthesis. In Computer Vision – ECCV 2020. Mantiuk, Hans-Peter Seidel, Karol Myszkowski, and Piotr Didyk. 2019. Luminance- Springer International Publishing, Cham, 328–344. Contrast-Aware Foveated Rendering. ACM Trans. Graph. 38, 4, Article 98 (July 2019), Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. 2020. 14 pages. https://doi.org/10.1145/3306346.3322985 Neural sparse voxel fields. arXiv preprint arXiv:2007.11571 (2020). Tim Winkler, Jens Drieseberg, Marc Alexa, and Kai Hormann. 2010. Multi-scale ge- Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas ometry interpolation. In Computer graphics forum, Vol. 29. Wiley Online Library, Geiger. 2019. Occupancy networks: Learning 3d reconstruction in function space. In 309–318. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Atzmon, Basri Ronen, and 4460–4470. Yaron Lipman. 2020. Multiview neural surface reconstruction by disentangling Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, geometry and appearance. Advances in Neural Information Processing Systems 33 Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. 2019. Local Light Field Fusion: (2020). Practical View Synthesis with Prescriptive Sampling Guidelines. ACM Transactions Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The on Graphics (TOG) (2019). unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ra- the IEEE conference on computer vision and pattern recognition. 586–595. mamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields Zaiwei Zhang, Zhenpei Yang, Chongyang Ma, Linjie Luo, Alexander Huth, Etienne for View Synthesis. In ECCV. Vouga, and Qixing Huang. 2020. Deep generative modeling for scene synthesis via Hiroshi Mochizuki, Nobuyuki Shoji, Eriko Ando, Maiko Otsuka, Kenichiro Takahashi, hybrid representations. ACM Transactions on Graphics (TOG) 39, 2 (2020), 1–21. Tomoya Handa, et al. 2012. The magnitude of stereopsis in peripheral visual fields. Kitasato Med J 41 (2012), 1–5. Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. 2019. Occu- pancy flow: 4d reconstruction by learning particle dynamics. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5379–5389. Michael Oechsle, Lars Mescheder, Michael Niemeyer, Thilo Strauss, and Andreas Geiger. 2019. Texture fields: Learning texture representations in function space. In Proceed- ings of the IEEE/CVF International Conference on Computer Vision. 4531–4540. Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan, and Alexander C Berg. 2017. Transformation-grounded image generation network for novel 3d view synthesis. In Proceedings of the ieee conference on computer vision and pattern recognition. 3500–3509. Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Love- grove. 2019. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 165–174. Anjul Patney, Marco Salvi, Joohwan Kim, Anton Kaplanyan, Chris Wyman, Nir Benty, David Luebke, and Aaron Lefohn. 2016. Towards Foveated Rendering for Gaze- Tracked Virtual Reality. ACM Trans. Graph. 35, 6, Article Article 179 (Nov. 2016), 12 pages. https://doi.org/10.1145/2980179.2980246 Jeffrey S Perry and Wilson S Geisler. 2002. Gaze-contingent real-time simulation of arbitrary visual fields. In Human vision and electronic imaging VII, Vol. 4662. International Society for Optics and Photonics, 57–69. Albert Parra Pozo, Michael Toksvig, Terry Filiba Schrager, Joyce Hsu, Uday Mathur, Alexander Sorkine-Hornung, Rick Szeliski, and Brian Cabral. 2019. An Integrated 6DoF Video Camera and System Design. ACM Trans. Graph. 38, 6, Article 216 (Nov. 2019), 16 pages. https://doi.org/10.1145/3355089.3356555 Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. 2019. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2304–2314. Ana Serrano, Incheol Kim, Zhili Chen, Stephen DiVerdi, Diego Gutierrez, Aaron Hertz- mann, and Belen Masia. 2019. Motion parallax for 360◦ RGBD video. IEEE Transac- tions on Visualization and Computer Graphics (2019). Harry Shum and Sing Bing Kang. 2000. Review of image-based rendering techniques. In Visual Communications and Image Processing 2000, Vol. 4067. International Society for Optics and Photonics, 2–13. Vincent Sitzmann, Eric R. Chan, Richard Tucker, Noah Snavely, and Gordon Wetzstein. 2020a. MetaSDF: Meta-Learning Signed Distance Functions. In arXiv.