Deep

Team Cucumbers Joey Chiu, Subash Chebolu, Mirkamil Mijit, Bharat Suri

1 Object Representation in

http://www.cmap.polytechnique.fr/~peyre/ geodesic_computations/ 2 Graphic Pipeline: Terminology

1. Local space. 2. Object space. (World space) 3. View space. 4. Clip Space. 5. Screen Space.

3 https://tayfunkayhan.wordpress.com/2018/12/16/rasterization-in-one-weekend-part-ii/

4 https://tayfunkayhan.wordpress.com/2018/12/16/rasterization-in-one-weekend-part-ii/ 5 Phong Lighting

6 Graphic Pipeline http://web.cse.ohio-state.edu/~shen.94/5542/Site/Slides_files/hardware5542.pdf

7 Forward Shading VS Deferred Shading

8 More about Deferred Shading

9 More about Deferred Shading

10 Deferred Shading

1. Lighter Calculation

2. Heavy Memory Usage

3. No Transparent Object

11

● Completely different from GPU graphics pipeline previously mentioned ● Can capture effects ● Computationally expensive

12 Features

13 Effects

● Certain effects make rendered images appear more realistic ○ (AO) ○ Directional Occlusion (DO) ○ Indirect Light (GI) ○ Subsurface Scattering (SSS) ○ Depth-of-Field (DOF) ○ Motion Blur (MB) ○ Image-based lighting (IBL) ○ Anti-aliasing (AAA) ● These effects are generally produced during rendering and require significant computational resources

14 Directional Occlusion (Indirect Light)

● Generalized application of Ambient Occlusion ● Accounts for the direction of light when approximating global illumination ● Screen Space Directional Occlusion (SSD0, RGS09) ○ Direction of incoming light ○ One bounce of indirect illumination ○ Minor additional computational time (compared to SSAO) ○ Avoids ray-tracing and only uses normals and geometry

15 Ambient Occlusion

● Approximates global illumination by calculating how much ambient light is “occluded” from a point by surrounding geometry ● 2 approaches of computation ○ Screen Space AO (SSAO) ■ Developed by Crytek ■ Uses pixel depth from Z Buffer ○ Ray Tracing ■ Casts rays to determine if geometry is in the way ■ Very slow (until recently)

16 17 Sub-Surface Scattering (SSS)

● Phenomenon where light passes through a translucent material and is scattered, bouncing around, before exiting at a different point ● Visible on leaves, skin, wax, etc ● Screen Space SSS (JSG09) ○ Done in screen-space vs texture space ○ Does not require a diffusion profile or irradiance map

18 Depth of Field

● Physical property specifying the range of objects that are in focus ● A realistic and often desired feature in rendered images ● Screen-Space DOF ○ Post processing with specific filters

19 Motion Blur

● Physical phenomenon that occurs when object movement is faster than shutter speed of a camera ● Similar to DOF, this feature is often desirable in rendered images to make them look more realistic ● Implementation in graphics ○ Filtering in post-processing

20 Anti-Aliasing

● Aliasing occurs when sampling rate is not high enough (Nyquist frequency) ○ Fourier transform ○ Texture magnification/minification during pixel lookup ● Depth information can be used to blur discontinuities and reduce aliasing

21 Introduction to Paper

22 Data Structure

● 61,000 pairs with a train-validation-test split of 54,000-6,000-6,000 ● Data generation ○ 1,000 base images from each of 10 scenes ○ Each base image is flipped horizontally, and vertically and rotated in steps of 90 degrees ○ 170 hours of computation on one GPU ● The base images are from a perspective camera with 50 degree FOV ● 512 x 512 px for AO ● 256 x 256 px for other effects

23 Attributes

Attribute Name Space Notation Attribute Name Space Notation

Position Screen Ps RGB diffuse and // Rdiff/Rspec specular colors

Normals Camera/World Ns/Nw Scalar glossiness // Rgloss

Depth Screen Ds Scatter // Rscatt

Distance to the focal Screen Dfocal plane Direct light // L

Radius of the circle of // B Direct light for diffuse // L confusion of the lens diff system

Normalized direction to World Cw the camera

Material parameters // R

24 Appearance

AO and DO do not compute to final RGB appearance

● Output RGB radiance which is multiplied by albedo

RGB vs Mono networks

● RGB networks are trained on input for all 3 channels at the same time ● Mono networks are trained on a single channel at a time

25 Network Architecture

26 Network Architecture

Network architecture is based on the U-Net architecture

27 Network Architecture

The network architecture is based on the U-Net architecture.

A new network was created for each type of effect.

● Up to 6 “levels” of down sampling and up sampling ● Range from 512x512 px to 16x16 px throughout network ● Downsampling is a 2x2 mean-pooling ● Upsampling is bilinear upsampling ● Every level effectively doubles the feature maps ● Every level effectively halfs the dimensions of the image ● Activation function is always LeakyReLU

28 Network Architecture

The architecture uses grouped convolutions which is great for parallelizing

Regular Convolution Grouped Convolution

29 Training

Caffe API was used to train these Networks

● Input has 3 to 18 channels ● ADADELTA optimizer with momentum of 0.9 ● The loss function is SSIM (Structural Similarity Index) ○ Ranges from -1 to 1 ○ Tiled into 8x8 px’s and SSIM’s combined

30 Analysis

31 Analysis

This section discusses the following:

● Address the shortcomings using Typical Artifacts ● Structural Choices and Trade-offs in Training

32 Visual Analysis

Their method produced some Artifacts which have been used to address the shortcomings. They use visual analysis of the following artifacts:

● Typical Artifacts ● Range of Values ● Effect Radius ● Internal Camera Parameters

33 Visual Analysis (Typical Artifacts)

● Light transport can be highly complex and mapping attributes to shading becomes ambiguous ● Patterns resemble correct shading but are inconsistent with the laws of optics ● Capturing high frequencies is a challenge and the network needs enough capacity and training ● Results might over-blur as seen in the figure but are better than ringing and Monte Carlo Noise as seen in man-made shaders

Source: http://deep-shading-datasets.mpi-inf.mpg.de/deep-shading.pdf 34 Visual Analysis (Effect Radius)

● Screen space shading is faded out based on a distance term ● Training is done in one resolution but later applied to different resolutions ● As the resolution changes, the effect radius should also change accordingly ● Effect radius is not an input to the network but fixed in the training data ● It can still be adjusted at test time by scaling the attributes determining the spatial scale of the effect

Source: http://deep-shading-datasets.mpi-inf.mpg.de/deep-shading.pdf 35 Visual Analysis (Internal Camera Parameters)

● It is unclear how the network performs on framebuffers rendered using a FOV different from the training data ● The figure investigates the influence of a FOV mismatch on image quality ● To keep the image content as similar as possible while changing FOV, they performed dolly-zoom ● The network is robust to FOV mismatches as seen from the minimal fluctuation of error in the figure

Source: http://deep-shading-datasets.mpi-inf.mpg.de/deep-shading.pdf 36 Network Structure

This section investigates the effect of structural parameters of the CNN architecture on its ability in terms of expressiveness and computational demand.

They studied two modes of variation for this:

● Varying Spatial Extent of the Kernels ● Number of Kernels on the First Level

The goal was to find the smallest network with adequate learning capacity, that generalizes well on unseen data

37 Spatial Kernel Size and Initial Number of Kernels

There was no noticeable difference in the performance for different sized kernels. 3x3 and 5x5 kernels had the same capacity but the smaller kernel was faster to train so that was finally used. The network seems to lose some capacity when 4 kernels are used instead of 8 38 Choice of Loss Function

39 Choice of Loss Function

● Loss function has a significant impact on how Deep Shading performs ● Same network structure was trained using the common L1 and L2 losses as well as the perceptual SSIM metric ● Combination of Structural Similarity metric with L1 and L2 have also been used ● L1 and L2 alone are prone to Halo effects ● L1+SSIM and SSIM alone produce the best results overall

40 Training Data Trade-offs

Ideally, a training set consists of a vast collection of images with no imperfections from Monte Carlo noise

However, to produce such training data is typically impractical, and so trade-offs must be considered when the network is trained

These are the trade-offs that were considered when the network was trained:

● Amount of Noise vs Image Set Size ● Scene Diversity

41 Noise vs Image Set Size

● Time spent to generate a training set is roughly linear ● Twice as many images can be rendered that use half as many samples per pixel ● Generally, a larger number of views per scene is more desirable than noiseless images ● Excessive noise also hinders with the network training

42 Scene Diversity

● Diversity of scenes also has a major influence on the quality of trained Deep Shaders ● Directional Occlusion was sensitive to the scene diversity and was used to investigate this trade-off ● Increasing the number of scenes from 1 to 5 results in a 5% increase in performance ● Going from 5 scenes to 10 only has a 1% increase in performance ● In terms of DO, the difference in loss visually translates to a more correct placement of darkening

43 Results

● SSIM: Metric to evaluate similarities between images ○ -1 to 1 range, 1 is best ○ Part of perceptual loss

44 Results

● Ambient Occlusion ○ Ground truth generated by ray tracer ○ AO term multiplied to ambient lighting in post processing ■ Decrease computational time ○ Achieved higher SSIM compared to HBAO, another screen space AO technique ■ Less high frequency artifacts ■ Less indirect shadows ● Diffuse Indirect Light ○ Only consider diffuse reflection ○ Ground truth considers light arriving at a pixel after ONE interaction with a surface ○ Light source position is randomized ○ Deep shader achieves higher SSIM, due to noise in Monte Carlo noise

45 46 Results

● Depth of Field (DOF) ○ Ground truth images generated in screen space ■ Average multiple images with different circle of confusion and focal lengths ■ Parameters are randomly sampled and multiplied to create a single attribute ● Motion Blur ○ Similar to DOF, but in temporal domain ○ For ground truth, move objects in random directions parallel image plane ● Sub-Surface Scattering ○ Ground truth produced by Screen Space Sub-Surface Scattering Algorithm ○ Trained on each color channel separately ● Anti-aliasing (not textures, but geometry) ○ Learn a filter to remove sharp edges with smooth ones ○ Does not perform as well as reference (MSAA)

47 48 Results

● Image-based Lighting ○ Shaded by sampling directions in an enviroment map ● Directional Occlusion ○ Generalization of AO ○ More challenging for deep shading, ■ Issues with high frequency shadows ○ Performs better than reference screen space DO algorithm with same time budget ● Multi Shading ○ Combines several shading effects ○ AO is directly applied by CNN, whereas the standalone AO architecture multiplied the AO term afterwards

49 50 Results

● Real Shading ○ Deep Shaders can also learn from real photographs ○ Experiment consisted of 12,000 RGB + Depth photographs ■ Derive camera positions and camera space normals ○ Design network to map normals to RGB data ○ Limitations due to resolution of depth camera

51 52 Questions

1. What factors should be considered in an application when choosing between Forward Shading and Deferred Shading? 2. The authors covered many different effects in the Deep Shader. Ambient Occlusion is one of the toughest to implement in the conventional way. Final results show that SSIM score on AO was lower than some of the other effects, does that affect the overall performance of the shader? What other factor should be considered while assessing the performance of the model on AO? 3. Why is concatenating the downsampling output of the same level with the corresponding upsampling layers input important? 4. The authors mention that the Deep Shader achieved better results than other screen space shading algorithms for Ambient Occlusion. Why do you think the network can do better given that it is not using more attributes than other algorithms? 5. In the paper, the authors have used visual analysis as a tool to assess the performance of the same network under different conditions to investigate different trade-offs. How can this assessment be made more effective? 53