Rendering Natural Camera Bokeh Effect with Deep Learning

Rendering Natural Camera Bokeh Effect with Deep Learning Andrey Ignatov Jagruti Patel Radu Timofte [email protected] [email protected] [email protected] ETH Zurich, Switzerland Abstract Bokeh is an important artistic effect used to highlight the main object of interest on the photo by blurring all out-of-focus areas. While DSLR and system camera lenses can render this effect naturally, mobile cameras are unable to produce shallow depth-of-field photos due to a very small aperture diameter of their optics. Unlike the current solutions simulating bokeh by applying Gaussian blur to image background, in this paper we propose to learn a realistic shallow focus technique directly from the photos produced by DSLR cameras. For this, we present a large-scale bokeh dataset consisting of 5K shallow / wide depth-of-field image pairs captured using the Canon 7D DSLR with 50mm f/1.8 lenses. We use these images to train a deep learning model to reproduce a natural bokeh effect based on a single narrow-aperture image. The experimental results show that the proposed approach is able to render a plausible non-uniform bokeh even in case of complex input data with multiple objects. The dataset, pre-trained models and codes used in this paper are available on the project website: https://people.ee.ethz.ch/˜ihnatova/ pynet-bokeh.html 1. Introduction Figure 1. The original shallow depth-of-field image and the image produced with our method. Bokeh effect is a very popular photography technique used to make the subject in the shot stand out sharply unattainable for mobile cameras with compact optics and against a blurred background (Fig. 1). It is achieved by fo- tiny sensors. As a result, bokeh effect can only be sim- cusing the camera on the selected area or object and shoot- ulated computationally on smartphones and other devices ing the photo with a wide aperture lens. This produces with small mobile cameras. a shallow depth-of-field image where only objects located Synthetic bokeh effect rendering is a relatively new ma- within a narrow image plane are visible clearly, while all chine learning topic that has emerged in the past years. other parts of the image are blurred. This effect is often [15, 16] were one of the first papers exploring shallow leading to very pleasing visual results, and besides that al- depth-of-field simulation based on a single image. These lows to wash out unnecessary, distracting or unattractive works were focused on portrait photos only: they first seg- background details, which is especially useful in case of mented out people from the image with a convolutional neu- mobile photography. In order to get good bokeh, fast lenses ral network, and then uniformly blurred the remaining im- with a large aperture are needed, which makes this effect age background. The same approach was also considered 1 Figure 2. Sample wide and shallow depth-of-field image pairs from the EBB! dataset. in [21], where the authors proposed an efficient solution for In this paper, we present a different approach to the con- person segmentation. There also exists a separate group of sidered problem. Since the properties of the real bokeh de- works [8, 2, 19] that are blurring the image based on the pend on a large number of parameters such as focal length, predicted depth map, however they require the camera to distance to the subject, the acceptable circle of confusion, move in order to get the parallax effect, thus the practical the aperture size, type and construction, various optical lens applicability of these approaches is quite limited. An accu- characteristics, etc., it is almost impossible to render this rate depth map estimation can be also obtained with stereo effect precisely with the existing software tools and meth- vision [1] using synchronously captured photos from multi- ods. Therefore, in this work we propose to learn the bokeh ple cameras. technique directly from the photos produced by a high-end Since 2015, the majority of flagship and mid-range mo- DSLR camera using deep learning. The presented approach bile devices are getting special hardware for accelerating is camera independent, does not require any special hard- deep learning models [5]. The performance of such hardware, and can also be applied to the existing images. ware is already approaching the results of mid-range Nvidia GPUs presented in the past years [6], making the use of Our main contributions are: deep learning-based methods for bokeh effect simulation on • A novel end-to-end deep learning solution for realis- mobile devices especially relevant. Paper [17] describes the tic bokeh effect rendering. The model is trained to synthetic depth-of-field rendering approach used on Google map the standard narrow-aperture images into shallow Pixel devices that is able to achieve very impressive results depth-of-field photos captured with a DSLR camera. for the majority of photo scenes. In this work, the authors are predicting a depth map based on a single image and are • A large-scale dataset containing 5K shallow / wide using camera’s dual-pixel auto-focus system when possible depth-of-field image pairs collected in the wild with for refinement of the results. The weakest part of the paper the Canon 7D DSLR camera and 50mm f/1.8 fast lens. is the actual bokeh rendering method – a simple approxi- • A comprehensive set of experiments evaluating the mated disk blur is used for this, making the visual results quantitative and perceptual quality of the rendered quite different from the natural bokeh photos produced by bokeh photos. We demonstrate that the proposed solu- DSLR cameras. tion can process one 1024×1536 px image under 5s on all high-end mobile chipsets when running it on GPU. 2. Everything is Better with Bokeh! Dataset One of the biggest challenges in the bokeh rendering task is to get high-quality real data that can be used for training deep models. To tackle this problem, a large-scale Every- thing is Better with Bokeh! (EBB!) dataset containing more than 10 thousand images was collected in the wild during several months. By controlling the aperture size of the lens, images with shallow and wide depth-of-field were taken. In each photo pair, the first image was captured with a narrow aperture (f/16) that results in a normal sharp photo, whereas the second one was shot using the highest aperture (f/1.8) leading to a strong bokeh effect. The photos were taken during the daytime in a wide variety of places and in various illumination and weather conditions. The photos were captured in automatic mode, the default settings were used throughout the entire collection procedure. An example set of collected images is presented in Figure 2. The captured image pairs are not aligned exactly, therefore they were first matched using SIFT keypoints and RANSAC method same as in [3]. The resulting images were then cropped to their intersection part and downscaled so that their final height is equal to 1024 pixels. Finally, we computed a coarse depth map for each wide depth-of-field image using the Megadepth model proposed in [12]. These maps can be stacked directly with the input images and used as an additional guidance for the trained model. From Figure 3. The architecture of the PyNET-based model. Concat and the resulting 10 thousand images, 200 image pairs were re- Sum ops are applied to the outputs of the adjacent layers. served for testing, while the other 4.8 thousand photo pairs can be used for training and validation. layers are then concatenated, which allows the network to 3. Proposed Method learn a more diverse set of features at each level. The outputs obtained at lower scales are upsampled with transposed Bokeh effect simulation problem belongs to a group of convolutional layers, stacked with feature maps from the tasks dealing with both global and local image processing. upper level and then subsequently processed in the follow- High-level image analysis is needed here to detect the ar- ing convolutional layers. Leaky ReLU activation function eas on the photo where the bokeh effect should be applied, is applied after each convolutional operation, except for the whereas low-level processing is used for rendering the ac- output layers that are using tanh function to map the results tual shallow depth-of-field images and refining the results. to (-1, 1) interval. Instance normalization is used in all con- Therefore, in this work we base our solution on the PyNET volutional layers that are processing images at lower scales architecture [7] designed specifically for this kind of tasks: (levels 2-5). We are additionally using two transposed con- it is processing the image at different scales and combining volutional layers on top of the main model that upsample the learned global and local features together. the images to their target size. The model is trained sequentially, starting from the low- 3.1. PyNET CNN Architecture est layer. This allows to achieve good semantically-driven Figure 3 illustrates schematic representation of the reconstruction results at smaller scales that are working PyNET-based architecture used in this work. The model has with images of very low resolution and thus performing an inverted pyramidal shape and is processing the images mostly global image manipulations. After the bottom layer at seven different scales. The proposed architecture has a is pre-trained, the same procedure is applied to the next level number of blocks that are processing feature maps in paral- till the training is done on the original resolution.

Load more