Multi-View Image Fusion
Total Page:16
File Type:pdf, Size:1020Kb
Multi-view Image Fusion Marc Comino Trinidad1 Ricardo Martin Brualla2 Florian Kainz2 Janne Kontkanen2 1Polytechnic University of Catalonia, 2Google Inputs Outputs Color Transfer Mono Color High-def mono Color High-def color HDR Fusion Color EV+2 Color EV-1 Color EV+2 Color EV-1 Denoised Detail Transfer DSLR Stereo DSLR Low-def stereo High-def stereo Figure 1: We present a method for multi-view image fusion that is a applicable to a variety of scenarios: a higher resolution monochrome image is colorized with a second color image (top row), two color images with different exposures are fused into an HDR lower-noise image (middle row), and a high quality DSLR image is warped to the lower quality stereo views captured by a VR-camera (bottom row). Abstract 1. Introduction In this paper we focus on the problem of fusing multiple We present an end-to-end learned system for fusing mul- misaligned photographs into a chosen target view. Multi- tiple misaligned photographs of the same scene into a cho- view image fusion has become increasingly relevant with sen target view. We demonstrate three use cases: 1) color the recent influx of multi-camera mobile devices. transfer for inferring color for a monochrome view, 2) HDR The form factor of these devices constrains the size of fusion for merging misaligned bracketed exposures, and 3) lenses and sensors, and this limits their light capturing abil- detail transfer for reprojecting a high definition image to ity. Cameras with larger lens apertures and larger pixels the point of view of an affordable VR180-camera. While the capture more photons per pixel, and thus show less promi- system can be trained end-to-end, it consists of three dis- nent photon shot noise. This is the reason mobile cameras tinct steps: feature extraction, image warping and fusion. have been lagging behind large DSLR systems in quality. We present a novel cascaded feature extraction method that In recent years the use of computational photography has enables us to synergetically learn optical flow at different narrowed the gap significantly [11, 17, 19], but the funda- resolution levels. We show that this significantly improves mentals have not changed: more light means better images. the network’s ability to learn large disparities. Finally, we Lately it has become common to fit 2, 3 or even 5 cam- demonstrate that our alignment architecture outperforms a eras [1,6] into a single mobile device. The use of multiple state-of-the art optical flow network on the image warping cameras significantly improves the light gathering ability of task when both systems are trained in an identical manner. the device. At the minimum two cameras capture twice the light of a single camera, but often it is possible to do better by recording different aspects of the scene with each camera 1 and then fusing the results to get the best of both worlds. narios: Color transfer, HDR fusion, and Detail transfer. One can envision a number of applications that fall into this class, for example, fusing infrared and color images 2. Related Work [33], HDR fusion using bracketed exposures [28], or fusing 2.1. High-Dynamic Range Imaging wide-angle and telephoto views for super-resolution within The seminal work of Devebec and Malik [14] presented the central region of the wide-angle image. In this paper a model of a camera’s pixel response that allows fusing mul- we show an end-to-end learned system that is suitable for a tiple exposures into an HDR image. Although they assumed number of multi-view fusion applications. We demonstrate a static camera and scene, the technique has recently been its effectiveness in three compelling multi-camera designs: introduced to mobile cameras, where a stack of frames is Color transfer: Monochrome cameras, such as those fused to generate an HDR-composite [11, 17, 19]. This available in some smart phones [1,3] capture roughly three works the best if the misalignment between the frames is times the number of photons compared to the cameras with moderate, which is not the case in some of our applications color mosaic sensors and do not exhibit artifacts introduced (we show this in the supplementary material). by the mosaic. We explore fusing together a color image Kalantari and Ramamoorthi [23] use a neural network and monochrome image from slightly different view points to generate HDR images from exposure stacks of dynamic to combine the desirable aspects of both cameras. scenes and corresponding precomputed flow fields. Wu et HDR fusion: We explore an HDR camera design where al [35] propose a similar technique that does not require two cameras take photographs simultaneously but with dif- computing optical flow. Others have focused on burst im- ferent exposure settings. We show that fusing the images age fusion by using either recurrent networks [18] or per- reduces noise and increases the dynamic range. mutation invariant networks [8]. In contrast, our proposed Detail transfer: We explore a novel architecture for method jointly estimates a warp and fuses the different im- building a high-quality VR180[2] camera, where a high- ages to generate a high-quality composite. resolution image taken by a DSLR camera is warped to the points of view of a cheaper VR180 camera with a field of 2.2. Image Colorization view close to 180 degrees and a lens separation that matches There is a large amount of literature on single image col- the human interpupillary distance (IPD). The sizes of the orization [21, 37]. Most of the methods presented attempt to the lenses and bodies of DSLR cameras make it difficult to generate artificial but plausible colors for grayscale images. record VR180 with a pair of DSLRs and achieve a small Jeon et al. [22] study stereo matching between a color enough IPD; our design sidesteps this issue. and a monochrome image in order to compute pixel dispar- Our system can be trained end-to-end but it consists of ity. They convert the monochrome image to YUV (lumi- three conceptual stages: feature extraction, warping and fu- nance/chroma) format and populate the chroma (U and V) sion. We use a novel cascaded feature pyramid that en- channels with information from the color input, using the ables synergetic learning of image alignment across differ- previously computed disparity. ent scales. We show that this architecture has a dramatic Wang et al. [33] propose colorizing infrared and ultravio- impact on learning alignment over large disparities. Instead let flash images in order to obtain low-noise pictures in low- of training the network to predict optical flow and using that light conditions. However, their alignment is based on opti- for alignment, we employ the idea of task oriented flow [36] cal flow [9], and their neural network also needs to learn to to optimize directly for our use cases since this has proven account for misregistration artifacts, whereas our network to produce better results. aligns and colorizes at the same time. We demonstrate the performance of our system with an ablation study and compare it with a state of the art optical 2.3. VR Imaging flow network [32]. We also compare our HDR fusion tech- For virtual reality applications one would ideally cap- nique against Kalantari et al. [23], obtaining comparable re- ture a complete light field video of a scene. Multiple cam- sults. Finally, we provide a large number of high resolution era designs have been proposed towards this end, includ- examples in the supplementary material. ing rings [9] or spheres [26] of outward-facing cameras, or To summarize, the main contributions of this work are: planar camera arrays [34]. Many of these systems do not 1) A novel end-to-end CNN architecture for merging in- directly produce stereo views that match the human inter- formation from multiple misaligned images. 2) An image pupillary distance, but rely on view interpolation to gener- warping module that employs a cascaded feature pyramid ate novel views of the scene using computational methods. to learn optical flow on multiple resolution levels simultane- Using our proposed method for multi-lens fusion, we en- ously. We show that this produces better results than state- vision creating a VR camera where we use detail transfer to of-the-art optical flow for multi-view fusion. 3) A demon- project a high quality DSLR image into the viewpoints of stration of the proposed architecture in three different sce- a VR camera that captures images with the baseline that 2 inputs Feature Extraction Warping Fusion t0 t0 A0 s0 w(s0) Flow F Module + 0 t1 t1 H x W x 3 A0 A1 s1 w(s1) fused output Flow F + Module + 1 image pyramid A0 A1 A2 t2 t2 dataflow s2 w(s2) 2x2 avg. pooling Flow F + + Module + 2 2x resize feature concatenation ··· + ··· ··· ··· ··· ··· ··· features kernels Figure 2: Our architecture takes inspiration from U-Net [30] with an encoder (feature extraction, Section 3.1) on the left and a decoder (fusion, Section 3.3) on the right. Since U-Net cannot efficiently align images, an additional warping module is inserted in the middle (Section 3.2). The green blocks An and Fk are kernels whereas the blue blocks represent features. Blocks An for n = 0; 1; 2 are feature extraction kernels that are sequentially applied to each level of the image pyramid. For each level k, we concatenate the features obtained by applying A0 on the current level, A1 on the previous level and A2 on the level before the previous one, yielding features sk for the source 4 5 6 image and tk for the target.