<<

Video Depth-From-Defocus

Hyeongwoo Kim 1 Christian Richardt 1, 2, 3 Christian Theobalt 1

1 Max Planck Institute for Informatics 2 Intel Visual Computing Institute 3 University of Bath

Abstract aperture, into a valuable signal. In theory, smaller apertures produce sharp images for scenes covering a large depth range. Many compelling video post-processing effects, in particular When using a larger aperture, only scene points close to a aesthetic focus editing and refocusing effects, are feasible if certain focus distance project to a single point on the image per-frame depth information is available. Existing computa- sensor and appear in focus (see Figure2). Scene points at tional methods to capture RGB and depth either purposefully other distances are imaged as a circle of confusion [32]. The modify the optics (coded aperture, light-field imaging), or limited region of sharp focus around the focus distance is employ active RGB-D cameras. Since these methods are known as depth of field; outside of it the increasing defocus less practical for users with normal cameras, we present an blur is an important depth cue [22]. algorithm to capture all-in-focus RGB-D video of dynamic Unfortunately, depth of field in normal videos cannot be scenes with an unmodified commodity video camera. Our changed after recording, unless a method like ours is applied. algorithm turns the often unwanted defocus blur into a valu- Our approach (Figure1) takes as input a video in which able signal. The input to our method is a video in which the focus sweeps across the scene, e.g. by manual change of focus plane is continuously moving back and forth during lens focus. This means temporally changing defocus blur is capture, and thus defocus blur is provoked and strongly vis- purposefully provoked (see Equation2). Each frame thus has ible. This can be achieved by manually turning the focus a different focus setting, and no frame is entirely in focus. ring of the lens during recording. The core algorithmic in- Our core algorithmic contribution uses this provoked blur gredient is a new video-based depth-from-defocus algorithm and performs space-time coherent depth, all-in-focus color, that computes space-time-coherent depth maps, deblurred and focus distance estimation at each frame. We first segment all-in-focus video, and the focus distance for each frame. We the input video into multiple focus ramps, where the focus extensively evaluate our approach, and show that it enables plane sweeps across the scene in one direction. The first compelling video post-processing effects, such as different stage of our approach (Section 4.1) constructs a focus stack types of refocusing. video for each of them. Focus stack videos consist of a focus stack at each video frame, by aligning adjacent video frames 1. Introduction to the current frame using a new defocus-preserving warping Many cinematographically important video effects that nor- technique. At each frame, the focus stack video comprises mally require specific camera control during capture, such multiple images with a range of approximately known focus as focus and depth-of-field control, can be computationally distances (see AppendixA), which are used to estimate a arXiv:1610.03782v1 [cs.CV] 12 Oct 2016 achieved in post-processing if depth and focus distance are depth map in the second stage (Section 4.2) using depth- available for each video frame. This post-capture control is from-defocus with filtering-based regularization. The third on high demand by video professionals and amateurs alike. stage (Section 4.3) performs spatially varying deconvolution In order to capture video and depth in general, and specif- to remove the defocus blur and produce all-in-focus images. ically to enable post-capture focus effects, several methods And the fourth stage of our approach (Section 4.4) further were proposed that use specialized camera hardware, such minimizes the remaining error by refining the focus distances as active depth cameras [33], light-field cameras [28], or for each frame, which significantly improves the depth maps cameras with coded optics [19]. and all-in-focus images in the next iteration of our algorithm. Our end-to-end method requires no sophisticated calibration We follow a different path and propose one of the first process for focus distances, which allows it to work robustly end-to-end approaches for depth estimation from – and fo- in practical scenarios. cus manipulation in – videos captured with an unmodified commodity consumer camera. Our approach turns the of- In a nutshell, the main algorithmic contributions of our ten unwanted defocus blur, which can be controlled by lens paper are: (1) a new hierarchical alignment scheme between

1 Video Capture Focus Sweep Video All-In-Focus RGB-D Video Focus Editing & Other Applications

continuous focus change

Video Refocusing (e.g. from far to near)

Scene Reconstruction -Shift Videography Figure 1. We capture focus sweep videos by continuously moving the focus plane across a scene, and then estimate per-frame all-in-focus RGB-D videos. This enables a wide range of video editing applications, in particular video refocusing. Please refer to the paper’s electronic version and supplemental video to more clearly see the defocus effects we show in our paper. video frames of different focus settings and dynamic scene photo, and these sharp regions are then combined into the all- content; (2) a new approach to estimate per-frame depth in-focus image. Depth from focus additionally determines maps and deblurred all-in-focus color images in a space- the depth of a pixel from the focus setting that produced time coherent way; (3) a new image-guided algorithm for the sharpest focus [9, 26]; however, this requires a densely focus distance initialization; (4) and a new optimization sampled focus stack and a highly textured scene. Depth from method for refining focus distances. We extensively validate defocus, on the other hand, exploits the varying degree of our method, compare against related work, and show high- defocus blur of a scene point for computing depth from just quality refocusing, dolly-zoom and tilt-shift editing results a few defocused images [30, 41]. The all-in-focus image is on videos captured with different cameras. then recovered by deconvolving the input images with the spatially-varying point spread function of the defocus blur. 2. Related Work Obviously, techniques relying on focus stacks only work well for scenes without camera or scene motion. RGB-D Video Acquisition Many existing approaches for Suwajanakorn et al. [43] proposed a hybrid approach that RGB-D video capture use special hardware, such as pro- stitches an all-in-focus image using motion-compensated jectors or time-of-flight depth cameras [33], or use multi- focus stacking, and then optimizes for the depth map using ple cameras at different viewpoints. Moreno-Noguer et al. depth-from-defocus. Their approach is completely automatic [25] use defocus blur and attenuation of the a projected and even estimates camera parameters (up to an inherent dot pattern to estimate depth. Coded aperture optics enable affine ambiguity) from the input images. However, their single- RGB-D image estimation [2, 6, 19, 20, 21], but approach is limited to reconstructing a single frame from require more elaborate hardware modification. Stereo corre- a focus stack, and cannot easily be extended to videos, as spondence [4] or multi-view stereo approaches [46] require this requires stitching per-frame all-in-focus images. Our multiple views, for instance by exploiting mo- approach is tailored for videos, not just single images. tions. Shroff et al. [39] shift the camera’s sensor along the optical axis to change the focus within a video. They align Refocusing Images and Videos Defocus blur is an im- consecutive video frames using optical flow to form a focus portant perceptual depth cue [10, 22], thus refocusing is stack, and then apply depth from defocus to the stack. Un- a powerful, appearance-altering effect. However, just like like all mentioned approaches, ours works with a single un- RGB-D video capture, all approaches suitable for refocusing modified video camera without custom hardware. There are videos require some sort of custom hardware, such as special also single-view methods based on non-rigid structure-from- lenses [24, 28], coded apertures [19] or active lighting [25]. motion [35], which interpret clear motion cues (in particular Special image refocusing methods are difficult to extend to out-of-plane) under strong scene priors, and learning-based videos as they rely on multiple captures from the same view, depth estimation methods [8, 11, 36, 40]. for example, for depth from (de)focus [43]. Depth from Focus/Defocus Focus stacking combines mul- A single image captured with a light-field camera with tiple differently focused images into a single all-in-focus (or special optics [28, 44] can be virtually refocused [14, 27], extended depth of field) image [31]. Depth-from-(de)focus but the light-field image often has a drastically reduced spa- techniques exploit (de)focus cues within a focus stack to tial resolution. Miau et al. [24] use a deformable lens to compute depth. Focus stacking is popular in macro photog- quickly and repeatedly sweep the focus plane across the raphy, where the large lens magnification results in a very scene while recording video at 120 Hz. They refocus video small depth of field. By sweeping the focus plane across a by selecting appropriately focused frames. Some approaches scene or an object, each part of it will be sharpest in one exploit residual micro-blur that is hard to avoid even in a photograph set to be in focus everywhere. It can be used for people cannot exactly reproduce a curve. Therefore, our al- coarse depth estimation [37], or removed entirely [38]. gorithm uses the sparsely recorded lens information only as Defocus deblurring is also related to motion deblurring a guide and explicitly optimizes for the dense correct focus [7, 13, 45]. Their characteristics differ; motion blur is, for distances at every frame (Section 4.4). instance, mostly depth-independent. 4. All-In-Focus RGB-D Video Recovery

3. Preliminaries Given a video V with frames {Vt}t∈T containing one or Both our lens model and aspects of video recording influence more focus sweeps, we formulate our algorithm as a joint op- algorithm design. timization framework that seeks the optimal depth maps Dt, all-in-focus images It, and focus distances Ft for all frames object focus plane lens image sensor t∈T . Let us assume that Ws→t(·) is a warping function that coc aligns an image at time s with time t, while preserving the

aperture original defocus blur (explained in Section 4.1). Then, we can construct a focus stack at each frame t by warping all focus distance object distance input video frames to it using {Ws→t(Vs)}s∈T (in practice, Figure 2. Thin-lens model and circle of confusion. we only warp keyframes, as explained later). We seek the Defocus Model We assume a standard video camera with optimal depth map Dt and all-in-focus image It, and focus a finite aperture lens that produces a limited depth of field. distances {Ft}t∈T which best reproduce the focus stack at According to the thin-lens model (Figure2), the amount of frame t with the defocus model in Equation2. Dt, It and Ft, defocus blur is quantified by the diameter c of the circle of for t∈T , are the unknowns we solve for. The core ingredient confusion [32]: of our joint optimization is a data term that penalizes the defocus model error of the focus stack at all frames: Af |D − F | f 2|D − F | c = = , (1) X X 2 D(F − f) ND(F − f) Edata = wt,s kK(Dt,Fs) ∗ It − Ws→t(Vs)k . (3) t∈T s∈T where A=f/N is the diameter of the aperture, f is the focal We introduce the weighting term wt,s to give lower weights length of the lens, N is the f -number of the aperture, D is to pairs of frames that are further apart, and which the depth of a scene point and F is the focus distance. We hence need warping over longer temporal distances. In assume fixed aperture and focal length, and a focus distance our implementation, we use a Gaussian function wt,s = 2 2 F changing over time. Therefore, the defocus blur of a 3D exp(− |t−s| /2σw) with σw set to 85 percent of the length point only depends on its depth D and the focus distance of each focus ramp. F , which we express as the point-spread function K(D,F ) Simultaneously estimating depth, deblurring the input corresponding to the circle of confusion in Equation1. We video and optimizing focus distances from purposefully model the color of a defocused image V at a pixel x using defocused and temporally misaligned images is highly challenging; many invariance assumptions used by corre- V (x) = (K(D(x),F ) ∗ I)(x), (2) spondence finding approaches break down in this case. where I denotes the all-in-focus image. Note that K is spa- To solve this joint optimization problem efficiently, we tially varying, because each pixel x may have a different decompose it into four subproblems, or stages, that we depth D(x). For brevity, we hence omit the pixel index x. solve iteratively: defocus-preserving alignment (Section 4.1), depth estimation (4.2), defocus deblurring (4.3), and fo- Video Recording We record our focus sweep input video cus refinement (4.4). Initialization and implementation de- simply by manually adjusting the focus on the lens, roughly tails can be found in AppendixA. Each subproblem re- following a sinusoidal focus distance curve – leading to sev- quires solving for a subset of the unknowns by minimizing eral focus ramps (see Figure3). The exact focus distance for a cost functional like Near Focus (First Frame) Far Focus (Last Frame) each video frame is unknown; reading it out from the camera Equation3, with ad- is difficult in practice and may require low level modification ditional regularization 1 of the firmware. We use the Magic Lantern software for terms explained later. Canon EOS digital DSLR cameras to record timestamped We adopt a multi-scale, lens information at about 4 Hz. In practice, focus distance coarse-to-fine approach. values are not measured for most frames, timestamps may At each resolution level, (×8)Downsampled Input Frames not be exactly aligned with the time of frame capture, and the we perform three itera- Figure 4. Downsampling (bottom) ef- recorded focus distances are quantized and not fully accurate. tions of the four stages, fectively reduces defocus difference, There is also natural variation in the focus distance curves as each of which is solved helping correspondence finding. 1http://www.magiclantern.fm for the entire length of the input video. The multi-resolution Focus Sweep Video RGB-D Video Focus

Time Correspondence Fields Depth Maps All-In-Focus Images Focus Distances All-In-Focus Video focus ramp

Focus Alignment Depth Estimation Defocus Deblurring Focus Re nement (Section 4.1) (Section 4.2) (Section 4.3) (Section 4.4)

Pyramid Time Initial Focus Distances Depth Maps Figure 3. Overview of our approach. For each video frame, we first align neighboring frames to it to construct a focus stack. We then estimate spatially and temporally consistent depth maps from the focus stacks, and compute all-in-focus images using non-blind deconvolution. Finally, we refine the focus distances for all frames. We perform these steps in a coarse-to-fine manner and iterate until convergence. approach improves convergence, but more importantly, tion level, the refocusing operator returns the input frame Vt any focus difference between two video frames is reduced unchanged, as the downsampling in the pyramid has already when the images are downsampled in the pyramid (see removed most of the defocus blur. In subsequent iterations, Figure4). This insight enables us to compute reliable initial the refocusing operator uses the current estimates of frame correspondence fields with less influence from different t’s depth map Dt and all-in-focus image It to perform the defocus blurs. Once all parameters are estimated at a refocusing. The embedding of this focus difference compen- coarse level, the higher level of the pyramid uses them as sation and alignment process into the overarching coarse-to- initialization for its iterations. fine scheme enables reliable focus stack alignment even for large scene motions and notable defocus blur differences. 4.1. Patch-Based Defocus-Preserving Alignment We use PatchMatch [3] to robustly compute the warp- Here, we construct a focus stack for each frame of the input ing function Ws→t between the source frame Vs and the video using patch-based, defocus-preserving image align- refocused target frame R(Vt,Fs). PatchMatch handles com- ment. The result are the warping functions Ws→t for pairs (s, plex motions and is fairly robust to the remaining focus t) of frames, while all other unknowns (I, D and F ) remain differences, while traditional optical flow techniques tend constant. Two frames in the focus sweep video, Vs and Vt, to fail in such cases. PatchMatch correspondences are not generally differ in defocus blur and maybe scene or camera always geometrically cor- motion. The main challenge of the defocus-preserving align- rect (see right), as they ment is to compute a reliable correspondence field that is exploit visually similar robust to both complex motion and defocus changes between patches from other re- the frames. gions of the image which Using standard cor- Without Refocusing With Refocusing have similar defocus blur

respondence techniques, [Sun et al.] (note the purple and yel- Correspondences for the BOOK dataset such as optical flow, to low regions on the left, (from the frst frame to the last frame). directly warp the input Optical Flow which indicate vertical motion along the edge of the books). V V In our case, this is an advantage, as it improves the warping video frame s to t [Barnes et al.] is prone to failure, be- quality while preserving defocus blur. At the coarsest level cause the different defo- PatchMatch of the pyramid, we initialize the PatchMatch search using cus blurs in the two im- Figure 5. Comparison of focus stack optical flow [42]; at this level, focus-induced appearance ages are not modeled by alignment from near to far focus differences are minimal. We also constrain the size of the standard matching costs. (Figure4). Without refocusing to search window to find the best matches around the initial Optical flow will try to match blur levels, both optical flow correspondences. This encourages the estimated correspon- explain differences in de- and PatchMatch fail. With refocus- dence field to be more spatially consistent. Since the warping ing, PatchMatch improves on flow focus blur using flow dis- is computed by refocusing the target frame, the defocus blur (used by Shroff et al. [39]). placements, which pro- in the source frame is preserved, which is crucial for con- duces erroneous correspondences (see Figure5). structing valid focus stacks from a dynamic focus sweep The solution is to compensate any focus differences be- video. fore computing correspondences [39]. We therefore refocus We apply the estimated defocus-preserving warping op- the target frame Vt to match the focus distance Fs of the erators Ws→t to create a focus stack video with per-frame source frame s using the refocusing operator R(Vt,Fs) = focus stacks, as shown in Figure6. However, we do not warp K(Dt,Fs) ∗ It. In the first iteration at the coarsest resolu- all frames to all others, to prevent artifacts introduced by aligning temporally dis- Input Frames between depth maps using Refocusing tant videos frames in Alignment temporal X X 2 which the scene may Esmoothness = kDt − Ws→t(Ds)k , (5) have changed drastically. t∈T s∈T \t

Instead, we first segment Distance Focus the input video into con- which encourages temporal consistency over extended depth tiguous focus ramps (see map sequences. The total cost function for depth map estima- tion is defined by combining the data term from Equation3 Figure3), Ti ⊂ T for Time i∈R, which contain only Figure 6. Defocus-preserving align- and the smoothness terms from Equations4 and5: temporally close frames. ment. We refocus the input frame Vt spatial temporal arg min Edata + λssEsmoothness + λtsEsmoothness, (6) For each input video (center) to match the focus distances D frame t, we then create of neighboring frames Ft±1 (red ar- a focus stack by warp- rows), and compute correspondences where λss =1 and λts =0.2 are balancing weights. ing the other frames in (green arrows) between neighboring The direct minimization of Equation6 requires global frames, V , and the corresponding its ramp to it using our t±1 optimization with respect to all depth images, which is com- refocused image R(Vt,Ft±1). defocus-preserving align- putationally expensive. Instead, we solve an efficient approx- ment. This reduces the computational complexity of align- imation of the global optimization problem. We pose the ment from O(|T |2) for all-pairs warping, to O(|T |2/|R|) minimization task as a labeling problem, and first estimate for all-pairs warping within each focus ramp. spatially consistent depth maps for all frames by applying a variant of cost-volume filtering [12], and then refine the per-frame depth maps to enforce temporal consistency [17]. 4.2. Filtering-Based Depth Estimation We start by computing per-frame depth maps Dt in three steps. (1) We evaluate the data term (Equation3) for n pre- The second stage of our approach estimates spatially and defined, uniformly spaced depth layers, and store the er- temporally consistent depth maps Dt for all focus stacks, ror for each pixel x and depth label d in the cost volume while keeping all other variables constant. In this case, our C(x, d). As in previous depth-from-defocus techniques [30], data term Edata from Equation3 measures how well the we perform this evaluation in the frequency domain, where estimated depth maps fit to the defocus observations in all convolution can be efficiently computed using element-wise focus stacks, which is equivalent to depth from defocus multiplication. (2) We apply fast joint-bilateral filtering [29] [30] applied per video frame. This step requires the pixel- on each depth-cost slice, to minimize the long-range spatial wise alignment across each focus stack, computed in the smoothness term in Equation4. For this, we use the all-in- previous stage, to measure the fitting error. Since this error focus image It as the guide image in computing the bilateral is individually penalized at each pixel, it can lead to spatial weight α. As in the previous section, we take the estimated inconsistencies in the depth map. To avoid this issue, we all-in-focus image It from the previous iteration, and assume introduce a long-range linear Potts model. In contrast to It = Vt in the first iteration of the coarsest resolution level. the pairwise Potts model which compares depth values only (3) We select the spatially optimal depth for each pixel using between immediately adjacent pixels, our version performs Dt(x)=arg mind C(x, d). long-range comparisons which benefit globally consistent After computing Depth Pro le of Selected Pixel Over Time depth estimation, yet prevents erroneous smoothing of actual depth maps indepen- Without Temporal Smoothing Depth features in the depth map: dently from each With Local Temporal Smoothing focus stack, we apply With Keyframe-Based Temporal Smoothing spatial X X X temporal smoothing to Time Esmoothness = min(α(x, y) |Dt(x)−Dt(y)|, τd), (4) t∈T x y6=x make the depth maps Figure 7. Our keyframe-based smooth- consistent over time. ing produces more consistent depth We use a keyframe- than local smoothing of adjacent frames. where τd is the truncation value of the depth difference. based approach with (The pixel should have constant depth We use the bilateral weight α between two pixels x and in this scene.)  kx−yk2 kI(x)−I(y)k2  a sliding temporal y, α(x, y) = exp − 2 − 2 , to encour- 2σs 2σr window. For each frame t, we align the depth maps of the age consistent depth estimation between nearby pixels with previous and following two keyframes to the current depth similar colors, where σs and σr denote the standard devi- map Dt using our warping operator Ws→t computed on ation for the spatial and range terms, respectively. We use the all-in-focus images. The updated depth map Dt is the σs =0.075× the image width, and σr =0.05. Gaussian-weighted mean of aligned depth maps. The used In addition, we want the depth maps to be temporally keyframes are not restricted to be from the same focus coherent across all frames. We minimize the discrepancy ramp as the frame t; this enforces temporal consistency also across focus ramp boundaries. In Figure7, we show that our 4.4. Focus Distance Refinement approach successfully produces temporally coherent depth This step refines the focus distances to reduce the data term maps, compared to the unfiltered input depth maps and also Edata (Equation3). As explained in Equation2, we can at the simpler local filtering of adjacent frames, as some bias best read out temporally sparse focus distance values from remains due to the short temporal range of the filtering. the camera, which are moreover subject to inaccuracies. To overcome this difficulty, we refine the focus distances for all 4.3. Defocus Deblurring frames in the final stage of our approach. By rearranging the Now that we have Input Frame (crop) Without Deconvolution Prior With Deconvolution Prior terms associated with the focus distance Ft in Equation3, computed the depth we define the focus refinement subproblem as maps Dt, we esti- X 2 arg min ws,t kR(Vs,Ft) − Wt→s(Vt)k . (9) F mate all-in-focus im- t s∈T ages It using non- Without Focus Distance Re nement With Focus Distance Re nement blind deconvolution We optimize this equa- with the spatially Figure 8. Deconvolution can result in ring- tion by gradient descent. varying point-spread ing artifacts, which are suppressed by our Since the cost function F function (PSF) corre- deconvolution smoothness prior. is highly nonlinear in t, sponding to the depth-dependent circle of confusion (Equa- we compute the gradi- tion1). While the disc shape of the PSF is a good approxima- ent numerically by ex- tion of the actual shape of the camera aperture, in practice, amining the costs for fo- cus distances F ±δ with its sharp boundary causes ringing artifacts in the deconvolu- t Figure 9. Focus distance refinement δ = tion process due to zero-crossings in the frequency domain 5 mm. In practice, improves depth maps (top) by reduc- [19], see Figure8. We therefore adopt the smoothness term we refocus each source ing texture copy artifacts, and also V introduced by Zhou et al. [47] to prevent ringing artifacts: frame s to focus dis- removes halos in all-in-focus images tances Ft ± δ, and com- (middle). The refined focus distances all-in-focus X 2 pare it to the aligned tar- (bottom) correctly reflect the con- E = kH ∗ Itk , (7) smoothness stant focus at the beginning and end t∈T get frame Wt→s(Vt) (see Figure 15). We then set of the video. where H is an image statistics prior. the focus distance Ft to the new minimum. The smoothness term uses a learning approach to capture We demonstrate the performance of our focus distance natural image statistics. We first take sample images at the refinement in Figure9. It improves the depth estimation same resolution as the input image from a database of natural as well as visual quality of the all-in-focus images by sup- images, and then apply the Fourier transform to the samples pressing excessive edge contrasts. Because this strategy frees to compute the frequency distribution of image statistics. us from requiring artificial patterns or special hardware for The final image statistics of the natural image dataset H is the accurate calibration of focus distances, it allows for our obtained by averaging the squared per-frequency modulus of flexible and simple acquisition of the focus sweep video. all sample distributions. The smoothness term Equation7 en- forces the all-in-focus image It to follow a similar frequency 5. Results and Evaluation distribution as the learned one in H. The image statistics We thoroughly evaluate our proposed video depth-from- prior H only needs to be computed once for each video defocus approach for reconstructing all-in-focus RGB-D resolution to be processed, and can then be reused for new videos. We first show qualitative results on natural, dynamic videos. For the details of the smoothness term, we refer the scenes with non-trivial motion, captured with static and mov- reader to Zhou et al. [47]. ing video cameras. We then compare our approach against The total cost function of the defocus deblurring is a the two closest approaches, by Shroff et al. [39] and Suwa- combination of the data term in Equation3 and the learned janakorn et al. [43]. We further evaluate the design choices smoothness term in Equation7: made in our approach with an ablation study on a ground- truth dataset. Lastly, we evaluate our focus refinement opti- all-in-focus arg min Edata + λasEsmoothness , (8) mization in AppendixC. I We show all-in-focus images and depth map results on a −3 where λas =10 balances the two cost terms. We compute range of datasets in Figures 10 and 14, and in our video. Our the optimal all-in-focus image It by performing Wiener de- depth maps capture the gist of each scene, including the main convolution independently on a range of n depth layers, each depth layers and their silhouettes, and the depth gradients of with a fixed, depth-dependent point spread function, and then slanted planes with sufficient texture. As demonstrated by the composite the sub-images to obtain the all-in-focus image results, our approach works for dynamic scenes, and handles It. a fair degree of occlusions, dis-occlusions and out-of-plane CHART dataset (static scene with moving camera, 30 frames, 1 focus ramp) Input Focus Sweep … All-In-Focus

Suwajanakorn et al. [23] Input Frames Depth Maps Single-Image Depth Map Figure 11. Comparison of our approach to Suwajanakorn et al. [43] crops on their dynamic dataset. Left: Input focus stack, focused near (left) to far (right), and their estimated depth map for the last frame only. Right: We reconstruct all-in-focus images and depth maps for all frames of this dynamic sequence (also see our video).

All-In-Focus motions. It also properly reconstructs the depth and all-in- focus appearance of small objects, like the earrings in se- quence TALKING2 (Figure 14), which is highly challenging.

Depth Maps Note that our approach also works if scene and camera are rather static, where approaches requiring notable disparity BOOK dataset (dynamic scene with static camera, 28 frames, 1 focus ramp) for depth estimation would fail – even on unblurred footage. Similar to previous depth-from-(de)focus techniques, our approach works best for textured scenes that are captured in

Input Frames a full focus stack. Although our depth maps are not perfect, they are temporally coherent and enable visually plausible video refocusing applications (see AppendixD). crops Comparison to Shroff et al. [39] This work moves a cam- era’s sensor along the optical axis to compute all-in-focus RGB-D videos in an approach similar to ours. However, our approach improves on theirs in several important ways: (1) All-In-Focus we use a commodity consumer video camera that does not require any hardware modifications like in their approach, (2) our defocus-preserving alignment finds more reliable cor-

Depth Maps respondences than optical flow, (3) our depth maps are more detailed and temporally coherent, and (4) our all-in-focus TALKING1 dataset (dynamic scene with moving camera, 304 frames, 4 focus ramps) images and hence refocusing results improve on theirs. We simulate their focus stack alignment approach by replacing PatchMatch in our implementation with optical flow [42].

Input Frames Figure5 shows that PatchMatch achieves visually better alignment results. Comparison to Suwajanakorn et al. [43] This recent crops depth-from-focus technique computes a single depth map with all-in-focus image from quick focus sweeps of around 30 photos of static scenes with little camera motion. Their approach first reconstructs the all-in-focus image by align- All-In-Focus ing the input photos and stitching the sharpest regions. This will fail for videos, as dynamic scenes break their alignment strategy of concatenating the optical flows. Any estimated

Depth Maps per-frame depth maps are also most likely not temporally coherent. Our approach, on the other hand, computes tempo- Figure 10. RGB-D video results. We show reconstructed all-in- focus images and depth maps for three focus sweep videos with rally coherent all-in-focus RGB-D videos of dynamic scenes. various combinations of scene and camera motion. The image crops Our robust defocus-preserving alignment (Section 4.1) en- (top: input frame cropped, bottom: all-in-focus images cropped) ables us to construct per-frame focus stacks for dynamic focus on regions at the near, middle and far end (from left to right) scenes (moving scene and camera), and hence to compute of the scene’s depth range. Note that each input frame is in focus per-frame depth maps and all-in-focus images. On top, we in only one of the three crops, while our all-in-focus images are in implement keyframe-based temporal consistency filtering focus everywhere. Please zoom in to see more details. (Section 4.2) to remove any flickering from the depth maps. RMSE of Depth Maps RMSE of All-In-Focus Images RMSE of Refocused Images 1.2 1.4 1.6 1.8 0 0.02 0.04 0.06 0.08 0 0.02 0.04

Our full approach Without refocusing Without deconvolution prior Without temporal smoothing Without spatial smoothness Single-image deconvolution Without pyramid Figure 12. Validation of design choices using an ablation study (lower RMSE is better). Our approach is best overall, but each components is required for achieving best results.

We visually compare our results to theirs on one of their alignment degrades in quality when part of the scene is not datasets in Figure 11. We use their provided camera parame- visible during a focus ramp, for example at the image bound- ters without further focus distance refinement (Section 4.4). aries. Like static depth-from-defocus methods, we assume the appearance of objects as well as lighting remain constant. Validation of Design Choices We performed a quantita- Additionally, untextured regions are harder to reconstruct, tive ablation study to analyze the influence of the design and may show some temporal flickering, similar to previous choices in our algorithm. For this, we synthetically defocus depth-from-defocus methods but also passive, image-based 10 frames from the MPI-Sintel dataset ‘alley_1’ [5] using depth reconstruction approaches in general. two focus ramps, and apply additive Gaussian noise with σ =3/255 to simulate camera imaging noise. We then pro- We employ a simple blur model, which uses a spatially cess the resulting video while disabling or replacing individ- varying convolution with a point-spread function. This may ual components of our approach. In Figure 12, we evaluate cause blurs across depth boundaries, which can create ha- the accuracy of the estimated depth maps and all-in-focus im- los in the depth maps (see discussion in Lee et al. [18]). A ages using the root-mean-squared error (RMSE) compared potential solution are more sophisticated, multi-layer defo- to the ground truth. Our full approach produces overall the cus blur models [16], which are harder to integrate into our best results. One can clearly see the importance of each com- optimization. Our depth maps are plausible. They may not ponent in our approach, as leaving them out significantly entirely match the quality of depth maps from RGB-D cam- degrades the quality of the estimated depth maps or all-in- eras or multi-camera systems, but they were recorded with a focus images, or both. We also evaluate how accurately each completely unaltered camera, along with focus distances and alternative explains the input defocus images when refocus- all-in-focus frames, and enable video focus post-processing ing the all-in-focus image using the estimated depth map. at good quality. Our goal was to explore what is possible Without temporal smoothing, refocused images have lower with unaltered hardware and what information may lie in RMSE than our full approach, but the images lack temporal typical artifacts, even auto-focus pulls. consistency (which is not measured by RMSE). Conclusion We presented the first algorithm for space-time 6. Discussion and Conclusion coherent depth-from-defocus from video. It reconstructs all- in-focus RGB-D video of dynamic scenes with an unmodi- Limitations Our approach relies on aligning all frames of fied commodity video camera. We open a different view on a focus ramp to each other. This works well for focus ramps RGB-D video capture by turning the often unwanted defo- of up to around 30 frames, but becomes more difficult for cus blur artifact into a valuable signal. From an input video longer ramps, as more motion needs to be compensated. This with purposefully provoked defocus blur, e.g. by simply turn- is significantly more difficult than for example the alignment ing the lens, we compute space-time-coherent depth maps, required for HDR video reconstruction [15], which only deblurred all-in-focus video and per-frame focus distance. needs to align three subsequent frames instead of 30–100. Our end-to-end approach relies on several algorithmic con- While our alignment approach produces good results within tributions, including an alignment scheme robust to strongly one ramp, even a long one, the consistency across long ramps varying focus settings, an image-based method for accurate becomes more difficult to enforce. This may lead to popping focus distance estimation, and a space-time-coherent depth artifacts in the all-in-focus video. estimation and deblurring approach. We have extensively As in previous depth-from-defocus methods, each aligned evaluated our method and its components, and show that it focus stack yields a single depth map. However, this lim- enables compelling video refocusing effects. its objects in adjacent video frames to have similar depths, which restricts their motion in depth. Acknowledgements We thank the authors of the used Large occlusions are also problematic as the focus stack datasets. Funded by ERC Starting Grant 335545 CapReal. A. Initialization and Implementation similar images (after ow-based alignment) Optimization result The input to our video depth-from-defocus approach is a Magic Lantern log radiometrically linearized video with temporally changing values w/ uncertainty focus distances containing one or more focus ramps, but Frame 57 Frame 119 with otherwise constant camera settings. We assume that we 1000 know camera properties such as the focal length, aperture f -number, sensor size, as well as temporally sparse readings 900 of the camera’s focus distances for some video frames. 800 similar focus Focus Distance Initialization Before the start of our algo- 700 distances rithm in Section4 of the main paper, we compute an initial set of temporally dense focus distance values. We use the [mm] Distance Focus 600 sparse timestamped focus distance readings from the Magic 500 50 100 150 Lantern firmware as starting point, see Equation2 in the main Time paper. We then solve for the per-frame focus distances F us- ing an energy minimization with the recorded focus data as Figure 13. Focus distance initialization yields a smooth initial focus data term, and additional smoothness and focus-consistency curve from sparse Magic Lantern data. Similar images according regularization terms: to st,s (Equation 14) enforce consistent focus distances. focus focus arg min Edata + λfsEsmoothness + λfocusEfocus. (10) F A.1. Implementation of Video Depth-From-Defocus rec Algorithm The recorded focus distances Ft are available only for some frames t ∈ Trec, so we constrain the unknown focus distances F at those frames to lie close to them: To implement our method from Section4, we use a multi- t resolution approach with three levels to improve the conver- focus X rec 2 Edata = kFt − Ft k . (11) gence and visual quality of our results, as image defocus is

t∈Trec more similar at coarser image resolutions. At each pyramid level, we perform three iterations of the stages described As the focus is assumed to change smoothly over time, we in Sections 4.1 to 4.4. At the coarsest level, we start the enforce this by penalizing the second derivative of the focus first iteration assuming that the all-in-focus image I is the distances: t input video frame Vt, and also initialize our PatchMatch focus X 2 Esmoothness = kFt−1 − 2Ft + Ft+1k . (12) correspondences using optical flow [42] to provide a good t starting point for our alignment computations. Between pyra- The focus-consistency term exploits the observation that sim- mid levels, we bilinearly upsample the all-in-focus images It, ilar focus distances result in similar depth-of-field and hence depth maps Dt and all computed flow fields. We use scale- similar images, so if video frames appear very similar, then adjusted patch sizes for PatchMatch, using 25×25 pixels at their focus distances should also be similar (see Figure 13): the finest level and 7×7 at the coarsest. X X 2 Computation Times Our all-in-focus RGB-D video es- Efocus = st,s kFt − Fsk , (13) t s6=t timation approach processes 30 video frames at 854×480 resolution in 4 hours on a 30-core 2.8 GHz processor with where st,s measures the (symmetric) similarity of the in- 256 GB memory. This runtime breaks down as follows, per put video frames Vt and Vs, so that more similar frames video frame: 8.6 minutes for defocus-preserving alignment, enforce consistency constraints more strongly. We compute 19 minutes for depth estimation, 2.4 minutes for defocus the similarity using deblurring, and 5 seconds for focus distance refinement. Our MATLAB st,s = min(0, 1 − min(dt,s, ds,t) /τsim) , (14) implementation is unoptimized, but parallelized over the input video frames. We believe an optimized, pos- based on the image dissimilarity dt,s which we compute sibly GPU-assisted implementation would yield significant using the RMSE between input frames Vt and Vs warped speed-ups. to Vt using low-resolution (160×90) optical flow to com- pensate for camera and scene motion. The similarity thresh- old τsim determines which pairs of input frames result in B. Results consistency constraints and how strongly√ they are enforced. Typical parameter values are λfs = 10, λfocus = 0.1 and We show additional all-in-focus images and depth map re- τsim ∈[0.01, 0.05]. sults on a range of datasets in Figure 14. ihvroscmiain fseeadcmr oin h im- The motion. camera and scene of combinations various with r nfcseeyhr.Pes omi osemr details. more see to images in all-in-focus zoom our Please while everywhere. in crops, focus is three in frame the are input of each one that only left Note in (from range. focus end depth scene’s far the and of middle near, right) the to at images regions all-in-focus on bottom: focus cropped, cropped) frame input (top: crops age videos sweep focus three for maps depth and reconstructed show images We all-in-focus results. video RGB-D Additional 14. Figure

Depth Maps All-In-Focus crops Input Frames Depth Maps All-In-Focus crops Input Frames Depth Maps All-In-Focus crops Input Frames PLANT KEYBOARD TALKING dataset (static scene withmoving camera, 37frames, 1focus ramp) 2 dataset (dynamicscene withmoving camera, 317frames, 4focus ramps) dataset (static scene withmoving camera, 38frames, 1focus ramp) ie Refocusing Video hsi unlast etrrfcsn eut o u defocus- our for results refocusing better to leads turn in This ies h ou a lob xdo nojc finterest of object an on fixed be also can focus The videos. the rendering simply by wishes user’s the to according video ee eivsiaetecnrbto ftefcsdistance focus the of contribution the investigate we Here, Refinement Distance Focus of Evaluation C. 9). (Equation images input refocused the of each to to frames input refocusing frame by each distances focus refine We 15. Figure rflo ttruhtevdouiga‘ou ul,o the or pull’, ‘focus a using video the through it follow or Durand and Bae to defocus similar the 17), magnify Figure or keeping (see reduce blur while to aperture, settings, focus the original change the example for The can arbitrarily. and user focus independently and changed length be focal can distance aperture, camera’s the as [ freedom, lens virtual the of depth its blur to kernel and corresponding blur 2, the Figure with in neighborhood pixel’s as each model use defocus we thin-lens this, For same post-process. the a in blur defocus appropriate input original the refocus freely now can we maps, depth and Depth-From-Defocus Video of Applications D. of performance approach. overall our the improves and cleaner images produces all-in-focus which 4.1), (Section alignment preserving distances. focus refinement estimated in distance errors focus the our reduces that consistently ground shows the 16 to Figure distances truth. focus all-in- estimated the and compare images and focus refinement, distance with- focus and our with additive noise), out imaging of without degrees (but noise varying Gaussian by perturbed distances [ re- focus MPI-Sintel synthetically from the dataset process ‘alley_1’ we focused this, For initial distances. inaccurate focus from recovering better and estimating images to 4.4) all-in-focus Section and 15 (Figure refinement Focus Distance t n iiiigtedfeec oteframe the to difference the minimizing and , 34 ie h siae l-nfcsimages all-in-focus estimated the Given .Ti prahpoie complete provides approach This ]. D ( x Time ) n h ou distance focus the and 5 Warping Refocusing FramesInput ihinitial with ] K [1] ( D V u for but , t ( warped x ) F , F ) fcsflo’fnto htkestergo ne h user’s the under region the keeps that function ‘focus-follow’ hc rvnsti fet hr r oespecial-purpose some are There effect. this prevents which itSitVideography Tilt-Shift ae notercre otg n antb oie after modified be cannot and footage is recorded look the tilt-shift into the baked but expensive, Nikon be Canon, can which from Lensbaby, e.g. sensor, or cameras, image modern the for lenses to tilt-shift parallel be to fixed mod- in are lenses cameras most film, ern most and in lens between lens bellows the flexible While field.) of cameras affect depth view not or plane does focus it however, the lines; parallel converging like [ look lens miniature of iconic purpose the depth produces wedge-shaped that a with field which plane of plane focus image slanted its a in to results relative lens camera’s the tilting produce result. and professional-looking failures more auto-focus a correct can to settings smoothed focus be reconstructed also The a focus. in using pointer user mouse the by controlled interactively be can focus the to close is image truth. the refinement, ground with details; lack and distorted level consis- error. refinement the that reduces note tently truth; ground the to compared distances inac- are noisy. distances focus or initial curate the estimates when focus images the improves all-in-focus refinement and distance Focus 16. Figure Focus Distance RMSE [cm] Stack Frame Input FocusInput 10 15 20 25 5 σ 10 = 0cm 10 ihu enmn,teali-ou mgsare images all-in-focus the refinement, Without . a etle n hfe reytak othe to thanks freely shifted and tilted be can Top: shift Ground Truth All-In-Focus 15 lto os ee essRS ffocus of RMSE versus level noise of Plot st orc o esetv distortions for correct to is Bottom: The rp fasnl rm o noise for frame single a of Crops itsiteffect tilt-shift w/o Re nement All-In-Focus 20 scetdby created is w/ Re nement All-In-Focus 25 10 Re nement With Re nement Without .(The ]. σ [cm] Noise hspoie liaeflxblt stedsrdlo a be can look desired the as flexibility ultimate provides This ie) suigti-esotc,ti sahee yvarying by achieved is this optics, thin-lens Assuming video). References object. selected magnification the length focal the supplemental (see size constant at interest of that object zoom an from camera keeps virtual away controlled or carefully towards a moves and that scene, dolly the virtual a Zoom’): on ‘Hitchcock camera (or a zoom dolly two a the for provides required this ingredients earlier, presented refocusing with video combined the When synthesis. novel-view limited as such Zoom Dolly interactively. tweaked [ and lens modified virtual tilted a with refocusing by 17 video Figure our in and videography tilt-shift virtual show We capture. [13] [12] [11] [10] [6] [2] [1] [9] [8] [7] [5] [4] [3] (SIGGRAPH) rpisFrm(Eurographics) Forum Graphics rpis(SIGGRAPH) Graphics Asia) (SIGGRAPH Graphics ECCV yond. Intelligence .Bno .Y hn n .Nsia xrcigdphand depth Extracting Nishita. T. and Chen, B.-Y. Bando, Y. magnification. Defocus Durand. F. and Bae S. aeasaermvlfo igebur mg.In 2014. and image. 2893–2900, estimation blurry pages depth single Joint from removal Yang. shake M.-H. camera and Xu, L. Hu, Z. be- and correspondence Gelautz. visual M. for and filtering Rother, cost-volume C. Fast Bleyer, M. Rhemann, C. Hosni, A. photo Automatic 2005. 577–584, Hebert. M. and Efros, pop-up. A. A. Hoiem, D. Banks. S. M. Graphics size. and on and actions distance O’Brien, perceived F. affect J. to Cooper, blur A. Using E. Held, T. R. 1987. 5(1):63–69, focus. from Depth Grossmann. P. convolutional In multi-scale architecture. common normals a surface with depth, labels Predicting semantic and Fergus. R. and Eigen D. hand-held for deblurring synthesis. Video patch-based using Lee. cameras S. and Wang, J. Cho, S. 2012. a from In deblurring and depth-of-field. spectrally-varying Depth Zickler. T. A and Chakrabarti A. Black. J. In M. evaluation. flow and optical Stanley, for B. movie source G. open Wulff, naturalistic J. Fast Butler, J. Hernández. In D. C. defocus. and synthetic Shih, for Y. stereo Adams, bilateral-space A. Barron, T. J. editing. for image algorithm structural correspondence Goldman. B. randomized D. A and Finkelstein, PatchMatch: A. Shechtman, E. Barnes, C. aperture. color-filtered a using matte 2012. , EETascin nPtenAayi n Machine and Analysis Pattern on Transactions IEEE C rnatoso rpis(SIGGRAPH) Graphics on Transactions ACM et asas nbeohrapplications other enable also maps Depth 52:0–1,2013. 35(2):504–511, , f 83:4 2009. 28(3):24, , M n bett-esdistance object-to-lens and ICCV = 92:911,2010. 29(2):19:1–16, , f/ 2015. , 14:419 2012. 31(4):64:1–9, , ( u − C rnatoso Graphics on Transactions ACM 75:3:–,2008. 27(5):134:1–9, , f ) 63:7–7,2007. 26(3):571–579, , ean osatfrthe for constant remains atr eonto Letters Recognition Pattern ECCV C rnatoson Transactions ACM C rnatoson Transactions ACM ae 648–661, pages , CVPR u C Trans- ACM uhthat such Computer 2015. , 24(3): , CVPR 23 ]. , , Input Video Frame Refocusing Magni ed Defocus Virtual Tilt-Shift Eect Depth-of-Field for Tilt-Shift

Figure 17. Video refocusing results. We first synthetically refocus the input video, then increase the defocus blur by increasing the aperture (smaller f -number), and finally apply a virtual tilt-shift effect, which results in a slanted focus plane. Please see our video for full results.

[14] A. Isaksen, L. McMillan, and S. J. Gortler. Dynamically [24] D. Miau, O. Cossairt, and S. K. Nayar. Focal sweep videogra- reparameterized light fields. In SIGGRAPH, pages 297–306, phy with deformable optics. In ICCP, 2013. 2000. [25] F. Moreno-Noguer, P. N. Belhumeur, and S. K. Nayar. Ac- [15] N. K. Kalantari, E. Shechtman, C. Barnes, S. Darabi, D. B. tive refocusing of images and videos. ACM Transactions on Goldman, and P. Sen. Patch-based high dynamic range video. Graphics (SIGGRAPH), 26(3):67, 2007. ACM Transactions on Graphics (SIGGRAPH Asia), 32(6): [26] S. K. Nayar and Y. Nakagawa. Shape from focus. IEEE 202:1–8, 2013. Transactions on Pattern Analysis and Machine Intelligence, [16] M. Kraus and M. Strengert. Depth-of-field rendering by 16(8):824–831, 1994. pyramidal image processing. Computer Graphics Forum [27] R. Ng. Fourier slice . ACM Transactions on (Eurographics), 26(3):645–654, 2007. Graphics (SIGGRAPH), 24(3):735–744, 2005. [17] M. Lang, O. Wang, T. O. Aydın, A. Smolic, and M. Gross. [28] R. Ng, M. Levoy, M. Brédif, G. Duval, M. Horowitz, and Practical temporal consistency for image-based graphics ap- P. Hanrahan. Light field photography with a hand-held plenop- plications. ACM Transactions on Graphics (SIGGRAPH), 31 tic camera. CSTR 2005-02, Stanford University, 2005. (4):34:1–8, 2012. [29] S. Paris and F. Durand. A fast approximation of the bilat- [18] S. Lee, E. Eisemann, and H.-P. Seidel. Real-time lens blur eral filter using a signal processing approach. International effects and focus control. ACM Transactions on Graphics Journal of Computer Vision, 81:24–52, 2009. (SIGGRAPH), 29(4):65:1–7, 2010. [30] A. P. Pentland. A new sense for depth of field. IEEE Trans- [19] A. Levin, R. Fergus, F. Durand, and W. T. Freeman. Image and actions on Pattern Analysis and Machine Intelligence, 9(4): depth from a conventional camera with a coded aperture. ACM 523–531, 1987. Transactions on Graphics (SIGGRAPH), 26(3):70, 2007. [31] S. Pertuz, D. Puig, M. A. Garcia, and A. Fusiello. Generation [20] C.-K. Liang, T.-H. Lin, B.-Y. Wong, C. Liu, and H. H. Chen. of all-in-focus images by noise-robust selective fusion of Programmable aperture photography: multiplexed light field limited depth-of-field images. IEEE Transactions on Image acquisition. ACM Transactions on Graphics (SIGGRAPH), Processing, 22(3):1242–1251, 2013. 27(3):55:1–10, 2008. [32] M. Potmesil and I. Chakravarty. Synthetic image generation [21] M. Martinello, A. Wajs, S. Quan, H. Lee, C. Lim, T. Woo, with a lens and aperture camera model. ACM Transactions W. Lee, S.-S. Kim, and D. Lee. Dual aperture photography: on Graphics, 1(2):85–108, 1982. Image and depth from a mobile camera. In ICCP, 2015. [33] C. Richardt, C. Stoll, N. A. Dodgson, H.-P. Seidel, and [22] G. Mather. Image blur as a pictorial depth cue. Proceedings C. Theobalt. Coherent spatiotemporal filtering, upsampling of the Royal Society of London. Series B: Biological Sciences, and rendering of RGBZ videos. Computer Graphics Forum 263(1367):169–172, 1996. (Eurographics), 31(2):247–256, 2012. [23] H. M. Merklinger. Focusing the View Camera. 1.6.1 edition, [34] G. Riguer, N. Tatarchuk, and J. Isidoro. Real-time depth of 2010. field simulation. In Shader2, chapter 4. Wordware, 2003. [35] C. Russell, R. Yu, and L. Agapito. Video pop-up: Monocular 3D reconstruction of dynamic scenes. In ECCV, pages 583– 598, 2014. [36] A. Saxena, M. Sun, and A. Ng. Make3D: Learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5):824–840, 2009. [37] J. Shi, X. Tao, L. Xu, and J. Jia. Break Ames room illusion: Depth from general single images. ACM Transactions on Graphics (SIGGRAPH Asia), 34(6):225:1–11, 2015. [38] J. Shi, L. Xu, and J. Jia. Just noticeable defocus blur detection and estimation. In CVPR, 2015. [39] N. Shroff, A. Veeraraghavan, Y. Taguchi, O. Tuzel, A. Agrawal, and R. Chellappa. Variable focus video: Re- constructing depth and video for dynamic scenes. In ICCP, 2012. [40] S. Srivastava, A. Saxena, C. Theobalt, S. Thrun, and A. Y. Ng. i23 - rapid interactive 3D reconstruction from a single image. In Vision, Modeling, and Visualization Workshop (VMV), 2009. [41] M. Subbarao and G. Surya. Depth from defocus: A spatial domain approach. International Journal of Computer Vision, 13(3):271–294, 1994. [42] D. Sun, S. Roth, and M. J. Black. A quantitative analysis of current practices in optical flow estimation and the principles behind them. International Journal of Computer Vision, 106 (2):115–137, 2014. [43] S. Suwajanakorn, C. Hernandez, and S. M. Seitz. Depth from focus with your mobile phone. In CVPR, 2015. [44] A. Veeraraghavan, R. Raskar, A. Agrawal, A. Mohan, and J. Tumblin. Dappled photography: mask enhanced cam- eras for heterodyned light fields and coded aperture refo- cusing. ACM Transactions on Graphics (SIGGRAPH), 26(3): 69, 2007. [45] J. Wulff and M. J. Black. Modeling blurred video with layers. In ECCV, pages 236–252, 2014. [46] F. Yu and D. Gallup. 3D reconstruction from accidental motion. In CVPR, 2014. [47] C. Zhou, S. Lin, and S. K. Nayar. Coded aperture pairs for depth from defocus and defocus deblurring. International Journal of Computer Vision, 93(1):53–72, 2011.