<<

Learning to Reduce Defocus Blur by Realistically Modeling Dual- Data

Abdullah Abuolaim1* Mauricio Delbracio2 Damien Kelly2 Michael S. Brown1 Peyman Milanfar2 1York University 2Google Research

Abstract Cameras Pixel Canon DSLR Recent work has shown impressive results on data- driven defocus deblurring using the two-image views avail- Input able on modern dual-pixel (DP) sensors. One significant Dual-pixel images with defocus blur challenge in this line of research is access to DP data. De- (Only a single DP spite many cameras having DP sensors, only a limited num- view is shown) ber provide access to the low-level DP sensor images. In addition, capturing training data for defocus deblurring in- Deblurred results [1] volves a time-consuming and tedious setup requiring the

camera’s aperture to be adjusted. Some cameras with DP Encoder Decoder sensors (e.g., ) do not have adjustable aper- tures, further limiting the ability to produce the necessary DNN trained with Canon data only training data. We address the data capture bottleneck by proposing a procedure to generate realistic DP data syn- Our results thetically. Our synthesis approach mimics the optical im- Encoder RCNR Decoder … age formation found on DP sensors and can be applied to virtual scenes rendered with standard computer . Encoder RCNR Decoder Leveraging these realistic synthetic DP images, we intro- OurO RCRCNN ttraineddith with duce a recurrent convolutional network (RCN) architecture our synthetic data that improves deblurring results and is suitable for use with Figure 1. Deblurring results on images from a Pixel 4 smartphone single-frame and multi-frame data (e.g., video) captured by and a Canon 5D Mark IV. Third row: results of the DNN pro- DP sensors. Finally, we show that our synthetic DP data is posed in [1] trained with DP data from the Canon camera. Last row: results from our proposed network trained on synthetically useful for training DNN models targeting video deblurring generated data only. applications where access to DP data remains challenging.

tortion, and optical aberrations. Most existing deblurring methods [6, 21, 24, 33, 40] approach the defocus deblur- 1. Introduction and related work ring problem by first estimating a defocus image map. The arXiv:2012.03255v2 [eess.IV] 17 Aug 2021 Defocus blur occurs in scene regions captured outside defocus map is then used with an off-the-shelf non-blind the camera’s depth of field (DoF). Although the effect can deconvolution method (e.g., [7, 23]). This strategy to de- be intentional (e.g., the effect in portrait photos), in focus deblurring is greatly limited by the accuracy of the many cases defocus blur is undesired as it impacts image estimated defocus map. quality due to the loss of sharpness of image detail (e.g., Recently, work in [1] was the first to propose an interest- Fig.1, second row). Recovering defocused image details ing approach to the defocus deblurring problem by leverag- is challenging due to the spatially varying nature of the ing information available on dual-pixel (DP) sensors found defocus point spread function (PSF) [26, 43], which not on most modern cameras. The DP sensor was originally only is scene depth dependent, but also varies based on the designed to facilitate auto-focusing [2,3, 19]; however, re- camera aperture, focal length, focus distance, radial dis- searchers have found DP sensors to be useful in broader ap- plications, including depth map estimation [10, 32, 36, 49], *This work was done while Abdullah was an intern at . defocus deblurring [4, 45, 25, 32], reflection removal [37],

ݍ Traditional sensor Dual-pixel (DP) sensor Focal plane { Lens Scene points DP unit { Left photodiode Right photodiode ܲଵƬܲଶ Left view Right view Left view Right view ݎ

ܲଵ ܲଶ Circle of Half CoC ݏ ݂ confusion Half CoC Flipped Flipped ݀ ݏᇱ (CoC) Front focus case: ࡼ૚ Back focus case: ࡼ૛ Figure 2. Thin lens model illustration and dual-pixel image formation. The circle of confusion (CoC) size is calculated for a given scene point using its distance from the lens, camera focal length, and aperture size. On the two dual-pixel views, the half-CoC PSF flips if the scene point is in front or back of the focal plane. and synthetic DoF [46]. DP sensors consist of two photo- 1st capture w. DoF blur 2nd capture w/o DoF blur Local misalignment diodes at each pixel location effectively providing the func- tionality of a simple two-sample light-field camera (Fig.2, DP sensor). Light rays coming from scene points within the camera’s DoF will have no difference in phase, whereas light rays from scene points outside the camera’s DoF will ଶ۾ଵǡ۾ ଶ ࣲଶୈ۾ ଵ Image patch۾ Image patch have a relative shift that is directly correlated to the amount Figure 3. Misalignment in the Canon DP dataset [1] due to the of defocus blur. Recognizing this, the work in [1] proposed physical capture procedure. Patches P1 and P2 are cropped from a deep neural network (DNN) framework to recover a de- in-focus regions from two captures: the first using a wide-aperture blurred image from a DP image pair using ground-truth data (w/ defocus blur) and the second capture using a narrow-aperture captured from a Canon DSLR. (w/o defocus blur). The 2nd capture is intended to serve as the Although the work of [1] demonstrates state-of-the-art ground truth for this image pair. The third column shows the deblurring results, it is restricted by the requirement for ac- 2D cross correlation between the patches X2D(P1, P2), which re- veals the local misalignment that occurs in such data capture. curate ground truth data, which requires DP images to be captured in succession at different apertures. As well as being labor intensive, the process requires careful control real-image data sets through data augmentation. To demon- to minimize the and motion differences between strate the generality of the model, we explore a new appli- captures (e.g., see local misalignment in Fig.3). Another cation domain of video defocus deblurring using DP data significant drawback is that data capture is limited to a sin- and propose a recurrent convolutional network (RCN) ar- gle commercial camera, the Canon 5D Mark IV, the only chitecture that scales from single-image deblurring to video device that currently provides access to raw DP data and deblurring applications. Additionally, our proposed RCN has a controllable aperture. addresses the issue of patch-wise training by incorporat- ing radial distance learning and improves the deblurring re- While datasets exist for defocus estimation, including sults with a novel multi-scale edge loss. Our comprehen- CUHK [39], DUT [50], and SYNDOF [24], as well as light- sive experiments demonstrate the power of our synthetic DP field datasets for defocus deblurring [6] and depth estima- data generation procedure and show that we can achieve the tion [14, 42], none provide DP image views. The work of state-of-the-art results quantitatively and qualitatively with [1] is currently the only source of ground truth DP data suit- a novel network design trained with this data. able for defocus deblur applications but is limited to a single device. This lack of data is a severe limitation to contin- ued research on data-driven DP-based defocus deblurring, 2. Modeling defocus blur in dual-pixel sensors in particular to applications where the collection of ground Synthetically generating realistic blur has been shown to truth data is not possible (e.g., fixed aperture smartphones). improve data-driven approaches to both defocus map [24] Contributions. This work aims to overcome the challenges and depth map estimation [31]. We follow a similar ap- in gathering ground-truth DP data for data-driven defo- proach in this work but tackle the problem of generating re- cus deblurring. In particular, we propose a generalized alistic defocus blur targeting DP image sensors. For this, we model of the DP image acquisition process that allows re- comprehensively model the complete DP image acquisition alistic generation of synthetic DP data using standard com- process with spatially varying PSFs, radial lens distortion, puter graphics-generated imagery. We demonstrate that we and image noise. Fig.4 shows an overview of our DP data can achieve state-of-the-art defocus deblurring results us- generator that enables the generation of realistic DP views ing synthetic data only (see Fig.1), as well as complement from an all-in-focus image and corresponding depth map. point m.Ra PPF xii ou-hpddpeini the in depletion donut-shaped a exhibit col- PSFs left 5 ’s DP Fig. Real in umn. illustrated ex- as structure PSFs, true real-world the the by PSF, hibited reflect DP sufficiently real not cap- a did to in PSF able overall observed size is property model CoC symmetry the the the Though ture to column). correlated middle directly 5, (Fig. was that only rameter the between property right symmetry the nice in a using occurs sensor that PSF a ing PSFs Dual-.2. sensor and lens and distance, cus is where model 2, fo- This Fig. distance). (i.e., in focus on illustrated parameters and based camera size, aperture point and length, given confusion lens cal the a of from for circle distance PSF its the the Through represents approximate that can [34]. (CoC) we calculations to model, tracing helping this ray thickness, lens optical negligible simplify assumes that model model lens Thin PSFs. 2.1. measured real-world correlation. the with cross correlation 2D higher the achieves by modeling measured Our is filter. PSFs Butterworth 2D two DSLR. modified between IV a similarity Mark on The 5D Canon PSFs. the DP from focus PSFs back and Front 5. Figure the to related PSFs defocus, noise. scene sensor model and we distortion data, radial this including artifacts, from additional Starting computer-generated and with package. formation, starts graphics approach image computer Our sensor standard views. DP (DP) a dual-pixel with generate produced synthetically to imagery used (CG) framework our of overview An 4. Figure as All-in-focus Depth map s

nrdcdamdlfrapproximat- for model a introduced [36] in work Recent lens thin a using optics camera virtual our model We Back focus Front focus 0 = Ss oee,temdlivle igefe pa- free single a involved model the However, PSFs. P

Measured PSF CG input 1 DP combined s s f oae tdistance at located − f and s q 0 F n h pruediameter aperture the and , r = stefso.Tedsac ewe the between distance The f-stop. the is = Plf DP right DP left F f 2 q hn h o radius CoC the Then, .

× ThinThin lens model f DefocusD d s s efocus stefcllength, focal the is map 0 Section 2.1 left rmtecmr is: camera the from × d and − d s right Middle . iw faDP a of views PPF smdldb [36]. by modeled as PSFs DP : q Pcmie DP left DP combined

r PSF model in [36] r defined are s ݐݕ ͲǤͷͻ݅ݎ݈ܽ݅݉݅ݏǤͻ ͵ݐݕ ͲǤ͸݅ݎ݈ܽ݅݉݅ݏ͵Ǥ fascene a of stefo- the is left DPDP PSF modelingmodeling LeftL and eft (1) Section 2.2 with the measured PSF with the measured PSF Plf DP right DP left ButterworthBuB ol aa.Wt u rpsdmdl h parameterized the model, proposed PSF our real- With from measured data). PSFs world of observation pos- our always on is based (which itive center kernel’s the at depletion imum parameter the where parameter the by notation The ( 1. Both Eq. product. Hadamard in the calculated notes as radius CoC the where PSF parametric a filter define Butterworth the we on PSF, based model the of structure shaped the ling where Butterworth 2D the on based filter model parametric a troduce [43]. aberrations optical to attributed is that CoC hc sfre as formed is which deviation standard of smooth we fore, tt x e o rw opoiemr elsi SsfrteD iw,w in- we views, DP the for PSFs realistic more provide To u oeigof modeling Our y , ൌ B … o H B rt o ( C n ) h ,y x, enda follows: as defined , a hr alofaottecrufrne There- circumference. the about fall-off sharp a has . 3 stefitrodr and order, filter the is Bctf oiin iigt atr h donut- the capture to Aiming position. cutoff dB D ersnsacrua ikwt radius with disk circular a represents Right = ) o

safnto fteradius the of function a is Right Left u el rpsdmdlfrD Ssbased PSFs DP for model proposed newly Our :

H + 1 α DP views h ausof values The . H ycnovn twt asinkernel Gaussian a with it convolving by H > β H κ  =

= Ours × DP combined √ ersnstecmie PPSF, DP combined the represents ݐݕ ͲǤͺͻ݅ݎ݈ܽ݅݉݅ݏǤͻ ݐݕ ͲǤͺ͹݅ݎ݈ܽ݅݉݅ݏǤ͹ B 0 ( H r x where , sitoue ocnrltemin- the control to introduced is ◦ − l C x + o ( ) D D x H 2 o +( o o r B y , B 0 where , y saprmtrcontrol- parameter a is − o << κ < r esae to re-scaled are and with the measured PSF with the measured PSF B Camera artifacts ) y , o r ) sfollows: as 2 Left C Radial distortion Radial Rad Section 2.3 n scontrolled is and  H 2 r etrdat centered are esrdDP Measured : n i a l Noise N 1 ! l and . d o r − i DP right i s se matching t 1 o , rti H [ on ◦ β, r de- are (2) (3) 1] , the left and right DP PSFs, respectively. Similar to the work is proportional to image intensity [9, 27]. Let I be the in [36], we enforce the constraint of horizontal symmetry noiseless image and N a zero-mean Guassian noise layer; f between Hl and Hr, and express Hr as Hr = Hl , where then our modeling of the signal-dependent Gaussian noise f is I = I+I◦N, where N ∼ N (0, σ2Id), and σ controls Hl represents the left PSF flipped about the vertical axis. noise Hl can be shown as H with a gradual fall-off towards the the noise strength. right direction (see front-focus DP left in Fig.5). Mathe- matically, we denote Hl as: 3. Generating dual-pixel views P 1 In this section we introduce the synthetic dataset used, Hl = H ◦ M, s.t. Hl ≥ 0, with Hl = 2 , (4) followed by a description of the procedure to synthetically where M is a 2D ramp mask with a constant decay. This generate the DP left and right views. The source of our syn- decay can be considered as an intensity fall-off (inten- thetic example data comes from the street view SYNTHIA sity/pixel) in a given direction. The direction is determined dataset [16], which contains image sequences of photo- by the sign of the CoC radius calculated based on the thin realistic GC-rendered images from a virtual city. Each se- lens model. The positive sign represents the front focus quence has 400 frames on average. The dataset contains (i.e., blurring of objects behind the focal plane), whereas a large diversity in scene setups, involving many objects, the negative sign represents the back focus (i.e., blurring cities, seasons, weather conditions, day/night time, and so of objects in front of the focal plane). Our PSF model, forth. The SYNTHIA dataset also includes the depth-buffer parameterized with five parameters, facilitates synthesizing and labelled segmentation maps. In our framework, we use PSF shapes more similar to what we measured in real cam- the depth map to apply synthetic defocus blur in the process eras under different scenarios (see Fig.5, right column). of generating the DP views. From this model, we can generate a bank of representa- To blur an image based on the computed CoC radius r, tive PSFs based on actual observations from real cameras. we first decompose the image into discrete layers according Additional details about the calibration procedure, PSF es- to per-pixel depth values, where the maximum number of timation method, and parameter searching are provided in layers is set to 500. Then, we convolve each layer with our the supplemental material. parameterized PSF, blurring both the image and mask of the 2.3. Modeling additional camera artifacts depth layer. Next, we alpha-blend the blurred layer images in order of back-to-front, using the blurred masks as alpha Radial lens distortion. Radial lens distortion occurs due values. For each all-in-focus video frame Is, we generate to lens curvature imperfections causing straight lines in the two images – namely, the left Il and right Ir sub-aperture real world to map to circular arcs in the image plane. This DP views – as follows (for simplicity, let Is be a patch with is a well-studied topic with many methods for modelling all from the same depth layer): and correcting the radial distortion (e.g., [5,8, 13, 35]). In our framework, we consider applying radial distortion to the Il = Is ∗ Hl, Ir = Is ∗ Hr, (6) synthetically generated images to mimic this effect found where ∗ denotes the convolution operation. Afterwards, the in real cameras. We adopt the widely used division model radial distortion is applied on I , I , and I based on the introduced in [8], as follows: l r s camera’s focal length. Finally, we add signal-dependent (xu − xo, yu − yo) noise layers (i.e., Nl and Nr) for the two DP views that (xd, yd) = (xo, yo) + 2 4 , (5) 1 + c1R + c2R + ··· have the same σ, but are drawn independently. The final output defocus blurred image I is equal to I + I . where (x , y ) and (x , y ) are the undistorted and dis- b l r u u d d Our synthetically generated DP views exhibit a similar torted points respectively, and c is the ith radial distortion i focus disparity to what we find in real data, where the in- coefficient. R is the radial distance from the image plane focus regions show no disparity and the out-of-focus re- center (x , y ). This model enables different types of radial o o gions have defocus disparity. In the supplemental material, distortion, including barrel and pincushion. We generate we provide animated examples that alternate between the representative radial distortion sets at different focal lengths left and right DP views to assist in visual comparisons be- found on cameras. A detailed description of this procedure tween synthetic and real data. is provided in the supplemental material. Noise. Image noise is the undesirable occurrence of random 4. Defocus deblurring image sequences variations in intensity or color information in images. Our initial input is CG-generated data that is noise-free. In or- With the ability to generate synthetic DP data, we can der to synthesize realistic images, we add signal-dependent shift our attention to training new RCN-based architectures noise as the last step. We model the noise using a signal- addressing image sequences (e.g., video) captured with DP dependent Gaussian distribution where the variance of noise sensors. This is possible only by using our synthetic DP … … …

࣢௧ିଵ ܥ௧ିଵ

convLSTM 512 frames (time) 128 256 256 128 64 64 ࣢ ܥ CNN-Encoder ௧ ௧ CNN-Decoder 32 3 ۷ௗሺݐሻ 32 Video Video … … … 3×3 convolutional ReLU 2×2 max-pooling 2×2 up-sampling convolutional 1×1 convolutional Linear + clamping [0,1] Skip connections Figure 6. Our recurrent dual-pixel deblurring (RDPD) architecture. Our model takes a blurred image sequence, where each image at time t is fed as left Il(t) and right Ir(t) DP views. The DP views are encoded at the encoder part to feed the convLSTM that outputs the hidden state Ht and memory cell Ct to the next time point. The convLSTM unit also outputs a feature map ot that is processed through the decoder part to give the deblurred sharp image Id(t). Note: the number of output filters is shown under each convolution operation. data, as no current device allows video DP data capture. As encoded at the CNN bottleneck – namely, X(t) = we will show, our method can be used for both image se- CNN-Encoder(Il(t), Ir(t)). Then, the features are fed to quences and single-image inputs. In the context of image a convLSTM as shown in Fig.6. We utilize the convL- sequences, the amount of defocus blur changes based on STM to learn of the temporal dynamics of the sequential the motion of the camera and scene’s objects over time. In inputs. This is achieved by incorporating memory units the presence of such motion, sample depth variation over a with the gated operations. The convLSTM also preserves sequence of frames provides useful information for deblur- spatial information by replacing dot products with convo- ring. Our work is the first to explore the domain of defocus lutional operations, which is essential for making spatially deblurring on image sequences (e.g., video). variant estimation align with the spatially varying DP PSFs. We adopt a data-driven approach for correcting defo- We choose LSTM over RNN because standard RNNs are cus blur. We leverage a symmetric encoder-decoder CNN- known to have difficulty in learning long-time dependen- based architecture with skip connections between corre- cies [17], whereas LSTMs have shown the capability to sponding feature maps [30, 38]. Skip connections are learn long- and short-time dependencies [18]. widely used in encoder-decoder CNNs and have been found For the input feature X(t) at time t, our convLSTM to be effective for image deblurring tasks [1, 11]. Our pro- leverages three convolution gates – input it, output ot, and posed network is also coupled with convLSTM units [44, forget Ft – in order to control the signal flow within the cell. 47, 48] to better learn temporal dependencies between mul- The convLSTM outputs a hidden state Ht and maintains a tiple frames and to allow variable sequence size. With memory cell Ct for controlling the state update and output: convLSTM units, the same network remains fully convo- X H C lutional and can successfully deblur a single image or a se- it = Σ(Wi ∗ Xt + Wi ∗ Ht−1 + Wi ◦ Ct−1 + bi), (7) quence of images of arbitrary number. Fig.6 shows a de- F = Σ(W X ∗ X + W H ∗ H + W C ◦ C + b ), (8) tailed overview of our proposed CNN-convLSTM architec- t F t F t−1 F t−1 F X H C ture, which we refer to as recurrent dual-pixel deblurring ot = Σ(Wo ∗ Xt + Wo ∗ Ht−1 + Wo ◦ Ct−1 + bo), (9) (RDPD). C = F ◦C +i ◦ τ(W X ∗X +W H ∗H +b ), (10) Our architecture is similar to the one in [1], but with the t t t−1 t C t C t−1 C following modifications: (1) convLSTM units are added to Ht = ot ◦ τ(Ct), (11) the network bottleneck, (2) we train the network using the where the W terms denote the different weight matrices, radial distance patch to address the patch-wise training is- and the b terms represent the different bias vectors. Σ and sue, (3) we introduce a multi-scale edge loss function that τ are the activation functions of logistic sigmoid and hy- helps in recovering sharp edges, (4) the number of nodes are perbolic tangent, respectively. Afterwords, the output de- reduced to half at each block to make the model lighter, and blurred image I is obtained by decoding o through the (5) the last layer is replaced by a linear layer with a [0,1] d t decoder part of our encoder-decoder CNN as follows: clamping as it is found to be more effective in [11].

RDPD architecture. Given an input video of j consecutive Id(t) = CNN-Decoder(ot). (12) frames that have defocus blur {{Il(t), Ir(t)}, ··· , {Il(t + j), Ir(t + j)}} (such that Il(t) and Ir(t) are the DP views Radial distance patch. Radial distortion and lens aberra- of the given frame at time t), we first obtain a sequence tion make the PSFs vary in radial directions away from the of compact convolutional features {X(t), ··· ,X(t + j)} image center. Similar to [1, 33], we perform patch-wise Table 1. Results on the Canon DP dataset from [1]. DPDNet is the pre-trained model on Canon data provided by [1]. DPDNet+ and our RDPD+ are trained with Canon and our synthetically generated DP data. Bold numbers are the best and highlighted in green. The second-best performing results are highlighted in yellow. The testing set consists of 37 indoor and 39 outdoor scenes. Indoor Outdoor Indoor & Outdoor Method Time ↓ PSNR ↑ SSIM ↑ MAE ↓ PSNR ↑ SSIM ↑ MAE ↓ PSNR ↑ SSIM ↑ MAE ↓ NIQE ↓ EBDB [21] 25.77 0.772 0.040 21.25 0.599 0.058 23.45 0.683 0.049 5.42 929.7 DMENet [24] 25.70 0.789 0.036 21.51 0.655 0.061 23.55 0.720 0.049 4.85 613.7 JNB [40] 26.73 0.828 0.031 21.10 0.608 0.064 23.84 0.715 0.048 5.11 843.1

DPDNet [1] 27.48 0.849 0.029 22.90 0.726 0.052 25.13 0.786 0.041 3.77 0.5 DPDNet+ [1] 27.65 0.852 0.028 22.72 0.719 0.054 25.12 0.784 0.042 3.73 0.5 Our RDPD+ 28.10 0.843 0.027 22.82 0.704 0.053 25.39 0.772 0.040 3.19 0.3 training to avoid the redundancies of full image training and (DPDNet) [1], the edge-based defocus blur (EBDB) [21], ensure that the input has enough variance. However, this the defocus map estimation network (DMENet) [24], and approach breaks the spatial correlation between the image the just noticeable blur (JNB) [40] estimation. DPDNet [1] patches as they are fed independently with no knowledge of is the only method that utilizes DP data for deblurring, and their location on the image plane. As a result, in addition to the others [21, 24, 40] use only a single image as input (i.e., the six-channel RGB DP views, we include a single-channel Il) and estimate the defocus map in order to feed it to an off- patch that represents the relative radial distance. the-shelf deconvolution method (i.e., [7, 23]). EBDB [21] Multi-scale edge loss. In addition to the MSE loss, we in- and JNB [40] are not learning-based methods; thus, we can troduce a multi-scale edge loss using a Sobel gradient to test them directly. For the learning-based DMENet method, guide the network to encourage sharper edges. Our new we cannot retrain it with the Canon data [1], as it does not loss is similar in principle to the single-scale (i.e., 3 × 3) provide the ground truth defocus map. However, with our Sobel loss used in [28], but we modified this loss in two data generator we are able to generate defocus maps, which ways: first, we added multiple scales of the Sobel operator allows us to retrain DMENet with our synthetically gener- (i.e., kernel sizes) in order to capture different edge sizes. ated data. Second, we minimized for the horizontal and vertical direc- Settings to generate DP data. For our DP data gen- tions separately, to concentrate more on the direction that erator, we define five camera parameter sets – namely, is perpendicular to the imaging sensor orientation. For our {4, 5, 6}, {5, 8, 6}, {7, 5, 8}, {10, 13, 12}, {22, 10, 30} multi-scale modified edge loss, the vertical Gx and horizon- – such that each set represents focal length, aperture tal Gy derivative approximations of the deblurred output I d size, and focus distance. Given the depth range found and its ground truth I are: s in the SYNTHIA dataset [16], these camera sets cover x x y y Gd = Id ∗ Sm×m,Gd = Id ∗ Sm×m, (13) a wide range of front- as well as back-focus CoC sizes. x x y y For each image sequence in the SYNTHIA dataset, Gs = Is ∗ Sm×m,Gs = Is ∗ Sm×m, (14) we generate five sequences based on the predefined x y where Sm×m and Sm×m are the vertical and horizontal So- camera sets. The radial distortion coefficients are set bel operators of size m, respectively. The derivative oper- accordingly for each camera set. For the DP PSFs, ations are performed at multiple filter sizes. Our new edge we generate many representative PSF shapes by vary- loss Ledge is the mean of multiple scales for each direction ing the parameters in the given ranges n ∈ {3, 6, 9}, x/y and denoted as: α ∈ {0.4, 0.6, 0.8, 1}, β ∈ {0.1, 0.2, 0.3, 0.4}, and {x,y} {x,y} κ = 0.14. Image noise layer strength is chosen randomly, L = [MSE(G{x,y},G )]. (15) edge E s d where σ ∈ {5e−2, 5.5e−2, ··· , 5e−1}. These parameters Then the final loss function L is: are set empirically to model real camera hardware. More x y detail is provided in the supplemental material. L = LMSE + λxLedge + λyLedge, (16) We divide the SYNTHIA dataset [16] into training and such that L is the typical MSE loss between the output MSE testing sequences. We generate five sets of blurred images estimated I and its ground truth I . The λ terms are added d s for each image sequence. In total, we synthesize 2023 train- to control our final loss. ing and 201 testing blurred DP views. Though our synthetic 5. Experiments DP data generator enables an unlimited number of images to be generated, we found this number of images sufficient We evaluate our proposed RDPD and other existing for training. In addition to our synthetically generated DP defocus deblurring methods: the DP deblurring network data, we use the DP ground truth data from [1] with 300 Input DPDNet [1] DPDNet+ [1] Our RDPD Our RDPD+ Ground truth Canon 5D DSLR

No ground truth available Pixel 4 Smartphone

Figure 7. Qualitative results. DPDNet [1] is trained on Canon DP data. RDPD is our method trained on synthetically generated DP data only. DPDNet+ and RDPD+ are trained on both Canon and synthetic DP data. In general, RDPD and RDPD+ are able to recover more image details. Interestingly, RDPD trained on synthetic data generalizes well to real data from the two tested cameras. Note that there is no ground truth sharp image for Pixel 4, due to the fact smartphones have fixed aperture and thus a narrow-aperture image cannot be captured to serve as a ground truth image. Additionally, we note that the DP data currently available from the Pixel smartphones are not full-frame, but are limited to only one of the green channels in the raw-Bayer frame. training, 74 validation, and 76 testing pairs of blurred im- for our edge loss – namely, m ∈ {3, 7, 11}. The λ terms are ages (with DP views) and corresponding sharp images. found to be effective at λx = 0.03 and λy = 0.02. RDPD settings and training procedure. We set the size of To avoid overfitting, the dropout layer in the convLSTM the convLSTM to 512 units. For patch-wise training, we fix is set to 0.4. Our model converges after 140 epochs. Al- the size of input and output layers to 512 × 512 × 7 and though we train on image patches, our RDPD (with con- 512 × 512 × 3, respectively. We initialize the weights of vLSTM) is fully convolutional and enables testing on full- the convolutional layers using He’s initialization [15] and resolution inputs. To demonstrate the effectiveness of each use the Adam optimizer [22] to train the model. The initial component in our model, an ablation study of different learning rate is 5 × 10−5, which is decreased by half every training settings is provided in the supplemental material. 40 epochs. Single image results. We evaluate our proposed RDPD For domain generalization from synthetic to real against existing defocus deblurring methods for single- data [12, 41], we train our model iteratively using mini- image inputs. For methods that utilize DP views for the batches of real (i.e., single image) as well as synthetic input image (i.e., RDPD and DPDNet [1]), we introduce data (i.e., image sequence), where the patches are ran- variations on the training data used for more comprehensive domly cropped at each iteration. This type of iterative im- evaluations. The variations are RDPD+ and DPDNet+ that age/image sequence training becomes feasible since our re- are trained on both Canon DP data from [1] combined with current model RDPD allows training and testing with any synthetic DP data generated by the process described in number of frames, and it does not need to be preset before- Sec.2. The RDPD without the + sign is our baseline trained hand. We set the mini-batch size for the real data iteration with synthetically generated DP data only. The DPDNet to eight batches, because the dataset of real data has only without the + is trained on Canon data only. single-image examples (i.e., no image sequences). For the In Table1, we report quantitative results on real Canon synthetic data iteration, we set the mini-batch size to two DP data from [1] using standard metrics – namely, MAE, sequences each of size four frames. We define three scales PSNR, SSIM, and time. We also report the Naturalness Table 2. Results on our synthetically generated DP data. sRDPD+ … is a variation that is trained with single-frame data (green=best, yellow=second best). Our RDPD+, trained with image sequences, Input achieves the best results. Method PSNR ↑ SSIM ↑ MAE ↓ ૛ૠǤ ૛૜ ૛૟Ǥ ૢૡ ૛ૠǤ ૝૚ DPDNet [1] 26.38 0.782 0.034 … DPDNet+ [1] 29.84 0.828 0.025 sRDPD+ 30.26 0.849 0.020 DPDNet [1] RDPD+ 31.09 0.861 0.016 ૛ૡǤ ૙૚ ૛ૡǤ ૞૝ ૛ૢǤ ૟૚ …

our synthetically generated data (i.e., DPDNet+). ૛ૡǤ ૠૡ ૛ૡǤ ૡ૝ ૛ૢǤ ૠ૚ … In Fig.7, we also provide qualitative results of RDPD compared to other methods on data captured by Canon DSLR and Pixel 4 cameras. In general, RDPD+ is able Our RDPD+ Our sRDPD+ to recover more details from the input deblurred image. … Additionally, Fig.7 demonstrates that the baseline RDPD achieves good deblurring results on Canon and Pixel 4 data despite being trained with synthetic data only. This result Ground truth Time demonstrates the accuracy of the proposed framework for Figure 8. Results on a Canon 5D DSLR image sequence. PSNR synthetic DP data generation and the ability of the recurrent is shown for each deblurred image. sRDPD+ has a 0.4dB lower model to generalize to different cameras. It can also be seen PSNR on average when it is trained with a single frame compared that DPDNet+ has improved results compared to DPDNet, to our multi-frame method – that is, RDPD+. demonstrating the benefit gained by DPDNet+ through the addition of synthetic DP data on training. The supplemental Image Quality (NIQE) metric of the output deblurred im- material contains more quantitative results, visual compar- ages with respect to a reference model derived from the DP isons, and animated deblurring examples for both Canon GT images. In general, our RDPD+ has the best overall and Pixel 4 cameras. PSNR compared to other methods. Particularly, RDPD+ achieves the best PSNR and MAE for both indoor and com- Image sequence results. Our RDPD is designed to handle bined categories, and all with our lighter-weight network input image sequences. Here, we investigate the improve- that enabled the fastest inference time. ment gained by training with image sequences vs. single For the Outdoor dataset, the PSNR of RDPD+ is slightly frames. For this comparison, we introduce the RDPD+ vari- lower (i.e., 0.08dB) due to the fact that the Outdoor dataset ant sRDPD+, which is trained with single-frame inputs. is imperfect as a result of the capturing process (see Fig.3). As previously mentioned, there is no camera that enables DP cameras do not enable simultaneous capture of the DP access to DP views for video data. Nevertheless, we mimic images and corresponding ground truth sharp image (i.e., the same capturing procedure in [1] in order to capture a se- the image pairs can be captured only in succession at dif- quence of images. We performed four captures of the same ferent times). As a result, the Outdoor ground-truth is im- scene with small camera motion introduced between the perfect with small local motion and illumination variations. captures. Each image has its own DP views and is captured The Indoor ground-truth is captured in more controlled at narrow and wide apertures. Fig.8 presents the results on conditions and has fewer imperfections. The slightly bet- the sequence of images. The effectiveness of training with ter performance of DPDNet for outdoor scenes is because image sequences with RDPD+ can be seen from the average DPDNet is learning to compensate for imperfections in the PSNR gain (i.e., +0.4dB) compared to sRDPD+ trained us- Outdoor dataset. A key strength of our work is the abil- ing single-image inputs. ity to synthetically generate DP data that is not impeded by imperfections of manual capture. RDPD+ is trained on the Table2 shows the quantitative results on our syntheti- Outdoor dataset as well as the synthetically generated data cally generated DP image sequences. Our method RDPD+ (without such imperfections) debiasing the result from the (trained on multiple frames) achieves the best results as it imperfect ground-truth. The consequence is reduced fidelity utilizes the convLSTM architecture to better model the tem- to the imperfect ground-truth (in terms of PSNR/SSIM), but poral dependencies in an image sequence. Recall that our better defocus deblur performance overall. Similar behavior RDPD network is lighter and has a much lower number of is observed when DPDNet is trained with Canon data and weights compared to DPDNet. 6. Conclusion We proposed a novel framework to generate realistic DP data by modeling the image formation steps present on cam- eras with DP sensors. Our framework helps to address the current challenges in capturing DP data. Utilizing our syn- thetic DP data, we also proposed a new recurrent convolu- tional architecture that is designed to reduce defocus blur in image sequences. We performed a comprehensive evalua- tion of existing deblurring methods, and demonstrated that our synthetically generated DP data and recurrent convolu- tional model achieve state-of-the-art results quantitatively and qualitatively. Furthermore, our proposed framework demonstrates the ability to generalize across different cam- eras by training on synthetic data only. We believe our DP data generator will help spur additional ideas about defo- cus deblurring and applications that leverage DP data. Our dataset, code, and trained models are available at https: //github.com/Abdullah-Abuolaim/recurre nt-defocus-deblurring-synth-dual-pixel. Acknowledgments. The authors would like to thank Shu- mian Xin, Yinxiao Li, Neal Wadhwa, and Rahul Garg for fruitful comments and discussions. Supplemental Material

This supplemental material introduces the calibration procedure used to estimate estimate the point spread func- tions (PSFs) (Sec. S1) along with the parameter range searching for our parametric dual-pixel (DP) PSF model (Sec. S2). Next, another calibration procedure is described for estimating the radial distortion coefficients (Sec. S3). Afterward, an ablation study is provided to demonstrate the effectiveness of each component added to our proposed re- current dual-pixel deblurring architecture (RDPD). Then, Sec. S5 provides a discussion about the difference between the defocus and motion blur. Additional qualitative re- (A) Computer-generated calibration pattern. A grid of disks of known radius sults for different deblurring methods are also presented in Sec. S6. For further visual assessment, we provide animated ex- amples that alternate between the left and right DP views of real and synthetic data in the “animated-dp-views” direc- tory. We also provide animated examples of our deblurring results in the “animated-results” directory. Both “animated- dp-views” and “animated-results” are located in the project GitHub repository: https://github.com/Abdulla h-Abuolaim/recurrent-defocus-deblurrin g-synth-dual-pixel. (B) Out-of-focus calibration pattern as imaged by Canon 5D Mark IV DSLR camera

S1. PSFs on a real dual-pixel sensor Figure S1. Calibration patterns are used for estimating dual-pixel As mentioned in Sec. 2.2 of the main paper, we aim to es- (DP) point spread functions (PSFs). A: our synthesized computer- timate more accurate and realistic DP-PSFs; thus, we follow generated pattern of a grid of small equal-size disks. B: the same calibration pattern as imaged by Canon 5D Mark IV DSLR cam- the same calibration practice as in [20, 29, 36]. The calibra- era. Note that the pattern captured in (B) is out-of-focus. tion pattern contains a grid of small disks with known radius and spacing as shown in Fig. S1-A. The pattern in Fig. S1-A is computer-generated, and we display it on a 27-inch LED ing blurred patch B by solving: display of resolution 1920 × 1080. Then, we use the Canon 2 5D Mark IV DSLR camera for our capturing procedure, as P arg min i Di(S ∗ E − B) + E , it facilitates reading out DP data. The LED display is placed E 2 1 parallel to the image plane and at a fixed distance of about subject to E ≥ 0, (1) one meter. where D ∈ {D , D , D , D , D } denote the spatial We captured many images by varying the camera pa- i x y xx yy xy vertical and horizontal derivatives. The ` -norm of E en- rameters – namely, focus distance, aperture size, and focal 1 courages sparsity of the PSF entries. Additionally, another length – covering a wide range of shape varying PSFs (an non-negativity constraint is imposed on the entries of E.A example image is shown in Fig. S1-B). Since we apply ra- wide range of PSFs for each DP view is estimated indepen- dial distortion to our synthetically generated data, we seek dently. Fig. S2 shows examples of the estimated DP- PSFs. to estimate the least radially-distorted PSF, that in practice, is found close to the image center. Patches containing the S2. Parameter search for our dual-pixel PSFs disks are identified, and the center of the disks is estimated by finding these patches’ centroid. The radius of computer- In Sec. 2.2 and 5 of the main paper, we also mentioned a generated disks is a known fraction of the distance between mechanism to select the effective range of parameters (i.e., disk centers. n, α, β, κ) for our parametric DP-PSF modeling. Recall that Similar to [29, 36], we adopt a non-blind PSF estimation, n is the Butterworth filter order, and α is used to control its in which the latent sharp disk S is known. Then, the PSF 3db cutoff position. β is the filter lower bound scale, and κ from the camera E is estimated using S and the correspond- is the Gaussian smoothing factor. h siae Ssaeniyadntpretycircular. phys- , S2 perfectly camera-specific Fig. to not in due and shown expected is noisy as observation and are is This camera, PSFs reason main single estimated The a the used estimated. is we what that that to PSFs similar representative very find to are rather but cameras in found is PSF estimated ones. the estimated the with with similarity parametric similarity The higher our obtains and independently. model [36] DP-PSF estimated parametric in are Our model right case. PSF each and below (DP) left, shown dual-pixel are combined, the DP-PSF cross-correlation DP with 2D the the comparison for using in PSFs measured (PSFs) The functions spread model. point DP-PSF estimated The S2. Figure u oli o oeatymthtemaue PSFs measured the match exactly to not is goal Our

DP rightDP left DP combined DP rightDP left DP combined DP rightDP left DP combined Estimated PSF Front focus

ࣲ ൌ ͲǤͷͺ ࣲ ൌ ͲǤͷͻ ࣲ ൌ ͲǤ͸͹ ࣲ ൌ ͲǤ͸͵ ࣲ ൌ ͲǤ͸ʹ ࣲൌͲǤ͸ͳ ࣲ ൌ ͲǤͷ͹ ࣲ ൌ ͲǤͷͷ ࣲൌͲǤͷʹ PSF model in [ X . n ssono h etfrec ae h aaees(i.e., parameters The case. each for left the on shown is and (.) 36 ǡߙൌͲ͸ ൌ ߚ ͲǤ͸ǡ ൌ ߙ ͵ǡ ൌ ݊ ǡߙൌͲ͸ ൌ ߚ ͲǤ͸ǡ ൌ ߙ ͻǡ ൌ ݊ Ours ] ǡߙൌͳ ൌ ߚ ͳǡ ൌ ߙ ͵ǡ ൌ ݊ ࣲ ൌ ͲǤͺͳ ࣲ ൌ ͲǤͺ͸ ࣲ ൌ ͲǤͺͷ ࣲ ൌ ͲǤͺͳ ࣲ ൌ ͲǤͺ͵ ࣲൌͲǤ͹͹ ࣲ ൌ ͲǤ͹ͻ ࣲ ൌ ͲǤͺʹ ࣲൌͲǤͺͳ 0.4 0.4 0.3 0.4 mefcin.Teeoe elmtteprmtrsac to search manufacturing parameter the optics limit other the we and Therefore, microlens, wells, the imperfections. sensor of the positioning of depth the like constraints ical

DP rightDP left DP combined DP rightDP left DP combined DP rightDP left DP combined Estimated PSF Back focus

ࣲ ൌ ͲǤ͸Ͳ ࣲ ൌ ͲǤ͸͵ ࣲൌͲǤͷͺ ࣲ ൌ ͲǤͶ͹ ࣲ ൌ ͲǤͶ͸ ࣲൌͲǤͶ͹ ࣲൌͲǤͷͶ ࣲ ൌ ͲǤͷͳ ࣲൌͲǤͷ͵ PSF model in [ 36 ǡߙൌͲͺ ൌ ߚ ͲǤͺǡ ൌ ߙ ͻǡ ൌ ݊ ൌ ߚ ͲǤͺǡ ൌ ߙ ͸ǡ ൌ ݊ ,α β α, n, Ours ] ǡߙൌͳ ൌ ߚ ͳǡ ൌ ߙ ͸ǡ ൌ ݊ ࣲ ൌ ͲǤ͹ͻ ࣲ ൌ ͲǤͺͲ ࣲൌͲǤ͹ͺ ࣲ ൌ ͲǤͻͳ ࣲ ൌ ͲǤͺͻ ࣲൌͲǤͺͺ ࣲൌͲǤͺ͹ ࣲ ൌ ͲǤͺ͹ ࣲൌͲǤͺ͸ sdt eeaeour generate to used ) 0.1 0.3 0.3 a range of discrete values for each parameter sampled as:

n ∈ {1, 2, 3,..., 15},, α, β ∈ {0.1, 0.2,..., 1.0}, κ ∈ {0.14, 0.21,..., 0.42}. (2)

Then, we perform a brute-force search in this bounded (A) Input: computer-generated calibration pattern (no radial distortion) space by solving the following equation: 2 P … arg min i Di(E − H(n, α, β, κ)) , (3) n,α,β,κ 2 where E is the estimated PSF from a real camera and H(n, α, β, κ) is our parameterized PSF. We found the pa- (B) In-focus pattern as imaged by Canon 5D Mark IV DSLR camera rameters achieve the highest similarity at:

n ∈ {3, 6, 9}, … α ∈ {0.4, 0.6, 0.8, 1.0}, β ∈ {0.1, 0.2, 0.3, 0.4}, κ = 0.14. (4) (C) Output: computer-generated pattern after applying radial distortion Barrel radial distortion Pincushion radial distortion Given the best values we defined for each parameter, we can generate 48 combinations of possible PSFs, and those Focal length represent our bank of DP-PSFs used to generate our syn- Figure S3. Calibration pattern used for estimating the radial dis- thetic DP data. Fig. S2 shows examples of our parameter- tortion coefficients. A: the input computer-generated pattern with ized PSFs in comparison with the estimated ones. Our PSFs no radial distortion applied. B: the same pattern as imaged by the Canon 5D Mark IV camera at different focal lengths. C: the demonstrate a much higher correlation with the estimated computer-generated pattern after applying radial distortion. real PSFs compared to the DP-PSF model in [36]. The sim- ilarity is measured using the 2D cross-correlation X (.). Such comprehensive PSF calibration (Sec. S1) and pa- and is followed by the coefficient search procedure used to rameter search (Sec. S2) is not possible for smartphone mimic real-world radial distortion. Recall that radial distor- cameras due to uncontrollable camera factors like aperture tion is associated with zoom lenses and depends mainly on size and focal length. Additionally, up to our knowledge, the camera’s focal length. only and 4 smartphones allow direct access to the DP data, and the Pixel-DP API provided in [10] does not We synthesize a uniform pattern of squares, as shown in facilitate manual focus, which limits us from controlling the Fig. S3-A. We follow the same setup described in Sec. S1. focus distance as well. The work in [36] estimated two PSFs Still, with the following changes: (1) we capture an in- from a Pixel 3 smartphone, and we found that they achieve focus calibration pattern, (2) the focal length is changed a high correlation with our parametric PSF model. across captures and the aperture remains fixed, (3) the dis- tance between the display and camera is adjusted accord- S3. Radial distortion coefficients ingly to make sure the full-resolution image is within cam- era’s field of view, sine increasing the focal length intro- Based on Sec. 2.3 of the main paper, radial distortion duces zoom/magnification effect, and (4) the focus distance is applied for more realistic imagery since the input images is also adjusted accordingly to make sure the pattern is in and our parametric DP-PSFs are not radially distorted. To focus. Fig. S3-B shows examples of the calibration pattern this aim, we use the division model [8], as follows: as imaged by the Canon camera at different focal lengths. (x − x , y − y ) We performed five captures at five different fo- (x , y ) = (x , y ) + u o u o , (5) d d o o 1 + c R2 + c R4 + ··· cal lengths ranging from min to max. Each is 1 2 mapped to a focal length in our predefined pa- where (xu, yu) and (xd, yd) are the undistorted and dis- rameter sets of the virtual five cameras – namely: th torted pixel coordinates respectively, and ci is the i radial {4, 5, 6}, {5, 8, 6}, {7, 5, 8}, {10, 13, 12}, {22, 10, 30} distortion coefficient. R is the radial distance from the im- — such that each set represents focal length, aperture age plane center (xo, yo). This section introduces the cal- size, and focus distance. With these five representative ibration used to capture real-world radial distortion cases radial distortions that cover barrel as well as pincushion Table S1. Results on indoor and outdoor scenes combined from distortions, a brute-force search is performed to find the ci coefficients that satisfy the following: the Canon DP dataset [1]. Bold numbers are the best. RDPD+ (PSF [36]) is a variation trained on DP data generated using the PSF model from [36]. Our RDPD+, trained on DP data generated arg min X (If , Id(c1, c2, c3)), (6) c1,c2,c3 using our parametric DP-PSF, demonstrates +0.7db higher PSNR and reflects the power of our realistic PSF modeling. where If is the calibration pattern as imaged by the Variation PSNR ↑ SSIM ↑ MAE ↓ Canon camera at a certain focal length f (e.g., Fig. S3- RDPD+ (PSF [36]) 24.69 0.752 0.044 B). I (c , c , c ) is the computer-generated input pattern d 1 2 3 RDPD+ (our PSF) 25.39 0.772 0.040 but after applying radial distortion based on the coefficients c , c , c (see example in Fig. S3-C). Since we have few ex- 1 2 3 Table S2. Results on indoor and outdoor scenes combined from I I (c , c , c ) amples, f is re-centered manually to match d 1 2 3 ’s the Canon DP dataset [1]. Bold numbers are the best. RDPD+ center. While Eq.5 can be defined with more coefficients, trained with radially distorted DP data achieves +0.2db higher we found three coefficients sufficient to approximate our PSNR when tested on real images, in which applying radial dis- real-world distortion examples. The final optimal five sets tortion on the synthetically generated DP data helped RDPD+ to of coefficients are: learn the spatially varying PSFs shapes found in real cameras. Variation PSNR ↑ SSIM ↑ MAE ↓ {2 × 10−2, 2 × 10−2, 3 × 10−2}, RDPD+ (w/o distortion) 25.19 0.758 0.041 {8 × 10−3, 2 × 10−3, 2.2 × 10−3}, RDPD+ (w/ distortion) 25.39 0.772 0.040 {−4 × 10−3, 9 × 10−4, −9 × 10−4}, {−7 × 10−3, −3.8 × 10−3, −3.6 × 10−3}, Table S3. Results on indoor and outdoor scenes combined from the Canon DP dataset [1]. Bold numbers are the best. RSPD+: re- {−8 × 10−3, −5 × 10−3, −4.5 × 10−3}. (7) current single-pixel deblurring trained with a single DP view (i.e., left view). RDPD+: trained with left and right DP views. Utilizing S4. Ablation study DP views to train our RDPD+ leads to +1.15db PSNR gain and is This section investigates the usefulness of our syntheti- essential for better defocus deblurring. cally generated DP data along with our novel recurrent dual- Variation PSNR ↑ SSIM ↑ MAE ↓ pixel deblurring architecture (RDPD) — this is related to RSPD+ 24.24 0.726 0.045 Sec. 4 and Sec. 5 of the main paper. To this aim, we divide RDPD+ 25.39 0.772 0.040 the ablation study into two parts:

• Explore the effectiveness of our DP data generator on RDPD+ when trained with data generated using other components (Sec. S4.1), including: (1) our paramet- DP-PSFs. In this study, we compare our RDPD+ that is ric DP-PSF model vs. the DP-PSF model presented trained with DP data generated using our parametric DP- in [36], (2) radially distorted DP data vs. undistorted PSF model against the DP data generated using the DP-PSF ones, and (3) training with dual views vs. training with model in [36]. a single DP view. The quantitative results reported in Table S1 demon- • Investigate the effectiveness of each component added strates the power of training RDPD+ with DP data that is to our RDPD model (Sec. S4.2), including: (1) the util- generated using our realistically modeled PSFs, where there ity of adding the radial patch distance mask to the input is an increase in PSNR of +0.7db. and (2) our new edge loss vs. traditional Sobel loss. Radial distortion. For more realistic modeling, we consid- ered applying radial distortion in the proposed DP data gen- Note that all subsequent experiments are conducted with erator. In this study, we examine the proposed deblurring variations of RDPD+. Each variation represents a single RDPD+ model’s behavior when it is also trained with data change, where the rest remains similar to RDPD+ as de- that is not radially distorted. In Table S2, we present the scribed in the main paper Sec. 5. All the variations are quantitative results of training RDPD+ with data generated tested on Canon DP data from [1], since it is the only data with and without radial distortion. The results demonstrate that we can have access to real ground truth DP data and the effectiveness of modeling the radial distortion during thus enables us to report quantitative results. synthesizing the DP images, where RDPD+ trained with ra- dially distorted data leads to a +0.2db PSNR gain. S4.1. Utility of DP data generator components Dual views vs. single view. Following [1], we explore Our parametric DP-PSF. While our parametric DP-PSF the effect of training RDPD+ with DP views vs. training model already has a higher correlation with the estimated with a single view (i.e., the traditional approach of single PSFs from real cameras, we further investigate the effect image deblurring [21, 24, 40]). To this aim, we introduce Table S4. Results on indoor and outdoor scenes combined from the Table S5. Results on indoor and outdoor scenes combined from the Canon DP dataset [1]. Bold numbers are the best. RDPD+ has an Canon DP dataset [1]. Bold numbers are the best. RDPD+(0): improved results when it is trained with the radial distance patch. trained without the edge loss. RDPD+(1): trained with a 3 × 3 Variation PSNR ↑ SSIM ↑ MAE ↓ single-scale Sobel loss similar to [28]. RDPD+(3): trained with our three-scale edge loss. RDPD+ trained with our multi-scale RDPD+(w/o radial distance) 25.00 0.756 0.042 edge loss has the best results for all metrics. RDPD+(w/ radial distance) 25.39 0.772 0.040 Variation PSNR ↑ SSIM ↑ MAE ↓ RSPD+(0) 25.06 0.765 0.042 RSPD+(1) [28] 25.11 0.763 0.042 RDPD+(3) 25.39 0.772 0.040

Table S6. Results on indoor and outdoor scenes combined from the Canon DP dataset [1]. Bold numbers are the best. RDPD+ has about -0.5db PSNR drop in performance, when both our ra- dial distance patch and multi-scale edge loss are removed from the RDPD+’s training. Variation PSNR ↑ SSIM ↑ MAE ↓ RDPD+(w/o both) 24.90 0.754 0.042 RDPD+(w/ both) 25.39 0.772 0.040

and RDPD+(3) (trained with our three-scale edge loss). Ta- ble S5 shows the quantitative results, in which training with Figure S4. The radial distance patch used to assist RDPD+ training our multi-scale edge loss achieves the best results for all and address the issue of patch-wise training. metrics. Effect of radial distance and edge loss together. We also a recurrent single-pixel deblurring model variant (RSPD+) examine the performance when our radial distance patch and compare it with our proposed RDPD+ model. Quanti- and multi-scale edge loss are removed from the RDPD+’s tative results in Table S3 demonstrates that training with DP training. Table S6 shows the results, where there is a PSNR views is crucial in order to perform better defocus deblur- drop of -0.5db, indicating the usefulness of the additional ring where there is a +1.15db PSNR gain. proposed components.

S4.2. Effectiveness of RDPD components S5. Defocus vs. motion deblurring Radial distance patch for patch-wise training. We intro- While defocus and motion both lead to image blur, the duced training with the radial distance patch to feed each physical formation and appearance of these two blur types pixel’s spatial location in the cropped patch and address the are significantly different. Therefore, methods that solve for patch-wise training issue (i.e., the network does not see the motion blur are not expected to perform well when applied full image or the relative position of the patch in the full to defocus deblurring. This is shown by Abuolaim et al. [1], image). Training in this manner is important as the PSFs where the motion deblurring method of Tao et al. [44] that are spatially varying in the radial direction away from the is evaluated on the same Canon test set achieves an average image center. Fig. S4 shows an example of the cropped ra- PSNR of 20.12dB, which is significantly lower than ours dial distance patch used to assist our training. To investigate (i.e., 25.39) and all other defocus deblurring methods. the effectiveness of training with the radial distance patch, we also train RDPD+ without it and report the results in Ta- S6. Additional qualitative results ble S4. RDPD+ has a gain in PSNR of +0.4db when trained with the radial distance patch. As mentioned in Sec. 5 of the main paper, we pro- Our multi-scale edge loss. As mentioned in Sec. 4 of the vide more qualitative results for all existing defocus deblur- main paper, we introduced the multi-scale edge loss based ring methods including a variations of ours. The methods on the Sobel gradient operator to recover sharper details are: the DP deblurring network (DPDNet) [1], the edge- at different edge sizes. To examine our edge loss func- based defocus blur (EBDB) [21], the defocus map esti- tion’s effectiveness, we train RDPD+ with different vari- mation network (DMENet) [24], and the just noticeable ations of edge loss scales denoted as RDPD+(m), where blur (JNB) [40] estimation. Fig. S5, Fig. S6, and Fig. S7 m is the number of scales used. In particular, we intro- show more qualitative results on images from the Canon duce RDPD+(0) (trained without the edge loss), RDPD+(1) dataset [1]. Fig. S8 shows results on images from a Pixel (trained with a 3×3 single-scale Sobel loss similar to [28]), smartphone. Input EBDB DMENet JNB DPDNet-Single

DPDNet DPDNet+ RSPD+ RDPD RDPD+ Ground truth

Figure S5. Qualitative results on images from Canon dataset [1]. We compare with other defocus deblurring methods: EBDB [21], DMENet [24], JNB [40], and DPDNet [1]. RDPD is our method trained on synthetically generated DP data only. DPDNet+ and RDPD+ are trained on both Canon and synthetic DP data. DPDNet-Single and RSPD+ are trained on a single DP view (i.e., Il). In general, RDPD and RDPD+ are able to recover more image details. Interestingly, RDPD trained on synthetic data generalizes well to real data from Canon camera. Input EBDB DMENet JNB DPDNet-Single

DPDNet DPDNet+ RSPD+ RDPD RDPD+ Ground truth

Figure S6. Qualitative results on images from Canon dataset [1]. We compare with other defocus deblurring methods: EBDB [21], DMENet [24], JNB [40], and DPDNet [1]. RDPD is our method trained on synthetically generated DP data only. DPDNet+ and RDPD+ are trained on both Canon and synthetic DP data. DPDNet-Single and RSPD+ are trained on a single DP view (i.e., Il). In general, RDPD and RDPD+ are able to recover more image details. Interestingly, RDPD trained on synthetic data generalizes well to real data from Canon camera. Input EBDB DMENet JNB DPDNet-Single

DPDNet DPDNet+ RSPD+ RDPD RDPD+ Ground truth

Figure S7. Qualitative results on images from Canon dataset [1]. We compare with other defocus deblurring methods: EBDB [21], DMENet [24], JNB [40], and DPDNet [1]. RDPD is our method trained on synthetically generated DP data only. DPDNet+ and RDPD+ are trained on both Canon and synthetic DP data. DPDNet-Single and RSPD+ are trained on a single DP view (i.e., Il). In general, RDPD and RDPD+ are able to recover more image details. Interestingly, RDPD trained on synthetic data generalizes well to real data from Canon camera. Input DPDNet DPDNet+ Our RDPD+

Figure S8. Qualitative results on images captured by Pixel smartphone. DPDNet [1] is trained on Canon DP data only. DPDNet+ and RDPD+ are trained on both Canon and synthetic DP data. In general, RDPD+ is able to recover more image details. Interestingly, DPDNet+ achieves better results when it is trained with our synthetic data augmented. Note that there is no ground truth sharp image for Pixel smartphone because smartphones have a fixed aperture. As a result, a narrow-aperture image cannot be captured to serve as a ground truth image. Additionally, we note that the DP data currently available from the Pixel smartphones are not full-frame but are limited to only one of the green channels in the raw-Bayer frame. References [19] Jinbeum Jang, Yoonjong Yoo, Jongheon Kim, and Joonki Paik. Sensor-based auto-focusing system using multi-scale [1] Abdullah Abuolaim and Michael S Brown. Defocus deblur- feature extraction and phase correlation matching. Sensors, ring using dual-pixel data. In ECCV, 2020.1,2,5,6,7,8, 15(3):5747–5762, 2015.1 13, 14, 15, 16, 17, 18 [20] Neel Joshi, Richard Szeliski, and David J Kriegman. Psf [2] Abdullah Abuolaim and Michael S Brown. Online lens mo- estimation using sharp edge prediction. In CVPR, 2008. 10 tion smoothing for video autofocus. In WACV, 2020.1 [21] Ali Karaali and Claudio Rosito Jung. Edge-based defo- [3] Abdullah Abuolaim, Abhijith Punnappurath, and Michael S cus blur estimation with adaptive scale selection. TIP, Brown. Revisiting autofocus for smartphone cameras. In 27(3):1126–1137, 2017.1,6, 13, 14, 15, 16, 17 ECCV, 2018.1 [22] Diederik P Kingma and Jimmy Ba. Adam: A method for [4] Abdullah Abuolaim, Radu Timofte, Michael S Brown, et al. stochastic optimization. arXiv preprint arXiv:1412.6980, NTIRE 2021 challenge for defocus deblurring using dual- 2014.7 pixel images: Methods and results. In CVPR Workshops, [23] Dilip Krishnan and Rob Fergus. Fast image deconvolution 2021.1 using hyper-laplacian priors. In NeurIPS, 2009.1,6 [5] Faisal Bukhari and Matthew N Dailey. Automatic radial dis- tortion estimation from a single image. Journal of mathe- [24] Junyong Lee, Sungkil Lee, Sunghyun Cho, and Seungyong matical imaging and vision, 45(1):31–45, 2013.4 Lee. Deep defocus map estimation using domain adaptation. In CVPR, 2019.1,2,6, 13, 14, 15, 16, 17 [6] Laurent D’Andres,` Jordi Salvador, Axel Kochale, and Sabine Susstrunk.¨ Non-parametric blur map regression for depth of [25] Junyong Lee, Hyeongseok Son, Jaesung Rim, Sunghyun field extension. TIP, 25(4):1660–1673, 2016.1,2 Cho, and Seungyong Lee. Iterative filter adaptive network for single image defocus deblurring. In CVPR, 2021.1 [7] DA Fish, AM Brinicombe, ER Pike, and JG Walker. Blind deconvolution by means of the richardson–lucy algorithm. [26] Anat Levin, Yair Weiss, Fredo Durand, and William T Journal of the Optical Society of America (A), 12(1):58–65, Freeman. Understanding blind deconvolution algorithms. 1995.1,6 TPAMI, 33(12):2354–2367, 2011.1 [8] Andrew W Fitzgibbon. Simultaneous linear estimation of [27] Ce Liu, Richard Szeliski, Sing Bing Kang, C Lawrence Zit- multiple view geometry and lens distortion. In CVPR, 2001. nick, and William T Freeman. Automatic estimation and re- 4, 12 moval of noise from a single image. TPAMI, 30(2):299–314, [9] Alessandro Foi, Mejdi Trimeche, Vladimir Katkovnik, and 2007.4 Karen Egiazarian. Practical poissonian-gaussian noise [28] Zhengyang Lu and Ying Chen. Single image super resolution modeling and fitting for single-image raw-data. TIP, based on a modified u-net with mixed gradient loss. arXiv 17(10):1737–1754, 2008.4 preprint arXiv:1911.09428, 2019.6, 14 [10] Rahul Garg, Neal Wadhwa, Sameer Ansari, and Jonathan T [29] Fahim Mannan and Michael S Langer. Blur calibration for Barron. Learning single camera depth estimation using dual- depth from defocus. In Conference on Computer and Robot pixels. In ICCV, 2019.1, 12 Vision (CRV), 2016. 10 [11] Jochen Gast and Stefan Roth. Deep video deblurring: The [30] Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. Image devil is in the details. In ICCV Workshops, 2019.5 restoration using very deep convolutional encoder-decoder [12] Xiaoyang Guo, Hongsheng Li, Shuai Yi, Jimmy Ren, and networks with symmetric skip connections. In NeurIPS, Xiaogang Wang. Learning monocular depth by distilling 2016.5 cross-domain stereo networks. In ECCV, 2018.7 [31] Maxim Maximov, Kevin Galim, and Laura Leal-Taixe.´ Fo- [13] Richard Hartley and Sing Bing Kang. Parameter-free ra- cus on defocus: bridging the synthetic to real domain gap for dial distortion correction with center of distortion estimation. depth estimation. In CVPR, 2020.2 TPAMI, 29(8):1309–1321, 2007.4 [32] Liyuan Pan, Shah Chowdhury, Richard Hartley, Miaomiao [14] Caner Hazirbas, Sebastian Georg Soyer, Maximilian Chris- Liu, Hongguang Zhang, and Hongdong Li. Dual pixel explo- tian Staab, Laura Leal-Taixe,´ and Daniel Cremers. Deep ration: Simultaneous depth estimation and image restoration. depth from focus. In ACCV, 2018.2 In CVPR, 2021.1 [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [33] Jinsun Park, Yu-Wing Tai, Donghyeon Cho, and In Delving deep into rectifiers: Surpassing human-level perfor- So Kweon. A unified approach of multi-scale deep and hand- mance on imagenet classification. In ICCV, 2015.7 crafted features for defocus estimation. In CVPR, 2017.1, [16] Daniel Hernandez-Juarez, Lukas Schneider, Antonio Es- 5 pinosa, David Vazquez, Antonio M. Lopez, Uwe Franke, [34] Michael Potmesil and Indranil Chakravarty. A lens and aper- Marc Pollefeys, and Juan Carlos Moure. Slanted stixels: ture camera model for synthetic image generation. SIG- Representing san francisco’s steepest streets. In BMVC, GRAPH, 15(3):297–305, 1981.3 2017.4,6 [35] B Prescott and GF McLean. Line-based correction of radial [17] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jurgen¨ lens distortion. Graphical Models and Image Processing, Schmidhuber, et al. Gradient flow in recurrent nets: the dif- 59(1):39–47, 1997.4 ficulty of learning long-term dependencies, 2001.5 [36] Abhijith Punnappurath, Abdullah Abuolaim, Mahmoud [18] Sepp Hochreiter and Jurgen¨ Schmidhuber. Long short-term Afifi, and Michael S Brown. Modeling defocus-disparity in memory. Neural computation, 9(8):1735–1780, 1997.5 dual-pixel sensors. In ICCP, 2020.1,3,4, 10, 11, 12, 13 [37] Abhijith Punnappurath and Michael S Brown. Reflection re- moval using a dual-pixel sensor. In CVPR, 2019.1 [38] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.5 [39] Jianping Shi, Li Xu, and Jiaya Jia. Discriminative blur de- tection features. In CVPR, 2014.2 [40] Jianping Shi, Li Xu, and Jiaya Jia. Just noticeable defocus blur detection and estimation. In CVPR, 2015.1,6, 13, 14, 15, 16, 17 [41] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, 2017.7 [42] Pratul P Srinivasan, Tongzhou Wang, Ashwin Sreelal, Ravi Ramamoorthi, and Ren Ng. Learning to synthesize a 4D RGBD light field from a single image. In ICCV, 2017.2 [43] Huixuan Tang and Kiriakos N Kutulakos. Utilizing opti- cal aberrations for extended-depth-of-field panoramas. In ACCV, 2012.1,3 [44] Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Ji- aya Jia. Scale-recurrent network for deep image deblurring. In CVPR, 2018.5, 14 [45] Tu Vo. Attention! stay focus! In CVPR Workshops, 2021.1 [46] Neal Wadhwa, Rahul Garg, David E Jacobs, Bryan E Feld- man, Nori Kanazawa, Robert Carroll, Yair Movshovitz- Attias, Jonathan T Barron, Yael Pritch, and Marc Levoy. Synthetic depth-of-field with a single-camera . ACM Transactions on Graphics, 37(4):64, 2018.2 [47] Wenguan Wang, Jianbing Shen, Fang Guo, Ming-Ming Cheng, and Ali Borji. Revisiting video saliency: A large- scale benchmark and a new model. In CVPR, 2018.5 [48] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Ye- ung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NIPS, 2015.5 [49] Yinda Zhang, Neal Wadhwa, Sergio Orts-Escolano, Chris- tian Hane,¨ Sean Fanello, and Rahul Garg. Du2net: Learning depth estimation from dual-cameras and dual-pixels. ECCV, 2020.1 [50] Wenda Zhao, Fan Zhao, Dong Wang, and Huchuan Lu. Defocus blur detection via multi-stream bottom-top-bottom fully convolutional network. In CVPR, 2018.2