The “Vertigo Effect” on Your Smartphone: via Single View Synthesis

Yangwen Liang Rohit Ranade Shuangquan Wang Dongwoon Bai Jungwon Lee Samsung Semiconductor Inc {liang.yw, rohit.r7, shuangquan.w, dongwoon.bai, jungwon2.lee}@samsung.com

Abstract

Dolly zoom is a technique where the camera is moved either forwards or backwards from the subject under focus while simultaneously adjusting the field of view in order to maintain the size of the subject in the frame. This results in effect so that the subject in focus appears sta- tionary while the background field of view changes. The (a) Input image I1 (b) Input image I2 effect is frequently used in films and requires skill, practice and equipment. This paper presents a novel technique to model the effect given a single shot capture from a single camera. The proposed synthesis pipeline based on cam- era geometry simulates the effect by producing a sequence of synthesized views. The technique is also extended to allow simultaneous captures from multiple cameras as in- puts and can be easily extended to video sequence captures. Our pipeline consists of efficient image warping along with depth–aware image inpainting making it suitable for smart- phone applications. The proposed method opens up new avenues for view synthesis applications in modern smart- phones. (c) Dolly zoom synthesized image with SDoF by our method

Figure 1: Single camera single shot dolly zoom View Syn- 1. Introduction thesis example. Here (a) is generated from (b) through dig- ital zoom The “dolly zoom” effect was first conceived in Alfred Hitchcock’s 1958 film “Vertigo” and since then, has been frequently used by film makers in numerous other films. Previous attempts at automatically simulating this effect The photographic effect is achieved by in or out required the use of a specialized light field camera [19] in order to adjust the field of view (FoV) while simultane- while the process in [20, 1] involved capturing images while ously moving the camera away or towards the subject. This moving the camera, tracking interest points and then apply- leads to a continuous perspective effect with the most di- ing a calculated scaling factor to the images. Generation rectly noticeable feature being that the background appears of views from different camera positions given a sequence to change size relative to the subject [13]. Execution of the of images through view interpolation is mentioned in [8] effect requires skill and equipment, due to the necessity of while [10] mentions methods to generate images given a 3D simultaneous zooming and camera movement. It is espe- scene. Some the earliest methods for synthesizing images cially difficult to execute on mobile phone cameras, because through view interpolation include [6, 23, 31]. More re- of the requirement of fine control of zoom, object tracking cent methods have applied deep convolutional networks for and movement. producing novel views from a single image [30, 17], from multiple input images [28, 9] or for producing new video P frames in existing videos [15]. A In this paper, we model the effect using camera geometry D1 focus plane and propose a novel synthesis pipeline to simulate the ef- B image plane fect given a single shot of single or multi–camera captures, u1 B at location B where single shot is defined as an image and depth capture D0 f1 collected from each camera at a particular time instant and B image plane C1 f A at location A location. The depth map can be obtained from passive sens- t 1 uA ing methods, cf. e.g. [2, 11], active sensing methods, cf. 1 A e.g. [22, 24] and may also be inferred from a single image C1 through convolutional neural networks, cf. e.g. [5]. The synthesis pipeline handles occlusion areas through depth Figure 2: Single camera system setup under dolly zoom aware image inpainting. Traditional methods for image in- painting include [3, 4, 7, 26] while others like [16] adopt the method in [7] for depth based inpainting. Recent methods the scalar DX are the pixel coordinates on the image plane, P like [21, 14] involve applying convolutional networks for the intrinsic parameters, and the depths of for camera X, R this task. However, since these methods have high complex- X ∈ {A, B}, respectively. The 3 × 3 matrix and the T ity, we implement a simpler algorithm suitable for smart- 3 × 1 vector are the relative rotation and translation of phone applications. Our pipeline also includes the appli- camera B with respect to camera A. In general, the relation- u u cation of the shallow depth of field (SDoF) [27] effect for ship between B and A can be obtained in closed–form as image enhancement. An example result of our method with uB DA −1 uA KBT = KBR(KA) + (1) the camera simulated to move towards the object under fo- 1 DB 1 DB cus while changing simultaneously is shown     T T R C C in Figure 1. In this example, Figures 1a, 1b are the input where can also be written as = ( A − B). images and Figure 1c is the dolly zoom synthesized image 2.1. Single Camera System with the SDoF effect applied. Notice that the dog remains in focus and the same size while the background shrinks with Consider the system setup for a single camera under an increase in FoV. dolly zoom as shown in Figure 2. Here, camera 1 is CA A The rest of the paper is organized as follows: System at an initial position 1 with a FoV of θ1 and focal A model and view synthesis with camera geometry are de- length f1 (the relationship between the camera FoV θ and scribed in Section 2 while experiment results are given in its focal length f (in pixel units) may be given as f = Section 3. Conclusions are drawn in Section 4. (W/2)/tan(θ/2) where W is the image width). In order H to achieve the dolly zoom effect, we assume that it under- Notation Scheme In this paper, matrix is denoted as , B T P goes translation by a distance t to position C1 along with and (·) denotes transpose. The projection of point , de- B fined as P T in R3, is denoted as point u, a change in its focal length to f1 and correspondingly, a = (X,Y,Z) B B A A defined as u = (x, y)T in R2. Scalars are denoted as X change of FoV to θ1 (θ1 ≥ θ1 ). D1 is the depth of a 3D I point P and D0 is the depth to the focus plane from the or x. Correspondingly, is used to represent an image. A u camera 1 at the initial position C1 . Our goal is to create I(x, y), or alternately I( ), is the intensity of the image B H a synthetic view at location C1 from a capture at location at location (x, y). Similarly, for a matrix , H(x, y) de- A C1 such thatany objectin the focusplane is projectedat the notes the element at pixel (x, y) in that matrix. Jn and 0n denote the identity matrix and zero vectors. same pixel location in 2D image plane regardless of camera n × n n × 1 A location. For the same 3D point P , u1 is its projection onto diag{x1, x2, ··· , xn} denotes a diagonal matrix with ele- the image plane of camera 1 at its initial position CA while ments 1 2 on the main diagonal. 1 x , x , ··· , xn B u1 is its projection onto the image plane of camera 1 af- CB 2. View Synthesis based on Camera Geometry ter it has moved to position 1 . We make the following assumptions: Consider two pin-hole cameras A and B with camera 1. The translation of the camera center is along the prin- centers at locations CA and CB, respectively. From [12], CA − CB T based on the coordinate system of camera A, the projec- cipal axis Z. Accordingly, 1 1 = 0, 0, −t . P R3 while the depth of P to the camera 1 at position CB is tions of any point ∈ onto the camera image planes are 1 T T 1 T T T T B A u K J3 03 P and u D1 = D1 − t. A, 1 = DA A , 1 B, 1 = T 1 K R T P T for cameras A and B, respec- 2. There is no relative rotation during camera translation. DB B , 1    tively. Here, the 2 × 1 vector uX , the 3 × 3 matrix KX , and Therefore, R is an identity matrix J3.    3. Assuming there is no shear factor, the camera intrin- A A sic matrix K1 of the camera at location C1 can be modeled as [12]

A f1 0 u0 A A K1 = 0 f1 v0 , (2)   DZ 0 0 1 (a) Input image I1 with FoV (b) Synthesized image I1 with A o B o   θ1 = 45 FoV θ1 = 50 T where u0 = (u0, v0) is the principal point in terms of pixel dimensions. Assuming the resulting image Figure 3: Single camera image synthesis under dolly zoom. B resolution did not change, the intrinsic matrix K1 Epipoles (green dots), sample point correspondences (red, B A at position C1 is related to that at C1 through a blue, yellow, magenta dots) along with their epipolar lines B zooming factor k and can be obtained as K1 = are shown. A B A K1 diag{k, k, 1}, where k = f1 /f1 = (D0 − t)/D0.

A A 4. The depth of P is D1 > t. Otherwise, P will be K1, the intrinsic matrix K1 for partial FoV can be ob- B A A excludedon the image plane of camera at position C1 . tained as K1 = K1diag{k0, k0, 1}. where k0 = f1 /f1 = A tan (θ1/2)/tan (θ1 /2). Subsequently,a closed–formequa- uB A From Eq. 1, we can obtain the closed–form solution for 1 tion may be obtained for the zoom pixel coordinates u1 in A in terms of u1 as terms of u1 of actual image pixel location (with the camera rotation R as an identity matrix J3): A A B D1 (D0 − t) A t(D1 − D0) u u u0 1 = A 1 + A . (3) A A A D0(D1 − t) D0(D1 − t) u1 = (f1 /f1)u1 + 1 − (f1 /f1) u0 (4)

A generalized equation for camera movements along the Eq. 4 may be used to digitally zoom I1 andD1 to the re- A horizontal and vertical directions along with the translation quired FoV θ1 . along the principal axis and change of FoV/focal length is derived in the supplementary. 2.2. Introducing a Second Camera to the System I D Let 1 be the input image from camera 1 and 1 be the Applying the synthesis formula from Eq. 3 for a single u corresponding depth map, so that each pixel = (x, y), the camera results in many missing and occluded areas as the u I corresponding depth D1( ) may be obtained. 1 can now FoV increases. Some of these areas can be filled using pro- D be warped using Eq. 3 using 1 for a camera translation t to jections from other available cameras with different FoVs. IDZ obtain the synthesized image 1 . Similarly, we can warp We now introduce a second camera to the system for this D1 with the known t and obtain the corresponding depth DZ purpose. Consider the system shown in Figure 4 where a D1 . This step is implemented through z-buffering [25] second camera with focal length f2 is placed at position C2. and forward warping. An example is shown in Figure 3. As an example, we assume that both cameras are well cal- Epipolar Geometry Analysis In epipolar geometry, pixel ibrated [29], i.e. these two cameras are on the same plane movement is along the epipolar lines, which is related by and their principal axes are perpendicular to that plane. Let the fundamental matrix between the two camera views. The b be the baseline between the two cameras. The projection F fundamental matrix 1 relates corresponding pixels in the of point P on the image plane of camera 2 is at pixel loca- images for the two views without knowledge of pixel depth tion u2. information [12]. This is a necessary condition for corre- T We once again assume that there is no relative rotation B A sponding points and can be given as x1 F1x1 = 0 , between the two cameras (or that it has been corrected dur- T T A A T B B T where x1 = (u1 ) , 1 and x1 = (u1 ) , 1 . From ing camera calibration [29]). The translation of the second  B [12], it is easy to show that the fundamental matrix can be camera from position C2 to position C1 can be given as F   B T obtained as 1 = 0 −1 v0; 1 0 −u0; −v0 u0 0 C2 − C1 = b, 0, −t . Here, we assume the baseline is and the correspondingepipoles and epipolar lines can be ob- on the X-axis, but it is simple to extend to any directions.   tained accordingly [12], as shown in Figure 3. The epipoles For the same point P , the corresponding depth relationship T A B A B B are e1 = e1 = (u0, v0, 1) for both locations C1 and C1 can be given as D1 = D2 − t, where D2 denotes the depth as camera moving along the principal axis [12]. of P seen by camera 2 at position C2. Assuming image res- A Digital Zoom It is worthy to note that θ1 may be a par- olutions are the same, the intrinsic matrix K2 of camera 2 A tial FoV of the actual camera FoV θ1 at initial position C1 . can be related to the intrinsic matrix of camera 1 at position A A ′ ′ Straightforward digital zoom can be employed to get the C1 as K2 = K1 diag{k , k , 1}, where the zooming factor ′ ′ A A partial FoV image. Assuming the actual intrinsic matrix k can be given as k = f2/f1 = tan(θ1 /2)/ tan(θ2/2). P

A D1

B u1 D0 image plane DZ DZ at location 2 (a) Synthesized image I1 with (b) Synthesized image I2 with B u2 B o B o C1 FoV θ1 = 50 FoV θ2 = 50 t f2 A C1 b C2

Figure 4: Two cameras system setup under dolly zoom.

(c) Binary mask B (d) Synthesized fused image IF

Figure 6: Image fusion

T B A DZ corresponding epipoles are e1 = u0 − bf1 k/t, v0, 1 (a) Input image I2 with FoV (b) Synthesized image I2 with T o B o e A ′ θ2 = 77 FoV θ1 = 50 and 2 = u0 − bf1 k /t, v0, 1 for cameras at locations B C and C2, respectively. Also, epipolar lines can be ob- 1  Figure 5: Image synthesis for the second camera under tained accordingly [12] as shown in Figure 5. dolly zoom. Epipoles (green dots), sample point correspon- dences (red, blue, yellow, magenta dots) along with their 2.3. Image Fusion epipolar lines are shown. DZ We now intend to use the synthesized image I2 from the second camera to fill in missing/occlusion areas in the DZ B synthesized image I1 from the first camera. This is A closed–form solution for u1 can be obtained as: achieved through image fusion with the following steps: A bf1 k B D2k u u u u D2−t 1 = ′ ( 2 − 0) + 0 + . (5) 1. The first step is to identify missing areas in the synthe- (D2 − t)k 0 DZ ! sized view I1 . Here, we implement a simple scheme B Let I2 be the input image from camera 2 and D2 be given below to create a binary mask by checking the IDZ the corresponding depth map. I2 can now be warped us- validity of 1 at each pixel location (x, y): ing Eq. 5 with D2 for a camera translation t to obtain the DZ m,c IDZ 1,I1 (x, y) ∈ O1 synthesized image 2 . We once again use forward warp- B(x, y) = (7) DZ Om,c ing with z–buffering for this step. An example is shown in (0,I1 (x, y) ∈/ 1 Figure 5. This derivation can be easily extended to include m,c any number of additional cameras to the system. where O1 denotes a set of missing/occluded pixels DZ A generalized equation for camera movements along the for I1 due to warping. horizontal and vertical directions along with the translation 2. With the binary mask B, the synthesized images IDZ along the principal axis and change of FoV/focal length is 1 and IDZ are fused to generate I : derived in the supplementary for this case as well. 2 F Epipolar Geometry Analysis Similar to single camera DZ DZ IF = B · I2 + (1 − B) · I1 (8) case, we can derive the fundamental matrix F2 in close– form where · is element-wise matrix product. The depths for the DZ DZ 0 −t tv0 synthesized view D1 and D2 are also fused in a similar F A ′ 2 = t 0 bf1 k − tu0 (6) manner to obtain DF . An example is shown in Figure 6. In  A A ′  −tv0 tu0 − bf1 k bf1 v0 (k − k ) this example, the FoV of the second camera is greater than A  T  that of the first camera (i.e. θ2 > θ1 ) in order to fill larger xB F x such that pixel location relationship 1 2 2 = 0 is missing area. For image fusion, we can also apply more ad- x uT T satisfied. Here, 2 = 2 , 1 is the homogeneous repre- vanced methods which are capable of handling photometric sentation of pixels of camera at location 2. Therefore, the differences between input images, e.g. Poisson fusion [18].  2.4. Depth Aware Image Occlusion Handling Algorithm 2: Image hole filling

In order to handle occlusions, for each synthesized dolly Input : Synthesized view IF , Synthesized depth zoom view, we identify occlusion areas and fill them in DF , Occlusion mask M, depth segment using neighboring information for satisfactory subjective mask Mprev initialized to zeros and viewing. Occlusions occur due to the nature of the cam- dimensions (width W and height H) era movement and depth discontinuity along the epipolar Output: The hole filled synthesized view IF line [12]. Therefore, one constraint in filling occlusion ar- 1. Initialize: IF = IF u eas is that whenever possible, they should be filled only with 2. Determine all unique values in DF . Let d be the the background and not the foreground. array of unique values in the ascending order and Occlusion Area Identification The first step is to identify S be the number of unique values. occlusion areas. Let I be the generated view after image for s = S to 2 do F 3.1) Depth mask D corresponding to the depth fusion. Let M be a binary mask depicting occlusion areas. s step: Similar to section 2.3, M is simply generated by checking the validity of IF at each pixel location (x, y). u u Ds = (DF > d (s − 1))&(DF ≤ d (s))

c 1,IF (x, y) ∈ O where >, ≤ and & are the element-wise matrix M(x, y) = F (9) Oc greater than, less than or equal to and AND (0,IF (x, y) ∈/ F operations. Oc I I where F denotes a set of occluded pixels for F after im- 3.2) Image segment s corresponding to the age fusion in Section 2.3. depth mask: Depth Hole–Filling A critical piece of information is the Is = IF · Ds fused depth DF for the synthesized view which allows us to D distinguish between foreground and background. F will where · is element-wise matrix product. also have holes due to occlusion. If we intend to use the 3.3) Current Occlusion mask for the depth step: depth for image hole–filling, we need to first fill the holes in the depth itself. We implement a simple nearest neighbor Mcurr = M · Ds hole filling scheme described in Algorithm 1. 3.4) Update Mcurr with previous mask Algorithm 1: Depth map hole filling Mcurr = Mcurr k Mprev Input : Fused depth DF , dimensions (width W and height H) where k is element-wise matrix OR condition. D Output: The hole filled depth F for x = 1 to H do D D 1. Initialize F = F . for y = 1 to W do for x = 1 to H do if Mcurr(x, y) = 1 then for y = 1 to W do 3.5.1) Find nearest valid pixels on ′ ′ if M(x, y) = 1 then the same row Is(x , y ), where 2.1) Find four nearest neighbors (left, (x′, y′) is the location of the valid right, bottom, top). pixel. 2.2) Find the neighbor with the 3.5.2) Update value of ′ ′ maximum value (dmax), since we IF (x, y) = Is(x , y ) intend to fill in the missing values with 3.5.3) Update Mcurr(x, y) = 0 background values. 3.5.4) Update M(x, y) = 0 2.3) Set DF (x, y) = dmax end end end end end end 3.6) Propagate the current occlusion mask to the next step: Mprev = Mcurr Depth–Aware Image Inpainting Hole filling for the syn- thesized view needs to propagate from the background to- end wards the foreground. The hole filling strategy is described 4. Apply simple low pass filtering on the filled in in Algorithm 2. occluded areas in IF . 2.6. Dolly Zoom View Synthesis Pipeline Single Camera Single Shot View Synthesis The single shot single zoom synthesis pipeline is shown in Figure 8 and described below: 1. The input is the image I with FoV θ, its depth map D (a) Depth map DF for the fused (b) Depth map after hole filling and the known intrinsic matrix K. image IF DF 2. We apply digital zoom to both I and D according to Eq. 4 (described in Section 2.1) through inverse warp- ing to a certain angle θ1 (in the example experiments, o θ1 is set to 30 ) to obtain the zoomed-in image I1 and the corresponding depth map D1. 3. The original input image I, depth map D and intrinsic I (c) Synthesized fused image F (d) Synthesized fused image af- matrix K are re-used as I2, D2 and K2 respectively. ter hole filling IF DZ DZ 4. A synthesized image I1 and its depth D1 is pro- Figure 7: Occlusion handling duced from I1 and D1 with Eq. 3 through forward warping and z-buffering for a given t. IDZ DDZ This strategy is implemented in a back–to–front order 5. A synthesized image 2 and its depth 2 is pro- I D with the intuition being that holes in the image should be duced from 2 and 2 with Eq. 5 through forward filled in from parts of the image at the same depth or the warping and z-buffering for the given t. The baseline next closest depth. In this simple hole filling algorithm, we b is set to 0 for this case. search for valid image pixels along the same row but this DZ DZ 6. The synthesized images I1 and I2 are fused to- could also be extended to finding more than one valid pix- gether (as described in Section 2.3) to form the fused els in both the horizontal and vertical directions or using DZ image IF while the synthesized depth maps D1 and epipolar analysis described in Sections 2.1 and 2.2 to define DZ D2 are similarly fused together to form the fused search directions. The results of the hole filling process are depth map DF . shown in Figure 7. 7. Occlusion areas in IF (and DF ) are handled (accord- 2.5. Shallow (SDoF) ing to Section 2.4) to obtain IF (and DF ).

After view synthesis and occlusion handling, we can ap- 8. The shallow depth of field effect is applied to IF to I IDZ ply the shallow depth of field (SDoF) effect to F . This ef- obtain the final dolly zoom synthesized image F . fect involves the application of depth–aware blurring. The A restriction of this setup is that the maximum FoV for the diameter c of the blur kernel on the image plane is called synthesized view is limited to θ1. the circle of confusion (CoC). Assuming a thin lens cam- era model [12], the relation betwen c, lens aperture A, Extending the pipeline for multiple camera inputs The magnification factor m, distance to an object under focus single camera single shot dolly zoom synthesis pipeline D0 and another object at distance D can be given as [25] may be extended to input images captured from multiple c = Am(|D − D0|)/D. Under the dolly zoom condi- cameras at the same time instant with minor modifications. tion, the magnification factor m and the relative distance Consider a dual camera setup, where the inputs are the im- I D |D − D0| remains constant. However, after a camera trans- age 1 with FoV θ1, its depth map 1 and the known intrin- K lation of t, the diameter of the CoC changes to: sic matrix 1 from the first camera and correspondingly,the image I2 with FoV θ2, its depth map D2 and the known in- |D − D0| D trinsic matrix K2 from the second camera. In this case, c(t) = Am = c(0) , (10) (D − t) (D − t) the application of digital zoom in Step 2 of the single cam- era pipeline is no longer required. Instead, we only apply where c(t) is the CoC for an object at depth D and the cam- Steps 4 – 8 with the baseline b set to the representative IDZ era translation t. The detailed derivation can be found in value, to obtain the synthesized view F for the dual cam- the supplementary. The usage of SDoF effect is two–fold: era case. A restriction of such a setup is that the maximum 1) enhanced viewer attention to the objects in focus, and 2) FoV for the synthesized view is now limited to θ2 (and in hide imperfections due to image warping, image fusion and general, to the FoV of the camera with the largest FoV in hole filling steps. the multi-camera system). Camera 1 Digital Image DZ I Zoom I1 I View1 Image Depth-Aware Synthesis Fusion IF Image Inpainting IF Camera 1 Digital DDZ Depth 1 D Zoom D1 D F SDoF Synthesized DZ View IF

DZ I2 View I2 Depth Depth Synthesis Fusion D Hole-Filling F DF DZ D2 D2

Figure 8: Single camera single shot synthesis pipeline

3. Experiment Results 1 Mean PSNR (after SDOF) Mean PSNR (before SDOF) 0.95 3.1. Datasets 35 0.9 Synthetic Dataset We generated a synthetic image 30 0.85 dataset using the commercially available graphics software SSIM Unity 3D. For the experiment, we assume a dual-camera PSNR (dB) 0.8 25 0.75 Mean SSIM (after SDOF) collinear system with the following parameters: Camera 1 Mean SSIM (before SDOF) o o 0.7 with FoV θ1 = 45 and Camera 2 with FoV θ2 = 77 This 50 55 60 65 70 75 50 55 60 65 70 75 θ θ setup is simulated in the Unity3D software. In this synthetic DZ DZ dataset, each image set includes: I1 from Camera 1 and I2 (a) Objective metric – PSNR (b) Objective metric – SSIM from Camera 2 , the depth maps D1 for Camera 1, D2 for Camera 2, and the intrinsic matrices K1 and K2 for Cam- Figure 9: Quantitative evaluation era’s 1 and 2 respectively. In addition, each image set also includes the ground truth dolly zoom views which are also generated with Unity3D for objective comparisons. 3.3. Quantitative Evaluation Smartphone Dataset We also created a second dataset The synthesis pipeline modified for a dual camera input with dual camera images from a representative smartphone as described in Section 2.6 is applied to each image set in device. For this dataset, the depth was estimated using a the synthetic dataset. In order to produce a sequence of im- o stereo module so that each image set includes I from Cam- ages, we initialize the dolly zoom angle θDZ = 45 and era1 witha FoV θ = 45o, its depth map D and the intrinsic increment it with a δ = 1o up to a maximum angle of o matrix K. θ2 = 77 . The dolly zoom camera translation distance t at each increment is calculated according to Eq. 11. We 3.2. Experiment Setup then objectively measure the quality of our view synthe- From the input images, we generate a sequence of im- sis by computing the peak signal-to-noise ratio (PSNR) and ages using the synthesis pipeline described in Section 2. For the structural similarity index (SSIM) for each synthesized each image set, the depth to the object under focus (D0) view against the corresponding ground truth image at each is set manually. The relationship between the dolly zoom increment, before and after the application of the SDoF ef- camera translation distance t, the required dolly zoom cam- fect. The mean metric values for each increment are then era FoV θDZ and the initial FoV θ1 of I1 can be obtained computed across all the image sets in the synthetic dataset from Section 2.1 as: and are shown in Figure 9. As the dolly zoom angle θDZ increases, the area of the image that need to be inpainted tan (θDZ /2) − tan (θ1/2) t = D0 . (11) due to occlusion increases, which corresponds to the drop tan (θ /2) DZ in PSNR and SSIM. Initializing the dolly zoom angle θ to θ1, we increment it DZ 3.4. Qualitative Evaluation by a set amount δ up to a certain maximum angle set to θ2. For each increment, we obtain the corresponding distance t Figure 10 shows the results for the dual camera input with Eq. 11. We then apply the synthesis pipeline described synthesis pipeline applied to the image sets in the synthetic IDZ I I in Section 2.6 to obtain the synthesized image F for that dataset. The input images 1 and 2 are used to produce the increment. The synthesized images for all the increments dolly zoom image IF while the rightmost column shows the are then compiled to form a single sequence. corresponding ground truth dolly zoom image. Both IF and I1 I2 IF ground truth Person Balloon

Figure 10: Dual camera dolly zoom view synthesis with synthetic data set.

I I I IDZ 1 2 1 with SDoF F with SDoF Dog Toy

Figure 11: Single camera single shot dolly zoom view synthesis with smartphone data set. the ground truth are shown here before application of the 4. Conclusion and Future Work SDoF effect. Figure 11 shows the results for the single cam- era single shot synthesis pipeline described in Section 2.6 We have presented a novel modelling pipeline based on applied to image sets in the smartphone dataset. Here, the camera geometry to synthesize the dolly zoom effect. The synthesis pipeline presented in this paper can be applied input images I1 and I2 (formed from image I of each image set as described in Section 2.6, Steps 2 – 3) are used to syn- to single camera or multi-camera image captures (where the cameras may or may not be on the same plane) and to thesize the dolly zoom image IDZ (shown after the applica- F video sequences. Generalized equations for camera move- tion of the SDoF effect). For comparison, we also show I1 ment not just along the principal axis and change of focal with the SDoF effect applied. Under dolly zoom, objects in length/FoV but also, along the horizontal or vertical direc- the background appear to undergo depth-dependent move- tions have been derived in the supplementary as well. The ment, the objects under focus stay the same size while the focus of future work will be on advanced occlusion han- background FoV increases. This effect is apparent in our dling schemes to provided synthesized images with subjec- view synthesis results. In both Figures 10 and 11, the fore- tively greater image quality. ground objects (balloon, person, dog and toy) remain in fo- cus and of the same size, the backgroundobjects are warped according to their depths while the synthesized images (IF References IDZ I and F ) have a larger background FoV than 1. [1] Abhishek Badki, Orazio Gallo, Jan Kautz, and Pradeep Sen. Computational zoom: a framework for post–capture im- age composition. ACM Transactions on Graphics (TOG), Thomas Wiegand. Depth image-based rendering with ad- 36(4):46, 2017. vanced texture synthesis for 3-d video. IEEE Transactions [2] Jonathan T. Barron, Andrew Adams, YiChang Shih, and Car- on Multimedia, 13(3):453–465, 2011. los Hern´andez. Fast bilateral–space stereo for synthetic defo- [17] Simon Niklaus, Long Mai, Jimei Yang, and Feng Liu. 3d cus. The IEEE Conference on Computer Vision and Pattern ken burns effect from a single image. ACM Transactions on Recognition (CVPR), June 2015. Graphics (TOG), 38(6):1–15, 2019. [3] Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and [18] Patrick P´erez, Michel Gangnet, and Andrew Blake. Pois- Coloma Ballester. Image inpainting. Proceedings of the 27th son image editing. ACM Transactions on Graphics (TOG), Annual Conference on Computer Graphics and Interactive 22(3):313–318, July 2003. Techniques (SIGGRAPH ’00), pages 417–424, 2000. [19] Colvin Pitts, Timothy James Knight, Chia-Kai Liang, and [4] Marcelo Bertalmio, Luminita Vese, Guillermo Sapiro, and Yi-Ren Ng. Generating dolly zoom effect using light field Stanley Osher. Simultaneous structure and texture im- data. US Patent US8971625, March, 2015. age inpainting. IEEE Transactions on Image Processing, [20] Timo Pekka Pylvanainen and Timo Juhani Ahonen. Method 12(8):882–889, 2003. and apparatus for automatically rendering dolly zoom effect. [5] Po-Yi Chen, Alexander H. Liu, Yen-Cheng Liu, and Yu- WO Patent Application WO2014131941, September, 2014. Chiang Frank Wang. Towards scene understanding: Un- [21] Yurui Ren, Xiaoming Yu, Ruonan Zhang, Thomas H. Li, supervised monocular depth estimation with semantic-aware Shan Liu, and Ge Li. Structureflow: Image inpainting via representation. The IEEE Conference on Computer Vision structure-aware appearance flow. The IEEE International and Pattern Recognition (CVPR), June 2019. Conference on Computer Vision (ICCV), October 2019. [6] Shenchang Eric Chen and Lance Williams. View interpola- [22] Daniel Scharstein and Richard Szeliski. High–accuracy tion for image synthesis. Proceedings of the 20th Annual stereo depth maps using structured light. The IEEE Confer- Conference on Computer Graphics and Interactive Tech- ence on Computer Vision and Pattern Recognition (CVPR), niques (SIGGRAPH ’93), pages 279–288, 1993. June 2003. [7] Antonio Criminisi, Patrick P´erez, and Kentaro Toyama. [23] Steven M Seitz and Charles R Dyer. View morphing. Region filling and object removal by exemplar-based im- Proceedings of the 23rd Annual Conference on Computer age inpainting. IEEE Transactions on Image Processing, Graphics and Interactive Techniques (SIGGRAPH ’96), 13(9):1200–1212, 2004. pages 21–30, 1996. [8] Maha El Choubassi, Yan Xu, Alexey M. Supikov, and Os- [24] Shuochen Su, Felix Heide, Gordon Wetzstein, and Wolfgang car Nestares. View interpolation for visual storytelling. US Heidrich. Deep end–to–end time–of–flight imaging. The Patent US20160381341A1, June, 2015. IEEE Conference on Computer Vision and Pattern Recogni- [9] John Flynn, Michael Broxton, Paul Debevec, Matthew Du- tion (CVPR), June 2018. Vall, Graham Fyffe, Ryan Overbeck, Noah Snavely, and [25] Richard Szeliski. Computer Vision: Algorithms and Appli- Richard Tucker. Deepview: View synthesis with learned gra- cations. Springer, 2011. dient descent. The IEEE Conference on Computer Vision and [26] Alexandru Telea. An image inpainting technique based Pattern Recognition (CVPR), June 2019. on the fast marching method. Journal of Graphics Tools, [10] Orazio Gallo, Jan Kautz, and Abhishek Haridas Badki. 9(1):23–34, 2004. System and methods for computational zoom. US Patent [27] Neal Wadhwa, Rahul Garg, David E. Jacobs, Bryan E. US20160381341A1, December, 2016. Feldman, Nori Kanazawa, Robert Carroll, Yair Movshovitz- [11] Rahul Garg, Neal Wadhwa, Sameer Ansari, and Jonathan T. Attias, Jonathan T. Barron, Yael Pritch, and Marc Levoy. Barron. Learning single camera depth estimation using dual- Synthetic depth–of–field with a single–camera mobile pixels. The IEEE International Conference on Computer Vi- phone. ACM Transactions on Graphics (TOG), 37(4):64:1– sion (ICCV), 2019. 64:13, July 2018. [12] R. Hartley and A. Zisserman. Multiple View Geometry in [28] Zexiang Xu, Sai Bi, Kalyan Sunkavalli, Sunil Hadap, Hao Computer Vision. Wiley, 2007. Su, and Ravi Ramamoorthi. Deep view synthesis from sparse [13] Steven Douglas Katz. Film directing shot by shot: visual- photometric images. ACM Transactions on Graphics (TOG), izing from concept to screen. Gulf Professional Publishing, 38(4):76, 2019. 1991. [29] Z. Zhang. A flexible new technique for camera calibration. [14] Hongyu Liu, Bin Jiang, Yi Xiao, and Chao Yang. Coherent IEEE Transactions on Pattern Analysis and Machine Intelli- semantic attention for image inpainting. The IEEE Inter- gence, 22(11):1330–1334, Nov 2000. national Conference on Computer Vision (ICCV), October [30] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Ma- 2019. lik, and Alexei A Efros. View synthesis by appearance flow. European Conference on Computer Vision (ECCV), pages [15] Ziwei Liu, Raymond A. Yeh, Xiaoou Tang, Yiming Liu, and 286–301, 2016. Aseem Agarwala. Video frame synthesis using deep voxel flow. The IEEE International Conference on Computer Vi- [31] C Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, sion (ICCV), Oct 2017. Simon Winder, and Richard Szeliski. High-quality video view interpolation using a layered representation. ACM [16] Patrick Ndjiki-Nya, Martin Koppel, Dimitar Doshkov, Transactions on Graphics (TOG), 23(3):600–608, 2004. Haricharan Lakshman, Philipp Merkle, Karsten Muller, and