<<

Computational Field Photography Depth Estimation, Demosaicing, and Super-Resolution

Yongwei Li

Department of Information Systems and Technology Mid Sweden University Doctoral Thesis No. 327 Sundsvall, Sweden 2020 Mittuniversitetet Informationssystem och teknologi ISBN 978-91-88947-57-4 SE-851 70 Sundsvall ISSN 1652-893X SWEDEN Akademisk avhandling som med tillstånd av Mittuniversitetet framlägges till of- fentlig granskning för avläggande av teknologie doktorsexamen den 10 juni 2020 klockan 09.00 i sal C-312, Mittuniversitetet Holmgatan 10, Sundsvall. Seminariet kommer att hållas på engelska. c Yongwei Li, juni 2020 ⃝ Tryck: Tryckeriet Mittuniversitetet To Medical Professionals during COVID-19 Pandemic iv Abstract

The transition of camera technology from film-based cameras to digital cameras has been witnessed in the past twenty years, along with impressive technological ad- vances in processing massively digitized media content. Today, a new evolution emerged – the migration from 2D content to immersive perception. This rising trend has a profound and long-term impact to our society, fostering technologies such as teleconferencing and remote surgery. The trend is also reflected in the scientific re- search community, and more intention has been drawn to the light field and its ap- plications. The purpose of this dissertation is to develop a better understanding of light field structure by analyzing its sampling behavior and to addresses three problems con- cerning the light field processing pipeline: 1) How to address the depth estimation problem when there is limited color and texture information. 2) How to improve the rendered image quality by using the inherent depth information. 3) How to solve the interdependence conflict of demosaicing and depth estimation. The first problem is solved by a hybrid depth estimation approach that combines advantages of correspondence matching and depth-from-focus, where occlusion is handled by involving multiple depth maps in a voting scheme. The second problem is divided into two specific tasks – demosaicing and super-resolution, where depth- assisted light field analysis is employed to surpass the competence of traditional image processing. The third problem is tackled with an inferential graph model that encodes the connections between demosaicing and depth estimation explicitly, and jointly performs a global optimization for both tasks. The proposed depth estimation approach shows a noticeable improvement in point clouds and depth maps, compared with references methods. Furthermore, the objective metrics and visual quality are compared with classical sensor-based demo- saicing and multi-image super-resolution to show the effectiveness of the proposed depth-assisted light field processing methods. Finally, a multi-task graph modelis proposed to challenge the performance of the sequential light field image processing pipeline. The proposed method is validated with various kinds of light fields, and outperforms the state-of-the-art in both demosaicing and depth estimation tasks. The works presented in this dissertation raise a novel view of the light field data structure in general, and provide tools to solve image processing problems in spe- cific. The impact of the outcome can be manifold: To support scientific research with

v vi

light field microscopes, to stabilize the performance of range cameras for industrial applications, as well as to provide individuals with a high-quality immersive expe- rience. Sammanfattning

Under de senaste tjugo åren har det skett en övergång från filmbaserad till digital ka- merateknik, parallellt med en imponerande teknisk utveckling inom bearbetning av omfattande digitaliserat medieinnehåll. På senare tid även en ny utvecklingslinje – övergången från 2D-innehåll till omslutande perception. Detta är en utveckling som har långtgående och långvarig påverkan på samhället och främjar arbetsmetoder såsom telekonferens och fjärrstyrd kirurgi. Den här utvecklingst trenden återspeglas också i det vetenskapliga forskningssamhället, och mer uppmärksamhet har lagts på light field och dess olika tillämpningsområden. Syftet med avhandlingen är att nå en bättre förståelse av strukturen i light field genom att analysera hur light field samplas, och att lösa tre problem inom behand- lingsprocessen av light field: 1) Hur problemet med djupestimering kan lösas med begränsad information om färg och textur. 2) Hur renderad bildkvalitet kan förbätt- ras genom att utnyttja den inneboende djupinformationen. 3) Hur beroendekonflik- ten mellan demosaicing (färgfiltrering) och djupestimering kan lösas. Det första problemet har lösts genom en hybridmetod för djupestimering, som kombinerar styrkorna med korrespondensmatchning och djup från fokus, där ock- lusion hanteras genom att använda flera djupkartor i ett röstningssystem. Det andra problemet delas upp i två separata moment – demosaicing och superupplösning, där djupassisterad analys av light field används för att överträffa kapaciteten för tradi- tionell bildbehandling. Det tredje problemet har angripits med en inferentiell graf- modell som explicit kopplar samman demosaicing och djupestimering, och samfällt utför en global optimering för båda dessa processteg. Den metod för djupestimering som föreslås producerar visuellt tilltalande punkt- moln och djupkartor, jämfört med andra referensmetoder. Objektiva mätvärden och visuell kvalitet jämförs vidare med klassisk sensorbaserad demosaicing och superupp- lösning från multipla bilder, för att visa effektiviteten hos de föreslagna metoderna för djupassisterad behandling av light field. En multitaskande grafmodell föreslås även för att matcha och överträffa prestandan hos sekventiell light field-baserad bildbehandling. Den metod som föreslås valideras med olika sorters light fields och överträffar de bästa existerande metoderna inom både demosaicing och djupestime- ring. De arbeten som presenteras i avhandlingen utgör ett nytt sätt att betrakta den generella datastrukturen hos light field, och tillhandahåller verktyg för att lösa spe-

vii viii

cifika bildbehandlingsproblem. Effekterna av dessa resultat kan vara många, till ex- empel som stöd för vetenskaplig forskning om light field-baserade mikroskop, för att förbättra prestandan hos avståndsmätande kameror i industriella tillämpningar, såväl som för att erbjuda högkvalitativa omslutande mediaupplevelser. Acknowledgements

I would like to thank my supervisors, Prof. Mårten Sjöström and Dr. Roger Olsson, for offering me the opportunity to be a Ph.D. student in the Realistic 3D group. I want to thank them for their supports, kindness, encouragement, and trust through- out my education, and their wisdom and positive attitude when I had tough mo- ments. This thesis and many other research works would undoubtedly never have achieved fruition without your dedication. The supports of my friends and colleagues are vital to me. I thank all the mem- bers of the Realistic 3D group that I have had the pleasure of working with, es- pecially Waqas Ahmad and Elijs Dima, who have helped me consistently from the very beginning of this journey and shared four years of happy memories with me. At Chinese festivals, more than ever we think of our family and friends far away. I want to thank my compatriots from EKS and their family who organized cultural events and gatherings. I thank Jan Lundgren for his review of this thesis, and all the staff members of Mid Sweden University I met along the way. I wish you all the best in your future endeavors. I am grateful to my family and my wife’s family. Without their constant support, I would never have had a chance to start and finish a Ph.D., or even to come to study in Sweden. I want to thank my wife, Jun Wang. There is so much I want to say, but so little can be expressed. She quit her job as a teacher in China and came to me with great courage. She took care of me, spared me from tremendous housework, and even managed to be my Swedish translator after SFI and grundläggande. Anywhere I go, I know it would be my sweet home only if you are with me. This work has received funding from the European Unions Horizon 2020 re- search and innovation program under the Marie Sklodowska-Curie Actions, and specifically European Training Network on Full Parallax Imaging. The network has provided me with numerous exciting training schools, workshops, internal confer- ences, online seminars, and secondments. I am greatly thankful for the coordinators and researchers from all our academic and industrial partners/organizations. They made this project both fruitful and enjoyable. Special thanks to Prof. Manuel Mar- tinez from University of Valencia, Prof. Filiberto Pla from Jaume I University, and their research groups for hosting me during my research secondments. Thank you for your assistance and motivation.

ix x Table of Contents

Abstract v

Sammanfattning vii

Acknowledgements ix

List of Papers xv

Terminology xxi

1 Introduction 1 1.1 Motivation ...... 1 1.2 Purpose ...... 2 1.3 Objectives ...... 2 1.4 Scope ...... 3 1.5 Contributions ...... 3 1.6 Outline ...... 4

2 Computational Photography and Light Field 5 2.1 Computational photography ...... 5 2.1.1 Image processing ...... 5 2.1.2 Digital photography ...... 6 2.2 Plenoptic function and light field ...... 8 2.2.1 Definition ...... 9 2.2.2 Light field sampling ...... 9

3 Light Field Depth Estimation, Demosaicing and Super-Resolution 11

xi xii TABLE OF CONTENTS

3.1 View-based depth estimation ...... 11 3.1.1 Correspondence matching ...... 12 3.1.2 Depth-from-focus/defocus ...... 12 3.1.3 Other techniques ...... 13 3.2 Light field demosaicing ...... 13 3.3 Light field spatial super-resolution ...... 14 3.4 Markov random field in ...... 15

4 Related Works 19 4.1 Light field depth estimation ...... 19 4.1.1 View-based initial depth estimation ...... 19 4.1.2 Markov random field depth refinement ...... 20 4.1.3 Concluding remarks ...... 20 4.2 Light field demosaicing ...... 21 4.2.1 Sampling-based approaches ...... 21 4.2.2 Other approaches ...... 21 4.2.3 Concluding remarks ...... 22 4.3 Light field spatial super-resolution ...... 22 4.3.1 Projection-based approaches ...... 22 4.3.2 Optimization-based approaches ...... 22 4.3.3 Learning-based approaches ...... 23 4.3.4 Concluding remarks ...... 23

5 Depth-Assisted Light Field Processing 25 5.1 Light field microscopic depth estimation ...... 25 5.1.1 Introduction ...... 25 5.1.2 Proposed solution ...... 25 5.1.3 Methodology ...... 27 5.1.4 Results and analysis ...... 27 5.2 Depth-assisted demosaicing ...... 28 5.2.1 Introduction ...... 28 5.2.2 Proposed solution ...... 28 5.2.3 Methodology ...... 29 5.2.4 Results and analysis ...... 30 TABLE OF CONTENTS xiii

5.3 Depth-assisted super-resolution ...... 31 5.3.1 Introduction ...... 31 5.3.2 Proposed solution ...... 32 5.3.3 Methodology ...... 32 5.3.4 Results and analysis ...... 33 5.4 Concluding remarks ...... 33

6 A Joint Model of Depth Estimation and Demosaicing 35 6.1 Introduction ...... 35 6.2 Initialization ...... 36 6.2.1 Initial color demosaicing ...... 36 6.2.2 Initial depth estimation ...... 36 6.3 Collaborative graph model for demosaicing and depth estimation . . . 37 6.3.1 Data term ...... 38 6.3.2 Smoothness term ...... 39 6.4 Methodology ...... 39 6.5 Results and analysis ...... 40 6.5.1 Demosaicing performance analysis ...... 40 6.5.2 Depth estimation performance analysis ...... 40 6.6 Concluding remarks ...... 40

7 Conclusions and Future Work 43 7.1 Overview ...... 43 7.2 Objective outcome ...... 44 7.3 Impact ...... 45 7.4 Risks and ethical aspects ...... 46 7.5 Future work ...... 46

Bibliography 49

Paper I 57

Paper II 63

Paper III 71 xiv TABLE OF CONTENTS

Paper IV 79

Paper V 91 List of Papers

This thesis is based on the following papers, herein referred to by their Roman nu- merals:

PAPER I An Analysis of Demosaicing for Plenoptic Capture Based on Ray Yongwei Li, Roger Olsson, and Mårten Sjöström, 3DTV-Conference (3DTV-Con), 2018 ...... 57

PAPER II Area-Based Depth Estimation for Monochromatic Feature-Sparse Orthographic Capture Yongwei Li, Gabriele Scrofani, Mårten Sjöström, and Manuel Martinez-Corral, European Signal Processing Conference (EUSIPCO), 2018 ...... 63

PAPER III Depth-Assisted Demosaicing for Light Field Data in Layered Object Space Yongwei Li, Mårten Sjöström, IEEE International Conference on Image Processing (ICIP), 2019...... 71

PAPER IV Depth-Assisted Light Field Super-Resolution in Layered Object Space Yongwei Li, Mårten Sjöström, In manuscript, 2020 ...... 79

PAPER V A Collaborative Graph Model for Light Field Demosaicing and Depth Estima- tion Yongwei Li, Filiberto Pla, and Mårten Sjöström, Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020...... 91

xv xvi List of Figures

1.1 The sequential processing steps from 3D color content capturing to applications...... 2

2.1 The relationships between computer vision, digital photography, com- putational photography and image processing...... 6

2.2 Pinhole camera model...... 7

2.3 Epipolar geometry: two views capturing the scene content, left: with- out rectification, right: after rectification ...... 8

2.4 Different geometrical optic models of MLA-based plenoptic cameras. Left: conventional plenoptic camera, middle: focused plenoptic cam- era with Keplerian configuration, right: focused plenoptic camera with Galilean configuration...... 10

3.1 The bayer pattern of CFA (left), and the sensitivity of the human visual system towards red, green and blue (right)...... 13

3.2 Undirected Graph models (left) and example cliques shown alongside them, random variables are shown as red dots. A clique is a complete graph, and a complete graph is its own maximal clique...... 16

5.1 Light field captured by FiMic microscope: (a)raw sensor image la- beled with (u, v) coordinates; (b)refocused images by shift-and-sum, focusing at different depths in µm...... 26

5.2 Depth estimation results of (a) the proposed method; (b) [NN94] and (c) [PPG13]. Top row shows the 3D point cloud, and the bottom is depth map of the center view...... 27

xvii 5.3 An example of pinhole plenoptic sampling: (a) pixel samples back- projected onto the conjugate planes of the object space; (b) from left to right: simulated back-projection of pixels based on depth Z1, Z2 and Z3 respectively, color blocks at the bottom show different simulated CFAs. Note the simulated CFA fashion is not necessarily the same with the bayer pattern...... 29 5.4 A layered object space diagram, mosaiced sensor pixels are back-projected onto several layers according to their depths ...... 30 5.5 Visual comparison between the proposed demosaicing scheme and comparison methods: (a) [DPW13], (b)[DLPG17], and (c) the proposed scheme...... 31

6.1 Flow chart of CGMDD framework...... 37

xviii List of Tables

5.1 Performance of demosaicing methods in terms of mean PSNR value and SSIM index ...... 30 5.2 Evaluated Super-resolution Methods for Light Field Data ...... 33 5.3 Mean PSNR and SSIM for the SR factor α = 2 ...... 34

xix xx Terminology

Abbreviations and Acronyms

2D Two-Dimensional 3D Three-Dimensional 4D Four-Dimensional ADC Analog-to-Digital Converter BRDF Bidrectional Reflectance Distribution Function CCD Charge-Coupled Device CFA Color Filter Array CI Confidence Interval CNN Convolutional Neural Network DfD Depth-from-Defocus DfF Depth-from-Focus DIBR Depth-Image-Based Rendering EI Elemental Image EPI Epipolar Plane Image FOV Field of View GMM Gaussian Mixture Model HVS Human Visusal System MAP Maximum a Posteriori MISR Multi-Image Super-Resolution MLA Microlens Array MRF Markov Random Field NCC Normalized Cross-Correlation PSNR Peak Signal-to-Noise Ratio RGB Color-Only (from Red-Green-Blue mosaicing model) RMSE Root Mean Square Error RTM Ray Transfer Matrix SAD Sum of Absolute Differences SAI Subaperture Image SfF Shape-from-Focus

xxi SISR Single-Image Super-Resolution SR Super-Resolution SSIM Structural Similarity Index

xxii Mathematical Notation a Distance between image and lens B Baseline b Distance between lens and object c Color component: R, G or B d Disparity E Energy function e Euler’s number f focal length I reference image K Camera calibration matrix L 4D light field O Object point in world coordinate system P Probability Q Projective transformation matrix R Rotation matrix S Smoothness score t Time t Translation vector (u, v) Coordinates of camera plane U Prior energy V Factorized energy terms with respect to cliques w Weighting factor, scale factor (x, y) Coordinates of image plane (X,Y,Z) World coordinate system z Depth Clique C (θ, ϕ) Orientation vector L norm ∥ · ∥2 2 0 zero vector ∆ Offset ϵ Cost function Γ 7D plenoptic function λ Wavelength Low-pass filtering and downsampling F Refocused image I Local window N µ Average of a local window N Gradient operator ∇ ω Weighting function σ Standard deviation C˜ Camera center in world coordinates

xxiii xxiv Chapter 1

Introduction

The continuous development of multimedia technology has led to increased de- mands on the fidelity of the scene recurrence and its applications to quite afew aspects of everyday life, such as 3D television, immersive gaming, remote surgery and so forth. To enable these applications, stereoscopic views need to be generated for human visual system to perceive a computer-aided reconstruction of the scene. Such reconstruction is a challenging task that involves a variety of processing tech- nologies, ranging from computer vision to digital photography. Both industry and academia are pursuing research in related areas, including data acquisition, process- ing, transmission and view rendering. This dissertation mainly addresses the image processing aspects of the light field data, with particular focus on the vital roleof depth information in different processing steps.

1.1 Motivation

Conventional photography fails to meet the requirement of reconstructing the stereo- scopic view since it captures only a projective transformation of the three-dimensional world. Stemming from integral photography [Lip08], the light field [LH96], which is able to represent both spatial and angular information of the scene, has developed rapidly over the last decade [YGL+13, AOS17]. The light field captures rich light information compared with conventional cam- eras, enabling processing techniques such as depth-image-based rendering [Feh04] and 3D reconstruction [VMCS15]. These light field processings often require the known depth. As one of the fundamental problems, the extraction of depth cues is crucial for computer vision tasks, both the quality and usage of depth estimation has been widely investigated [WG13, YYLG12]. In this dissertation, the processing chain including capturing, pre-processing and post-processing stages of the light field (Fig. 1.1) is analyzed. Then using that knowledge, depth-assisted processing of light field data are introduced for different stages of various capturing setups. Finally, the un-

1 2 Introduction

… Scene Lenses CFA ADC Raw data

Capturing

… Devignet Demosaic Denoise RGB

Pre-processing

… Depth Refocus Compress Applications

Post-processing

Figure 1.1: The sequential processing steps from 3D color content capturing to applications. known depth is considered and a joint model is proposed to perform both depth estimation and image quality improvements.

1.2 Purpose

The purpose of this dissertation is to develop a profound understanding of the light field structure by analyzing its sampling behavior, and to propose independent solu- tions to depth estimation, demosaicing and super-resolution, as well as a joint frame- work that challenges the conventional light field processing pipeline.

1.3 Objectives

The research work presented in this dissertation aims to enhance the light field infor- mation by taking depth information into account in light field data processing. The approach is to understand the shortage of a sequential processing chain and benefits of a depth-assisted processing methods. This aim is divided into the following list of objectives:

O1: To investigate geometrical optical models, and the sampling behavior of • different light field capturing systems.

O2: To investigate the diversity of light field capturing systems, and their po- • tential challenges regarding the depth estimation process. 1.4 Scope 3

O3: To investigate the role of depth in the pre-processing of raw light field • data, and to propose depth-assisted demosaicing method to reduce artifacts.

O4: To investigate the influence of depth in the post-processing of the light • field, and to propose depth-assisted super-resolution for light fields.

O5: To propose a solution for light field quality enhancement with the un- • known depth, and to solve the interdependence between depth and light field enhancement techniques using a graph model.

1.4 Scope

This work is conducted within the empirical, post-positivist research paradigm, and relies on quantitative research methods. The work in this dissertation falls within the field of light field resesarch. Specifically, this dissertation focuses ontheimage processing and computational photography aspects of light field data. The work is mainly based on geometrical optical models of the capturing devices. Wave optics and high-order optical aberrations are beyond the scope of this work. This work contributes knowledge to estimating and applying depth information to boost the subjective and objective performance of the light field pre-processing and post-processing. Demosaicing and super-resolution, which are typical steps from the classical light field processing pipeline, are studied considering the influ- ence of depth. The proposed methods in this dissertation allow for both small and big baseline scenarios, in other words, multi-camera and plenoptic camera setups. The methods in the dissertation are validated on real and synthetic datasets.

1.5 Contributions

The contributions on which this dissertation is based are the papers listed below, included in full at the end of this work. As the first author of the listed papers, I am actively involved in the ideas, methods, implementations, analyses, writing, and presentation of the research work and results. The remaining co-authors contributed with advice and guidance regarding conception of the work, interpretation of data throughout the work, and assisted in the revision of respective papers. The contri- butions of this dissertation are: Paper I addresses objective O1, it presents the sampling analysis of general light field capturing systems based on geometrical optics and ray-tracing. Paper II addresses objective O2, it presents the intractable depth estimation un- der the limitation of scarce color and features, and proposes an area-based depth estimation scheme with occlusion handling for the microscope. Paper III addresses objective O3, it proposes a demosaicing approach in the lay- ered object space for recovering high color fidelity of raw light field data. 4 Introduction

Paper IV addresses objective O4, it proposes a super-resolution approach in the layered object space that involves in-depth warping and cross-depth learning for the light field resolution enhancement. Paper V addresses objective O5, it proposes an Markov Random Field graph model for solving the interdependence of depth estimation and image demosaicing simultaneously. The above contributions in terms of ideas, formulations, implementations and evaluations are described in Chapter 5 and 6. The dissertation is structured as fol- lows: Paper I to IV are presented in Chapter 5, and Paper V is presented in Chapter 6.

1.6 Outline

The following chapters in this dissertation are organized as follows: Chapter 2 pro- vides an overview of light field theory, starting from computational photography to capturing devices and their optical models. In Chapter 3, an overview of the light field processing chain is provided, and a detailed theoretical basis is introduced for demosaicing, depth estimation and super-resolution. In continuation of the the- ory introduced in Chapter 3, Chapter 4 describes the related work in the context of depth estimation, demosaicing and super-resolution. The proposed depth estima- tion, demosaicing and super-resolution work are presented in Chapter 5 together with the sampling analysis of light field representation. Based on the understand- ings of depth-assisted light field processing, the study is extended to tackle thechal- lenging joint problem of demosaicing and depth estimation in Chapter 6. Finally, in Chapter 7 the dissertation is concluded and future work is discussed. Chapter 2

Computational Photography and Light Field

This chapter provides a general background on the research field in which this dis- sertation is positioned. The required theory includes computational photography, as well as light field and the relevant geometrical optical models for various capturing systems. The strengths and shortcomings of different light field capturing systems are introduced, which will lead to research problems discussed in the next chapters.

2.1 Computational photography

The context of this dissertation lies in computational photography [SJD12], which is a multi-disciplinary area that integrates computer vision, digital photography, and other related fields as shown in Fig. 2.1. Computational photography aims to acquire rich information (beyond conven- tional image processing) by combining the knowledge of image formation of digital photography with development of computer vision tasks: 1) Digital photography focuses on the data processing between scene and pixelized sensor information, in- cluding optics, projective transformation, optical aberrations, multiple view geom- etry (if multiple views are taken for the same scene content) [HZ03]. 2) Computer vision covers topics related to the computational understanding of visual content, such as optical character recognition, face detection, and robotics to help the com- puter to perceive the world by projecting images to models [Sze10].

2.1.1 Image processing

Image processing is a prerequisite for computer vision tasks. When an image is stored on a computer using finite pixels with limited precision, it can be regarded

5 6 Computational Photography and Light Field

Figure 2.1: The relationships between computer vision, digital photography, computational photography and image processing. as a function I(x, y) of two discrete variables x and y. Essentially, image processing manipulates the function I(x, y) to achieve following goals:

Image enhancement: To improve the quality of I(x, y) based on the heuristic • approach so that image details can be perceived easily by both computer vision and human visual system. For example, tone mapping, sharpening, and super- resolution.

Image restoration: To recover the original signal of I(x, y) before sampling, • quantization and image corruption based on statistical or mathematical as- sumption. For example, demosaicing and denoising.

Image coding: To encode the image information with reduced redundancy and • compact representation. For example, lossy/lossless image compression.

Image analysis: To extract high-level semantic information by analyzing the • image features. For example, object recognition, face detection, stereo match- ing.

The scope of this dissertation includes image enhancement, image restoration, and image analysis. However, image coding is beyond the scope of our discussion.

2.1.2 Digital photography

Digital photography focuses on the image formation process of digital cameras. In general, digital cameras use media such as charge-coupled devices (CCD) to capture the image created by lenses. In this section, the image formation process is intro- duced under geometrical optics, and the view geometry is classified based on the number of captured images: single-view, epipolar, or multiple view geometry. Single-view geometry is known as camera geometry. It describes a mapping between the 3D world and a 2D image. A pinhole camera model is depicted in Fig. 2.2. The optical center of a pinhole camera is C, Z is the principal axis, an 2.1 Computational photography 7

Figure 2.2: Pinhole camera model.

T object point O = (Xo,Yo,Zo) is mapped onto the sensor plane with its image T O′ = (Xo′ ,Yo′ ,Zo′ ) . By using homogeneous coordinates, such projective trans- formation can be written as:

Xo Xo′ fXo f 0 Yo Yo = fYo = f 0   , (2.1) ′ Z Z   Z   1 0 o o′ o  1            where f is the focal length of the pinhole camera. The principal axis Z intersects the sensor plane with principal point. Assuming that the principal point is not the origin of image coordinate system, and it is shifted by a vector (∆x, ∆y)T , then a more general projective transformation can be formulated as:

Xo Xo′ fXo + Z∆x f ∆x 0 Yo Yo = fYo + Z∆y = f ∆y 0   , (2.2) ′ Z Z   Z   1 0 o o′ o  1            f ∆x where K = f ∆y is called the camera calibration matrix, and its parameters  1  f, ∆x, ∆y are internal parameters. To further generalize the pinhole camera model, assume that camera center C is not the origin of world coordinate system, it shifts by a translation vector t and it rotates by a rotation matrix R, thus the generalized projective transformation matrix is written as: Q = K[R t], (2.3) | where t = RC˜, and C˜ consists of coordinates of the camera center in the world − coordinate system. Matrix [R t] is often referred to as the extrinsic parameters of the | camera projection matrix. Stemmed from single-view geometry, epipolar geometry accounts for studies of two views capturing the same scene. Given a pair of views as shown in Fig. 2.3, the 8 Computational Photography and Light Field

Figure 2.3: Epipolar geometry: two views capturing the scene content, left: without rectifica- tion, right: after rectification

object point O is captured by cameras Cl and Cr at image points Ol′ and Or′ respec- tively. The epipolar plane OCrCl intersects sensors with lines in red (epipolar line). The epipolar geometry includes three specific topics: 1) Correspondence geometry that constrains the stereoscopic pair of image points Ol′ and Or′ . 2) Camera geometry that describes the relationships between camera matrices of Cl and Cr. 3) Structure geometry that triangulates 3D point O of the scene. Epipolar geometry is widely used in processing 3D information of the scene. For example, the depth estimation task can be formed as a correspondence match- ing problem and solved with constraints of epipolar geometry. It can be seen from Fig. 2.3 that the correspondence pair Ol′ and Or′ must lie within their corresponding epipolar lines. Furthermore, it is feasible to determine the epipolar line by knowing two camera centers and one image point without the exact position of object point O [HZ03]. Then a triangulation is performed to find the 3D position of O. The 3D reconstruction of O can be further simplified with rectification as shown to the right in Fig 2.3. After image rectification, i.e. enforcing parallel optical axes of the cameras and horizontal epipolar lines, the correspondence matching task can be constrained to the same row of a pair of images, and the depth of a 3D point can be triangulated as: z = fB/d, (2.4) where B is the baseline distance of two camera centers, d is the horizontal disparity. Multiple view depth estimation can be solved in an identical manner, as multiple pairs of stereo systems [HZ03].

2.2 Plenoptic function and light field

Generally speaking, computational photography aims to restore a subset of the 7D plenoptic function by using computer vision so that it outperforms conventional 2.2 Plenoptic function and light field 9

digital photography. This section introduces the plenoptic function and its most important representation - light field.

2.2.1 Definition

Plenoptic function is an ideal function which formulates light information with a set of properties [AB91]. The full plenoptic function usually takes the form:

I = Γ(θ, ϕ, λ, t, x, y, z), (2.5) where I is the light intensity of the light ray from direction θ, ϕ at spatial position (x, y, z) of wavelength λ at time t. If the full plenoptic function is known, then it is possible to recreate the whole scene content without information loss. Unfortu- nately, it is not practical to record the full plenoptic function due to the limitations of the digital cameras, such as color filtering, quantization, and discrete pixel loca- tions [AT12]. The full plenpotic function is usually transformed according to specific applications, for example, t can be ignored when sampling a static scene, conven- tional single-view photography preserves a perspective transformation I(x, y) of a 3D scene, and λ relates to red, green and blue components when filtered by a bayer pattern color filter array (CFA). Light field representation is one of the most important derivations from the plenop- tic function, using a two-plane representation [LH96]. In the 4D light field represen- tation, only four coordinates are registered, namely L(u, v, x, y), where (x, y) are the spatial coordinates related to viewing position (u, v). Conventionally, (x, y) relates to the spatial resolution of the light field, and (u, v) relates to the angular resolution.

2.2.2 Light field sampling

Several capturing systems are often used to enable the acquisition of a sampled light field. Plenoptic cameras utilize a microlens array (MLA) structure to decouple the spatial-angular information of the 4D light field between the main lens and image sensor. Based on different optical design, plenoptic cameras can be divided into two categories. Conventional plenoptic camera (also referred to as plenoptic 1.0, shown as Fig. 2.4(a)) focuses the 3D scene onto the MLA. The distance between the sensor and the MLA is set to be the microlens focal length fMLA, and the main lens is focused at infinity. Microlenses demultiplex the angular information of the same spatial point, and record them into elemental images (EI) behind microlenses. The number of mi- crolenses defines the spatial resolution (x, y), and the pixels of an EI sets the angular resolution (u, v). Focused plenoptic camera (also referred to as plenoptic 2.0) focuses the MLA onto either a real image (Fig. 2.4(b)) or a virtual image (Fig. 2.4(c)) that created by the main lens, following the thin lens equation:

1 1 1 = + , (2.6) f a b 10 Computational Photography and Light Field

Figure 2.4: Different geometrical optic models of MLA-based plenoptic cameras. Left: con- ventional plenoptic camera, middle: focused plenoptic camera with Keplerian configuration, right: focused plenoptic camera with Galilean configuration. where a and b are the distance from a lens to the image and the object respectively. Keplerian configuration requires shorter microlens focal lengths, which often leads to stronger distortions. The effective spatial resolution drops when the object is dis- tant from the main lens. In contrast, Galilean configuration reduces distortions by using long focal length MLA and increases the effective spatial resolution for distant objects[PW10]. In focused plenoptic cameras, each microlens observes a fraction of the object, and effective spatial resolution is coupled with depth, meaning that the spatial resolution for focused plenoptic cameras depends on the proportion of over- lapping field of view (FOV). Both conventional and focused plenoptic cameras are able to capture a sampled 4D light field with a single exposure. However, it sacrifices the sensor resolution to assist the trade-off between spatial and angular resolution. In addition to plenoptic cameras, a conventional camera can also be used to sam- ple a 4D light field if multiplexed in the spatial or temporal domain. One wayisto utilize a camera array which is equivalent to integrate a 4D light field L(u, v, x, y) by sampling (x, y) from adjacently placed camera positions (u, v) [WJV+05]. An- other way is to integrate a light field over time t with a moving gantry [ZohVKZ17]. Compared with a camera array, camera gantry is more affordable with adjustable (u, v) steps, but it can only capture the plenoptic function if the scene is static. Other methods to capture light field include integral photography which can be seenas a plenoptic camera without the main lens [GZC+06], and which creates synthetic light fields with virtual cameras [HJKG16], as well as special optical systems like light field microscopes [LNA+06]. Chapter 3

Light Field Depth Estimation, Demosaicing and Super-Resolution

An explication concerning the context of this work was provided in the previous chapter - computational light field photography, and this chapter continues witha brief summary of three specific light field processing steps in such context, including light field depth estimation, demosaicing, and super-resolution. The purpose ofthis dissertation is to investigate the above aspects and to propose novel solutions for the light field. This chapter acts as a theoretical basis for the subsequent three chapters.

3.1 View-based depth estimation

Depth estimation is one of the oldest most widely studied topics in computer vision. With the emergence of light field, new challenges and opportunities are brought into this topic. In this section, I introduce depth estimation methods that will be further investigated in the next chapters. View images can be called by different names based on the capturing systems. It can be referred to as elemental images from integral imaging [CJ09] or subaperture images from plenoptic cameras, essentially representing the same concept: the inte- gration Iu,v(x, y) of the light field L(u, v, x, y) from a virtual pinhole camera located with a fixed (u, v) coordinate. View-based methods rely on the evaluation that is cal- culated from pixel correlation or focus/defocus cues. This section focuses on local methods such as correspondence matching, depth-from-focus/defocus (DfF/DfD). The global method, particularly using the Markov Random Field (MRF), is explained in Section 3.4.

11 12 Light Field Depth Estimation, Demosaicing and Super-Resolution

3.1.1 Correspondence matching

Most light field systems capture the scene with a parallel optical axis and known relative position, such as plenoptic camera and multi-camera array, meaning that rectification is no longer required except for lens distortion and non-square pixels. Therefore, the light field depth estimation problem can be turned into a classic stereo matching discussed in Section 2.1.2, and the depth can be inferred from (2.4) for a reference view position (u, v). The only difference in the light field case is that this process needs to be repeated for multi-view stereo. The correspondence matching often works on the principle of local windows or identified features in computer vision:

Area-based: consider a local window centered at the reference pixel location • (x, y) to handle uncertainty in the matching process. It generates a dense depth map except for image border if not padded, and various matching metrics can be chosen from, such as the sum of absolute differences (SAD) [PU14] and nor- malized cross-correlation (NCC) [LCDX09]. Additionally, Lambertian scene is often assumed so that radiance emitted by the same object will be identical regardless of viewing position to provide a better matching performance.

Feature-based: extract features that are stable using feature detection algo- • rithms such as Sobel edge detection [GMC+16], corner detection [HS+88] and SIFT algorithm [Low04]. This approach is more robust when occlusion occurs, but can only result in a sparse depth map where features are detected.

In addition to the demand of correspondence matching accuracy, occlusion is also a troublesome task for stereo matching because of its violation of photo-consistency and it is often addressed with prior knowledge of the scene [WER15, WER16, ZWY17, CHNC18].

3.1.2 Depth-from-focus/defocus

Depth-from-focus refers to methods that estimate the depth from a focal stack us- ing a focus measure which describes the sharpness of the image [ST98]. The focus measure attains the extremum when focused on the depth of an object. Thanks to the rich angular information of light fields L(u, v, x, y) which enables after-shot refo- cusing [NLB+05], a focal stack can be generated by integrating angular patches over different depths z. The refocusing of angular patches can be calculated as:

1 fB fB L(u, v, x x , y y )dvdu, (3.1) uv − z − z Zu Zv where Bx and By indicate the baseline in x- and y-direction. Focal stack is generated by shifting and averaging over the angular domain, the angular patch in a focal stack is centrosymmetric with respect to the focused depth 3.2 Light field demosaicing 13

Figure 3.1: The bayer pattern of CFA (left), and the sensitivity of the human visual system towards red, green and blue (right). z. This observation is often used when insufficient depths z are available in the focal stack: Instead of taking a focus measure, a blurring (defocus) factor is measured to infer the approximate depth between two slices of the focal stack. This method is referred to as depth from defocus (DfD). Though a full 4D light field enables refo- cusing at any depth z, principles of DfD can still be effective on fast convergence in finding local extremum for DfF.

3.1.3 Other techniques

In addition to correspondence matching on subaperture image (SAI) and DfF/DfD on the focal stack, the dense angular sampling of plenoptic cameras also give rise to epipolar plane image (EPI). EPI is obtained by slicing 4D light field L(u, v, x, y) in a horizontal manner as Iv0,y0 (u, x) = L(u, v = v0, x, y = y0), or in a vertical manner as

Iu0,x0 (v, y) = L(u = u0, v, x = x0, y). Thus, the 3D object is projected as a tilted line across SAIs, whereby the slope encodes depth information [LGZD15]. More recently, researchers have also presented techniques based on convolutional neural network (CNN) and sparse coding.

3.2 Light field demosaicing

A conventional digital camera performs several pre-processings with built-in pro- cessors, such as analog-to-digital converter (ADC) and CFA, in the capturing stage, as shown in Fig. 1.1. Therefore, red, green and blue components are quantized and filtered before it is recorded at each pixel location. One way to display a color pixel is to use a combination of beam splitters and three sensors, each recording a full- resolution color channel. However, this reduces light intensity in each split, requires a precise alignment among sensors, and it is costly. A more general solution to this problem is to apply the CFA in front of the sensor, passing only one color compo- nent for each pixel, and demosaicing is performed afterwards to recover the other two missing channels. Because of the mostly used bayer pattern [Bay76] (Fig. 3.1), demosaicing is also known as debayering. The bayer pattern uses half of the image resolution to measure the green channel, and only a quarter of pixels stores each of red and blue. This is because that human 14 Light Field Depth Estimation, Demosaicing and Super-Resolution

visual system (HVS) is more sensitive to the medium wavelengths (green), as can be seen in Fig. 3.1. As a light field sampling device, plenoptic cameras [GYLG13, PW10] also employ bayer pattern CFA like conventional digital cameras. Therefore, demosaicing is also required. Though light field demosaicing can be achieved with simple interpolation techniques like bilinear interpolation, the real challenge lies in finding proper image priors so that small reconstruction errors can be achieved. Intuitively, classic demo- saicing approaches of 2D images can be applied directly to the perspective image or the EI. However, it is suboptimal due to the fact that rich and abundant information that can be provided by the 4D light field is discarded. Specifically for light fieldde- mosaicing, whether if the demosaicing approach is making the most of the 4D light field and can be adapted to various capturing systems are of significant importance. The accuracy of light field demosaicing is critical to the performance of following processings which takes color images as the input. Therefore, a significant effort need to be devoted to the demosaicing process before error propagation happens in the post-processing chain. However, the demosaicing has been overlooked by most researchers in this field, and only a few solutions are proposed [YYLG12, HC14, DLPG17].

3.3 Light field spatial super-resolution

Plenoptic cameras sacrifice the image resolution to encounter the spatial-angular trade-off in the exchange for one-shot light field capturing ability. For example, the raw image captured by Lytro Illum is of size 7728 5368, but the rendered SAI is × shrinked down to 625 434 pixels [RE16]. To enhance the image resolution, many re- × searchers have devoted to light field super-resolution (SR) [BZF09, WG13,+ YJY 17]. SR is the process to obtain high-resolution images from low-resolution ones. Super- resolution can be classified in terms of the number of low resolution images in- volved: single-image super-resolution (SISR) and multi-image super-resolution (MISR). Most of SISR algorithms employ some learning scheme to search relevance between low-resolution and high-resolution image patch. Then the missing information is hallucinated to super-resolve. MISR methods often assume a warping relationship between low-resolution images, and that relative displacements among these images can be found to fill the missing pixels. Light field super-resolution can be achieved by applying SISR to each of theSAIs, or employing MISR on a sequence of SAIs. However, these are not optimal ap- proaches, because SISR overlooks the fact that SAIs in a 4D light field are closely correlated, and useful information can be extracted from other SAIs. Furthermore, SISR propagates hallucinated artifacts back to 4D light field, which makes subse- quent processings prone to be erroneous [NM14]. MISR, on the other hand, takes correlation of SAIs into account, but its global image warping model does not fit the geometry of light field data because of the local disparity differences among pixels, and it is computationally demanding. 3.4 Markov random field in computer vision 15

This dissertation focuses on the super-resolution for light field SAIs in spatial domain and distinguish LFSR from conventional SR methods by measuring the ren- dering performance. For super-resolution in the angular domain, I refer readers to the field of view synthesis [KWR16]. It is also worth noting that the usedprinci- ples are quite different between demosaicing and super-resolution: demosaicing is an image restoration technique that recovers original color by modeling intra- and inter-channel dependencies [DLPG17], whereas super-resolution belongs to image enhancement that uses approaches such as projection and learning [FVA17], even though the common goal of both is to upsample images,

3.4 Markov random field in computer vision

Most of computer vision problems can be posed as a labeling model using con- straints based on prior knowledge and observations, ranging from low-level demo- saicing [ST12] to high-level semantic segmentation [LLL+17]. Such a labeling prob- lem can be described with a set of random variables (pixels) with sparse local inter- actions, and solved by finding the maximum a posteriori (MAP) configuration, which is the best estimate from observations B [Li94]. According to the Bayes’ theorem: P (B A)P (A) P (A B) = | , (3.2) | P (B) where A is a configuration of labels for pixels (x, y), P (A) is the a priori probability of configuration A, P (B A) is the likelihood density of the observation B. Therefore, | the vision problem can be mathematically solved by finding a configuration A˜ so that A˜ = arg max P (A B). However, the a priori is a difficult problem to solve, as it reflects | how individual random variables affect one another. This makes bayesian labeling an intractable task, especially when applied to computer vision where numerous pixels need to be considered at the same time. Fortunately, researchers in computer vision observe that the pixels tend to be dependent on a local neighborhood rather than irrelevant distant pixels. Based on this observation, MRF is employed to handle the a priori probability by using a potential function. Consider a neighborhood for an image I: N = (x , y ) I , (3.3) N {N(xi,yi)|∀ i i ∈ } where is the collection of pixels in a local window centered at (x , y ), and N(xi,yi) i i the neighborhood is subject to the condition:

(x , y ) (x , y ) , (3.4) i i ∈ N(xj ,yj ) ⇐⇒ j j ∈ N(xi,yi) meaning that if a local window centered at (xj, yj) includes a pixel (xi, yi), then vice versa. In other words, the dependencies between a pair of pixels are always mutual, and its corresponding graphical model is undirected. Based on the defined neighborhood, a set in which every pixel is the neighbor of all others is called a clique. A clique can be seen as a complete graph as shown in 16 Light Field Depth Estimation, Demosaicing and Super-Resolution

Figure 3.2: Undirected Graph models (left) and example cliques shown alongside them, ran- dom variables are shown as red dots. A clique is a complete graph, and a complete graph is its own maximal clique.

Fig. 3.2. The interactions between pixels are restricted by the clique, meaning that the direct mutual impact on the labeling task is no longer considered if two pixels are not in a clique. A set of random variables is considered as a MRF model if and only if: 1) The positivity constraint holds, which limits the probability for all configurations to be strictly positive for all pixels (x, y) so that it can be considered as a random field: P (A) > 0, (3.5) and 2) the Markovian locality constraint holds, which requires that the probability of a pixel xi, yi conditioned on the whole graph is equivalent to the probability that conditioned on its neighbors: P (A A , (x , y ) I) = P (A A , (x , y ) ). (3.6) (xi,yi)| (xj ,yj ) j j ∈ (xi,yi)| (xj ,yj ) j j ∈ N(xi,yi) Additionally, according to the Hammersley-Clifford theorem, an MRF can be fac- torized using cliques of its corresponding graph model, which has the form:

1 U(A) P (A) = e − T , (3.7) N where N is the normalizing factor, T is the temperature parameter which is often taken to be 1, U(A) is the prior energy term that needs to be minimized in order to find the most stable state of the defined neighborhood system. The prior energyterm U(A) can be further factorized as a summation of energy with respect to the cliques : C U(A) = V (A) = V1(A(x ,y )) + λ V2(A(x ,y ),A(x ,y )) + ..., (3.8) C i i i i j j XC XC1 XC2 3.4 Markov random field in computer vision 17

where , . . . indicate cliques with a different number of pixels, as shown in C1 C2 C Fig. 3.2. In computer vision, the energy minimization can be solved with various techniques, such as graph cut [BVZ01] and min-cut/max-flow algorithms [BK04]. and are referred to as data term (or unary term) and smoothness term (or bi- C1 C2 nary term) respectively, and higher orders of potential terms are barely used. Data term is mostly used to penalize the inconsistency of event A with the observed data, whereas smoothness term often penalizes the change in local neighborhood except for boundaries (if it is defined) so that regularization is performed to prevent outliers and remove noise. 18 Chapter 4

Related Works

This chapter summarizes the related works conducted in three areas of this disser- tation: depth estimation, demosaicing and super-resolution in the context of light field processing. This chapter begins with positioning and description of theworks in each area, and ends with a discussion on the identified knowledge gaps.

4.1 Light field depth estimation

A number of contributions focus on the light field depth estimation because depth is critical to a wide range of applications, e.g. light field super-resolution [BF11] and image rendering [ZDdW10]. Typically, most light field depth estimation tasks are managed in a two-step strategy: an initial depth is estimated from views or EPIs and then refined with global methods [WER15, WER16, LXRC17].

4.1.1 View-based initial depth estimation

Early stereo matching algorithms are based on the feature detection and feature matching to generate a stable yet sparse depth map [MKS89]. However, researchers are not keen on such techniques because they fail to meet the requirements of other processings like depth-image-based rendering (DIBR). In [EE10], a new area-based correspondence matching was proposed using the NCC algorithm. A NCC filter is first applied to a small window size and then aggregated to preserve boththefine image structure and reduce the noise. Yu et al. [YGL+13] imposed 3D line geom- etry as constraints in their framework so that line structure can be encoded into a correspondence matching algorithm. To solve the sub-pixel disparity problem for plenoptic cameras, Jeon et al. [JPC+15] proposed to build a cost volume based on SAD and gradient differences for the stereo matching task. It is further refined by a feature-guided labeling framework for low structure regions. Several other methods are proposed to specifically handle the occlusion problem

19 20 Related Works

of correspondence matching. In [WER16], an occlusion-aware depth estimation al- gorithm is proposed based on the principle that approximately half the viewpoints still conform to photo-consistency criterion when occlusion exists. Zhu et al. extend the depth estimation framework in [WER16] to address multiple occlusions based on the partial photometric consistency of occluders in spatial and angular patches [ZWY17]. Instead of improving correspondence matching measures, [CHNC18] fo- cused on solving depth uncertainty in partially occluded border regions with super- pixel regularization. Based on the computational refocusing of the light field [NLB+05], Tao et al. [THMR13] combined depth cues from the focal stack and correspondence cues from stereo images to improve the depth map quality. A novel cue which compares the difference between synthesized focal stack and ground truth focal stack is proposed in [LCBKY15]. The depth is estimated by matching and merging both the focus cue and the proposed cue with iterative refinements. Moreover, neural networks are also introduced to learn depth cues from a focal stack. Zhou et al. formulated a pixel- wise classification problem and generated discrete depth labels from a CNN thatis pre-trained on local structure information and high-level semantic feature[ZZY+19].

4.1.2 Markov random field depth refinement

The initial depth estimation have outliers because of noise from insufficient light- ing, depth discontinuity at object border, inherent uncertainty of matching, and oc- clusions. Therefore, global optimization such as a MRF architecture is usually ap- plied to refine the depth maps [YGL+13, THMR13, JPC+15, WER16, MYXA19]. In [YGL+13], the endpoints of line segments are assigned with high-consistency dispar- ity. The smooth transition of disparity on such line segments is explicitly encoded as a part of the energy function and the multi-view graph cut is employed to find the global minimum energy. Tao et al. propagated both defocus and correspondence cues to the MRF framework and stop the optimization when the error between two iterations is below a threshold [THMR13]. Jeon et al. [JPC+15] set the reliable match- ing correspondence as the additional constraints of the energy function to propagate confident disparity estimates while preventing oversmoothing across borders. In [WER16], the occlusion constraints are encoded in the smoothness term so that the graph cut algorithm does not oversmooth the depth map when possible occlusion occurs. Mo et al. [MYXA19] divided the initially estimated depths into to classes: valid and invalid. The valid estimations are depths which are not occluded or in homogeneous intensity patches. Then the unary term is applied only to the valid es- timations, while the binary term constrain the smoothness on both valid and invalid depths.

4.1.3 Concluding remarks

Depth estimation is a very active research field with numerous proposed methods. Local depth estimation includes correspondence matching and DfF/DfD methods. 4.2 Light field demosaicing 21

Correspondence matching is sensitive to vignetting problem of the plenoptic camera. NCC outperforms SAD and SSD because it compensates for the lighting difference mathematically. Compared with correspondence matching, DfF/DfD is robust to noise, and can handle sparse feature space, but it requires additional computational power to perform necessary refocusing step of the focal stack generation. In gen- eral, local depth estimation suffers from occlusion, noise and light field sparsity. In contrast, the global optimization use explicit priors instead of local aggregation. The priors can exclusively handle smoothness, occlusion and other intractable problems of local estimation. Thus, global method is often used as a refinement process to the initial depth.

4.2 Light field demosaicing

There have been relatively few efforts in addressing the light field demosaicing in the literature. These works can be divided into two categories based on whether they explicitly consider the sampling model of the light field.

4.2.1 Sampling-based approaches

Georgiev et al. [GCL11] analyzed the sampling model of plenoptic cameras and proposed to combine super-resolution with demosaicing. The method interleave raw images on a rendering plane, so that information from other EIs can be used directly. Based on this work, Yu et al. [YYLG12] proposed a similar approach to postpone the demosaicing process after a 2D integral projection which resamples the light field. Then the demosaicing was performed before rendering according to the chosen depth in order to preserve more high frequency information. Seifi et al. reported a method that first estimated disparity from raw images, and then the missing color components were propagated to other EIs based on a confidence measure[SSDP14].

4.2.2 Other approaches

Most of the post-processing schemes are developed on top of a common light field toolbox [DPW13], which employs bilinear interpolation to raw lenslet images. A learning-based approach was proposed by Huang et al. that uses K-SVD algorithm to train a dictionary on spatial and angular correlations from a set of reordered sam- ple vectors of the light field. Then the demosaicing is done by solving the sparse reconstruction [HC14]. Lian et al. assumed sparse gradients and low illumination, then formulated the demosaicing problem as an optimization process for the simu- lated light field data [LC16]. In order to solve the vignetting problem, Davidetal. used a pre-captured white image to weigh the reliability of each pixel, and demo- saicing is performed within the EI to avoid crosstalk effects [DLPG17]. 22 Related Works

4.2.3 Concluding remarks

Classic demosaicing methods interpolate missing colors based on neighboring pix- els on the sensor. However, this yields artifacts for MLA-based plenoptic cameras [GCL11] and will propagate erroneous information into the processing pipeline. Fur- thermore, the important role of depth is barely discussed and a general solution is still lacking for various light field systems.

4.3 Light field spatial super-resolution

The light field capturing systems, especially plenoptic cameras, suffer from the spatial- angular trade-off. This makes the rendered quality of views uncomparable to con- ventional digital cameras which dedicate the whole sensor for spatial resolution. Existing spatial super-resolution methods for the light field can be divided into three groups: projection-based, optimization-based and learning-based approaches [CXCL19].

4.3.1 Projection-based approaches

The spatial resolution of an SAI is not limited by a single SAI, but the projection from all other SAIs [LOP+09, GCL11, DOS+13]. By taking advantage of disparity or depth information, the other views can propagate pixels into a reference view. If sub-pixel disparity exists, the reference view is super-resolved. Liang et al. [LR15] reported the same conclusion that lenslet plenoptic cameras can surpass the Nyquist limit when making use of cross-view information. The conclusion is derived based on geometrical optical analysis of a prefiltering optics model that considers full spatial and angular sensitivity of the sensor. Another simple projection-based approach is proposed by [WHS+16] which requires no prior information based on projection re- definition. Alain et al. proposed to apply a high-dimension form of block-matching and 3D filtering and back-projection iteratively to achieve spatial super-resolution [AS18].

4.3.2 Optimization-based approaches

Optimization-based methods solve the super-resolution problem with different mod- els and priors. [BF11] introduced a variational Bayesian framework to super-resolve the light field by merging multi-view information. The Lambertian reflectance and texture-preserving priors are employed to avoid aliasing. Mitra et al. [MV12] pro- posed a patch-based approach using a Gaussian mixture model (GMM) such that each patch is considered as a Gaussian random variable conditioned on its dispar- ity. The GMM patch prior is then used to perform multiple tasks, including super- resolution. Wanner et al. [WG13] employed a total variation prior not only to obtain a favorable inpainting result, but also to minimize computational effort in the vari- ational framework. The framework can achieve both spatial super-resolution and 4.3 Light field spatial super-resolution 23

angular view synthesis. An undirected weighted graph model was constructed in [RF17] with three terms: the first term constrains correlation between HR andLR views; the second term collects information from other views; the third term regu- larized the HR views by geometric structure. The work is further extended to replace the quadratic regularizer (which tend to be low-pass) with a non-smooth regularizer in order to preserve high frequency information [REGF18].

4.3.3 Learning-based approaches

Recently, vast data-driven learning-based approaches are applied to most fields in computer vision due to their outstanding performance. CNN was introduced to light field in[YJY+15], where two networks are trained for spatial super-resolution and angular view synthesis independently. Gul et al. proposed to collect four neigh- boring pixels from an SAI and use a shallow CNN to predict three in-between super- resolved pixels to achieve a higher spatial resolution [GG18].

4.3.4 Concluding remarks

Optimization-based and learning-based super-resolution methods are limited by the required computational power and datasets. Projection-based light field super-resolution relies on simple inherent light field image formation models and disparity with sub- pixel accuracy. As concluded, the light field depth estimation is still an active field with unsolved problems. Therefore, how to make use of uncertain depth and adapt SISR/MISR techniques are the keys to the light field spatial super-resolution prob- lem. 24 Chapter 5

Depth-Assisted Light Field Processing

The previous chapter presented related works with overlooked and unsolved light field processing problems stated in their respective concluding remarks. This chap- ter discusses the critical role of depth, and a study of depth-assisted light field pro- cessing pipeline. Firstly, a microscopic example is raised in Section 5.1 to show the how problematic the depth estimation can be. This is then followed by Section 5.2 and 5.3, which present depth-assisted methods designed to address the light field demosaicing and superresolution respectively.

5.1 Light field microscopic depth estimation

5.1.1 Introduction

Section 2.2.2 discussed about various optical arrangements in capturing light field, and they pose various challenges to depth estimation. For example, light field mi- croscope poses unique depth estimation challenges: 1) microscope captures ortho- graphic views due to its telecentric optical design [LNA+06, SSPL+18]; 2) specimen images are sparse in feature and color compared with macroscopic scenes; 3) micro- scopes have a shallow depth of field [SSPL+18], this optical sectioning phenomenon breaks down stereo matching accuracy.

5.1.2 Proposed solution

The proposed solution is based on a hybrid strategy of DfF and correspondence matching, because either one alone cannot manage the posed challenges. On one hand, DfF is limited by the sharpness cue but benefits from the orthographic pro-

25 26 Depth-Assisted Light Field Processing

(a) (b)

Figure 5.1: Light field captured by FiMic microscope: (a)raw sensor image labeled with (u, v) coordinates; (b)refocused images by shift-and-sum, focusing at different depths in µm. jection. On the other hand, correspondence matching suffers from sparse-featured monochromatic scene content but it provides specimen details considering all-in- focus optical sectioning. The microscopes are inherently orthographic, and this introduces a linear depen- dence between depth and disparity. One of the most important induced features is the simple shift-and-sum implementation of the light field refocusing algorithm (see Fig5.1(b)). The shift amount (∆x′, ∆y′) of orthographic view I(u′, v′) is calculated with respect to its relative position from reference view I(u, v): π π (∆x′, ∆y′) = d (u′ u) cos , d (v′ v) sin , (5.1) × − × 6 × − × 6   where d is the disparity, and an example of (u, v) naming convention is shown in Fig. 5.1(a). The average intensity of shifted views is calculated to generate a refocused image as shown in Fig. 5.1(b). Unlike other heuristic depth cues used in DfF, the advantage of the extended depth of field of FiMic is considered to develop a matching approach. Assuming the reference view I(u, v) is sharp, NCC measure is calculated at (x, y) between a refocused image (u, v) and the reference image I(u, v) as: I 1 (I(x + i, y + j) µI )( (x + i, y + j) µ ) ϵ(I, ) = − I − I (5.2) I N σ σ i,j I(x,y) (x,y) X · I where σ and µ indicate the standard deviation and average of the considered win- dow, and N is the number of pixels in the window. The maximum of NCC value between reference image I and refocused image is attained when pixel (x, y) is in I focus. A voting scheme is proposed to handle the occlusion based on partial occlu- sion theory presented in paper [WER16]. The partial occlusion refers to the assump- tion that the occluder blocks a few of the views, and the non-occluded views can still converge to a spatial point. The depth of a spatial point is registered by setting a threshold of minimum number of views that shows photo-consistency. Note that the threshold is determined empirically by the number of views and the scene content. 5.1 Light field microscopic depth estimation 27

(a) (b) (c)

Figure 5.2: Depth estimation results of (a) the proposed method; (b) [NN94] and (c) [PPG13]. Top row shows the 3D point cloud, and the bottom is depth map of the center view.

5.1.3 Methodology

The performance of depth estimation is generally evaluated with both objective met- rics and visual comparisons. However, objective metrics are limited to datasets gen- erated with computer graphics and do not reflect all aspects of real scenes and the perception of human visual systems. Furthermore, to the best of my knowledge there is no existing microscopic light field with ground truth depth, and the depth es- timation performance are evaluated on visual performance in light field microscopy [LNA+06, LWZD15]. The data used in the validation is captured by a state-of-the-art Fourier integral light field microscope (FiMic)[SSPL+18] that has an extended depth of field, and single-shot light field capture capability. The specimen is the tainted fluorescent cotton fiber which has a complex occlusion, distinct line structure and certain sparsity in features. The pitch of each view is 683 pixels, and the central view is chosen as the reference view to perform depth estimation and occlusion handling, as shown in Fig. 5.1. Finally, both the 3D point cloud and the depth map of the proposed solution are compared against classic shape-from-focus (SfF)[NN94] and state-of-the-art robust sfF (R-SfF) [PPG13].

5.1.4 Results and analysis

The visual comparison in Fig. 5.2 shows that the proposed method outperforms the depth estimation quality achieved by others for monochromatic sparse features captured by microscope. Noticeable artifacts and shadowing can be seen in the point clouds (top) generated by [NN94] and [PPG13], meaning that the occlusion is not properly handled and the occluded scene cannot be reconstructed. In addition, the fine fiber structure is lost on the depth map (bottom) generated by [NN94, PPG13], 28 Depth-Assisted Light Field Processing

showing that the correspondences are mistakenly matched. In contrast, our method keeps reasonable details of the fiber structure and handles the occlusion.

5.2 Depth-assisted demosaicing

5.2.1 Introduction

In a conventional plenoptic camera, it is often assumed that each microlens captures one single spatial point of the scene and distribute its angular information over pixels of the EI (Fig. 2.4). However, this assumption no longer holds if depth is taken into account, implying the existence of disparity and inconsistent sampling order between sensor pixels and the scene content. In this section, an object space light field demosaicing method is proposed with the aid of roughly estimated depths.

5.2.2 Proposed solution

Conventional demosaicing algorithms are based on the regular pixel grid of the sen- sor because the sampling frequency of the scene is essentially defined by adjacent pixels. However, such theoretical basis is violated in MLA-based plenoptic cameras. Considering a conventional plenoptic camera model as presented in Section 2.2.2, a natural way of finding minimal sampling interval is to back-project all pixels to a point cloud in the object space by ray transfer matrix (RTM). Therefore, the near- est samples captured by plenoptic cameras is determined by correspondences rather than the pixel neighbors (Fig. 2.4). In other words, once the depth for each pixel is known, it is possible to make full use of potential of light field samples to recover the missing color information. In general, demosaicing among scattered monochromatic points in the 3D ob- ject space requires more computational power and sophisticated interpolation algo- rithms such as trilinear interpolation. Fortunately, there are cases that are simple to cope with: a planar scene that lies on a constant depth. In this case, the 3D object space collapsed back to a 2D demosaicing problem. However, it should be noted that the mosaiced back-projected pixels on such a depth do not necessarily follow the bayer pattern and the pixel grid can be irregular, as shown in Fig. 5.3. There- fore, an inverse distance weighting function was adopted as the color interpolation method for each channel at pixel (x, y):

N N

I(x, y) = ωiI(xi, yi) ωi , (5.3) i=1 ,i=1 X X where (xi, yi) is the coordinate of N nearest neighbors of the same color (R or G or B) on a certain depth plane. If pixel (x, y) and (xi, yi) does not converge to an identical spatial position, The inverse weighting factor ωi has the form: 1 ω = , (5.4) i (x x, y y) k ∥ i − i − ∥2 5.2 Depth-assisted demosaicing 29

(a) (b)

Figure 5.3: An example of pinhole plenoptic sampling: (a) pixel samples back-projected onto the conjugate planes of the object space; (b) from left to right: simulated back-projection of pixels based on depth Z1, Z2 and Z3 respectively, color blocks at the bottom show different simulated CFAs. Note the simulated CFA fashion is not necessarily the same with the bayer pattern. where indicates L norm, and k is a positive real number called the power ∥ · ∥2 2 parameter, taken as k = 2. When the scene content lies on the focal plane of the main lens, no demosaicing is required because the focal plane will be focused directly on the MLA, distributing the color information of a single spatial point to different RGB-sensitive pixels of the EI behind a lenslet. Therefore, if (x x, y y) = 0, the missing color components ∥ i − i − ∥ can be extracted directly by averaging neighboring pixels of the specific color under the same microlens: 1 N I(x, y) = I(x , y ). (5.5) N i i i X Note that projecting pixels onto several layers (Fig. 5.4) not only reduces the com- putational complexity by avoiding interpolation in high dimensional space, but also endows the demosaicing method with the capability of handling uncertain depths by keeping proximity of pixels through projection process. Additionally, the layered object space can handle the depth ambiguity of homogeneous regions and repeti- tive patterns quite well. This is due to the fact that the depth errors of such regions are generally consistent and smooth, so that it will be mapped onto the same layer, preserving sampling adjacency in the object space.

5.2.3 Methodology

The performance of demosaicing methods are commonly assessed with both objec- tive metrics and visual quality [LGZ08]. Among objective metrics, peak signal-to- noise ratio (PSNR) and structural similarity index (SSIM) are widely used to evalu- ate the difference between the demosaiced image and the ground truth [HC14, LC16, DLPG17, KHKP16]. However, available datasets are mostly pre-processed by the built-in processors in the digital cameras before being saved in the storage, and such 30 Depth-Assisted Light Field Processing

Figure 5.4: A layered object space diagram, mosaiced sensor pixels are back-projected onto several layers according to their depths

Table 5.1: Performance of demosaicing methods in terms of mean PSNR value and SSIM index Proposed method [DPW13] [DLPG17] PSNR 36.43 24.06 31.35 SSIM 0.968 0.812 0.952 color images cannot be considered as the ground truth. Therefore, a benchmark syn- thetic light field dataset [HJKG16] has been used to verify the proposed scheme. The dataset includes 28 light fields of size 9 9 512 512. × × × The proposed demosaicing scheme has been compared against two other meth- ods, including a widely used light field decoding pipeline [DPW13] and a state-of- the-art light field demosaicing approach [DPW13, DLPG17]. Note that the perfor- mance of light field demosaicing methods degrade drastically at EI borders. This is because crosstalk and noticeable aliasing artifacts occur when demosaiced across different EIs using sensor-based demosaicing methods. Thus, peripheral views were rendered for visual comparison instead of the center view to emphasize such signif- icant visual artifacts and reveal the shortcomings of previous methods.

5.2.4 Results and analysis

The performance of the proposed method is evaluated on public dataset [HJKG16], compared against a widely used light field toolbox [DPW13] and a state-of-the-art approach [DLPG17]. The average PSNR and SSIM is shown in Table 5.1 As shown in Fig. 5.5, both reference methods degrade the image quality on the repetitive pattern while the proposed solution preserves high-frequency com- ponents. The black stripes on the basketball are replicated in the image demosaiced by [DPW13] due to crosstalk, while in the image demosaiced by [DLPG17] there are zipper artifacts. False colors are seen on the third sub-image for both of the reference methods because they overlooked pixel correspondences and used suboptimal pix- els (even though they are neighboring pixels on the sensor) to perform demosaicing. 5.3 Depth-assisted super-resolution 31

(a)

(b)

(c)

Figure 5.5: Visual comparison between the proposed demosaicing scheme and comparison methods: (a) [DPW13], (b)[DLPG17], and (c) the proposed scheme.

The fourth image (details of a floor) shows how well the high-frequency compo- nent is preserved by each method. Noticeable aliasing can be seen for both reference methods whereas such aliasing did not appear in the results of the proposed solu- tion.

5.3 Depth-assisted super-resolution

5.3.1 Introduction

A layered object space framework was introduced in the previous section to handle the depth uncertainty. This section will address the super-resolution problem for light field data as an extension to the same framework. Natural images tend to contain repetitive visual content. In particular, small im- age patches in natural images tend to recur many times inside the image, both within the same scale and cross different scale [NM14]. Such information redundancy can be readily found especially in the light field context where multiple views ofthe scene are available. Therefore, a super-resolution method which involves in-depth 32 Depth-Assisted Light Field Processing

warping and cross-depth learning is proposed in this section.

5.3.2 Proposed solution

An high-resolution image can be directly calculated with inverse distance weight- ing function (5.3) in the layered object space by applying the demosaicing scheme proposed in Section 5.2. However, analytic interpolations such as bilinear or bicu- bic interpolations can only generate an image of the desired number of pixels that lacks high-resolution details. Therefore, an light field SR framework that employs two techniques: in-depth warping and cross-depth learning, are proposed in this section.

The in-depth warping aims to recover the high-resolution image IHR from a set of low-resolution images ILR. Formally, the relationship between high-resolution image IHR and low-resolution image ILR is modeled as follows:

I = I , (5.6) LR F HR where denotes low-pass filtering (to avoid aliasing) and downsampling. First, the F depth for each pixel is estimated (using [LS19]) and used to project all sensor pixels into a 3D point cloud. Then the 3D points are warped to the nearest depth layer following their optical paths. If the depth is accurate, angular patches from different views are projected onto the same depth layer, forming similar patches (the patch is identical if the scene is Lambertian). On the other hand, if the similar patches are not found from other views on the same projected depth layer, depth is erroneous and the corresponding patches are projected onto different layers, thus stopping further error propagation. In this case, the number of layers essentially set the trade-off be- tween accurate patch correlation and erroneous depths. The more accurate a depth map is, the more layers should be adopted to ensure a high patch similarity score. This means that IHR can be recovered in a well-defined manner. The cross-depth learning works as a supplement to in-depth warping. In the worst-case scenario, the estimated depth map is so defective that it fails to warp corresponding patches on the same layer. In such a case, cross-depth learning em- ploys the approximate nearest neighbor search [ML09] to provide high-resolution and low-resolution patch pairs that generated from the same light field and migrate fine details directly from a high-resolution patch across layers. Note that in-depth warping outplays cross-depth learning because learning techniques tend to halluci- nate erroneous high-resolution details.

5.3.3 Methodology

The method presented in this chapter will be an extension of the layered object space framework in the context of projection-based super-resolution described in Section 4.3.1. In the image super-resolution, prior knowledge of the scene is often assumed. The fundamental assumption of MISR is that natural images contain repetitive visual 5.4 Concluding remarks 33

Table 5.2: Evaluated Super-resolution Methods for Light Field Data Method Language Category Time (s) Bicubic interp. Matlab Single image 0.22 LFBM5D[AS18] Matlab & C++ Projection-based 601.17 GB[REGF18] Matlab Optimization-based 9 105 × LFCNN[GG18] Python Learning-based 68.93 Proposed solution Matlab Projection-based 170.19 content [LR15], and this knowledge of redundancy forms the basis of the proposed light field super-resolution scheme. The proposed scheme was quantitatively evaluated using standard metrics, such as PSNR and SSIM, in the super-resolution field and the results were validated on public benchmarks: HCI dataset [HJKG16] and Stanford dataset[Lab08]. The used datasets have a wide range of data, covering situations of both synthetic and real captures, small and large baselines. All the data is processed with the light field parameterization L(u, v, x, y). The Stanford dataset was cropped from its original 17 17 views to the central 5 5 array of views in order to save computational × × time. All the light fields is downsampled by a factor α = 2 and then brought back to original resolution by different SR methods. The proposed method was compared against three state-of-the-art reference schemes [AS18, REGF18, GG18], from projec- tion, optimization and learning-based categories that defined in Section 4.3.

5.3.4 Results and analysis

Table 5.2 shows the summary of all SR methods that were applied in the evaluation process. In addition to the state-of-the-art approaches, bicubic interpolation was also calculated as a naive SISR method which cannot restore any high frequency information even though it is the fastest upsampling method. LFBM5D [AS18] has an iterative refining process which corrects residuals until convergence. The graph model approach GB [REGF18] takes a huge computational effort. LFCNN is fast in computation time compared with the others, but it requires a long training process before parameters can be well-tuned. The proposed method has a relatively low computational complexity, while attaining the second best objective scores among all the methods, as shown in Table 5.3.

5.4 Concluding remarks

This chapter presented three solutions to light field processing steps. Firstly, a sim- ple and effective depth estimation method for light field microscope was presented, followed by visual comparison with reference methods to demonstrate that general 34 Depth-Assisted Light Field Processing

Table 5.3: Mean PSNR and SSIM for the SR factor α = 2 HCI Stanford PSNR SSIM PSNR SSIM BIC 35.22 0.912 35.59 0.961 LFBM5D 37.73 0.972 37.93 0.966 GB 38.22 0.977 24.18 0.952 LFCNN 39.00 0.981 39.19 0.986 Proposed solution 38.59 0.973 38.81 0.970 depth estimation became suboptimal when challenged by monochromatic sparse- feature content or unique optical designs. Nonetheless, depth information is a vital feature and it plays a critical role in both pre- and post-processing stages as discussed in Section 5.2 and 5.3. Secondly, the object space sampling was studied and an inverse distance weighted interpolation was proposed in the layered object space to solve the 3D demosaicing. It was shown that the proposed demosaicing reduced aliasing and crosstalk artifacts by exploring the potential high-frequency samples from the light field. The usage of the layered object space was further extended to solve the light field SR problem. Inspired by SISR and MISR techniques, a unified framework which con- sists of in-depth warping and cross-depth learning was proposed. Although CNN- based approach outperforms other methods in the objective evalution, the result is still encouraging because the proposed method requires no training on vast data while achieving a comparable performance. Additionally, there are common pro- cessings between the proposed demosaicing and super-resolution method, which can be combined to further save the computational effort. With regards to this chapter, my main contributions are:

A depth estimation for monochromatic microscope, designed to be capable of • estimating accurate depth from occluded sparse features. A demosaicing approach that utilizes a roughly estimated depths to circum- • vent the classical edge detection problem, and prevented crosstalk artifacts with a layered object space approach to forbid cross-depth interpolation. A super-resolution method that involves a combination of in-depth warping • and cross-depth learning to super-resolve based on a single light field.

The work of this chapter was presented in Paper I, II, III and IV. Chapter 6

A Joint Model of Depth Estimation and Demosaicing

Different methods were introduced in Chapter 5 to provide solutions to light field data processing in a classic sequential manner. However, the limitations of such a step-by-step light field processing pipeline was not discussed. In fact, post-processings are highly dependent on the pre-processings of the data and vice versa. For example, in most cases the depth is estimated from color intensities, whereas demosaicing is optimized with depth. This chapter targets the identified flaws in the classic sequential processing pipeline, and proposes a novel model to handle light field data. The model addresses the inter- dependence between demosaicing and depth estimation by encoding the correlation between color and depth explicitly in the data term and smoothness term of an MRF framework, and it is then optimized jointly and globally.

6.1 Introduction

Depth serves as a basis for a variety of light field applications [JLPG17,+ SRY 18, EJM19]. Although numerous depth estimation methods were introduced in Section 3.1 and 4.1, it is worth noting that they mostly worked on the demosaiced color light fields, as shown in the general processing pipeline Fig. 1.1. Therefore, ifthe demosaicing process is ill-posed, the error propagates to depth estimation inevitably. Especially, such demosaicing artifacts are destructive for EPI and stereo matching depth estimation methods which use photo-consistency as a fundamental criterion. It is also shown in Section 5.2 that demosaicing can be improved if depth is available, making the problems of depth estimation and demosaicing interdependent. A straightforward approach to solve the interdependence problem is to itera- tively alternate between a demosaicing step and a depth estimation step, improving

35 36 A Joint Model of Depth Estimation and Demosaicing

either one at a time to boost the other, until the desired results are obtained. How- ever, this approach can be extremely time-consuming, and the convergence cannot be guaranteed as erroneous information also accumulates after each iteration. Ap- parently, a sophisticated and organized solution is required. Unlike local methods, the global optimization methods try to find the global min- imum for the cost function instead of erroneous local aggregation. To address the aforementioned problems, a MRF-based collaborative graph model is introduced in this chapter, solving both demosaicing and depth estimation jointly in a unified framework. The data term and the smoothness term are used to explicitly encode the interdependence as the constraints for global optimization.

6.2 Initialization

Before MRF optimization, a preliminary guess of color and depth is required. In this section, the initial demosaicing and depth estimation are discussed, and the interme- diate parameters calculated in the initialization process are propagated to form the energy terms in Section 6.3.

6.2.1 Initial color demosaicing

The two-plane parameterization light field L(u, v, x, y) is adopted for the original light field data regardless of the capturing systems, where (u, v) and (x, y) are camera plane and focal plane respectively. To refocus at a depth z, the shift amount of pixels from other views can be calculated as: f B (u u ) f B (v v ) (∆x(u, z), ∆y(v, z)) = · u · − 0 , · v · − 0 , (6.1) z z   where I(u0, v0) is the reference view, and (∆x, ∆y) is the disparity offset, B is the unit baseline distance in u and v direction respectively. For each refocused depth z, the color image is recovered by: 1) propagating color from other views according to the correspondences indicated by disparity; 2) or interpolating from the nearest samples using simple bilinear interpolation same as the toolbox [DPW13].

6.2.2 Initial depth estimation

A DfD approach is proposed in this section to obtain the initial depth information based on the initial color focal stack. To make the focus metric insensitive to the illumination noise, the median of absolute difference of each color component c is taken, as following:

ϵ(x , y , z) = c(x, y) cˆ (x , y ) , (6.2) i i { − z i i }median c X 6.3 Collaborative graph model for demosaicing and depth estimation 37

Light field pre-processing block Joint MRF optimization

$%&&#'()$$ Reparameter- ization +

Original data Joint graph model min = !"#" / $%&&#'()$$ Color LF Initial depth *+,-.

Masking Depth !"#" $%&&#'()$$ estimation

Color Initial gradient demosaicing Input

Mosaicked LF Demosaicked LF Refocused view Edge map

Figure 6.1: Flow chart of CGMDD framework. where indicates the median filter, and cˆ is the estimated color in channel {·}median z c calculated from initial demosaicing at depth z. c(x, y) is the color correspondence c from a random view I(u, v) to the demosaiced pixel (xi, yi) of view I(u0, v0), and x = xi + ∆x, y = yi + ∆y is a function of depth z as shown in (6.1). Essentially, ϵ indicates how reliable the estimation is based on the photo-consistency criterion: the greater an ϵ is, the less confidence an estimated depth has. Therefore, the depth can be estimated as an error minimization problem:

zˆ(xi, yi) = argmin ϵ(xi, yi, z), (6.3) z where zˆ(xi, yi) is the estimated depth for pixel (xi, yi).

6.3 Collaborative graph model for demosaicing and depth estimation

It has been shown in Chapter 5 that demosaicing and depth estimation are inter- dependent: 1) an ill-posed demosaicing breaks photo-consistency criterion for cor- respondence matching of depth estimation, and 2) depth-assisted demosaicing out- perform classical sensor-based methods. To solve this joint problem, a collabora- tive graph model for demosaicing and depth estimation (CGMDD) is proposed, as shown in figure 6.1. The fundamentals of MRF was introduced in Section 3.4. High-order terms are not often considered in solving computer vision problems, because the distant pixels have significantly weak and negligible connections compared with neighbors. Thus, in this work only data term and smoothness term are employed. According to (3.8), the simplified energy function of the proposed MRF frame- 38 A Joint Model of Depth Estimation and Demosaicing

work can be written as:

E = Edata(i) + λ Esmoothness(i, j), (6.4) i i,j N(i,j) X ∈X where E defines an energy function of a joint probability distribution which follows the Gibbs distribution. The Edata(i) and Esmoothness(i, j) are the data term and the smoothness term, representing factorized energy function of single-node clique and pairwise clique respectively.

6.3.1 Data term

The data term models the joint relationship between color intensities and depth as:

E (i) = log P (c, z), (6.5) data − i where Pi(c, z) is the joint probability for a pixel i = (xi, yi) that its color component is c and its depth is z. According to the chain rule, the joint probability Pi(c, z) can be written as: P (c, z) = P (c, z (x , y )) = P (c z, (x , y ))P (z (x , y )) (6.6) i | i i | i i | i i where P (c z, (x , y )) is the probability of the color intensity to be c when the depth | i i is taken as z, and P (z (x , y )) is the probability of pixel i to be assigned with a depth | i i z. To compute P (c z, (x , y )), all the views are refocused at depth z, and the corre- | i i sponding pixels of (xi, yi) from other views are used to calculate the probability of each specific c to be assigned:

1 N P (c z, (x , y )) = δ(c(x , y ), c(x , y )) c(x , y ), (6.7) | i i N j j i i · j j u,v X where (xj, yj) are the correspondences of the pixel i with respect to depth z, and δ = 1 if c(xj, yj) and c(xi, yi) captures the same color component. For the conditional P (z (x , y )), the confidence measure ϵ from (6.2) are used, be- | i i cause it implies how probable a depth is to be assigned to the pixel. The probability of a pixel (xi, yi) to be sampled at depth Z is defined as:

ϵ(xi, yi,Z)w1(xi, yi,Z) P (Z (xi, yi)) = 1 , (6.8) | − z ϵ(xi, yi, z)w1(xi, yi, z) where w1(xi, yi, z) is a weighting factorP that evaluates the absolute difference be- tween ground truth color c(xi, yi) in one channel of the raw sensor image and the estimated color intensity cˆz(xi, yi) at depth Z:

w (x , y , z) = c(x , y ) cˆ (x , y ) . (6.9) 1 i i | i i − z i i | Thus, the single pixel color-depth is encoded in the data term. 6.4 Methodology 39

6.3.2 Smoothness term

In addition to the data term, the smoothness constraint for the regularization process is modeled in the pairwise clique. The smoothness term is defined as:

Esmoothness(i, j) = w2(i, j)S(i, j), (6.10) where w (i, j) is the weighting factor that is calculated from refocused image : 2 I 1 N (x , y ) = I(x + ∆x(u, zˆ(x , y )), y + ∆y(u, zˆ(x , y ))), (6.11) I i i N i i i i i i u,v X (i, j) w (i, j) = exp 0.5 |∇I | , (6.12) 2 − · max  {|∇I|} where is the gradient operator, and w is the weight factor that reflects how closely ∇ 2 two pixels are correlated based on the intensity variation, a large gradient implies a weak mutual impact. To enforce the neighboring pixels to have smooth color and depth transitions, a smoothness score is calculated as follows:

ci cj zi zj S(i, j) = 1 exp | − ||2 − | , (6.13) − {− σs } where σs is the normalizing constant that indicates the allowed deviation in intensity and depth differences.

6.4 Methodology

This section presents an MRF framework that challenges the sequential light field processing pipeline by combining pre-processing and post-processing into a unified probabilistic model, so that implicit mutual impact across processings are explicitly expressed with energy terms, and the potential of light field data are handled prop- erly. The effectiveness of the proposed framework has been evaluated on three public datasets [Lab08, RE16, MULCS+16], covering real and synthetic scenes, large and small baselines, camera gantry and plenoptic cameras. The performance was vali- dated on the following aspects:

Objective metrics: Structural similarity index (SSIM), root mean square error • (RMSE), and confidence interval (CI); Visual quality: Demosaiced color image, color difference map, depth map, and • depth error.

Specifically, the demosaicing performance is compared against the widely usedlight field toolbox and state-of-the-art WLIG algorithm [DLPG17], and depth estimation 40 A Joint Model of Depth Estimation and Demosaicing

result is compared against three methods, including OCC[WER15], SPO [ZSL+16] and EPI [HAS+18]. To keep a fair comparison, the original data is reparameterized to consistent light field representation L(u, v, x, y), and then the same initial demo- saicing process is applied by using the light field toolbox [DPW13] to generate input color images for depth estimation task.

6.5 Results and analysis

6.5.1 Demosaicing performance analysis

Table 2 of Paper V summarizes the SSIM and CI results for reference methods using different light field data. The SSIM results show that the proposed method achievesa comparable result when the baseline is small (scene Bathroom, Pillars and Bikes), and outperforms other demosaicing methods significantly when the baseline increases (scene Chess and Truck). This is due to the fact that the large baseline scenarios induce an enlarged disparity range and a stronger color-depth interdependence. Thus, per- forming demosaicing solely, irrespective of the impact of depth often result in heavy color artifacts, as shown in the scene Bathroom (see Fig. 4 of Paper V). Furthermore, under the assumption that SSIM values of pixels obey a normal distribution, the CI results show that the differences between the proposed solution and investigated demosaicing methods are statistically significant.

6.5.2 Depth estimation performance analysis

Table 3 of Paper V shows the RMSE and SSIM of the estimated depth using OCC [WER15], EPI [HAS+18], SPO [ZSL+16] and the proposed method. The results show that the proposed method obtains a depth with less information loss and higher depth accuracy compared with reference depth estimation methods. The visual comparison is shown in Fig. 6 of Paper V as supplements. Furthermore, it can be seen that the depth estimation results of the reference methods heavily deteriorates when provided with the same initial demosaiced light fields. This is because that the chromatic errors propagate from demosaicing to the depth estimation through the classical sequential processing pipeline. Such error propagation not only breaks photo-consistency criterion for correspondence matching, but also smears out object borders, making depth estimation a formidable problem if solved independently.

6.6 Concluding remarks

This chapter presented an MRF framework called CGMDD which encoded the inter- dependence between demosaicing and depth estimation into a global energy mini- mization problem and solved them jointly. The results showed the superiority of the proposed method over conventional single-task light field image processing meth- 6.6 Concluding remarks 41

ods. One of the reasons for conventional methods to be suboptimal is the fact that the researchers have overlooked the particular 4D structure of the light field and its inherent complex correlations among processings. Furthermore, the proposed framework can be generalized for light fields of different characteristics, regardless of small or large baseline, synthetic or real content. Just like other MRF-based graph models, the weights between data term and smoothness term need to be assigned empirically to avoid over-regularization and under-regularization, because over- regularization results in loss of high-frequency details whereas under-regularization induces outliers. With regards to this chapter, my main contributions are:

An MRF-based statistical framework for light field image processing, which is • designed to jointly handle demosaicing and depth estimation problems.

The novel data term and smoothness term that explicitly model the interde- • pendence of different light field properties. The flaw of classic light field image processing pipeline is analyzed bycom- • paring the joint model with sequential model.

The content of this chapter was presented in Paper V. 42 Chapter 7

Conclusions and Future Work

This final chapter will provide an overview of the light field processing workpre- sented in this thesis, followed by discussions about how the presented work fulfills the research objectives, and the impact of the work. Finally, the chapter is concluded with ethical aspects of the work and its conceivable future extensions.

7.1 Overview

The purpose of this dissertation as stated in Chapter 1 is to provide an insight to light field sampling, and develop solutions for various its processing problems. Thescope of this thesis is confined in Chapter 2 to be the computational photography, withits particular focus on the image processing and the geometrical models of light field sampling devices. To distinguish 4D light field processing from classic 2D image processing techniques, the fundamental theory and basic knowledge of light field demosaicing, depth estimation and super-resolution was provided in Chapter 3, fol- lowed by the related works correspondingly in Chapter 4. To fulfill the purpose of this dissertation, three light field image processing as- pects were investigated and summary is reported in Chapter 5. Firstly, a novel area- based DfF depth estimation method was proposed to demonstrate the complexity of the depth problem with a light field microscope. The experimental results showed that extracted depth tends to be erroneous if: 1) the scene is constrained with limited vision cues, and 2) solved with methods that do not take the specific structure of light field data into account. Secondly, a layered object space was introduced tocope with such depth errors, and it was extended to be applied on both demosaicing and super-resolution tasks. A new depth-assisted demosaicing method was proposed, which considers the geometrical optic model and the pixel sampling correlations was proposed. The experimental results verified the necessity of considering depth when demosaicing 4D light field, so that pixel proximity can be defined and iden- tified properly in the sampled object space rather than on the sensor. Finally, the

43 44 Conclusions and Future Work

layered object space was adopted to light field super-resolution, which enabled in- depth warping and cross-depth learning strategies. Unlike conventional SISR and MISR, the proposed scheme was developed on the back-projection process of lay- ered object space, achieving both the capability of handling uncertain depth and super-resolution of light field images with little effort. A new joint model was proposed in Chapter 6 based on the findings of demosaic- ing and depth estimation interdependence from Chapter 5. The model employed an MRF framework to optimize color and depth results in a collaborative manner, so that the conflict within the sequential processing pipeline can be avoided. More- over, both demosaicing and depth estimation can benefit from this joint approach in terms of quantitative metrics and visual quality compared with state-of-the-art in- dependent methods, especially when the diversity of light field configurations was considered.

7.2 Objective outcome

Five objectives were formulated in Section 1.3 to serve the purpose of this disser- tation. The details concerning the fulfillment of these objectives are concluded as follows:

O1 To investigate geometrical optical models, and the sampling behavior of different light field capturing systems. The geometrical optical models of light field capturing systems are first intro- duced in Chapter 2, including image formation process of conventional cam- eras and different plenoptic cameras. However, such simplified models are incomplete as it only considers the sampling behavior for an ideal case - when the scene is on the focal plane. To investigate how the sampling behavior changes with respect to different depths, a depth-assisted sampling analysis is performed. Rooted from the geometrical optics and ray transfer matrix, the sampling behavior is simulated by a pixel back-projection process with respect to the object space in Section 5.2, along with the detailed analysis on light field demosaicing presented in Paper I. O2 To investigate the diversity of light field capturing systems, and its potential challenges to the depth estimation process. Various light field acquisition setups are briefly introduced in Chapter 2,and they pose different challenges to depth estimation methods. EPI-based ap- proach is limited by a small baseline, view-based approach suffers severely from noise, DfF/DfD requires elaborate focus cues, and learning-based ap- proach requires huge data for training. The fluorescent cotton fiber specimen captured by FiMic microscope is adopted for the case study purpose. The ex- perimental results showed that the estimated depth tends to be defective if other general depth estimation methods are applied, even though depth is of critical importance to light field processings. The results are presented inSec- tion 5.1, with more details in Paper II. 7.3 Impact 45

O3 To investigate the role of depth in the pre-processing of raw light field data, and to propose depth-assisted demosaicing method to reduce artifacts. Based on the findings from Objectives O1 and O2, a layered object space method is proposed in Section 5.2 and Paper III to address the light field demosaicing in condition of uncertain depths. The method utilizes depth information to search for neighboring samples that distributed across different elemental im- ages. Then spatial samples are projected onto several layers to apply an inverse distance weighting function in order to preserve fine details as well as to cope with depth uncertainty. In visual comparison, the proposed method manages to reduce artifacts especially for peripheral views, and outperforms state-of- the-art methods which overlook the significance of depth.

O4 To investigate the influence of depth in the post-processing of the light field, andto propose depth-assisted super-resolution for light fields. In addition to pre-processing like demosaicing, the impact of depth in the post-processing is studied in Section 5.3 and Paper IV. A depth-assisted super- resolution method is proposed to extend the usage of the layered object space framework. Two specific schemes are employed based on the high redundancy of the light field. The in-depth warping process try to find similar patches ofthe same layer, using the layered structure to ensure a high-quality matching and to avoid error propagation from erroneous depth. The cross-depth learning migrates fine details from similar patches of a higher resolution across layers as an enhancement to in-depth warping.

O5 To propose a solution for light field quality enhancement with the unknown depth, and to solve the interdependence between depth and light field enhancement techniques using a graph model. Objectives O1, O2, and O3 can be condensed into a dilemma: Depth estima- tion requires faithful color information, whereas color demosaicing needs to be assisted by depth. Generally, a sequential pipeline solves the demosaic- ing problem first, because depth is overlooked and images are demosaiced on the sensor instead of in the object space. To solve the aforementioned prob- lems, an MRF-based model is proposed in Chapter 6 that provides a joint so- lution for demosaicing and depth estimation. The interdependence of color and depth are explicitly expressed and globally optimized. The experimental results show that the proposed method achieved remarkable results in both tasks by quantitative metrics and visual assessment, the details of this work is presented in Paper V.

7.3 Impact

This dissertation proposed computational light field photography techniques re- lated to light field data processings. Firstly, the depth estimation algorithm wasde- signed to meet the characteristics of the light field microscope and the particularity of 46 Conclusions and Future Work

monochromatic feature-sparse specimens, allowing medical professionals to recon- struct 3D models of fluorescent specimens and to view the specimens from different angles without being concerned with relevant light field knowledge. Secondly, the depth-assisted demosaicing and super-resolution algorithms can be integrated into 3D vision devices, such as head-mounted gears and autostereoscopic displays, to improve the rendering quality of views. Furthermore, the proposed methods can be considered if transmission bandwidth is limited such that downsampling is required before light field transmission. Lastly, the proposed joint solution for demosaicing and depth estimation yields desirable results compared with independent sequential processings. This framework opened up new possibilities of collaborative process- ing of the light field data, and may replace the classic light field processing pipeline.

7.4 Risks and ethical aspects

The research of this dissertation complies with the ethical guidelines provided by the Swedish Research Council [Sta17]. The work intended to improve processing techniques of the light field data. No risky human experiments were involved at the current stage, despite the ultimate goal of this dissertation is to improve the user experience of 3D content. The research activities are primarily performed on computers. During the research, there were oc- casions when visual artifacts need to be identified from images that were displayed on a standard 2D screen with subjective assessment. In these cases, great care was taken to make sure that: all the participants were at least 18 years old, the test was practised in a voluntary manner, everyone was briefed with necessary information beforehand, and anyone was able to stop the test at any time. The proposed solutions are implemented and tested in the research environment. Therefore, they may not be deployed to stringent services or applications, such as medical imaging systems, in order to avoid possible risks.

7.5 Future work

This dissertation is positioned in the context of computational light field photogra- phy. Given the context, there are still challenges and overlooked problems that can be focused on in the future. The importance of depth in multiple image processing steps of the light field data is manifested in this dissertation with the proposed demosaicing and super- resolution methods. The results demonstrated advantages of the proposed layered object space when handling uncertain depth. Such capability of handling depth uncertainty may be further verified with the usage of commercial range cameras, such as Kinect and light detection and ranging (Lidar) cameras, which provide noisy depths. On top of that, some of the optical aberrations may also be corrected with the layered depths information, such as axial chromatic aberration. 7.5 Future work 47

Another possible track lies in the unexplored potentials of 4D structure of the light field. The Lambertian scene is assumed in most super-resolution works. How- ever, one can benefit from abundant angular samples of the identical spatial point if pixels are traced back onto the object surface, making it feasible to model bidirec- tional reflectance distribution function (BRDF). This can enable the proposed meth- ods to cope with non-Lambertian scenes, and drastically improve its performance against view-based super-resolution. Additionally, the spatial super-resolution method can be extended to involve view synthesis for non-Lambertian scenes with small ef- fort - upscaling angular samples instead of spatial samples. Lastly, but not the least important, the proposed MRF-based model shows the significance of combining different processings into a joint solution. So far, the framework incorporated demosaicing and depth estimation, it is promising to see how other correlated problems and concepts, such as super-resolution, of light field can be solved and modeled jointly with depth estimation. One can also extend the color-depth interdependence by adapting energy terms so that it accounts for more sophisticated assumptions or prior knowledge of the scene. 48 Bibliography

[AB91] Edward H Adelson and James R Bergen. The Plenoptic Function and the Elements of Early Vision. Vision and Modeling Group, Media Labora- tory, Massachusetts Institute of Technology, 1991. [AOS17] Waqas Ahmad, Roger Olsson, and Mårten Sjöström. Interpreting plenoptic images as multi-view sequences for improved compression. In 2017 IEEE International Conference on Image Processing (ICIP), pages 4557–4561. IEEE, 2017. [AS18] Martin Alain and Aljosa Smolic. Light field super-resolution via lfbm5d sparse coding. In 2018 25th IEEE international conference on image processing (ICIP), pages 2501–2505. IEEE, 2018. [AT12] Elizabeth Allen and Sophie Triantaphillidou. The manual of photogra- phy and digital imaging. CRC Press, 2012. [Bay76] Bryce E Bayer. Color imaging array, July 20 1976. US Patent 3,971,065. [BF11] Tom E Bishop and Paolo Favaro. The light field camera: Extended depth of field, aliasing, and superresolution. IEEE transactions on pat- tern analysis and machine intelligence, 34(5):972–986, 2011. [BK04] Yuri Boykov and Vladimir Kolmogorov. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE transactions on pattern analysis and machine intelligence, 26(9):1124– 1137, 2004. [BVZ01] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate en- ergy minimization via graph cuts. IEEE Transactions on pattern analysis and machine intelligence, 23(11):1222–1239, 2001. [BZF09] Tom E Bishop, Sara Zanetti, and Paolo Favaro. Light field superreso- lution. In 2009 IEEE International Conference on Computational Photog- raphy (ICCP), pages 1–9. IEEE, 2009. [CHNC18] Jie Chen, Junhui Hou, Yun Ni, and Lap-Pui Chau. Accurate light field depth estimation with superpixel regularization over partially occluded regions. IEEE Transactions on Image Processing, 27(10):4889– 4900, 2018.

49 50 BIBLIOGRAPHY

[CJ09] Myungjin Cho and Bahram Javidi. Computational reconstruction of three-dimensional integral imaging by rearrangement of elemental image pixels. Journal of display technology, 5(2):61–65, 2009.

[CXCL19] Zhen Cheng, Zhiwei Xiong, Chang Chen, and Dong Liu. Light field super-resolution: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.

[DLPG17] Pierre David, Mikaël Le Pendu, and Christine Guillemot. White lenslet image guided demosaicing for plenoptic cameras. In Multi- media Signal Processing (MMSP), 2017 IEEE 19th International Workshop on, pages 1–6. IEEE, 2017.

[DOS+13] Mitra Damghanian, Roger Olsson, Mårten Sjöström, Hector Navarro Fructuoso, and Manuel Martinez-Corral. Investigating the lateral res- olution in a plenoptic capturing system using the spc model. In Dig- ital Photography IX, volume 8660, page 86600T. International Society for Optics and Photonics, 2013.

[DPW13] Donald G Dansereau, Oscar Pizarro, and Stefan B Williams. Decod- ing, calibration and rectification for lenselet-based plenoptic cameras. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1027–1034, 2013.

[EE10] Nils Einecke and Julian Eggert. A two-stage correlation method for stereoscopic depth estimation. In 2010 International Conference on Dig- ital Image Computing: Techniques and Applications, pages 227–234. IEEE, 2010.

[EJM19] Fatima El Jamiy and Ronald Marsh. Survey on depth perception in head mounted displays: distance estimation in virtual reality, aug- mented reality, and mixed reality. IET Image Processing, 13(5):707–712, 2019.

[Feh04] Christoph Fehn. Depth-image-based rendering (DIBR), compression, and transmission for a new approach on 3d-tv. In Electronic Imaging 2004, pages 93–104. International Society for Optics and Photonics, 2004.

[FVA17] Saber Farag, Vladan Velisavljevic, and Amar Aggoun. A hybrid ap- proach for image super-resolution of light field images. In 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP), pages 1–6. IEEE, 2017.

[GCL11] Todor Georgiev, Georgi Chunev, and Andrew Lumsdaine. Superres- olution with the focused plenoptic camera. In Computational Imaging IX, volume 7873, page 78730X. International Society for Optics and Photonics, 2011. BIBLIOGRAPHY 51

[GG18] M Shahzeb Khan Gul and Bahadir K Gunturk. Spatial and angular resolution enhancement of light fields using convolutional neural net- works. IEEE Transactions on Image Processing, 27(5):2146–2159, 2018.

[GMC+16] Claudia I Gonzalez, Patricia Melin, Juan R Castro, Olivia Mendoza, and Oscar Castillo. An improved sobel edge detection method based on generalized type-2 fuzzy logic. Soft Computing, 20(2):773–784, 2016.

[GYLG13] Todor Georgiev, Zhan Yu, Andrew Lumsdaine, and Sergio Goma. Lytro camera technology: theory, algorithms, performance analysis. In Multimedia Content and Mobile Devices, volume 8667, page 86671J. International Society for Optics and Photonics, 2013.

[GZC+06] Todor G Georgiev, Ke Colin Zheng, Brian Curless, David Salesin, Shree K Nayar, and Chintan Intwala. Spatio-angular resolution trade- offs in integral photography. Rendering Techniques, 2006(263-272):21, 2006.

[HAS+18] Xinpeng Huang, Ping An, Liang Shan, Ran Ma, and Liquan Shen. View synthesis for light field coding using depth estimation. In 2018 IEEE International Conference on Multimedia and Expo (ICME), pages 1– 6. IEEE, 2018.

[HC14] Xiang Huang and Oliver Cossairt. Dictionary learning based color de- mosaicing for plenoptic cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 449–454, 2014.

[HJKG16] Katrin Honauer, Ole Johannsen, Daniel Kondermann, and Bastian Goldluecke. A dataset and evaluation methodology for depth estima- tion on 4d light fields. In Asian Conference on Computer Vision, pages 19–34. Springer, 2016.

[HS+88] Christopher G Harris, Mike Stephens, et al. A combined corner and edge detector. In Alvey vision conference, volume 15, pages 10–5244. Citeseer, 1988.

[HZ03] Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2003.

[JLPG17] Xiaoran Jiang, Mikael Le Pendu, and Christine Guillemot. Light field compression using depth image based view synthesis. In 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pages 19–24. IEEE, 2017.

[JPC+15] Hae-Gon Jeon, Jaesik Park, Gyeongmin Choe, Jinsun Park, Yunsu Bok, Yu-Wing Tai, and In So Kweon. Accurate depth map estimation from a lenslet light field camera. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1547–1555, 2015. 52 BIBLIOGRAPHY

[KHKP16] Teresa Klatzer, Kerstin Hammernik, Patrick Knobelreiter, and Thomas Pock. Learning joint demosaicing and denoising based on sequential energy minimization. In 2016 IEEE International Conference on Compu- tational Photography (ICCP), pages 1–11. IEEE, 2016.

[KWR16] Nima Khademi Kalantari, Ting-Chun Wang, and Ravi Ramamoorthi. Learning-based view synthesis for light field cameras. ACM Transac- tions on Graphics (TOG), 35(6):1–10, 2016.

[Lab08] The Computer Graphics Laboratory. The (New) Stanford Light Field Archive. http://lightfield.stanford.edu/index.html, 2008. [On- line; accessed 10-October-2017].

[LC16] Trisha Lian and Kyle Chiang. Demosaicing and denoising on simulated light field images. http://stanford.edu/class/ee367/ Winter2016/projects2016.html, 2016. [Online].

[LCBKY15] Haiting Lin, Can Chen, Sing Bing Kang, and Jingyi Yu. Depth re- covery from light field using focal stack symmetry. In Proceedings of the IEEE International Conference on Computer Vision, pages 3451–3459, 2015.

[LCDX09] Yebin Liu, Xun Cao, Qionghai Dai, and Wenli Xu. Continuous depth estimation for multi-view stereo. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2121–2128. IEEE, 2009.

[LGZ08] Xin Li, Bahadir Gunturk, and Lei Zhang. Image demosaicing: A sys- tematic survey. In Visual Communications and Image Processing 2008, volume 6822, page 68221J. International Society for Optics and Pho- tonics, 2008.

[LGZD15] Huijin Lv, Kaiyu Gu, Yongbing Zhang, and Qionghai Dai. Light field depth estimation exploiting linear structure in epi. In 2015 IEEE Inter- national Conference on Multimedia & Expo Workshops (ICMEW), pages 1–6. IEEE, 2015.

[LH96] Marc Levoy and Pat Hanrahan. Light field rendering. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Tech- niques, pages 31–42. ACM, 1996.

[Li94] Stan Z Li. Markov random field models in computer vision. In Euro- pean conference on computer vision, pages 361–370. Springer, 1994.

[Lip08] . Epreuves reversibles donnant la sensation du re- lief. J. Phys. Theor. Appl., 7(1):821–825, 1908.

[LLL+17] Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen Change Loy, and Xiaoou Tang. Deep learning markov random field for semantic segmentation. IEEE transactions on pattern analysis and machine intelligence, 40(8):1814–1828, 2017. BIBLIOGRAPHY 53

[LNA+06] Marc Levoy, Ren Ng, Andrew Adams, Matthew Footer, and Mark Horowitz. Light field microscopy. ACM Transactions on Graphics (TOG), 25(3):924–934, 2006. [LOP+09] JaeGuyn Lim, HyunWook Ok, ByungKwan Park, JooYoung Kang, and SeongDeok Lee. Improving the spatail resolution based on 4d light field data. In 2009 16th IEEE International Conference on Image Processing (ICIP), pages 1173–1176. IEEE, 2009. [Low04] David G Lowe. Distinctive image features from scale-invariant key- points. International Journal of Computer Vision, 60(2):91–110, 2004. [LR15] Chia-Kai Liang and Ravi Ramamoorthi. A light transport framework for lenslet light field cameras. ACM Transactions on Graphics (TOG), 34(2):1–19, 2015. [LS19] Yongwei Li and Mårten Sjöström. Depth-assisted demosaicing for light field data in layered object space. In 2019 IEEE International Con- ference on Image Processing (ICIP), pages 3746–3750. IEEE, 2019. [LWZD15] Xing Lin, Jiamin Wu, Guoan Zheng, and Qionghai Dai. Camera array based light field microscopy. Biomedical optics express, 6(9):3179–3189, 2015. [LXRC17] Zhengqin Li, Zexiang Xu, Ravi Ramamoorthi, and Manmohan Chan- draker. Robust energy minimization for brdf-invariant shape from light fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5571–5579, 2017. [MKS89] Larry Matthies, Takeo Kanade, and Richard Szeliski. Kalman filter- based algorithms for estimating depth from image sequences. Inter- national Journal of Computer Vision, 3(3):209–238, 1989. [ML09] Marius Muja and David G Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. VISAPP (1), 2(331-340):2, 2009. [MULCS+16] Adolfo Martínez-Usó, Pedro Latorre-Carmona, Jose M Sotoca, Filib- erto Pla, and Bahram Javidi. Depth estimation in integral imaging based on a maximum voting strategy. Journal of Display Technology, 12(12):1715–1723, 2016. [MV12] Kaushik Mitra and Ashok Veeraraghavan. Light field denoising, light field superresolution and stereo camera based refocussing usinga gmm light field patch prior. In 2012 IEEE Computer Society Confer- ence on Computer Vision and Pattern Recognition Workshops, pages 22–28. IEEE, 2012. [MYXA19] Yu Mo, Jungang Yang, Chao Xiao, and Wei An. Toward real-world light field depth estimation: A noise-aware paradigm using multi- stereo disparity integration. IEEE Access, 7:94391–94399, 2019. 54 BIBLIOGRAPHY

[NLB+05] Ren Ng, Marc Levoy, Mathieu Brédif, Gene Duval, Mark Horowitz, and Pat Hanrahan. Light field photography with a hand-held plenop- tic camera. Computer Science Technical Report CSTR, 2(11):1–11, 2005.

[NM14] Kamal Nasrollahi and Thomas B Moeslund. Super-resolution: a com- prehensive survey. Machine vision and applications, 25(6):1423–1468, 2014.

[NN94] Shree K Nayar and Yasuo Nakagawa. Shape from focus. IEEE Trans- actions on Pattern analysis and machine intelligence, 16(8):824–831, 1994.

[PPG13] Said Pertuz, Domenec Puig, and Miguel Angel Garcia. Relia- bility measure for shape-from-focus. Image and Vision Computing, 31(10):725–734, 2013.

[PU14] Chirag S Panchal and Abhay B Upadhyay. Depth estimation analysis using sum of absolute difference algorithm. Int. J. Adv. Res. Electr. Electron. Instrum. Eng., 3:6761–6767, 2014.

[PW10] C Perwass and Lennart Wietzke. The next generation of photography. White Paper, www. raytrix. de, 2010.

[RE16] Martin Rerabek and Touradj Ebrahimi. New light field image dataset. In 8th International Conference on Quality of Multimedia Experience (QoMEX), pages 1–2, 2016.

[REGF18] Mattia Rossi, Mireille El Gheche, and Pascal Frossard. A nonsmooth graph-based approach to light field super-resolution. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 2590– 2594. IEEE, 2018.

[RF17] Mattia Rossi and Pascal Frossard. Graph-based light field super- resolution. In 2017 IEEE 19th International Workshop on Multimedia Sig- nal Processing (MMSP), pages 1–6. IEEE, 2017.

[SJD12] JinLi Suo, XiangYang Ji, and QiongHai Dai. An overview of compu- tational photography. Science China Information Sciences, 55(6):1229– 1248, 2012.

[SRY+18] James Steven Supanˇciˇc,Gregory Rogez, Yi Yang, Jamie Shotton, and Deva Ramanan. Depth-based hand pose estimation: methods, data, and challenges. International Journal of Computer Vision, 126(11):1180– 1198, 2018.

[SSDP14] Mozhdeh Seifi, Neus Sabater, Valter Drazic, and Patrick Perez. Disparity-guided demosaicking of light field images. In 2014 IEEE International Conference on Image Processing (ICIP), pages 5482–5486. IEEE, 2014. BIBLIOGRAPHY 55

[SSPL+18] G Scrofani, Jorge Sola-Pikabea, A Llavador, Emilio Sanchez-Ortiga, JC Barreiro, G Saavedra, J Garcia-Sucerquia, and Manuel Martínez- Corral. Fimic: design for ultimate 3d-integral microscopy of in-vivo biological samples. Biomedical optics express, 9(1):335–346, 2018. [ST98] Muralidhara Subbarao and J-K Tyan. Selecting the optimal focus mea- sure for autofocusing and depth-from-focus. IEEE transactions on pat- tern analysis and machine intelligence, 20(8):864–870, 1998. [ST12] Jian Sun and Marshall F Tappen. Separable markov random field model and its applications in low level vision. IEEE transactions on image processing, 22(1):402–407, 2012. [Sta17] Sven Stafström. God forskningssed. Stockholm: Vetenskapsrådet. Till- gänglig på internet: https://publikationer.vr.se/produkt/god-forskningssed, 2017. [Online; accessed 10-October-2017]. [Sze10] Richard Szeliski. Computer vision: algorithms and applications. Springer Science & Business Media, 2010. [THMR13] Michael W Tao, Sunil Hadap, Jitendra Malik, and Ravi Ramamoorthi. Depth from combining defocus and correspondence using light-field cameras. In Proceedings of the IEEE International Conference on Computer Vision, pages 673–680, 2013. [VMCS15] Karthik Vaidyanathan, Jacob Munkberg, Petrik Clarberg, and Marco Salvi. Layered light field reconstruction for defocus blur. ACM Trans- actions on Graphics (TOG), 34(2):1–12, 2015. [WER15] Ting-Chun Wang, Alexei A Efros, and Ravi Ramamoorthi. Occlusion- aware depth estimation using light-field cameras. In Proceedings of the IEEE International Conference on Computer Vision, pages 3487–3495, 2015. [WER16] Ting-Chun Wang, Alexei A Efros, and Ravi Ramamoorthi. Depth esti- mation with occlusion modeling using light-field cameras. IEEE trans- actions on pattern analysis and machine intelligence, 38(11):2170–2181, 2016. [WG13] Sven Wanner and Bastian Goldluecke. Variational light field analy- sis for disparity estimation and super-resolution. IEEE transactions on pattern analysis and machine intelligence, 36(3):606–619, 2013. [WHS+16] Yunlong Wang, Guangqi Hou, Zhenan Sun, Zilei Wang, and Tieniu Tan. A simple and robust super resolution method for light field im- ages. In 2016 IEEE International Conference on Image Processing (ICIP), pages 1459–1463. IEEE, 2016. [WJV+05] Bennett Wilburn, Neel Joshi, Vaibhav Vaish, Eino-Ville Talvala, Emilio Antunez, Adam Barth, Andrew Adams, Mark Horowitz, and Marc Levoy. High performance imaging using large camera arrays. In ACM Transactions on Graphics (TOG), volume 24, pages 765–776. ACM, 2005. 56 BIBLIOGRAPHY

[YGL+13] Zhan Yu, Xinqing Guo, Haibing Lin, Andrew Lumsdaine, and Jingyi Yu. Line assisted light field triangulation and stereo matching. In Pro- ceedings of the IEEE International Conference on Computer Vision, pages 2792–2799, 2013. [YJY+15] Youngjin Yoon, Hae-Gon Jeon, Donggeun Yoo, Joon-Young Lee, and In So Kweon. Learning a deep convolutional network for light-field image super-resolution. In Proceedings of the IEEE international confer- ence on computer vision workshops, pages 24–32, 2015. [YJY+17] Youngjin Yoon, Hae-Gon Jeon, Donggeun Yoo, Joon-Young Lee, and In So Kweon. Light-field image super-resolution using convolutional neural network. IEEE Signal Processing Letters, 24(6):848–852, 2017. [YYLG12] Zhan Yu, Jingyi Yu, Andrew Lumsdaine, and Todor Georgiev. An analysis of color demosaicing in plenoptic cameras. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 901– 908. IEEE, 2012.

[ZDdW10] Sveta Zinger, Luat Do, and PHN de With. Free-viewpoint depth im- age based rendering. Journal of visual communication and image repre- sentation, 21(5-6):533–541, 2010. [ZohVKZ17] Matthias Ziegler, Ron op het Veld, Joachim Keinert, and Frederik Zilly. Acquisition system for dense lightfield of large scenes. In 2017 3DTV Conference: The True Vision-Capture, Transmission and Display of 3D Video (3DTV-CON), pages 1–4. IEEE, 2017. [ZSL+16] Shuo Zhang, Hao Sheng, Chao Li, Jun Zhang, and Zhang Xiong. Ro- bust depth estimation for light field via spinning parallelogram oper- ator. Computer Vision and Image Understanding, 145:148–159, 2016. [ZWY17] Hao Zhu, Qing Wang, and Jingyi Yu. Occlusion-model guided antioc- clusion depth estimation in light field. IEEE Journal of Selected Topics in Signal Processing, 11(7):965–978, 2017. [ZZY+19] Wenhui Zhou, Enci Zhou, Yuxiang Yan, Lili Lin, and Andrew Lums- daine. Learning depth cues from focal stack for light field depth estimation. In 2019 IEEE International Conference on Image Processing (ICIP), pages 1074–1078. IEEE, 2019.