LIGHT-FIELD FEATURES FOR ROBOTIC VISIONINTHE PRESENCEOF REFRACTIVE OBJECTS

Dorian Yu Peng Tsai MSc (Technology) BASc (Enginnering Science with Honours)

Submitted in fulfillment of the requirement of the degree of Doctor of Philosophy 2020

School of Electrical Engineering and Computer Science Science and Engineering Faculty Queensland University of Technology Abstract

Robotic vision is an integral aspect of robot navigation and human-robot interaction, as well as object recognition, grasping and manipulation. Visual servoing is the use of computer vision for closed-loop control of a robot’s motion and has been shown to increase the accuracy and performance of robotic grasping, manipulation and control tasks. However, many robotic vision algorithms (including those focused on solving the problem of visual servoing) find refractive objects particularly challenging. This is because these types of objects are difficult to perceive. They are transparent and their appearance is essentially a distorted view of the background, which can change significantly with small changes in viewpoint. What is often overlooked is that most robotic vision algorithms implicitly assume that the world is Lambertian—that the appearance of a point on an object does not change significantly with respect to small changes in viewpoint1. Refractive objects violate the Lambertian assumption and this can lead to image matching inconsistencies, pose errors and even failures of modern robotic vision systems.

This thesis investigates the use of light-field cameras for robotic vision to enable vision-based motion control in the presence of refractive objects. Light-field cameras are a novel camera technology that use multi-aperture optics to capture a set of dense and uniformly-sampled views of the scene from multiple viewpoints. Light-field cameras capture the light field, which simul- taneously encodes texture, depth and multiple viewpoints. Light-field cameras are a promising alternative to conventional robotic vision sensors, because of their unique ability to capture view-dependent effects, such as occlusion, specular reflection and, in particular, refraction.

First, we investigate using input from the light-field camera to directly control robot motion, a process known as image-based visual servoing, in Lambertian scenes. We propose a novel light-field feature for Lambertian scenes and develop the relationships between feature motion and camera motion for the purposes of visual servoing. We also illustrate in both simulation and using a custom mirror-based light-field camera, that our method of light-field image-based visual servoing is more tolerant to small and distant targets and partially-occluded scenes than monocular and stereo-based methods.

Second, we propose a method to detect refractive objects using a single light field. Specifi- cally, we define refracted image features as those image features whose appearance have been distorted by a refractive object. We discriminate between refracted image features and the surrounding Lambertian image features. We also show that using our method to ignore the re- fracted image features enables monocular structure from motion in scenes containing refractive objects, where traditional methods fail.

We combine and extend our two previous contributions to develop a light-field feature capable of enabling visual servoing towards refractive objects without needing a 3D geometric model of the object. We show in experiments that this feature can be reliably detected and extracted from the light field. The feature appears to be continuous with respect to viewpoint, and is therefore be suitable for visual servoing towards refractive objects.

1This Lambertian assumption is also known as the photo-consistency or brightness constancy assumption. This thesis represents a unique contribution toward our understanding of refractive objects in the light field for robotic vision. Application areas that may benefit from this research include manipulation and grasping of household objects, medical equipment, and in-orbit satellite ser- vicing equipment. It could also benefit quality assurance and manufacturing pick-and-place robots. The advances constitute a critical step to enabling robots to work more safely and reli- ably with everyday refractive objects. Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meet requirements for an award at this or any other higher education institution. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made.

QUT Verified Signature

Dorian Tsai

March 2, 2020 Acknowledgements

To my academic advisors, Professor Peter Ian Corke, Dr. Donald Gilbert Dansereau and Asso- ciate Professor Thierry Peynot, I would like to offer my most heartfelt gratitude. They shared with me an amazing knowledge, insight, creativity and enthusiasm. I am grateful for the re- sources and opportunities they provided, as well as their guidance, support and patience.

In addition, I wish to convey my appreciation to Douglas Palmer and Thomas Coppin who were my fellow plenopticists for many helpful and stimulating discussions. Thanks to Dr. Steven Martin who helped with much of the technical engineering aspects of building and mounting light-field cameras to various robots over the years. Thanks to Dominic Jack and Ming Xu for being an excellent desk buddy. Thanks to Prof. Tristan Perez, Associate Professor Jason Ford and Dr. Timothy Molloy for helping to get me started on my PhD journey in inverse differential game theory applied to the birds and the bees, until I changed topics to light fields and robotic vision six months later.

Thanks to Kate Aldridge, Sarah Allen and all of the other administrative staff in the Australian Centre for Robotic Vision (ACRV) for organising so many conferences and workshops, and keeping things running smoothly.

This research was funded in part from the Queensland University of Technology (QUT) Post- graduate Research Award, the QUT Higher Degree Tuition Fee Sponsorship, the QUT Excel- lent Top-Up Scholarship, and the ACRV Top-Up Scholarship, as well as financial support in the form of employment as a course mentor and research assistant. The ACRV scholarship was supported in part by the Australian Research Council Centre of Excellence for Robotic Vision.

Lastly, a very special thanks goes to the many faithful friends and family and colleagues who’s backing and constant encouragements sustained me through this academic marathon and grad- uate with a degree. I am especially indebted to Robin Tunley and Miranda Cherie Fittock for their camaraderie and steady moral support. Thank you all very much. Contents

Abstract

List of Tables vii

List of Figures ix

List of Acronyms xiii

List of Symbols xv

1 Introduction 1

1.1 Motivation ...... 1

1.1.1 Limitations of Robotic Vision for Refractive Objects ...... 3

1.1.2 Seeing and Servoing Towards Refractive Objects ...... 5

1.2 Statement of Research ...... 8

1.3 Contributions ...... 9

1.4 Significance ...... 10

1.5 Structure of the Thesis ...... 11

2 Background on Light Transport & Capture 15

2.1 Light Transport ...... 15

i ii CONTENTS

2.1.1 Specular Reflections ...... 16

2.1.2 Diffuse Reflections ...... 16

2.1.3 Lambertian Reflections ...... 17

2.1.4 Non-Lambertian Reflections ...... 17

2.1.5 Refraction ...... 19

2.2 Monocular Cameras ...... 21

2.2.1 Central Projection Model ...... 21

2.2.2 Thin Lenses and ...... 23

2.3 Stereo Cameras ...... 26

2.4 Multiple Cameras ...... 28

2.5 Light-Field Cameras ...... 29

2.5.1 Plenoptic Function ...... 30

2.5.2 4D Light Field Definition ...... 32

2.5.3 Light Field Parameterisation ...... 34

2.5.4 Light-Field Camera Architectures ...... 36

2.6 4D Light-Field Visualization ...... 42

2.7 4D Light-Field Geometry ...... 44

2.7.1 Geometric Primitive Definitions ...... 44

2.7.2 From 2D to 4D ...... 46

2.7.3 Point-Plane Correspondence ...... 56

2.7.4 Light-Field Slope ...... 58

3 Literature Review 61

3.1 Image Features ...... 61

3.1.1 2D Geometric Image Features ...... 62 CONTENTS iii

3.1.2 3D Geometric Image Features ...... 65

3.1.3 4D Geometric Image Features ...... 66

3.1.4 Direct Methods ...... 69

3.1.5 Image Feature Correspondence ...... 70

3.2 Visual Servoing ...... 72

3.2.1 Position-based Visual Servoing ...... 73

3.2.2 Image-based Visual Servoing ...... 75

3.3 Refractive Objects in Robotic Vision ...... 81

3.3.1 Detection & Recognition ...... 82

3.3.2 Shape Reconstruction ...... 85

3.4 Summary ...... 92

4 Light-Field Image-Based Visual Servoing 95

4.1 Light-Field Cameras for Visual Servoing ...... 95

4.2 Related Work ...... 97

4.3 Lambertian Light-Field Feature ...... 99

4.4 Light-Field Image-Based Visual Servoing ...... 100

4.4.1 Continuous-domain Image Jacobian ...... 100

4.4.2 Discrete-domain Image Jacobian ...... 102

4.5 Implementation & Experimental Setup ...... 104

4.5.1 Light-Field Features ...... 104

4.5.2 Mirror-Based Light-Field Camera Adapter ...... 105

4.5.3 Control Loop ...... 106

4.6 Experimental Results ...... 107

4.6.1 Camera Array Simulation ...... 108 iv CONTENTS

4.6.2 Arm-Mounted MirrorCam Experiments ...... 110

4.7 Conclusions ...... 117

5 Distinguishing Refracted Image Features 119

5.1 Related Work ...... 122

5.2 Lambertian Points in the Light Field ...... 126

5.3 Distinguishing Refracted Image Features ...... 128

5.3.1 Extracting Image Feature Curves ...... 130

5.3.2 Fitting 4D Planarity to Image Feature Curves ...... 132

5.3.3 Measuring Planar Consistency ...... 137

5.3.4 Measuring Slope Consistency ...... 138

5.4 Experimental Results ...... 140

5.4.1 Experimental Setup ...... 140

5.4.2 Refracted Image Feature Discrimination with Different LF Cameras . . 141

5.4.3 Rejecting Refracted Image Features for Structure from Motion . . . . . 148

5.5 Conclusions ...... 153

6 Light-Field Features for Refractive Objects 157

6.1 Refracted LF Features for Vision-based Control ...... 158

6.2 Related Work ...... 159

6.3 Optics of a Lens ...... 161

6.3.1 Spherical Lens ...... 162

6.3.2 Cylindrical Lens ...... 163

6.3.3 Toric Lens ...... 163

6.4 Methodology ...... 164 CONTENTS v

6.4.1 Refracted Light-Field Feature Definition ...... 166

6.4.2 Refracted Light-Field Feature Extraction ...... 170

6.5 Experimental Results ...... 174

6.5.1 Implementations ...... 174

6.5.2 Feature Continuity in Single-Point Ray Simulation ...... 177

6.5.3 Feature Continuity in Ray Tracing Simulation ...... 179

6.6 Visual Servoing Towards Refractive Objects ...... 186

6.7 Conclusions ...... 188

7 Conclusions and Future Work 191

7.1 Conclusions ...... 191

7.2 Future Work ...... 193

Bibliography 197

A Mirrored Light-Field Video Camera Adapter I

A.1 Introduction ...... II

A.2 Background ...... IV

A.3 Methods ...... V

A.3.1 Design & Optimization ...... V

A.3.2 Construction ...... VI

A.3.3 Decoding & Calibration ...... VIII

A.4 Conclusions and Future Work ...... X vi CONTENTS List of Tables

2.1 Minimum Number of Parameters to Describe Geometric Primitives from 2D to 4D...... 46

4.1 Comparison of camera systems’ capabilities and tolerances for VS ...... 98

5.1 Comparison of our method and the state of the art using two LF camera arrays and a lenslet-based camera for discriminating refracted image features . . . . . 145

5.2 Comparison of mean relative instantaneous pose error for unfiltered and filtered SfM-reconstructed trajectories ...... 151

A.1 Comparison of Accessibility for Different LF Camera Systems ...... VI

vii viii LIST OF TABLES List of Figures

1.1 Robot applications with refractive objects ...... 3

1.2 An example of unreliable RGB-D camera output for a refractive object . . . . . 5

1.3 Light-field camera as a array of cameras ...... 6

1.4 Gradual changes in a refractive object’s appearance in an image can be pro- grammatically detected ...... 7

2.1 Surface reflections ...... 16

2.2 Lambertian and non-Lambertian reflections ...... 18

2.3 Non-Lambertian reflection ...... 19

2.4 Snell’s law of refraction at the interface of two media...... 19

2.5 The central projection model ...... 22

2.6 Image formation for a thin lens ...... 25

2.7 Depth of field and focus ...... 26

2.8 Epipolar geometry for a stereo camera system ...... 27

2.9 The plenoptic function ...... 31

2.10 The two-plane parameterisation of the 4D LF ...... 34

2.11 Example 4D LF ...... 36

2.12 Light-field camera architectures ...... 37

ix x LIST OF FIGURES

2.13 Monocular versus plenoptic camera ...... 40

2.14 Raw plenoptic imagery ...... 41

2.15 Visualization of the light-field ...... 43

2.16 A line in 3D ...... 49

2.17 4D point example ...... 53

2.18 4D line example ...... 53

2.19 4D hyperplane example ...... 54

2.20 4D plane example ...... 56

2.21 Illustrating the depth of a point in the 2PP ...... 58

2.22 Light-field slope ...... 60

3.1 Architectures for visual servoing ...... 73

3.2 Light path through a refractive object ...... 87

4.1 MirrorCam setup ...... 97

4.2 Visual servoing control loop ...... 106

4.3 Simulation results for LF-IBVS ...... 109

4.4 Simulation of views for LF-IBVS ...... 110

4.5 Experimental results of LF-IBVS trajectories ...... 112

4.6 Experimental results of stereo-IBVS ...... 113

4.7 Setup for occlusion experiment ...... 116

4.8 Example views from occlusion experiments ...... 116

4.9 Experimental results from occlusion experiments ...... 118

5.1 LF camera mounted on a robot arm ...... 121

5.2 Lambertian versus non-Lambertian feature in the ...... 130 LIST OF FIGURES xi

5.3 Example epipolar planar images ...... 131

5.4 Extraction of the image feature curve using the correlation EPI using simulated data ...... 132

5.5 Example Lambertian and refracted image feature curves ...... 143

5.6 Example Lambertian and refracted feature curves from a small-baseline LF camera ...... 144

5.7 Discrimination of refracted image features ...... 144

5.8 Refracted image features detected in sample images ...... 147

5.9 Rejecting refracted image features for SfM ...... 150

5.10 Sample images where monocular SfM failed by not rejecting refracted features 151

5.11 Comparison of camera trajectories for monocular structure from motion . . . . 152

5.12 Point cloud reconstructions of scenes with refracted objects ...... 154

6.1 Toric lens cut from a torus ...... 164

6.2 The visual effect of the toric lens on a background circle ...... 165

6.3 Light-field geometry depth and projections of a lens into a light field ...... 167

6.4 Orientation for the toric lens ...... 168

6.5 Illustration of 3D line segment projected by a toric lens ...... 170

6.6 Ray tracing of a refractive object using Blender ...... 176

6.7 Single point ray trace simulation ...... 178

6.8 Slope estimates for changing z-translation of the LF camera ...... 179

6.9 Orientation estimates for changing z-rotation of the LF camera ...... 179

6.10 Refracted LF feature approach towards a toric lens ...... 180

6.11 Refracted light-field feature slopes during approach towards a toric lens . . . . 181 xii LIST OF FIGURES

6.12 Orientation estimate from a Blender simulation of an ellipsoid that was rotated about the principal axis of the LF camera...... 182

6.13 Refracted light-field features for a toric lens ...... 183

6.14 Refracted light-field features for different objects ...... 185

6.15 Concept for visual servoing towards a refractive object ...... 187

A.1 MirrorCam ...... III

A.2 MirrorCam field of view overlap ...... VII

A.3 Rendering of MirrorCam ...... VIII

A.4 MirrorCam v0.4c Kinova arm mount diagrams ...... IX

A.5 MirrorCam v0.4c mirror holder diagrams ...... XII

A.6 MirrorCam v0.4c camera clip diagrams ...... XIII Acronyms

2PP two-plane parameterisation.

BRIEF binary robust independent elementary features.

CNN convolutional neural networks.

DOF degree-of-freedom.

DSP-SIFT domain-size pooled SIFT.

FAST features from accelerated segment test.

FOV field of view.

GPS global positioning system.

HoG histogram of gradients.

IBVS image-based visual servoing.

IOR index of refraction.

LF light-field.

LF-IBVS light-field image-based visual servoing.

LIDAR light detection and ranging.

xiii xiv Acronyms

M-IBVS monocular image-based visual servoing.

MLESAC maximum likelihood estimator sampling and consensus.

ORB oriented FAST and rotated BRIEF.

PBVS position-based visual servoing.

RANSAC random sampling and consensus.

RMS root mean square.

S-IBVS stereo image-based visual servoing.

SfM structure from motion.

SIFT scale invariant feature transform.

SLAM simultaneous localisation and mapping.

SURF speeded-up robust feature.

SVD singular value decomposition.

VS visual servoing. List of Symbols

θi angle of incidence

θr angle of reflection N surface normal n index of refraction zi distance to image along the camera’s z-axis zo distance to object along the camera’s z-axis d disparity b camera baseline P 3D world point

Px 3D world point’s x-coordinate

Py 3D world point’s y-coordinate

Pz 3D world point’s z-coordinate C P world point with respect to the camera frame of reference p 2D image coordinate p∗ initial/observed image coordinates p# desired/goal image coordinates p˜ homogeneous image plane point f focal length R radius of curvature K camera matrix

xv xvi Acronyms

T translation vector R rotation matrix F fundamental matrix J Jacobian J + left Moore-Penrose pseudo-inverse of the Jacobian

ν camera spatial velocity v translational velocity

ω rotational velocity

NMIN minimum number of sub-images in which feature matches must be found

KP proportional control gain

KI integral control gain

KD derivative control gain s light-field horizontal viewpoint coordinate t light-field vertical viewpoint coordinate u light-field horizontal image coordinate v light-field vertical image coordinate D light-field plane separation distance w light-field slope from the continuous domain (unit-less) m light-field slope from the discrete domain (views/) σ light-field slope as an angle s0 light-field central view horizontal viewpoint coordinate t0 light-field central view vertical viewpoint coordinate u0 light-field central view horizontal image coordinate v0 light-field central view vertical image coordinate L(s,t,u,v) 4D light field I(u,v) 2D image ∗ ()˙ indicates a variable is fixed while others may vary W Lambertian light-field feature Acronyms xvii

H intrinsic matrix for an LF camera

Rn real coordinate space of n dimensions

Π1 a plane φ a ray n normal of a 4D hyperplane

∆u differences between u and u0 ξ singular vector from SVD

λ singular value from SVD c slope consistency tplanar threshold for planar consistency tslope threshold for slope consistency ei relative instantaneous pose error etr instantaneous translation pose error erot instantaneous rotation pose error C the focal point or focal line A the projection of point P through a toric lens

ΣA the scaling matrix, containing the singular values of A Chapter 1

Introduction

In this chapter, we introduce the motivation for this research and outline the research goals and questions this thesis seeks to address. Then we provide our list of contributions, their significance and an overview of this thesis.

1.1 Motivation

Robots are changing the world. Their use for automating the dull, dirty and dangerous tasks in the modern world has increased economic growth, improved quality of life and empowered people. For example, robots assist in manipulating heavy car components in the automotive manufacturing industry. Robots are being used to survey underwater ruins, sewage pipes, col- lapsed buildings, and other planets in space. At home, robots are also starting to be used for de- livery services, home cleaning, and assisting people with reduced mobility [Christensen, 2016]. Traditionally, many robots have operated in isolation from humans, but through the gradual availability of inexpensive computing, user interfaces, integrated sensors, and improved algo- rithms, robots are quickly improving in function and capability. The confluence of technologies is enabling a robot revolution that will lead to the adoption of robotic technologies for all as-

1 2 1.1. MOTIVATION pects of daily life. As such, robots are gradually venturing into less constrained environments to work with humans and an entirely new set of challenging objects to interact with. These more complex and unstructured working environments require richer perceptual information for safer interaction.

Historically, roboticists have had success with a variety of sensing modalities, from light detec- tion and ranging (LIDAR), global positioning system (GPS), radar, acoustic imaging, infrared range-finding sensors, to time-of-flight and structured-light depth sensors, as well as cameras. In particular, cameras uniquely measure both dense colour and textural detail that other sensors do not normally provide, which enables robots to use vision to perceive the world. Vision as a robotic sensor is particularly useful because it mimics human vision and allows for non-contact measurement of the environment. Much of the human world has been engineered around our sense of sight, and a significant amount of our communication and interaction relies on vision. Robotic vision has proven effective in terms of object detection, localization, and scene un- derstanding for robotic grasping and manipulation [Kemp et al., 2007]. Furthermore, directly using visual feedback information extracted from the camera to control robot motion, a tech- nique known as visual servoing (VS), has proven useful for real-time, high-precision robotic manipulation tasks [Kragic and Christensen, 2002]. However, refractive objects, which are common throughout the human environment, are one of the areas where modern robotic vision algorithms and conventional camera systems still encounter difficulties [Ihrke et al., 2010a].

One novel camera technology that may enable robots to better perceive refractive objects is the light-field (LF) camera, which uses multiple-aperture optics to implicitly encode both texture and depth. In this thesis, we look at exploring LF cameras as a means of seeing and servoing towards refractive objects. By seeing, we refer to detecting refractive objects using only the LF. By servoing, we refer to visual servoing using LF camera measurements to directly control the camera’s relative pose. Combining the two, this research may enable a robotic manipulator to detect and move towards, grasp, and manipulate refractive objects—for example a glass of beer or wine. The principal motivation for this topic lies in improving our understanding of how CHAPTER 1. INTRODUCTION 3 refractive objects behave in the LF and how to exploit this knowledge to enable more reliable motion towards refractive objects.

1.1.1 Limitations of Robotic Vision for Refractive Objects

Robots for the real world will inevitably interact with refractive objects, as in Fig. 1.1. Future robots will contend with wine glasses and clear water bottles in domestic applications [Kemp et al., 2007]; glass objects and clear plastic packaging for quality assessment and packing in manufacturing [Ihrke et al., 2010a]; glass windows throughout the urban environment, as well as water and ice for outdoor operations [Dansereau, 2014]. For example, a household robot must be able to pick up, wash and place glassware; a bartender robot must serve drinks from bottles of wine and spirits; an outdoor maintenance robot may want to avoid falling into the swimming pool or nearby fountains. Other examples of robots interacting with refractive ob- jects include medical robots performing opthalmic (eye) surgery, or servicing satellites working with telescopic lenses or shiny and transparent surface coverings. Automating these applica- tions typically requires knowledge of either object structure and/or robot motion. Yet objects such as those just described are particularly challenging for robots, largely because they are transparent.

(a) (b) Figure 1.1: Robots will have to interact with refractive objects. (a) In domestic applications, such as cleaning and putting away dishes. (b) In manufacturing, assessing the quality of glass objects, or picking and placing such objects in warehouses. 4 1.1. MOTIVATION

Refractive objects are particularly challenging for robots primarily because these types of ob- jects do not typically have texture of their own. Instead their appearance depends on the object’s shape and the surrounding background and lighting conditions. Robotic methods for localiza- tion, manipulation and control exist to deal with refractive objects when accurate 3D geometric models of the refractive objects themselves are available [Choi and Christensen, 2012, Luo et al., 2015, Walter et al., 2015, Zhou et al., 2018]. However, these models are often difficult, time-consuming and expensive to obtain, or simply not available [Ihrke et al., 2010a]. When 3D geometric models of the refractive objects are not available, localization, manipulation and control around refractive objects become much harder problems.

In robotic vision, a common approach when no models are available, regardless of whether the scene contains refractive objects, is to use features. Features are distinct aspects of interest in the scene that can be repeatedly and reliably identified from different viewpoints. Image features are those features recorded in the image as a set of pixels by the camera. Image features can then be automatically detected and extracted as a vector of numbers, which we refer to as the image feature vector. Features are often chosen because their appearances do not change significantly with small changes in viewpoint. The same features are matched from image to image as the robot moves, which enables the robot to establish a consistent geometric relationship between its observed image pixels and the 3D world.

Feature-based matching strategies form the basis for many robotic vision algorithms, such as object recognition, image segmentation, structure from motion (SfM), VS, and simultaneous localisation and mapping (SLAM). However, many of these algorithms implicitly assume that the scene or object is Lambertian—that the object’s (or feature’s) appearance remains the same despite moderate changes in viewpoint. Instead, refractive objects are non-Lambertian because their appearance often varies significantly with viewpoint. The violation of the Lambertian as- sumption can cause inconsistencies, errors and even failures for modern robotic vision systems. CHAPTER 1. INTRODUCTION 5

1.1.2 Seeing and Servoing Towards Refractive Objects

Humans are able to discern refractive objects visually by looking at the objects from different perspectives and observing that the appearance of refractive objects change differently from the rest of the scene. The robotic analogue to human eyes are cameras, which have proven ex- tremely useful as a low-cost and compact sensor. Monocular vision systems are by far the most common amongst robots today, but suffer from the ambiguity that small and close objects appear the same size as distant and large objects. Moreover, a single view from a monocular camera does not provide sufficient information to detect the presence of refractive objects. Stereo cam- eras, which provide two views of a scene, do not work well with refractive objects without prior knowledge of the scene, because triangulation relies heavily on appearance matching. RGB-D cameras and LIDAR sensors do not work reliably on refractive objects because the emitted light is either partially reflected or travels through these objects. Fig. 1.2 shows an example of unre- liable depth measurements from an RGB-D camera for a refractive sphere. Robots can move to gain better understanding of a refractive object over time; however, physically moving a robot can be time-consuming, expensive and potentially hazardous. A more efficient approach would be to instantaneously capture multiple views of the refractive object.

(a) (b) Figure 1.2: An example of unreliable RGB-D camera (Intel Realsense D415) output for a re- fractive sphere. This RGB-D camera uses the structured-light approach to measure depth, which works well for Lambertian surfaces, but not for refractive objects. (a) The colour image, which a monocular camera would also provide. (b) Incorrect and missing depth information around the refractive sphere. 6 1.1. MOTIVATION

The light field describes all the light flowing in every direction through every point in free space at a certain instance in time [Levoy and Hanrahan, 1996]. LF cameras are a novel camera technology that use multi-aperture optics to measure the LF by capturing a dense and uniformly- sampled set of views of the same scene from multiple viewpoints in a single capture from a single sensor position. We refer to a view as what one would see from a particular viewing pose or viewpoint. Conceptually, unlike looking through a peephole, an LF “image” is similar to an instantaneous window that one can look through to see how a refractive object’s appearance changes smoothly with viewpoint. As illustrated in Fig. 1.3, compared to a monocular camera, which uses a single aperture to capture a single view of the scene from a single viewpoint, an LF camera is analogous to having an array of cameras all tightly packed together which provide multiple views of the scene from multiple viewpoints. Within the LF camera array, a single view can be selected and changed from one to the other in a way that can be described as virtual motion within the single shot of the LF.

(a) (b) (c) Figure 1.3: (a) A monocular camera acts as a peephole with a single aperture to capture a single view of the scene. Light from the scene (yellow) passes through the aperture (red) and is recorded on the image sensor (green). (b) A LF camera can be thought of as a window, or an equivalent camera array that uses multi-aperture optics to capture multiple views of the scene. As a result, the LF camera can capture much more information of the scene from a single sensor capture than a monocular camera. (c) An example LF camera array by Wilburn [Wilburn et al., 2004].

For example, compared to Fig. 1.4a, Fig. 1.4b shows the gradual change in appearance of the refractive sphere from a much denser and more regular or uniform sampling of views from an LF. Perhaps one of the reasons why humans can somewhat reliably perceiving refractive objects is because we may unconsciously move a little bit side-to-side using our continuous CHAPTER 1. INTRODUCTION 7 stream of vision—which is a very dense sampling of the scene. Humans may be able to detect the inconsistent motions of the background caused by the refractive object with respect to their viewpoints.

Therefore, the dense sampling of the LF camera captures the behaviour of the refractive object with a high level of redundancy that is needed to differentiate refractive objects from normal scene content. The uniform sampling pattern of the LF camera induces patterns and algorithmic simplifications that would be unavailable to a set of non-uniformly-sampled views1. Addition- ally, while the same set of images could be obtained with a single moving camera, LF cameras can capture this information from a single sensor position, reducing the amount of motion re- quired by the robot to perceive a refractive object. Therefore, LF cameras could allow robots to more reliably and efficiently capture the behaviour of refractive objects.

(a)

(b) Figure 1.4: In this scene, a refractive sphere has been placed amongst cards. A camera has captured images of the scene along a horizontal rail at (a) a 3 cm interval, and (b) 1 cm in- tervals. The end images (blue border) are taken from the same positions. In (a), the change in appearance of the refractive sphere is significant and perhaps very challenging to recognize without the prior knowledge that there is a refractive sphere in the middle of the scene. In (b), a more frequent sampling of the scene reveals the gradual change in appearance of the refractive sphere, which may be programmatically detected. Images from the New Stanford Light Field Archive.

1Consider a conventional monocular camera and its dense and uniformly-sampled array of pixels that produce a detailed 2D image. Often, more pixels yield more detail in a single image. Additionally, if the pixels were oriented in different directions and at a variety of positions, interpreting the scene would be a much more complex task. 8 1.2. STATEMENT OF RESEARCH

Returning to the original theme of this section, robots must not just perceive refractive objects, they must be able to precisely control their relative pose around these objects as well. In the traditional open-loop “look then move” approach, the accuracy of the operation depends directly on the accuracy of the visual sensor and robot end-effector. VS is a robot control technique that uses the camera output to directly control the robot motion in a feedback loop, which is referred to as a closed-loop approach. VS has proven to be reliable at controlling robot motion with respect to visible objects without requiring a geometric model of the object, or an accurate robot. While refractive objects are challenging because the objects are not always directly visible, they leave fingerprints based on how the background is distorted. LF cameras capture some of this distortion, which we show can be exploited to visual servo towards refractive objects for further grasping and manipulation.

1.2 Statement of Research

Based on the previous section, there exists a clear opportunity to advance robotic vision in the area of visual control with respect to refractive objects using LF cameras. The main research question of this thesis is thus:

How can we enable robotic vision systems to visually control their motion around refractive objects in real-world environments, using a light-field camera and without prior models of the objects?

The primary research question can be decomposed into sub-questions:

1. How can we visually servo using a light-field camera?

2. How can we detect refractive objects using a light-field camera?

3. How can we servo towards a refractive object? CHAPTER 1. INTRODUCTION 9

Our hypothesis is that we can develop a novel light-field feature based on the dense and uniform observations of a Lambertian point captured by an LF camera, which we refer to as a Lamber- tian light-field feature. We can use this Lambertian light-field feature to perform visual servoing in Lambertian scenes. We can observe how this light-field feature becomes distorted by a re- fractive object and use these changes to distinguish refracted image features from Lambertian image features. Using insight from visual servoing with the Lambetian light-field feature and distinguishing refracted image features in the LF, we can propose a novel refracted light-field feature to directly control the robot pose with respect to a refractive object, without needing a prior model of the object. We define a refracted light-field feature as the projection of a feature in the LF that has been distorted by a refractive object. The key challenges in showing this will be in understanding how the Lambertian light-field feature changes with respect to camera pose, how to characterise the changes in our light-field feature caused by a refractive object, and how the LF changes as the robot moves towards a refractive object.

1.3 Contributions

The broad topics addressed in this thesis are (1) image-based visual servoing using a light-field camera (2) detecting refracted features, and (3) visual servoing towards refracted objects. The specific contributions are as follows:

Light-field image-based visual servoing – partially published as [Tsai et al., 2017]

1. We propose the first derivation, implementation and experimental validation of light-field image-based visual servoing (LF-IBVS). In particular, we define an appropriate compact representation of an LF feature that is close to the form measured directly by LF cameras for Lambertian scenes. We derive continuous- and discrete-domain image Jacobians for the light field. Our LF feature enforces LF geometry in feature detection and correspon- dence. We experimentally validate LF-IBVS in simulation and on a custom LF camera adapter, called the MirrorCam, mounted on a robot arm. 10 1.4. SIGNIFICANCE

2. We show that our method of LF-IBVS outperforms conventional monocular and stereo image-based visual servoing in the presence of occlusions.

Distinguishing refracted image features – partially published as [Tsai et al., 2019]

1. We develop an LF feature discriminator for refractive objects. In particular, we develop a method to distinguish a Lambertian image feature from a feature whose rays have been distorted by a refractive object, which we refer to as a refracted image feature. Our discriminator can distinguish refractive objects more reliably than previous work. We also extend refracted image feature discrimination capabilities to lenslet-based LF cameras which typically have much smaller baselines than conventional LF camera arrays.

2. We show that using our method to reject most of the refracted image feature content enables monocular SfM in scenes containing refractive objects, where traditional methods otherwise fail.

Light-field features for refractive objects

1. We define a representation for a refracted LF feature that approximates the local surface area of the refractive object as two orthogonal surface curvatures. We can then model the local part of the refractive object as a toric lens. The properties of the local projections can then be observed by and extracted from the light field.

2. We evaluate the feature’s continuity with respect to LF camera pose for a variety of dif- ferent refractive objects to demonstrate the potential for our refracted LF feature’s use in vision-based control tasks, such as for visual servoing.ee

1.4 Significance

This research is significant because it will provide robots with hand-eye coordination skills for objects that are difficult to perceive. It is a critical step towards enabling robots to see and CHAPTER 1. INTRODUCTION 11 interact with refractive objects. Specifically, with an improved understanding of how refractive objects behave in a single light field, robots can now distinguish refractive objects and reject the refracted feature content. Robots can then move in scenes containing refractive objects without having their pose estimates corrupted by the refracted scene content. Being able to describe refractive objects in the light field and then servo towards them enables more advanced grasping and manipulation tasks for robots.

Furthermore, applications of understanding the behaviour of refracted objects in the light field as robots move is not limited to structure from motion and visual servoing. This theory could help improve visual navigation and even SLAM applications for domestic and manufacturing robots. Ultimately, this research will enable manufacturing robots to quickly manipulate objects encased in clear plastic packaging. Domestic robots will be able to more reliably clean glasses and serve drinks. Medical robots will more safely operate on transparent objects, such as human eyes. Overall, this research is a significant step towards opening up an entirely new class of objects for manipulation that have been largely ignored by the robotics community until now.

1.5 Structure of the Thesis

This thesis in robotic vision draws on theory from both computer vision and robotics research communities. Chapter 2 provides the necessary background relevant to the remainder of this thesis, including a description of light transport and light capture. Specifically, we explain the difference between specular and diffuse reflections, as well as Lambertian and non-Lambertian reflections and refraction. We discuss image formation with respect to monocular, stereo, mul- tiple camera and LF camera systems. We then discuss visualization of 4D LFs and 4D LF geometry, which are built on in the following chapters.

In Chapter 3, we provide a review of the relevant literature surrounding three topics, image features, VS and refractive objects. Because VS systems typically rely on tracking image fea- 12 1.5. STRUCTURE OF THE THESIS tures in a sequence of images, we first include a review of image features, how they have been used in VS systems and how image features have been used in LF cameras. Second, we discuss the major classes of VS systems, position-based and image-based systems, in the context of LF cameras and refractive objects. Third, we review a variety of methods that have been explored to automatically detect and perceive refractive objects in both computer and robotic vision. All together, this chapter explains how traditional image features are insufficient for dealing with refractive objects, LF cameras have not yet been considered for VS systems, and for refractive objects, other methods for perceiving these objects are impractical for most mobile robotic plat- forms or rely on assumptions that significantly narrow their application window. Thus, there is a gap for methods that do not rely on 3D geometric models of the refractive objects that can apply to a wide variety of object shapes. Using LF cameras for VS towards refractive objects therefore carves out a niche in the research community that leaves room for scientific exploration.

As mentioned previously, LF cameras are of interest for VS because they can capture the be- haviour of view-dependent light transport effects, such as occlusions, specular reflections and refraction within a single shot. However, VS with an LF camera for basic Lambertian scenes has not yet been explored. As an initial investigation, we first focus on using an LF camera to servo in Lambertian scenes. Chapter 4 develops a light-field feature for Lambertian scenes, which we later refer to as a Lambertian light-field feature. This feature exploits the fact that a Lambertian point in the world induces a plane in the 4D LF. Afterwards, we derive the relations between differential feature changes and resultant robot motion. Using this feature, we then present the first development of light-field image-based VS for Lambertian scenes and compare its performance to traditional monocular and stereo VS systems.

Next, Chapter 5 presents our method to distinguish a Lambertian image feature from a feature whose rays have been distorted by a refractive object, which we refer to as a refracted feature. We do this by characterising the apparent motion of an image feature in the light field and compare it to how well this apparent motion matches the model of an ideal Lambertian image feature (which is based upon the plane in the 4D LF). We apply this method to the problem of CHAPTER 1. INTRODUCTION 13

SfM, allowing us to reject most of the refracted feature content, which enables monocular SfM using the Lambertian parts of the scene, in scenes containing refractive objects where traditional methods would normally fail.

In Chapter 6, we combine the Lambertian light-field feature definition and LF-IBVS frame- work from Chapter 4, with the concept of the refracted image feature from Chapter 5, to explore the concept of a refracted light-field feature. This chapter is largely focused on investigating how the 4D planar structure of a light-field feature can be extended to refractive objects, ex- tracted from a single light field, and how this structure changes with respect to viewing pose. We demonstrate this feature’s suitability for VS with respect to pose change, and lay the ground- work for a system to visual servo towards refractive objects.

The unifying theme underlying the contributions of this thesis is exploring and exploiting the properties of the light field for robotic vision. In Chapter 4, we developed a Lambertian light- field feature for visual servoing, and in Chapter 5, we propose a method to detect refractive objects. Both of these investigations exploit the fact that a Lambertian point in the world in- duces a plane in the 4D LF. In Chapter 6, we use the induced plane to propose a method that en- ables visual servoing towards refractive objects. Throughout this thesis, the dense and uniform sampling of the light field induce patterns that we exploit to improve robotic vision algorithms. Finally, conclusions and suggestions for further work are presented in Chapter 7. 14 1.5. STRUCTURE OF THE THESIS Chapter 2

Background on Light Transport & Capture

This chapter begins with a background on how light is transported through scenes, including reflection and refraction. We then discuss single image formation with conventional monocu- lar cameras and extend the discussion to LF cameras. Finally, we illustrate how we typically visualize LFs and discuss the theory of 4D LF geometry.

2.1 Light Transport

In order to understand refractive objects and LF cameras, it is important to first understand light transport, the nature of light and how it interacts with matter. Light is an electromagnetic wave, but when the wavelength of light is small relative to the size of the structures it interacts with, we can neglect the more complex wave-like behaviours of light and focus on the particle-like behaviours of light, where light can be described as rays that move in straight lines within a constant medium [Pedrotti, 2008]. This approximation describes most phenomena measured by human eyes, most cameras and most robotic vision systems.

15 16 2.1. LIGHT TRANSPORT

2.1.1 Specular Reflections

When light rays hit a surface, light is reflected. The law of reflection states that for a smooth, flat and mirror-like surface, the reflected light ray is on a plane formed by the incident light ray and the surface normal. Additionally, the angle of reflection θr is equal to the angle of incidence

θi [Lee, 2005], as shown in Fig. 2.1a. If we know the surface geometry and the incident light ray, then we can recover the direction of the reflected ray. Alternatively, if we know the incident and reflected light rays, then we can determine the geometry (normals) of the reflective surface. For surfaces that are not perfect mirrors, specular reflections can still occur, taking the form of a narrow angular distribution of the reflected light. The ratio of reflected light to incident light is known as the reflectance and values of more than 99% can be achieved through a combination of surface polishing and advanced coatings [Freeman and Fincham, 1990]. Examples of specular reflective materials are metal, mirrors, glossy plastics and shiny surfaces of transparent objects.

2.1.2 Diffuse Reflections

Most real surfaces are not perfect mirrors. Instead, they are often rough and produce diffuse reflections. Light interacts with rough surfaces via penetration, scattering, absorption and being re-emitted from the surface. These surfaces are commonly modelled using a distribution of

(a) (b)

Figure 2.1: (a) The law of reflection for a smooth surface. The angle of the incidence θi is equal to the angle of reflection θr about the surface normal N on the plane of incidence. The reflection off a smooth surface illustrates a specular reflection. (b) The reflections from a rough surface of micro-facets illustrate a diffuse reflection. CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 17 micro-facets. Each facet acts like a small smooth surface that has its own, single surface normal, which varies from facet to facet, as in Fig. 2.1b. The extent to which the micro-facet normals differ from the smooth surface normal is a measure of surface roughness. The distribution of the micro-facet normals create a broad angular distribution of reflected light, which is known as a diffuse reflection. Some examples of diffuse materials include wood and felt.

2.1.3 Lambertian Reflections

The Lambertian surface model is often referred to as the isotropic radiance constraint, the brightness constancy assumption, or the photo consistency assumption in computer graphics and robotic vision. Each point on a Lambertian surface reflects light with a cosine angular distribution, as shown in Fig. 2.2a, where θ is the viewing angle relative to the surface normal. However, when a surface is viewed with a finite field of view (FOV), the surface area seen by the observer is proportional to 1/ cos θ. As θ approaches 90◦, more surface points become visible to the observer. The observed radiance (amount of reflected light) comes from the reflected in- tensity from each surface point (∝ cos θ) multiplied by the number of points seen (∝ 1/ cos θ), which cancels out and is thus independent of θ. This results in the observed radiance being roughly equal in all directions [Lee, 2005], as shown in Fig. 2.2b. The Lambertian model is very common in computer graphics, and often implicitly assumed in many robotic vision al- gorithms. However, this assumption is invalid for specular reflections and refractive objects, which motivates us to consider non-Lambertian reflections and refraction.

2.1.4 Non-Lambertian Reflections

Although the Lambertian assumption has been shown to work quite well in practice for most scenes and surfaces [Levin and Durand, 2010], there remains a variety of surfaces, such as those polished or shiny, that reflect light in a manner that does not follow the Lambertian model. These 18 2.1. LIGHT TRANSPORT

(a) (b) Figure 2.2: (a) The cosine distribution of a Lambertian reflection at a point with an observer at viewing angle θ. (b) A Lambertian reflection has an observed radiance approximately equal in all directions. The appearance of the ray stays the same regardless of viewing angle.

non-Lambertian reflections are when at least part of the reflected light is dependent on viewing angle, as shown in Fig. 2.3. Such surfaces are not perfectly smooth because of the molecular structure of materials; however, when the irregularities are less than the wavelength of incident light, the reflected light becomes increasingly specular. This means that even rough surfaces can exhibit some degree of non-Lambertian reflections when viewed at a sufficiently sharp angle.

Shiny surfaces involve both specular and diffuse surface reflections. A common approach to dealing with these non-Lambertian surfaces is to use the dichromatic reflection model [Shafer, 1985], which separates the reflections into specular and diffuse components. The diffuse com- ponent is modelled as Lambertian and the rest is attributed to a non-Lambertian reflection. The relative amount of these two components depends on material properties, geometry of light source, observer viewing pose and surface normal [Corke, 2017]. The model is valid for mate- rials such as woods, paints, papers and plastics, but excludes materials, such as metals. In the graphics community, there are more advanced models. Schlick [Schlick, 1994], Lee [Lee, 2005] and Kurt [Kurt and Edwards, 2009] are good reference surveys of modelling non-Lambertian light reflection. CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 19

Figure 2.3: A non-Lambertian reflection has an uneven reflected light distribution. The reflected intensity, and thus appearance of the ray changes with viewing angle.

2.1.5 Refraction

Refractive objects pose a major challenge for robotic vision because they typically do not have an appearance of their own. Rather, they allow light to pass through them and in the process distort or change the direction of the light. When light passes through an interface—a boundary that separates one medium from another—light is partially reflected and transmitted. Refraction occurs when light rays are bent at the interface. Assuming the media are isotropic, the amount of bending is determined by the media’s index of refraction (IOR) n and Snell’s Law.

Snell’s law of refraction, illustrated in Fig. 2.4, relates the sines of the angles of incidence θi and refraction θr at an interface between two optical media based on their IOR,

ni sin θi = nr sin θr, (2.1)

Figure 2.4: Snell’s law of refraction at the interface of two media. 20 2.1. LIGHT TRANSPORT

where θi and θr are measured with respect to the surface normal, and ni and nr are the IOR of the incident and refracting medium, respectively. The IOR of a medium is defined as the ratio of the speed of light in a vacuum c over the speed of light in the medium v, given by n = c/v. For air and most gases, n is taken as 1.0, while for other solid materials such as glass, n =1.52. As light passes from a lower to higher n, the light ray is bent towards the normal, while light is bent away from the normal when it passes from higher to lower n.

Because the bending of light depends on the surface normal, the shape of an object as well as its IOR play an important role in the appearance of visual features on the surface of a refractive object. The larger the angle between the incident light and the object’s surface normal, the larger the change in direction of the refracted light. And the thicker an object is, the longer the light can travel through the refractive medium, resulting in a larger change in appearance.

To complicate matters more, the surfaces of transparent objects are often both reflective and refractive. This means that a portion of the light is reflected at the surface, while another portion is refracted through the surface. Fresnel’s equations describe the reflection and transmission of light at the boundary of two different optical media [Hecht, 2002]. The amount of reflected light depends on the media’s n and angle of incidence.

Furthermore, when light travels from a medium of higher to lower n, internal reflections can occur within refractive objects. Light moving perpendicular to the interface’s surface normal does not change direction. Light moving at an angle large enough to cause the refracted ray to bend 90◦ from the normal travels along the interface itself. Such an angle is referred to as the critical angle θc. Any incident light that has an angle greater than θc is totally reflected back into the original medium, as per the law of reflection. This phenomenon is known as total internal reflection, which is typically exploited in propagating light through fibre optics. Internal reflection can cause light sources to appear within transparent objects from unexpected angles and even disappear entirely. This further adds to the viewpoint-dependent nature of refractive objects. CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 21

In all, refractive objects are texture-less on their own. They can refract, magnify (scale), flip and distort the background, and even cause it to vanish at certain angles. All of these effects depend heavily on the refractive object’s surface normals, thickness, material properties, as well as the object’s background, so it is no surprise that refractive objects easily confuse most robotic vision techniques that do not account for more than simple Lambertian reflections.

2.2 Monocular Cameras

Cameras are excellent sensors for robotic vision; they are compact, affordable sensors that have low power consumption and provide a wealth of visual information. Camera systems are very flexible in application, owing to the variety of computer vision and image processing algorithms available. In this section, we look at monocular cameras, the central projection model and the loss of depth information. We do this to better understand LF cameras, which can sometimes be considered as an array of monocular cameras.

2.2.1 Central Projection Model

Image formation using a conventional monocular camera projects a 3D world onto a 2D surface. The central projection model is often used to perform this transformation and is also referred to as the central perspective or pinhole camera model. An illustration of how the model works is shown in Fig. 2.5. It assumes an infinitely small aperture for light to pass through to the image plane and sensor. The camera’s optical axis is defined as the centre of the field of view. The geometry of similar triangles describes the projective relationships for world coordinates

P =(Px,Py,Pz) onto the image plane p =(x,y) as

P P x = f x , y = f y . (2.2) Pz Pz 22 2.2. MONOCULAR CAMERAS

The image plane point can be written in homogeneous form p˜ =(x′,y′,z′) as

P P x′ = f x , y′ = f y , z′ = P . (2.3) z′ z′ z

′ If we consider the homogeneous world coordinates P = (Px,Py,Pz, 1), then the central pro- jection model can be written linearly in matrix form as

Px f 0 0 0     P  y ′ p˜ = 0 f 0 0   = KP , (2.4)       Pz  0 010          1    where K is a 3×4 matrix known as the camera matrix and p˜ is the coordinate of the point with respect to the camera frame [Hartley and Zisserman, 2003].

Figure 2.5: The central projection model. The image plane is a focal length f distance in front of the camera’s origin. A non-inverted image of the scene is formed as world-point P (Px,Py,Pz), and is captured at image point p(x,y) on the image plane. CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 23

The central projection model is relatively simple and no lenses are needed to focus the light, and is thus commonly used throughout robotic vision [Siciliano and Khatib, 2016]. However, the infinitely small aperture from the central projection model has the practical problem of not letting much light from the scene onto the image sensor. This may result in dark images that may not be useful, or impractically long exposure times for many robotic applications.

In practice, most modern cameras use optical lenses to achieve reasonable image exposure. However, the central projection model does not include geometric distortions or blurred effects caused by lenses and finite-sized apertures. Thus, the central projection model is often aug- mented with additional terms to account for the image distortion caused by lenses. See Sturm et al. for a survey on other camera models, including models for catadioptric and omnidirectional cameras [Sturm et al., 2011].

2.2.2 Thin Lenses and Depth of Field

The infinitely small aperture of the central projection model is a mathematical approximation only and does not physically exist. In fact, as the aperture size approaches a certain limit (related to the wavelength of the observed light and the shape of the aperture), diffraction increases proportionally (resulting in a blurrier image) and prevents infinitely small apertures. Thus all apertures have a nonzero diameter. And in practice, optical lenses are used to allow for a much larger aperture so that more light from the scene can reach the image sensor.

It is typical to assume that the axial thickness of the lens is small relative to the radius of curvature of the lens, which means that the lens is “thin”. It is also common to assume that the angles the light rays make with the optical axis of the lens are small, which is known as the paraxial ray approximation. Thus, assuming thin lenses and paraxial rays, the mathematics describing the behaviour of lenses can be significantly simplified. As shown in Fig. 2.6, the light rays emitting from a point P in the scene pass through the lens and converge to a point 24 2.2. MONOCULAR CAMERAS behind the lens based on the thin lens formula,

1 1 1 + = , (2.5) zi zo f

where zo is the distance to the subject, zi is the distance to the image, and f is the focal length of the lens. Therefore, we can determine the distance along the z-axis of P , given the lens’ focal length and the distance of the image formed by the lens.

The trade-off with the larger aperture from the lens is that the incoming light rays can only be focused at a certain depth. Objects at different depths produce rays that converge at different points behind the lens. A cone of light rays that converge to a point on the image plane are considered to be in focus, or at the point of convergence, as shown in Fig. 2.7. When the point of convergence does not lie on the focal plane, the rays occupy an area on the image plane and appear blurred. This area is known as the circle of confusion c and is useful for describing how “sharp” or in focus a world point appears in an image.

Real lenses are not able to focus all rays to perfect points and the smallest circle of confusion that a lens can produce that is indistinguishable from a point on the image plane is often referred to as the circle of least confusion, which also depends on pixel size. If c is smaller than a pixel, it is usually indistinguishable from a point on the image plane and thus considered to be in focus, even if the focused light does not converge to a point that strictly lies on the image plane. This leads cameras to have a nonzero depth of field, the range of distances at which objects in the scene appear in focus on a discrete, digital sensor.

A small aperture provides a large depth of field, which is desirable to keep more of the scene in focus. However, the small aperture admits less light. Less light can lead to issues with noise and these issues cannot always be compensated for by simply increasing the exposure time, due to motion blur. There is therefore an integral link between depth of field, motion blur and signal-to-noise ratio, with relationships determined by exposure duration and aperture diameter. CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 25

Figure 2.6: Image formation for a thin lens, shown as a 2D cross section. By convention, the camera’s optical axis is the z-axis with the origin at the centre of the thin lens.

2.2.2.1 Monocular Depth Estimation

As the 3D world is projected onto a 2D surface, the mapping is not one-to-one and depth information is lost. A unique inverse of the central projection model does not exist. Given an image point p(x,y), we cannot uniquely determine its corresponding world point P (Px,Py,Pz). In fact, P can lie at any distance along the projecting ray CP in Fig. 2.5. This is known as the scale ambiguity and is a significant challenge for robots striving to interact in a 3D world using only 2D images.

A variety of strategies can be applied to compensate for this loss, such as active vision [Krotkov and Bajcsy, 1993], depth from focus [Grossmann, 1987,Krotkov and Bajcsy, 1993], monocular SfM [Hartley and Zisserman, 2003,Schoenberger and Frahm, 2016], monocular SLAM [Civera et al., 2008], and learnt monocular depth estimation [Saxena et al., 2006, Godard et al., 2017]. However, without prior geometric models, very few of these methods apply to refractive objects as we will later discuss in Chapter 3. It is therefore worth considering stereo and other camera systems to exploit more views for depth information. 26 2.3. STEREO CAMERAS

Figure 2.7: Diagram illustrating the circle of confusion c for a point source passing through a lens of diameter d. The point source is focused behind the lens (top), in focus (middle) and focused in front of the lens (bottom).

2.3 Stereo Cameras

Stereo camera systems use two cameras and the known geometry between them to obtain depth through triangulation in a single sensor capture. Given the corresponding image points p1, p2 and both camera poses, the 3D location of world point P can be determined. Epipolar geometry defines the geometric relationship between the two images captured by the stereo camera system is illustrated in Fig. 2.8, and can be used to simplify the stereo matching process required for stereo triangulation.

As in Fig. 2.8, the centre of projection for each camera is given as {1} and {2}. The 3 points, P , {1} and {2} define a plane known as the epipolar plane. The intersection of the epipolar plane and the image plane for cameras 1 and 2 define the epipolar lines, l1 and l2, respectively.

These lines constrain where P is projected into each image at p1 and p2. Given p1, we seek p2 in I2. Rather than searching the entire image, we need only search along l2. Conversely, given p2, we can find p1 on l1. CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 27

(a) Figure 2.8: Epipolar geometry used for stereo camera systems. The epipolar plane is defined by point P and camera centres of {1} and {2}. Note that {1} and {2} define the reference frames of cameras 1 and 2, respectively. The intersection of the epipolar plane with the two image 1 planes I1 and I2 define the epipolar lines l1 and l2, respectively. Given p, knowledge of l1 and l2 can reduce the search for the corresponding image point in the second image plane from a 2D to a 1D problem. Seven pairs of corresponding image points p are required to estimate 1 1 the fundamental matrix, in order to recover the translation T2 and rotation R2 of {2} in the reference frame of {1}.

This important geometrical relationship can be encapsulated algebraically in a single matrix known as the fundamental matrix F [Corke, 2013],

F = K−1T ×RK, (2.6) where K is the camera matrix, T × is the skew symmetric matrix of the translation vector T , and R is the rotation between the two camera poses. For any pair of corresponding image points 1x and 2x, the F satisfies

2xT F 1x =0. (2.7) 28 2.4. MULTIPLE CAMERAS

The fundamental matrix is a 3 × 3 matrix with 7 degree-of-freedom (DOF). Thus a minimum of 7 unique pairs of points are required to compute F .

A typical stereo camera arrangement is for two cameras with parallel optical axes, both or- thogonal to their baseline. This yields horizontally-aligned epipolar lines and further simplifies the correspondence search from image lines to image rows. For this setup, it is assumed that both cameras have the same focal length f and a baseline separation b. In the case of a typical horizontally-aligned stereo camera system, for image points p1(u1,v1) and p2(u2,v2), the dis- parity d is given as d = u2 − u1. The disparity is a measure of motion parallax. Then the depth Z can then be computed using fb Z = , (2.8) d which shows that d is inversely proportional to depth.

However, stereo methods are limited to a single fixed baseline. In this configuration, edges parallel to this baseline do not yield easily observable disparity (especially when the edge spans the with of the image), and thus their depths are not easily determined. Additionally, only two views of the same scene means that feature correspondence must be performed on just two images. In Lambertian scenes, this is sufficient; however, stereo vision can fail under significant appearance changes, for example in the presence of occlusions and non-Lambertian objects.

2.4 Multiple Cameras

Additional cameras yield more views and with them more redundancy; however, the configura- tion of their relative poses is important. Introducing a third camera creates a trinocular camera system. While stereo uses the 3×3 fundamental matrix, tri-camera methods use a 3×3×3 ten- sor, known as the trifocal tensor. These tensors can be determined from a set of corresponding image points from 2 and 3 views, respectively. These tensors can then be decomposed into the cameras’ projection matrices, after which triangulation can be used to recover the 3D positions CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 29 of the points. According to Hartley and Zisserman, the quadrifocal tensor exists for 4 views, but it is difficult to compute, and the tensor method does not extend to n views [Hartley and Zisserman, 2003].

Furthermore, multi-camera vision systems are not necessarily limited to regularly-sampled grids aimed at the same scene. For example, a common commercial multi-camera configuration is to have six 90◦ FOV cameras mounted together to provide a 360◦ panoramic view. In this configuration, there is very little scene overlap between the cameras, which provides very little redundancy. As we will later show in Ch. 5, redundancy of views from different perspectives at regular intervals is extremely important for characterising the appearance of refractive objects as a function of viewpoint.

2.5 Light-Field Cameras

LF cameras are based on the idea of computational photography, in which a large part of the image capture processes is performed by software rather than hardware. LF cameras belong to the greater class of generalised cameras [Li et al., 2008, Comport et al., 2011]. In this section, we begin by introducing the plenoptic function as a means of modelling light from all possible views in space. Under certain assumptions and restrictions, we explain how we can reduce the plenoptic function to a 4D LF that captures a more manageable representation of multiple views. We then discuss the most common LF parameterisation and camera architectures of cameras that capture LFs, how captured LFs are decoded from raw sensor measurements to its parameterisation and why we expect light-field cameras to be suitable for dealing with refractive objects. 30 2.5. LIGHT-FIELD CAMERAS

2.5.1 Plenoptic Function

Light is more than a “2D image plus depth” for a single perspective. Light is a much higher- dimensional phenomenon. Adelson and Bergen [Adelson and Bergen, 1991] introduced the plenoptic function as a means of representing light and encapsulating all possible views in space. The term “plenoptic” was coined from the root words for “all” and “seeing”, so the plenoptic function conceptualizes all the properties of light in a scene. Light is modelled as rays, each of which can be described using seven parameters: three spatial coordinates (x,y,z) that define the ray’s position (of the source), two orientation coordinates (θ,φ) define the ray’s elevation and azimuth, the specific wavelength λ accounts for the colour of light and time t. Together, these 7 parameters yield the plenoptic function,

P (px,py,pz,θ,φ,λ,t), (2.9) which is the intensity of the ray as a function of space, time and colour, also illustrated in Fig. 2.9. Thus the plenoptic function represents all the light flowing through every point in a scene through all space and time. The significance of the plenoptic function is best put in Adelson and Bergen’s own words [Adelson and Bergen, 1991]:

The world is made of three-dimensional objects, but these objects do not commu- nicate their properties directly to an observer. Rather, the objects fill the space around them with the pattern of light rays that constitutes the plenoptic function, and the observer takes samples from this function. The plenoptic function serves as the sole communication link between physical objects and their corresponding retinal images. It is the intermediary between the world and the eye.

To explain how cameras typically sample the plenoptic function, we consider a monocular, monochrome camera. First, time is sampled by setting a small shutter time on the camera. The camera’s photosensor integrates over a small amount of time as the photosites are exposed to CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 31

Figure 2.9: The plenoptic function models all the light flowing through a scene in 7 dimensions of the plenoptic function, 3 for position, 2 for direction, 1 for time and 1 for wavelength. the scene, and incoming photons are counted by the sensor. Exposure time and aperture size directly affect the exposure of an image by establishing a trade-off between depth of field, image brightness and motion blur.

Second, the wavelength is sampled from the plenoptic function by integrating the incoming light over a small band of wavelengths. Each photosite uses a filter to select a specific range of wavelengths (typically red, green, or blue), although in practice, the luminosity curves overlap, especially for red and green.

Third, the position is sampled in a camera by setting an aperture. The aperture determines the positions of the rays seen by the camera. This is typically idealized as an infinitely small pinhole, whereby all the light in the scene passes through to project an inverted image of the scene at one focal length from the pinhole. The location of this pinhole is the camera origin, a 3D point known as the nodal point.

Finally, direction is sampled from the plenoptic function in a conventional camera. Each pixel integrates the scene luminance over a range of both direction angles. The range of directions that the camera can capture is called the FOV. The parameters that determine the FOV are the focal length and the pixel size and number in both the x- and y-directions. As we integrate over the directions, we also project the scene onto the camera sensor, which is where scale 32 2.5. LIGHT-FIELD CAMERAS information is lost. Additionally, only unoccluded objects are projected onto the sensor in a Lambertian scene. Occluded objects are therefore not measured by a conventional camera, with the exception of transparency and translucency.

Although not part of the original 7D plenoptic function, the 8D plenoptic function includes polarisation and is worth mentioning [Georgiev et al., 2011]. Linear polarisation is when light waves have vibrations only in a single plane, while unpolarised light have vibrations that occur in many planes, or equivalently many directions that can also change rapidly. Unpolarised light may be better thought of as a mixture of randomly polarised light. In terms of sampling polarisation, a camera can measure different polarisations by employing polarisers that sample only certain polarisations. Since humans do not have any natural polarisation filters built into their eyes, regular cameras are not typically built with polarisers. Thus most cameras measure unpolarised light by integrating over all the different polarisations.

Therefore, with a monocular camera we are integrating over small intervals of position, time and wavelength and over a larger range of direction. This means we are sampling only 2 dimensions of the 7D plenoptic function, which results in a 2D image. We note that RGB cameras provide colour images, which may seem like a sampling over wavelength. Wavelength is measured at 3 (relatively) small intervals that are unevenly spaced and overlap to some degree; however, wavelength is not sampled in the signal processing sense of multiple, regularly-spaced mea- surements along the spectrum of wavelengths. Thus we consider images from RGB cameras as 2D images. In order to overcome the aforementioned limitations of 2D images, we must consider how to capture multiple views within a single sensor capture. This can be achieved by capturing the light field.

2.5.2 4D Light Field Definition

The light field was first defined by Gershun in 1936 as the amount of light travelling in every direction through every point in space [Gershun, 1936], but was only reduced from the 7D CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 33 plenoptic function to the more tractable 4D LF as a function of both position and direction in free space by Levoy and Hanrahan [Levoy and Hanrahan, 1996] and Gortler et al. [Gortler et al., 1996] in 1996. Interestingly, the 4D LF was initially developed by the computer graphics community to render new views of a scene given several views of the existing scene, without involving the complexities associated with geometric, lighting and surface models [Levoy and Hanrahan, 1996]. However, the 4D LF has recently proved useful for computer vision and robotics for solving the inverse problem: extracting scene structure given several images of the scene.

In order to reduce the 7D plenoptic function to a 4D light field, we first integrate over small intervals of time and wavelength. This reduces the plenoptic function to 5D. Since the radiance along rays in free space is constant in non-attenuating media, we can further reduce the plenoptic function from 5D to 4D. This means that the light rays do not change their value as they pass through the scene, implying that rays do not pass through objects1 and do not change in their intensity as they pass through the air.

An alternate geometric way of understanding the dimensional reduction from 5D to 4D is to consider a ray defined by a point in 3D space, and a normalized direction. The ray is thus defined by 5 parameters. In free space, the value of the ray does not change as we move the point along the ray’s axis and thus the value of the plenoptic function is the same for many combinations of these 5 parameters. If we fix our point to be on the xy-plane, i.e. set z = 0, then we have 4 independent parameters that describe the ray; thus we have reduced the 7D plenoptic function to a 4D LF representation.

The 4D LF is the smallest sample of the plenoptic function needed to encode multiple views of the scene. Multiple views are of interest because they contain much more information about the scene. As illustrated in Fig. 1.3, the classic pinhole camera model can be considered a tiny

1Seemingly, this implies that the 4D LF cannot capture the behaviour of refractive objects; however, in the subsequent chapters of this thesis, we will show that this is not the case. We can look at the relative changes between views in the LF to infer the distortion caused by refractive objects. 34 2.5. LIGHT-FIELD CAMERAS peephole through a wall that grants only a single view of the scene from a single viewpoint, while an LF can be thought of as a window through the wall that grants multiple views of the scene as we move behind the window. In relation to conventional cameras, an LF image can be thought of as a set of 2D images of the same scene, taken from a range of 3D positions in space. Typically these 3D positions are constrained to a planar array for simplicity. The LF is valid for non-attenuating medium. Novel views can be rendered from the LF. Occlusions are reproduced correctly in the LF, but we cannot render views behind occluding objects.

2.5.3 Light Field Parameterisation

There are many different parameterisations of the LF, but the simplest and most common is the two-plane parameterisation (2PP) [Levoy and Hanrahan, 1996]. With this parameterisation, a ray of light is described by a set of coordinates φ = [s,t,u,v]T, which are the ray’s points of intersection with two parallel reference planes separated by an arbitrary distance D. The T represents the vector transpose. The two reference planes are denoted by (s,t) and (u,v). By convention, the (s,t) plane is closest to the camera and the (u,v) plane is closer to the scene [Gu et al., 1997], shown in Fig. 2.10.

Figure 2.10: The two-plane parameterisation (2PP) of the 4D LF. Shown here is the relative parameterisation, where u,v are defined relative to s,t between the two planes, separated by distance D. From a Lambertian point P , a light ray φ passes through both planes, and can be represented by the four coordinates from the two planes, s,t,u and v. CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 35

In the relative parameterisation, u and v are expressed relative to s and t, respectively. In the absolute parameterisation, u and v are expressed in absolute coordinates. We note that all four dimensions are required to define position and direction. It is a matter of convention to discuss which plane defines position or direction. For the purposes of this work, we choose (s,t) as spatial (position) and (u,v) as angular (direction) dimensions, respectively. In this sense, s,t fix a ray’s position and u,v fix its direction. A convenient way to interpret the 2PP is as an array of cameras with parallel optical axes and orthogonal baselines, as illustrated in Fig. 1.3. The camera apertures are on the s,t plane facing the u,v plane. The s,t plane can be thought of as a collection of all the viewpoints available within the LF camera. If the separation distance

D is chosen to be the focal length f of the cameras, then (u,v) correspond to the image plane coordinates of the physical camera sensor.

Therefore, we can consider the 4D LF as a 2D array of 2D images, as shown in Fig. 2.11. In the literature, these 2D images are sometimes referred to as sub-views, sub-images, or sub-aperture images in the LF. Each view looks at the same scene, but from a slightly shifted viewpoint. The key intuition is that in comparison to other robotic vision sensors like monocular cameras, stereo cameras and RGB-D cameras, LFs can more efficiently capture the behaviour of refractive objects in these multiple views that we might exploit.

There are alternate parameterisations to characterise the LF, such as the spherical-Cartesian parameterisation [Neumann and Fermuller, 2003]. This method describes the ray position based on its point of intersection with a plane and its direction with two angles, which also yields a 4D LF. The advantage of this parameterisation is that it can describe all rays passing through in all directions, and may be well-suited for the design of wide FOV cameras. Although the 2PP cannot describe rays that pass parallel to the reference planes, the 2PP is most common because of its simplicity and the parameterisation is easily transferable to traditional camera design and robotic vision [Chan, 2014]. A possible solution to this limitation is to use multiple 2PP’s, oriented perpendicular to each other. 36 2.5. LIGHT-FIELD CAMERAS

Figure 2.11: Example 4D LF as a 2D array of 2D images for a refractive sphere amongst a pile of cards. Here, only 3 × 3 images are shown, while the actual light-field is a 17 × 17 array of 2D images. The views are indexed by s and t. Typically, we refer to the view for s0 = 0 and t0 = 0 as the central view of the LF camera. The pixels within each view are indexed by their u,v image coordinates. Therefore, a single light ray emitting from the scene can be indexed by four numbers, s,t,u and v. Light field courtesy of the New Stanford Light Field Archive.

2.5.4 Light-Field Camera Architectures

Light-field cameras capture multiple views of the same scene from slightly different viewpoints in a dense and regularly-sampled manner. The most common LF camera architectures are the light-field gantry, the camera array and the plenoptic camera, shown in Fig. 2.12. CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 37

2.5.4.1 Light-Field Camera Gantry

The camera gantry captures the LF using a single camera, moving it to different positions over time. Thus the positions of the camera map to s,t and the image coordinates of each 2D image map to u,v. Although Yamamoto was one of the earliest to consider camera gantries for 3D reconstruction [Yamamoto, 1986]. Levoy and Hanrahan were some of the first to consider computer-assisted camera gantries for recording light fields [Levoy and Hanrahan, 1996]. The camera gantry used to help digitize ten statues by Michelangelo is shown in Fig. 2.12a [Levoy et al., 2000]. The camera gantry can offer much finer angular resolution in the LF than camera arrays, because camera positioning is only limited by the mechanical precision of its actuators, while the spatial sampling interval in a camera array is limited by the physical size of the cameras. Additionally, there is only one camera to calibrate. However, there are high precision requirements for camera placement and in particular, the LF is not captured within a single shot. This usually limits the camera gantry to be used in static scenes.

(a) (b) (c) Figure 2.12: Different light-field camera architectures, (a) a camera gantry [Levoy et al., 2000], (b) a camera array [Wilburn et al., 2005], and (c) a lenslet-based camera [Ng et al., 2005]. These architectures all capture 4D LFs. 38 2.5. LIGHT-FIELD CAMERAS

2.5.4.2 Light-Field Camera Array

The camera array is probably the most easily understood architecture for light-field cameras. The array uses multiple cameras arranged in a grid to capture the LF. 2D images are collected in an array, which straightforwardly maps to a 4D LF with camera position s,t and pixel position in each camera image from u,v. A typical configuration is to arrange the cameras on a plane with regular spacing. This architecture was first developed by Wilburn et al. [Wilburn et al., 2005], shown in Fig. 2.12b. Camera arrays do not require special optics like plenoptic cameras; however, there are synchronization, bandwidth, calibration and image correction challenges to contend with. The discrete nature of the image capture can also cause aliasing artefacts in the rendered images. Camera arrays have been historically physically large, requiring several dis- crete sensors, although this also allows for relatively large baselines in comparison to plenoptic cameras.

Additionally, arrays of cameras can be created virtually. A single monocular camera pointed at an array of mirrors has been used to capture LFs [Fuchs et al., 2013, Song et al., 2015]. This LF camera design trades-off mass, bandwidth and synchronization issues with a different set of calibration issues and limited FOV, depending on the design. In Chapter 4, we use an array of planar mirrors to make a virtual array of cameras to collect LFs for visual servoing. The use of additional an array of small lenses to create virtual camera arrays leads the discussion to plenoptic cameras.

2.5.4.3 Lenslet-based LF Camera

The lenslet-based LF camera, which is sometimes referred to as a plenoptic camera, is a type of light-field camera that has an array of micro-lenses, often referred to as lenslets, mounted between the main lens and the image sensor, which split the image from the main aperture into smaller components, based on the incoming direction of the light rays, as shown in Fig. 2.13. CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 39

Lippman first proposed to use to create crude integral photographs in 1908 [Lipp- mann, 1908]. It was not until 1992 that Adelson and Wang placed the microlenses at the focal plane of the camera’s main lens [Adelson and Wang, 1992]. Ng et al. [Ng et al., 2005] designed and commercialized the “standard plenoptic camera” design of the lenslet-based LF camera, making it hand-held and accessible to a large user base.

In the standard plenoptic camera, the main lens focuses the scene onto the lenslet array and the lenslet array focuses the pixels at infinity. Fig 2.13b shows how the angular components of the incoming light rays are divided by the lenslets. Each pixel underneath each lenslet corresponds to part of the image from a particular direction. This arrangement results in a virtual camera array in front of the main lens. For a camera with N × N pixels underneath each lenslet, there are N × N virtual cameras. This yields a series of lenslet images, as in Fig. 2.14, which can be decoded into discrete sub-views to obtain the 4D LF structure previously discussed [Dansereau et al., 2013].

One of the drawbacks of the standard plenoptic camera is that the final resolution of each de- coded sub-view is limited by the number of lenslets. Georgiev and Lumsdaine developed the fo- cused plenoptic camera, also known as the plenoptic camera 2.0, which also places the lenslets behind the main lens, but the main lens focuses the scene inside the camera before the light reaches the lenslets [Lumsdaine and Georgiev, 2008]. The focused plenoptic camera displays a focused sub-image on the sensor, allowing for higher spatial resolutions at the cost of angular resolution. Equivalently, there are fewer s,t for higher u,v. Although the lower angular resolu- tion can produce undesirable aliasing artefacts, the key contribution of this camera design was to decouple the trade-off between the number of lenslets and the achievable resolution [Lums- daine and Georgiev, 2009]. Commercial cameras that utilize the plenoptic camera 2.0 design include the Raytrix [Perwass and Wietzke, 2012]. 40 2.5. LIGHT-FIELD CAMERAS

pixels are the sum of all the rays point point source source

focal plane main lens image decoded focal plane main lens image sensor array sensor sub-images (a) (b) Figure 2.13: (a) For a conventional monocular camera, light rays from a point source are all integrated over all the directions that pass through the main aperture into a single pixel value, such that the pixel’s value depends only on pixel position. (b) For a lenslet-based LF camera, a microlens array is placed in front of the sensor, such that pixel values depend on pixel position as well as incoming ray angle. Decoded sub-images equivalent to that of an LF camera array can be obtained by combining pixels from similar ray directions behind each microlens (or lenslet).

2.5.4.4 Light-Field Cameras vs Stereo & Multi-Camera Systems

The difference between LF cameras compared to general multi-camera systems and stereo sys- tems is not apparent at first glance. Stereo, multi-camera systems and LF camera arrays typically use multiple cameras to capture multiple views of the scene. The main difference is the level of sampling of the plenoptic function. Stereo only samples the plenoptic function twice along one direction with a fixed baseline. This means stereo can only measure motion parallax (and thus depth) along the direction of its fixed baseline. Multi-camera systems and LF cameras sample the plenoptic function from multiple viewpoints, and so have both small and long baselines, as well as baselines in multiple directions (typically vertically and horizontally). This yields depth measurements with more redundancy and thus more reliability. However, the density and uniformity of sampling the plenoptic function matters. Multi-camera systems are not limited to physical camera configurations where each camera is aimed at the same scene in a regular and tightly-spaced manner.

On the other hand, LF cameras sample densely and uniformly. This simplifies the processing in the same way that uniformly sampled signals are easier to sample than non-uniformly sampled signals. For example, consider 2D imaging devices: non-uniform 2D imaging devices are ex- tremely rare. A few designs have been proposed, such as the foveated vision sensor, where the CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 41

(a) (b) (c) Figure 2.14: (a) A raw plenoptic image of a climber’s helmet captured using a Lytro Illum. This cropped section consists roughly of 100×100 lenslets. (b) Zoomed in on the raw plenoptic im- age, each lenslet is visible. Each pixel in the lenslet image (roughly 13×13 pixels) corresponds to the directional component of a measured light ray. (c) A decoded 100×100 pixel sub-image from the light-field. The central view of the 4D light-field, which is roughly comprised of the central pixel from each lenslet image across the entire raw image. There are 13×13 decoded sub-images in this 4D light-field.

pixel density is varied similar to the non-uniform distribution of cones in the human eye [Yeasin and Sharma, 2005]. However, such designs are not common in industrial applications or the consumer marketplace. The dominant 2D imaging devices use a rectangular, uniform distribu- tion of pixels, which are much simpler to manufacture and process algorithmically. Therefore, LF cameras can be considered a specific class of multi-camera systems that exploit the camera geometry to simplify the image processing.

In particular, the dense and regular sampling of LF cameras motivates their use for visual ser- voing and dealing with refractive objects. As we will show in Ch 4, LF cameras can be used for visual servoing towards small and distant targets in Lambertian scenes and enable better performance in occluded scenes. Later in Ch 5, we show that capturing these slightly differ- ent views is sufficient to differentiate changes in texture from camera motion versus distortion from refractive objects. Finally in Ch 6, we show that LF cameras can be used to servo towards refractive objects. 42 2.6. 4D LIGHT-FIELD VISUALIZATION

2.6 4D Light-Field Visualization

Visualizing the data is an important part of understanding the problem. While visualizing 2D and 3D data has become common in modern robotics research, visualizing 4D data is signif- icantly less intuitive. In order to examine the characteristics of the 4D LF, the conventional approach is to slice the LF into 2D images. For example, a u,v slice of the LF fixes s and t to depict the LF as u varies with respect to v. Recalling the 2PP in Fig. 2.10, it is clear that this 2D slice is analogous to viewpoint selection and corresponds to what is captured by a single camera in a camera array. Nine different examples of u,v slices depicting the 4D LF as a 3 × 3 grid of 2D images are shown in Fig. 2.15 for different values of s and t, although the actual LF is comprised of 17 × 17 2D images.

Further insight can be gained from the LF by considering different pairings of dimensions from the LF. Consider the horizontal s,u slice, shown at the top of Fig. 2.15. This 2D image is taken by stacking the rows of image pixels (all the u) from the highlighted yellow, red and green lines (all the s), while holding t and v constant. Similarly, the vertical t,v slice is taken by stacking all of the columns of pixels (all the v) from the highlighted turquoise, blue and purple lines (all the t), while holding s and u constant, shown on the right side of Fig. 2.15.

Visualizing slices using this stacking approach is only meaningful due to the uniform and dense sampling of the LF. This method was employed by Bolles et al. [Bolles et al., 1987] for a single monocular camera with linear translation and capturing images with a uniform and dense sam- pling. Their volume of light was 3D and they referred to the 2D slices of light as epipolar planar images (EPIs). They were able to simplify the image feature correspondence problem from per- forming multiple normalized cross-correlation searches across each image, to simply finding lines in the EPIs. Furthermore, for Lambertian scenes, these lines are characteristic straight lines with slopes, which, as discussed in Section 2.7, reflect depth in the scene. However, as we will show in Ch. 5, these lines can become distorted and nonlinear by refractive objects, which can be exploited for refractive object detection. CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 43

Figure 2.15: Visualizing subsets of the 4D LF as 2D slices. The u,v slice can the seen as a conventional image from a camera positioned at the s,t plane. In this figure, there are 3 × 3 u,v slices for different s,t depicting the 4D LF; the full LF is a 17 × 17 grid of images. The s,u slice, illustrated by stacking the yellow/red/pink rows of image pixels of u with respect to s, and the t,v slice, depicted by stacking the turquoise/blue/purple columns of image pixels of v with respect to t, are sometimes referred to as EPIs. For Lambertian scenes, EPIs show characteristic straight lines with slopes that reflect depth. However, these lines can become distorted and nonlinear by refractive objects, such as the refractive sphere in the centre of these images. LF courtesy of the New Stanford Light-Field Archive. 44 2.7. 4D LIGHT-FIELD GEOMETRY

2.7 4D Light-Field Geometry

In this section, we discuss the geometry of the LF. We start by defining geometric primitives in 2D and follow their extensions to 3D and 4D. We then go into detail with the 2PP of the LF and discuss the point-plane correspondence and the concept of slope and depth in the LF. We show that a ray in 3D intersects the two planes of parameterisation twice, defined by two pairs of image coordinates in s,t,u and v, and subsequently that a Lambertian point in 3D induces a plane in the 4D LF. This theory serves as the basis for understanding the properties of the 4D LF, which we exploit throughout this thesis for the purposes of visual servoing and discriminating against refractive objects.

2.7.1 Geometric Primitive Definitions

First, we provide the definitions of several typical geometric primitives, including a point, a line, a plane and a hyperplane. The definitions of dimensions and manifolds are also included for clarity.

• Dimension: The definition of dimension, or dimensionality, varies somewhat across mathematics. The dimension of an object is often thought of as the minimum number of coordinates needed to specify any point within the object. More formally, dimension is defined in linear algebra as the cardinal number of a maximal linearly independent subset for a vector space over a field, i.e. the number of vectors in its basis.

• Degree(s) of Freedom: (DOF) The number of degrees of freedom in a problem is the number of parameters which may be independently varied. Informally, degrees of free- dom are independent ways of moving, while dimensions are independent extents of space. Thus, a rigid three-dimensional object can have zero DOF if it is not allowed to change it’s pose, six DOF if it is allowed to translate and rotate, or anything combination of translation and rotation. CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 45

• Point: A point is a 0-DOF object that can be specified in n-dimensions as an n-tuple of

coordinates. For example, a 2D point is defined as (x,y), a 3D point as (x,y,z), and a 4D point as (x,y,z,w), which can be described by a minimum of 2,3 and 4 param- eters, respectively. Points are synonymous with coordinate vectors. Basic structures of geometry (eg. line, plane, etc) are built from an infinite number of points in a particular arrangement. One might go as far as to say life without geometry is pointless.

• Manifold: A manifold is a topological space that is locally Euclidean, in that around every point there is a neighbourhood that has the same properties as the point itself.

• Line: A line is a 1-DOF object that has no thickness and extends uniformly and infinitely in both directions. A line is a specific case of a 1D manifold. Informally, a line extends in both directions with no wiggles.

• Plane: A plane is a 2-DOF object that is spanned by two linearly independent vectors. A plane is a specific case of a 2D manifold.

• Hyperplane: In an n-dimensional space, a hyperplane is any vector subspace that has

n − 1 dimensions [Weisstein, 2017]. For example, in 1D, a hyperplane is a point. In 2D, a hyperplane is a line. In 3D, a hyperplane is a plane. In 4D, the hyperplane has 3 DOF and the standard form of a hyperplane is given as

ax + by + cz + dw + e =0. (2.10)

In n-dimensions, for a space X = [x1, x2, ··· , xn], xi ∈ R, let a1, a2, . . . an be scalars not all equal to 0. Then the hyperplane in R⋉ is given as

a1x1 + a2x2 + ... + anxn = c, (2.11)

where c is a constant. There are n +1 parameters, but we can divide though by xi for a minimum of n parameters to describe the hyperplane in nD. 46 2.7. 4D LIGHT-FIELD GEOMETRY

2.7.2 From 2D to 4D

In this section, we describe the geometry of primitives in increasing dimension and discuss the minimum number of parameters each primitive can be described by in the different dimensions. The parameters and their primitives are categorized in Table 2.1. In the rest of the section, we explain why each primitive requires a certain number of parameters to be described. Note that the minimum number of parameters to fully describe a primitive is different from its DOF. For example, a point has 0 DOF. In 2D, a point requires a minimum of two parameters, but in 4D, a point requires four parameters to describe. We also discuss the equations used to describe these geometric primitives.

Table 2.1: Minimum Number of Parameters to Describe Geometric Primitives from 2D to 4D

Primitive 2D 3D 4D

Point 2 3 4 Line 2 4 6 Plane — 3 6 Hyperplane 2 3 4

2.7.2.1 2 Dimensions

A Point in 2D In 2D, the space is defined by x and y. A 2D point is defined with two equations

x = a, y = b, (2.12) where a, b ∈ R. Thus a point in 2D requires a minimum of two parameters to be fully described. CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 47

A Line in 2D A line in 2D has the standard form of

ax + by + c =0, (2.13) where a, b, c ∈ R are three parameters. We can re-write (2.13) as

a b x + y = −1, (2.14) c c which has two free parameters if we consider a/c and b/c to be two parameters. Thus a line in 2D requires a minimum of two parameters.

Intersection of 2D Hyperplanes We note that a line in 2D is a hyperplane. Consider the two lines,

ax + by + c =0, (2.15) and

dx + ey + f =0, (2.16) where a, b, c, d, e, f ∈ R. Thus, we can describe a 2D point by the intersection of 2 lines.

a b x −c     =   . (2.17) d e y −f      

This is a 2 × 2 system of equations for the 2D intersection of two 2D hyperplanes. Assuming that these two lines are neither collinear nor parallel, we can solve this system of equations to yield a 2D point. We will refer back to this observation as we journey through the intersection of three 3D hyperplanes, and four 4D hyperplanes. 48 2.7. 4D LIGHT-FIELD GEOMETRY

2.7.2.2 3 Dimensions

A point in 3D A point in 3D is defined by three equations as

x = a, y = b, z = c, (2.18) where a, b, c ∈ R. Thus a point in 3D requires a minimum of three parameters to be completely described.

A Line in 3D As illustrated in Fig. 2.16, a line in 3D can be described by two points p1 and p2

x = p1 +(p2 − p1)k, (2.19)

3 3 where x = [x,y,z] ∈ R , p1, p2 ∈ R and k ∈ R. With p1 and p2, we have six parame- ters to describe the line; however, these parameters are not independent. Since the line is one dimensional, we can imagine sliding either p1 or p2 along the line and retaining the line’s def- inition. There are infinitely many pairs of 3D points along the line that can describe the line. Thus for each point, we can hold one of its three coordinates constant to describe the same line without any lose of generality. Therefore, a line in 3D can be described by a minimum of four parameters.

We can also describe a line in 3D as a point p1 and a direction (vector) r. In this case, a similar argument holds: both the point and direction can be reduced to two parameters each, yielding a total of four parameters.

Plücker coordinates have also been used to specify a line in 3D. Two points on the line can specify the direction (vector) of the line d. Another vector describes the direction to a point on the line from the origin p. The cross product between these two vectors is independent of the chosen point, and uniquely defines the line. Plücker coordinates are defined as the line’s CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 49 direction vector d together with the cross product, given by

(d; p × d), (2.20) where d is normalised to unit length, p × d is the cross product, often known as the ‘moment’, and p is an arbitrary point on the line. We can then select this point relative to a constant without any loss of generality. Thus, even with Plücker coordinates, four parameters are required to describe a line in 3D.

(a) (b) (c) Figure 2.16: Describing a line in 3D with (a) two points, (b) a point and a vector, and (c) the intersection of two planes.

A Plane in 3D The standard form for a plane in 3D is

ax + by + cz + d =0, (2.21) where a, b, c, d ∈ R. Similar to a line in 2D (2.14), we can describe the plane in 3D with a minimum of three parameters. The direction of a plane’s normal can be described by two parameters and the plane must intersect some axis at some scalar distance from the origin; therefore, we have three parameters. 50 2.7. 4D LIGHT-FIELD GEOMETRY

Intersection of 3D Hyperplanes We also note that a hyperplane in 3D is a 2D plane. As with the 2D case, if we consider two two hyperplanes in 3D as

ax + by + cz + d =0, (2.22) and

ex + fy + gz + h =0, (2.23) where a, b, c, d, e, f, g, h ∈ R. We can then describe the intersection of these two hyper- planes in 3D, as a 2 × 3 system of equations

x a b c   −d   y =   . (2.24)   e f g   −h         z    

We can row-reduce this system of equations to

x c b ga−ce   −d b ga−ce 1 0 a − a fa−be a − a fa−be    y =   . (2.25) ga−ce   −ha−de 0 1    fa−be     fa−be    z    

If we let

c b ga − ce α = − (2.26) a afa − be −d b ga − ce β = − (2.27) a a fa − be ga − ce γ = (2.28) fa − be −ha − de η = , (2.29) fa − be then we can rewrite (2.25) as CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 51

x 1 0 α   β   y =   . (2.30)   0 1 γ   η         z    

Clearly, the intersection of two non-coplanar, non-parallel planes in 3D describes a line in 3D, which depends on a minimum of four parameters. Fig. 2.16c shows the intersection of two such

3 planes in 3D, Π1 and Π2 ∈ R , forming a line in 3D.

Another way to consider the minimum number of parameters to describe a line in 3D is that from (2.21), each plane can be described by three parameters. If we consider the intersection of Π1 and Π2, we have two equations (six parameters total); however, each equation constrains the system by one. Thus, six minus two yields four parameters that are required to describe a line in 3D.

Additionally, similar to the 2D case in Section 2.7.2.1, we can also describe the intersection of three 3D hyperplanes as a 3 × 3 system of equations, which intersect at a 3D point.

a1 b1 c1 x d1       a b c y = d , (2.31)  2 2 2    2             a3 b3 c3 z d3       where the three hyperplanes are their subscripts. In other words, three planes that have unique normals intersect at a point in 3D. In Section 2.7.2.3, we will show that the intersection of two hyperplanes in 4D describes a plane in 4D, and that the intersection of four hyperplanes in 4D forms a 4D point. 52 2.7. 4D LIGHT-FIELD GEOMETRY

2.7.2.3 4 Dimensions

The journey to the fourth dimension can be intimidating and shrouded in mystery due to the limitations of our perception and understanding. If we can only perceive the world in 3 spatial dimensions, how can we understand a fourth spatial dimension? In physics and mathematics,

4D geometry is often discussed in terms of a fourth spatial dimension. Consider the axes of x, y and z form a basis for R3. The fourth spatial dimension along w in 4D is orthogonal to the other 3 axes. Such discussions lead to speculations of the limitations on human perception and the 4D equivalent of a cube, known as a tesseract [Hinton, 1884]. Fortunately in this thesis, we are not concerned with a 4th spatial dimension, but rather 4 dimensions with respect to the sampling of light, as per the plenoptic function in Section 2.5.1. Our 4 dimensions in the LF s,t,u and v differ slightly from the four spatial dimensions x,y,z and w, in that s,t,u and v are constrained via the 2PP. However, much of the geometry from 4 spatial dimensions carries over to dealing with the 4D LF. In this section, we illustrate the geometric primitives in 4D with respect to the 2PP for light fields. We illustrate these primitives as projections on a grid of 2D images, similar to how they would appear in the LF.

A Point in 4D A point in 4D can be described by four equations as

x = a, y = b, z = c, w = d, (2.32) where a, b, c, d ∈ R. A point in 4D requires a minimum of four parameters to be completely described. Examples of two 4D points are shown in Fig. 2.17. In the 2PP, a point in 4D describes a ray in 3D. However, not all rays can be represented by the 2PP, because the 2PP cannot describe rays that are parallel to the two planes.

A Line in 4D Similar to the 3D case, a line in 4D can be written as a function of two 4D points, which require eight parameters in total. There are an infinite number of pairs of 4D CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 53

(a) (b) Figure 2.17: The projection of two different points in 4D using the 2PP, shown in red. The 2PP is illustrated as a grid of squares. Each square is considered to be a view. Each view has its own set of coordinates that describe a location within the view. Both (a) and (b) show 4D points, defined for specific values of s,t,u and v. Note that s and t values correspond to which view, while u and v correspond to a specific view’s coordinates (similar to image coordinates). In our case, a single 4D point must be defined by its view and its coordinates within the view; hence, a minimum of 4 parameters to describe a 4D point. points that can describe the 4D line. Each of these points can be “fixed” in the same manner as a line in 3D (Section 2.7.2.2), reducing the minimum number of parameters to six to describe a line in 4D. Several examples of 4D lines are shown in Fig. 2.18.

(a) (b) (c) (d) Figure 2.18: The projection of four different lines in 4D using the 2PP. A 4D line still has one DOF. (a) t, u and v are held constant, while s is allowed to vary. (b) s, u, and v are held constant while t is allowed to vary. (c) s, t and u are held constant, while v is allowed to vary. (d) s and t are held constant, while u and v vary linearly.

A Hyperplane in 4D A hyperplane in 4D is given as

ax + by + cz + dw + e =0, (2.33) 54 2.7. 4D LIGHT-FIELD GEOMETRY where a, b, c, d, e ∈ R. In the 2PP, we can write as + bt + cu + dv + e = 0. Similar to the 3D case, this equation can be divided by e, yielding a minimum of four parameters to describe the hyperplane in 4D. Alternatively, the hyperplane in 4D can be described by its normal n =[a, b, c, d] and a distance to the origin. Four examples of 4D hyperplanes are shown in Fig. 2.19.

(a) (b) (c) (d) Figure 2.19: The projection of four different hyperplanes in 4D using the 2PP. A hyperplane is only constrained along 1 dimension. (a) The hyperplane is constrained along u, such that cu+e =0. (b) The hyperplane is constrained along v, such that dv +e =0. (c) The hyperplane is constrained along u and v through a linear relation, such that cu + dv + e = 0. (d) The hyperplane is constrained along s, such that as + e =0.

A Plane in 4D & Intersection of 4D Hyperplanes Similar to the analogy of a line in 3D that can be represented by the intersection of two hyperplanes in 4D that have unique normals. With unique normals, we can say that the hyperplanes are not parallel and one hyperplane is not entirely contained within the other hyperplane. Mathematically, let us assume we have two 4D hyperplanes, given as

ax + by + cz + dw + e =0, (2.34) and

fx + gy + hz + iw + j =0, (2.35) CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 55 where a, b, c, d, e, f, g, h, i ∈ R. From (2.34), we can isolate w as

−1 w = (e + ax + by + cz), (2.36) d and substitute this expression into (2.35) as

−1 fx + gy + hz + i (e + ax + by + cz) + j =0, (2.37)  d  which can be simplified to

ia ib ic ie f − x + g − y + h − z + j − =0. (2.38)  d   d   d   d 

This equation matches the standard form of a plane in 3D, given in (2.21). From this, it is clear that the intersection of two hyperplanes in 4D forms a plane in 4D. Each hyperplane can be described using four parameters, to a total of eight. In Section 5.2, we show equivalently that the intersection of two 4D hyperplanes can be described with two equations in (5.5).

We note that (2.38) appears to imply that we can describe a plane in 4D with just four numbers; however, each coefficient in (2.38) is an equation that has two constraints in total, d =06 and at least two of the coefficients in front of x,y or z must be non-zero. We can further illustrate this relation by illustrating two hyperplanes in the 2PP, as in Fig. 2.20. Two different hyperplanes are pictured in green and purple. Their intersections, highlighted in red, represent the plane in the 4D LF.

Additionally, we can show that the intersection of four hyperplanes in 4D intersect at a point.

In 2D, the intersection of two 2D hyperplanes resulted in a 2 × 2 system of equations, which could be solved for a 2D point. In 3D, the intersection of 3 3D hyperplanes resulted in a 3 × 3 system of equations, which could be solved for a 3D point. In 4D, the intersection of four 4D 56 2.7. 4D LIGHT-FIELD GEOMETRY

(a) (b) Figure 2.20: The projection of two different planes in 4D using the 2PP. In both (a) and (b), the two hyperplanes are shown in green and purple. Their intersection in red represents the plane in the 4D LF. hyperplanes results in a 4 × 4 system of equations, which results in a 4D point,

a1 b1 c1 d1 x e1       a b c d y e  2 2 2 2    2     =   , (2.39)       a3 b3 c3 d3 z  e4             a4 b4 c4 d4 w e4       where the four hyperplanes are represented by their subscripts.

2.7.3 Point-Plane Correspondence

A particularly relevant question for robots striving to interact in a 3D world is how do observa- tions in the LF translate to the 3D world? In this section, we will further discuss the intersections of hyperplanes in 4D to show that a point in 3D manifests itself as a plane in the 4D LF. This manifestation was coined the point-plane correspondence in [Dansereau and Bruton, 2007], although a similar relationship was determined for translating monocular cameras in [Bolles et al., 1987].

Recall the relative two-plane parameterisation (2PP) [Levoy and Hanrahan, 1996]. A ray with coordinates φ =[s,t,u,v], is described by two points of intersection with two parallel reference CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 57 planes. An s,t plane is conventionally closest to the camera, and a u,v plane is conventionally closer to the scene, separated by arbitrary distance D. The rays emitting from a Lambertian

T point in 3D space, P =[Px,Py,Pz] can be illustrated in the xz-plane, shown in Fig. 2.21. The same ray can be shown in the su-plane in Fig. 2.22.

For the xz-plane, if we define θ as the angle between the intersecting ray and the z-axis direc- tion, then by similar triangles, we have

P − u − s P − s tan θ = x = x . (2.40) Pz − D Pz

Then

Px − s u = Px − (Pz − D)+ s  Pz 

D u = (Px − s). (2.41) Pz

We can also plot (2.41) to yield projections in the 2PP similar to Fig. 2.19a. Plotting (2.42) yields projections similar to Fig. 2.19b.

We can follow a similar procedure for the yz-plane, resulting in

D v = (Py − t). (2.42) Pz

We can combine (2.41) and (2.42) into a single equation as

u D Px − s   =   . (2.43) Pz  v Py − t    

We can recognize (2.43) as two hyperplanes in 4D that intersect to describe plane in 4D, as well as a point in 3D. Therefore, light rays from a Lambertian point in 3D manifests as a plane in the 4D LF. 58 2.7. 4D LIGHT-FIELD GEOMETRY

Figure 2.21: Light-field geometry for a point in space for a single view (black), and other views (grey), whereby u is defined relative to s and varies linearly with s for all rays originating from P (Px,Pz).

We can re-write (2.43) into the form,

s   D 0 10 t DPx  Pz     Pz    = . (2.44) 0 D 0 1 u DPy  Pz     Pz          v  

From (2.44), we note that the hyperplane normals only depend on Pz, and not Px or Py. The normals are similar for both hyperplanes for a Lambertian point in that their elements have the same values, but in different columns (such that the two normals are still linearly-independent) in su and tv, respectively. Equation (2.43), and thus (2.44), map out the ray space (all rays) emitting from point P .

2.7.4 Light-Field Slope

In 2D, a line’s direction and steepness, i.e. its rate of change of one coordinate with respect to the other coordinate, is referred to as the slope. In the 4D LF, if we consider two different measurements from a Lambertian point P (Px,Py,Pz) as (s1, t1, u1, v1) and (s2, t2, u2, v2), CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 59 the difference between these two measurements for the xz-plane can be written as

D D u2 − u1 = (Px − s2) − (Px − s1), (2.45) Pz Pz which simplifies to D u2 − u1 = − (s2 − s1). (2.46) Pz

We then refer to the rate of change between the linear relation of u with respect to s, as slope w, which is often visualized in a 2D EPI slice of the LF, as in Fig. 2.15 and is given as

u − u D w = 2 1 = − . (2.47) s2 − s1 Pz

We note that a similar procedure follows for the y − z plane and yields an identical expression,

v − v D w = 2 1 = − . (2.48) t2 − t1 Pz

The slope w relates the image plane coordinates for all rays emitting from a particular 3D point in the scene. Fig. 2.21 shows the geometry of the LF for a single view of P . As the viewpoint changes, that is, s and t change, the image plane coordinates vary linearly according to (2.43). In Fig. 2.22, we show how u varies as a function of s, noting that v varies as a similar function of t. The slope of this line w, comes directly from (2.43), and is given by

D w = − . (2.49) Pz

By working with slope, akin to disparity from stereo algorithms, we deal more closely with the structure of the light field.

In this section, we explored geometric primitives such as points, lines, planes and hyperplanes from 2D to 4D. We explained the number of parameters that are required to describe the prim- itives. By describing the underlying mathematics behind these geometric primitives, we gain 60 2.7. 4D LIGHT-FIELD GEOMETRY

u

s

Figure 2.22: For the situation illustrated in Fig. 2.21, the corresponding line in the s,u plane has a slope w. insight into how light rays emit from a Lambertian point are represented in the 2PP of the LF. We showed that a light ray emitting from a Lambertian point can be described as a 4D point in the LF. A Lambertian point induces a plane in the 4D LF, and a plane in the 4D LF can be described by the intersection of two 4D hyperplanes. In future chapters, we will use these re- lations to propose light-field features for visual servoing and detecting refracted image features using an LF camera, and servoing towards refractive objects. Chapter 3

Literature Review

In this chapter, we provide a review of the literature relevant to this thesis. First, we introduce image features, from 2D to 4D. Then we review visual servoing in the context of LF cameras and refractive objects. Third, we investigate the state of the art for how refractive objects are handled in robotics. Finally, we summarize the review by identifying the research gaps that this thesis seeks to address.

3.1 Image Features

Features are distinct aspects of the scene that can be reliably and repeatedly identified from dif- ferent viewpoints and/or across different viewing conditions. Image features are those features recorded in the image as a set of pixels by the camera that can then be automatically detected and extracted as a vector of numbers, which is referred to as an image feature vector. Image fea- ture vectors abstract raw and dense image information into a simpler, smaller and more compact relevant representation of the data. Much of the literature does not make a significant distinc- tion between these three concepts. Good image features to track are those that can repeatedly be detected and matched across multiple images [Shi and Tomasi, 1993]. There are typically

61 62 3.1. IMAGE FEATURES two aspects to finding an image feature vector, an image feature detector and a image feature descriptor. For brevity, we refer to both as a detector and descriptor, respectively. The detector is a method of determining whether there is a suitable image feature at a given image location. The detector is usually represented by a pair of image coordinates, a set of curves, a connected region, or area [Corke, 2013]. The descriptor is a method of describing the image feature’s neighbourhood. The descriptor typically takes the form of a vector for correspondence. In this section, we review geometric image features, as well as photometric image features in the context of refractive objects and light fields from 2D to 4D. We then briefly discuss image fea- ture correspondence and why refractive objects are particularly challenging for image feature correspondence.

3.1.1 2D Geometric Image Features

Traditionally, the most common image features are geometric image features that represent a 2D or 3D geometric shape in a 2D image. Most robotic vision methods use 2D geometric image features, such as regions and lines [Andreff et al., 2002], line segments [Bista et al., 2016], moments (such as the image area, the coordinates of the centre of mass and the orientation of an image feature) [Mahony et al., 2002,Tahri and Chaumette, 2003,Chaumette, 2004], and interest points (sometimes referred to as keypoints) [Chaumette and Hutchinson, 2006,McFadyen et al., 2017]. For image points, Cartesian coordinates are normally used, though polar and cylindrical coordinates have also been developed [Iwatsuki and Okiyama, 2005]. Interest points are better suited to handle large changes in appearance, which may be caused by refractive objects. One of the earliest and most popular interest point detectors is the Harris corner detector [Harris and Stephens, 1988]; however, Harris corners do not distinguish interest points of different scale— they operate at a single scale, determined by the internal parameters of the detector. In the context of wide baseline matching and object recognition, there is an interest in features that can cope with scale and viewpoint changes. Harris corners are computationally-cheap, but do CHAPTER 3. LITERATURE REVIEW 63 not provide accurate feature matches across different scales and viewpoints [Tuytelaars et al., 2008, Le et al., 2011].

To achieve scale invariance, a straightforward approach is to extract points over a range of scales and use all of these points together to represent the image, giving rise to multi-scaled features. Of particular note, Lowe developed the scale invariant feature transform (SIFT) feature detector based on finding the extrema of a multi-scaled pyramid of the Difference of Gaussian (DoG) responses [Lowe, 2004]. Bay et al. further reduced the computational cost of SIFT features by considering the Hessian of Gaussians and other numerically-efficient approximations to create speeded-up robust feature (SURF) feature detectors [Bay et al., 2008].

SIFT and SURF features also include descriptors that are based on using histograms. These histograms describe the distribution of gradients and orientations of the feature’s support re- gion in the image for illumination and rotational invariance. Dalal et al. developed the more advanced histogram of gradients (HoG) feature descriptor, which uses normalized weights based on nearby image gradients for each sub-region, making HoG descriptors less sensitive to changes in contrast than SIFT and SURF, and better at matching in cluttered scenes [Dalal and Triggs, 2005]. While Lowe’s SIFT descriptor was limited to a single scale, Dong et al. recently improved the SIFT descriptors by pooling (combining) the gradient histograms over all the sampled scales, calling the new descriptor domain-size pooled SIFT (DSP-SIFT) [Dong and Soatto, 2015], which represents the state of the art in terms of point feature descriptors for SfM tasks [Schoenberger et al., 2017].

Features from accelerated segment test (FAST) features were developed by Rosten [Rosten et al., 2009] as an exceptionally cheap binary feature detector that exploits the relative rela- tionship of nearby feature pixel values directly. binary robust independent elementary features (BRIEF) descriptors selects random pixels within the neighbourhood of the feature to make binary comparisons in sequence as a binary descriptor, which was computationally cheap and reliable except for in-plane rotation [Calonder et al., 2010]. Oriented FAST and rotated BRIEF 64 3.1. IMAGE FEATURES

(ORB) features were developed as computationally cheaper alternatives to SIFT and SURF fea- ture detectors for real-time robotics. The ORB features build on the FAST detector by using Harris corner strength for ranking, and SIFT’s multi-scale pyramids for scale invariance [Rublee et al., 2011]. ORB descriptors use the BRIEF descriptor and augment it with an intensity- weighted centroid. This assumes a small offset between the image’s corner intensity and the corner centre which defines a measure of orientation to provide rotational invariance. Over- all, the result is that ORB is a much more computationally efficient detector and descriptor with comparable performance to SURF, which has proven to be very successful in the robotics literature.

Recent work in machine learning and convolutional neural networks (CNN)s have also given rise to learned 2D features. Verdie et al. [Verdie et al., 2015] developed a temporally invari- ant learned detector (TILDE) that detects keypoints for outdoor scenes, despite lighting and seasonal changes. They demonstrated better repeatability over 3 different datasets than hand- crafted feature detectors, such as SIFT. Unfortunately, Verdie’s approach was only done for a single scale and without any viewpoint changes.

Yi et al. trained a deep network to learn the thresholds of the entire SIFT feature detection and description pipeline in a unified manner [Yi et al., 2016]. They called this method the Learned Invariant Feature Transform (LIFT). LIFT out-performed all other hand-crafted fea- tures in terms of repeatability and the nearest neighbour mean average precision, a metric that captures how discriminating the descriptor is by evaluating it at multiple descriptor distance thresholds.

However, an extensive experimental evaluation of hand-crafted to learned local feature descrip- tors showed that learned descriptors often surpassed basic SIFT and other hand-crafted descrip- tors on all evaluation metrics in SfM tasks [Schoenberger et al., 2017]. However, more advanced hand-crafted descriptors such as DSP-SIFT performed on par, or better than the state-of-the-art CHAPTER 3. LITERATURE REVIEW 65 learned feature descriptors including LIFT, for tasks in SfM, which showed a high variance across different datasets and applications, unlike hand-crafted features.

Many robotic vision systems have used hand-crafted and learned features [Kragic and Chris- tensen, 2002,Bourquardez et al., 2009,Low et al., 2007,Tsai et al., 2017,Lee et al., 2017,Pages et al., 2006]. The majority of them assume a similar appearance for the support region during correspondence. For non-Lambertian scenes where the support regions change significantly in appearance with respect to viewing pose, incorrect matches can occur. Moreover, refractive ob- jects can cause features to distort, rotate, scale and flip. Feature descriptors that only account for scale and rotation will not reliably match refracted content because the additional distortion and flips caused by refractive objects change the very neighbourhood that the descriptors attempt to describe. Thus, these 2D image features may not perform well for scenes containing refractive objects.

3.1.2 3D Geometric Image Features

The fundamental limitation with using 2D features to describe the 3D world is that significant information is lost during the image formation process of conventional cameras. The perspec- tive transformation is an irreversible process that projects the 3D world into a 2D image. Full 3D information can greatly improve robot vision algorithms to more reliably handle changes due to viewing position and lighting conditions. We refer to incorporating 3D information into image features as 3D geometric image features.

Measurements of 3D data can come from a variety of sensors, including stereo, RGB-D or LIDAR. Sensor measurements are then turned into one of many different 3D feature represen- tations. Most conventional 3D feature descriptors are based on histograms, similar to SIFT’s 2D gradient-based histogram descriptors. The most common is Johnson’s spin image [Johnson and Hebert, 1999]. For a given point, a cylindrical support volume is divided into volumetric ring slices. The number of points in each slice are counted and summed about the longitudinal 66 3.1. IMAGE FEATURES axis of the volume. This makes the spin image rotationally invariant about this axis. Finally, the spin image is binned into a 2D histogram.

Tombari et al. built on spin images by using a spherical support volume and examining the surface normals of all the points within the support, referred to as the Signature of Histograms of OrienTations (SHOT) feature descriptor [Tombari et al., 2010]. All of these approaches use a similar strategy, geometric measurements are taken about a support volume and are binned into a histogram. The shape of the histogram is used to compare similarity to given points. Salti et al. extended the SHOT descriptors to include both surface geometry as well as colour texture [Salti et al., 2014]. They demonstrated improved repeatability by including texture; however, their method remains untested for refractive objects and we anticipate reduced performance since colour texture may change with viewpoint for refractive objects.

Quadros recently developed 3D features from LIDAR, defined by ray-tracing a set of 3D line segments in space [Quadros, 2014]. If these lines reach behind a surface or encounter a large gap in the data, unobserved space is registered by the method. Unobserved space is assumed to be occlusions in 3D point clouds. The authors report that accounting for occlusions facilitates more robust object recognition, although their method does not consider refractive objects.

Recently, Gupta et al. learned 3D features from RGB-D images for object detection and seg- mentation [Gupta et al., 2014] and Gao et al. for SLAM [Gao and Zhang, 2015]. However, none of these methods have been implemented for visual servoing and these methods rely on 3D data from RGB-D and LIDAR sensors, which return erroneous measurements for refractive objects and other view-dependent effects.

3.1.3 4D Geometric Image Features

All of the previous features have been developed for 2D images or 3D representations. LFs are parameterised in 4D, which requires a re-evaluation of feature detectors and descriptors. CHAPTER 3. LITERATURE REVIEW 67

Most previous work using LFs have only used 2D image features [Johannsen et al., 2015,Smith et al., 2009], or simply use the LF camera as an alternate 3D depth sensor in Structure from Motion and SLAM-based applications [Dong et al., 2013, Marto et al., 2017]. These works do not take advantage of all the information contained within the full 4D LF, which can capture not only shape and texture, but also elements of occlusion, specular reflection and in particular, refraction.

Ghasemi et al. proposed a global feature using a modified Hough transform to detect changes in the lines of slope within an EPI [Ghasemi and Vetterli, 2014]. However, their method is a global feature used to describe the entire scene, which is inappropriate for most SfM and IBVS methods that require local features. More recently, Tosic et al. focused on developing a SIFT-like feature detector for LFs by incorporating both scale-invariance and depth into a combined feature space, called LISAD space [Tosic and Berkner, 2014]. Extrema of the first derivative of the LISAD space was taken as 3D feature points, yielding a feature described by image position (u,v), scale and slope (equivalently depth). However, we note that Tosic’s work assumes no occlusions or specular reflections and does not discuss feature description to facilitate correspondence over multiple light fields. Furthermore, Tosic’s choice of using an edge-detector in the epipolar plane images (EPIs) amounts to a 3D edge detector in Cartesian space, which is a poor choice when unique points are required by SfM and IBVS. Edge points are not unique and are easily confused with their neighbours. Additionally, we anticipate these LF features may not perform well for refractive objects, because the depth analysis assumes Lambertian scenes.

Also pursuing more reliable LF features, Texeira et al. found SIFT features in all sub-views of the LF and projected them into their corresponding EPIs [Teixeira et al., 2017]. These projec- tions were filtered and grouped into straight lines in their respective EPIs, and then counted. Features with higher counts were observed in more views and thus considered more reliable. In other words, Teixeira imposed 2D epipolar constraints on 2D SIFT features, which does not take full advantage of the geometry of the 4D LF. 68 3.1. IMAGE FEATURES

Similarly, Johannsen et al. considered 3D line features based on Plücker coordinates and im- posed 4D light-field constraints in relation to LF-based SfM [Johannsen et al., 2015]. Zhang et al. considered the geometry of 3D points and lines transforming under light field pose changes [Zhang et al., 2017]. They derived line and plane-based correspondence methods between sub- views of the LF and imposed these correspondences in LF-based SfM. Doing so resulted in im- proved accuracy and reliability over conventional SfM, especially in challenging scenes where image feature points were sparse, but lines and planes were still visible. These previous LF- based works largely focused on matching between large differences in viewpoint. However, incremental pose changes, such as those found in visual servoing and video applications, also warrant consideration. How the LF changes with respect to these small pose changes is similar in concept to the image Jacobian for IBVS, which has not yet been well-explored.

In considering LF cameras with respect to refractive objects, Maeno et al. proposed to model an object’s refraction pattern as image distortion and developed the light-field distortion (LFD) feature based on the differences in corresponding points in the 4D LF [Maeno et al., 2013]. The authors used the LFD for transparent object recognition. However, their method did not impose any LF geometry constraints, leading to poor performance with respect to changes in camera position. Xu et al. built on Maeno’s LFD to develop a method for refractive object image segmentation [Xu et al., 2015]. Each pixel was matched between each sub-view of the light field and then fitted to a single normal of a 4D hyperplane that is characteristic for a Lambertian point. A threshold was applied to this error to distinguish a refracted pixel. However, we will show in Chapter 5 that a 3D point is not described by a single hyperplane in 4D. Rather a 3D point manifests as a plane in 4D, which can be described as the intersection of two hyperplanes. Both of these must be considered when considering a feature’s potentially refractive nature when it passes through a refractive object. CHAPTER 3. LITERATURE REVIEW 69

3.1.4 Direct Methods

In contrast to geometric image features that represent a geometric primitive from the image(s), direct methods establish some geometrical relationship between two images using pixel intensi- ties directly. For this reason, they are also known as featureless, intensity-based, or photometric methods [Collewet and Marchand, 2011]. These methods avoid image feature detection, extrac- tion and correspondence entirely by directly using the image intensities by way of minimising the error between the current and desired image to servo towards the goal pose. A common measure of photometric image error is the sum of squared differences. Although this operation involves many calculations over the image as a whole, it involves very few calculations per pixel, each of which are relatively simple and easily computed in parallel. This allows many direct methods to potentially run faster that feature-based VS methods.

Despite these benefits, VS methods using photometric image features typically suffer from small convergence domains compared to geometric feature-based methods [Collewet and Marc- hand, 2011]. Recently, to improve the convergence domain, Bateux et al. projected the current image to a several poses, which were tracked by a particle filter. The error between the pro- jected and current images drove the robot towards the convergence area, whereupon the method switched to conventional photometric VS [Bateux and Marchand, 2015]. Although the error was minimised between the current and next images, the poses projected by the particle fil- ter were random, resulting in a path towards the goal pose that was not necessarily smooth or optimal with respect to the amount of physical motion required to reach the goal.

Furthermore, photometric image features typically assume that the scene’s appearance does not change significantly with respect to viewpoint. Thus, they do not perform well for changes in pose which result in large changes in scene appearance [Irani and Anandan, 1999]. Refractive objects tend to have large changes in appearance with respect to viewing pose and therefore photometric VS methods are ill-suited for scenes with refractive objects. 70 3.1. IMAGE FEATURES

Collewet et al. recently extended photometric IBVS to scenes with specular reflections [Collewet and Marchand, 2009]. This was accomplished by considering the Phong light reflection model, which provides image intensity as a function of a diffuse, specular and ambient component, given a point light source [Phong, 1975]. Collewet’s approach compared the derivative of the light reflection model to the image Jacobian from photometric VS [Collewet and Marc- hand, 2011] to arrive at an analytical description of the image Jacobian relating pixel values to the light reflection from the Phong model. However, their approach requires a light reflection model, which in other words requires complete knowledge of all the lighting sources and their relative geometry. A similar strategy would likely only be viable for refractive objects if a 3D geometric model of the object was available.

3.1.5 Image Feature Correspondence

The classical approach to many robotic vision algorithms involves detecting, extracting and then image features to compare the current and some goal image feature sets. Often, the success of the algorithm depends significantly on accurate feature correspondence. Correspondence is a data association problem, finding the same set of image features in a pair of images. Cor- responding image features is typically divided into two categories: large-baseline and small- baseline matching. Large-baseline matching aims to correspond features between two images that were taken from relatively different viewpoints, large baselines, or different viewing condi- tions. Small-baseline matching aims to correspond features between two images that were taken from relatively similar viewpoints, or narrow baselines. While both approaches aim to match image features between two images, the general assumptions and approaches differ. However, large-baseline matching can also apply to small-baseline situations, and in the context of VS, the image feature error that VS seeks to minimise, relies on corresponding image features be- tween the current and goal images. The goal image may have been captured from a relatively different viewpoint. Thus, in this thesis we focus on large-baseline image feature matching. CHAPTER 3. LITERATURE REVIEW 71

For matching, the nearest neighbour distance between two feature descriptor vectors is com- monly used as putative matches; however, exhaustive methods are inefficient for large feature databases. Advanced features, like SIFT use search data structures, such as k-d trees to more efficiently find matches [Lowe, 2004]. Muja et al. proposed multiple randomized k-d trees to approximately find the nearest neighbour with much faster speeds than linear search, with only a minor loss in accuracy [Muja and Lowe, 2009]. However, the actual comparison of tradi- tional image feature correspondence methods is often based on some abstraction of the image feature’s appearance. This inherently assumes that the appearance of the image feature does not change significantly between views. Refracted objects can significantly change the appearance of a feature, which makes matching based on appearance particularly challenging.

To reduce the possibility of mismatches and remove outliers, putative matches are refined ac- cording to some consistency measure (or model). For example, in two-view geometry, the image reprojection error from the fundamental matrix is used. The standard approach is ran- dom sampling and consensus (RANSAC) [Bolles and Fischler, 1981], where candidate points are randomly chosen to form a hypothesis, which is tested for consistency against the remain- ing data. The hypothesis process is iteratively repeated until a thresholded number of inliers is reached. Building on RANSAC, Torr et al. proposed maximum likelihood estimator sampling and consensus (MLESAC) that maximises the likelihood that the data was generated from the hypothesis [Torr and Zisserman, 2000].

Most outlier rejection methods, such as RANSAC and MLESAC, are based on two assump- tions: first, that there are sufficient data to describe the model and second, that the data are mostly inliers—there are few outliers. Most robotic vision algorithms do not account for re- fraction and thus rely on these outlier rejection methods to remove these inconsistent features (such as refracted features) from the inlier set. In a scene that has mostly Lambertian features with only a small number of refracted features, outlier rejection methods work well. However, for scenes that are mostly covered by a refractive object, such as when a robot or camera di- rectly approaches a refractive object, outlier rejection methods are much less reliable because 72 3.2. VISUAL SERVOING the second assumption is broken [Kompella and Sturm, 2011]. Therefore, traditional feature correspondence methods may not work reliably for features that pass through refractive objects.

3.2 Visual Servoing

Visual servoing (VS) is a form of closed-loop feedback control that uses a camera in the loop to directly control robot motion. The term VS was introduced by Hill & Park to distinguish their approach from the common “look-then-move”, or equivalently “sense-then-act”, approach to robotics in 1979 [Hill, 1979], and has since covered a wide range of applications, from controlling robot manipulators in manufacturing and agricultural fruit/vegetable picking [Mehta and Burks, 2014,Baeten et al., 2008,Han et al., 2012], to flying quadrotors [Bourquardez et al., 2009], and even docking of planetary rovers [Tsai et al., 2013]. VS is a promising technique for robotics because it does not necessarily require a 3D geometric model of its target, the accuracy of the operations do not entirely depend on accurate robot control and calibration, and historically, the simplicity of the VS approach has led to faster interaction in docking, manipulation and grasping tasks, as well as shorter time cycles in sensing the environment, which have translated to more reliable robot performance.

Hutchinson et al. were some of the first researchers to clearly distinguish the different types of VS systems in 1996 [Hutchinson et al., 1996]. This classification was based on how the visual input was used and what computation was involved, grouping them into either position-based visual servoing (PBVS) or image-based visual servoing (IBVS) systems. In this section, we provide a comparison and review of PBVS and IBVS systems. CHAPTER 3. LITERATURE REVIEW 73

3.2.1 Position-based Visual Servoing

The purpose of PBVS is to minimise the relative pose error between the target (some desired pose), and the camera’s pose. Image features are extracted from the image and used with a geometric model of the target and a known camera model to estimate the relative pose of the target with respect to the camera, as shown in Fig. 3.1a. Feedback is then computed to reduce the error in the estimated relative pose. PBVS is traditionally referred to as position-based VS, although the approach may be more realistically referred to as pose-based VS. The main advantage of PBVS is that it is straight-forward to incorporate physical constraints, spatial knowledge and direct manoeuvres (such as obstacle avoidance).

PBVS requires an estimate of the target object pose in order to derive feedback control in the task space. The approach can be computationally demanding, sensitive to noise and highly dependent on camera calibration. Most research involving PBVS has focused on Lambertian scenes, i.e. scenes that are predominantly Lambertian and so do not contain refractive objects, specular reflections, or other surfaces or materials that cause non-Lambertian light transfer. PBVS has been demonstrated in full 6 DOF control by Wilson et al. [Wilson et al., 1996] and in real time using object models with a monocular camera by Drummond et al. [Drummond and Cipolla, 1999]. More recently Tsai et al. implemented PBVS using a stereo camera for a tether-assisted docking system [Tsai et al., 2013]. Teuliére et al. demonstrated successful PBVS

(a) (b) Figure 3.1: Architectures for (a) position-based visual servoing and (b) image-based visual servoing, which does not require explicit pose estimation. Image courtesy of [Corke, 2017]. 74 3.2. VISUAL SERVOING using an RGB-D camera even when partial occlusions were present [Teulière and Marchand, 2014].

PBVS towards refractive objects was recently considered by Mezouar et al. for transparent pro- tein and crystal manipulation under a microscope [Mezouar and Allen, 2002]. However, the 2D nature of the microscope workspace greatly simplified the visual servoing process. More im- portantly, the microscope and the backlighting reduced the image processing to a thresholding problem, making the objects’ refractive nature irrelevant.

Recently, Bergeles et al. used PBVS for controlling the pose of a microrobotic device inside a transparent human eye for surgery by accounting for the visual distortions caused by the eye [Bergeles et al., 2012]. Their method required extremely precise model calibration of both the eye and the robot in order to avoid potential injury. In our application of servoing towards refractive objects of more general shapes, models of the cameras are not always accurate, and prior models of the objects are not necessarily available or can be difficult to obtain. For this reason, PBVS methods are sometimes referred to as model-based VS [Kragic and Christensen, 2002].

PBVS is not commonly used in practice because the visual features used for servoing are not guaranteed to stay in the FOV during the approach, and more importantly, it requires estimation of the target pose, which in turn requires a geometric model of the target object and model of the camera. As we will discuss in Section 3.3.2, 3D information of refractive objects, and in particular their 3D models and 3D pose information is extremely difficult to obtain. Experi- mental setups that can obtain the required 3D measurements on the refractive objects are likely too bulky for mobile robot applications such as VS. Additionally, monocular pose estimation is poorly conditioned numerically [Kragic and Christensen, 2002]. Therefore, there is real in- terest in compact IBVS systems that tend to keep the target in the FOV by the very nature of the algorithm, that avoid the ill-conditioned pose estimation, and do not necessarily require 3D geometric models of the refractive objects. CHAPTER 3. LITERATURE REVIEW 75

3.2.2 Image-based Visual Servoing

In IBVS, robot control values are directly computed based on image features, as shown in Fig. 3.1b. Typically, image features from the current view of the robot are detected and ex- tracted. These image feature vectors are matched to a set of goal image feature vectors. The image feature error is computed as the difference between the two image feature sets. Then the estimated camera velocity that attempts to drive the image feature error to zero is computed. This cycle is repeated until the image feature error is sufficiently small. The negative feedback helps to reduce system fluctuations and promotes settling to equilibrium, which makes IBVS more robust to uncertainty, noise and camera/robot modelling and calibration errors that often plague traditional open-loop sense-then-act approaches. IBVS works because the camera pose is implicit in the image feature values. This eliminates the need for an explicit 3D geomet- ric model of the goal object, as well as an explicit pose-based motion planner [Chaumette and Hutchinson, 2006].

3.2.2.1 Image Jacobian for Monocular IBVS

At the core of IBVS systems is the interaction matrix, which is sometimes referred to as a visual- motor model, but more commonly referred to as an image Jacobian [Kragic and Christensen, 2002]. The image Jacobian J represents a first-order partial derivative function that relates the rate of change of image features to camera velocity. Consider

p˙ = J(p, cP ; K)cν, (3.1) where cP ∈ R3 is the coordinate of a world point in the camera reference frame, p ∈ R2 is its image plane projection, K ∈ R3×3 is the camera intrinsic matrix, cν = [v; ω] ∈ R6 is the camera’s spatial velocity in the camera reference frame, which is the concatenation of the 76 3.2. VISUAL SERVOING

T T camera’s translational velocity v =[vx,vy,vz] and rotational velocity ω =[ωx, ωy, ωz] in the camera reference frame.

The control problem is defined by the initial (observed) and desired image coordinates, p# and p∗ respectively, from which the required optical flow

p˙∗ = λ(p∗ − p#) (3.2) can be determined, where λ> 0 is a constant. This equation implies straight line motion in the image because the image feature error is only taken as the difference between initial and desired image coordinates. Combining both equations we can write

J(p, cP ; K)ν = λ(p∗ − p#), (3.3) which relates camera velocity to observed and desired image plane coordinates. It is important to note that VS is a local method based on J, the linearisation of the perspective projection equation. In practice it is found to have a wide basin of attraction.

The monocular image-based Jacobian for image point features p =(u,v), is given as [Chaumette and Hutchinson, 2006]

2 2 − fx 0 u uv − fx +u fyv J =  Pz Pz fy fx fx  , (3.4) f 2+v2 0 − fy v y − uv − fxu  Pz Pz fy fx fy   

1 where fx, fy are the x and y focal lengths , respectively, and Pz is the depth of the point. We note that the first three columns of J depend on depth, implying that image feature velocity in the image plane is inversely proportional to depth, while the feature velocity due to the angular velocity of the camera is largely unaffected by depth.

1 Typically, fx and fy are equal. These terms are in units of pixels, i.e. pixel size is included. CHAPTER 3. LITERATURE REVIEW 77

Equation (3.3) suggests we can solve for camera velocity ν, but for single point, the system is under-determined. Thus it not possible to uniquely determine the elements of ν for a single observation p. To address this issue, the typical approach is to stack (3.4) for each of N image features, c ∗ # J(p1, P1; K) p − p    1 1  . . . ν = λ . (3.5)          c   ∗ # J(pN , PN ; K) pN − pN      and if N ≥ 3 we can solve uniquely for ν

+ ∗ J1 p1 − p    1  . . ν = −λ . . , (3.6)            ∗  JN  pN − pN      where J + represents the left Moore-Penrose pseudo-inverse of J. Equation (3.6) is similar to the classical proportional control law for VS [Hutchinson et al., 1996], except that we use the pseudo-inverse because we may have noisy observations forming a non-square matrix; the pseudo-inverse finds a solution that minimises the norm of the camera velocity. The constant λ is the control loop’s gain, which serves to amplify the resulting control.

There are two important issues with (3.4) and (3.6) with respect to lack of depth information and stability. First, (3.4) depends on depth Pz. Any method that uses this form of Jacobian must therefore estimate or approximate Pz. However monocular cameras do not measure depth directly. A common assumption is to fix Pz, which then is a control gain for the translational velocities [Chaumette and Hutchinson, 2006]. A variety of other approaches exist to estimate depth online [Papanikolopoulos and Khosla, 1993, Jerian and Jain, 1991, De Luca et al., 2008]; however, monocular depth estimation techniques are often nonlinear and difficult to solve be- cause they are typically ill-posed [Kragic and Christensen, 2002]. Moreover, visual servoing stability issues can arise from these approaches because variable depth can lead to local minima and ultimately unstable behaviour of the robot system [Chaumette, 1998]. 78 3.2. VISUAL SERVOING

Second, Chaumette showed that the IBVS system is only guaranteed to be stable near Pz, since J is a linear approximation to the nonlinear robotic vision system [Chaumette and Hutchinson, 2006]. Local asymptotic stability is possible for IBVS, but global asymptotic stability cannot be ensured. Determining the size of the neighbourhood where stability and convergence are ensured is still an open issue, even though this neighbourhood is large in practice. Furthermore, the stacking in (3.6) relies on stacking N image point feature Jacobians Ji, each of which may have different Pz,i, depending on the scene geometry. Malis et al. showed that incorrect Pz,i can cause the system to fail [Malis and Rives, 2003]. In other words, the depth distribution affects IBVS convergence and stability, and in the case of unknown target geometry, accurate depth estimates are actually needed.

One example of undesirable behaviours in IBVS is camera retreat, where the camera may move backwards for large rotations [Chaumette, 1998]. Camera retreat is caused by the coupled nature of the rotation and translation components in the image Jacobian. This poses a perfor- mance issue because in real systems, such backwards manoeuvres may not be feasible. Corke et al. showed that camera retreat was a consequence of requiring straight line motion on the image plane with a rotating camera (as in (3.3)). This was then addressed by decoupling the translation motion components from the z-axis rotation components into two separate image Jacobians [Corke and Hutchinson, 2001]. Recently, Keshmiri et al. proposed to decouple all six of the camera’s velocity screw elements [Keshmiri and Xie, 2017]. Their approach enables better Cartesian trajectory control compared to traditional IBVS systems at the cost of more computation.

Almost all IBVS methods rely on accurate image feature correspondence in order to accurately compute image feature error. McFayden et al. recently proposed an IBVS method that jointly solves the image feature correspondence and motion control problem as an optimal control framework [McFadyen et al., 2017]. Image feature error is computed for different feature cor- respondence permutations. As the robot moves closer to the desired pose, the system converges towards smaller error and the correct permutation. However, their approach focused on an ex- CHAPTER 3. LITERATURE REVIEW 79 haustive approach for the number of image features and thus does not scale well for a large number image features, such as those detected when using natural features typically found in most robotic vision algorithms.

3.2.2.2 IBVS on Non-Lambertian Objects

An interesting approach to IBVS on featureless objects was proposed by Pages et al. whereby coded, structured light was projected into the scene to create geometric visual features for fea- ture correspondence [Pages et al., 2006]. By defining the projection pattern as a particular grid of coloured dots, many point features were quickly and unambiguously detected and matched. However, the structured light required that the ambient light did not overpower the projector, limiting usage to indoor applications. Additionally, this method may not work reliably for re- fractive objects, because the projected pattern would be severely distorted, scaled, or flipped, which would greatly complicate the feature detection and correspondence problem.

Recently, Marchand and Chaumette used planar mirrored reflections to overcome the limited FOV of a single camera in IBVS [Marchand and Chaumette, 2017]. They derived the image Jacobian for servoing the mirror relative to the camera to track an object. However, only Lam- bertian features were tracked through the mirror and it was assumed that the image features were always within the mirror (thus all the reflected features always showed consistent motion). Furthermore, image feature distortion that could arise from non-planar mirrors, somewhat simi- lar to the distortion from refractive objects, was also not considered. In summary, this approach may not be directly transferable to tracking image features through refractive objects because nonlinear image feature motion—potentially caused by inconsistent feature/mirror motion or non-planar mirrors—was not considered in their approach. 80 3.2. VISUAL SERVOING

3.2.2.3 IBVS using Multiple Cameras

IBVS has been extended to stereo and multi-camera vision systems. Assuming the pose be- tween both cameras is known, each camera’s Jacobian can be transformed into a common ref- erence frame. Stacking the same type of image features from both cameras and solving the system yields camera motion [Chaumette and Hutchinson, 2007]. Malis et al. [Malis et al., 2000] extended this concept to multiple cameras with a similar stacking of image features; more cameras yielded more features. Comport et al. derived an IBVS framework for gener- alised cameras [Comport et al., 2011], though the focus was on non-overlapping FOV camera configurations, rather than the overlapping FOV camera configurations of LF cameras. Ad- ditional IBVS systems were discussed in Section 3.1.2. All of these previous works rely on accurate feature correspondence. They assume Lambertian point correspondences, which do not necessarily apply in the case of refractive objects. Therefore, we expect that none of these systems would perform reliably in the presence of refractive objects.

In the area of VS, Malis et al. first proposed to use 3D geometric image features, but referred to this concept as 2.5D visual servoing [Malis et al., 1999]. Using 3D geometric image features does not necessarily require any geometric 3D model of the target object and is less limited by the relatively small convergence domain and depth issues that plague monocular image-based visual servoing (M-IBVS). In a slightly different manner, Chaumette proposed that it may be advantageous for robot systems to plan large steps using PBVS, while small intermediate steps are maintained by IBVS [Chaumette and Hutchinson, 2007].

For stereo vision systems, Cervera et al. used the 3D coordinates of points [Cervera et al., 2003], and Bernardes et al. used 3D lines [Bernardes and Borges, 2010] for visual servoing. Malis et al. used homographies [Malis et al., 1999,Malis and Chaumette, 2000] and both Mariottini et al. and Cai et al. used epipolar geometry [Mariottini et al., 2007,Cai et al., 2013] in visual servoing. Both homography and epipolar-based approaches determine a geometric relationship between the current and desired views to control robot motion. The geometric relationship is either CHAPTER 3. LITERATURE REVIEW 81 the homography matrix or the fundamental matrix, both of which can be determined using corresponding feature points from different views. However, decomposing the homography matrix only applies to planar scenes and stereo epipolar geometry becomes ill-conditioned for short baselines as well as planar scenes [López-Nicolás et al., 2010].

Recently, Zhang et al. developed a trifocal tensor-based approach for visual servoing [Zhang et al., 2018]. In simulation, a trinocular camera system was used to estimate the trifocal ten- sor based on point feature correspondences, as in [Hartley and Zisserman, 2003]. Instead of directly computing the camera pose via singular value decomposition (SVD), the authors chose to use elements of the trifocal tensor, augmented with elements of scale and rotation, as visual features. However, these methods relied on accurate feature correspondences, which are fun- damentally based on Lambertian assumptions. Therefore, these approaches are not likely to perform reliably in the presence of non-Lambertian scenes, such as those containing refractive objects.

3.3 Refractive Objects in Robotic Vision

Refractive objects are particularly challenging in computer and robotic vision because these objects do not have any obvious visible features of their own. Their appearance tends to be largely dependent on the background, the object’s shape and the lighting conditions. Although refractive objects have been largely ignored by the bulk of the robotics community, we review the previous research in detecting, recognizing refractive objects and reconstructing their shape. Although shape reconstruction is not an explicit goal of this thesis, observed structure and camera motion are integrally linked, and it is important to review what information has been extracted from refractive objects. 82 3.3. REFRACTIVE OBJECTS IN ROBOTIC VISION

3.3.1 Detection & Recognition

There have been a variety of approaches to detecting and recognizing refractive objects. In this review, we have divided the different approaches into model- and image-based approaches, based on whether or not the method in question relies on a prior 3D geometric model of the refractive objects.

3.3.1.1 Model-based Approaches

One of the earliest model-based approaches to refractive object detection was proposed by Choi and Christensen, where a database of 2D edge templates of projected 3D refractive object mod- els with known poses, was used to match edge contours from 2D images [Choi and Chris- tensen, 2012]. Image edges were extracted and matched using particle filters to provide coarse pose estimates, which were refined via RANSAC. The authors achieved real-time refractive object detection and tracking with 3D pose information. However, this approach required a large database of edge templates for every conceivable model and pose, which does not scale well for general purpose robots, although this is becoming less significant with the increasing computational abilities of modern computers.

Most subsequent approaches adopted RGB-D cameras as a means of making putative refractive object detections. While depth measurements of refractive objects from RGB-D cameras were known to be inconsistent, partial depth around the refractive objects was usually observed in the RGB-D images. Luo et al. applied a variety of morphological operations to identify 3D regions of inconsistent depth, which were assumed to be refractive [Luo et al., 2015]. These 3D regions were then compared to 3D object models for recognition. However, Luo obtained the 3D models of the refractive objects by first painting them so that the refractive objects became Lambertian, which is not a practical approach for most robotic applications. CHAPTER 3. LITERATURE REVIEW 83

Recently, LF cameras have been considered for refractive object recognition with models. Wal- ter et al. also used an RGB-D camera for object recognition, but combined their system with an LF camera array to detect and replace the inconsistent depth estimates caused by specular reflections on glass objects [Walter et al., 2015]. This was accomplished by comparing a known 3D model of the refractive object to the observed depth measurements in order to identify the in- consistent depths. Given that LF cameras implicitly encode depth, it is possible that the RGB-D camera was redundant in this approach.

In a particularly recent and interesting work, Zhou et al. developed an LF-based depth descriptor for object recognition and grasping [Zhou et al., 2018]. For a Lambertian point, the light field yields one highly redundant depth estimate, but for a refracted image feature, the light field can yield a wide distribution of depths. Zhou proposed to use a 3D array of depth likelihoods within a certain image region and depth range, creating a 3D descriptor for the refractive object. By comparing this depth-based descriptor to a 3D geometric model, refractive object pose was estimated using Monte Carlo localization. This method was sufficiently accurate for coarse manipulation of glass objects in water and Lambertian objects behind a stained-glass window.

However, all of the previously-mentioned methods required prior accurate 3D geometric models of the refractive objects. For a small set of simple objects this approach may be feasible, but in general, models of refractive objects are challenging to acquire, potentially time-consuming and expensive to obtain, or simply not available [Ihrke et al., 2010a]. Therefore, there is great interest in methods that do not rely on 3D geometric models.

3.3.1.2 Image-based Approaches

Early work on detecting refractive objects in 2D images started with Adelson and Andandan in 1990 by focusing on finding occluding edges caused by refractive objects [Adelson and Anandan, 1990]. However their method was limited to 2D layered scenes with no visual texture on planar refractive shapes, such as circles and triangles. Szeliski et al. extended this concept of 84 3.3. REFRACTIVE OBJECTS IN ROBOTIC VISION layered depth images to detect reflective and refractive objects in more general images [Szeliski et al., 2000]; however, their approach was still limited to scenes that could be described as a collection of planar layers. McHenry et al. noted that refracted objects tended to distort and blur image edges, as well as appear slightly darker in the image [McHenry et al., 2005]. Thus their method focused on finding image edges and then compared the image gradients and overall intensity values on either side of the edge to detect refractive parts of the image. Snake contours were then used to merge components of refractive object edges into overall refractive object segments. However, their method assumed that the background was similar on all sides of the glass edges, which was not true for very refractive elements or those containing bright surface highlights.

Kompella et al. extended this work by finding regions in the image that contained even more visual characteristics related to refracted objects [Kompella and Sturm, 2011]. In addition to the reduced image intensity and blurred image gradients, their method also searched for an abun- dance of highlights and caustics caused by the specular surface of most refractive objects and lower saturation values as some light and colour is lost as it passes through refractive objects. Characteristics were combined into a function to detect and avoid refractive objects during navi- gation. However, their method only provided extremely coarse estimates of where the refractive objects were located in the image and still assumed that the background was similar on all sides of the glass edges. Therefore we anticipate that their approach would not perform well if the ob- ject was not in front of a uniform background, which is not practical for mobile robots working in cluttered scenes.

Recently, Klank et al. used an RGB-D camera for detection, but noted that most refractive objects appeared much darker in the depth images when placed on a flat table [Klank et al., 2011]. This was likely due to the different absorption properties of glass. They segmented dark regions in the depth images as candidate refractive objects and then identified depth inconsis- tencies within the dark regions as refractive. However, dark regions in depth images do not necessarily correspond to glass objects. Depending on the type of RGB-D camera used, dark CHAPTER 3. LITERATURE REVIEW 85 regions in depth images can also appear at regions that are actually farther away (since intensity is correlated to depth), as well as other material types, such as felt, and sometimes at occlu- sion boundaries. Thus their algorithms may not perform well in cluttered and occluded scenes containing refractive objects.

LF cameras have only recently been considered for image-based refractive object detection. Maeno et al. proposed to model an object’s refraction pattern as image distortion, based on differences in corresponding points in the 4D LF [Maeno et al., 2013]. However, the authors noted poor performance due to changes in appearance from both the specular reflections on the refractive objects and the camera viewing pose. Xu et al. built on Maeno’s work to complete a transparent object image segmentation method from a single light field capture [Xu et al., 2015]. However, as we will discuss in more detail in Ch. 5, their method does not fully describe how a 3D point manifests in the light field. We address this to improve detection and recognition rates.

3.3.2 Shape Reconstruction

Although shape reconstruction is not an explicit goal of this thesis, observed structure and mo- tion are intricately linked; thus it is important to understand what has been done in this area. Shape reconstruction of refractive objects is a particularly challenging task. Ihrke et al. proposed a taxonomy of objects according to their increasing complexity with respect to light transport (reflections, refractions, sub-surface scattering, etc. . . ) [Ihrke et al., 2010a]. Most techniques have focused on opaque objects (Class 1) and have demonstrated good performance using a sequence of images from a monocular camera relying on dense pixel correspondences [Engel et al., 2014, Newcombe et al., 2011]. However, shiny and transparent objects are still diffi- cult for the state-of-the-art because these methods assume Lambertian surfaces. Additionally, traditional methods rely on rejecting inconsistent correspondences using RANSAC [Fischler and Bolles, 1981], which can be robust to a few small specular highlights; but are insufficient 86 3.3. REFRACTIVE OBJECTS IN ROBOTIC VISION for dealing with more complex light transport phenomena (Class 3+), including refractive ob- jects [Ihrke et al., 2010a,Tsai et al., 2019], as we will show in Ch 5. In order to reliably deal with shiny and transparent objects, researchers have developed a variety of methods to reconstruct the shape of refractive objects.

3.3.2.1 Shape from Light Path Triangulation

Kutulakos et al. presented the seminal work on using light-ray correspondences to estimate the shape of refractive objects [Kutulakos and Steger, 2007]. The shape of specular and transparent objects, defined by depths and surface normals, can be estimated by mapping the light rays that enter and exit the object. As shown in Fig. 3.2, we can consider a convex hull two-interface refractive object and draw a ray originating from background point P¯ (two parameters) in some

2 direction r (2) for some distance dPA (1). At A¯0, the ray intersects with the refractive object and changes direction. We estimate this direction change using Snell’s Law, which requires an estimate of the surface normal Ni (2) and the ratio of refractive indices n1/n2 (1). Through the object, the light ray travels for distance dAB (1) through the object, and changes direction at the exiting interface at B¯0, which is defined by surface normal N (2). The light ray then travels for distance dBL (1) to the camera. All together, a basic light path can be described by a minimum of twelve characteristics of the scene. Alternatively, one can describe the light path as three rays (four parameters each) linked in series, which also requires a minimum of twelve parameters. As we will describe below, many approaches, such as shape from distortion [Ben- Ezra and Nayar, 2003] and shape from reflection [Han et al., 2015], apply assumptions which limit or define many of these parameters to simplify shape recovery.

Shape from distortion is an approach based on capturing multiple images from known poses, finding visual features that correspond to the same 3D point from behind the refractive object, and then examining how the light path has been distorted by the refractive object. For example,

2Recall that a ray can be described by four parameters. CHAPTER 3. LITERATURE REVIEW 87

Figure 3.2: Light paths can describe the behaviour of light as it passes through a refractive ob- ject. Most methods rely on light path correspondence and triangulation to solve for the depths and surface normals of the refractive object. In general for 2-interface refractive objects, light paths are described by over twelve characteristics from the point of origin to the intersecting lines at the refractive object boundaries, to the camera sensor. Many approaches apply assump- tions or constraints to simplify the problem.

Kim et al. acquired the shape of axially-symmetric transparent objects, such as wine glasses, by placing an LCD display monitor in the background and emitting several known lighting patterns [Kim et al., 2017]. However, most methods rely on a bulky device to project a calibrated pattern through the object [Murase, 1990, Hata et al., 1996, Kim et al., 2017] and so are not immediately applicable to mobile robotics. Recently Ben-Ezra et al. tracked features over a sequence of monocular camera images to capture the distortion pattern [Ben-Ezra and Nayar, 2003]. Starting with an unknown parametric model, shape and pose were simultaneously found in an iterative, nonlinear, multi-parameter optimisation scheme. However their method could only handle quadratic-shaped refractive objects and importantly, the features were manually tagged because it was seen as a very hard problem to automatically detect and match image points through refractive medium from single images.

Alternatively shape from reflection or refraction approaches typically solve light ray correspon- dences by controlling the background behind the refractive object. Han et al. used a single cam- era fixed in position with a refractive object placed in front of a checkerboard background [Han 88 3.3. REFRACTIVE OBJECTS IN ROBOTIC VISION et al., 2015, Han et al., 2018]. The method only required two images with the background pattern in different known positions; however, change of refractive index was required, which meant immersing the object in water, which is a major limitation for most robots.

In addition to background scene control, constraints on the refractive object itself can further simplify the light path correspondence problem. For example, Tsai et al. imposed a planar surface constraint to one side of a refractive object. With a monitor controlling the background image, they were able to reconstruct a diamond’s shape with a single monocular image [Tsai et al., 2015] without having to place the object in water.

Without explicit control of the background, shape can also be obtained by controlling the incom- ing light rays using a mobile light source. Morris et al. used a static monocular camera with a grid of known moving lights to map different reflectance values to the same surface point, from which they reconstructed very challenging shiny and transparent structures [Morris and Kutu- lakos, 2007]. Miyazaki and Ikeuchi used a rotating polariser in front of a monocular camera to capture multiple images of different polarisation settings, but also required a known back- ground surface and known lighting distribution to estimate the shape of the transparent object [Miyazaki and Ikeuchi, 2005]. However, both Morris’ and Miyazaki’s methods require known light sources with bulky configurations that are impractical for mobile robotic applications.

The majority of the state-of-the-art methods for refractive object shape reconstruction based on light paths roughly rely on feature correspondence between multiple views to find common features for triangulation. Because of the complexity and sheer number of unknowns of the problem, most of these approaches apply assumptions and constraints to the problem to make it more tractable. In doing so, the application window of their methods become too narrow, making them fragile and unreliable to be considered for practical robot applications that must contend with many conditions or environments, or the methods require equipment too bulky to be considered for most mobile robot applications. CHAPTER 3. LITERATURE REVIEW 89

3.3.2.2 Shape from Learning

Recent work in robotics has seen an explosion in the area of learning features using CNN. CNNs use a large number of images to train several layers of parameters to minimise some cost function. CNNs use the convolution operation on images that are input to the network to approximate how neurons from the brain respond to visual stimulus in the receptive field of the visual cortex [Krizhhevsky et al., 2012]. By feeding it large training sets of images and an objective function, the CNN is able to “learn” the visual stimulus relevant for a given task (such as image classification or object detection).

Deep learning approaches use more layers than CNNs to handle more complex tasks, and ad- vanced recognition performance. Deep learning has achieved state-of-the-art performance for many classification and recognition tasks, but few have explored their use for refractive objects. Saxena demonstrated a data-driven method for recognizing grasping points on a variety of ob- jects, including some refractive objects [Saxena et al., 2008]. However, recovering the shape of such objects still remains a challenge due to the large amount of ground-truthed images re- quired to train CNNs. For learning approaches on opaque objects, ground truth comes from RGB-D cameras; however, RGB-D cameras are unable to provide reliable depth information on refractive objects and 3D models of refractive objects are not always available.

3.3.2.3 Shape (Structure) from Motion

Shape estimation techniques based on multiple viewpoints are closely related to structure from motion techniques [Wei et al., 2013]. For shape estimation, scene depth is usually determined given the viewing pose for each viewpoint (although surface normals are also often computed). On the other hand, for SfM, scene depth and viewing pose are simultaneously computed from multiple 2D images3. SfM is generally considered to be a well understood problem in the-

3SfM is also closely related to visual servoing, which we review in Section 3.2. 90 3.3. REFRACTIVE OBJECTS IN ROBOTIC VISION ory [Hartley and Zisserman, 2003]. The typical pipeline of SfM includes detecting image features, establishing image correspondences, filtering outliers, estimating camera poses and locations of 3D points, followed by optional refinement with bundle adjustment [Triggs et al., 2000]. However, classical SfM does not produce reliable results for refractive objects because of poor performance with feature correspondence [Ihrke et al., 2010b].

Ham et al. presents a shape estimation method that may be loosely described as structure from motion on occluding edges of refractive objects [Ham et al., 2017]. The authors use multiple views with known pose to extract the position and orientation of occluded edge features. Oc- cluding edge features are visible edges in an image that lie on the boundary of an occlusion or depth discontinuity. They appear as edges in the image and unlike textural edges (flat patterns on a surface), they are view-dependent and their surfaces are tangential to the camera view. Ham’s method can handle very general object shapes and does not require pre-existing knowl- edge of the object. Bulky equipment setups are not required. However, their method relies on a monocular camera, which must be moved to different poses to acquire multiple views, which may make dynamic scenes more challenging. An LF camera may be able to capture a sufficient number of multiple views in a single shot from a single sensor position (ie, without moving the camera). Furthermore, Ham’s method is focused on reconstructing the scene. While Ham’s method requires full pose information, our methods look to detect refracted features and servo towards them, which are entirely image-based, and thus do not require full pose information.

3.3.2.4 Shape from Light Fields

Using LF cameras for estimating the shape of objects is a relatively recent development. Most research has been focused on reconstructing Lambertian objects. Tao et al. recently used cues from both defocus and correspondence within a single LF exposure to obtain depth. The two measures of depth were combined to provide more accurate dense depth maps than from either method alone [Tao et al., 2013]. Luke et al. provided a framework to estimate depth by working CHAPTER 3. LITERATURE REVIEW 91 directly with the 4D LF in terms of gradients, as opposed to other methods that only exploited 2D epipolar slices of the 4D LF [Luke et al., 2014]. Wanner & Goldluecke formulated a struc- ture tensor for each pixel to give local estimates of the lines of slopes from the epipolar plane images. A global optimisation method was used to combine these local depth estimates in a consistent manner [Wanner and Goldluecke, 2012]. Their approach yielded high quality, dense depth maps, but required significant computation time, easily over four hours for a single light field, which may not be practical for online robotics applications. Recently, Strecke et al. devel- oped a method to jointly estimate depth and normal maps from a 4D light field on Lambertian surfaces using focal stacks generated by a single light field [Strecke et al., 2017]. However, none of these methods considered refractive objects.

Wanner et al. were the first to recover the shape of planar specular and transparent surfaces from an LF [Wanner and Golduecke, 2013]. They assumed that the observed light was a linear combination of the real surface and the reflected or refracted image. Then the epipolar planar image can be described as a super-position of two lines of slope that are related to depth. Both depths were determined and used to separate the scene into a layer closer to the camera and a layer farther from the camera. However, Wanner’s method was limited to single reflection cases and planar reflective or transparent surfaces. Our interest is in interacting with more general object shapes.

Furthering the work on slightly more general shapes, Wetzstein et al. reconstructed the shape of transparent surfaces based on the distortion of the light field’s background light paths [Wet- zstein et al., 2011]. Their method relied on a light-field probe that consisted of a lenslet array in front of a monitor to encode two dimensions in position and two dimensions in direction. Thus a monocular camera could measure a 4D LF in a single 2D image. The thin refractive object was placed between the probe and the monocular camera. Since the start of the light paths were known by calibration, the difference between incoming and exiting angles θi and θo were computed assuming known refractive indices of the two media. The surface normals were sub- sequently determined. However, this approach relied on placing the light-field probe behind the 92 3.4. SUMMARY object, while photographing its front and only applied to thin objects. Thus the general place- ment of refractive objects in cluttered scenes and mobile applications would be problematic for this approach.

Recently, Ideguchi et al. proposed an interesting approach to transparent shape estimation based on comparing the different disparities between sub-images for a given visual feature in the light field, which they called light-field convergency [Ideguchi et al., 2017]. It is known that as a visual feature approaches an occluding edge of a refractive object in a light field, it appears in- creasingly Lambertian. A deeper analysis of their approach suggests that their method performs an approximate Lambertian depth estimate similar to focus stacking and then fills in inconsis- tent or missing depths using traditional hole-filling methods that assume smooth surfaces. This approximation is only valid near occluding edges of refractive objects; thus, their method was unable to handle thick and wide shapes, such as spheres.

Overall, the bulk of shape estimation techniques using LF cameras has been focused on Lam- bertian cases, leaving the topic of refractive objects little explored. Those works that have addressed refractive objects have been limited in terms of the types of objects they apply to, or require bulky equipment that is not practical for mobile robots.

3.4 Summary

In summary, we have reviewed the topics have been explored in the realm of features and visual servoing in the context of LF cameras and refractive objects. Our motivation is to enable visual control around refractive objects using LF cameras.

Most image features in robotic vision have been limited to 2D and 3D and rely heavily on the Lambertian assumption. Recent 4D LF-specific features have been proposed, but still predomi- nantly only consider Lambertian or occluded scenes. LF features in relation to refractive objects are still not yet well-explored. CHAPTER 3. LITERATURE REVIEW 93

For visual servoing, PBVS methods appear to be impractical because they require a model of the refractive object. Various IBVS methods have been developed, but the focus has been largely on Lambertian scenes. To the best of our knowledge, IBVS in the context of refractive objects or LF cameras remains unexplored.

Finally, model-based solutions for refractive object detection have been explored; however, 3D geometric models of refractive objects are time-consuming and difficult to obtain accurately or simply not available. Thus there is interest in approaches that do not require models. Image- based detection methods are so far limited in their application, unreliable for changes in viewing pose, or incomplete in describing a refracted feature’s behaviour in the light field. Additionally, most solutions require bulky equipment that is impractical for mobile robotic platforms, while others rely on assumptions that significantly narrow their application window. Clearly there is a gap for methods that are compact and apply to a wide variety of object shapes. 94 3.4. SUMMARY Chapter 4

Light-Field Image-Based Visual Servoing

In the background section, we introduced LF cameras and saw that they were good for capturing scene texture, depth and view-dependent lighting effects, such as occlusion, specular reflection and refraction. In the following chapters, we will elaborate on how we will use them to reliably perceive refractive objects and servo towards them for grasping and manipulation. However, the first practical issue that must be addressed is how to actually perform visual servoing (VS) with an LF camera in Lambertian scenes. This chapter focuses on how to directly control robot motion using observations from an LF camera via image-based visual servoing (IBVS) for Lambertian scenes. This work was published in [Tsai et al., 2017].

4.1 Light-Field Cameras for Visual Servoing

VS is a robot control technique that makes direct use of visual information by placing the camera in the control loop. VS is widely applicable and generally robust to errors in camera calibration, robot calibration and image measurement [Hutchinson et al., 1996, Chaumette, 1998, Cai et al.,

95 96 4.1. LIGHT-FIELD CAMERAS FOR VISUAL SERVOING

2013]. Most VS techniques fall into one of two categories. Position-based visual servoing (PBVS) uses observed image features and a geometric object model to estimate the camera- object relative pose and adjust the camera pose accordingly; however, geometric object models are not always available. In contrast, Image-based visual servoing (IBVS) uses observed image features and a reference image, from which a set of reference image features are extracted, to directly estimate the required rate of change of camera pose, which does not necessarily require a geometric model.

However, most IBVS algorithms are focused on conventional monocular cameras that inher- ently suffer from lack of depth information, provide limited observations of small or distant targets relative to the camera’s FOV, and struggle with occlusions, specular highlights and re- fractive objects. LF cameras offer a potential solution to these problems. As a first step in exploring LF for IBVS, this chapter considers the multiple views and depth information im- plicit in the LF structure. To the best of our knowledge, light-field image-based visual servoing (LF-IBVS) has not yet been proposed.

The main contribution of this chapter are as follows:

• We provide the first derivation, implementation and experimental validation of LF-IBVS.

• We derive image Jacobians for the LF.

• We define an appropriate compact representation for LF features that is close to the form measured directly by LF cameras.

• In addition, we take a step towards truly 4D plenoptic feature extraction by enforcing LF geometry in feature detection and correspondence.

We assume a Lambertian scene and sufficient scene texture for classical 2D image features, such as SIFT and SURF. We validate our proposed method for LF-IBVS using both a simu- lated camera array and a custom LF camera adapter, shown in Fig. 4.1, which we refer to as CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 97

(a) (b) (c) Figure 4.1: (a) MirrorCam mounted on the Kinova MICO robot manipulator. Nine mirrors of different shape and orientation reflect the scene into the upwards-facing camera to create 9 virtual cameras, which provides video frame-rate LFs. (b) A whole image captured by the Mir- rorCam and (c) the same decoded into a light-field parameterisation of 9 sub-images, visualized as a 2D tiling of 2D images. The non-rectangular sub-images allow for greater FOV overlap.

MirrorCam, mounted on a robot manipulator. We describe MirrorCam in detail in AppendixA. Finally, we show that LF-IBVS outperforms conventional monocular and stereo IBVS for ob- jects occupying the same FOV and in the presence of occlusions.

The remainder of this chapter is organized as follows. Section 4.2 discusses the related work, formulates the VS problem and explains the LF parameterisation. Section 4.4 explains the derivations for LF image Jacobians, features, correspondence and the control system. Sec- tion 4.5 describes our experimental setup with the MirrorCam. Section 4.6 shows our results, and provides a comparison to conventional monocular and stereo IBVS. Lastly, in Section 4.7, we conclude the chapter and explore future work for LF-IBVS.

4.2 Related Work

LF cameras offer extra capabilities for robotic vision. Table 4.1 compares conventional and LF camera systems for different capabilities and tolerances related to VS, given similar configu- rations, such as sensor size and number of pixels. Notably, stereo provides depth for a single baseline along a single direction (typically horizontally), but multi-camera and LF systems pro- 98 4.2. RELATED WORK vide more detailed depth information. They can have both small and long baselines, and have baselines in multiple directions (typically vertically and horizontally). LF cameras have an ad- vantage over conventional multi-camera systems for tolerating occlusions and specular reflec- tions (or more generally non-Lambertian surfaces). This is largely due to the regular sampling, and because only LF cameras capture the refraction, transparency and specular reflections na- tively. As such, LF cameras can benefit from methods that exploit these capabilities [Dansereau, 2014].

Table 4.1: Comparison of camera systems’ capabilities and tolerances for VS

Systems Perspectives Field Baseline Baseline Aperture Occlusion Specular of View Direction Problem Tolerance Tolerance

Conventional Cameras Mono 1 wide zero none significant no no Stereo 2 wide wide single moderate weak no Trinocular 3 wide wide two moderate moderate no Multiple cameras n wide wide multiple minor moderate no

Light-Field Cameras Array n2 wide wide multiple minor strong yes MLA a n2 wide narrow multiple minor moderate yes MirrorCam b n2 narrow wide multiple minor strong yes a Based on n2 pixels per lenslet b Based on n2 mirrors

Johannsen et al. recently applied LFs in structure from motion [Johannsen et al., 2015]. They derived a linear relationship using the LF to solve the correspondence problem and compute a 3D point cloud. They achieved an increase in accuracy and robustness, although their 3D-3D approach did not take full advantage of the 4D LF. Dong et al. focused on Simultaneous Lo- calization and Mapping (SLAM), and demonstrated that an optimally-designed low-resolution LF camera allowed them to develop a SLAM implementation that is more computationally efficient, and more accurate than SLAM for a single high-resolution camera [Dong et al., 2013]. Dansereau et al. derived “plenoptic flow” for closed-form, computationally efficient visual odometry with a fixed operation time regardless of scene complexity [Dansereau et al., 2011]. Zellet et al. extended Dansereau’s plenoptic flow to narrow FOV visual odometry and CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 99 showed how LF cameras can enable SLAM for narrow FOV systems, where monocular SLAM normally fails [Zeller et al., 2015]. That work also showed that using LF cameras with their visual odometry method improved the depth estimation error by an order of magnitude. Re- cently, Walter et al. used LF cameras to analyse specular reflection and detect features specific to specular reflections, which enabled robots to interact with glossy objects, and outperform their stereo counterparts [Walter et al., 2015]. These motivate the application of LFs for robotics and LF-IBVS.

4.3 Lambertian Light-Field Feature

c T Recall from 2.7, the rays emitting from a point in space, P =[Px,Py,Pz] follow a pair of linear relationships [Bolles et al., 1987, Dansereau and Bruton, 2007], as shown in Fig. 2.21 and 2.22,

u Px − s   = D   , (4.1)  Pz  v Py − t     where each equation describes a hyperplane in 4D, F(s,t,u,v) ∈ R3, and their intersection describes a plane L(s,t,u,v) ∈ R2.

T We define our LF feature with respect to the central view of the LF as W =[u0,v0,w] , where u,v is the direction of the ray entering the central view of the LF, i.e.

u0 u Px   =   = D   . (4.2)  Pz  v0 v Py    s,t=0  

As discussed in Section 2.7.4 , the slope w relates the image plane coordinates for all rays emitting from a point in the scene. Fig. 2.21 shows the geometry of the LF for a single view of cP . As the viewpoint changes, that is, s and t change, the image plane coordinates vary lineararly according to (4.1), as in Fig. 2.22. The slope of this line w, comes directly from (4.1), 100 4.4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING and is given by D w = − , (4.3) Pz noting that this slope is identical in the s,u and t,v planes. We exploit this aspect of the LF in the feature matching and correspondence process, described in Section 4.5.1. By working with slope, akin to disparity from stereo algorithms, we deal more closely with the structure of the LF.

Our LF feature is similar to the Augmented Image Space of [Jang et al., 1991] for perspective images where the image plane coordinates are augmented with Cartesian depth. Also similar are the plenoptic disk features developed for the calibration of lenslet-based LF cameras in [O’Brien et al., 2018]. In plenoptic disk features, image feature coordinates are augmented with the radius of the plenoptic disk, which is related (by similar triangles) to a Lambertian point’s depth.

4.4 Light-Field Image-Based Visual Servoing

In this section, we derive the image Jacobians for our LF feature, which are used for image- based visual servoing. Image Jacobians relate image feature velocity (in image space) to camera velocity in translation and rotation. We first consider the continuous-domain, where s,t,u,v are distances. Then we consider the discrete-domain, where i, j and k,l are discrete versions of s,t and u,v, and typically correspond to different views and pixels, respectively.

4.4.1 Continuous-domain Image Jacobian

Following the derivation for conventional IBVS, we wish to relate the camera’s velocity to the resulting change in an observed feature W through a continuous-domain image Jacobian JC

W˙ = JC ν, (4.4) CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 101 where ν = [v; ω] ∈ R6 is the camera spatial velocity in the camera reference frame. ν is the

T concatenation of the camera’s translational velocity v = [vx,vy,vz] and rotational velocity

T ω =[ωx, ωy, ωz] in the camera reference frame.

Differentiation of (4.2) and (4.3) yields

˙ ˙ 2 u˙0 = D(PxPz − PxPz)/Pz , (4.5) ˙ ˙ 2 v˙0 = D(PyPz − PyPz)/Pz , (4.6) ˙ 2 w˙ = DPz/Pz , (4.7)

where u0, v0 and w are the feature positions and velocities with respect to the central camera frame.

We can write the apparent motion of a 3D point as

cP˙ = −(ω × cP ) − v, (4.8) yielding three components cP˙ expressed in terms of cP and ν. Substituting these expressions into (4.5)–(4.7) allows us to factor out the continuous-domain Jacobian

2 −wu0 u0v0 u0 w 0 −D − v0  D D D  2 −wv0 v0 −u0v0 JC = 0 w D + −u . (4.9)  D D D 0  2   −w wv0 −wu0  0 0 D D D 0   

While conventional image Jacobians require an estimate of depth, we note that JC instead has slope w—an inverse measure of depth, which we can observe directly in the LF. The slope w is explicit in all columns of (4.9) except the last one, because the LF camera array spans both the x- and y-axes, and can therefore only observe motion parallax with respect to the camera’s x- and y-axes. The optical flow for the final column is due to rotation about the optical axis, and is therefore invariant to depth. In contrast, depth is not explicit in the monocular image 102 4.4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING

Jacobian for rotations about the x- and y-axes. Trinocular and multi-camera system image

Jacobians would have similar depth dependencies to JC . Multiple views make parallax, and thus depth, observable in rotations about the x- and y-axes for the LF camera array. We note that the derivation for JC is for the central view of the LF camera array. Jacobians derived for the off-axis in-plane views would contain elements of slope in the last column. Additionally, JC has a rank of three, which implies that the stacked image Jacobian (as in (3.5)) will be full rank with a minimum of two points for LF-IBVS, in contrast to a minimum of three image points for M-IBVS.

4.4.2 Discrete-domain Image Jacobian

In the discrete domain, we define to i, j and k,l as the discrete versions of s,t and u,v, as units of “views” and pixels, respectively. We observe our discrete-domain feature M as the discrete

T position and slope M =[k0,l0,mx,my] , where [k0,l0] are observations taken from the central view in i, j, and separate slopes mx in the i, k dimensions and my in j,l. The general plenoptic camera is described by an intrinsic matrix H relating a homogeneous ray φ˜ =[s,t,u,v, 1]T to the corresponding sample in the LF n˜ =[i, j, k, l, 1]T as in

φ˜ = Hn˜, (4.10) where in general H is of the form

h11 0 h13 0 h15   0 h22 0 h24 h25     H = h 0 h 0 h  , (4.11)  31 33 35      0 h42 0 h44 h45      0 0 0 0 1    and the matrix H is found through plenoptic camera calibration [Dansereau et al., 2013]. CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 103

However, we limit our development to the case of a rectified camera array, for which only diagonal entries and the final column are nonzero [Dansereau, 2014]. In this case h11 and h22 are the horizontal and vertical camera array spacing, in meters, and h33 and h44 are given by

D/fx and D/fy, i.e. the inverse of the horizontal and vertical focal lengths of the cameras, expressed in pixels, scaled by the reference plane separation. The final column encodes the centre of the LF, e.g. for Nk samples in k, h15 = -h11(Nk/2+1/2) and k = Nk/2+1/2 is the centre sample in k. We also note that mx and my encode the same information following the relationship

h11h44 mx = my. (4.12) h22h33

We wish to express the image Jacobian of (4.4) in the discrete domain,

˙ ˙ T M˙ =[k¯0, ¯l0, m˙ x] = JD ν, (4.13)

where the observation is expressed relative to the LF centre, k¯0 = k0+h35/h33, ¯l0 = l0+h45/h44. From (4.10), we can relate the discrete and continuous-domain observations as

h33 h44 u0 = h33k¯0, v0 = h44¯l0, w = mx = my, (4.14) h11 h22 from which it is trivial to express the derivatives of the discrete observation in terms of the continuous variables:

¯˙ -1 ¯˙ -1 h11 h22 k0 = h33u˙ 0, l0 = h44v˙0, m˙ x = w,˙ m˙ y = w.˙ (4.15) h33 h44 104 4.5. IMPLEMENTATION & EXPERIMENTAL SETUP

Substituting the continuous-domain derivatives in (4.4), and (4.9) and discrete/continuous rela- tionships in (4.14) into (4.15) allows us to factor out the discrete-domain Jacobian

¯2 mx h33 k¯0mx k¯0¯l0 k0 D h44 0 - h44 -h33 − ¯l0  h11 h11 D D D h33 h33  ¯ ¯2 my h44 l0my l0 D k¯0¯l0 h33 JD = 0 - h44 + -h33 - k¯0 . (4.16)  h22 h22 D D h44 D h44   2 ¯ ¯   0 0 - h33 mx h l0mx -h k0mx 0   h11 D 44 D 33 D   

4.5 Implementation & Experimental Setup

In this section, we discuss the implementation details of our LF-IBVS approach, including how we exploit the LF structure for feature matching and correspondence. We then validate our proposed derivation of LF-IBVS using closed loop control and the experimental setup described below.

4.5.1 Light-Field Features

To our knowledge all prior work on LF features operates by applying 2D feature detectors to

2D slices in the u,v dimensions [Johannsen et al., 2015]. In this paper, we do the same. Our implementation employs Speeded-Up Robust Features (SURF) [Bay et al., 2008], though the proposed method is agnostic to feature type. However, as a first step towards truly 4D features, we augment the 2D feature location with the local light-field slope, implicitly encoding depth.

Operating on 2D slices of the LF, feature matches are found between the central view and all other sub-images. Each pair of matched 2D features is treated as a potential 4D feature. A single feature pair yields a slope estimate, which defines an expected feature location in all other sub-images. We introduce a tunable constant that determines the maximum distance between observed and expected feature locations, in pixels, and reject all matches exceeding this limit. We also reject features that break the point-plane correspondence discussed in Section 4.3. By CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 105 selecting only features that adhere to the planar relationship (4.1), we can remove spurious and inconsistent detections.

A second constant NMIN imposes the minimum number of sub-images in which feature matches must be found. In the absence of occlusions, this can be set to require feature matches in all sub-images. Any feature that is below the maximum distance criterion in at least NMIN images is accepted as a 4D feature, and a mean slope estimate is formed based on all passing sub-images.

NMIN was set to 4 out of 8 sub-image matches for our experiments.

Feature matching between two LFs again starts with conventional 2D methods. A conventional 2D feature match finds putative correspondences between the central sub-images of the two LFs. Outlier rejection is performed using the M-estimator Sample Consensus algorithm [Torr and Zisserman, 2000].

4.5.2 Mirror-Based Light-Field Camera Adapter

There is a scarcity of commercially available LF cameras appropriate for robotics applications. Notably, no commercial camera delivers 4D LFs at video frame rates. Therefore, we constructed our own LF video camera, the MirrorCam, by employing a mirror-based adapter based on pre- vious work [Fuchs et al., 2013,Song et al., 2015]. The MirrorCam is depicted in Fig. 4.1a. The MirrorCam design, optimisation, construction, calibration, and image decoding processes are described in the Appendix A [Tsai et al., 2016]. This approach splits the camera’s field of view into sub-images using an array of planar mirrors, as shown in Fig. 4.1c. By appropriately posi- tioning the mirrors, a grid of virtual views with overlapping fields of view can be constructed, effectively capturing an LF. We 3D-printed the mount based on our optimization, and populated it with laser-cut acrylic mirrors. Note that the LF-IBVS method described in this chapter does not rely on this particular LF camera design, and applies to 4D LFs in general. 106 4.5. IMPLEMENTATION & EXPERIMENTAL SETUP

4.5.3 Control Loop

The proposed LF-IBVS control loop is depicted in Fig. 4.2. Notably, this control loop is similar to that of standard VS. Goal light-field features f ∗ ∈ R3 are compared to observed light-field features f ∈ R3 to produce a light-field feature error. The camera spatial velocity ν can then be calculated as in (3.6) by multiplying the light-field feature error with the pseudo-inverse of the stacked image Jacobians and then multiplying it by a gain λ.

Velocity control is formulated in (3.6). We assume infinitesimal motion to convert ν into a homogeneous transform cT that we use to update the camera’s pose. A motion controller moves the robot arm. After finishing the motion, a new light field is taken and the feedback loop repeats until the light-field feature error converges to zero.

An important consideration in LF-IBVS is the feature representation, because the choice of feature representation in IBVS influences the Cartesian motion of the camera [Mahony et al., 2002]. We have the option of computing the 3D positions of the points obtained from the LF; however, this would be very similar to PBVS. Instead, we chose to work more closely to the native LF representation, working with projected feature position, augmented by slope. Doing so avoids unnecessary computation, and is more numerically stable as depth computation involves inverting slope.

f* C + Inverse ν T Motion Jacobian λ controller -

I f Decode Grab Extract Image

Figure 4.2: The control loop for the VS system. Goal features f ∗ are given. Then f ∗ and f are compared, the J + is computed, and camera velocity ν is determined with gain λ and converted into a motion cT . A motion controller moves the robot arm. After finishing the motion, a new image is taken and the feedback loop repeats until image features match. CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 107

We define the terminal condition for LF-IBVS as a threshold on the root mean square (RMS) error between all of the observed LF features and the goal LF features. We combine all of M, and note that (u0,v0) are in meters, and (k¯0, ¯l0) are in pixels, but the slope w is unit-less. This issue can be addressed by weighting the components; however, for the discrete case, in practice we found that mx and my had similar relative magnitudes. The relative magnitudes of the light- field feature elements are important because they define the error term, which in turn drives the system to minimise light-field feature error. Extremely large magnitudes for slope could potentially place more emphasis on more z-axis or depth-related camera motion, compared to x- or y-axis camera motion. Additionally, we typically use a small λ of 0.1 in order to generate a smooth trajectory towards the goal view.

For the robotic manipulator, we found that the manufacturer’s built-in inverse kinematics soft- ware became unresponsive for small pose adjustments1. Therefore we implemented a resolved- rate motion control method using a manipulator Jacobian to command camera spatial velocities to desired joint velocities [Corke, 2013]. We also changed the proportional, integral and deriva- tive controller gains for all joints to KP = 2.0,KI = 4.8, and KD = 0.0, respectively. With these implementations, we achieved sufficient positional accuracy and resolution to demonstrate LF-IBVS.

4.6 Experimental Results

In this section, we present our experimental results in from camera array simulation and arm- mounted experiments using a custom mirror-based light-field camera. First, we show LF camera and light-field feature trajectories over the sequence of a typical visual servoing manoeuvre in simulation. Second, we compare LF-IBVS to monocular and stereo IBVS in a typical unoc- cluded scene. Finally, we compare the three same VS systems in an occluded scene.

1Limits were determined experimentally and confirmed by the manufacturer. 108 4.6. EXPERIMENTAL RESULTS

4.6.1 Camera Array Simulation

In order to verify our LF-IBVS algorithm, we first simulated a 3 × 3 array of pinhole camera models from the Machine Vision Toolbox [Corke, 2013]. Four planar world points in 3D were projected into the image planes of the 9 cameras. A typical example of LF-IBVS is shown in

Fig. 4.3. For this example, a small gain λ = 0.1 was used to enforce small steps and produce smooth plots as shown in Fig. 4.3a. The Cartesian positions and orientations relative to the goal pose converge smoothly to zero, as shown in Fig. 4.3b. Similarly, the camera velocity profiles in Fig. 4.3c converge to zero. Fig. 4.3d shows the image Jacobian condition number first increases, and then decreases to a constant lower value, indicating that the Jacobian becomes worse and then better conditioned, as the features move closer and then further apart, respectively. To- gether, these figures show the system converges, indicating that LF-IBVS was successful in simulation. Similar to conventional IBVS, a large λ results in a faster convergence, but a less smooth trajectory.

Fig. 4.4a shows the view of the central camera, and the image feature paths as the camera array servos to the goal view. We see that the image feature paths are almost straight due to the linearisation of the Jacobian. Fig. 4.4b shows the trajectories of the top-left corner of the target relative to the goal features, which also converge to zero. We note the slope profile matches the inverse of the z-position profile in the top red line of Fig. 4.3b, as it encodes depth.

For large initial angular displacements, we note that like regular IBVS, this formulation of LF-IBVS exhibited camera retreat issues. Instead of taking the straight-forward screw motion towards the goal, the camera retreats backwards, before moving forwards to reach the goal view. In these situations, the Jacobian linearisation is no longer valid, since the image feature paths are optimally curved, rather than linear. This poses a performance issue because in real systems, such backwards manoeuvres may not be feasible; however, retreat can be addressed [Corke and

Hutchinson, 2001] by decoupling the translation components from the z−axis rotation into two separate image Jacobians, and will be considered in future work. CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 109

350 0.2 X Y Z 300 0.1

250 0 Position [m] 200 -0.1 0 10 20 30 40 50 60

150 10 Error [pix] θ θ θ 100 5 X Y Z

50 0

0 -5

0 10 20 30 40 50 60 Orientation [deg] 0 10 20 30 40 50 60 Time Steps Time Steps (a) (b)

55 0.02 54 0.01 53

0 52

51 -0.01 50 -0.02 49 Jacobian condition number v v v ω ω ω Cartesian velocity [m/s, rad/s] x y z x y z -0.03 48 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Time Steps Time Steps (c) (d) Figure 4.3: Simulation of LF-IBVS, with (a) error (RMS of f − f ∗) decreasing over time, (b) camera motion profiles relative to the goal pose, (c) Cartesian velocities, and (d) image Jacobian’s condition number for λ = 0.1. Ideally, the condition number is low (or decreases over time), which means the system is well-conditioned and therefore less sensitive to changes or errors in the input. Error, relative pose and velocities all converge to zero. 110 4.6. EXPERIMENTAL RESULTS

500 k l 0 0 0 0

500 -500

Position Error [pix] 0 10 20 30 40 50 60

100

v [pix] m 1000 x 50

1500 0 0 500 1000 1500 2000 0 10 20 30 40 50 60 Slope Error [pix/pix] u [pix] Time Steps (a) (b) Figure 4.4: Simulation of view (a) of the initial target points (blue), servoing along the image plane feature paths (green) to the target goal (red), and (b) the feature trajectory profile of M − M ∗, corresponding to the top left corner of the target, which converges to zero.

4.6.2 Arm-Mounted MirrorCam Experiments

We also validated LF-IBVS using the MirrorCam mounted to the end of a Kinova MICO arm robot, shown in Fig. 4.1a. The robot arm and camera were controlled using the architecture outlined in Fig. 4.2. We then performed two experiments for M-IBVS, stereo image-based visual servoing (S-IBVS) and LF-IBVS. The first involved a typical approach manoeuvre in a Lambertian scene to evaluate the nominal performance of our LF-IBVS system. The second involved adding occlusions after the goal image/light field was captured in order to explore the effect of occlusions on LF-IBVS.

4.6.2.1 Lambertian Scene Experiment

We first tested the MirrorCam on a scene similar to Fig. 4.1b, with complex motion involving all 6 DOF from the initial pose in a Lambertian scene. In a typical VS sequence, we move the robot to a goal pose, record the camera pose and goal features, then move the robot to an initial pose and use the features to servo back to the goal. CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 111

Fig. 4.5 shows the performance of our LF-IBVS algorithm for the scene with λ =0.15. Fig. 4.5a shows the error decreasing over time as the camera approaches the goal view, and converges after 20 time steps. We attribute the non-zero error to the arm’s limited performance, which we address at the end of this section. Fig. 4.5b shows the relative pose of the camera to the goal in the camera frame converging smoothly to zero. Note that the goal pose is never the objective of LF-IBVS; rather, the image features captured at the goal pose drive LF-IBVS. Fig. 4.5c shows the commanded camera velocities also converge to zero. Fig. 4.5d shows the condition number for the image Jacobian, which decreases slightly as the system converges. We also note that despite only an approximate camera-to-end-effector calibration, the system converged, which suggests the robustness of the system against modelling errors.

We compared LF-IBVS against conventional M-IBVS and S-IBVS. Using the sub-images from the MirrorCam in Fig. 4.1c, we used the view through the central mirror for M-IBVS, and the two horizontally-adjacent views to the centre from the MirrorCam for S-IBVS. This maintained the same FOV and pixel resolution. Implementations were based on [Corke, 2013, Chaumette and Hutchinson, 2006]. The average scene depth was provided for M-IBVS and S-IBVS to compute the Jacobian, although we note depth, or disparity can be measured directly from stereo. All three IBVS methods were tested ten times on the same goal scene and initial pose.

A typical case for S-IBVS is shown in Fig. 4.6. The image feature error is not uniformly decreasing at the start, but eventually converges after 25 time steps. The camera moves in an erratic motion at the start in the x- and y-axes, but still manages to converge to the goal pose, as seen in the relative pose trajectories and camera velocities in Fig. 4.6b and 4.6c. This is probably not because λ was too high for S-IBVS; smaller gains were tested for S-IBVS, but yielded the same poor performance.

Instead, we observe that the S-IBVS Jacobian condition number in Fig. 4.6d was an order of magnitude higher than LF-IBVS, producing an almost rank-deficient Jacobian; such a Jacobian becomes an inaccurate approximation of the spatial velocities, and yields erratic motion. We 112 4.6. EXPERIMENTAL RESULTS

25 0.1 X Y Z

0.05 20 0 Position [m] 15 0 10 20 30 40

10 6

Error [pix] θ θ θ 4 X Y Z 5 2 0 -2 0 -4

0 10 20 30 40 Orientation [deg] 0 10 20 30 40 Time Steps Time Steps (a) (b)

0.025 300 v v v ω ω ω x y z x y z 0.02 250 0.015 200 0.01

0.005 150

0 100 -0.005 Jacobian condition number Cartesian velocity [m/s, rad/s] -0.01 50 0 10 20 30 40 0 10 20 30 40 Time Steps Time Steps (c) (d) Figure 4.5: Experimental results of LF-IBVS with MirrorCam on the robot arm, illustrating (a) the error (RMS of M − M ∗) that converges after 20 time steps, (b) the camera motion profiles relative to the goal, which converge to zero, (c) the camera velocity profiles, which converge to zero, and (d) the image Jacobian condition number. Referring also to Fig. 4.6, we note that LF-IBVS outperforms S-IBVS; the motion profiles are much smoother, and the velocities and condition numbers are an order of magnitude smaller than those from S-IBVS. CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 113

0.1 50 X Y Z 0.05 40 0

Position [m] -0.05 30 0 10 20 30 40

20

Error [pix] 5 θ θ θ X Y Z 0 10 -5

0 -10 0 10 20 30 40 Orientation [deg] 0 10 20 30 40 Time Steps Time Steps (a) (b)

2500 0.025 v x 0.02 v y 2000 v z 0.015 ω x 0.01 ω y 1500 ω 0.005 z

0 1000

-0.005 Jacobian condition number Cartesian velocity [m/s, rad/s] -0.01 500 0 10 20 30 40 0 10 20 30 40 Time Steps Time Steps (c) (d) Figure 4.6: Experimental results of S-IBVS with narrow FOV sub-images from the MirrorCam, on the robot arm, illustrating the performance in (a) the error (RMS of p − p∗) that eventually converges after 25 time steps; however, the scale is almost double compared to Fig. 4.5.a, (b) the camera motion profiles relative to the goal that show an erratic trajectory at the start, (c) the camera velocity profiles that also vary greatly, and (d) the extremely large image Jacobian condition number, indicating a potentially unstable system (it can exhibit very large changes in camera velocity output for very small changes in image feature error). 114 4.6. EXPERIMENTAL RESULTS attribute this poor performance to the narrow FOV of the MirrorCam, which is approximately 20 degrees horizontally. The lack of perspective change, which is required to differentiate rotation from translation, particularly about the x- and y-axes, can therefore be attributed to the poor S-IBVS performance.

During the experiments, M-IBVS exhibited much worse performance than S-IBVS, to the extent that such erratic motion caused the robot to completely lose view of the goal scene within two or three time steps. Therefore, M-IBVS velocity profiles are not shown in the results.

Equivalently, the projected scale of the object being servoed against affects the performance of IBVS; smaller or more distant objects yield poorly-conditioned image Jacobians. These ob- servations are not new or surprising [Dong et al., 2013]. LF-IBVS outperformed both of our constrained implementations of M-IBVS and S-IBVS, as LF-IBVS converged with a smooth trajectory regardless of the narrow FOV constraints of the MirrorCam. These improvements were likely due to a much lower Jacobian condition number in LF-IBVS, which we attributed to the LF camera providing the perspective change required to differentiate rotation from trans- lation, unlike the stereo and monocular systems. Therefore, the narrow FOV constraints of the MirrorCam can generalize to other camera systems as small or distant targets relative to the camera, where increasing the FOV would not help the system converge to the target.

4.6.2.2 Occluded Scene Experiment

Experiments with occlusions were also conducted using a series of black wires to partially occlude the scene. The setup is illustrated in Fig. 4.7 and 4.8. The goal, or reference image, was captured without the occlusions at a specified goal pose. An example image is shown in Fig. 4.8a. Next, the robot was moved to an initial pose, where the occlusions did not obscure the scene. Then the robot was allowed to servo towards the goal, along a path where the occlusions gradually obscured the goal view. The final goal image was partially occluded, as shown in Fig. 4.8b. M-IBVS, S-IBVS and LF-IBVS were run using the same setup. With the partially CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 115 occluded views, M-IBVS and S-IBVS failed; whereas the LF-IBVS method servoed to the original goal pose.

Fig. 4.9 compares the number of features matched by LF-IBVS, M-IBVS, and S-IBVS in the occlusion experiment. Without any occlusions, we note that all three methods have a similar number of matched features at the goal view, although stereo and mono have slightly more matches than LF-IBVS throughout the experiment. This is likely because all 3 methods used similar 2D feature detection methods; however, our LF-IBVS approach also rejected those fea- tures that were inconsistent with LF geometry. In our experiment with occlusions, M-IBVS failed at time step 5, when it was unable to match sufficient features. Similarly, the perfor- mance of S-IBVS in our experiment quickly degraded at time step 10, as the occlusions covered most of the left view and significant portions of the right view.

On the other hand, in the presence of occlusions, LF-IBVS had fewer matches than the un- occluded case, but still matched a consistent and sufficient number of features throughout its trajectory to converge. It was therefore apparent that LF-IBVS could utilize the LF camera’s multiple views and baseline directions to handle partial occlusions. To further illustrate this, consider the scene where a 3D point occluded from one of the LF camera’s sub-views, but still visible in at least one other of the LF camera’s sub-views. A single LF is captured; thus there is no physical camera motion. Conventional image feature matching would fail for stereo vi- sion systems in this situation, because the 3D point is occluded from one of two views, and therefore not viable for image matching. However, our LF-camera-based method would still able to perform matching using the other unoccluded views, provided a sufficient baseline. By setting a minimum number of views that an image feature must be visible in (NMIN ), we have made it harder for image features to be matched (thus there are fewer image feature matches). Those that are matched are therefore more consistent for motion estimation applications, such as visual servoing. Thus, our feature extension from 2D to 4D enables our method to better deal with the presence of occlusions. Trinocular camera systems may also benefit from the occlusion tolerance that we demonstrated in Fig. 4.9 (albeit far less tolerance due to significantly fewer 116 4.6. EXPERIMENTAL RESULTS

MirrorCam unoccluded initial view

scene features camera trajectory

field of view

partial occlusions occluded goal view

Figure 4.7: Occlusion experimental setup, showing the initial view of the scene (red) with no occlusions, the camera trajectory that gradually becomes more occluded, and converging to the goal view with partial occlusions (green).

(a) (b) Figure 4.8: Occlusion experiments showing (a) the goal view with no occlusions from the MirrorCam, and (b) the goal view, partially occluded by a box of black wires. The arm was able to reach the partially-occluded goal view using LF-IBVS, but not M-IBVS or S-IBVS. Images shown are flipped vertically. CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 117 views—three compared to n × n views, where n is typically three or greater), but would lack tolerance to specular highlights and other non-Lambertian surfaces as discussed in Table 4.1.

4.7 Conclusions

In this chapter, we have proposed the first derivation, implementation, and validation of light- field image-based visual servoing. We have derived the image Jacobian for LF-IBVS based on a LF feature representation that is augmented by the local light-field slope. We have exploited the LF in our feature detection, correspondence, and matching processes. Using a basic VS control loop, we have shown in simulation and on a robotic platform that LF-IBVS is viable for con- trolling robot motion. Further research into alternative feature types and Jacobian decoupling strategies may address camera retreat and improve the performance of LF-IBVS.

Our implementation takes 5 seconds per frame to operate as unoptimized MATLAB code. The decoding and correspondence processes are the current bottlenecks. Through optimization, real-time LF-IBVS should be possible.

Our experimental results demonstrate that LF-IBVS is more tolerant than monocular and stereo methods to narrow FOV constraints and partially-occluded scenes. Robotic applications op- erating in narrow, constrained and occluded environments, or those aimed at small or distant targets would benefit from LF-IBVS, such as household grasping, medical robotics, and in- orbit satellite servicing. In future work, we will investigate other LF camera systems, how to further exploit the 4D nature of the light-field features, and explore LF-IBVS in the context of refractive objects, where the method should benefit significantly from the light field. 118 4.7. CONCLUSIONS

60

50

40 LF Stereo Mono 30 LF Occluded Stereo Occluded 20 Mono Occluded Number of Features 10

0 0 10 20 30 40 Time Steps

Figure 4.9: Experimental results for number of features matched over time with occlusions (dashed), and without (solid), for LF-IBVS (red), S-IBVS (blue), and M-IBVS (black). Both stereo and monocular methods fail at time step 5 and 10, respectively, but LF-IBVS maintains enough feature matches to converge to the goal pose, which demonstrates that LF-IBVS is more robust to occlusions. Chapter 5

Distinguishing Refracted Image Features with Application to Structure from Motion

Robots for the real world will inevitably have to perceive, grasp and manipulate refractive ob- jects. However, refractive objects are particularly challenging for robots because these objects are difficult to perceive—they are often transparent and their appearance is essentially a dis- torted view of the background, which can change significantly with respect to small changes in viewpoint. The amount of distortion depends on the scene geometry, as well as the shape and refractive indices of the objects involved. As the robot approaches the refractive object, the refracted background can move differently compared to the rest of the non-refracted scene. In- tuitively, the key to detecting refractive objects is to understand and characterise the background distortion caused by the refractive object.

This chapter is concerned with discriminating the appearance of features whose image features have been distorted by a refractive object—refracted image features—from the surrounding Lambertian features. This is because robots will need to reliably operate in scenes with re-

119 120 fractive objects in a variety of applications. Unfortunately, refractive objects can cause many robotic vision algorithms, such as structure from motion (SfM), to become unreliable or even fail. This is because they all assume a Lambertian world, and do not know not to use refracted image features when estimating structure and motion.

Outlier rejection methods such as RANSAC have been used to remove refracted image features (outliers compared to the perceived relative motion of the robot) when the number of refracted image features are small relative to the number of Lambertian image features. However, there is a trade-off between computation and robustness when dealing with outlier-rich image feature sets1. More computation is required to deal with increasingly outlier-rich image feature sets. With limited computation, outlier rejection may return a sub-optimal inlier set, potentially lead- ing to failure of the robotic vision system. Therefore, starting with a higher-quality set of image features for applications such as SfM are preferred to reduce computation, power consumption and probability of failure.

In this chapter, we propose a novel method to distinguish between refracted and Lambertian im- age features using a light-field camera. For the previous refracted feature detection methods that are limited to light-field cameras with large baselines relative to the refractive object, our method achieves state-of-the-art performance. We extend these capabilities to light-field cameras with much smaller baselines than previously considered, where we achieve up to 50% higher re- fracted feature detection rates. Specifically, we propose to use textural cross-correlation to characterise apparent feature motion in a single LF, and compare this motion to its Lambertian equivalent based on 4D light-field geometry.

1Outliers are by definition samples that significantly differ from other observations; normally they appear with low probability at the far end of distributions. Thus the term “outlier-rich” may appear contradictory as this would imply another distribution, where many of the image feature points removed do not follow an assumed distribution. However, by “outlier-rich”, we imply that there is a much higher concentration of outliers than normal. In our context, we might obtain an outlier-rich image feature set when the cameras’ views are dominated by a refractive object, such that a large number of image features are refracted, and only a few image features follow a consistent (Lambertian) image motion within the light field, or due to the robot’s own motion. CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 121

We show the effectiveness of our discriminator in the application of structure from motion (SfM) when reconstructing scenes containing a refractive object, such as Fig. 5.1. Structure from motion is a technique to recover both scene structure and camera pose from 2D images, and is widely applicable to many systems in computer and robotic vision [Hartley and Zisserman, 2003, Wei et al., 2013]. Many of these systems assume the scene is Lambertian, in that a 3D point’s appearance in an image does not change significantly with viewpoint. However, non-Lambertian effects, including specular reflections, occlusions, and refraction, violate this assumption, which can cause these systems to become unreliable or even fail. We demonstrate that rejecting refracted features using our discriminator yields lower reprojection error, lower failure rates, and more accurate pose estimates when the robot is approaching refractive objects. Our method is a critical step towards allowing robots to operate in the presence of refractive objects. This work has been published in [Tsai et al., 2019].

Figure 5.1: (Left) A LF camera mounted on a robot arm was used to distinguish refracted features in a scene in SfM experiments. (Right) SIFT features that were distinguished as Lam- bertian (blue) and refracted (red), revealing the presence of the refractive cylinder in the middle of the scene.

In this chapter, our main contributions are the following.

• We extend previous work to develop a light-field feature discriminator for refractive objects. In particular, we detect the differences between the apparent motion of non- Lambertian and Lambertian features in the 4D LF to distinguish refractive objects more reliably than previous work. 122 5.1. RELATED WORK

• We propose a novel approach to describe the apparent motion of a feature observed within the 4D light-field based on textural cross-correlation.

• We extend refracted feature distinguishing capabilities to lenslet-based LF cameras that are limited to much smaller baselines by considering non-Lambertian apparent motion in the LF. All LFs captured for these experiments are available at https://tinyurl.com/LFRefractive.

• We show that by distinguishing and rejecting refracted features with our discriminator, SfM performs better in scenes that include refractive objects.

The main limitation of our method is that it requires background visual texture to be distorted by the refractive object. Our method’s effectiveness depends on the extent to which the appearance of the object is warped in the LF. This depends on the geometry, shape, and the refractive indices of the object involved.

The remainder of this chapter is organized as follows. Section 5.1 describes the related work and Section 5.2 provides background on LF geometry. In Section 5.3, we explain our method for discriminating refracted features in the LF. We show our experimental results for detection with different LF cameras, and validation in the context of monocular SfM in Section 5.4. Lastly in Section 5.5, we conclude the chapter and explore future work for the detection of refracted features.

5.1 Related Work

A variety of strategies for detecting and reconstructing refractive objects using vision have been investigated [Ihrke et al., 2010a]. For example, reflectivity has been used to reconstruct refrac- tive object shape. A single monocular camera with a light source moving to points in a square CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 123 grid has been used to densely reconstruct complex refractive objects by tracing the specular re- flections from different, known lighting positions over multiple monocular images [Morris and Kutulakos, 2007]. Additionally, light refracted by transparent objects tends to be polarised, and thus a rotating polariser in front of a monocular camera has been used to reconstruct the front surface of glass objects that faces the camera [Miyazaki and Ikeuchi, 2005], but their method requires prior knowledge of the object’s refractive index, shape of the back surface, and illu- mination distribution, which for a robot are not necessarily available. Refractive object shape has also been obtained by measuring the distortion of the light field’s background light paths, using a monocular camera image of the refractive object placed in front of a special optical sheet and lighting system, known as a light-field probe [Wetzstein et al., 2011], but this method also required knowledge of the initial direction of light rays emitting from a planar background. Furthermore, many of these methods require known light sources with bulky configurations that are impractical for robotic applications in everyday environments.

Recent work has been aimed at finding refractive objects within a single monocular image. SIFT features and a learning-based approach have been used to detect refractive objects [Fritz et al., 2009]. They trained a linear, binary support-vector machine to classify glasses versus a Lambertian background. Their approach required many hand-labelled training images from a variety of refractive objects under different lighting environments and backgrounds, and only returned a bounding box, providing little to no insight into the nature of the refractive object itself. Monocular image sequences from moving cameras have been used to recover refractive object shape and pose [Ben-Ezra and Nayar, 2003]; however, image feature correspondence was established manually throughout camera motion, emphasizing the difficulty of automati- cally identifying and tracking refracted image features due to the severe magnification of the background and image distortion from the object.

LFs have been used to obtain better depth maps for Lambertian and occluded scenes [Johannsen et al., 2017]; however, their depth estimation performance suffers for refractive objects. Jach- nick et al. considered using light fields to estimate scene lighting configurations and then remove 124 5.1. RELATED WORK specular reflections from images of planar surfaces [Jachnik et al., 2012]. Tao et al. recently applied a similar concept using LF cameras to simultaneously estimate depth and remove spec- ular reflections from more general 3D surfaces (not limited to planar scenes) [Tao et al., 2016]. Wanner et al. recently considered planar refractive objects and reconstructed two different depth layers [Wanner and Golduecke, 2013]. For example, their method provided the depth of a thin sheet of frosted glass and the depth of the background Lambertian scene. In another example, their method provided the depth of a reflective mirror and the apparent depth of the reflected scene. However, this work was limited to thin planar surfaces and single reflections. Although our work does not determine the dense structure of the refractive object, our approach can dis- tinguish image features from objects that significantly distort the LF.

Refractive object recognition is the problem of finding or identifying a refractive object from vi- sion. In this area, Maeno et al. proposed a light-field distortion feature (LFD), which models an object’s refraction pattern as image distortion based on differences in the corresponding image points between the multiple views encoded within the LF, captured by a large-baseline (relative to the refractive object) LF camera array [Maeno et al., 2013]. Several LFDs were combined in a bag-of-words representation for a single refractive object. However, the authors observed significantly poor recognition performance due to specular reflections, as well as changes in camera pose. Xu et al. used the LFD as a basis for refractive object image segmentation [Xu et al., 2015]. Corresponding image features from all views in the LF (s,t,u,v) were fitted to the normal of a 4D hyperplane using singular value decomposition (SVD). The smallest singular value was taken as a measure of error to the hyperplane of best fit, for which a threshold was applied to distinguish refracted image features. However, we will show that a 3D point cannot be described by a single hyperplane in 4D. Instead, it manifests as a plane in 4D that has two orthogonal normal vectors. Our approach builds on Xu’s method and solves for both normals to find the plane of best fit in 4D; thus allowing us to discriminate refractive objects more reliably.

A key difficulty in image feature-based approaches in the LF is obtaining the corresponding image feature locations between multiple views. It is possible to use traditional multi-view ge- CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 125 ometry approaches for image feature correspondence, such as epipolar geometry, optical flow and RANSAC. In fact, both Maeno and Xu used optical flow between two views for correspon- dence. However, these approaches do not exploit the unique geometry of the LF, which can lead to algorithmic simplifications or reduced computational complexity [Dansereau, 2014]. We propose a novel textural cross-correlation method to associate image features in the LF by describing their apparent motion in the LF, which we refer to as image feature curves. This method directly exploits LF geometry and provides insight on the 4D nature of image features in the LF.

Our interest in LF cameras stems from robot applications that often have mass, power and size constraints. Thus, we are interested in employing compact lenslet-based LF cameras to deal with refractive objects. However, most previous works have utilized gantries [Wanner and Golduecke, 2013], or large camera arrays [Maeno et al., 2013, Xu et al., 2015]; their results do not reliably transfer to LF cameras with much smaller baselines, where distortion is less apparent, as we show later. We demonstrate the performance of our method using two different LF camera architectures with different baselines. Ours is the first method, to our knowledge, capable of identifying RFs using lenslet-based LF cameras.

For LF cameras, LF-specific image features, have been investigated. SIFT features augmented with “slope”, an LF-based property related to depth, were proposed by the author of this thesis for visual servoing using an LF camera [Tsai et al., 2017]; however, in Chapter 4, refractive ob- jects were not considered. Ghasemi proposed a scale-invariant global image feature descriptor based on a modified Hough transform [Ghasemi and Vetterli, 2014]; however, we are interested in local image features whose positions encode the distortion observed in the refracted back- ground. More recently, Tosic developed a scale-invariant, single-pixel-edge detector by finding local extrema in a combined scale, depth, and image space [Tosic and Berkner, 2014]. However, these LF image features did not differentiate between Lambertian and refracted image features, nor were they designed for reliable matching between LFs captured from different viewpoints. 126 5.2. LAMBERTIAN POINTS IN THE LIGHT FIELD

Recent work by Teixeira et al. projected SIFT features found in all views into their correspond- ing epipolar plane images (EPIs). Example EPIs are shown in Fig. 2.15. These projections were filtered and grouped onto straight lines in their respective EPIs and then counted. Features with higher counts were observed in more views, and thus considered to be more reliable Lambertian image features [Teixeira et al., 2017]. However, this approach only looked for SIFT features that were consistently projected on top of lines in their respective EPIs and intentionally filtered out any nonlinear image feature behaviour; thus, this approach did not consider any nonlinear image feature behaviour, while our method aims to detect these non-Lambertian image features, and is focused on characterising them. Clearly, there is a gap in the literature for identifying and characterising refracted image features using LF cameras.

In this chapter, we detect unique image features that allow us to reject distorted content and work well for SfM. This could be useful for many other common feature-based algorithms, including recognition, segmentation, visual servoing, simultaneous localization and mapping, visual odometry, and SfM, making these algorithms more robust to the presence of refractive objects. We are interested in exploring the impact of our refracted image feature discriminator in a SfM framework. While there has been significant development in SfM in recent year for conventional monocular and stereo cameras [Wei et al., 2013], Johannsen et al. were the first to consider LFs in the SfM framework [Johannsen et al., 2015]. Although our work does not yet explore LF-based SfM, we investigate SfM’s performance with respect to RFs, which has not yet been fully explored. We show that rejecting RFs reduces reprojection error and failure rate near refractive objects, improving camera pose estimates.

5.2 Lambertian Points in the Light Field

In this section, we provide a brief reminder on the LF geometry background provided in Sec- tion 2.7.3; however, we have re-written (2.34) and (2.35) in the context of our refracted im- age feature discriminator. Using the two plane paramterisation, a ray φ can be described by CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 127

T 4 T 3 φ =[s,t,u,v] ∈ R . A Lambertian point in space P =[Px,Py,Pz] ∈ R emits rays in many directions, which follow a linear relationship

u D Px − s   =   , (5.1) Pz  v Py − t     where each row describes a hyperplane in 4D. A hyperplane in 4D is a 3D manifold and can be described by a single equation

n1s + n2t + n3u + n4v + n5 =0, (5.2)

T where n =[n1,n2,n3,n4] is the normal of the hyperplane. A plane is defined as a 2D manifold and can be spanned by two linearly-independent vectors. In 4D, a plane can be described by the intersection of two 4D hyperplanes

n1s + n2t + n3u + n4v + n5 =0 (5.3)

m1s + m2t + m3u + m4v + m5 =0, (5.4) where m is the normal of a second hyperplane in 4D. Equations (5.3) and (5.4) can be written in matrix form,

s   n n n n t −n  1 2 3 4     5    = . (5.5)   m1 m2 m3 m4 u −m5         v   128 5.3. DISTINGUISHING REFRACTED IMAGE FEATURES

Equation (5.1) can be written into a similar form as (5.5) as

s   D 0 10 t DPx  Pz     Pz    = , (5.6) 0 D 0 1 u DPy  Pz     Pz          N v | {z }   where N contains the two linearly-independent normals to the plane in 4D. The plane is defined as the set of all s,t,u,v that follow (5.6). Therefore, a Lambertian point in 3D induces a plane in 4D, which is characterised by two linearly-independent normal vectors that each define a hyperplane in 4D. In the literature, this relationship is sometimes referred to as the point-plane correspondence, as discussed in Section 2.7.3, because a point in 3D corresponds to a plane in 4D.

The light-field slope w relates the rate of change of image plane coordinates, with respect to viewpoint position, for all rays emitting from a point in the scene. In the literature, slope is sometimes referred to as “orientation” [Wanner and Golduecke, 2013], and other works com- pute slope as an angle [Tosic and Berkner, 2014]. We recall that the slope comes directly from (2.43) as

w = −D/Pz, (5.7) and is clearly related to depth.

5.3 Distinguishing Refracted Image Features

Epipolar Planar Images (EPIs) graphically illustrate the apparent motion of a feature across multiple views. If the entire light field is given as L(s,t,u,v) ∈ R4, the central view is an image I(u,v) = L(s0,t0,u,v), and is equivalent to what a monocular camera would provide from the same camera viewpoint. EPIs represent a 2D slice of the 4D LF. A horizontal EPI CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 129 is given as L(s,t∗,u,v∗), and a vertical EPI is denoted as L(s∗,t,u∗,v), where ∗ indicates a variable is fixed while others may vary.

In practice, we construct the EPI by plotting all u pixels for view s, as illustrated in Fig. 2.15.

Then we plot all u pixels for view s +1, stacking the row of pixels on top of the previous plot, and repeating for all s. As each view is horizontally shifted by some baseline, the scene captured by the u pixels shifts accordingly. As shown in Fig. 5.2a, image features or rays from a Lambertian point are linearly distributed with respect to viewpoint due to the uniform sampling of the LF camera.

Points with similar depths yield lines with similar slopes in the EPI. Points with different depths yield lines with different slopes. Similar behaviour is observed considering the vertical viewing direction along t and v. Equivalently, linear parallax motion manifests itself as straight lines for Lambertian image features. Image features for highly-distorting refractive objects are nonlinear, as illustrated in Fig. 5.2b. We can thus compare this difference in apparent motion between Lambertian and non-Lambertian features to distinguish RFs.

Fig. 5.3a shows the central view and an example EPI of a crystal ball LF (large baseline) from the New Stanford Light-Field Archive, captured by a camera array. The physical size of cameras often necessitates larger baselines for LF capture. A Lambertian point forms a straight line in the EPI, shown in Fig. 5.3b. The relation between slope and depth is also apparent in this EPI.

Refracted image features appear as nonlinear curves in the EPI, as seen in Fig. 5.3b. Refracted image feature detection in the LF simplifies to finding image features that violate (5.1) via iden- tifying nonlinear feature curves in the EPIs and/or inconsistent slopes between two linearly- independent EPI lines (ie, EPIs sampled from two linearly-independent motions), such as the vertical (along t) and horizontal (along s) EPIs. We note that occlusions and specular reflections also violate (5.1), and so can potentially cause many vision algorithms to fail as well. Occlu- sions appear as straight lines, but have intersections in the EPI, indicated in green. Edges of the 130 5.3. DISTINGUISHING REFRACTED IMAGE FEATURES

(a) (b) Figure 5.2: A Lambertian point emits rays of light and are captured by the LF camera. (a) Projection of the linear behaviour of a Lambertian image feature (orange), and (b) the nonlinear behaviour of a refracted image feature with respect to linear motion along the viewpoints of an LF (blue). refractive objects, and objects with low distortion also appear Lambertian. Specular reflections appear as a superposition of lines in the EPI, which may be addressed in future work.

5.3.1 Extracting Image Feature Curves

In this section, we discuss how we extract these 4D image feature curves and how we identify refracted image features. For a given image feature from the central view (s0,t0) at coordinates

′ ′ (u0,v0), we must determine the feature correspondences (u ,v ) from the other views, which is equivalent to finding the feature’s apparent motion in the LF. In this chapter, we start by detecting SIFT features [Lowe, 2004] in the central view, although the proposed method is agnostic to any scale-based image feature type.

Next, we select a template surrounding the feature which is k-times the feature’s scale. We determined k =5 to yield the most consistent results. 2D Gaussian-weighted normalized cross- correlation (WNCC) is used across views to yield correlation images, such as Fig. 5.4. To reduce computation, we only apply WNCC along the central row and column of LF views.

For Lambertian image features, peaks in the correlation space for each view correspond to the feature’s image coordinates in that view. We create another EPI by plotting the image feature’s CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 131

(a) (b) Figure 5.3: (a) In the crystal ball LF from the New Stanford Light Field Archive, the central view is shown. A vertical EPI (b) is sampled from a column of pixels (yellow), where nonlinear apparent motion caused by the crystal ball are seen in the middle (blue). Straight lines corre- spond to Lambertian features (orange). Occlusions (green) appear as intersections of straight lines. correlation response with respect to the views, which we call the correlation EPI. Illustrated in Fig. 5.4, the ridge of the correlation EPI will have the same shape as the image feature curve from original EPI.

For refracted image features, we hypothesize that the distortion of the feature’s appearance between views will not be so strong as to make the correlation response unusable. Thus, the correlation response will be sufficiently strong that the ridge of the correlation EPI will still correspond to the desired feature curve. Our textural cross-correlation method allows us to focus on the image structure, as opposed to the image intensities. Our method can be applied to any LF camera, and directly exploits the geometry of the LF.

There are many other strategies to find and match image features across a sequence of views, such as stereo matching and optical flow. Such approaches have previously been used in LF- related work [Maeno et al., 2013, Xu et al., 2015]. However, both stereo matching and optical flow typically rely on pair-wise image comparisons and must therefore be iterated across other views of the LF. It is often more efficient and robust against noise to consider all views simulta- 132 5.3. DISTINGUISHING REFRACTED IMAGE FEATURES

Figure 5.4: The image feature curve extraction process. (Top left) The simulated horizontal views of a yellow circle and (top right) the corresponding horizontal EPI taken along the middle of the views from the green pixel row. A feature template is taken and used for textural cross- correlation (bottom left). The resulting cross-correlation response is computed and shown as the cross-correlation views for a typical scene. Yellow indicates a high response, while blue indicates a low response. The resultant correlation EPI (bottom right) , created by stacking the red pixel row of adjacent views. The ridge (yellow) along this correlation EPI corresponds to the desired, extracted image feature curve (red). Note that only 3 views are shown, but the simulated LF actually contains 9 views. neously when attempting to characterise a trend across an image sequence. Although we only consider 2D EPIs in the LF in this chapter, we are interested in considering full 4D approaches for image feature extraction in future work.

5.3.2 Fitting 4D Planarity to Image Feature Curves

For Lambertian image features, the image feature disparities are linear with respect to linear camera translation, as in (5.7). The disparities from refracted image features deviate from this linear relation. In this section, we explain that fitting these disparities in the least squares sense to (5.1) yields the plane of best fit in 4D. The plane in 4D can be estimated from the image feature curves that we extracted in the previous section. The error of the 4D planar fit provides a measure of whether or not our image feature is Lambertian. CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 133

Similar to [Xu et al., 2015], we consider the ray passing through the central view φ0(s0,t0,u0,v0). The corresponding ray coordinates in the view (s,t) are defined as φ(s,t,u,v). The LFD is then defined as the set of relative differences between φ0 and φ as in [Maeno et al., 2013]:

LF D(u0,v0)= {(s,t,∆u,∆v) |∀ (s,t) =(6 s0,t0)}, (5.8)

where ∆u = u(s,t) − u0, and ∆v = v(s,t) − v0 are image feature disparities. We note that the

LFD uses φ from all other views (s,t =6 s0,t0). This differs from our proposed image feature curves extracted from EPIs that only sample views along two orthogonal viewing directions from which the EPIs are first created, which represents a minimal sampling of the LF in order to discriminate against refracted image features.

As discussed in Section 5.2, our plane in 4D has two linearly-independent normals, n and m.

Then considering the LFD, we compare central view ray φ0 to ray φ. Recall that each ray is represented by a point in 4D. Substituting each coordinate into (5.3), we can write

n1s0 + n2t0 + n3u0 + n4v0 = −n5 (5.9)

n1s + n2t + n3u + n4v = −n5. (5.10)

Subtracting (5.9) from (5.10), yields

n1s + n2t + n3∆u + n4∆v =0, (5.11)

which is expressed in terms of the LFD. Recall that s0 =0 and t0 =0.. 134 5.3. DISTINGUISHING REFRACTED IMAGE FEATURES

We can write this in matrix form as

s   t   n1 n2 n3 n4   = 0 . (5.12)       ∆u N     | {z } ∆v   n | {z }

We can estimate n by fitting rays according to

n1   n  2 (s, t, ∆u, ∆v)   = 0 . (5.13)       n3 N     | {z } n4   n | {z } Note that the constants on the right-hand side of (5.6) cancel out in (5.13) because we consider the differences relative to u0. We also note that N is a matrix of rank one. We require a minimum of four additional rays (equivalently, four views of the image feature) relative to φ0 to estimate n to solve the system Nn = 0.

For an LF that can be represented by an M × N camera array, we can use all MN views to estimate n. However, to reduce the required computation involved in the image feature curve extraction process, we can use all N views from the horizontal image feature curve, which were extracted from the horizontal EPI, u = fh(s; tj,vl − v0). This represents the set of all values of u that follow the horizontal image feature curve as a function of s, given the constant tj, vl and v0. Similarly, the vertical image feature curve can be expressed as v = fv(t; si,uk − u0), for constant si, uk and u0. CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 135

We can substitute the image feature curve fh into N as a set of stacked rays,

n1 s1, tj, ∆u1, vl − v0     n . . . .  2 . . . .   = 0 . (5.14)         n3 s , t , ∆u , v − v     N j N l 0     n4 N¯   | {z } The matrix N¯ is a singular N ×4 matrix, and (5.14) is an overdetermined system. We can solve this system using SVD to estimate n in the least squares sense.

It is interesting to note that for a Lambertian point, we can reduce each row of N¯ to a function of the other rows. We know that tj and vl − v0 are constant columns. The column s1 to sN is a linear combination of itself, as it is simply the increment of the horizontal viewpoints, which are linearly-spaced in the LF. Using (5.1), we can write

D D D D ∆u = u − u0 = (Px − s) − (Px − s0)= − (s − s0)= − ∆s (5.15) Pz Pz Pz Pz

∆u D = − . (5.16) ∆s Pz

The change in u is linear with respect to changes in s, which matches our expression for LF slope in (5.7). Therefore, N¯ has a rank of 1 and can only yield a single hyperplane.

However, recall that a Lambertian point can be described by two hyperplanes. (5.11) and con- sequently (5.12) must hold for both hyperplanes for Lambertian point in 3D. We are interested in estimating hyperplane normals n and m from the LFD. Therefore, we can write (5.11) in 136 5.3. DISTINGUISHING REFRACTED IMAGE FEATURES

matrix form as the 4D plane containing φ0 and φ,

s   n n n n t 0  1 2 3 4        = , (5.17)   m1 m2 m3 m4 ∆u 0         T ∆v nm   | " {z# } where T is the transpose. The positions for s,t can be obtained by calibration [Dansereau et al., 2013], although the nonlinear behaviour still holds when working with uncalibrated units of “views”. We can then write

n1   0 s, t , ∆u, v − v n    j l 0  2 .   = . , (5.18)     si, t, uk − u0, ∆v  n3       0     A n4   | {z }   n | {z } where the first row (s,tj,∆u,vl − v0) represents a single ray from fh, and the second row

(si,t,uk − u0,∆v) represents a single ray from fv. A is a singular matrix and has a rank of at least two. We still require a minimum of five rays to solve (5.18) (four plus φ0). As before, we can stack all MN rays over the entire LF; however, we use a smaller set of M + N rays from fh and fv. The system of equations can be written,

s1, tj, ∆u1, vk − v0   ...... n   1     0      sN , tj, ∆uN , vk − v0 n2 .     = . . (5.19)        si, t1, uk − u0, ∆v1  n3          . . . .    0  . . . .  n       4       (sM , tM , uk − u0, ∆vM  n   | {z } A¯ | {z } CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 137

Equation (5.19) is of the form An¯ = 0, which is a homogeneous system of equations. Since A¯ is a (M + N) × 4 matrix, the system is overdetermined. We use SVD to solve the system in a least-squares sense to compute the four singular vectors ξi, and corresponding singular values,

λi, i =1 ... 4, where λi are sorted in ascending order according to their magnitude.

Additionally, A¯ has a rank of two for a Lambertian point. We can show this by following a similar arguments for the rank of N¯ for the rays from fv. Thus we expect two non-zero singular values and two trivial solutions for a system with no noise. With image noise and noise from the image feature curve extraction process, it is possible to get four non-zero singular values; whereupon the magnitudes of λ1 and λ2 are much smaller than λ3 and λ4. Importantly, distortion caused by a refractive object can also cause non-zero singular values, and it is this effect that we are primarily interested in.

The two smallest singular values, λ1 and λ2 and their corresponding singular vectors are related to the two normals n and m that best satisfy (5.19) in the least-squares sense. The magnitude of these singular values provides a measure of error of the planar fit. Smaller errors imply stronger linearity, while larger errors imply that the feature deviates from the 4D plane.

5.3.3 Measuring Planar Consistency

From the two smallest singular values λ1 and λ2, we have two measures of error for the planar

2 2 fit. The Euclidean norm of λ1 and λ2, λ1 + λ2 may be taken as a single measure of pla- p narity; however, doing so masks the case where λ1 ≫ λ2, or λ1 ≪ λ2. This can occur when observing a feature through a 1D refractive object (glass cylinder) that causes severe distortion along one direction, but relatively little along the other. Therefore, we reject those features that have large errors in either of the two hyperplanes in a manner similar to a logical OR gate. This planar consistency, along with the slope consistency discussed in the following section, make the proposed method more sensitive to distorted texture than prior work that considered only the smallest singular value, which we refer to as hyperplanar consistency [Xu et al., 2015]. 138 5.3. DISTINGUISHING REFRACTED IMAGE FEATURES

5.3.4 Measuring Slope Consistency

Equation (5.1), shows that a Lambertian point has a single value of slope for both hyperplanes. However, the case for a refracted image feature can be locally approximated as

u w1(Px − s)   =   (5.20) v w2(Py − t),     where w1 and w2 are the two slopes for the same image feature. Each row in (5.20) is still a hyperplane in 4D. The intersection of these two hyperplanes also represents a plane in 4D, as the normals are still linearly-independent.

We are interested in the horizontal and vertical hyperplanes, which are aligned to the hori- zontal and vertical viewpoints along t0 and s0, respectively. We can compute the slopes for each hyperplane given their normals. For the first hyperplane, we solve for the in-plane vector

T q =[qs,qu] , by taking the inner product of the two vectors n and m from (5.17) in

n1 n3 qs 0     =   , (5.21) m1 m3 qu 0       where q is constrained to the s,u plane, because we choose the first and third elements of n and m. This system is solved using SVD, and the minimum singular vector yields q. The slope for the horizontal hyperplane, wsu is then wsu = qs/qu. The slope for the vertical hyperplane wtv is similarly computed from the second and fourth elements of n and m.

We define slope consistency c as a measure of how different the slopes are between the two hyperplanes for a given image feature. It is possible to compute this difference as

2 c =(w1 − w2) . (5.22) CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 139

However, in practice, we plot the EPIs as ds/du (rather than du/ds) because there are signif- icantly fewer views in typical LF cameras than there are pixels per view. We measure inverse of w1 and w2, which can approach infinity as the lines of slope become vertical. We therefore convert each slope to an angle

σ1 = tan(w1/1), σ2 = tan(w2/1). (5.23)

We compute c as the Euclidean norm of the two slope angles,

2 c = | σ1 − σ2 | . (5.24) p

Large values of c imply a large difference in slopes between the horizontal and vertical EPIs, which in turn implies a refractive object. Overall, image features with large planar errors and inconsistent slopes are identified as belonging to a highly-distorting refractive object.

Two thresholds for planar consistency tplanar and slope consistency tslope are used to determine if an image feature has been distorted. If true, we refer to it as a refracted image feature,

1, if (λ1 >tplanar) ∨ (λ2 >tplanar) ∨ (c>tslope) refracted image feature = (5.25)  0, otherwise,   where ∨ is the logical OR operator. Note that our method is not limited to detecting distortion aligned with the horizontal and vertical axes of the LF. Although not implemented in this work, we can further check for λ1, λ2 and c along other axes by rotating the LF’s s,t,u,v frame and repeating the check. In future work, we aim to consider all of the LF, in order to estimate this rotation. 140 5.4. EXPERIMENTAL RESULTS

5.4 Experimental Results

In this section, we present our experimental setup for refracted image feature detection and show how our methods extend from large-baseline LF camera arrays to small-baseline lenslet- based LF cameras. Finally, we use our method to reject refracted image features for monocular SfM in the presence of refractive objects, and demonstrate improved reconstruction and pose estimates.

5.4.1 Experimental Setup

To obtain LFs captured by a camera array, we used the Stanford New Light Field Archive2, which provided LFs captured from a gantry with a 17 × 17 grid of rectified 1024 × 1024-pixel images that were down-sampled to 256 × 256 pixels to reduce computation. We focused on two LFs that captured the same scene of a crystal ball surrounded by textured tarot cards. The first LF was captured with a large baseline (16.1 mm/view over 275 mm), which exhibited significant distortions in the LF caused by the crystal ball. The second LF was captured with a smaller baseline (3.7 mm/view over 64 mm). This allowed us to compare the effect of LF camera baseline for refracted image feature discrimination.

Smaller baselines were considered using a lenslet-based LF camera. These LF cameras are of interest in robotics due to their simultaneous capture of multiple views, and typically lower size and mass compared to LF camera arrays and gantries. In this section, the Lytro Illum was used to capture LFs with 15 × 15 views, each 433 × 625 pixels. Dansereau’s Light-Field Toolbox [Dansereau et al., 2013] was used to decode and rectify the LFs from raw LF imagery to the 2PP; thereby, converting the Illum to an equivalent camera array with a baseline of 1.1 mm/view over 16.6 mm. To compensate for the extreme lens distortion of the Illum, we removed the outer views, reducing our LF to 13 × 13 views. The LF camera was fixed at 100 mm

2The (New) Stanford Light Field Archive is available at http://lightfield.stanford.edu/lfs.html. CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 141 focal length. All LFs were captured in ambient indoor lighting conditions without the need for specialized lighting equipment. The refractive objects were placed within a textured scene in order to create textural details for SIFT features. For repeatability, the lenslet-based camera was mounted to the end-effector of a 6-DOF Kinova Jaco robotic manipulator, shown in Fig. 5.1. The arm was controlled using the Robot Operating System (ROS) framework.

It is important to remember that our results depend on a number of factors. The geometry and refractive index of a transparent object affects its appearance. Higher curvature and thickness yield more distortion. Second, the distance between the LF camera and refractive object, as well as the distance between the refractive object and the background, directly affect how much distortion can be observed. Similarly, a larger camera baseline captures more distortion. When possible, these factors were held constant throughout different experiments.

5.4.2 Refracted Image Feature Discrimination with Different

LF Cameras

In this section, we provide a qualitative comparison of our discrimination methods for the large- baseline and small-baseline LF camera setups. Then we provide quantitative results over a larger variety of LFs for our refracted image feature discriminator.

5.4.2.1 Large-Baseline LF Camera Observations

The large-baseline crystal ball LF was captured by a camera array. Lambertian image features were captured by our textural cross-correlation approach as straight lines, while refracted image features were captured as nonlinear curves, as shown in Fig. 5.5. We observed that while the refracted image feature’s WNCC response was weaker compared to the Lambertian case, local maxima were observed near the image feature’s corresponding location in the central view. 142 5.4. EXPERIMENTAL RESULTS

Thus, taking the local maxima of the correlation EPI yielded the desired feature curves. Our textural cross-correlation method enables us to extract image feature curves without focusing on image intensities.

5.4.2.2 Small-Baseline LF Camera Observations

Fig. 5.6 shows the horizontal and vertical EPIs for a refracted image feature taken from the small-baseline crystal ball LF. The image feature curves appear straight, despite being distorted by the crystal ball. However, we observed that the slopes were inconsistent, which could still be used to discriminate refracted image features.

5.4.2.3 Discrimination of Refracted Image Features

To discriminate refracted image features, thresholds for planarity and slope consistency were selected by exhaustive search over a set of training LFs, while evaluated on a different set of LFs, with the exception of the crystal ball LFs where only one was available for each baseline b from the New Stanford Light Field Archive. For comparison to state of the art, parameter search was performed for both Xu’s method and our method independently, to allow for the best performance of each method.

The ground truth refracted image features were identified via hand-drawn masks in the central view. It was assumed that all features visible and passing through the refractive object were distorted. Detecting a refracted image feature was considered positive, while returning a Lam- bertian image feature was negative. Thus a true positive (TP) is correctly as a identified refracted image feature, while a true negative (TN) is correctly identified as a Lambertian image feature. A false positive (FP) is an incorrectly identified refracted image feature. A false negative (FN) is an incorrectly identified Lambertian image feature, as shown in Fig 5.7. CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 143

(a) (b) (c)

(d) (e) (f) Figure 5.5: Comparison of sample image feature curves extracted for a Lambertian (top) and refracted (bottom) feature from the large-baseline LF. (a) Sample Lambertian SIFT feature with template used for WNCC (red). (b) A 3D view of the vertical correlation EPI overlaid with the straight Lambertian image feature curve (red). (c) The same straight Lambertian feature curve (red) overlaid in the original vertical EPI. (d) Sample refracted SIFT feature with template used for WNCC (red). (e) The refracted image feature curve (red) in the vertical correlation EPI can still be extracted, despite more complex “terrain”, and still matches (f) the refracted image feature curve, which exhibits nonlinear behaviour in the original vertical EPI. For reference, the image feature location is shown at (t0,v0) by the red dot in the vertical EPIs. 144 5.4. EXPERIMENTAL RESULTS

(a) (b) Figure 5.6: Sample (a) horizontal and (b) vertical EPIs from the crystal ball LF with small baseline. From the image feature’s location (u0,v0) in the central view (red), extracted image feature curves (green) match the plane of best fit (dashed blue). In the small baseline LF, refracted image features appear almost linear and are thus much more difficult to detect.

Figure 5.7: Illustrating true positive, true negative, false positive and false negative in the con- text of refracted image feature discrimination.

From these definitions, we can compute precision and recall as performance measures. Preci- sion is the fraction of correctly identified refracted image features that are relevant,

TP Pr = . (5.26) TP + FP

Recall is the fraction of correctly identified refracted image features,

TP Re = . (5.27) TP + FN CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 145

Table 5.1: Comparison of our method and the state of the art using two LF camera arrays and a lenslet-based camera for discriminating refracted image features

State of the Art [Xu et al., 2015] Proposed b [mm] TPR TNR FPR FNR Pr Re TPR TNR FPR FNR Pr Re

crystal ball 275 0.58 0.97 0.02 0.41 0.83 0.59 0.66 0.95 0.05 0.34 0.71 0.66 array 68 0.42 0.91 0.08 0.89 0.35 0.42 0.63 0.94 0.05 0.37 0.55 0.63

sphere 1.1 0.43 0.36 0.64 0.58 0.18 0.08 0.48 0.95 0.04 0.52 0.79 0.83

lenslet cylinder 1.1 0.08 0.80 0.20 0.92 0.72 0.43 0.82 0.81 0.13 0.24 0.97 0.48

Two LF camera setups were used for the crystal ball LF, a 275mm baseline and a 68mm base- line. For the lenslet-based camera, ten LFs from a variety of different backgrounds were used for each object type. The discrimination results are shown in Table 5.1, which we discuss in the following paragraphs. Fig. 5.8 shows sample views of refracted features (red) and Lambertian features (blue).

Large-baseline LF Cameras For large-baseline LF cameras, such as the LF camera array with 275 mm, our approach had comparable performance to the state of the art, shown by only a 14% lower precision, but an 11% increase in recall. For large baselines, a significant amount of apparent motion for many of the refracted image features was observed in the EPIs; thus, refractive image features yielded nonlinear curves which strongly deviated from both 4D hyperplanes. Therefore, a single threshold (that only accounted for a single hyperplane) was sufficient to discriminate refracted image features.

The FPs included some occlusions, which appeared nonlinear in the EPI [Wanner and Gold- eluecke, 2014], but were not discriminated by our implementation. However, this may still be beneficial as occlusions often cause unreliable depth estimates, and are thus undesirable for 146 5.4. EXPERIMENTAL RESULTS most robotic vision feature-based algorithms. Sampling from all the views in the LF would likely improve the results for both methods, as more data would improve the planar fit. Interest- ingly, more accurate depth estimates near occlusions is a common motivation to use LF cameras over conventional vision sensors [Ham et al., 2017, Tao et al., 2013].

Small-baseline LF Cameras For small-baseline LF cameras, such as the LF camera array with a 68 mm baseline, and the lenslet-based plenoptic camera, we observed improved perfor- mance with our method over state of the art. For the crystal ball LF, our method had up to a 50% higher TP rate (TPR), up to a 58% lower FN rate (FNR), similar FP rates (FPR) and TN rates (TNR), and generally better precision and recall compared to Xu’s method for the camera array. We attributed these improvements to more accurately fitting the plane in 4D, as opposed to a single hyperplane.

For the lenslet-based LF camera, we investigated two different types of refractive objects: a glass sphere and an acrylic cylinder, shown in the bottom two rows of Fig. 5.8. The sphere exhibited significant distortion along both the horizontal and vertical viewing axes, while the cylinder only exhibited significant distortion perpendicular to its longitudinal axis.

When using the small-baseline lenslet-based LF camera, we observed significant improvement in performance over state of the art for all object types. As shown in Table 5.1, Xu’s method was unable to detect the refractive cylinder (TPR of 0.08), while our method succeeded with a TPR 10 times higher. Our method had a 3.4 times increase in precision and 9.4 times increase in recall for the sphere. The higher precision and recall imply that our method provides fewer incorrect detections and misses fewer correct refracted image features compared to previous work. We attribute this to accounting for slope consistency, which Xu’s method did not address. In shorter-baseline LFs, the nonlinear characteristics of refracted image feature curves were much less apparent, as in Fig. 5.6, but could still be distinguished by their inconsistent slopes. CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 147

Figure 5.8: Comparison of the state of the art (Xu’s method) (left), and our method (right) for discriminating against Lambertian (blue), and refracted (red) SIFT features. The top row shows the crystal ball captured with a large baseline LF (cropped). Both methods detect refracted image features; however, our method outperforms Xu’s. In the second and third rows, a cylinder and sphere captured with a small-baseline lenslet-based LF camera. Our method successfully detects more refracted image features with fewer false positives and negatives.

We observed that features that were located close to the edge of the sphere appeared more linear, and thus were not always detected. Other FPs were due to specular reflections that appeared like well-behaved Lambertian points. Finally, there were some FNs near the middle of the sphere, where there is identical apparent motion in the horizontal and vertical hyperplanes. This is a degenerate case for the current method, due to the symmetry of the refractive object. Principal rays that are directly aligned with the camera are not significantly refracted (their hyperplanes would therefore appear linear and consistent to each other). However, the image of these features appears flipped, and the scale of the object is also often changed. These indicators may be considered in future work to address this issue. 148 5.4. EXPERIMENTAL RESULTS

5.4.3 Rejecting Refracted Image Features for Structure

from Motion

Since too many refracted image features in a set of input image features can cause SfM to fail, we examine the impact of rejecting refracted image features in a SfM pipeline. We captured 10 sequences of LFs where the camera gradually approached a refractive object using the same lenslet-based LF camera. These sequences were captured on a robot so that the sequences were repeatable and the ground truth of the LF camera poses were known. An OptiTrack and motion capture system was used for ground truth camera pose. We used Colmap, a publicly-available SfM implementation which included its own outlier rejection and bundle adjustment [Schoen- berger and Frahm, 2016]. Incremental monocular SfM using the central view of the LF was performed on the sequences of images. Each successive image had an increasing number of re- fracted image features, making it increasingly difficult for SfM to converge. If SfM converged, a sparse reconstruction was produced, and the estimated poses were further analysed. The scene is shown in Fig. 5.1a with a textured, slanted background plane behind a refractive cylinder.

For each LF, SIFT features in the central view were detected, creating an unfiltered set of fea- tures, some of which were refracted. Our discriminator was then used to remove refracted image features, creating a filtered set of (ideally) only Lambertian features. Both sets were imported separately into the SfM pipeline. This produced respective “unfiltered” and “filtered” SfM results for comparison. The unfiltered case used all of the available image features, while our method was applied to the filtered case to remove most of the refracted image features from the SfM pipeline.

We note that outlier rejection schemes, such as RANSAC, are often used to reject inconsistent features, including refracted image features. While RANSAC successfully rejected many re- fracted image features, we observed more than 53% of inlier features used for reconstruction were actually refracted image features in some unfiltered cases. This suggested that in the pres- CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 149 ence of refractive objects, RANSAC is insufficient on its own for robust and accurate structure and motion estimation.

We measured the ratio of refracted image features r = ir/it, where ir is the number of refracted image features in the image, and it is the total number of features detected in the image. We considered the reprojection error as it varied with r. Shown in Fig. 5.9, the error for the unfil- tered case was consistently higher than the filtered case (up to 42.4% higher for r < 0.6 in the red case). Additionally, the unfiltered case often failed to converge, while the filtered case was successful, suggesting better convergence. Sample scenes that caused the unfiltered SfM to fail are shown in Fig. 5.10a and 5.10b. These scenes could not be used for SfM without our method to find consistent image features for reconstruction.

For the monocular SfM, scale was obtained by solving the absolute orientation problem using

Horn’s method between the estimated pose ps and ground truth pose pg, and only using the scale. Fig. 5.11a shows example pose trajectories reconstructed by SfM for a filtered and unfil- tered LF sequence with the ground truth. The filtered trajectory had a more accurate absolute pose over the entire sequence of images. Fig. 5.11b and 5.11c show the relative instantaneous pose error ei, computed as

ei =(ps,i − ps,i−) − (pg,i − pg,i−) (5.28) for image i, split into translation and rotation components. To do this, we considered the po-

T sition of the camera origin at image i as hi = [Px,Py,Pz] . We can then write the translation error etr for a sequence of images as the L2-norm of the instantaneous translation error

nLF 2 etr = v (hi − hi−1) − (hg,i − hg,i−1) , (5.29) u uXi=1 t where nLF is the number of LFs in the image sequence, and hg,i is the ground truth position at

T image i. Similarly, we consider the orientation of the camera at image i as θ =[θr,θp,θy] for 150 5.4. EXPERIMENTAL RESULTS

0.7

0.65 unfiltered 1 filtered 1

0.6 unfiltered 2 filtered 2 unfiltered 3 filtered 3 0.55

0.5

0.45

0.4 reprojection error [pix] error reprojection 0.35

0.3

0.25 0.1 0.2 0.3 0.4 0.5 0.6 0.7 refracted feature ratio

Figure 5.9: Rejecting refracted image features with our method yielded lower reprojection er- rors and better convergence for the same image sequences. SfM reprojection error vs refracted image feature ratio for the unfiltered case containing all the features, including refracted image features (dashed), and filtered case excluding refracted image features (solid). The spike in error at 0.6 r for filtered sequence 2 was due to insufficient inlier matches for SfM to provide reliable results.

roll, pitch and yaw in Euler angles (XYZ ordering). The rotation error erot for a sequence of images is then the L2-norm of the instantaneous rotation error

nLF 2 erot = v (θi − θi−1) − (θg,i − θg,i−1) . (5.30) u uXi=1 t

◦ Although erot was ≈ 0.02 , etr had larger errors up to 0.01m higher than the filtered case. This suggested that filtering for refracted image features yielded more accurate pose estimates from SfM.

In Table 5.2, we show filtering refracted image features leads to an average of 4.28 mm lower

◦ etr, and 0.48 lower erot relative instantaneous pose errors over 5 LF sequences with different objects, poses and backgrounds, excluding Seq. 6, where the number of inlier feature matches for SfM dropped below 50. The number of LFs in each sequence varied, because the unfiltered case could not converge with more images at the end of the sequence where r was higher. Seq. 7 and 8 are examples of where only our filtered case converged, so that SfM produced a trajectory CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 151

(a) (b) Figure 5.10: Both (a) and (b) show example images for the refractive cylinder and sphere, respectively, where SfM could not converge without filtering out refracted image features using our method.

Table 5.2: Comparison of mean relative instantaneous pose error for unfiltered and filtered SfM-reconstructed trajectories

Unfiltered Filtered

◦ Seq. #LFs etr [mm] erot [ ] #inliers etr erot #inliers

1 10 18.86 5.72 160 8.09 4.52 127 2 10 10.45 4.66 285 7.10 4.29 140 3 10 10.17 4.52 281 6.94 4.09 186 4 9 11.13 4.70 296 7.50 4.37 224 5 8 6.07 4.47 201 5.66 4.39 196 6 10 6.52 0.74 207 15.21 1.58 50 7 10 N/A N/A N/A 8.51 4.02 155 8 10 N/A N/A N/A 6.95 4.16 230 for analysis. Thus, filtering refracted image features using our method yielded more consistent (non-refracted) image features that improved the accuracy of the SfM pose estimates compared to not filtering for refracted image features, and made SfM more robust in the presence of refractive objects.

For the cases where SfM converged in the presence of refractive objects, we created a sparse reconstruction of the scene of Fig. 5.1, which was primarily the Lambertian background plane, 152 5.4. EXPERIMENTAL RESULTS

0.01 0 -0.01 z [m] z -0.02

0.3

0.25

0.2

0.15 groundtruth -0.06 unfiltered 0.1 filtered -0.04 x [m] 0.05 -0.02 0 0 y [m]

(a)

0.02

[m] 0.01

tr e

0 1 2 3 4 5 6 7 8 9 image # (b)

0.2 unfiltered filtered

[deg] 0.1

rot e 0 1 2 3 4 5 6 7 8 9 image # (c)

0.6

0.4

0.2

refractive feature ratio feature refractive 0 1 2 3 4 5 6 7 8 9 10 image # (d) Figure 5.11: For cases where SfM converged, filtering out the refracted image features yielded more accurate pose estimates. (a) Sample pose trajectory with the filtered (red) closer to ground truth (blue), compared to the unfiltered case (green). Relative instantaneous pose error for translation (a) and rotation (b) are shown over a sample LF sequence, where the filtered case was consistently lower than the unfiltered case. (c) With our method, the refractive feature ratio for the filtered case was lower than the unfiltered case. CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 153 since we attempted to remove refracted image features distorted by the cylinder. Sample recon- structions for both the unfiltered and filtered cases are shown in Fig 5.12. Both point clouds were centered about the origin and rotated into a common frame. For visualization, an overlay of the scene geometry’s best fit to the background plane is provided. The unfiltered case had to be re-scaled according to the scene geometry (as opposed to via the poses as done in Fig. 5.12) for comparison. Scaling via scene geometry resulted in severely worse pose trajectories for the unfiltered case, although similar observations were made: with our method, there were fewer points placed within the empty space between the refracted object and the plane. This is an important difference since the absence of information is treated very differently from incor- rect information in robotics. For example, estimated refracted points might incorrectly fill an occupancy map, preventing a robot from grasping refractive objects.

5.5 Conclusions

In this chapter, we proposed a method to discriminate refracted image features based on a planar fit in 4D and slope consistency. To achieve this, we introduced a novel textural cross- correlation technique to extract feature curves from the 4D LF. Our approach demonstrated higher precision and recall than previous work for LF camera arrays, and extended the detection capability to lenslet-based LF cameras. For these cameras, slope consistency proved to be a much stronger indicator of distortion than planar consistency. This is appealing for mobile robot applications, such as domestic robots that are limited in size and mass, but will have to navigate and eventually interact with refractive objects. Future work will examine in more detail the impact of thresholds on the discriminator through the use of precision-recall curves, as well as relate image feature slopes to surface curvature to aid grasping.

It is important to note that while we have developed a set of criteria for refracted image features in the LF, these criteria are not necessarily limited to refracted image features. Depending on the surface, specular reflections may appear as nonlinear in the EPI. Such image features are 154 5.5. CONCLUSIONS

(a) Side view, unfiltered (b) Side view, filtered

(c) Top view, unfiltered (d) Top view, filtered Figure 5.12: For the scene shown in Fig. 5.1a, (a,c) the unfiltered case resulted in a sparse reconstruction where many points were generated between the refractive cylinder (red) and the background plane (blue). In contrast, (b,d) the filtered case resulted in a reconstruction with fewer such points, and the resulting camera pose estimates were more accurate. The cylinder and plane are shown to help with visualization only. The camera (green) represents the general viewpoint of the scene, not the actual position of the camera. typically undesirable, and so we retain image features that are strongly Lambertian, and thus good candidates for matching, which ultimately leads to more robust robot performance in the presence of refractive objects.

Our experiments have shown that we can exclude refracted image features in a scene containing spherical and cylindrical refractive objects; however, it is likely that not all planar objects, such as windows, would be detected by our method. Some types of glass with a homogeneous refractive index may not be detected by our method because they do not significantly distort CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 155 the LF by design, such as a glass rectangular prism. However, features viewed through curved surfaces or non-homogeneous refractive indices, such as those commonly seen through privacy glass and stained glass windows, should be detected based on the nonlinearities created by the distortions of the object.

In this chapter we have explored the effect of removing the refractive content from the scene. We have demonstrated that rejecting refracted image features for monocular SfM yields lower reprojection errors and more accurate pose estimates in scenes that contain refractive objects. The ability to more reliably perceive refractive objects is a critical step towards enabling robots to reliably recognize, grasp and manipulate refractive objects. In the next chapter, we exploit the refractive content to control robot motion. 156 5.5. CONCLUSIONS Chapter 6

Light-Field Features for Refractive Objects

For an eye-in-hand robot manipulator, and a refractive object surrounded by Lambertian scene elements, we can use the Lambertian elements in the scene to approach the refractive object using the LF-IBVS for Lambertian scenes developed in Chapter 4. The refractive object can be partially detected via a variety of methods, such as the refracted image features as in Chapter 5, or another different technique, such as using the occluding edges of the refractive object [Ham et al., 2017]. However, as the camera’s FOV becomes increasingly dominated by the refractive object, the Lambertian scene content becomes increasingly smaller to the point where it is no longer available. In this situation, we must consider using the refractive object itself (and thus the refracted image features) for positioning control tasks, such as visual servoing. In this chapter, we combine the two previous chapters to develop a refracted light-field feature—a light-field feature whose rays have been distorted by a refractive object—that will enable control tasks, such as visual servoing towards refractive objects.

157 158 6.1. REFRACTED LF FEATURES FOR VISION-BASED CONTROL

6.1 Refracted LF Features for Vision-based Control

If we consider the physics of a two-interface refractive object, the light paths tracing from the point of origin, along the intersecting lines at the refractive object’s boundaries to the cam- era sensor, can be described by over twelve characteristics (see Fig. 3.2). The problem of completely reconstructing this light path is severely under-constrained for a single LF camera observation. However, the problem is more constrained for the task of position control, where only several DOFs need to be controlled with respect to the object (as opposed to recovering the complete object/scene geometry). Therefore, we approximate the local surface curvature in two orthogonal directions, which allows us to model that part of the refractive object as a type of lens. With an LF camera, we can observe the background projections caused by this lens. We can describe these observations with at least five parameters in the LF, which we use as our refracted light-field feature for refractive objects. This local description of the refractive object is much simpler than complete surface reconstructions of the refractive object. While it may not be sufficient to fully reconstruct the shape of a refractive object, it will be sufficient for vision-based position control tasks, such as visual servoing.

The main contributions of this chapter are as follows:

• We propose a compact representation for a refracted LF feature, which is based on the local projections of the background through the refractive object. We assume that the surface of the refractive object can be locally approximated as having two orthogonal surface curvatures. We can then model the local part of the refractive object as a toric lens. The properties of the local projections can then be observed and extracted from the light field.

• We provide an analysis of our refracted LF feature’s behaviour in the LF in simulation. In particular, we illustrate the feature’s continuity with respect to LF camera pose. Doing so CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 159

shows the potential for the feature’s use in vision-based control tasks towards refractive objects.

The rest of this chapter is organised as follows. We discuss related work in Section 6.2. In Section 6.3, we discuss the optics of the lens elements that can describe the behaviour of our refracted LF feature. The formulation of our refracted LF feature and method of extraction from observations captured by the LF camera is described in Section 6.4. In Section 6.5, we describe our our implementation and discuss experimental results that illustrate the continuity and suitability of our feature for a variety of refractive objects in simulation, for the purposes of visual servoing. Lastly, in Section 6.7, we conclude the chapter and explore future work.

6.2 Related Work

Grasping and manipulation of refractive objects have been considered in previous work. Choi et al. developed a method to localise refractive objects in real-time with a monocular camera [Choi and Christensen, 2012]. The contours from a given image, matched them to a database of refractive object contours with known poses, and efficiently searched/matched to the database. Walter et al. did so with an LF camera combined with an RGB-D sensor [Walter et al., 2015]. Lysenkov et al. recognised and estimated the pose of rigid transparent objects using a RGB-D (structured-light) sensor [Lysenkov, 2013]. Recently, Zhou et al. used an LF camera to recognise and grasp a refractive object by developing a light-field descriptor based on the distribution of depths observed by the LF camera [Zhou et al., 2018]. However, all of these previous works rely on having a 3D model of the object a priori. Complete and accurate geometric models of refractive objects are extremely difficult or time-consuming to acquire.

While the reconstruction of opaque surfaces with Lambertian reflectance is a well-studied prob- lem in computer and robotic vision, reconstructing the shape of refractive objects pose chal- lenging problems. Ihrke et al. provide an excellent survey on transparent and specular object 160 6.2. RELATED WORK reconstruction [Ihrke et al., 2010a]. Kutulakos et al. developed light path theory on refractive objects and performed refractive object reconstruction on complex inhomogeneous refractive objects [Kutulakos and Steger, 2007, Morris and Kutulakos, 2007]. If the light paths can be fully determined, the shape reconstruction is solved. However, from this work, it is clear that for a two-interface object, there are many more parameters needed than can be measured di- rectly by an LF camera. We are left with an underdetermined system of equations, which is insufficient for shape reconstruction.

Taking a slightly different approach, Ben-Ezra et al. used multiple monocular images to re- cover a parameterised refractive object shape and pose [Ben-Ezra and Nayar, 2003], while Wanner et al. used LF cameras to reconstruct planar reflective and refractive surfaces [Wanner and Golduecke, 2013]. There are many other prior works that rely on controlling background patterns [Kim et al., 2017, Kutulakos and Steger, 2007, Morris and Kutulakos, 2007, Wetzstein et al., 2011], and shape assumptions [Kim et al., 2017, Tsai et al., 2015]. Many of these ap- proaches rely on known lighting systems, large displays behind the refractive object in question and other bulky setups that are impractical for real-world robots in general unstructured scenes. We are interested in an approach that does not require large apparatus surrounding the refractive object and does not require models of the entire refractive object. Our work is different from these previous works in that we are not focused on the problem of reconstructing refractive ob- ject surfaces. Rather, we aim to develop a refracted LF feature that will enable us to use visual servoing to approach refractive objects.

In Chapter 4, we developed the first light-field image-based visual servoing algorithm by using a feature based on central view image coordinates, augmented with slope [Tsai et al., 2017]; however, like many previous works, the implementation was limited to Lambertian scenes. We revisit the Lambertian light-field feature and LF-IBVS in the context of refractive objects by proposing a novel LF feature for refractive objects. To the best of our knowledge, refracted light-field feature for image-based visual servoing towards refractive objects has not yet been proposed. CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 161

For LF features, Tosic et al. developed LF-edge features [Tosic and Berkner, 2014]; however, our interest is in keypoint features, which tend to be more uniquely identifiable are more com- monly applied to visual servoing and structure from motion tasks. Teixeira et al. used EPIs to detect reliable Lambertian image features [Teixeira et al., 2017]. Similarly, Dansereau recently proposed the Light-Field Feature (LIFF) detector and descriptor [Dansereau et al., 2019], which focuses on detecting and describing reliable Lambertian image features in a scale-invariant man- ner. However, all of these LF features are designed for Lambertian scenes, and are not suitable for describing refracted image features.

Maeno et al. proposed the light-field distortion feature (LFD) [Maeno et al., 2013]. Xu et al. built on the LFD and used it for transparent object image segmentation, but only characterised a refracted feature as a single hyperplane [Xu et al., 2015]. In Chapter 5, we then developed a refracted feature classifier for refracted image features using an LF camera [Tsai et al., 2019]. A Lambertian point feature was identified as a planar structure in the 4D LF, which can be described by the intersection of two 4D hyperplanes. The nature of this 4D planar structure changes in the light field when distorted by a refractive object, and was used for discriminating refracted image features. Previously, only a limited subset of views were used (the central cross of the LF) were used to describe the 4D planar structure. In this chapter, we use feature correspondences from all of the LF views and extend the theory and how we can observe, extract and estimate the 4D planar structure of a refracted light-field feature in the LF for the purposes of visual servoing.

6.3 Optics of a Lens

We first assume that a large, complex refractive object can be sufficiently approximated by sev- eral smaller parts. These parts are smooth and we constrain the surface to directionally-varying curvature by choosing two orthogonal directions on the surface. A surface defined in this man- ner is similar to a type of astigmatic lens, known as a toric lens, which is commonly used by 162 6.3. OPTICS OF A LENS optometrists to describe and correct astigmatisms [Hecht, 2002]. Thus, we can approximate small local parts of the refractive object as a toric lens. In general, refractive objects can project the background into space, and lenses do this in a predictable manner. In this section, we pro- vide a brief background in the optics of a spherical, cylindrical and finally toric lens, in order to better understand how the appearance of a feature may be distorted by such a lens, and how it may be observed in the light field. We describe our reasons for choosing the toric lens for our refracted LF feature in Section 6.4.

6.3.1 Spherical Lens

One of the most common and simple lenses is the spherical lens. A convex spherical lens surface is derived from a slice of a sphere, such that it has equal focal lengths in all orientations (it has a single focal length) and thus focuses collimated light to a single point. As in geometrical optics, we assume the light acts as rays (no waves). We assume we are in air, such that the index of refraction nair =1. We assume the lens is thin and we assume paraxial rays. The lens formula is then given as 1 1 1 (n − 1)d =(n − 1) − + , (6.1) f R1 R2 nR1R2  where n is the index of refraction of the lens material, R1 and R2 are the radii of curvature of the front and back surfaces, and d is the thickness of the lens. For thin lenses, d is much smaller than

R1 and R2 and approaches zero. Equation (6.1) is useful because it relates surface curvature to focal length, and can be used to derive the equation describing image formation, sometimes called the lensmaker’s formula. As discussed in Section 2.2.2, the lensmaker’s formula is given as 1 1 1 = + , (6.2) f zo zi where zo and zi describe the distance of the object and image, respectively, along the optical axis of the lens. Therefore, given focal length f and zo, we can determine zi formed by the lens. CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 163

6.3.2 Cylindrical Lens

Cylindrical lenses are sliced from the shape of a cylinder. Cylindrical lenses also have a single focal length, but focus collimated light into a line instead of a point. We refer to this line as the focal line. The focal line is parallel to the longitudinal axis of the lens. Effectively, the lens compresses the image of the background in the direction perpendicular to the focal line. The background image is unaltered in the direction parallel to the focal line.

6.3.3 Toric Lens

A toric lens has two focal lengths in two orientations perpendicular to each other. As shown in Fig. 6.1, the surface of a toric lens can be formed from a slice out of a torus. The surface of a torus can be formed by revolving a circle of radius R2, about a circle of radius R1. A slice, shown in dashed red, forms the surface of a toric lens. The radii of curvature are related to the focal length, as in (6.1). An astigmatic lens is the more general form of the toric lens, where (for the astigmatic lens) the axes of the two focal lengths are not constrained to be perpendicular to each other.

The two focal lengths cause a toric lens to focus light at two different distances from the lens, resulting in two focal lines. A toric lens has the same optical effect as two perpendicular cylin- drical lenses combined. Visually, this is seen as a “flattening” of rays with respect to their respective axes at these two distances [Freeman and Fincham, 1990]. The shape of bundle of rays passing through the astigmatic lens is known as an astigmatic pencil. Mathematician Jacques Sturm (1838) investigated the properties of the astigmatic pencil, and thus the astig- matic pencil is also known as Sturm’s conoid. The distance between the focal lines is known as the interval of Sturm. The circular cross-section where the pencil has the smallest area is known as the circle of least confusion. Fig. 6.2 shows a rendering of the visual effect of a toric lens on a background circle. 164 6.4. METHODOLOGY

(a) (b)

Figure 6.1: (a) A torus can be defined by two radii, R1 and R2. The surface of a toric lens can be sliced (dashed red) from a torus. (b) The toric lens surface is defined by the two radii of curvature, and therefore two focal lengths f1 and f2. The direction of these two curvatures are perpendicular to each other.

6.4 Methodology

There are three reasons for choosing to use the toric lens for locally modeling a large, complex refractive object. First, it is reasonable to assume local orthogonal surface curvatures as a first order approximation to any Euclidean surface. Second, it is one of the simplest refractive objects that we can unambiguously use to describe a feature in relation to camera pose. Third, the toric lens is more descriptive than a spherical lens in terms of describing the location and orientation of the image created by projecting a Lambertian point through the lens. In this case, a spherical lens is ambiguous in its orientation. In this section, we propose our refracted LF feature that is based on the background projections of a toric lens. We define our refracted LF feature. Then we describe our method to extract our feature from the LF.

A Lambertian point P emits rays of light that pass through a toric lens and into the LF camera. Toric lenses project the background into 3D space through two focal lines, located at two differ- ent distances from the lens that depend on the local surface curvatures. We can recover where these focal lines occur in 3D based on the ray observations captured by the LF camera. Fur- thermore, we can show that these vary continuously with respect to LF camera viewing pose, CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 165

Figure 6.2: A rendering of the visual effect of a toric lens on a blue background circle. In this scene, a toric lens is aligned with the principal axis of a camera. The camera is moved along this axis towards the lens. The toric lens is the transparent circular disk in the middle of the images (1-9). For reference, the background is a checkerboard with a blue circle in the centre. Far away (1), the blue circle appears as a flattened ellipse. At (3), the image of the blue circle is almost completely flattened, and appears as a line at one of the focal lengths of the lens. As the camera progresses closer, the effect of the two focal lengths acting on orthogonal axes balances out. Image (6) shows the blue dot as a circle at the circle of least confusion. Moving forwards, the circle is stretched vertically at the second focal length of the toric lens at (9). Finally, the image appears almost undistorted at (12) when the camera is directly in front of the toric lens. which makes these measurements suitable for positioning control tasks, such as visual servo- ing. In sum, we propose a refracted LF feature based on the projections produced by local toric lenses, which will be suitable for vision-based position and control tasks in scenes dominated by a refractive object.

For our approach, we assume that the local surface curvatures of the refractive object can be described by a toric lens. The validity of this assumption, and thus the effectiveness of our method, depends on how smooth the surface of the refractive object is compared to the base- line of the LF camera. A high-frequency surface curvature may make the background image unmatchable and not locally well approximated by a toric lens. We also assume a thin lens, although thick lenses can be considered in future work for more general refractive objects. We assume that the background is infinitely far from the refractive object, such that we are dealing with collimated light. Lastly, we assume that there is sufficient background texture to facilitate 166 6.4. METHODOLOGY image feature correspondence within the LF (i.e., between sub-images of the LF), which applies to most feature-based robotic vision methods.

6.4.1 Refracted Light-Field Feature Definition

As described in Section 2.7, a Lambertian point in 3D induces a plane in 4D. This plane can be described by the intersection of two 4D hyperplanes. Mathematically, the relation the 3D point and the LF observations can be described by (4.1). Each hyperplane can be described by a normal vector. In Chapter 5, we showed that these normal vectors are related to the light-field slope, which is inversely proportional to the depth of the point. For a Lambertian point, the apparent motion profiles of the feature in the LF are linear and the two slopes from the two hyperplanes are consistent with each other—they are equal in magnitude.

However, for a refracted image feature, these two motion profiles can be nonlinear and/or the slopes can inconsistent with each other. The latter implies that they can have different magni- tudes. We showed this to be sufficient to discriminate Lambertian image features from refracted image features in Chapter 5. Astonishingly, a Lambertian point projected through a toric lens, also yields a plane in 4D. Although the normals are not necessarily equal in magnitude, as in the Lambertian case, the apparent motion profiles are still linear. We can therefore describe the projections from a toric lens using two slopes. We can also include a measure of orientation of the toric lens with respect to the LF camera. In this section, we show how the 4D plane is still formed through the projections of a toric lens, and how we can use this insight to develop a refracted LF feature.

6.4.1.1 Two Slopes

As in Chapters 4 and 5, we parameterise the LF using the relative two-plane parameterisation

(2PP) [Levoy and Hanrahan, 1996]. A light ray φ emitting from a point P in the scene, has CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 167 coordinates φ = [s,t,u,v], and is described by two points of intersection with two parallel reference planes. An s,t plane is conventionally closest to the camera, and a u,v plane is conventionally closer to the scene, separated by arbitrary distance D.

(a) (b) Figure 6.3: (a) Light-field geometry for a point in space for a single view (black), and other views (grey) in the xz-plane, whereby u varies linearly with s for all rays originating from P (Px,Pz). (b) A 2D (xz-plane) illustration of a background feature P that gets projected through a toric lens (blue). The lens is characterised by focal length f and converges at the focal line C. Note that C appears as a point here because C is a line into the page along y into the page. C is created by the rays (red). From P to C, the image created by the lens is upright, but from C to the LF camera, the image flips and an inverted image is observed by the 2PP of the LF camera (green). In relation to Fig. 6.3a, it is clear that the LF camera’s slope measurements capture the depth of the toric lens’ formed image.

Considering the xz-plane, when a Lambertian point P is projected through a thin toric lens, it forms a line at C, which is subsequently captured by the LF camera. Fig. 6.3 illustrates the rays traced from P to the observations captured by the light-field camera. It is important to note that in the xz-plane, C appears as a focal point; however, in 3D, C actually represents a focal line. In relation to Fig. 6.3a, Fig. 6.3b shows that an LF camera captures the location of the toric lens’ image formation point C. The rays are arranged in such a way that the LF camera captures C’s slope for both the xz- and yz-planes. Additionally, the position of C depends on the position of P in the background behind the lens, as in (6.2). 168 6.4. METHODOLOGY

Although much of this discussion has been focused on the positions of the two orthogonal focal lines, we note that the light is focused on a continuum of distances from the toric lens along Sturm’s conoid. However, the most salient aspects of Sturm’s conoid that can be directly observed in the LF are its end points. Therefore, light rays emitted from P are refracted by the toric lens and converge to two different and orthogonal focal lines. These focal lines occur at two different depths from the LF camera’s perspective.

6.4.1.2 Orientation

We can describe the orientation of the focal lines with respect to the LF camera. In opthalmol- ogy, the optical axis of the toric lens is typically aligned with the principal axis of the eye (the

LF camera in our situation). The lens’ orientation is then described with a single angle θ as the rotation about the principal axis from the x-axis of the LF camera to the xy-axes of the toric lens. Fig. 6.4 illustrates the orientation of the toric lens θ with respect to the LF camera. If we define f1 and f2 as the two focal lengths of the toric lens, we note that as the difference between f1 and f2, becomes small, the interval of Sturm approaches a point and the lens approaches a spherical lens. The focal lines then intersect at a focal point, and the orientation information becomes poorly-defined and unusable.

Figure 6.4: The blue ellipse represents the toric lens. The lens orientation θ, defined as the angle between the refractive object frame (xr,yr) and the camera frame (xc,yc) for the axis-aligned focal lines relative to the camera frame. For notation, s,t are aligned with xc,yc. CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 169

6.4.1.3 Combined Slopes and Orientation

Our previous LF feature for Lambertian points was p =[u0, v0, w] in Chapter 4, where u0 and v0 were the image coordinates of the feature in the central view of the LF (s = 0, t = 0), and w was the slope. Accounting for both slopes and orientation of the toric lens, we can augment our Lambertian LF feature as a refracted LF feature described by

RLF =[u0, v0, w1, w2,θ], (6.3)

where w1 and w2 are the two slopes related to the distances to the two focal lines of the toric lens from the LF camera.

Notably, for the axis-aligned case, where the principal axis of the LF camera is aligned with the toric lens’ optical axis, our refracted LF feature follows the chief ray1 from the centre of the toric lens to the centre of the LF camera. For the off-axis case (where the two axes are not necessarily aligned), the refracted LF feature follows the LF camera’s chief ray to the u0, v0 in the image plane. Regardless, each focal line must intersect the optical axis of the toric lens. In either case, we can determine the 3D location of each of the two points of intersection, C1 and

C2, using similar triangles. The rays passing through the focal lines and into the LF camera all pass through the line segment C1C2, which is known as the interval of Sturm. The line segment

C1C2 may be sufficient for visual servoing, as illustrated in Fig. 6.5, because as with many local feature-based approaches, it is also possible to consider multiple refracted LF features at the same time, for visual servoing.

Additionally, our refracted LF feature is not limited to the application of refractive objects. For Lambertian points, the two slopes for the refracted LF feature are equal in magnitude. The 3D line segment of the refracted LF feature therefore reduces to a 3D point. By ignoring

1In optics, the chief ray, or principal ray, is the ray that passes through the centre of the aperture. Thus, chief rays are equivalent to rays observed by a pinhole camera. 170 6.4. METHODOLOGY

Figure 6.5: A Lambertian point P emits a ray of light that pass through the toric lens (blue). The ray reaches the central view of the LF camera at {L}. The refracted light field feature (red) is shown as the 3D line segment created by the position of the two focal lines, rotated by an orientation with respect to the LF camera’s xy-axes along the chief ray. The central view image coordinates u0,v0 slopes w1, and w2, as well as the orientation θ define our refracted LF feature. the orientation, our refracted LF feature generalises the Lambertian LF feature developed in Chapter 4.

6.4.2 Refracted Light-Field Feature Extraction

In this section, we explain our method to extract the refracted LF feature from the LF. Using the observations captured by the LF camera, we solve for the 4D plane as a 2D projection matrix. We then decompose the projection matrix into scaling and rotation components, which allow us to extract the slopes and orientation of the projections formed by the toric lens.

6.4.2.1 LF Observations through a Toric Lens

For the scenario outlined in Fig. 6.3b, a Lambertian point P in the background emitting rays of light that project through a toric lens to produce a plane in the continuous domain LF. In CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 171 the discrete domain where we sample s,t in a uniform grid of points, projections appear as a rectangular grid of points on the uv plane. As in Ch. 5, we consider the Light-Field Distortion feature [Maeno et al., 2013] as a set of u,v relative to (u0,v0), the image coordinates of an image feature in the central view (s0,t0). Then we can generally write the projection of P through a toric lens as

∆u s a1 a2 s   = A   =     , (6.4) ∆v t a3 a4 t         where A is a 2 × 2 matrix. We note that if we have a spherical lens, or simply a Lambertian point, then (6.4) simplifies to

∆u s w 0 s   = AL   =     , (6.5) ∆v t 0 w t         where w = −D/Pz. For the case of P projecting through a toric lens, in (6.4), we can factorise A into three components of SVD as

T A = ALΣAAR , (6.6)

where AL is a 2 × 2 matrix, ΣA is a diagonal matrix with non-negative real numbers on the diagonal, and AR is also a 2 × 2 matrix. The diagonal entries of ΣA are the singular values of A and represent the two slopes of the projections of the toric lens, as seen by the LF camera.

The columns of AL and AR are the left-singular and right-singular vectors of A, respectively. Intuitively, we can reason this factorisation as three geometrical transformations, a rotation or reflection (AL), a scaling (ΣA), followed by another rotation or reflection (AR). The orientation from AL should be the same from AR. We can later extract slopes and orientation from these three matrices. Therefore, in order to extract the slopes and orientation of the toric lens, we must first recover the projection matrix A. 172 6.4. METHODOLOGY

6.4.2.2 Projection Matrix

We can write (6.4) in terms of the elements of A as

a1   ∆u s t 0 0 a      2 =   . (6.7)   ∆v 0 0 s t a3         b F a4 | {z } | {z }   x | {z } F is a matrix of at most rank two, because s = kt, where k ∈ R, which means we can reduce the columns of F to a minimum of two independent columns. This equation has the common form F x = b. We can stack LF observations of s,t,∆u and ∆v for each corresponding point in all n × n views of the LF and estimate a1, a2, a3, and a4 in the least-squares sense. We can then form A and subsequently solve for the two slopes and the orientation.

6.4.2.3 Slope Extraction

We can extract the slopes from ΣA as the negative diagonals of ΣA. We note that singular values are positive because the matrix ATA has non-negative eigenvalues, and singular values are the square root of eigenvalues,

σ1 0 ΣA =   , (6.8)  0 σ2   where σ1 and σ2 are the singular values of ΣA. However, we know that the slopes for a point in front of the LF camera with a positive D should be negative, based on (5.7). Then the slopes of CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 173 the toric lens projections are given as

w1 = −σ1 (6.9)

w2 = −σ2. (6.10)

6.4.2.4 Orientation Extraction

In order to extract θ, we must first consider a 2D rotation and a 2D reflection. A 2D rotation matrix has the form cos θ − sin θ Rot(θ)=   . (6.11) sin θ cos θ    For a 2D reflection, we can generally reflect vectors perpendicularly over a line that makes an angle γ with the positive x-axis. The 2D reflection matrix then has the form

cos2γ sin2γ Ref(γ)=   . (6.12) sin2γ − cos2γ  

In our case, θ = γ, so the combined reflection and rotation matrix R is given as

1 R = Ref(θ)Rot(θ)= Ref(θ − θ). (6.13) 2

This reduces to cos θ − sin θ R(θ)=   . (6.14) − sin θ − cos θ  

Applying (6.14) to AR and AL yields two angles. The first angle represents a rotation and reflection to the principal axes of the LF observations on the uv-plane. The singular values represent scaling along the principal axes of the LF observations. The last angle represents the 174 6.5. EXPERIMENTAL RESULTS same rotation and reflection back to the original LF observations. Since we are dealing with 2D rotations, these two angles should be equal. Thus, we only have to extract a single angle θ.

Unlike previous work, where we only considered the central cross (horizontal and vertical) of all the sub-views in the LF, in this work, we consider all of the sub-views. This improvement allows us to better characterise refractive objects of different orientations (which was also not accounted for in previous work), and simply allows us to use more information captured by the LF for less uncertainty in the fit.

6.5 Experimental Results

For position control tasks, we are primarily interested in feature continuity. Continuity implies that there are no abrupt breaks or jumps in a function. For our refracted LF feature, continuity means that u0, v0, w1, w2 and θ all vary smoothly with respect to viewing pose, locally on the surface of the refractive object. Methods such as visual servoing, typically rely on feature continuity to incrementally step towards the goal pose. In this section, we describe the two implementations and preliminary experimental results for investigating the continuity of our refracted LF feature with respect to a variety of viewing poses and different refractive object types.

6.5.1 Implementations

We developed two implementations for investigating refracted LF feature continuity. First, we developed a single-point ray simulation for a single Lambertian point through a toric lens. Note that this is not a ray-tracing method in the classic sense that we propagate rays from the source to the camera sensor. The purpose of this setup was to provide a useful figure to illustrate the nature of the toric lens, the focal lines, and act as a proof of concept for the refracted LF feature. CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 175

Second, we performed a ray-tracing simulation of a background scene, refractive object and LF camera using Blender, a popular and freely-available rendering tool. We used the Cycles Renderer option, which performed physics-based ray tracing for accurate renderings through refractive objects. Additionally, the Light-Field Blender Add-On [Honauer et al., 2016] was used to capture a set of LF camera array views. Geometric models were rendered as refractive by assigning Blender’s “glass BSDF” material property, which used an index of refraction of 1.450. In our ray-tracing simulation, we attempted to assess the validity of the toric lens assumption towards more general refractive object shapes in order to assess the limitations of the refracted LF feature.

A rendered sample LF reduced to 3 × 3 views is shown in Fig. 6.6. In this environment, we simulated and tested our method against a variety of different object types and poses, shown later in Fig. 6.14. The background was kept up to 100 times farther than the distance of the refractive object relative to the LF camera, in order to approximate collimated light from a point source of light. We used a flat checkerboard background to provide a visual reference of the amount of distortion caused by the refractive objects. However, our implementation is agnostic to the background pattern because we rely on a uniquely-coloured, solid blue circle on the top surface of the background plane in order to aid image feature correspondence between different LFs captured from different poses. Future work will involve different backgrounds, including more realistic, non-planar scenes.

We ensured that the tracked circle was visible in all the views of the LF through the refractive object, as in Fig. 6.6. Segmentation for the refracted blue circle was accomplished by trans- forming the red, green and blue (RGB) colour representation to the hue, saturation and value (HSV) colour space. Thresholds were used to segment the angular value for the blue hue, which ranged from approximately 240 to 300 degrees in the HSV colour space. The centre of mass of the largest blue-coloured dot was used as the centre of the circle, which was taken as the same background Lambertian point for feature correspondence. 176 6.5. EXPERIMENTAL RESULTS

Figure 6.6: Ray tracing of a refractive object using Blender. Here, a toric lens is simulated for 5 × 5 views as an LF, although only 3 × 3 views are shown here. The nature of the toric lens is visible—the square checkerboard background is elongated in the v-direction, indicating the longer focal length along the vertical axis of the toric lens. The large circular blue dot was used to aid feature correspondence. The blue circular blue dot appears as an ellipse due to the nature of the toric lens.

We note that the centre of mass of a blob (for example, the ellipses in Fig. 6.6) that has been distorted by a refractive object does not always reflect the precise centre of the circle. There may be cases where extreme curvature and inhomogeneous structures in the refractive object (such as bubbles or holes) can result in significant distortion, such that the circle’s centre no longer matches the blob’s centroid in the rendered image. However, for homogeneous (no holes or bubbles) and relatively smooth refractive objects, the centroid provides a reasonable approximation to the coordinates of the centre of the circle in the rendered image. CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 177

6.5.2 Feature Continuity in Single-Point Ray Simulation

For the single-point ray simulation, we know the location of the Lambertian point, as well as the pose and optical properties of the toric lens and LF camera. We can therefore determine the location of the focal lines. The rays can then be projected from the st viewpoints of the LF camera, which we know the chief rays must pass through. We assume paraxial rays. Fig. 6.7 illustrates rays of light emitted from a Lambertian point projected through a toric lens and into an LF camera. The pencil-like shape of the rays is known as an astigmatic pencil. The colours of the rays are coded with the two focal lengths of the toric lens. The 2D side-views clearly indicate the rays pass through the two focal lines according to the two focal lengths of the toric lens. Feature correspondence is known because we are tracing the rays individually through the scene from the camera viewpoints.

Fig. 6.8 shows the estimated slopes for a pure translation along the z-axis in the distance be- tween the refractive object and the LF camera. In this motion sequence, the LF camera was moved closer to the refractive object. The ground truth was calculated from the slope equations in Fig. 6.5. As expected, both slopes increased in magnitude (but decreased due to the negative sign) as the LF camera moved closer towards the focal lines, and matched the ground truth.

Orientation was correctly estimated as a constant and so is not shown. Translations in x and y also yielded constant slopes, and are therefore not shown. Similarly, Fig. 6.9 shows the correct estimated orientation for a pure rotation about the z-axis of the LF camera. The slopes were also correctly estimated as constant and so are not shown. In all of these plots, the refracted LF feature is continuous with camera pose. This experiment also demonstrated that we can correctly extract the refracted LF feature from simulated LF observations. 178 6.5. EXPERIMENTAL RESULTS

(a)

1

0.5 0.5 0

x [m] 0

-0.5 y [m] -0.5 -1 6 5 4 3 2 1 0 6 5 4 3 2 1 0 z [m] z [m] (b) (c) Figure 6.7: Single point ray trace simulation. (a) 3D view of a Lambertian point (black) ema- nating light rays through the toric lens (light blue, blue). The rays are refracted and pass through the focal lines (red, magenta). The rays pass through the uv-plane (green) and into the LF cam- era viewpoints (blue). (b) The xz-plane showing all the light rays passing through the magenta focus line induced by fx. (c) The yz-plane, showing all the light rays passing through the red focus line induced by fy. CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 179

-0.095

-0.1

-0.105

w -0.11 1,gt w 2,gt w

slope 1,est -0.115 w 2,est

-0.12

-0.125

-0.13 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 LF camera displacement z [m]

Figure 6.8: Our method correctly estimates the two slopes, w1,est, w2,est of the refracted light- field feature, compared to the ground truth w1,gt, w2,gt for changing z-translation of the LF camera.

20 ground truth 10 estimated

0 [deg] 1 -10

-20 -15 -10 -5 0 5 10 15 [deg] gt

Figure 6.9: Our method correctly estimates the orientation θ1 of the refracted light-field feature for changing z-rotation of the LF camera.

6.5.3 Feature Continuity in Ray Tracing Simulation

In the ray-tracing simulation experiments, we extracted our refracted LF feature from rendered LFs. Similar to the plots from Section 6.5.2, we considered basic motion sequences of the LF camera, and plotted the elements of the refracted LF feature with respect to camera motion to show continuity. Fig. 6.10 depicts an LF camera starting from the left and a toric lens (blue) on the right. The LF camera is approaching the lens in a straight line. The refracted LF feature is shown in red. Out of the eight poses in this sequence, only three instances of the pose sequence 180 6.5. EXPERIMENTAL RESULTS are shown for brevity. As the LF camera moves closer to the lens, the refracted LF feature slopes decrease in magnitude accordingly; however, the feature’s position in 3D space remains constant. This is because the decrease in slope (and thus decrease in distance of the feature from the camera) is offset by the forwards motion of the position of the LF camera. Fig. 6.11 shows the corresponding two slopes as a function of LF camera displacement from the starting position for the corresponding LF camera motion sequence. The trends in Fig. 6.11 matched what we anticipated, based on Fig. 6.8. In this case, the refracted LF feature’s two slopes were continuous with respect to forwards and backwards motion along the z-axis.

Figure 6.10: Refracted LF feature (red) for the approach of an LF camera (left, blue) towards a toric lens (right, blue). For visualization, a straight line connecting the refracted LF feature and the LF camera is shown (dashed green). As the LF camera moves closer (top to bottom), the feature’s 3D line segment position remains constant, as we are measuring the same pencil of light rays. Only three of the eight positions from the sequence are shown.

Similarly, Fig. 6.12 shows the recovered orientation estimates for rotating an ellipsoid about the principal axis of the LF camera. The ellipsoid was aligned with the same axis and rotated from -30 to 30 degrees. In this graph, we note that although the correct relative angles are recovered, the entire line is centred about 90 degrees, instead of zero. This was likely due to the inherent ambiguity from SVD, where 30 degrees rotation from one axis is equivalently 60 degrees from the other axis of the toric lens. This ambiguity may be addressed by considering the heuristics CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 181

-0.05

-0.1

-0.15

-0.2 slope

-0.25

w -0.3 1,est w 2,est -0.35 -1 -0.95 -0.9 -0.85 -0.8 -0.75 -0.7 -0.65 -0.6 -0.55 -0.5 z

Figure 6.11: Slope estimates for the entire approach towards the toric lens that was illustrated in Fig. 6.10. Again, w1,est and w2,est represent the two estimated slopes for the toric lens. As we approach the toric lens (decreasing z), we expect the slope to decrease in magnitude, which we observe. We also note that the slopes appear continuous for z-translation. of the problem, or by only considering small changes in orientation, and will be addressed in future work.

Fig. 6.13 shows refracted LF features (red/orange/yellow) for a toric lens (light blue, right) plotted in 3D from a grid of LF camera poses (blue squares, left). Note that a single blue square represents an entire LF camera, as opposed to a single monocular camera. The regularity of the grid of LF camera poses was an experimental design choice. The dashed lines (green) connect the LF camera to its corresponding refracted LF feature. The refracted LF features are between the LF camera poses and the toric lens, as expected. Interestingly, in traditional robotic vision, Lambertian features do not move in 3D on their own. They are anchored in space (or attached to some object), and are therefore clearly useful for localisation and image registration, among other tasks. In Fig. 6.13, and many of the following refracted LF feature visualisations shown in the following section, we note that our refracted LF features are not simply stationary. They appear to move with the LF camera pose. 182 6.5. EXPERIMENTAL RESULTS

120

110

100

90 [deg] 80

70

60

50 -30 -20 -10 0 10 20 30 z rotation [deg] (a) (b) Figure 6.12: (a) An elongated ellipsoid that was rotated about the principal axis of the LF camera to capture orientation change. (b) Orientation estimate, which reflects the orientation of the principal axis of the ellipsoid relative to the horizontal. Here, the z-rotation is rotation about the principal axis of the camera. We note that even though the ellipsoid is not an ideal toric lens, the orientation was still correctly recovered and it was also continuous with respect to the camera rotation.

However, the feature’s movement due to camera pose was well-defined. The slopes define the distance of the feature to the LF camera, and these appear to be consistently at -0.2 m and 0.4 m on the z-axis. We note that the layout of the cluster of refracted LF features closely mirrors the ray patterns of the astigmatic pencil from Fig. 6.7. The uniform grid of LF camera poses mimics the sampling pattern of an LF camera array. The direction of the refracted LF features is clearly dictated by the toric lens’ two orthogonal focal lines and the LF camera pose. Although one can think of the interval of Sturm as simply a line segment along the principal axis of the toric lens, Fig. 6.13 reminds us that the interval of Sturm is actually a collection of rays along a continuum defined by the two focal lines of the toric lens. We also note that the direction of the refracted LF feature appears to change in a continuous manner with camera pose. Therefore, the alignment of our refracted LF feature implies a corresponding alignment of the LF camera pose to the toric lens. A position and alignment task in this case could take the form of line-segment alignment. CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 183

0.1

0.05

0 x

-0.05

-0.1

-0.1 -0.05 0 1 0.05 0.5 0 y 0.1 -0.5 -1 z

(a) (b)

0.1 -0.1

0.05 -0.05 y x 0 0

-0.05 0.05

-0.1 0.1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 z z

(c) (d) Figure 6.13: (a) Refracted LF feature (red/orange/yellow) for a toric lens (right, light blue) from a grid of LF camera positions (left, dark blue). Note that each blue square represents an entire LF camera, not a single monocular camera. (b) Central view of the central LF, showing the view of the flattened blue circle by the lens. The centre of the blue circle, (red star), was the image feature that was tracked across the different LFs. (c) The top and (d) side views of the refracted LF feature that clearly illustrate the focal lines of the toric lens at z of-0.2 and 0.4 m. Note that the scale of the z-axis is much larger than the x- and y-axes, in order to clearly show the refracted LF feature. 184 6.5. EXPERIMENTAL RESULTS

6.5.3.1 Different Object Types

We considered a variety of refractive object types from a set of poses in order to visualise our refracted LF feature in 3D along with the respective LF camera poses and the refractive object itself. Fig. 6.14 shows several of the objects, a sphere, a cylinder, and a “tornado”, along with their corresponding refracted LF features sampled by a sequence or grid of LF camera poses. First, Fig. 6.14b shows the case of a refractive sphere. As expected, the sphere focused the refracted LF feature into a single 3D point. Using a spherical lens model instead of a toric lens model is a viable refracted LF feature. A spherical refracted LF feature would be analogous to a 3D Lambertian point for position and control tasks; however, the refracted sphere would not be valid or as accurate for as many refractive object surfaces as the toric refracted LF feature.

Second, Fig. 6.14d shows the refracted LF features for a horizontal translation in x along a cylinder. The features spread out in a fan at z = 12 m, which is the location of the cylinder’s focal line. The single focal line is due to the curvature of the cylinder. The cylinder acts as a 1D refractive element, and therefore the other slope is simply a shifted measure of the Lambertian background. As we can see, the end points of the refracted LF features are approximately the same for this reason.

Finally, a tornado-shaped refractive object was rendered in Blender to represent a more com- plex, but still relatively smooth type of object, shown in Fig. 6.14f. The refracted LF features were estimated to be in front of the tornado; however, the features did not appear to have com- mon focal lines, like all of the refracted LF features the toric lens in Fig. 6.13. Despite the initial intention, we also noticed that the tornado model was surprisingly bumpy with its cur- vature. This lead to significant distortion caused by the refractive object. Several times, the bumpiness of the refractive object separated the blue circle’s image (through the refractive ob- ject) into two or more separate blobs, which greatly impacted the centroid measurements. LF camera poses were selected so as to minimise and avoid this impact in our experiments. Mul- CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 185

2

1

0 x

-1

-2 12 -2 10 8 0 6 4 2 2 y 0 z (a) (b)

(c) (d)

-2

y 0

2 0 2 4 6 8 10 12 14 16 18 20 z (e) (f) Figure 6.14: (a) For a sphere, the centroid of the blue circle (red star) was tracked throughout the LF as a means of feature correspondence. (b) The sphere, with equal focal lengths in all directions, forms an image of the background blue circle at a single point in space, which is shown in the refracted LF features (red) that also encapsulate a point. Note that each blue LF camera illustrated here represents a full LF camera, as opposed to a single monocular camera. The dashed green lines indicate which refracted LF feature matches to which LF camera pose. (c) For a cylinder, (d) the projections of the blue circle appears at the physical location for the cylinder-aligned focal direction, as expected. (e) For a “tornado”, (f) the refracted LF features from a grid of LF camera poses appear almost straight, as if the focal lines of the approximated local toric lenses are far away. The tornado represented a complex refractive object, but still yielded a continuous set of refracted LF features. 186 6.6. VISUAL SERVOING TOWARDS REFRACTIVE OBJECTS tiple projections of the same point, caused by internal reflection and total refraction may also need to be considered for future work image feature correspondence through refractive objects.

It is important to note that although our refracted LF feature is based on the assumption of local surface curvatures, we cannot solve for the surface curvatures themselves given only our refracted LF feature. Considering the lensmaker’s equation in (6.2), our method yields the distance of image formation zi from the lens. We know that focal length f is intrinsically linked to the surface curvature r. Therefore, in order to recover f, we require zo, the distance of the object to the lens along the lens’ optical axis. However, despite this lack of knowledge, our refracted LF feature is sufficient for the purposes of position control with respect to refractive objects.

6.6 Visual Servoing Towards Refractive Objects

To put the refracted LF feature into context, in this section, we provide an illustrative example of visual servoing towards a refractive object, shown in Fig. 6.15. This work has not yet been investigated, and is further discussed as future work. An LF camera is mounted at the end- effector of a robotic manipulator. A refractive object is placed in the scene with sufficient visual texture in the background. The LF camera is moved to the goal pose in order to capture a (set of) goal refracted LF feature(s). Then the LF camera is moved to an initial pose that is close to the goal pose, so that the relevant refracted LF features(s) can still be observed within the camera’s FOV. The robotic system uses a control loop similar to Fig. 4.2 in order to visual servo towards the goal pose. At each iteration, a refracted LF feature Jacobian, which relates LF feature changes to camera spatial velocities, is computed and used to iteratively step towards the goal pose until the difference(s) between the current and goal refracted LF feature(s) is/are sufficiently small; thereby completing a visual servo towards a refractive object. Some approaches on how to compute this refracted LF feature Jacobian are mentioned in the Section 6.7 as future work. CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 187

Figure 6.15: Concept for visual servoing towards a refractive object. At the start pose (red), a starting refracted LF feature (red) is captured by observing the distorted images of the yellow rubber duck in the LF. The LF camera moves (green) in order to align the current and the goal (blue) refracted LF feature. Owing to the continuity of the refracted LF feature, feature alignment corresponds with pose alignment, enabling the robot to reach the goal pose without requiring a 3D geometric model of the refractive object. 188 6.7. CONCLUSIONS

6.7 Conclusions

Overall, we have developed a refracted light-field feature that may be used for positioning tasks in robotic vision, such as visual servoing. Our feature approximates the surface of a refractive object with two local orthogonal surface curvatures. We can describe this part of the refractive object’s surface with a toric lens. The locations of the focal lines created by such a lens can be measured by an LF camera. We have demonstrated that the location of these focal lines can then be extracted from rendered light fields. By illustrating the continuity of our refracted light-field feature from a variety of LF camera poses and for a variety of different refractive objects, this feature can enable visual servoing and other positioning tasks without the need for a geometric model of the refractive object.

For future work, we are interested in deriving Jacobians for our refracted LF feature. Doing so would allow us to close the loop for servoing towards refractive objects. In order to com- plete visual servoing towards refractive objects, a Jacobian for the refracted light-field feature is required. Part of our feature extraction process relies on SVD, which potentially complicates the Jacobian derivation. It may be possible to derive an analytical expression for w1, w2 and θ via analytical expressions for the derivatives of singular values and singular vectors [Magnus, 1985]. Numerical methods could also be employed to estimate the Jacobian online [Jägersand, 1995]. An alternative approach is to simply derive Jacobians for the 3D line segments induced by the refracted LF feature, as we illustrated in Fig. 6.10, 6.13, and 6.14. Deriving analytical expressions for 3D points and line segments is likely more intuitive and straightforward.

Further investigation into denser LF camera pose sweeps to illustrate feature continuity in graphical form on a larger variety of refractive objects and surface curvatures would be use- ful to test the limitations of the toric lens assumption. It is also worth noting that the slopes recovered in this chapter are related to the position of focal lines, and that these focal lines are a function of surface curvature. Thus, it may be possible to use our refracted LF features to augment techniques for refractive object surface reconstruction. Finally, it may be possible to CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 189 extend our refracted LF feature concept to include reflections, which also induce multiple depth observations (multiple slopes) in the LF, and our orientation already provides a measure for reflection. 190 6.7. CONCLUSIONS Chapter 7

Conclusions and Future Work

7.1 Conclusions

At the start of this thesis, we identified an opportunity to advance robotic vision in the area of perceiving refractive objects. Although many robotic vision algorithms have been successful assuming a Lambertian world, the real world is far from Lambertian. Water, ice, glass and clear plastic in a variety of shapes and forms are common throughout the environments that robots must operate within. Our goal in this research was to help remove the Lambertian assumption in order to broaden the range of operable scenes and perceivable objects for robots.

We considered light-field cameras as a technology unique in their ability to capture scene tex- ture, depth and view-dependent phenomena, such as occlusion, specular reflection and refrac- tion. Furthermore, image-based visual servoing was chosen as a particularly interesting robotic vision technique for its wide range of applicability, robustness against modelling and calibra- tion errors, and because it did not necessarily require a 3D geometric model of the target object to perform positioning and control tasks. Thus, the overall aim of this thesis was to use LF cameras to advance robotic vision in the area of visual servoing towards refractive objects.

191 192 7.1. CONCLUSIONS

We decomposed this broad goal into the more manageable and specific objectives of demon- strating (1) image-based visual servoing using light-field cameras for Lambertian scenes; (2) detecting refracted image features using LF cameras; and (3) developing refracted LF features for visual servoing towards refractive objects.

In addressing these objectives, the key developments were a result of exploring the properties of the LF and developing algorithms to exploit them. The first objective was accomplished in Chapter 4. LF cameras were used for image-based visual servoing. Specifically, we proposed a novel Lambertian light-field feature and used it to derive image Jacobians from the light field that were then used to control robot motion. To deal with the lack of available real-time LF cameras, we designed a custom mirror-based light-field camera adapter. To the best of our knowledge, this was the first published light-field image-based visual servoing algorithm. Our method enabled more reliable VS compared to monocular and stereo IBVS approaches for small or distant targets that occupy a narrow part of the camera’s FOV and in the presence of occlu- sions. Areas in robotics that may benefit from this contribution include vision-based grasping, manipulation and docking problems in household, medical and in-orbit satellite servicing ap- plications.

For the second objective, discrimination of refracted image features from Lambertian image features was accomplished in Chapter 5. We developed a discriminator based on detecting the differences between the apparent motion of non-Lambertian and Lambertian image features in the LF using textural cross-correlation that was more reliable than previous work. We were able extend these distinguishing capabilities to lenslet-based LF cameras, which typically are limited to much smaller baselines that conventional LF camera arrays. Using our method to reject refracted image features, we also enabled monocular SfM in the presence of refractive objects, where traditional methods would normally fail. Domestic robots that clean dishes or serve glasses, as well as manufacturing robots attempting to interact with or near clear plastic packaging or heavily distorting refractive objects, such as stained glass or bottles of water, may benefit from this research. CHAPTER 7. CONCLUSIONS AND FUTURE WORK 193

Finally, for the third objective, development of a refracted light-field feature to enable visual servoing towards a refracted object was accomplished in Chapter 6. In particular, we proposed and extracted a novel refracted LF feature that could be described by the local projections of the refractive object. We demonstrated that our feature’s characteristics were continuous with respect to LF camera pose to show that our feature was suitable for visual servoing without requiring a 3D geometric model of the target refractive object.

7.2 Future Work

In this journey of the thesis, we have scratched the surface of the unknown, only to uncover more questions and ideas that might answer them. In this section, we propose the directions of future research that might build upon and improve the current state of the art for the robotic vision community.

In Chapter 6, we demonstrated the viability of our refracted light-field image feature for visual servoing towards refractive objects. Further research in this direction is needed to achieve a complete visual servoing system. Following our development of LF-IBVS in Chapter 4, derivations for the refracted light-field feature Jacobian need to be performed. LF-IBVS can also be implemented on a lenslet-based LF camera for comparison. Together, these tasks will finally close the loop on visual servoing towards refracted objects.

Additionally, we recognise that VS only addresses part of the problem in enabling robots to work with refractive objects. VS does not touch upon the area of interaction—grasping and manipulation. We consider recent works that have enabled grasping of refractive objects, such as [Zhou et al., 2018], which describe a refractive object as a distribution of depths obtained from an LF camera. Comparisons to a 3D geometric model are made for object localization and grasping. Zhou’s method relies on 3D models of refractive objects, while our method does not require such explicit 3D models. Thus, there is interest in combining our two contributions 194 7.2. FUTURE WORK to further the functionality of robotic perception for vision-based manipulation of refractive objects.

The performance and behaviour of VS strongly depends on the choice of image feature. The LF feature used in Chapter 4 for LF-IBVS constrains a Lambertian point in the scene to a plane in the 4D LF with equal slopes in all directions. We showed in Chapter 5 that we can describe a plane in 4D to discriminate refracted from Lambertian image features with nonlinear feature curves and unequal slopes in the horizontal and vertical directions of the LF. In Chapter 6, we extracted a more general 4D planar light-field feature from the entire LF as point with multiple depths and an orientation, and demonstrated the feature’s potential use for VS. However, it may be possible to servo on the more general 4D planar structure within the LF directly for VS. Specifically, servoing based on the parameters that describe the plane in 4D (such as the plane’s two linearly independent normals) provides a larger structure to estimate and track, compared to individual point features, which may make the approach more robust in low light (night time) and low contrast (foggy) conditions. This may also lead to analytical expressions of image Jacobians for visual servoing towards refractive objects. Furthermore, recent advances in LF- specific features, such as the Light-Field Feature detector and descriptor (LiFF) [Dansereau et al., 2019], may similarly lead to improved performance and accuracy in VS.

An interesting research direction is refractive object shape reconstruction using LF cameras. Previous work has shown that occlusion boundaries provide reliable depth information for re- fractive objects; however, these approaches have either relied on monocular cameras and motion to collect multiple views [Ham et al., 2017]. Occlusion boundaries of refractive objects may provide areas in the LF where the depth can be estimated. Local surface curvatures may be estimated by comparing the depths of the occlusion boundaries to the corresponding depths of the refracted LF feature from Chapter 6. These local surface curvatures and occlusion boundary depths may be combined to approximately reconstruct refractive object shape. CHAPTER 7. CONCLUSIONS AND FUTURE WORK 195

Alternatively, a deep learning approach might be considered to relate the characteristic image feature curves from Chapter 5 to object depth and surface curvature. Deep learning techniques might also be used to separate diffuse, specular and refracted image features. Such approaches are typically reliant upon large amounts of ground truth data; however, ground truth data for refractive objects is difficult to obtain and often very labour intensive. To address this issue, it may be possible to rely on simulated ground truth data that use realistic ray-tracing to form the bulk of the training data and then rely on only a small amount of real-world data for fine-tuning the network. We may draw on the literature from the sim-to-real field, where this approach is referred to as a domain adaption technique.

Another interesting direction of research is to use the LF camera for virtual exploration. In this thesis, image Jacobians were computed analytically for Lambertian scenes based on point features that ultimately relied on an approximate model of the LF camera. In visual servoing, there exist a variety of methods to compute the image Jacobian online without prior camera models using a set of “test movements”, which are not part of the manipulation task [Jägersand, 1995, Piepmeier et al., 2004]. However, LF cameras capture a small amount of virtual motion by virtue of their multiple views, similar to a local image-based derivative of robot motion. Recently, there are also a variety of deep learning approaches to monocular VS [Lee et al., 2017, Bateux et al., 2018]. Thus, an LF camera may be used to estimate the image Jacobian by comparing these multiple views and the central view of the LF to some goal image. In a related project, our recent work demonstrated that gradients from a multi-camera array could be used to servo towards a target object in highly-occluded scenarios [Lehnert et al., 2019], although a non-planar grid of cameras were used, as opposed to a planar grid of cameras, as in a traditional LF camera array. Further research into these avenues may result in faster and simpler visual servoing algorithms that can still operate in cluttered and non-Lambertian environments, possibly without the need for LF camera calibration.

This work has been largely addressing problems in the context of robotic vision. Taking a broader view outside of the field of robotics, this research will hopefully find research directions 196 7.2. FUTURE WORK in other fields, such as cinematography, virtual reality, mixed or augmented reality, video games and consumer photography. In particular, augmented reality is an emerging technology that may rely on light-field imaging technology. The augmented reality must address a similar problem faced by robots—how to perceive the real world using limited sensor technologies, whilst still enabling safe and reliable interaction. However, as humans are an integral part of augmented reality, these interactions must also be real-time and realistic in appearance. An improved understanding of how refractive objects behave in the light field may lead to more realistic and faster renderings of scenes with refractive objects, as well as safer and more reliable interaction. Bibliography

[Adelson and Anandan, 1990] Adelson, E. H. and Anandan, P. (1990). Ordinal characteristics of transparency. Vision and Modeling Group, Media Laboratory, Massachusetts Institute of Technology.

[Adelson and Bergen, 1991] Adelson, E. H. and Bergen, J. R. (1991). The plenoptic function and the elements of early vision. Computational models of visual processing, 91(1):3–20.

[Adelson and Wang, 1992] Adelson, E. H. and Wang, J. Y. A. (1992). Single lens stereo with a plenoptic camera. IEEE Transactions on Pattern Analysis & Machine Intelligence, (2):99– 106.

[Adelson and Wang, 2002] Adelson, E. H. and Wang, J. Y. A. (2002). Single lens stereo with a plenoptic camera. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 14(2):99–106.

[Andreff et al., 2002] Andreff, N., Espiau, B., and Horaud, R. (2002). Visual servoing from lines. The International Journal of Robotics Research, 21(8):679–699.

[Baeten et al., 2008] Baeten, J., Donné, K., Boedrij, S., Beckers, W., and Claesen, E. (2008). Autonomous fruit picking machine: A robotic apple harvester. In Field and Service Robotics, pages 531–539. Springer.

197 198 BIBLIOGRAPHY

[Bateux and Marchand, 2015] Bateux, Q. and Marchand, E. (2015). Direct visual servoing based on multiple intensity histograms. In IEEE International Conference on Robotics and Automation.

[Bateux et al., 2018] Bateux, Q., marchand, E., Leitner, J., Chaumette, F., and Corke, P. (2018). Training deep neural networks for visual servoing. In IEEE International Conference on Robotics and Automation, pages 3307–3314.

[Bay et al., 2008] Bay, H., Ess, A., Tuytelaars, T., and Gool, L. V. (2008). Speeded-up robust features (SURF). Computer Vision and image understanding, 110(3):346–359.

[Ben-Ezra and Nayar, 2003] Ben-Ezra, M. and Nayar, S. K. (2003). What does motion reveal about transparency. In Intl. Conference on Computer Vision (ICCV). IEEE Computer Society.

[Bergeles et al., 2012] Bergeles, C., Kratochvil, B. E., and Nelson, B. J. (2012). Visually ser- voing magnetic intraocular microdevices. IEEE Transactions on Robotics, 28(4):798–809.

[Bernardes and Borges, 2010] Bernardes, M. C. and Borges, G. A. (2010). 3D line estimation for mobile robotics visual servoing. In Congresso Brasileiro de Automática (CBA).

[Bista et al., 2016] Bista, S. R., Giordano, P. R., and Chaumette, F. (2016). Appearance-based indoor navigation by ibvs using line segments. IEEE Robotics and Automation Letters, 1(1):423–430.

[Bolles et al., 1987] Bolles, R., Baker, H., and Marimont, D. (1987). Epipolar-plane image analysis: An approach to determining structure from motion. Intl. Journal of Computer Vision (IJCV), 1(1):7–55.

[Bolles and Fischler, 1981] Bolles, R. C. and Fischler, M. A. (1981). A ransac-based approach to model fitting and its application to finding cylinders in range data. In IJCAI, volume 1981, pages 637–643. BIBLIOGRAPHY 199

[Bourquardez et al., 2009] Bourquardez, O., Mahony, R., Guenard, N., Chaumette, F., Hamel, T., and Eck, L. (2009). Image-based visual servo control of the translation kinematics of a quadrotor aerial vehicle. Trans. on Robotics, 25(3).

[Cai et al., 2013] Cai, C., Dean-Leon, E., Mendoza, D., Somani, N., and Knoll, A. (2013). Uncalibrated 3D stereo image-based dynamic visual servoing for robot manipulators. In Intl. Conference on Intelligent Robots and Systems (IROS), pages 63–70. IEEE.

[Calonder et al., 2010] Calonder, M., Lepetit, V., Strecha, C., and Fua, P. (2010). Brief: Binary robust independent elementary features. In European conference on computer vision, pages 778–792. Springer.

[Cervera et al., 2003] Cervera, E., Del Pobil, A. P., Berry, F., and Martinet, P. (2003). Improv- ing image-based visual servoing with three-dimensional features. The International Journal of Robotics Research, 22(10-11):821–839.

[Chan, 2014] Chan, S. C. (2014). Light field. In Computer Vision A Reference Guide, pages 447–453. Springer Link.

[Chaumette, 1998] Chaumette, F. (1998). Potential problems of stability and convergence in image-based and position-based visual servoing. Lecture Notes in Control and Information Sciences, 237:66–78.

[Chaumette, 2004] Chaumette, F. (2004). Image moments: a general and useful set of features for visual servoing. IEEE Transactions on Robotics, 20(4):713–723.

[Chaumette and Hutchinson, 2006] Chaumette, F. and Hutchinson, S. (2006). Visual servo control part 1: Basic approaches. Robotics and Automation Magazine, 6:82–90.

[Chaumette and Hutchinson, 2007] Chaumette, F. and Hutchinson, S. (2007). Visual servo control part 2: Advanced approaches. IEEE Robotics and Automation Magazine, pages 109–118. 200 BIBLIOGRAPHY

[Choi and Christensen, 2012] Choi, C. and Christensen, H. (2012). 3D textureless object de- tection and tracking: An edge-based approach.

[Christensen, 2016] Christensen, H. I. (2016). A roadmap for US robotics (2016) from internet to robotics.

[Civera et al., 2008] Civera, J., Davison, A. J., and Montiel, J. M. (2008). Inverse depth parametrization for monocular slam. IEEE transactions on robotics, 24(5):932–945.

[Collewet and Marchand, 2009] Collewet, C. and Marchand, E. (2009). Photometry-based visual servoing using light reflexion models. In 2009 IEEE International Conference on Robotics and Automation, pages 701–706. IEEE.

[Collewet and Marchand, 2011] Collewet, C. and Marchand, E. (2011). Photometric visual servoing. Trans. on Robotics, 27(4).

[Comport et al., 2011] Comport, A. I., Mahony, R., and Spindler, F. (2011). A visual servoing model for generalised cameras: Case study of non-overlapping cameras. In 2011 IEEE International Conference on Robotics and Automation, pages 5683–5688. IEEE.

[Corke, 2013] Corke, P. (2013). Robotics, Vision and Control. Springer.

[Corke and Hutchinson, 2001] Corke, P. and Hutchinson, S. (2001). A new partitioned ap- proach to image-based visual servo control. Transactions on Robotics and Automation, 17(4):507–515.

[Corke, 2017] Corke, P. I. (2017). Robotics, Vision and Control. Springer, 2 edition.

[Dalal and Triggs, 2005] Dalal, N. and Triggs, B. (2005). Histograms of oriented gradients for human detection. In Intl. Conference on Computer Vision and Pattern Recognition (CVPR).

[Dansereau, 2014] Dansereau, D. G. (2014). Plenoptic Signal Processing for Robust Vision in Field Robotics. PhD thesis, University of Sydney. BIBLIOGRAPHY 201

[Dansereau and Bruton, 2007] Dansereau, D. G. and Bruton, L. T. (2007). A 4-D dual-fan filter bank for depth filtering in light fields. IEEE Transactions on Signal Processing (TSP), 55(2):542–549.

[Dansereau et al., 2019] Dansereau, D. G., Girod, B., and Wetzstein, G. (2019). LiFF: Light field features in scale and depth. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8042–8051.

[Dansereau et al., 2011] Dansereau, D. G., Mahon, I., Pizarro, O., and Williams, S. B. (2011). Plenoptic flow: Closed-form visual odometry for light field cameras. In Intl. Conference on Intelligent Robots and Systems (IROS), pages 4455–4462. IEEE.

[Dansereau et al., 2013] Dansereau, D. G., Pizarro, O., and Williams, S. B. (2013). Decoding, calibration and rectification for lenselet-based plenoptic cameras. In Intl. Conference on Computer Vision and Pattern Recognition (CVPR), pages 1027–1034. IEEE.

[De Luca et al., 2008] De Luca, A., Oriolo, G., and Robuffo Giordano, P. (2008). Feature depth observation for image-based visual servoing: Theory and experiments. The International Journal of Robotics Research, 27(10):1093–1116.

[Dong et al., 2013] Dong, F., Ieng, S.-H., Savatier, X., Etienne-Cummings, R., and Benosman, R. (2013). Plenoptic cameras in real-time robotics. The Intl. Journal of Robotics Research, 32(2):206–217.

[Dong and Soatto, 2015] Dong, J. and Soatto, S. (2015). Domain-size pooling in local de- scriptors: Dsp-sift. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5097–5106.

[Drummond and Cipolla, 1999] Drummond, T. and Cipolla, R. (1999). Visual tracking and control using lie algebras. In Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), volume 2, pages 652–657. IEEE. 202 BIBLIOGRAPHY

[Engel et al., 2014] Engel, J., Schoeps, T., and Cremers, D. (2014). LSD-SLAM: Large-scale direct monocular SLAM. European Conference on Computer Vision (ECCV).

[Fischler and Bolles, 1981] Fischler, M. and Bolles, R. (1981). Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.

[Freeman and Fincham, 1990] Freeman, M. H. and Fincham, W. H. A. (1990). Optics. Butter- worths, London, 10th edition. An optional note.

[Fritz et al., 2009] Fritz, M., Bradski, G., Karayev, S., Darrell, T., and Black, M. (2009). An additive latent feature model for transparent object recognition.

[Fuchs et al., 2013] Fuchs, M., Kächele, M., and Rusinkiewicz, S. (2013). Design and fabrica- tion of faceted mirror arrays for light field capture. In Computer Graphics Forum, volume 32, pages 246–257. Wiley Online Library.

[Gao and Zhang, 2015] Gao, X. and Zhang, T. (2015). Robust rgb-d simultaneous localization and mapping using planar point features. Robotics and Autonomous Systems, 72:1–14.

[Georgiev et al., 2011] Georgiev, T., Lumsdaine, A., and Chunev, G. (2011). Using focused plenoptic cameras for rich image capture. IEEE Computer Graphics and Applications, 31(1):62–73.

[Gershun, 1936] Gershun, A. (1936). Fundamental ideas of the theory of a light field (vector methods of photometric calculations). Journal of Mathematics and Physics, 18.

[Ghasemi and Vetterli, 2014] Ghasemi, A. and Vetterli, M. (2014). Scale-invariant represen- tation of light field images for object recognition and tracking. In IS&T/SPIE Electronic Imaging. International Society for Optics and Photonics.

[Godard et al., 2017] Godard, C., Mac Aodha, O., and Brostow, G. J. (2017). Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, pages 270–279. BIBLIOGRAPHY 203

[Gortler et al., 1996] Gortler, S., Grzeszczuk, R., Szeliski, R., and Cohen, M. (1996). The lumigraph. In SIGGRAPH, pages 43–54. ACM.

[Grossmann, 1987] Grossmann, P. (1987). Depth from focus. Pattern recognition letters, 5(1):63–69.

[Gu et al., 1997] Gu, X., Gortler, S., and Cohen, M. (1997). Polyhedral geometry and the two- plane parameterisation. In Proc. Eurographics Workshop on Rendering Techniques, pages 1–12. Springer.

[Gupta et al., 2014] Gupta, S., Girshick, R., Arbeláez, P., and Malik, J. (2014). Learning rich features from rgb-d images for object detection and segmentation. In European Conference on Computer Vision, pages 345–360. Springer.

[Ham et al., 2017] Ham, C., Singh, S., and Lucey, S. (2017). Occlusions are fleeting - texture is forever: Moving past brightness constancy. In WACV.

[Han et al., 2015] Han, K., Wong, K.-Y. K., and Liu, M. (2015). A fixed viewpoint approach for dense reconstruction of transparent objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4001–4008.

[Han et al., 2018] Han, K., Wong, K.-Y. K., and Liu, M. (2018). Dense reconstruction of trans- parent objects by altering incident light paths through refraction. International Journal of Computer Vision, 126(5):460–475.

[Han et al., 2012] Han, K.-S., Kim, S.-C., Lee, Y.-B., Kim, S.-C., Im, D.-H., Choi, H.-K., and Hwang, H. (2012). Strawberry harvesting robot for bench-type cultivation. Journal of Biosystems Engineering, 37(1):65–74.

[Harris and Stephens, 1988] Harris, C. and Stephens, M. (1988). A combined corner and edge detector. In Alvey vision conference, volume 15, page 50.

[Hartley and Zisserman, 2003] Hartley, R. and Zisserman, A. (2003). Multiple View Geometry in Computer Vision. Cambridge. 204 BIBLIOGRAPHY

[Hata et al., 1996] Hata, S., Saitoh, Y., Kumamura, S., and Kaida, K. (1996). Shape extraction of transparent object using genetic algorithm. In Proceedings of 13th International Confer- ence on Pattern Recognition, volume 4, pages 684–688. IEEE.

[Hecht, 2002] Hecht, E. (2002). Optics. Addition-Wesley, 4th ed. edition.

[Hill, 1979] Hill, J. (1979). Real time control of a robot with a mobile camera. In 9th Int. Symp. on Industrial Robots, 1979, pages 233–246.

[Hinton, 1884] Hinton, C. H. (1884). What is the fourth dimension? Scientific Romances, 1:1–22.

[Honauer et al., 2016] Honauer, K., Johannsen, O., Kondermann, D., and Goldluecke, B. (2016). A dataset and evaluation methodology for depth estimation on 4D light fields. In Asian Conference on Computer Vision, pages 19–34. Springer.

[Hutchinson et al., 1996] Hutchinson, S., Hager, G., and Corke, P. (1996). A tutorial on visual servo control. Transactions on Robotics and Automation, 12(5):651–670.

[Ideguchi et al., 2017] Ideguchi, Y., Uranishi, Y., Yoshimoto, S., Kuroda, Y., and Oshiro, O. (2017). Light field convergency: Implicit photometric consistency on transparent surface. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Work- shops, pages 41–49.

[Ihrke et al., 2010a] Ihrke, I., Kutulakos, K., Lensch, H., Magnor, M., and Heidrich, W. (2010a). Transparent and specular object reconstruction. Computer Graphics forum, 29:2400–2426.

[Ihrke et al., 2010b] Ihrke, I., Wetzstein, G., and Heidrich, W. (2010b). A theory of plenoptic multiplexing. In Intl. Conference on Computer Vision and Pattern Recognition (CVPR), pages 483–490. IEEE.

[Irani and Anandan, 1999] Irani, M. and Anandan, P. (1999). About direct methods. In Work- shop on Vision Algorithms. Springer. BIBLIOGRAPHY 205

[Iwatsuki and Okiyama, 2005] Iwatsuki, M. and Okiyama, N. (2005). A new formulation of visual servoing based on cylindrical coordinate system. IEEE Transactions on Robotics, 21(2):266–273.

[Jachnik et al., 2012] Jachnik, J., Newcombe, R. A., and Davison, A. J. (2012). Real-time surface light field capture for augmentation of planar specular surfaces. In Mixed and Aug- mented Reality (ISMAR), 2012 IEEE Intl. Symposium on, pages 91–97. IEEE.

[Jägersand, 1995] Jägersand, M. (1995). Visual servoing using trust region methods and esti- mation of the full coupled visual-motor jacobian. image, 11:1.

[Jang et al., 1991] Jang, W., Kim, K., Chung, M., and Bien, Z. (1991). Concepts of augmented image space and transformed feature space for efficient visual servoing of an “eye-in-hand robot”. Robotica, 9:203–212.

[Jerian and Jain, 1991] Jerian, C. P. and Jain, R. (1991). Structure from motion-a critical anal- ysis of methods. IEEE Transactions on systems, Man, and Cybernetics, 21(3):572–588.

[Johannsen et al., 2017] Johannsen, O. et al. (2017). A taxonomy and evaluation of dense light field depth estimation algorithms. In CVPR Workshop.

[Johannsen et al., 2015] Johannsen, O., Sulc, A., and Goldluecke, B. (2015). On linear struc- ture from motion for light field cameras. In Intl. Conference on Computer Vision (ICCV), pages 720–728.

[Johnson and Hebert, 1999] Johnson, A. and Hebert, M. (1999). Using spin images for efficient object recognition in cluttered 3D scenes.

[Kemp et al., 2007] Kemp, C. C., Edsinger, A., and Torres-Jara, E. (2007). Challenges for robot manipulation in human environments.

[Keshmiri and Xie, 2017] Keshmiri, M. and Xie, W.-F. (2017). image-based visual servoing using an optimized trajectory planning technique. IEEE Transactions on Mechatronics, 22(1):359–370. 206 BIBLIOGRAPHY

[Kim et al., 2017] Kim, J., Reshetouski, I., and Ghosh, A. (2017). Acquiring axially-symmetric transparent objects using single-view transmission imaging. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3559–3567.

[Klank et al., 2011] Klank, U., Carton, D., and Beetz, M. (2011). Transparent object detection and reconstruction on a mobile platform. In 2011 IEEE International Conference on Robotics and Automation, pages 5971–5978. IEEE.

[Kompella and Sturm, 2011] Kompella, V. R. and Sturm, P. (2011). Detection and avoidance of semi-transparent obstacles using a collective-reward based approach. In 2011 IEEE Inter- national Conference on Robotics and Automation, pages 3469–3474. IEEE.

[Kragic and Christensen, 2002] Kragic, D. and Christensen, H. (2002). Survey on visual ser- voing for manipulation.

[Krizhhevsky et al., 2012] Krizhhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks.

[Krotkov and Bajcsy, 1993] Krotkov, E. and Bajcsy, R. (1993). Active vision for reliable rang- ing: Cooperating focus, stereo, and vergence. Int. Journal of Computer Vision, 11(2):187– 203.

[Kurt and Edwards, 2009] Kurt, M. and Edwards, D. (2009). A survey of brdf models for computer graphics. ACM SIGGRAPH Computer Graphics, 43(2):4.

[Kutulakos and Steger, 2007] Kutulakos, K. N. and Steger, E. (2007). A theory of refractive and specular 3D shape by light-path triangulation. 76(1).

[Le et al., 2011] Le, M.-H., Woo, B.-S., and Jo, K.-H. (2011). A comparison of sift and har- ris conner features for correspondence points matching. In 2011 17th Korea-Japan Joint Workshop on Frontiers of Computer Vision (FCV), pages 1–4. IEEE.

[Lee et al., 2017] Lee, A. X., Levine, S., and Abbeel, P. (2017). Learning visual servoing with deep features and fitted q-iteration. arXiv preprint arXiv:1703.11000. BIBLIOGRAPHY 207

[Lee, 2005] Lee, H.-C. (2005). Introduction to Color Imaging Science. Cambridge University Press.

[Lehnert et al., 2019] Lehnert, C., Tsai, D., Eriksson, A., and McCool, C. (2019). 3D Move to See: Multi-perspective visual servoing for improving object views with semantic segmenta- tion. In Intl. Conference on Intelligent Robots and Systems (IROS).

[Levin and Durand, 2010] Levin, A. and Durand, F. (2010). Linear view synthesis using a dimensionality gap light field prior. In Intl. Conference on Computer Vision and Pattern Recognition (CVPR), pages 1831–1838. IEEE.

[Levoy and Hanrahan, 1996] Levoy, M. and Hanrahan, P. (1996). Light field rendering. In SIGGRAPH, pages 31–42. ACM.

[Levoy et al., 2000] Levoy, M., Pulli, K., Curless, B., Rusinkiewicz, S., Koller, D., Pereira, L., Ginzton, M., Anderson, S., Davis, J., Ginsberg, J., et al. (2000). The digital michelangelo project: 3D scanning of large statues. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 131–144. ACM Press/Addison-Wesley Publishing Co.

[Li et al., 2008] Li, H., Hartley, R., and Kim, J.-h. (2008). A linear approach to motion estima- tion using generalized camera models. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE.

[Lippmann, 1908] Lippmann, G. (1908). Epreuves reversibles. photographies integrals. Comptes-Rendus Academie des Sciences, 146:446–451.

[López-Nicolás et al., 2010] López-Nicolás, G., Guerrero, J. J., and Sagüés, C. (2010). Vi- sual control through the trifocal tensor for nonholonomic robots. Robotics and Autonomous Systems, 58(2):216–226. 208 BIBLIOGRAPHY

[Low et al., 2007] Low, E. M., Manchester, I. R., and Savkin, A. V. (2007). A biologically in- spired method for vision-based docking of wheeled mobile robots. Robotics and Autonomous Systems, 55(10):769–784.

[Lowe, 2004] Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. Intl. Journal of Computer Vision (IJCV), 60(2):91–110.

[Luke et al., 2014] Luke, J., Rosa, F., Marichal, J., Sanluis, J., Dominguez Conde, C., and Rodriguez-Ramos, J. (2014). Depth from light fields analyzing 4D local structure. Display Technology, Journal of.

[Lumsdaine and Georgiev, 2008] Lumsdaine, A. and Georgiev, T. (2008). Full resolution light field rendering. Technical report, Adobe Systems.

[Lumsdaine and Georgiev, 2009] Lumsdaine, A. and Georgiev, T. (2009). The focused plenop- tic camera. In Computational Photography (ICCP), pages 1–8. IEEE.

[Luo et al., 2015] Luo, R., Lai, P.-J., and Ee, V. W. S. (2015). Transparent object recognition and retrieval for robotic bio-laboratory automation applications. Intl. Conference on Intelli- gent Robots and Systems (IROS).

[Lysenkov, 2013] Lysenkov, I. (2013). Recognition and pose estimation of rigid transparent objects with a kinect sensor. Robotics: Science and Systems VIII, page 273.

[Lytro, 2015] Lytro (2015). Lytro Illum User Manual. Lytro Inc., Mountain View, CA.

[Maeno et al., 2013] Maeno, K., Nagahara, H., Shimada, A., and Taniguchi, R.-I. (2013). Light field distortion feature for transparent object recognition. In Intl. Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.

[Magnus, 1985] Magnus, J. R. (1985). On differentiating eigenvalues and eigenvectors. Econo- metric Theory, 1(2):179–191. BIBLIOGRAPHY 209

[Mahony et al., 2002] Mahony, R., Corke, P., and Chaumette, F. (2002). Choice of image fea- tures for depth-axis control in image based visual servo control. In Intl. Conference on Intelligent Robots and Systems (IROS), pages 390–395. IEEE.

[Malis and Chaumette, 2000] Malis, E. and Chaumette, F. (2000). 2 1/2 d visual servoing with respect to unknown objects through a new estimation scheme of camera displacement. In- ternational Journal of Computer Vision, 37(1):79–97.

[Malis et al., 1999] Malis, E., Chaumette, F., and Boudet, S. (1999). 2 1/2 d visual servoing. IEEE Transactions on Robotics and Automation, 15(2):238–250.

[Malis et al., 2000] Malis, E., Chaumette, F., and Boudet, S. (2000). Multi-cameras visual servoing. In Robotics and Automation (ICRA), pages 3183–3188. IEEE.

[Malis and Rives, 2003] Malis, E. and Rives, P. (2003). Robustness of image-based visual servoing with respect to depth distribution errors. In 2003 IEEE International Conference on Robotics and Automation (Cat. No. 03CH37422), volume 1, pages 1056–1061. IEEE.

[Marchand and Chaumette, 2017] Marchand, E. and Chaumette, F. (2017). Visual servoing through mirror reflection. In 2017 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 3798–3804. IEEE.

[Mariottini et al., 2007] Mariottini, G. L., Oriolo, G., and Prattichizzo, D. (2007). Image-based visual servoing for nonholonomic mobile robots using epipolar geometry. IEEE Transactions on Robotics, 23(1):87–100.

[Marto et al., 2017] Marto, S. G., Monteiro, N. B., Barreto, J. P., and Gaspar, J. A. (2017). Structure from plenoptic imaging. In 2017 Joint IEEE International Conference on Devel- opment and Learning and Epigenetic Robotics (ICDL-EpiRob), pages 338–343. IEEE.

[McFadyen et al., 2017] McFadyen, A., Jabeur, M., and Corke, P. (2017). Image-based vi- sual servoing with unknown point feature correspondence. IEEE Robotics and Automation Letters, 2(2):601–607. 210 BIBLIOGRAPHY

[McHenry et al., 2005] McHenry, K., Ponce, J., and Forsyth, D. (2005). Finding glass. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

(CVPR’05), volume 2, pages 973–979. IEEE.

[Mehta and Burks, 2014] Mehta, S. and Burks, T. (2014). Vision-based control of robotic ma- nipulator for citrus harvesting. Computers and Electronics in Agriculture, 102:146–158.

[Mezouar and Allen, 2002] Mezouar, Y. and Allen, P. K. (2002). Visual servoed microposi- tioning for protein manipulation tasks. In IEEE/RSJ International Conference on Intelligent Robots and Systems, volume 2, pages 1766–1771. IEEE.

[Miyazaki and Ikeuchi, 2005] Miyazaki, D. and Ikeuchi, K. (2005). Inverse polarisation ray- tracing: estimating surface shapes of transparent objects. Intl. Conference on Computer Vision and Pattern Recognition (CVPR).

[Morris and Kutulakos, 2007] Morris, N. J. W. and Kutulakos, K. N. (2007). Reconstructing the surface of inhomogeneous transparent scenes by scatter-trace photography. 76(1).

[Muja and Lowe, 2009] Muja, M. and Lowe, D. G. (2009). Fast approximate nearest neighbors with automatic algorithm configuration. VISAPP (1), 2(331-340):2.

[Mukaigawa et al., 2010] Mukaigawa, Y., Tagawa, S., Kim, J., Raskar, R., Matsushita, Y., and Yagi, Y. (2010). Hemispherical confocal imaging using turtleback reflector. In Computer Vision–ACCV 2010, pages 336–349. Springer.

[Murase, 1990] Murase, H. (1990). Surface shape reconstruction of an undulating transparent object. In [1990] Proceedings Third International Conference on Computer Vision, pages 313–317. IEEE.

[Neumann and Fermuller, 2003] Neumann, J. and Fermuller, C. (2003). Polydioptric camera design and 3D motion estimation. Intl. Conference on Computer Vision and Pattern Recog- nition (CVPR). BIBLIOGRAPHY 211

[Newcombe et al., 2011] Newcombe, R. A., Lovegrove, S., and Davison, A. J. (2011). DTAM: dense tracking and mapping in real-time. In Intl. Conference on Computer Vision (ICCV), pages 2320–2327.

[Ng et al., 2005] Ng, R., Levoy, M., Bredif, M., Duval, G., Horowitz, M., and Hanrahan, P. (2005). Light field photography with a hand-held plenoptic camera. Technical report, Stan- ford University Computer Science.

[O’Brien et al., 2018] O’Brien, S., Trumpf, J., Ila, V., and Mahony, R. (2018). Calibrating light field cameras using plenoptic disc features. In 2018 International Conference on 3D Vision (3DV), pages 286–294. IEEE.

[Pages et al., 2006] Pages, J., Collewet, C., Chaumette, F., and Salvi, J. (2006). An approach to visual servoing based on coded light. In Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006., pages 4118–4123. IEEE.

[Papanikolopoulos and Khosla, 1993] Papanikolopoulos, N. P. and Khosla, P. K. (1993). Adap- tive robotic visual tracking: Theory and experiments. IEEE Transactions on Automatic Con- trol, 38(3):429–445.

[Pedrotti, 2008] Pedrotti, L. S. (2008). Fundamentals of Photonics.

[Perwass and Wietzke, 2012] Perwass, C. and Wietzke, L. (2012). Single lens 3D-camera with extended depth-of-field. In IST/SPIE Electronic Imaging, pages 829108–829108. Interna- tional Society for Optics and Photonics.

[Phong, 1975] Phong, B. T. (1975). Illumination for computer generated pictures. Communi- cations of the ACM, 18(6):311–317.

[Piepmeier et al., 2004] Piepmeier, J. A., McMurray, G. V., and Lipkin, H. (2004). Uncali- brated dynamic visual servoing. IEEE Transactions on Robotics and Automation, 20(1):143– 147. 212 BIBLIOGRAPHY

[Quadros, 2014] Quadros, A. J. (2014). Representing 3D Shape in Sparse Range Images for Urban Object Classification. Thesis, University of Sydney.

[Raytrix, 2015] Raytrix (2015). Raytrix light field sdk.

[Rosten et al., 2009] Rosten, E., Porter, R., and Drummond, T. (2009). Faster and better: A machine learning approach to corner detection. IEEE Trans. Pattern Analysis and Machine Intelligence (to appear).

[Rublee et al., 2011] Rublee, E., Rabaud, V., Konolige, K., and Bradski, G. (2011). Orb: an efficient alternative to sift or surf. In Intl. Conference on Computer Vision (ICCV).

[Salti et al., 2014] Salti, S., Tombari, F., and Stefano, L. D. (2014). SHOT: Unique signatures of histograms for surface and texture description. Computer Vision and Image Understand- ing, 125:251–264.

[Saxena et al., 2006] Saxena, A., Chung, S. H., and Ng, A. Y. (2006). Learning depth from single monocular images. In Advances in neural information processing systems, pages 1161–1168.

[Saxena et al., 2008] Saxena, A., Driemeyer, J., and Ng, A. (2008). Robotic grasping of novel objects using vision. International Journal of Robotics Research.

[Schlick, 1994] Schlick, C. (1994). A survey of shading and reflectance models. In Computer Graphics Forum, volume 13, pages 121–131. Wiley Online Library.

[Schoenberger and Frahm, 2016] Schoenberger, J. and Frahm, J.-M. (2016). Structure-from- motion revisited. CVPR.

[Schoenberger et al., 2017] Schoenberger, J., Hardmeier, H., Sattler, T., and Pollefeys, M. (2017). Comparative evaluation of hand-crafted and learned local features. Intl. Confer- ence on Computer Vision and Pattern Recognition (CVPR). BIBLIOGRAPHY 213

[Shafer, 1985] Shafer, S. A. (1985). Using color to separate reflection components. Color Research & Application, 10(4):210–218.

[Shi and Tomasi, 1993] Shi, J. and Tomasi, C. (1993). Good features to track. Technical report, Cornell University.

[Siciliano and Khatib, 2016] Siciliano, B. and Khatib, O. (2016). Springer handbook of robotics. Springer.

[Smith et al., 2009] Smith, B. M., Zhang, L., Jin, H., and Agarwala, A. (2009). Light field video stabilization. In Intl. Conference on Computer Vision (ICCV).

[Song et al., 2015] Song, W., Liu, Y., Li, W., and Wang, Y. (2015). Light-field acquisition using a planar catadioptric system. Optics Express, 23(24):31126–31135.

[Strecke et al., 2017] Strecke, M., Alperovich, A., and Goldluecke, B. (2017). Accurate depth and normal maps from occlusion-aware focal stack symmetry. In Intl. Conference on Com- puter Vision and Pattern Recognition (CVPR).

[Sturm et al., 2011] Sturm, P., Ramalingam, S., Tardif, J.-P., Gasparini, S., Barreto, J., et al. (2011). Camera models and fundamental concepts used in geometric computer vision. Foun-

dations and Trends R in Computer Graphics and Vision, 6(1–2):1–183.

[Szeliski et al., 2000] Szeliski, R., Avidan, S., and Anandan, P. (2000). Layer extraction from multiple images containing reflections and transparency. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), volume 1, pages 246–253. IEEE.

[Tahri and Chaumette, 2003] Tahri, O. and Chaumette, F. (2003). Application of moment in- variants to visual servoing. In 2003 IEEE International Conference on Robotics and Au- tomation (Cat. No. 03CH37422), volume 3, pages 4276–4281. IEEE. 214 BIBLIOGRAPHY

[Tao et al., 2013] Tao, M. W., Hadap, S., Malik, J., and Ramamoorthi, R. (2013). Depth from combining defocus and correspondence using light field cameras. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 673–680. IEEE.

[Tao et al., 2016] Tao, M. W., Su, J.-C., Wang, T.-C., Malik, J., and Ramamoorthi, R. (2016). Depth estimation and specular removal for glossy surfaces using point and line consistency with light field cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(6):1155–1168.

[Teixeira et al., 2017] Teixeira, J. A., Brites, C., Pereira, F., and Ascenso, J. (2017). Epipolar based light field key-location detector. In Multimedia Signal Processing.

[Teulière and Marchand, 2014] Teulière, C. and Marchand, E. (2014). A dense and direct ap- proach to visual servoing using depth maps. IEEE Transactions on Robotics, 30(5):1242– 1249.

[Tombari et al., 2010] Tombari, F., Salti, S., and Stefano, L. D. (2010). Unique signatures of histograms for local surface description. ECCV.

[Torr and Zisserman, 2000] Torr, P. H. and Zisserman, A. (2000). Mlesac: A new robust es- timator with application to estimating image geometry. Computer vision and image under- standing, 78(1):138–156.

[Tosic and Berkner, 2014] Tosic, I. and Berkner, K. (2014). 3D keypoint detection by light field scale-depth space analysis. In Image Processing (ICIP). IEEE.

[Triggs et al., 2000] Triggs, B., McLauchlan, P., Hartley, R., and Fitzgibbon, A. (2000). Bundle adjustment - a modern synthesis. Vision Algorithms, pages 298–372.

[Tsai et al., 2015] Tsai, C.-Y., Veeraraghavan, A., and Sankaranarayanan, A. C. (2015). What does a single light-ray reveal about a transparent object? In 2015 IEEE International Con- ference on Image Processing (ICIP), pages 606–610. IEEE. BIBLIOGRAPHY 215

[Tsai et al., 2016] Tsai, D., Dansereau, D., Martin, S., and Corke, P. (2016). Mirrored Light Field Video Camera Adapter. Technical report, Queensland University of Technology.

[Tsai et al., 2017] Tsai, D., Dansereau, D. G., Peynot, T., and Corke, P. (2017). Image-based visual servoing with light field cameras. IEEE Robotics and Automation Letters, 2(2):912– 919.

[Tsai et al., 2019] Tsai, D., Dansereau, D. G., Peynot, T., and Corke, P. (2019). Distinguishing refracted features using light field cameras with application to structure from motion. IEEE Robotics and Automation Letters, 4(2):177–184.

[Tsai et al., 2013] Tsai, D., Nesnas, I., and Zarzhitsky, D. (2013). Autonomous vision-based tether-assisted rover docking. In Intl. Conference on Intelligent Robots and Systems (IROS). IEEE.

[Tuytelaars et al., 2008] Tuytelaars, T., Mikolajczyk, K., et al. (2008). Local invariant feature detectors: a survey. Foundations and trends in computer graphics and vision, 3(3):177–280.

[Vaish et al., 2006] Vaish, V., Levoy, M., Szeliski, R., Zitnick, C., and Kang, S. (2006). Re- constructing occluded surfaces using synthetic apertures: Stereo, focus and robust measures. In Intl. Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 2331–2338. IEEE.

[Verdie et al., 2015] Verdie, Y., Yi, K. M., Fua, P., and Lepetit, V. (2015). TILDE: A temporally invariant learned DEtector. Intl. Conference on Computer Vision and Pattern Recognition (CVPR).

[Walter et al., 2015] Walter, C., Penzlin, F., Schulenburg, E., and Elkmann, N. (2015). En- abling multi-purpose mobile manipulators: Localization of glossy objects using a light field camera. In Conference on Emerging Technologies & Factory Automation (ETFA), pages 1–8. IEEE. 216 BIBLIOGRAPHY

[Wanner and Goldeluecke, 2014] Wanner, S. and Goldeluecke, B. (2014). Variational light field analysis for disparity estimation and super-resolution. IEEE Trans. on Pattern Analysis and Machine Intelligence, 36(3).

[Wanner and Goldluecke, 2012] Wanner, S. and Goldluecke, B. (2012). Globally consistent depth labeling of 4D light fields. In Intl. Conference on Computer Vision and Pattern Recog- nition (CVPR).

[Wanner and Golduecke, 2013] Wanner, S. and Golduecke, B. (2013). Reconstructing reflec- tive and transparent surfaces from epipolar plane images. Proc. 35th German Conf. Pattern Recog.

[Wei et al., 2013] Wei, Y., Kang, L., Yang, B., and Wu, L. (2013). Applications of structure from motion: a survey. Journal of Zhejiang University-SCIENCE C (Computers & Electron- ics), 14(7).

[Weisstein, 2017] Weisstein, E. W. (2017). Hyperplane. http://mathworld.wolfram. com/Hyperplane.html. [Online; accessed 19-July-2017].

[Wetzstein et al., 2011] Wetzstein, G., Roodnick, D., Heidrich, W., and Raskar, R. (2011). Re- fractive shape from light field distortion. In Intl. Conference on Computer Vision (ICCV), pages 1180–1186. IEEE.

[Wilburn et al., 2004] Wilburn, B., Joshi, N., Vaish, V., Levoy, M., and Horowitz, M. (2004). High-speed videography using a dense camera array. In Intl. Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages II–294. IEEE.

[Wilburn et al., 2005] Wilburn, B., Joshi, N., Vaish, V., Talvala, E., Antunez, E., Barth, A., Adams, A., Horowitz, M., and Levoy, M. (2005). High performance imaging using large camera arrays. ACM Transactions on Graphics (TOG), 24(3):765–776. BIBLIOGRAPHY 217

[Wilson et al., 1996] Wilson, W. J., Hulls, C. W., and Bell, G. S. (1996). Relative end-effector control using cartesian position based visual servoing. IEEE Transactions on Robotics and Automation, 12(5):684–696.

[Xu et al., 2015] Xu, Y., Nagahara, H., Shimada, A., and ichiro Taniguchi, R. (2015). Transcut: Transparent object segmentation from a light field image. Intl. Conference on Computer Vision and Pattern Recognition (CVPR).

[Yamamoto, 1986] Yamamoto, M. (1986). Determining three-dimensional structure from im- age sequences given by horizontal and vertical moving camera. Denshi Tsushin Gakkai Ronbunshi (Transactions of the Institute of Electronics, Information and Communication

Engineers of Japan), pages 1631–1638.

[Yeasin and Sharma, 2005] Yeasin, M. and Sharma, R. (2005). Foveated vision sensor and im- age processing–a review. In Machine Learning and Robot Perception, pages 57–98. Springer.

[Yi et al., 2016] Yi, K. M., Trulls, E., Lepetit, V., and Fua, P. (2016). LIFT: Learned invariant feature transform. arXiv.

[Zeller et al., 2015] Zeller, N., Quint, F., and Stilla, U. (2015). Narrow field-of-view visual odometry based on a focused plenoptic camera. In ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences.

[Zhang et al., 2018] Zhang, K., Chen, J., and Chaumette, F. (2018). Visual servoing with tri- focal tensor. In 2018 IEEE Conference on Decision and Control (CDC), pages 2334–2340. IEEE.

[Zhang et al., 2017] Zhang, Y., Yu, P., Yang, W., Ma, Y., and Yu, J. (2017). Ray space features for plenoptic structure-from-motion. In Proceedings of the IEEE International Conference on Computer Vision, pages 4631–4639. 218 BIBLIOGRAPHY

[Zhou et al., 2018] Zhou, Z., Sui, Z., and Jenkins, O. C. (2018). Plenoptic Monte Carlo object localization for robot grasping under layered translucency. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–8. IEEE. Appendix A

Mirrored Light-Field Video Camera Adapter

This appendix section proposes the design of a custom mirror-based light-field camera adapter that is cheap, simple in construction, and accessible. Mirrors of different shape and orientation reflect the scene into an upwards-facing camera to create an array of virtual cameras with over- lapping field of view at specified depths, and deliver video frame rate s. We describe the design, construction, decoding and calibration processes of our mirror-based light-field camera adapter in preparation for an open-source release to benefit the robotic vision community.

The latest report, computer-aided design models, diagrams and code can be obtained from the following repository: https://bitbucket.org/acrv/mirrorcam.

I II A.1. INTRODUCTION

A.1 Introduction

Light-field cameras are a new paradigm in imaging technology that may greatly augment the computer vision and robotics fields. Unlike conventional cameras that only capture spatial information in 2D, light-field cameras capture both spatial and angular information in 4D using multiple views of the same scene within a single shot [Ng et al., 2005]. Doing so implicitly encodes geometry and texture, and allows for depth extraction. Capturing multiple views of the same scene also allows light-field cameras to handle occlusions [Walter et al., 2015], and non-Lambertian (glossy, shiny, reflective, transparent) surfaces, that often break most modern computer vision and robotic techniques [Vaish et al., 2006].

Robots must operate in continually changing environments on relatively constrained platforms. As such, the robotics community is interested in low cost, computationally inexpensive, and real-time camera performance. Unfortunately, there is a scarcity of commercially available light-field cameras appropriate for robotics applications. Specifically, no commercial camera delivers 4D s at video frame rates1. Creating a full camera array comes with more synchroniza- tion, bulk, input-output and bandwidth issues. However, the advantages of our approach are video-framerate LF video allowing real-time performance, the ability to customize the design to optimize key performance metrics required for the application, and the ease of fabrication. The main disadvantages of our approach are a lower resolution, a lower FOV2, and a more complex decoding process.

Therefore, we constructed our own LF video camera by employing a mirror-based adapter. This approach splits the camera’s field of view into sub-images using an array of planar mirrors. By appropriately positioning the mirrors, a grid of virtual views with overlapping fields of view can be constructed, effectively capturing a . We 3D-printed the mount based on our design, and populated the mount with laser-cut acrylic mirrors.

1Though one manufacturer provides video, it does not provide a 4D LF, only 2D, RGBD or raw lenslet images with no method for decoding to 4D. 2A 3 × 3 array will have 1/3 the FOV of the base camera. APPENDIX A. MIRRORED LIGHT-FIELD VIDEO CAMERA ADAPTER III

(b)

(a) (c) Figure A.1: (a) MirrorCam mounted on the Kinova MICO robot manipulator. Nine mirrors of different shape and orientation reflect the scene into the upwards-facing camera to create 9 virtual cameras, which provides video frame-rate s. (b) A whole image captured by the Mirror- Cam and (c) the same decoded into a light-field parameterisation of 9 sub-images, visualized as a 2D tiling of 2D images. The non-rectangular sub-images allow for greater FOV overlap [Tsai et al., 2017]. IV A.2. BACKGROUND

The main contribution of this appendix is the design and construction of a mirror-based adapter like the one shown in Fig. A.1a, which we refer to as MirrorCam. We provide a novel optimiza- tion routine for the design of the custom mirror-based camera that models each mirror using a 3-Degree-of-Freedom (DOF) reflection matrix. The calibration step uses 3-DOF mirrors as well; the design step allows non-rectangular projected images. We aim to make the design, methodology and code open-source to benefit the robotic vision research community.

The remainder of this appendix is organized as follows. Section A.2 provides some background on light-field cameras in relation to the MirrorCam. Section A.3 explains our methods for designing, optimizing, constructing, decoding and calibrating the MirrorCam. And finally in Section A.4, we conclude the appendix and explore future work.

A.2 Background

Light-field cameras measure the amount of light travelling along each ray that intersects the sensor by acquiring multiple views of a single scene. Doing so allows these cameras to obtain both geometry, texture, and depth information within a single light-field image/photograph. Some excellent references for s are [Adelson and Wang, 2002, Chan, 2014,Dansereau, 2014].

Table A.1 compares some of the most common LF camera architectures. The most prevalent are the camera array [Wilburn et al., 2005], and the micro-lens array (MLA) [Ng et al., 2005]. However, the commercially-available light-field cameras are insufficient for providing s for real- time robotics. Notably, the Lytro Illum does not provide s at a video frame rate [Lytro, 2015]. The Raytrix R10 is a light-field camera that captures the at more than 7-30 frames-per-second (FPS); however, the camera uses lenslets with different focal lengths, which makes decoding the raw image extremely difficult, and only provides 3D depth maps [Raytrix, 2015]. Furthermore, as commercial products, the light-field camera companies have not disclosed details on how to access and decode the light-field camera images, forcing researchers to hack solutions with APPENDIX A. MIRRORED LIGHT-FIELD VIDEO CAMERA ADAPTER V limited success. All of these reasons motivate a customizable, easy-to-access, easy to construct, and open-source video frame-rate light-field camera.

A.3 Methods

We constructed our own LF video camera by employing a mirror-based adapter based on previ- ous works [Fuchs et al., 2013, Song et al., 2015, Mukaigawa et al., 2010]. This approach slices the original camera image into sub-images using an array of planar mirrors. Curved mirrors may produce better optics; however, these mirrors are difficult to produce. Planar mirrors are much more accessible and customizable. A grid of virtual views with overlapping field of view can be constructed by carefully aligning the mirrors. These multiple views effectively capture a . Our approach differs from previous work by reducing the optimization routine to a single tunable parameter, and identifying the fundamental trade-off between depth of field and field of view in the design of mirrored LF cameras. Additionally, we utilize non-square mirror shapes.

A.3.1 Design & Optimization

Because an array of mirrors has insufficient degrees of freedom to provide both perfectly over- lapping FOVs and perfectly positioned projective centres, we employ an optimization algorithm to strike a balance between these factors, as in [Fuchs et al., 2013]. A tunable parameter deter- mines the relative importance of closeness to a perfect grid of virtual poses, and field of view overlap, which is evaluated at a set of user-defined depths. The grid of virtual poses is allowed to be rectangular, to better exploit rectangular camera FOVs.

The optimization routine begins with a faceted parabola at a user-defined scale and mirror count. Optimization is allowed to manipulate the positions and normals of the mirror planes, as well VI A.3. METHODS

Table A.1: Comparison of Accessibility for Different LF Camera Systems

LF Systems Sync FPS1 Customizability Open-Source

Camera Array poor 7-30 significant yes MLA (Lytro Illum) good 0.5 none limited MLA (Raytrix R8/R10) good 7-30 minor limited MirrorCam good 2-30 significant yes 1 Frames per second as their extents. Optimization constraints prevent mirrors occluding their neighbours, and allow a minimum spacing between mirrors to be imposed for manufacturability.

Fig. A.2 shows an example 3 × 3 mirror array before and after optimization. The FOV overlap was evaluated at 0.3 and 0.5 m. Fig. A.1a shows an assembled model mounted on a robot arm, and Fig. 4.1b shows an example image taken from the camera. Note that the optimized design does not yield rectangular sub-images, as allowing a general quadrilateral shape allows for greater FOV overlap. In future work, we will explore the use of non-quadrilateral sub-images.

A.3.2 Construction

For the construction of the MirrorCam, we aimed to use easily accessible materials and methods. We 3D-printed the mount based on our design, and populated the mounts with laser-cut flat acrylic mirrors. Figure A.3 shows a computer rendering of the MirrorCam before 3D printing. The reflection of the 9 mirrors show the upwards-facing camera, which is secured at the base of the MirrorCam. This design was built for the commonly available Logitech C920 webcam. More detailed diagrams of the design are supplied in the Appendix.

Mirror thickness and quality proved to be an issue for the construction of the MirrorCam. Since the mirrors are quite close to the camera, the thickness of the mirrors occlude a significant portion of the image, which greatly reduces the resolution of each sub-image. Thus, we opted APPENDIX A. MIRRORED LIGHT-FIELD VIDEO CAMERA ADAPTER VII

0.08

0.2 0.06

0.1 0.04 z y 0 0.02 -0.1 0 -0.2 0 0.2 -0.04 x -0.02 0.1 0 0.05 0.02 0 0.04 -0.05 x y (a) (b)

0.08

0.2 0.06

0.1 0.04 z y 0 0.02 -0.1 0 -0.2 0 0.2 -0.04 x -0.02 0.1 0 0.05 0.02 0 0.04 -0.05 x y (c) (d) Figure A.2: (a) A parabolic mirror array reflects images from the scene at right into a camera, shown in blue at bottom; Each mirror yields a virtual view, shown in red – note that these are far from an ideal grid; (b) The FOV overlap evaluated at 0.5 m, with the region of full overlap highlighted in green; (c) and (d) the same after optimization, showing better virtual camera placement and FOV overlap. VIII A.3. METHODS

(a) (b) Figure A.3: Rendered image of the MirrorCam version 0.4C, (a) from the front showing the single camera lens that is visible from all nine mirrored surfaces, and (b) an isometric view showing how the camera is attached to the mirrors. for thin mirrors, but encountered problems with mirror warping and flatness from the cheap acrylic mirrors. By inspecting the mirrors before purchase, and handling them very carefully (without flexing them) during construction, cutting and adhesion, we were able to minimise image warping and flatness.

A.3.3 Decoding & Calibration

Our MirrorCam calibration has two steps: first the base camera is calibrated following a con- ventional intrinsic calibration, e.g. using MATLAB’s built-in camera calibration tool. Next the camera is assembled with mirrors and the mirror geometry is estimated using a Levenberg- Marquardt optimization of the error between expected and observed checker board corner lo- cations. Initialization of the mirror geometry is based on the array design, and sub-image seg- mentation is manually specified. APPENDIX A. MIRRORED LIGHT-FIELD VIDEO CAMERA ADAPTER IX F E B A D C A3 0.4 1 1 REVISION SHEET 1 OF MirrorCam DO NOT SCALE DRAWING 2 2 KinovaMount TITLE: SCALE:1:2 DWG NO. DEBURR AND BREAK SHARP EDGES 3D Printed ABS N/AN/AN/AN/A 3 3 MATERIAL: DATE 11/04/2016 N/A FINISH: SIGNATURE

Indentation used for approximate alignment of mirror arrqy

Mount for Kinova MICO 128.73 128.73 NAME Steven Martin 4 4 DRAWN 142.94

5 5 41.23 41.23 6 6 Mounting to suit Logitech C920 Camera Holes sized to self tap with M5 mertic bolts 7 64.60 103.80 8 8 7 F E B A D C

Figure A.4: MirrorCam v0.4c kinova mount.

One point of difference with prior work is that rather than employing a 6-DOF transformation for each virtual camera view, our calibration models each mirror using a 3-DOF reflection matrix. This reduces the DOF in the camera model and more closely matches the physical camera, speeding convergence and improving robustness.

A limitation of our calibration technique is that the images taken without mirrors are only con- sidered when initializing the camera intrinsics. A better solution, left as future work, would jointly consider all images, with and without mirrors. X A.4. CONCLUSIONS AND FUTURE WORK

Based on the calibrated mirror geometry, the nearest grid of parallel cameras is estimated, and decoding proceeds as:

1. Remove 2D radial distortion,

2. Slice 2D image into a 4D array, and

3. Reproject each 2D sub-image into central camera view orientation.

Here, we assume the central camera view is aligned with the center mirror.

The final step corrects for rotational differences between the calibrated and desired virtual cam- era arrays using 2D projective transformations. There is no compensation for translational error, though in practice the cameras are very close to an ideal grid. An example input image and de- coded are shown in Fig. 4.1c. Our calibration routine reported a 3D spatial reprojection RMS error of 1.80 mm. The spatial reprojection error is the 3D distance from the projected ray to the expected feature location during camera calibration, where pixel projections are traced through the camera model into space. This small error confirms that the camera design, manufacture and calibration has yielded observations close to an ideal .

It is important to note that our current calibration did not account for the manufacturing aspects of the camera, such as the thickness of the acrylic mirrors, or the additional thickness of the epoxy used to secure the mirrors to the mount. The acrylic mirrors we used also exhibited some bending and rippling, causing image distortion unaccounted for in the calibration process.

A.4 Conclusions and Future Work

In this appendix, we have proposed the design optimisation, construction, decoding and cali- bration process of a mirror-based light-field camera. We have shown that our 3D-printed Mir- rorCam, optimized for overlapping FOV, reproduced a . This implies that the mirror-based LF APPENDIX A. MIRRORED LIGHT-FIELD VIDEO CAMERA ADAPTER XI camera was a viable, low-cost, and accessible alternative to commercially available LF cam- eras. Our implementation takes 5 seconds per frame to operate as unoptimized MATLAB code. The decoding and correspondence processes are the current bottlenecks. Through optimization, real-time s should be possible. We push the envelope of technology towards real-time light-field cameras for robotics. In future work, we will validate the MirrorCam in terms of image refo- cusing, depth estimation and perspective shift in comparison to other commercially-available light-field cameras. XII A.4. CONCLUSIONS AND FUTURE WORK F E B A D C 0.4 A3 1 1 REVISION SHEET 1 OF MirrorCam DO NOT SCALE DRAWING MirrorHolder 2 2 TITLE: SCALE:1:1 DWG NO. DEBURR AND BREAK SHARP EDGES 3D printed ABS M6 Bolts with washers used to mount mirror holder main assembly.

3 3

27 27 MATERIAL: WEIGHT: 15 15 DATE N/A 11/04/2016 FINISH: SIGNATURE Not for vertical alignment NAME Steven Martin 4 4 DRAWN 73.20 5 5 Optional M5 shroud mounting holes Mirrors mounted flush to surface using epoxy Center line for alignment 6 6 14.80 85.14

7 50.13 50.13 8 8 7 F E B A D C

Figure A.5: MirrorCam v0.4c mirror holder. APPENDIX A. MIRRORED LIGHT-FIELD VIDEO CAMERA ADAPTER XIII F E B A D C 0.4 A3 1 1 REVISION SHEET 1 OF MirrorCam DO NOT SCALE DRAWING CameraClip 2 2 TITLE: SCALE:1:1 DWG NO. DEBURR AND BREAK SHARP EDGES 3D printed ABS 3 3 MATERIAL: WEIGHT: DATE N/A 11/04/2016 FINISH: SIGNATURE NAME Steven Martin 4 4 55.60 DRAWN 5 5 6 6 92.80

7 10 10 8 8 7 F E B A D C

Figure A.6: MirrorCam v0.4c camera clip.