Deep Learning Approaches for 3D Inference from Monocular Vision

SUBMITTED IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Dominic Jack Bachelor of Applied Science (Hons)

School of Electrical Engineering and Robotics Science and Engineering Faculty Queensland University of Technology

2020

Copyright in Relation to This Thesis c Copyright 2020 by Dominic Jack

Bachelor of Applied Science (Hons). All rights reserved.

Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meet requirements for an award at this or any other higher education institution. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made.

Signature: QUT Verified Signature

Date: 3 September 2020

i ii Abstract

Deep learning has contributed significant advances to in the last decade. This thesis looks at two problems involving 3D inference from 2D inputs: human pose estimation, and single-view object reconstruction. Each of our methods considers a different type of 3D representation that seeks to take advantage of the representation’s strengths, including keypoints, occupancy grids, deformable meshes and point clouds. We additionally investigate methods for learning from unstructured 3D data directly, including point clouds and event streams.

In particular, we focus on methods targeted towards applications on moderately-sized mobile robotics platforms with modest computational power on board. We prioritize methods that operate in real-time with relatively low memory footprint and power usage compared to those tuned purely for accuracy-like performance metrics.

Our first contribution looks at 2D-to-3D human pose keypoint lifting, i.e. how to infer a 3D human pose from 2D keypoints. We use a generative adversarial network to learn a latent space corresponding to feasible 3D poses, and optimize this latent space at inference time to find an element corresponding to the 3D pose which is most consistent with the 2D observation using a known camera model. This results in competitive accuracies using a very small generator model.

Our second contribution looks at single-view object reconstruction using deformable mesh models. We learn to simultaneously choose a template mesh from a small number of candidates and infer a continuous deformation to apply to that mesh based on an input image.

We tackle both problems of human pose estimation and single-view object reconstruction in our third contribution. Through a reformulation of the model presented in our first contribution, we combine multiple separate optimization steps into a single multi-level optimization problem that takes into account the feasibility of the 3D representation and its consistency with observed 2D features. We show that approximate solutions to the inner optimization process can be expressed as a learnable layer and propose problem-specific networks which we call Inverse Graphics Energy Networks (IGE-Nets). For human pose estimation, we achieve comparable results to benchmark deep learning models with a fraction of

iii the number of operations and memory footprint, while our -based object reconstruction model achieves state-of-the-art results at high resolution on a standard desktop GPU.

Our final contribution was initially intended to extend our IGE-Net architecture to accommodate point clouds. However, a search of the literature found no simple network architectures which were both hierarchical in cloud density and continuous in coordinates – both necessary conditions for efficient IGE- Nets. As such, we present various approaches that improve performance of existing point cloud methods, and present a modification which is not only hierarchical and continuous, but also runs significantly faster and requires significantly less memory than existing methods. We further extend this work for use with event camera streams, producing networks that take advantage of the asynchronous nature of the input format and achieve state-of-the-art results on multiple classification benchmarks.

iv Acknowledgments

To my supervisors, thank you for your guidance and the freedom to chase my ideas – even when you knew they would not lead anywhere; to other students and academics in the lab, thank you for the advice, the water-cooler chats, and listening to my rants of frustrations; to my parents, thank you for your unquestioning love and support whether I’m starting a circus school or finishing a PhD; and finally to my friends, who kept me sane, accepted when I was distant, and welcomed me back after each of my deadline-induced disappearances. This project has pushed me to my limit, and I could not have got through without each and every one of you.

I would also like to thank the Australian government for the Research Training Program (RTP) of which I was a recipient, along with QUT’s Deputy Vice-Chancellor’s Initiative Scholarship.

v vi Table of Contents

Abstract iii

Acknowledgments v

Acronyms xi

List of Figures xiii

List of Publications 1

1 Introduction 3

1.1 Computer Vision in Robotics ...... 4

1.2 Artificial Intelligence and the Deep Learning Era ...... 5

1.3 Problem Descriptions ...... 6

1.3.1 Human Pose Estimation ...... 6

1.3.2 Single View Object Reconstruction ...... 8

1.3.3 Point Cloud Classification ...... 10

1.3.4 Event Stream Classification ...... 10

1.4 Research Questions ...... 11

1.5 Thesis Outline ...... 11

2 Literature Review 13

2.1 General Deep Learning ...... 13

2.1.1 The Exploding/Vanishing Gradient Problem ...... 13

2.1.2 Optimizers ...... 14

vii 2.1.3 Initialization ...... 15

2.1.4 Regularization ...... 15

2.1.5 Implementation and Training ...... 16

2.2 Generative Adversarial Networks ...... 18

2.3 Learned Energy Networks ...... 19

2.4 Deep Learning in Computer Vision ...... 19

2.4.1 Object detection ...... 20

2.4.2 Semantic Segmentation ...... 20

2.4.3 Instance Segmentation ...... 21

2.4.4 Human Pose Estimation ...... 21

2.4.5 Single View Object Reconstruction ...... 22

2.4.6 Point Cloud Networks ...... 23

2.4.7 Event Stream Networks ...... 24

3 Adversarially Parameterized Optimization for 3D Human Pose Estimation 27

4 Learning Free-Form Deformations for 3D Object Reconstruction 41

5 IGE-Net: Inverse Graphics Energy Networks for Human Pose Estimation and Single-View Reconstruction 59

6 Sparse Convolutions on Continuous Domains for Point Cloud and Event Stream Networks 73

7 Conclusion 99

7.1 Contributions ...... 99

7.2 Future Work ...... 100

7.2.1 Sparse Point Cloud Convolutions ...... 100

7.2.2 Sparse Event Stream Convolutions ...... 100

7.2.3 IGE-Nets ...... 101

7.2.4 Isosurface Extraction Models ...... 101

7.2.5 Universal Latent Shape Representation ...... 102

viii References 118

ix x Acronyms

AI Artificial Intelligence

DL Deep Learning

CNN Convolutional Neural Network

CPU Central Processing Unit

FFD Free Form Deformation

GAN Generative Adversarial Network

GPU

IGE-Net Inverse Graphics Energy Network

ML Machine Learning

NLP Natural Language Processing

QUT Queensland University of Technology

ReLU Rectified Linear Units

R-CNN Region-based CNN

SGD Stochastic Gradient Descent

TPU Tensor Processing Unit

xi xii List of Figures

1.1 Left-to-right: fleets of driverless cars are already being built [1]; doctor-assisted medical diagnosis improves accuracy and reduces doctor stress and fatigue [2]; a robotic “dog” is in use at San Francisco airport, taking and collating construction photographs [3]; Samsung’s SRG-A1 sentry gun on the South Korean side of the demilitarized zone [4]. .4

1.2 Left-to-right: image classification, object detection and localization, semantic segmenta- tion, and instance segmentation are common tasks in 2D computer vision...... 4

1.3 Human pose representations. Left-to-right: 2D heatmap and keypoint detections [5] and 3D keypoint detections [6]...... 7

1.4 3D object representations. Left-to-right: triangular mesh, point cloud, voxel occupancy grid, level sets slice...... 9

1.5 Output from an event camera can be thought of as a point-cloud in x, y, t space...... 10

xiii xiv List of Publications

Following is the list of peer-reviewed publications that form a part of this thesis.

Dominic Jack, Frederic Maire, Anders Eriksson, and Sareh Shirazi. “Adversarially Parameterized • Optimization for 3D Human Pose Estimation” in International Conference on 3D Vision (3DV). Qingdao, China, 2017.

Dominic Jack, Jhony K. Pontes, Sridha Sridharan, Clinton Fookes, Sareh Shirazi, Frederic Maire, • Anders Eriksson. “Learning Free-Form Deformations for 3D Object Reconstruction,” in the Asian Conference on Computer Vision (ACCV). Perth, Australia, 2018.

Dominic Jack, Frederic Maire, Sareh Shirazi, and Anders Eriksson. “IGE-Net: Inverse Graphics • Energy Networks for Human Pose Estimation and Single-View Reconstruction,” in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA, 2019.

Dominic Jack, Frederic Maire, Simon Denman, and Anders Eriksson. “Sparse Convolutions on • Continuous Domains for Point Cloud and Event Stream Networks,” submitted to the European Conference on Computer Vision (ECCV). Glasgow, UK, 2020.

The following publication is related to this thesis but not included.

Mateusz Michalkiewicz, Jhony K. Pontes, Dominic Jack, Mahsa Baktashmotlagh, Anders Eriks- • son. “Implicit Surface Representations As Layers in Neural Networks,” in International Confer- ence on Computer Vision (ICCV). Seoul, Korea, 2019.

1 2 Chapter 1

Introduction

“As I dug into research on Artificial Intelligence, I could not believe what I was reading. It hit me pretty quickly that what’s happening in the world of AI is not just an important topic, but by far THE most important topic for our future.”

— Tim Urban, Wait but Why [7]

Driverless cars. Aerial drone-based parcel delivery. doing house work. Military sentry guns automatically selecting and eliminating targets. These aren’t concepts discussed by futurists predicted for the far-off future, or even experimental proof-of-concept implementations in highly controlled labs. All are in operation today across the globe, and progress in the area is only speeding up.

The potential (and realized) benefits of artificial intelligence (AI) and robotics are clear. Doctor- assisted disease diagnosis is already saving lives, automatic collision avoidance systems in semi-autonomous cars have already prevented numerous road accidents, and few owners of robotic vacuum cleaners report missing the manual experience.

On the other hand, self-driving cars have been involved in at least six deaths as of August 2019 [8]. Science fiction has a long history of telling stories of rogue artificial agents destroying humanity, and many high-profile figures are concerned this could become more than a fantasy. Even if humans maintain control over their creations, there is concern that today’s capitalist model is fundamentally incompatible with a post-robotic revolution environment involving widespread automation and unemployment, and that without an adequate transition plan, widespread unemployment and inequality could spark mass riots.

Whether you foresee a utopian robotic socialist future free of disease, war and hardship; or an apoc- alyptic end to humanity sparked by nefarious super-intelligent beings or world-wide societal collapse,

3 4 CHAPTER 1. INTRODUCTION

one thing is certain: we live in exciting times.

This thesis looks at methods for solving common problems that arise in computer vision for robotics, either motivated by or directly related to performing 3D inference from monocular vision.

Figure 1.1: Left-to-right: fleets of driverless cars are already being built [1]; doctor-assisted medical diagnosis improves accuracy and reduces doctor stress and fatigue [2]; a robotic “dog” is in use at San Francisco airport, taking and collating construction photographs [3]; Samsung’s SRG-A1 sentry gun on the South Korean side of the demilitarized zone [4].

1.1 Computer Vision in Robotics

“Robots... I think that is a hot topic.”

— Bill Budge

We loosely define computer vision as the process by which high-level information can be extracted from low-level sensor data similar to the human . This includes images and video se- quences, but also data from LIDAR or infrared cameras. Conceptually, the simplest of these tasks include image classification, object detection and localization, semantic segmentation and instance segmentation as illustrated in Figure 1.2.

Figure 1.2: Left-to-right: image classification, object detection and localization, semantic segmentation, and instance segmentation are common tasks in 2D computer vision.

On their own, solutions to such tasks are not particularly useful for robotics applications. Robots operate in the 3-dimensional world, so the pixel coordinates of a bounding box cannot be used directly. They can however be used as inputs to higher-level tasks. For example, by knowing a camera’s position 1.2. ARTIFICIAL INTELLIGENCE AND THE DEEP LEARNING ERA 5

and orientation, a ground plane estimate and assuming some part of the object is on the ground, the distance to the base of the object can be computed, which can be used for collision avoidance. Other im- portant tasks in mobile robotics with a heavy visual component include visual navigation, path planning and simultaneous localization and mapping.

Getting around is only one problem in robotics. Beyond driverless cars, most robots need to interact with their environment somehow, be it through grasping objects or opening doors. Solutions to these tasks generally involve a lot more than simple pixel-wise analysis, requiring an understanding of not only the 3D geometry of the objects involved but also how they respond to interaction. A plastic bag must be grasped differently to a hard ball, and a door handle on an otherwise relatively blank wall implies the presence of a door.

Interacting with humans presents additional challenges. Ignoring the challenges of natural language processing, a lot can be understood from reading body language and facial expression. For example, there are many consumer-grade drones on the market that can track targets for filming and also recognize hand gestures to know when to land.

1.2 Artificial Intelligence and the Deep Learning Era

“The key to artificial intelligence has always been the representation.”

— Jeff Hawkins

Deep Learning can trace its roots back to the early 1940s, when Landahl et al. [9] attempted to model neural networks of the human brain. With notable exceptions, advances over the subsequent 70 years were slow, stymied by general lack of interest during the infamous AI winters and poor performance compared to competing methods such as support vector machines.

This all changed in 2012 when AlexNet [10] won a number of international competitions. This was made possible predominantly by two independent factors.

Cheap parallel computation High performance graphics processing units (GPUs) developed pre- dominantly for gaming could be repurposed for scientific use thanks to general-purpose GPU libraries like CUDA [11]. This drastically reduced training times.

Large publicly available datasets It is generally accepted that deep learning methods are data- hungry. Well-organised datasets like ImageNet [12] with millions of labelled images allowed these 6 CHAPTER 1. INTRODUCTION

networks to reach their full potential.

Since then, deep learning research has boomed with more algorithmic advances, a proliferation of large, publicly available datasets and further hardware advances.

Deep Learning approaches now boast state-of-the-art performance in areas including computer vi- sion, natural language processing, medical segmentation/diagnosis, drug discovery, graph analysis, game playing, speech recognition/synthesis and code generation.

1.3 Problem Descriptions

In this thesis we initially look at two problems of particular relevance to robotics platforms: human pose estimation and single view object reconstruction. As a result of this research we develop tools for extracting features from point clouds and event camera streams, so we additionally look at applications of these methods to classification tasks with these input formats.

1.3.1 Human Pose Estimation

Estimating human pose – the image or volumetric coordinates of specific body parts – is a common prob- lem with applications in areas such as human- interaction, scene understanding, virtual/ and human action recognition/prediction. There are many variations of the problem, depending on the available input data and desired outputs. Models most often infer poses in one of two formats.

Keypoints Keypoints are ordered sets of coordinates – either 2D or 3D – corresponding to specific points such as hands, ankles or the face. This format is sparse and continuous in space, allowing for very precise inferences and simple, cheap calculations for most common losses and evaluation metrics. Unfortunately there is no natural way of expressing uncertainty which is significant in ambiguous cir- cumstances that arise during occlusions and estimating depth. Keypoints are usually the format used in ground-truth annotations.

Heatmaps We can model uncertainty in joint locations by discretizing space and inferring a contin- uous pseudo-probability for each joint at each grid point. While this is useful for expressing known unknowns (like those resulting from occlusions or depth ambiguities), it greatly increases the size of the output space. As a result, the resolution of such heatmaps is often relatively low – particularly in 3D – greatly limiting the precision of each inference. 1.3. PROBLEM DESCRIPTIONS 7

These output formats are applicable in both 2D and 3D and are illustrated in Figure 1.3. While it is possible to change from one form to the other – e.g. by constructing a discretized Gaussian centered at the keypoint, or by applying some argmax-like operation to heatmaps – the benefits of the original form are generally lost.

Figure 1.3: Human pose representations. Left-to-right: 2D heatmap and keypoint detections [5] and 3D keypoint detections [6].

Perhaps the simplest form of the problem involves inferring the 2D pose of a single target from a single image. Learned models face a couple of key problems. Firstly, most images will contain some form of occlusion – either from external objects or from the other body parts (self-occlusions). As such, the network responsible for finding a left hand cannot simply look for regions that resemble left hands – it must also be aware of things that often lead to left hands, like left arms, in case the hand itself is completely occluded. A similar issue is distinguishing visually similar joints resulting from left-right symmetries. For example, the difference between a left knee and a right knee are often indistinguishable locally, and differentiation must be done by looking for other features – potentially quite far away – like visibility of the face or relative location of other knee-like features.

One extension of this 2D pose inference problem is to infer poses of an unknown number of targets in each image. While conceptually simple, this extension raises many questions that must be resolved in terms of output format and choice of loss function. Keypoints can be used, but the number of sets of keypoints must be variable, and these must be matched to the labels in some way before loss can be calculated. Heatmaps can be used, but this raises the question of how to associate a unique instance of each joint with a unique human.

Another 2D variation is to add temporal information by considering a sequence of images. Concep- tually there is much more information available to resolve ambiguities related to temporary occlusions, but the question remains how best to incorporate this information in a deep learning context.

3D variants of each of the above exist, though the extra depth dimension adds its own problems. The most obvious challenge is the resulting depth ambiguities. Furthermore, volumetric heatmaps (voxel heatmaps) become expensive to compute. 8 CHAPTER 1. INTRODUCTION

A significant factor in designing any deep learning model is the availability of data. This is greatly influenced by the cost of gathering such data – particularly in the case of human pose estimation. In 2D, the internet provides an effectively limitless source of images and videos of people in highly varied positions and environments. A simple labelling application is all that is required for a relatively unskilled human annotator to mark keypoints with a relatively high degree of accuracy. 3D labels in such varied environments are significantly more difficult to acquire. Most 3D datasets are produced using expensive motion-capture setups in very controlled environments with a relatively small number of participants. While the number of examples these setups can generate is large, the variety is limited, so deep learning methods must be carefully designed to allow resulting models to generalize to different environments.

A common way around this problem is to train two separate modules: an image-to-2D pose inference module, and a 2D-to-3D lifting module. The first module can be trained using only the highly varied 2D datasets, while the 2D-to-3D lifting module operates only on the high-level pose or image features extracted, rather than the raw pixel values.

In this thesis we focus on the problem of 2D-to-3D lifting. This allows us to iterate quickly on new ideas without requiring excessive training time or computational power due to the relatively low dimensionality of the problem.

1.3.2 Single View Object Reconstruction

The task of object reconstruction involves inferring a 3D representation of an object based on its image. From a robotics perspective, this is important for understanding an object’s affordances, planning how to interact with the object, or predicting how other agents might interact with the object. A key consider- ation for designing models for object reconstruction is the type of 3D representations used. We discuss these strengths and weaknesses below, and illustrate them in Figure 1.4.

Point Clouds Point clouds are unordered sets of coordinates generated by sampling of the surface of an object, potentially with associated features like normal direction or colour. Like all surface-based approaches, this means they scale quadratically with respect to resolution. Unlike keypoints, point clouds can be used to express any number of different classes. On the down side, they don’t contain any explicit geometric information. Their unordered nature also introduces complications in constructing deep networks, since operations should ideally be invariant to point order. 1.3. PROBLEM DESCRIPTIONS 9

Deformable Meshes Meshes are defined by an ordered set of vertices and face data encoding con- nectedness. New meshes can be generated by inferring adjustments to vertex coordinates without chang- ing the connectedness information. Since vertices are only on the surface of objects, this approach scales better with resolution than volumetric methods. Unfortunately, new meshes generated this way necessarily have the same topology as the input mesh. In some situations this is fine – even desired. For example, barring disfiguration or some forms of body augmentation, human faces are all topologically equivalent, and deforming vertices can allow for different expressions based on the same input face mesh. On the other hand, for some classes different instances will have different topologies. A modern passenger jet has a fundamentally different topology to a bi-plane, and while the surface of one could be deformed to have a similar appearance to the other, the underlying topology would be fundamentally incorrect and likely involve self-intersections.

Voxel Occupancy Grids A simple volumetric representation involves discretizing space into a reg- ular grid of volume elements – – and assigning each voxel a label of empty of occupied. Deep learning methods usually infer continuous pseudo-probabilities of occupancy to allow for differentiabil- ity. While this allows for arbitrary topologies up to the grid resolution, these methods scale cubicly with resolution.

Level Sets Level sets are implicitly defined as the set of points of some embedding function for which the function value is zero (or some other arbitrary value). Some functions, like signed distance functions, are embedding functions which give the signed distance to the surface, with the sign indicating whether the point is inside or outside the shape. A simple way of implementing these methods is to linearly interpolate function values inferred on a regular voxel grid. This allows for efficient isosurface extraction using well-established marching methods, but results in poor scaling with resolution consistent with all voxel-based methods. Other embedding functions like parameterized neural networks may be able to express sparse shapes more efficiently by focusing more detail towards the surface, but these introduce difficulties for efficient isosurface extraction.

Figure 1.4: 3D object representations. Left-to-right: triangular mesh, point cloud, voxel occupancy grid, level sets slice. 10 CHAPTER 1. INTRODUCTION

1.3.3 Point Cloud Classification

Extracting high level features from low-level point cloud data is valuable for a range of tasks including object classification, 3D semantic segmentation, object completion or point cloud re-meshing. As we discuss in Chapter 6, a possible approach to single-view object reconstruction with point clouds requires point cloud operations with certain desirable mathematical properties. This feature extractor is applicable in areas beyond single-view reconstruction however, so we demonstrate its effectiveness in a multi-class point cloud classification setting.

1.3.4 Event Stream Classification

Unlike standard cameras that accumulate incident photons over a fixed time window and report intensities for all pixels at the same time, event cameras fire asynchronous events from individual pixels in response to changes in local intensity levels. This is illustrated in Figure 1.5. These event cameras are capable of running with very low power usage and at very high temporal resolutions, making them particularly promising for robotics applications. Unfortunately, most learning approaches are based on work in computer vision designed for image processing, meaning a lot of the advantages of this new data format are lost. We demonstrate that our work in point cloud processing is also applicable here by looking at various classification problems using event camera data.

Figure 1.5: Output from an event camera can be thought of as a point-cloud in x, y, t space.

The first three contributions relate to 2D-to-3D human pose lifting and/or single-view reconstruction. Our fourth looks at classification tasks involving 3D point clouds and event camera streams. 1.4. RESEARCH QUESTIONS 11

1.4 Research Questions

Throughout this thesis we seek to answer the following questions.

1. Can we formulate a principled approach to 3D generative models conditioned on 2D data that respects concepts of consistency and feasibility?

2. How can we best utilize techniques to simplify computer vision problems or enhance model performance? More specifically, can we use known device parameters and physics models to reduce what has to be learned? Can we create deep learning models that are reconfigurable with respect to input device(s) without the need for retraining?

3. Can we produce models targeted towards mobile robotics platforms that require fast inference times with moderate compute resources? Can we design those models to take advantage of the heterogeneous architectures (i.e. CPUs + modest GPU) typically on board?

4. How are these models best implemented in modern deep learning frameworks?

1.5 Thesis Outline

The rest of this thesis by publication is laid out as follows. In Chapter 2 we conduct a review of the liter- ature relevant to generic deep learning, as well as techniques specific to 2D and 3D problems, generative adversarial networks, multi-level optimization/energy-based networks and point cloud networks.

Chapter 3 details our first contribution – a two-stage adversarially parameterized optimization ap- proach to 2D-to-3D human pose lifting. Chapter 4 describes our model for performing single-view object reconstruction using deformable meshes via free-form deformation. In Chapter 5 we tackle both problems – human pose lifting and single-view reconstruction – using an energy-based framework that takes advantage of simple graphics techniques, before Chapter 6 discusses a continuous domain sparse convolution operator with applications to point cloud and event stream classification problems.

We conclude in Chapter 7 by summarizing our contributions and discussing future research direc- tions. 12 CHAPTER 1. INTRODUCTION Chapter 2

Literature Review

“If I have seen further than others, it is by standing upon the shoulders of giants.”

— Isaac Newton

In this chapter we briefly review literature relevant to the core concepts investigated in this thesis. We begin with an overview of general advances in deep learning before looking at specifics related to generative adversarial networks, learned energy networks, and techniques specific to the computer vision problems tackled in this work.

2.1 General Deep Learning

Neural networks have been around since the early 1940s [9]. That said, it wasn’t until 1962 that Dreyfus et al. [13] presented a simple, numerically efficient method for calculating gradients – backpropagation – allowing training via gradient descent, by far the most popular method of training networks today. Even then, it took an additional 50 years before deep learning rose to popularity.

2.1.1 The Exploding/Vanishing Gradient Problem

Poor performance in early networks was due primarily to poor propagation of gradients. As depth increased, the learning signal generally either blew up exponentially (exploded), or decreased to zero (vanished). This limited the depth of viable networks, handicapping their overall performance. Recent success in the area is due largely to a number of works that address this problem.

One major reason for the exploding/vanishing gradient problem is the saturation of biologically inspired sigmoid and hyperbolic tangent activation functions used in early networks. Rectified linear

13 14 CHAPTER 2. LITERATURE REVIEW

units (ReLUs) [14] were introduced to address this, and continue to be used extensively in modern networks. This spawned many variants, including Leaky ReLUs [15], Exponential linear units [16], Parametric ReLUs [17], self-normalizing exponential linear units [18], swish [19] and mish [20].

Another approach to combating vanishing/expoding gradients is through activation normalization. Batch normalization [21] was the first popular method of this type and remains a critical component of many networks prominent in the literature. Variations include layer normalization [22] and group normalization [23].

Residual networks [24] allowed for a massive jump in network depth, demonstrating networks of 1,000 or more layers could be trained without experiencing exploding/vanishing gradients. The key idea was to add identity pathways – or skip connections – between blocks of layers that propagate residuals through network. Variants include a pre-activation version [25], a multi-path version (ResNeXt, [26]) and a version that densely connected different blocks with skip connections (DenseNet [27]).

2.1.2 Optimizers

Neural networks are generally highly non-linear functions. Even if care is taken to ensure gradients do not vanish or explode, the time taken to train and the quality of the final network are highly dependent on the algorithm used to identify the point of convergence in the loss landscape. In this subsection we discuss the algorithms – or optimizers – common in the literature.

Stochastic gradient descent (SGD) SGD with or without momentum has been used since at least 1964 [28], popularized in deep learning by Sutskever et al. [29].

Adagrad Adagrad [30] modifies SGD by adding a unique adaptive learning rate based on an ac- cumulation of squared gradient values for each trained parameter. This is good for datasets with sparse features, as the network learns more from uncommon features. One key weakness is that the learning rate for each parameter is monotonically decreasing with examples, meaning all learning eventually plateaus – potentially before the solution has reached a minima.

RMSprop First introduced in an online lecture [31], Hinton demonstrated Adagrad’s learning plateau can be avoided by considering a moving average of squared gradient values rather than the sum.

Adadelta Adadelta [32] attempts to resolve a theoretical dimension mismatch in RMSprop by ap- proximating second-order derivatives in a very principled manner. 2.1. GENERAL DEEP LEARNING 15

Adam One of the most popular optimizers used in modern deep learning, Adam [33] combines the ideas of adaptive learning rates used in RMSprop and Adadelta with first-order momentum. The authors also introduced Adamax – a variant based on the infinity norm rather than the 2-norm.

Yogi Zaheer et al. [34] present a minimal modification to Adam that exhibits superior performance with minimal hyperparameter tuning.

2.1.3 Initialization

Initialization is critical to most iterative solutions to optimization problems, and deep learning is no exception. While the various normalization techniques above can go some way to minimizing the signif- icance of this, good initialization has nonetheless been shown to drastically improve both convergence times and the quality of converged solutions.

The most straight-forward initialization schemes are based on sampling weights from a distribution that leaves the expected mean and variance of activations unchanged across a layer. Glorot and Bengio [35] show that doing so results in improved performance when using sigmoid or hyperbolic tangent functions, while He et al. [17] give a modified form for networks using ReLU activations.

More recently, experiments on residual networks have shown they can be trained without batch nor- malization with only minor modifications. By using a principled scaling of standard kernel initializations and simple regularization, Zhang et al. [36] train models with up to 10,000 layers. ZeroInit [37] is a similar approach that replaces batch normalization by a multiplication of the convolution activation by a single learned parameter initialized to zero, resulting in improved performance compared to batch normalized networks for small or modest batch sizes.

Transfer learning – the process of using part of a model trained in one problem area as the starting point of a solution in another – can be thought of as an initialization scheme and has shown great success in deep learning [38]. Its success has led popular frameworks (see subsection 2.1.5) to provide state-of-the-art pretrained models for common problems like image classification, object detection and segmentation.

2.1.4 Regularization

Regularization techniques aim to ensure trained networks generalize to data not seen in training.

One of the simplest techniques is weight decay [39], where weights are reduced by a constant factor 16 CHAPTER 2. LITERATURE REVIEW

after each optimizer update step. While straight forward to understand and implement, there are a couple of nuances to this approach that have been shown to have a non-negligible effect on training.

Firstly, early implementations of weight decay involved adding a squared term to the loss for each decayed parameter. This is equivalent to weight decay when using SGD without momentum, but is not the same using other optimizers. While this loss term is confusingly often described as weight decay, the more standard term is L2 regularization. The difference was demonstrated by Loshchilov and Hutter [40], who showed models trained with Adam [33] and weight decay outperformed those trained with Adam and L2 regularization on image classification problems.

Another surprising effect of weight decay is the effect it has on layers with outputs subjected to batch/layer normalization. These normalization layers scale their input to unit variance, so uniformly scaling multiplicative factors that affect these inputs (as is the case for weight decay) has no effect on output. Nonetheless, weight decay has been shown to improve network performance in the presence of batch normalization. This was investigated by Zhange et al. [41], who show that without weight decay parameters grow in scale over time, resulting in a reduced effective learning rate. Significantly, they also showed that the regularizing effect of weight decay or L2 regularization was due almost entirely to its influence on the effective learning rate and the regularizing effect of increased noise in gradient-based update steps as a result – even for networks without batch normalization.

A separate form of regularization is dropout [42], where activations are randomly dropped during training. This was motivated to discourage co-adaption of features, and is equivalent to training an exponential number of smaller networks with shared weights in an ensemble. Maxout networks [43] were proposed as natural companions to dropout by changing the activation between layers to a maximum over feature groups.

Guo and Gould [44] took the idea of dropout further and applied it to entire layers of residual networks. Not only did this have the desired regularization effect, it also drastically improved training times as computation of dropped layers could be skipped entirely, rather than computed and ignored as in the per-value implementation.

2.1.5 Implementation and Training

A key reason research in deep learning has developed so rapidly recently is the ease with which re- searchers can prototype ideas in terms of both implementation elegance and training time. This can be attributed to a number of factors. 2.1. GENERAL DEEP LEARNING 17

High level frameworks A large number of frameworks have sprung up for deep learning with sup- port for various languages and programming paradigms. These include Caffe [45], Theano [46], Torch [47], MXNet [48], ONNX [49], [50], Chainer [51] and CNTK (formerly Microsoft Cognitive Toolkit) [52]. Recently, the academic research community has largely converged to using either Tensorflow [53] or PyTorch [54], either directly or through higher-level interfacing libraries like Keras [55] and fastai [56].

Accelerator Hardware Compared to most machine learning algorithms, training deep learning models is computationally expensive. General purpose GPU programming [11] marked a turning point in deep learning research. More recent hardware developments include integrated circuits with in-memory processing [57] and specially designed tensor processing units (TPUs) [58] designed to address the increased energy demmands anticipated for data centers. Quantum computers are also slowly coming online [59], and while much work is required to design algorithms that can take advantage of this paradigm, the potential impact on the machine learning landscape is massive [60].

Distributed computing Many state-of-the-art deep learning architectures depend on their large size for their expressivity, and hence predictive performance. For example, NVidia recently released Mega- tronLM [61], a language model with parameters alone requiring 33GB of space. Very few accelerators have sufficient memory to hold these, and that’s before considering space for input data and activations. Models like these are trained by distributing parameters and computation across multiple devices. While this distributed training introduces implementation complexities, NVidia was able to achieve almost linear performance improvements with respect to number of GPUs used, allowing MegatronLM to be trained across 512 GPUs in less than 10 days.

Cloud computing Of course, many researchers don’t have access to their own cluster of 512 GPUs or other large-scale distributed accelerators. Cloud computing providers like Amazon Web Services [62], Microsoft Azure [63] and Google Cloud [64] fill this gap, providing deep learning specific products with support for major frameworks allowing researchers to rent various hardware configurations and run experiments quickly and affordably without the need to purchase, maintain and manage their own networks [65]. 18 CHAPTER 2. LITERATURE REVIEW

2.2 Generative Adversarial Networks

Since their introduction by Goodfellow et al. [66] in 2014, Generative Adversarial Networks (GANs) have seen an explosion in interest. At time of writing, there are 14,000 search results containing the exact phrase, and the original paper has over 15,000 citations. We provide a brief overview here as it relates to our first contribution (Chaper 3). For a more thorough review, we direct the interested reader to the recent survey by Pan et al. [67].

While a large number of variations exists, GANs are generally paired networks intended to learn distributions of data. A generator network is responsible for mapping a known random distribution to an element in the target domain, and a separate discriminator network is tasked with classifying an element of the target domain as originating either from the generator or the training dataset. These networks are trained in conjunction using a modified mini-max loss, with the generator aiming to produce output indistinguishable from the dataset as judged by the discriminator [66].

Applications of GANs are many and varied. In computer vision, GANs have been applied to problems including image super-resolution [68, 69], image translation [70–72], texture synthesis [73– 75] and deformable mesh inference [76]. Other notable uses include speech/poetry/music generation [77], anomaly detection in medical images [78], generating DNA sequences [79] and malware creation [80].

While modifying network architectures to generate/discriminate elements of new target domains is relatively straight-forward, vanilla implementations often exhibit unstable training and mode collapse [81]. Many methods to combat these phenomena are based on careful selection of objective functions, regularization techniques and training schemes. Wasserstein GANs [82] were introduced to take advan- tage of superior gradient behaviour of the earth-mover distance function for distribution learning, with an implementation based on a modified loss function and weight clipping. This approach was modified by Gulrajani et al. [83] by replacing the weight clipping with a gradient penalty term. Training stability was further improved by Petzka et al. [84] via an additional penalty term.

Other techniques that have been demonstrated to improve performance include feature matching, historical averaging and one-sided label smoothing [85]; separate learning rates [86]; and spectral nor- malization [87, 88].

We use GANs extensively in Chaper 3. 2.3. LEARNED ENERGY NETWORKS 19

2.3 Learned Energy Networks

At a high level, the problems investigated in this thesis involve inferring values of unknown variables from observations. Energy-based models describe relationships between sets of variables by mapping each combination to a scalar energy value, where realistic or likely combinations correspond to lower energies than their less viable counterparts. Inferences are made by fixing values of known variables and seeking unknown values which minimize the energy [89].

Energy-based models have been combined with deep learning in the past. Zheng et al. [90] for- mulated conditional random fields (CRFs) as a recurrent neural network layer, which combined with a standard convolutional neural network (CNN) achieved state-of-the-art results for . Amos and Kolter [91] considered energy functions based on quadratic programs. Their implementation solved the inner optimization problem efficiently and exactly, and demonstrated it was able to learn hard constraints like those associated with the number-game Sudoku.

Domke [92] presented a number of implementations for efficiently computing and differentiating approximate optimizations – solutions where the energy minimization process is based on a fixed number of steps of some optimization algorithm. While the algorithms did not find the exact solution to the energy minimization problem, these truncated optimization processes still yielded good results for image denoising and labeling problems. Belanger et al. [93] took a similar approach and showed inexact optimization of complex energy functions outperformed exact solutions using simpler functions for image denoising and natural language semantic role labeling.

While the ideas and implementations behind learned energy networks are well established, we feel there is potential for significant improvement by tayloring energy functions to specific applications. In the area of computer vision, learned energy networks let us take advantage of computer graphics techniques like projection and learn a consistency measure in the projected space, rather than try and learn the ambiguous task of learning the inverse of projection. Chapter 5 addresses this gap.

2.4 Deep Learning in Computer Vision

Virtually all advances in computer vision using neural networks have involved convolution operations, where local features are extracted from pixel neighborhoods using parameterized kernels. While the idea of convolutional neural networks (CNNs) has been around since the 80s [94] it wasn’t until AlexNet [10] that their potential began to be realized. Since then, a number of different CNN architectures have been proposed and evaluated. Families of particular note include VGG [95], ResNet [25, 26], Inception 20 CHAPTER 2. LITERATURE REVIEW

[96, 97], Xception [98], MobileNet [99, 100], DenseNet [27], NASNet [101], EfficientNet [102] and MixNet [103].

Image classification is the defacto standard for evaluating new CNN architectures due to the problem simplicity and availability of many well-constructed and widely used datasets of varying sizes and dimensions [12, 104, 105].

2.4.1 Object detection

Object detection is another problem CNNs have come to dominate in recent years. Unlike image classification, the output of object detection – some location information (usually bounding boxes) for an unbounded number of objects of different classes – cannot be trivially represented as a fixed size tensor as required by standard CNNs. Region-based CNNs (R-CNN) [106] address this by solving a classification problem at a fixed number of regions proposed by selective search [107]. While R-CNN gave promising results, inference speed was slow. Spatial pyramid pooling networks [108] addressed the speed issue by extracting convolutional features once and reusing these in evaluating each region proposal. Fast R-CNN [109] modified the spatial pooling aspect to be differentiable, and hence made the entire network trainable end-to-end, while Faster R-CNN [110] replaced the region proposal component with a separate smaller CNN.

YOLO (you only look once) networks [111–113] are fundamentally different to R-CNN networks in that they pass the entire network through a single CNN and output confidences of object existence at predefined grid points and resolutions, rather than running inference on a number of proposed regions. This makes them much faster than region-based methods, though the finite resolution of the discrete grid means accuracy is generally lower for small objects.

Single shot detectors (SSD) [114] can be thought of as a hybrid approach. Like YOLO networks, the entire network is passed through a single CNN, the output of which is passed into a small Faster R-CNN-inspired bounding box offset/confidence prediction network.

2.4.2 Semantic Segmentation

Semantic segmentation – the problem of classifying individual pixels of an image as being of one of more classes – is an important problem in areas such as medical diagnosis/treatment [115], driverless cars [116] and scene understanding [117].

U-net [118] was one of the first CNN-based approaches to tackle this problem by combining a 2.4. DEEP LEARNING IN COMPUTER VISION 21

standard convolution/pooling architecture with upsampling layers and skip connections to perform med- ical image segmentation. FCN [119] took a similar approach, though also demonstrated fractionally strided convolutions (or transposed convolutions) could be used instead of upsampling to form a fully convolutional network.

Chen et al. [120, 121] introduced atrous convolutions and used conditional random fields to refine predictions. They later extended this approach to using an encoder-decoder style base network [122]. Takikawa et al. [123] used explicit shape priors, while Zhu et al. [124] improved results in video datasets by propagating information across frames.

2.4.3 Instance Segmentation

Similar to semantic segmentation, instance segmentation problems require networks to classify each pixel in an image as belonging to a certain class. In addition, those inferences must differentiate between pixels corresponding to different objects of the same class.

Mask R-CNN [125] takes a simple approach to the problem by modifying Faster R-CNN by adding a mask head to the existing classification network. Mask-scoring R-CNN [126] extends this by penalizing low-quality masks (as measured by IoU) corresponding to high classification accuracies. The authors show this approach consistently out-performs the baseline across all CNN model architectures.

Facebook AI Reserch published a series works on instance segmentation. Their initial model, DeepMask [127], focused on segmenting the central object in a given image patch. They extended on this work with SharpMask [128], which uses features at multiple resolution levels to refine segmentations provided by DeepMask. Their MultiPath network [129] combined a modified Fast R-CNN model for image patch proposal with the DeepMask/SharpMask models.

2.4.4 Human Pose Estimation

Inferring human pose in two or three dimensions from images is an important part of many tasks in- cluding human-computer interaction and action recognition. For the 2D problem, traditional approaches combine visual features and image descriptors with a tree-structure of the body and known invariants and proportions [130]. More recently, deep learning’s wave of success in other image processing applications such as image classification and segmentation has flowed into pose estimation, with fully-convolutional approaches achieving exceptionally accurate 2D inferences by regressing heatmaps rather than the joint coordinates themselves [131–135]. 22 CHAPTER 2. LITERATURE REVIEW

The 3D problem is considerably more challenging. In addition to problems involved in the 2D variant, the main difficulty in training 3D pose inference systems that work in the wild is the availability of varied datasets. While 2D datasets can be annotated manually, 3D information is generally gathered using special motion-capture systems. Although these systems are capable of generating massive vol- umes of data, the examples within such datasets are usually limited in variety. For example, the human 3.6 million dataset (H3M) [136] contains millions of frames, but all images are collected in the same room with only a handful of subjects. By contrast, the popular 2D dataset COCO [117] features over 50,000 human pose annotations with very few duplicates.

To get around this lack of varied 3D data, many methods use a 2-stage approach to 3D inference by inferring 2D poses from images, then lifting these 2D poses to 3D separately [137–139]. These approaches benefit from the varied image features in 2D datasets, but the separate stages means any “lifting” module is unable to take advantage of contextual information learned in the first stage.

The other main difficulty with 3D pose estimation is the inherent ambiguity associated with depth inference and occlusions. Adversarial approaches tackle this by introducing loss terms which are them- selves learned in a modified mini-max game [140, 141] that encourage feasible solutions when multiple consistent possibilities exist. While these approaches penalize networks which produce inferences in- consistent with observation, they still require the network to learn to avoid these inconsistencies and thus learn some concept of projection. Given that we understand the concepts involved in projection almost perfectly and it is an umabiguous operation, we feel this is wasted effort, and seek to address this weakness in Chapter 3.

2.4.5 Single View Object Reconstruction

Reconstructing 3D objects from a single view is a common problem in computer vision and robotics. Fundamental to any approach is the choice of 3D representation for the inference. Volumetric methods are the most widely used in 3D learning [142–151]. These approaches generally use 3D analogues of ideas and operations that have proven successful in image processing, including convolutions, deconvo- lutions and feature pooling. Recent advances in auto-encoders [152, 153] and GANs [154–157] have also shown promising results on regular 3D grids, while Tulsiani et al. [158] showed object shape and pose can be learned simultaneously and without 3D labels using only depth maps or silhouettes to encourage view consistency across multiple views.

Unfortunately, the additional dimension inherent to 3D representations means these methods scale poorly with resolution, resulting in generally coarse outputs – typically 323 or 643. To overcome this scaling issue, octree networks [159–162] recursively divide regions of interest into octants. By focusing 2.4. DEEP LEARNING IN COMPUTER VISION 23

only on regions near the object surface, these methods operate with complexity proportional to surface area rather than volume.

Another approach is to represent the surface of the object as a level set of a 3D embedding func- tion. Park et al. [163] show that a single continuous deep network can be learned to represent a large number of embedding functions by conditioning the network on a latent-space embedding on the shape. Michalkiewicz [164] showed that training based on voxels targeting continuous signed distance function values rather than binary occupancy values resulted in better surfaces at higher resolutions.

Other approaches to high-resolution inference keep the regular volumetric data structure but use operations that scale better to higher resolutions [165, 166].

Template deformation approaches instead infer a constant-sized space warping that can be applied to an arbitrarily dense cloud or mesh. This comes at a cost however, as the topology of the output shape is intrinsically coupled with that of the deformed template. The two extremes of these are DeformNet [167] and FoldingNet [168]. DeformNet uses a latent representation of the course voxelization of a template mesh selected from a large database. While this makes choosing relatively close template meshes possible, the coarse discretization required to make the network computationally feasible limits the precision of the deformation with which the network can hope to infer. FoldingNet uses the exact same template for all inputs – a 2D plane – meaning all solutions are topologically equivalent to a plane regardless of the input.

The work we present in Chapter 4 seeks to combine the strengths of these two deformable mesh approaches by training separate decoder networks for each of a small number of templates.

2.4.6 Point Cloud Networks

Point cloud methods avoid the need to discretize space, instead working on continuous coordinates of points on the object surface [169, 170]. However, the variable size and unordered nature of point clouds introduce their own complexity in deep learning frameworks.

Early works in point cloud processing – Pointnet [171] and Deep Sets [172] – use point-wise shared subnetworks and order invariant pooling operations. The successor to Pointnet, Pointnet++ [173] was (to the best of our knowledge) the first to take a hierarchical approach, applying Pointnet submodels to local neighborhoods.

SO-net [174] takes a similar hierarchical approach to Pointnet++, though uses a different method for sampling and grouping based on self-organizing maps. DGCNN [175] applies graph convolutions to point clouds with edges based on spatial proximity. KCNet [176] uses dynamic kernel points in 24 CHAPTER 2. LITERATURE REVIEW

correlation layers that aim to learn features that encapsulate the relationships between those kernel points and the input cloud. While most approaches treat point clouds as unordered sets by using order-invariant operations, PointCNN [177] takes the approach of learning a canonical ordering over which an order- dependent operation is applied. SpiderCNN [178] and FlexConv [179] each bring their own unique interpretation to generalizing image convolutions to irregular grids. While SpiderCNN focuses on large networks for relatively small classification and segmentation problems, FlexConv utilizes a specialized GPU kernel to apply their method to point clouds with millions of points.

Recent work on ensemble methods [180] showed groups of networks working together can signif- icantly outperform lone models. In doing so, it also highlighted a trend in published results whereby reported metrics are consistently above the average performance of the model when the training process is repeated multiple times.

Assisting the development in deep learning methods for point clouds is the availability of large, qual- ity datasets. Probably the most popular for point cloud classification is Modelnet [143], a classification dataset made up of 3D CAD models from which point clouds are artificially sampled. ShapeNet [181] also provides a large set of CAD models of different categories with a derivative competition [182] in 2017 featuring a point cloud based semantic segmentation challenge, and more recently a much higher resolution variant [183]. In a recently released dataset, Uy et al. [184] also performed a survey of existing methods and found models trained on artificial data performed poorly on real-world data.

We identify the following gaps in the point cloud feature extraction literature:

there seems to be no network structures which are both hierarchical and continuous in coordinates; • and

those networks advertising themselves as convolutional [177–179] implement operations signifi- • cantly different to the mathematical definition of convolution.

We address these gaps in Chapter 6.

2.4.7 Event Stream Networks

Compared to standard images, relatively little research has been done with event networks. Interest has started to grow recently with the availability of a number of event-based cameras [185, 186] and publicly available datasets [186–190].

A number of approaches utilize the extensive research in standard image processing by converting event streams to images [188, 191]. While these can leverage existing libraries and cheap hardware 2.4. DEEP LEARNING IN COMPUTER VISION 25

optimized for image processing, the necessity to accumulate synchronous frames prevents them from taking advantage of many potential benefits of the data format. Other approaches look to biologically- inspired spiking neural networks (SNNs) [192–194]. While promising, these networks are difficult to train due to the discrete nature of the spikes.

Other notable approaches include the work of Lagorce et al. [195], who introduce hierarchical time- surfaces to describe spatio-temporal patterns; Sironi et al. [189], who show histograms of these time surfaces can be used for object classification; and Bi et al. [190], who use graph convolution techniques operating over graphs formed by connecting close events in space-time.

These approaches generally suffer from a need to accumulate a large number of events before inferences can be made – either by creating images and using image-based techniques, or by building up graphs or spike-trains and training an encoder that operates over the full graph or spike-train. We address this gap in Chapter 6 with a slight modification of our approach designed for point clouds. 26 CHAPTER 2. LITERATURE REVIEW Chapter 3

Adversarially Parameterized Optimization for 3D Human Pose Estimation

“Generative Adversarial Networks is the most interesting idea in the last ten years in machine learning, in my opinion.”

— Yann LeCun, Director, Facebook AI

Inferring 2D human poses from images can be framed as a regression problem. For deep learning approaches, some architectural decisions have to be made – prediction format (keypoints, heatmaps etc.), what loss to use, how to deal with multiple targets etc. – but beyond these, off-the-shelf CNNs perform quite well.

3D pose inference presents additional challenges and opportunities. At its simplest it can be tackled in a similar way to 2D pose inference with an additional depth dimension. However, there is significantly more ambiguity in this dimension, and this fails to take advantage of stronger priors over 3D joint distributions. For example, we can enforce a much stronger prior over limb lengths in 3D space compared to 2D pixel distances. 3D datasets are also generally significantly less varied than their 2D counterparts due to the cost of collection, so any model that performs well on such a limited 3D dataset will likely fail to generalize to different environments.

Instead of performing 3D inference directly from images, our first contribution looks at inferring 3D pose from a 2D pose inference or observation. Conceptually, we infer 3D keypoints by searching a feasible pose space for a solution which is most consistent with the observation. We perform this in two stages.

27 28 CHAPTER 3. ADVERSARIALLY PARAMETERIZED OPTIMIZATION

Learning feasibility Rather than trying to enumerate conditions which make poses feasible, we use a generative adversarial network (GAN) to learn a mapping from a fixed distribution to the distribution of normalized feasible poses.

Optimizing consistency At inference time, we optimize a sample from the fixed input distribution and denormalization parameters with respect to a consistency measure between the projected feasible pose and the 2D observation using an off-the-shelf optimizer.

As a result, our resulting model has the following properties:

feasibility is invariant to normalization factors like scale, rotation about the vertical axis and • uniform horizontal displacement;

learned feasibility is independent of camera parameters, so the same learned GAN can be used • with any number of different cameras (so long as intrinsic parameters are known); and

projection calculations are done explicitly using known camera models, meaning the learned • component does not need to learn or approximate this process. This results in significantly smaller/faster networks. 29

Statement of Contribution of Co-Authors for Thesis by Published Paper

The authors listed below have certified that:

1. they meet the criteria for authorship in that they have participated in the conception, execution, or interpretation, of at least that part of the publication in their field of expertise;

2. they take public responsibility for their part of the publication, except for the responsible author who accepts overall responsibility for the publication;

3. there are no other authors of the publication according to these criteria;

4. potential conflicts of interest have been disclosed to (a) granting bodies, (b) the editor or publisher of journals or other publications, and (c) the head of the responsible academic unit; and

5. they agree to the use of the publication in the student’s thesis and its publication on the QUT’s ePrints site consistent with any limitations set by publisher requirements.

In the case of this chapter: Adversarially Parameterized Optimization for 3D Human Pose Estimation, presented at the International Conference on 3D Vision (3DV) in Qingdao, China, 2017.

Dominic Jack QUT Verified Experiment design model implementation and paper drafting. Signature 26 May, 2020 Frederic Maire Advised on model design and paper editing. Anders Eriksson Advised on model design and paper editing. Sareh Shirazi Advised on model design and paper editing.

Principle Supervisor Confirmation

I have sighted email or other correspondence from all Co-authors confirming their certifying authorship. (If the Co-authors are not able to sign the form please forward their email or other correspondence confirming the certifying authorship to the RSC). QUT Verified Frederic Maire Signature 26 May 2020 Name Signature Date 30 CHAPTER 3. ADVERSARIALLY PARAMETERIZED OPTIMIZATION

Adversarially Parameterized Optimization for 3D Human Pose Estimation

Dominic Jack Frederic Maire Anders Eriksson [email protected] [email protected] [email protected] Sareh Shirazi [email protected] QUT 2 George St Brisbane, Australia

Abstract problem remains difficult. Not only is it ill-posed – many 3D scenes correspond to the same 2D image – it cannot be We propose Adversarially Parameterized Optimization, expressed as a simple combination of geometric transfor- a framework for learning low-dimensional feasible param- mations. eterizations of human poses and inferring 3D poses from For the purpose of this paper we separate the problem 2D input. We train a Generative Adversarial Network to into two parts: an image processing problem which involves ‘imagine’ feasible poses, and search this imagination space inferring 2D image coordinates of joints; and an inverse for a solution that is consistent with observations. projection problem, responsible for mapping these 2D pixel The framework requires no scene/observation corre- coordinates to 3D joint positions. This decoupling has sev- spondences and enforces known geometric invariances eral advantages. In particular: without dataset augmentation. The algorithm can be config- image processing modules can be switched in and out ured at run time to take advantage of known values such as • intrinsic/extrinsic camera parameters or target height when without the need for retraining; available without additional training. neither module is required to be trainable end-to-end; We demonstrate the framework by inferring 3D human • and poses from projected joint positions for both single frames and sequences. We show competitive results with extremely the image processing system does not require 3D joint • simple shallow network architectures and make the code information, and the 3D inverse projection system does 1 publicly available . not require image information.

This paper’s contribution focuses on the second of these problems. Keywords: GAN, Generative Adversarial Network, Hu- man Pose Estimation, Non-rigid Body Transformation, In- The rest of this paper is layed out as follows: we formally verse Graphics, Ill-Posed Function Inversion define the problem in Section 2 before discussing previous work in the area in Section 3. We introduce our proposed method in Section 4 and the experimental setup in Section 1. Introduction 5. Results are covered in Section 6 before we conclude and Recovering the pose of a person from an image or a se- discuss future work in Section 7. quence of images is an important research problem for ar- We provide a summary of notation conventions in Table eas including human computer interaction, computer ani- 1. mation and biometric analysis for sport and rehabilitation. At its core is an inverse projection problem: mapping a two 2. Problem Description dimensional representation into a three dimensional space. Consider a parameterization of a 3D human pose p While advances in computing over the last several decades ∈ P and a mapping function Π: that yields its 2D have resulted in systems that can render realistic 2D images P → Pπ projection. Of all possible poses, only a small subset based on 3D scenes at very high frame rates, the inverse Pf ⊂ are feasible. This subset is not explicitly given and only P 1github.com/jackd/adversarially parameterized optimization described by a sample . Pd ⊂ Pf

1 31

Table 1. Notation summary. stant/symmetric limb lengths, known body proportions) to Example Meaning resolve ambiguities [35]. Zhou et al.[50] use temporal and p˜ tilde Inferred/proposed/imagined spatial information with expectation-maximization, while hat pˆ Normalized Martinez et al.[23] show that a very simple network can π subscript pπ Projected th learn to ‘lift’ 2D data to 3D. Tome et al.[44] fuse 2D feature i subscript pi Pertaining to the i joint (t) superscript p(t) Pertaining to time step t learning with 3D plausibility to train a single-network end- to-end, and and Mehta et al.[25] take a similar approach in superscript z∗ optimal value ∗ p p, p˜, p˜ˆ pose parameterization their real-time model. λ λ , λ loss term π f 3.2. Generative Adversarial Networks α αc, αs, αs0 loss scaling hyperparameter Generative Adversarial Networks (GANs) are coupled network systems that attempt to simultaneously learn the ˜ The task is to find an inverse projection function Π† : distribution from which some dataset is sampled and fea- ˜ π such that p f Π†(Π(p)) p. tures that distinguish membership of that set [11]. The P → P ∀ ∈ P ≈ idea is to train a deterministic generator function to pro- 3. Related Work duce realistic output based on a random seed, while a critic/discriminator network is trained to distinguish be- 3.1. Human Pose Estimation tween this output and real data samples. Approaches to human pose estimation are many and var- Since their introduction, GANs have seen an explosion ied, each introducing its own advantages and limitations. in popularity in areas such as image synthesis [30], super- Marker suits are used extensively in animation [26] to pro- resolution image inpainting [20], unsupervised disentangle- duce highly accurate poses in tightly controlled environ- ment [7], text-to-image translation [36][48], 3D reconstruc- ments. Different/additional sensors such as RGB-D cam- tion and inverse graphics [13][45]. eras [38] and multi-view rigs [27] have also been used to capture poses of naturally dressed participants, though these 4. Proposed Method require more extensive setups. We describe the proposed method in the context of in- Monocular vision approaches are attractive due to the verse graphics – more specifically, human pose estimation ubiquity of cheap, high quality cameras. No special suits or – though it is equally valid for any ill-posed function in- camera calibration is required, and the internet provides an version problem where the feasible solution space is poorly effectively infinite source of highly varied, unlabeled data. defined but is small with respect to the possible solution Unfortunately, 3D labelling of this data is difficult and space. error prone due to the ill-posed nature of inferring 3D data We define the reconstruction loss λ of a proposed solu- from a 2D image. While datasets with a large number of r tion p˜ given an actual pose p as frames exist [39][17], the variation between frames is gen- ∈ P ∈ P erally poor. For example, the largest of these datasets – λr(p˜; p) = p p˜ (1) Human 3.6 million (H3M) [17] features 3.6 million frames, | − | but all are recorded from only a handful of camera locations and the reprojection loss λ given a projection p π π ∈ Pπ and a dozen different actors, and all sequences are recorded as in the same room. 2 λπ(p˜; pπ) = pπ Π(p˜) . (2) Despite this, recent advances in deep learning have been | − | used to regress 3D joint locations directly [4][42][32]. The The method is agnostic to the precise pose representation lack of variety in datasets has also been tackled by augment- p and norm . For simplicity, we use the 3D cartesian | · | nj 3 ing existing datasets with realistic renderings of real poses coordinates of nj joints = R × and the 2-norm. P in synthetic environments [31][37][6]. We describe poses with zero reprojection loss as con- Other approaches avoid the scarcity of data and ill- sistent and denote the set of such objects p˜ λ (p˜; p ) = { | π π posedness entirely by looking at the simpler problem of 2D 0) = . } Pc ⊂ P pose estimation [33][28][8][3][34][15][29]. These ap- The perfect reconstruction must be consistent. Unfortu- proaches benefit from large amounts of cheap, highly var- nately, consistency does not imply accurate reconstruction. ied hand-labeled data [18][1], and modern systems achieve The consistent set is large and varied, but there is only one near human-level performance in real-time [46][5]. perfectly reconstructed solution. These 2D approaches are often used as intermediate rep- Since our inverse projection function is only required to resentations for 3D pose estimation. Optimization methods reconstruct feasible poses , we seek solutions that are Pf have been used with anthropomorphic constraints (e.g. con- both feasible and consistent, p˜ = . The hope ∈ Ps Pf ∩ Pc 32 CHAPTER 3. ADVERSARIALLY PARAMETERIZED OPTIMIZATION

is that this set will be significantly smaller than the consis- We can combine this with a normalized generator Gˆ : tent set and contain more densely packed solutions. ˆ ˜ˆ Pc f by applying the functions in series, This corresponds to solving the reprojection optimiza- Z → P 1 tion problem over the feasible solution space, G([zˆ, ˜n]) = N − (Gˆ(zˆ), ˜n), (8)

Π˜ †(pπ) = arg min λπ(p˜; pπ). (3) where zˆ ˆ is the parameterization of the normalized p˜ f ∈ Z ˆ ∈P imagination space ˜ ˆ and ˜n is the proposed Pf ⊂ P ∈ N However, the feasible set is poorly defined. To re- normalization vector. Pf solve this, we propose learning an approximation of the fea- To enforce the invariances listed above, we normalize by sible space, or an imagination space ˜ and optimiz- rotating all poses about the vertical axis such that the hips Pf ≈ Pf ing over this space of imaginations. We do this by learning are aligned with the x-axis, the pelvis is directly above the a generator function G : ˜ and solving the optimiza- origin and the height of the target is 1. The corresponding Z → Pf tion problem over this generator’s domain, normalization vector is thus the angle between the vector joining the hips and the positive x-axis, the horizontal coor- z∗(pπ) = arg min λπ(G(z); pπ). (4) dinates of the pelvis and the height of the target. z ∈Z This formulation assumes the intrinsic and extrinsic The inverse projection is thus the generator applied to the camera properties are known. In situations where this is optimal parameterization, not the case, the framework is capable of inferring these properties by including them in the normalization vector. Π˜ †(pπ) = G(z∗(pπ)). (5) 4.2. Learning Feasibility The primary role of the generator function G is to filter out infeasible poses. As a fortunate side effect, this often GANs provide us with the tools to learn both the gener- results in a significant dimensionality reduction, making the ator and feasibility loss functions jointly by training on the normalized dataset ˆ = pˆ [pˆ, n] = N(p), p . subsequent optimization problem easier. Pd { | ∈ Pd} In addition, we consider using a feasibility loss function For a base feasibility loss we use a linear scaling of the logits from the critic trained in conjunction with the gener- λf : R to prevent the generated imaginations be- P → coming too infeasible and minimize the modified total loss ator λc, λ = α λ , (9) function f − c c where α is a non-negative hyperparameter. λ (p˜; p ) = λ (p˜; p ) + λ (p˜). (6) c m π π π f We illustrate the training and optimization processes in The combination of these ideas results in solving the op- Figure 1. timization problem 4.2.1 Hand-crafted Feasibility

Π˜ †(pπ) = G arg min λm(G(z); pπ) . (7) z While learned feasibility is attractive in that it requires no ( ∈Z ) expert knowledge and does not introduce artificial bias, the We discuss the generator and feasibility functions more framework does not preclude hand-crafted feasibility terms in Section 4.2. augmenting or replacing the GAN critic loss. For example, anthropomorphic constraints such as symmetric limb length 4.1. Pose Normalization consistency or bone length proportions can be encouraged The feasibility of a 3D human pose is invariant to hori- by additional loss terms. zontal shifts and rotations about the vertical axis, and to a In other situations, the feasible space of solutions may be lesser extent scale (within a certain range). Learning-based too large to learn directly. Inferring pose sequences for ex- methods commonly dealt with these invariances by aug- ample involves inferring a pose at each frame. Experiments menting the training dataset with random perturbations in show single frame pose estimation is unstable with respect these values. This does not guarantee invariance however, to depth estimation, so sequences of independently inferred and makes the learning problem harder. Additionally, the poses tend to exhibit observable depth-shuddering. One resulting parameterization often loses any semantic mean- approach would be to train a GAN to generate sequences ing associated with these invariances. rather than individual poses, but this significantly increases Instead, we propose using a deterministic invertible nor- the dimensionality of the problem. malization function N : ˆ to normalize poses Alternatively, depth shuddering can be reduced by intro- P → P × N prior to training, where pˆ ˆ is the normalized pose pa- ducing a penalty on fast motion between frames and han- ∈ P rameterization and n is the normalization vector. dled during optimization. To demonstrate the extensibility ∈ N 33

λπ λf For example, many approaches to 2D pose inference out- put a pseudo-probability distribution or heatmap for each joint [28][46][5]. By modifying our reprojection loss, we λG λ˜ λC pπ p˜ λC C can adapt our method to use such input without the need to retrain our GAN. 1 θC Cˆ θC Cˆ N − Cˆ For heatmap hi(x) for i = 1, 2, ..., nj, we might be tempted to use the probability-weighted average reprojec- p˜ˆ pˆ n ˜n pˆ˜ tion loss,

nj λ(avg) = h (p ) p Π(p˜ ) 2, (13) θG Gˆ N Gˆ π i πi | πi − i | i=1 p ∑ ∑πi ˆz p ˆz where the inner summation is over the domain of hi. However, the average of competing feasible hypotheses is itself not guaranteed to be feasible. 2D joint detectors of- (a) (b) (c) ten have difficulty distinguishing joints from their symmet- Figure 1. (a), (b) Generator/critic functions Gˆ and Cˆ with param- ric counterpart e.g. left and right hands. In these situations, eters θG and θC are optimized adversarially during training using the heatmaps are strongly bi-modal, but it would be a mis- normalized poses pˆ ˆd and randomly sampled zˆ. Loss function ∈ P take to choose the midpoint of these contending hypotheses λ λ λ˜ G, C and C vary across GAN implementations. (c) At run as a result. To counter this, we consider a loss which seeks time, the modified reprojection loss is optimized with respect to to maximize the intersection of the heatmap of each joint the generator input zˆ and normalization parameters ˜n. Loss terms (red) are optimized with respect to yellow parameters. with a Gaussian centered at the inferred 2D position,

nj λ(g) = h (p )g(p Π(p˜ ); σ) (14) of the system, we consider a general feasibility loss that π − i πi πi − i i=1 p combines the logits from the critic with hand-crafted tem- ∑ ∑πi poral smoothness terms, where g is a unit 2D Gaussian with standard deviation σ. The differences are illustrated in 1D in Figure 2. λ = α λ + α λ + α λ , (10) f − c c s s s0 s0 where λs is a smoothing term given by

T 1 1 − λ = p˜(t+1) p˜(t) 2 (11) s T | − | t=1 ∑ which discourages flickering between valid poses with ambiguous projections and whole-body depth-shuddering, while λs0 is a friction term given by

T 1 1 − (t+1) (t) 2 λs0 = min p˜i p˜i (12) T i | − | t=1 ∑ Figure 2. Optimal reprojection values based on probability distri- (t) butions (black). Optimal values with respect to Gaussian reprojec- to encourage at least one joint to be mostly stationary. p˜ tion loss (blue, Equation 14) tend towards locally confident regions th (t) th is the imagined pose at the t frame and p˜i is the i joint compared to those attained by average loss (red, Equation 13). of that pose. αs and αs0 are constant non-negative hyper- parameters. 4.4. Optimization 4.3. Domain Independent Training While the results will vary depending on the choice of One key advantage of the proposed method is that the optimizer, the framework itself is agnostic to manner in GAN training process and resulting generator/critic func- which Equation 4 is solved. Generator and critic functions tions are independent of the inverse projection input space. will presumably be differentiable assuming standard GAN Thus far we have discussed mapping 2D pixel coordinates training approaches, as are standard projection functions, to 3D spatial coordinates, but we can easily modify this if so gradient-based optimizers should be applicable assum- our means of sensing the scene change. ing any hand-crafted losses introduced are differentiable. 34 CHAPTER 3. ADVERSARIALLY PARAMETERIZED OPTIMIZATION

On the other hand, standard gradient-based optimizers temporal information, so values provided with temporal in- do not lend themselves to easy parallelization and are un- formation are for self-comparison only. able to represent multiple optimal solutions. The set of fea- Due to the smaller size of the dataset, we process frames sible and consistent solutions is small, but ambiguities will at the full 60 frames per second provided. still arise. Population-based methods resolve both these is- Most optimizations are run without using temporal infor- sues trivially, though may take more iterations to converge. mation. We recover the temporally independent optimiza- In the interest of simplicity we limit the scope of this tion problem by setting αs and αs0 in Equation 10 to zero. paper to using a standard LBFGS gradient-based optimizer [22], though leave this open for future investigation. 5.2. Reconstruction Evaluation To evaluate performance of the entire system, we use 4.5. Comparison with Other GAN-based Methods known 2D/3D correspondences (pπ, p) to evaluate the re- construction loss from Equation 1. However, small inaccu- Recent work has looked at using GANs for inverse racies in depth estimation tend to drown out errors in rela- graphics [45]. Their approach trains a network to map from tive joint positions. We thus report the per-joint error after projections to 3D poses using a loss function augmented the inferred pose undergoes an optimal rigid body transfor- with a GAN critic to promote feasible solutions. Our ap- mation as is common in the literature [47][45][50], proach differs from this in that we use a feasibility measure during inference rather than just training. We also search 1 λrt = min T (p˜) p , (15) the generator input space for solutions based on their con- n T | − | j ∈T sistency with observations rather than starting from the ob- where is the set of all rigid body transformations. Ex- servations themselves. T cept where otherwise stated, transformations are calculated independently for each frame. 5. Experiments In all experiments we assume camera properties (intrin- 5.1. Dataset sic and extrinsic) are known. This decreases optimization time, but makes little difference to the errors after transfor- We evaluate performance on the Human 3.6 Million mation. dataset [17]. We trained GANs using the 24 unique joints For H3M dataset results we use the 17 joint skeleton used provided in the dataset using subjects S1, S5, S6, S7 and S8 in [45] for pose inference, alignment and average reported for training and S9 and S11 for evaluation. We considered values. EVA results are evaluated on the 14 joint skeleton every 5th frame, equivalent to a frame rate of 10 frames per of Yasin et al.[47]. second. We ran experiments starting from 2D ground truth poses 5.3. GAN Architectures (h3m p2). We also used inferences from the OpenPose We investigated a number of small architectures consist- framework [5] trained on the COCO dataset [21]. We in- ing of simple sequences of fully connected layers. We re- ferred joint heatmaps based on the entire image rescaled stricted networks to have only 1 or 2 hidden layers. In all to 128 128. During optimization, we considering raw × cases, critic and generator networks were constructed iden- heatmaps (op hm) downscaled to a 64 64 grid and point × tically except for the number of input/output nodes. No inferences (op p2) made by the same framework. For the batch normalization, dropout or weight regularization was heatmap case we used Gaussian reprojection with a stan- used in any networks. Rectified linear units were used as dard deviation of 10% of the image size. activations at the output of all layers except the last in each These experiments considered a variety of hyperparam- case. eter sets. To ensure the results are valid in general and We experimented with both weight-clipped (WC) and not overfitting to the evaluation set, we evaluated the most gradient-normalized (GN) [12] Wasserstein GAN training successful model on the Human Eva 1 (EVA) [39] dataset. regimes. The critic optimization step was run 5 times per This smaller dataset features a different skeleton and cam- generator update step. WC models were trained using RM- 4 era setup. For comparison with Yasin et al.[47] we train SProp optimization using a learning rate of 10− , while GN 4 the architecture from random initialization twice: once with models used ADAM optimization with λ = 10, α = 10− , training data from all actions, and a second time with train- β1 = 0 and β2 = 0.9. All networks were trained with a ing data related to walking sequences removed. Apart from batch size of 128 for 1e7 critic optimization steps and 2e6 this, we used the standard train/evaluation split. For consis- generator optimization steps. tency, we used the 14 joint skeleton used by Yasin et al. While both WC and GN methods gave similar results, For completeness we consider optimization both with WC models tended to do marginally better across most mea- and without temporal smoothing. Yasin et al. do not use sures. All values reported are for the WC GANs. 35

All models were implemented using Tensorflow 1.2. ble 3. We compare against Tung et al.[45] and Yasin et Batches ran at 100 – 150 batches per second on an NVidia al.[47] due to the similarities in approach and evaluation GTX-1070 GPU. method. The literature is awash with models that achieve better raw numbers. Some are undoubtably more accurate 5.4. Optimization [23][43][32][25], while others use slightly different met- All optimizations were performed using the limited- rics, training data and/or report different metrics [4][24] memory BFGS algorithm [22] for consistency and simplic- [49][50]. We do not claim the results presented to be state- ity. of-the-art in this regard. Compared to many neural-network-based pose-inference Table 3. Single-frame average reconstruction loss from optimal re- models, our independent frame optimization is a relatively projection loss for each sequence in mm. Lower is better. GT : slow process, taking approximately 1 second per frame on ground truth 2D poses. I2: inferred 2D points used. IH: inferred an NVidia GTX-1070. In order to better leverage the paral- heatmaps used. lel capabilities of modern GPUs, we solve the independent Model Direct Discuss Eat Greet Phone optimization problem for each sequence using every frame [47]GT 60.0 54.7 71.6 67.5 63.8 jointly. While solving independent problems jointly may re- [45]GT 53.7 71.5 82.3 58.6 86.9 smallGT 64.6 75.3 80.0 80.3 81.4 quire more operations and introduce subtle differences due bigGT 45.4 53.1 65.3 56.6 68.2 to termination criteria, these effects were found to be negli- [45]I2 77.6 91.4 89.9 88.0 107.3 gible, and the batch capabilities of the hardware resulted in bigI2 88.2 100.1 94.4 99.8 109.0 big 75.0 82.5 82.7 87.2 92.7 a speedup of roughly a factor of 5. IH Model Photo Pose Purch. Sit SitD. [47]GT 96.9 61.9 55.7 73.9 110.8 6. Results [45]GT 98.4 57.6 104.2 100.0 112.5 smallGT 93.8 78.9 79.6 108.0 135.9 We begin by looking at reconstruction losses for a num- bigGT 74.0 50.0 55.5 106.0 152.8 ber of different GAN architectures. We tried a number of [45]I2 110.1 75.9 107.5 124.2 137.8 bigI2 119.5 95.0 117.6 128.2 184.4 different sizes for the input space nz and number of hid- bigIH 108.3 85.2 88.1 116.7 166.3 den nodes n . All feasibility weights were set to zero and h Model Smoke Wait Walk W.Dog W.Tog ground truth 2D joint positions were used with the reprojec- [47]GT 78.9 67.9 47.5 89.3 53.4 tion loss. [45]GT 83.3 68.9 57.0 - - Parameter values and average reconstruction losses af- smallGT 78.7 78.5 78.6 95.4 85.6 bigGT 58.4 60.6 60.3 58.4 59.5 ter optimal rigid transformation are given in Table 2. We [45]I2 102.2 90.3 78.6 - - see that even the smallest model performs reasonably well bigI2 99.5 96.0 102.4 105.1 103.1 despite having only roughly 1,500 trainable weights. The bigIH 85.4 83.6 86.1 91.5 91.6 larger networks – all with less than 30,000 trainable param- eters – offer modest improvements. To put things in per- Our smallest model is unsurprisingly worse in almost all spective, the standard VGG-16 architecture [40] common categories compared to other methods, but still performs in image processing contains 138 million weights, so these relatively well. The larger model performs at or better than networks are three-to-five orders of magnitude smaller. the other models considered in most categories. The obvi- ous exception to this is the SitD. category, and to a lesser Table 2. Average reconstruction loss using ground truth 2D poses extend the Sit category, where our models perform poorly and reprojection loss and optimal per-frame alignment. Lower is even when using ground truth 2D joint positions. The se- better. IDs are for models referenced in analysis. quences themselves feature actors sitting and lying on the ID nz nh λrt floor, and are quite unlike other categories in the dataset. small 8 32 85.9 This is likely an example of mode collapse [10]. 8 64 90.6 Similarly to Tung et al.[45], we observe a significant 16 64 82.8 drop in accuracy as a result of using poses inferred from 16 128 80.8 images as opposed to the ground-truth labels provided in 32 128 76.1 the dataset. This is lessened when using heatmaps rather 64 128 78.8 than the collapsed 2D joint positions as expected, since the 128 128 84.3 optimizer is able to take uncertainty into account. 32 64, 64 75.7 It should be noted the image-to-2D pose inference model big 32 128, 64 68.9 we used [5] was trained on ‘in-the-wild’ images and not 32 128, 128 74.0 fine-tuned on this dataset. As such, we expect similar per- formance of our model on ‘in-the-wild’ images, and that Selected results at a more granular level are given in Ta- fine-tuning this model on the H3M dataset could improve 36 CHAPTER 3. ADVERSARIALLY PARAMETERIZED OPTIMIZATION

results due to improvements in the 2D joint inferences. Figure 3 shows sample output from our big model using ground truth 2D pose.

2 × 102 rt λ

102

10−3 10−2 10−1 100 101 102 103 104 αc

Figure 4. For the big model using ground truth 2D poses, the effect of the critic weight makes little difference to the reconstruction loss for small values.

200

180

160

140 rt λ

120

100 Figure 3. Sample output for H3M dataset using 24 unique joint skeleton before alignment. Thick: ground truth. Thin: inferred 80 pose. Red/blue/black: right/left/central limbs. Left: 2D projected 60 points from camera view point. Right: view from a different angle. 10−2 10−1 100 101 102 103 Top/middle: the model generally succeeds in finding a relatively αs close pose. The offset is the result of failing to precisely infer Figure 5. Reconstruction losses for varying smoothness weights depth/scale. Bottom: the model struggles with sequences involv- αs. Values shown had αs0 = 100αs and αc = 0. Dashed lines ing kneeling and lying down. correspond to αs = 0. Blue lines denote values with optimal reconstruction across the entire sequence. Red lines denote values To investigate the extent to which the critic feasibility optimally reconstructed per frame. loss can assist or hinder the reconstruction, we show the av- erage result of optimizing for various values of αc in Figure 4. It shows little change to the reconstruction loss until the construction losses are shown in Figure 5 and show a clear reprojection loss is largely drowned out. decrease until the point at which the temporal losses over- This is a somewhat surprising result. Due to the genera- powers the reprojection loss. It is interesting to note that tor sampling regime, outputs are unlikely to be feasible for these improvements still occured – albeit to a lesser extent inputs far from the origin. Presumably infeasible poses ex- – when considering per-frame optimal reconstruction. This ist outside the feasible zone, and without any feasibility loss suggest not only is temporal smoothing removing the depth there is nothing to constrain the search to a space near the shuddering, but also helping to disambiguate poses with the origin. The fact the critic loss is largely unnecessary sug- same projection based on neighbouring frames. gests local minima occur sufficiently close to the origin that We use the big model with a critic weight of αc = 0 for the gradient-based optimizer is unable to escape in order to evaluation on the EVA dataset. Reconstruction losses for find a more distant infeasible pose which better matches the the aligned inferences are given in Table 4. Optimization observation. with temporal information uses αs = 10 and αs0 = 1e3. The effect of temporal smoothness is clearly evident Sample results are visualized in Figure 6. from videos, removing almost all depth shuddering. Re- Clearly these results are less competitive than those at- 37

Table 4. Average reconstruction loss for human eva dataset. t: 7. Conclusion uses temporal information. w: uses ‘Walking’ data for training. ID Walking Jog All We have shown that GAN parameterization can effec- S1 S2 S3 S1 S2 S3 [47]w 40.1 33.1 47.5 48.6 43.6 40.0 42.1 tively be used to learn a feasible space in which to search for bigw 37.1 42.1 75.1 59.5 62.2 57.0 54.3 solutions to ill-posed problems. We have demonstrated that wt big 35.4 38.3 65.6 48.4 53.1 49.5 47.7 tiny, primitive networks of only a couple thousand train- [47] 70.5 60.4 86.9 46.5 40.4 38.8 57.3 able parameters perform well, and only slightly larger net- big 52.7 52.5 95.0 55.8 58.4 63.5 62.8 bigt 50.9 50.5 83.2 52.0 55.8 60.0 58.6 works can achieve comparable results to recently published approaches in most categories. On the down side, the framework requires solving an op- timization problem at each frame. While this optimization problem is relatively low-dimensional and can be solved at roughly 1 frame per second on a modern computer, this likely precludes its use in online systems. Additionally, solutions exhibit observable depth-shuddering unlike ap- proaches which are continuous with respect to their inputs [23]. Despite this, we believe these are limitations of the im- plementation, rather than the framework itself. In particular, we highlight four areas for further investigation. Firstly, we draw attention to the fact that the pose pa- rameterization p does not necessarily need to be a vector of cartesian coordinates for each joint. All the framework requires is that there is a mapping from this representation to a reprojection loss given some observation. As a sim- Figure 6. Sample output for EVA dataset using 14 unique joint ple example, polar coordinates could be used, or a kine- skeleton before alignment. Thick: ground truth. Thin: inferred matic model which enforces limb length consistency over pose. Red/blue/black: right/left/central limbs. Left: 2D projected time and/or appropriate body proportions. Feasibility may points from camera view point. Right: view from a different angle. be easier to learn for a different representation, and every Top: an example of a good inference (the projections on left are generator presents a different surface over which to opti- indistinguishably aligned). Bottom: side-on views are sometimes mize at run time. aligned incorrectly. We also expect larger and more sophisticated networks that take advantage of network ideas like dropout [41], batch/layer normalization [16][2] and weight regularization tained for the H3M dataset. Qualitatively, we observe the [9] could perform much better at this task, and expect these system often fails to correctly infer the orientation for side- ideas to be crucial to parameterizing more complex scenes on views, resulting in inferred skeletons spinning on the like sequences and/or multiple targets. spot, even when temporal information was used. This may be a case of poor initialization, since GAN inputs are ini- Similarly, we have largely ignored the question of effi- tialized at zero. When using temporal data, this might be re- cient optimization. The small size of the networks lends it- duced by the introduction of a ‘rotational smoothness loss’ self to optimization algorithms that leverage parallelism and term that penalizes fast rotations (since rotation about the we anticipate real-time online processing should be possible major axis is only lightly penalized by the standard smooth- with an appropriate optimizer. ing loss term). Finally, we introduced hand-crafted temporal loss terms mostly to demonstrate extensibility. Greater improvements Alternatively, due to the smaller skeleton and dataset should be possible with a more targeted approaches to lever- compared to H3M, this network architecture may simply aging temporal information. In particular, we suggest learn- be larger than optimal. ing to generate pose sequences directly using long-short- Unsurprisingly, performance degrades if the walking se- term memory [14] or temporal convolutional neural net- quence data is removed from training. We note this degre- works [19] rather than introducing artificial loss terms. dation is less than what is seen in Yasin et al.[47], bringing our model closer to parity. The small improvement in accuracy with the addition of Acknowledgements temporal information is consistent with experiments on the This research was supported by the Australian Re- H3M dataset. search Council through the grant ARC FT170100072. 38 CHAPTER 3. ADVERSARIALLY PARAMETERIZED OPTIMIZATION

International Conference on Machine Learning, pages 448– 456, 2015. 8 References [17] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Hu- man3.6m: Large scale datasets and predictive methods for 3d [1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human sensing in natural environments. IEEE Transactions human pose estimation: New benchmark and state of the art on Pattern Analysis and Machine Intelligence, 36(7):1325– analysis. In IEEE Conference on Computer Vision and Pat- 1339, jul 2014. 2, 5 tern Recognition (CVPR), June 2014. 2 [18] S. Johnson and M. Everingham. Clustered pose and nonlin- [2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. ear appearance models for human pose estimation. In Pro- arXiv preprint arXiv:1607.06450, 2016. 8 ceedings of the British Conference, 2010. [3] V. Belagiannis and A. Zisserman. Recurrent human pose doi:10.5244/C.24.12. 2 estimation. In Automatic Face & (FG [19] Y. LeCun, Y. Bengio, et al. Convolutional networks for im- 2017), 2017 12th IEEE International Conference on, pages ages, speech, and time series. The handbook of brain theory 468–475. IEEE, 2017. 2 and neural networks, 3361(10):1995, 1995. 8 [4] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, [20] C. Ledig, L. Theis, F. Huszar,´ J. Caballero, A. Cunningham, and M. J. Black. Keep it smpl: Automatic estimation of 3d A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. human pose and shape from a single image. arXiv preprint Photo-realistic single image super-resolution using a gener- arXiv:1607.08128, 2016. 2, 6 ative adversarial network. arXiv preprint arXiv:1609.04802, [5] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi- 2016. 2 person 2d pose estimation using part affinity fields. In CVPR, [21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- 2017. 2, 4, 5, 6 manan, P. Dollar,´ and C. L. Zitnick. Microsoft coco: Com- [6] W. Chen, H. Wang, Y. Li, H. Su, Z. Wang, C. Tu, D. Lischin- mon objects in context. In European conference on computer ski, D. Cohen-Or, and B. Chen. Synthesizing training images vision, pages 740–755. Springer, 2014. 5 for boosting human 3d pose estimation. In 3D Vision (3DV), [22] D. C. Liu and J. Nocedal. On the limited memory bfgs 2016 Fourth International Conference on, pages 479–488. method for large scale optimization. Mathematical program- IEEE, 2016. 2 ming, 45(1):503–528, 1989. 5, 6 [7] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning [23] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple by information maximizing generative adversarial nets. In yet effective baseline for 3d human pose estimation. arXiv Advances in Neural Information Processing Systems, pages preprint arXiv:1705.03098, 2017. 2, 6, 8 2172–2180, 2016. 2 [24] D. Mehta, H. Rhodin, D. Casas, O. Sotnychenko, W. Xu, [8] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and and C. Theobalt. Monocular 3d human pose estimation us- X. Wang. Multi-context attention for human pose estima- ing transfer learning and improved cnn supervision. arXiv tion. arXiv preprint arXiv:1702.07432, 2017. 2 preprint arXiv:1611.09813, 2016. 6 [9] H. B. Demuth, M. H. Beale, O. De Jess, and M. T. Hagan. [25] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, Neural network design. Martin Hagan, 2014. 8 M. Shafiei, H.-P. Seidel, W. Xu, D. Casas, and C. Theobalt. [10] I. Goodfellow. Nips 2016 tutorial: Generative adversarial Vnect: Real-time 3d human pose estimation with a single networks. arXiv preprint arXiv:1701.00160, 2016. 6 rgb camera. arXiv preprint arXiv:1705.01583, 2017. 2, 6 [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, [26] A. Menache. Understanding for computer D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen- animation and video games. Morgan kaufmann, 2000. 2 erative adversarial nets. In Advances in neural information [27] T. B. Moeslund, A. Hilton, and V. Kruger.¨ A survey of ad- processing systems, pages 2672–2680, 2014. 2 vances in vision-based human motion capture and analysis. [12] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and Computer vision and image understanding, 104(2):90–126, A. Courville. Improved training of wasserstein gans. arXiv 2006. 2 preprint arXiv:1704.00028, 2017. 5 [28] A. Newell, K. Yang, and J. Deng. Stacked Hourglass Net- [13] J. Gwak, C. B. Choy, A. Garg, M. Chandraker, and works for Human Pose Estimation, pages 483–499. Springer S. Savarese. Weakly supervised generative adversar- International Publishing, Cham, 2016. 2, 4 ial networks for 3d reconstruction. arXiv preprint [29] G. Ning, Z. Zhang, and Z. He. Knowledge-guided deep arXiv:1705.10904, 2017. 2 fractal neural networks for human pose estimation. arXiv [14] S. Hochreiter and J. Schmidhuber. Long short-term memory. preprint arXiv:1705.02407, 2017. 2 Neural computation, 9(8):1735–1780, 1997. 8 [30] A. Odena, C. Olah, and J. Shlens. Conditional image [15] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and synthesis with auxiliary classifier gans. arXiv preprint B. Schiele. Deepercut: A deeper, stronger, and faster multi- arXiv:1610.09585, 2016. 2 person pose estimation model. In European Conference on [31] D. Park and D. Ramanan. Articulated pose estimation with Computer Vision (ECCV), May 2016. 2 tiny synthetic videos. In Proceedings of the IEEE Confer- [16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating ence on Computer Vision and Pattern Recognition Work- deep network training by reducing internal covariate shift. In shops, pages 58–66, 2015. 2 39

[32] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. [48] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and Coarse-to-fine volumetric prediction for single-image 3d hu- D. Metaxas. Stackgan: Text to photo-realistic image syn- man pose. arXiv preprint arXiv:1611.07828, 2016. 2, 6 thesis with stacked generative adversarial networks. arXiv [33] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele. Pose- preprint arXiv:1612.03242, 2016. 2 let conditioned pictorial structures. In Proceedings of the [49] X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and IEEE Conference on Computer Vision and Pattern Recogni- K. Daniilidis. Sparseness meets deepness: 3d human pose tion, pages 588–595, 2013. 2 estimation from monocular video. In Proceedings of the [34] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. An- IEEE Conference on Computer Vision and Pattern Recog- driluka, P. Gehler, and B. Schiele. Deepcut: Joint subset nition, pages 4966–4975, 2016. 6 partition and labeling for multi person pose estimation. In [50] X. Zhou, M. Zhu, G. Pavlakos, S. Leonardos, K. G. Derpa- IEEE Conference on Computer Vision and Pattern Recogni- nis, and K. Daniilidis. Monocap: Monocular human motion tion (CVPR), June 2016. 2 capture using a cnn coupled with a geometric prior. arXiv [35] V. Ramakrishna, T. Kanade, and Y. Sheikh. Reconstructing preprint arXiv:1701.02354, 2017. 2, 5, 6 3d human pose from 2d image landmarks. Computer Vision– ECCV 2012, pages 573–586, 2012. 2 [36] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016. 2 [37] G. Rogez and C. Schmid. Mocap-guided data augmentation for 3d pose estimation in the wild. In Advances in Neural Information Processing Systems, pages 3108–3116, 2016. 2 [38] J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook, M. Finocchio, R. Moore, P. Kohli, A. Criminisi, A. Kipman, et al. Efficient human pose estimation from single depth im- ages. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2821–2840, 2013. 2 [39] L. Sigal, A. O. Balan, and M. J. Black. Humaneva: Syn- chronized video and motion capture dataset and baseline al- gorithm for evaluation of articulated human motion. Inter- national journal of computer vision, 87(1):4–27, 2010. 2, 5 [40] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 6 [41] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neu- ral networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014. 8 [42] B. Tekin, I. Katircioglu, M. Salzmann, V. Lepetit, and P. Fua. Structured prediction of 3d human pose with deep neural net- works. arXiv preprint arXiv:1605.05180, 2016. 2 [43] B. Tekin, P. Marquez-Neila,´ M. Salzmann, and P. Fua. Learn- ing to fuse 2d and 3d image cues for monocular body pose estimation. arXiv preprint arXiv:1611.05708v3, 2017. 6 [44] D. Tome, C. Russell, and L. Agapito. Lifting from the deep: Convolutional 3d pose estimation from a single image. arXiv preprint arXiv:1701.00295, 2017. 2 [45] H.-Y. F. Tung, A. Harley, W. Seto, and K. Fragkiadaki. Ad- versarial inversion: Inverse graphics with adversarial priors. arXiv preprint arXiv:1705.11166, 2017. 2, 5, 6 [46] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Con- volutional pose machines. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 4724–4732, 2016. 2, 4 [47] H. Yasin, U. Iqbal, B. Kruger, A. Weber, and J. Gall. A dual- source approach for 3d pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4948–4956, 2016. 5, 6, 8 40 CHAPTER 3. ADVERSARIALLY PARAMETERIZED OPTIMIZATION Chapter 4

Learning Free-Form Deformations for 3D Object Reconstruction

“Where there is matter, there is geometry.” — Johannes Kepler

Our second contribution looks at a very different 3D problem: single-view object reconstruction. Unlike our human pose methods, keypoints are unsuitable for detailed object reconstruction due to variations in topology and existence of certain features. Even objects of the same class don’t necessarily all have the same features. For example, while most planes have wings, bi-planes have two sets. Some planes have 2 propeller engines, while others have 4 jet engines. Military jets in our dataset often have missiles and other weapons, and some models have undercarriages down, while others are missing them entirely.

Instead of keypoints, we train our model based on point cloud inferences. Each inferred point cloud is generated by deforming a cloud sampled from the surface of a template mesh. To cater for differing geometries, we infer multiple clouds – each from deforming a different template mesh from a fixed set – along with a ranking of these inferences. We investigate various different losses that combine the multiple point cloud inferences with this ranking and show varied success at learning both tasks simultaneously.

Rather than deform the point clouds point-by-point, we decompose each point in the cloud into a linear combination of a fixed number of basis vectors and deform those basis vectors. In this way, the learned component of the network outputs a fixed size deformation which can be applied to point clouds of varying sizes. While all training was performed with the same number of points, at inference time we can increase or decrease the number of points depending on computational budget.

Additionally, we use the same connectedness information of the undeformed mesh to infer deformed meshes rather than point clouds, and transfer semantic information from the input template to the de- formed mesh.

This was our first foray into the problem of single-view object reconstruction. While it did not seek to take advantage of computer graphics techniques as other contributions in this thesis did, the process of tackling the problem proved invaluable in terms of identifying strengths and weaknesses of point clouds and mesh representations and influencing the direction of the rest of the thesis.

41 42 CHAPTER 4. FFD FOR 3D OBJECT RECONSTRUCTION

Statement of Contribution of Co-Authors for Thesis by Published Paper

The authors listed below have certified that:

1. they meet the criteria for authorship in that they have participated in the conception, execution, or interpretation, of at least that part of the publication in their field of expertise;

2. they take public responsibility for their part of the publication, except for the responsible author who accepts overall responsibility for the publication;

3. there are no other authors of the publication according to these criteria;

4. potential conflicts of interest have been disclosed to (a) granting bodies, (b) the editor or publisher of journals or other publications, and (c) the head of the responsible academic unit; and

5. they agree to the use of the publication in the student’s thesis and its publication on the QUT’s ePrints site consistent with any limitations set by publisher requirements.

In the case of this chapter: Learning Free-Form Deformations for 3D Object Reconstruction, presented at the Asian Con- ference on Computer Vision (ACCV), Perth, Australia, 2018.

Dominic Jack

QUT Verified Experiment design, learned model implementation and experiments. Signature 26 May, 2020 Jhony K. Pontes Wrote most of introduction and background, FFD implementation Frederic Maire Advised on model design and paper editing. Anders Eriksson Advised on model design and paper editing. Sareh Shirazi Advised on model design and paper editing. Clinton Fookes Liased with Jhony K. Pontes. Sridha Sridharan Liased with Jhony K. Pontes.

Principle Supervisor Confirmation

I have sighted email or other correspondence from all Co-authors confirming their certifying authorship. (If the Co-authors are not able to sign the form please forward their email or other correspondence confirming the certifying authorship to the RSC). QUT Verified Frederic Maire Signature 26 May 2020 Name Signature Date 43

Learning Free-Form Deformations for 3D Object Reconstruction ⋆

Dominic Jack, Jhony K. Pontes, Sridha Sridharan, Clinton Fookes, Sareh Shirazi, Frederic Maire, and Anders Eriksson

Queensland University of Technology Brisbane QLD 4000, Australia d1.jack,s.sridharan,c.fookes,s.shirazi,f.maire,anders.ariksson @qut.edu.au { [email protected] }

Abstract. Representing 3D shape in deep learning frameworks in an accurate, efficient and compact manner still remains an open challenge. Most existing work addresses this issue by employing voxel-based rep- resentations. While these approaches beneft greatly from advances in computer vision by generalizing 2D convolutions to the 3D setting, they also have several considerable drawbacks. The computational complex- ity of voxel-encodings grows cubically with the resolution thus limiting such representations to low-resolution 3D reconstruction. In an attempt to solve this problem, point cloud representations have been proposed. Although point clouds are more efficient than voxel representations as they only cover surfaces rather than volumes, they do not encode de- tailed geometric information about relationships between points. In this paper we propose a method to learn free-form deformations (Ffd) for the task of 3D reconstruction from a single image. By learning to deform points sampled from a high-quality mesh, our trained model can be used to produce arbitrarily dense point clouds or meshes withfne-grained geometry. We evaluate our proposed framework on synthetic data and achieve state-of-the-art results on surface and volumetric metrics. We make our implementation publicly available1.

1 Introduction

Imagine one wants to interact with objects from the real world, say a chair, but in an augmented reality (AR) environment. The 3D reconstruction from the seen images should appear as realistic as possible so that one may not even perceive the chair as being virtual. The future of highly immersive AR and (VR) applications highly depends on the representation and reconstruction of high-quality 3D models. This is obviously challenging and the computer vision and graphics communities have been working hard on such problems [1,2,3].

⋆ This research was supported by the Australian Research Council through the grant ARC FT170100072. Computational resources used in this work were provided by the HPC and Research Support Group, QUT. 1 Tensorfow implementation available at github.com/jackd/template ffd. 44 CHAPTER 4. FFD FOR 3D OBJECT RECONSTRUCTION

2 D. Jack et al.

The impact that recent developments in deep learning approaches have had on computer vision has been immense. In the 2D domain, convolutional neural networks (CNNs) have achieved state-of-the-art results in a wide range of ap- plications [4,5,6]. Motivated by this, researchers have been applying the same techniques to represent and reconstruct 3D data. Most of them rely on volumet- ric shape representation so one can perform 3D convolutions on the structured voxel grid [7,8,9,10,11,12]. A drawback is that convolutions on the 3D space are computationally expensive and grow cubically with resolution, thus typically limiting the 3D reconstruction to exceedingly coarse representations. A recent shape representation that has been investigated to make the learning more efficient is point clouds [13,14,15,16]. However, such representations still lack the ability of describingfnely detailed structures. Applying surfaces, texture and lighting to unstructured point clouds are also challenging, specially in the case of noisy, incomplete and sparse data. The most extensively used shape representation in computer graphics is that of polygon meshes, in particular using triangular faces. This parameterisation has largely been unexplored in the machine learning domain for the 3D reconstruc- tion task. This is in part a consequence of most machine learning algorithms requiring regular representation of input and output data such as voxels and point clouds. Meshes are highly unstructured and their topological structure usually differs from one to another which makes their 3D reconstruction from 2D images using neural networks challenging. In this paper, we tackle this problem by exploring the well-known free-form deformation (Ffd) technique [17] widely used for 3D mesh modelling. Ffd allows us to deform any 3D mesh by repositioning a few predefned control points while keeping its topological aspects. We propose an approach to perform 3D mesh reconstruction from single images by simultaneously learning to select and deform template meshes. Our method uses a lightweight CNN to infer the low- dimensional Ffd parameters for multiple templates and it learns to apply large deformations to topologically different templates to produce inferred meshes with similar surfaces. We extensively demonstrate relatively small CNNs can learn these deformations well, and achieve compelling mesh reconstructions. An overview of the proposed method is illustrated in Figure 1.

Our contributions are summarized as follows:

We propose a novel learning framework to reconstruct continuous 3D meshes • from single images; we quantitatively and qualitatively demonstrate that relatively small neural • networks require minimal adaptation to learn to simultaneously select ap- propriate models from a number of templates and deform these templates to perform 3D mesh reconstruction; and we extensively investigate simple changes to training and loss functions to • promote variation in template selection. 45

FFD for 3D Reconstruction 3

Fig. 1: Given a single image, our method uses a CNN to infer Ffd parameters ∆P (red arrows) for multiple templatesT (middle meshes). The∆P parame- ters are then used to deform the template vertices to infer a 3D mesh for each template (right meshes). Trained only with surface-sampled point-clouds, the model learns to apply large deformations to topologically different templates to produce inferred meshes with similar surfaces. Likelihood weightingsγ are also inferred by the network but not shown. FC stands for fully connected layer.

2 Related Work

Interest in analysing 3D models has increased tremendously in recent years. This development has been driven in part by a rapid growth of the amount of readily available 3D data, the astounding progress made in thefeld of machine learning as well as a substantial rise in the number of potential application areas, i.e. Virtual and Augmented Reality. To address 3D vision problems with deep learning techniques a good shape representation should be found. Volumetric representation has been the most widely used for 3D learning [18,19,20,7,21,22,8,9,12,10,11]. Convolutions, pool- ing, and other techniques that have been successfully applied to the 2D domain can be naturally applied to the 3D case for the learning process. Volumetric autoencoders [23,21] and generative adversarial networks (GANs) have been in- troduced [24,25,26] to learn probabilistic latent space of 3D objects for object completion, classifcation and 3D reconstruction. Volumetric representation how- ever grows cubically in terms of memory and computational complexity as the voxel grid resolution increases, thus limiting it to low-quality 3D reconstructions. To overcome these limitations, octree-based neural networks have been pre- sented [27,28,29,30], where the volumetric grid is split recursively by dividing it into octants. Octrees reduce the computational complexity of the 3D convo- lution since the computations are focused only on regions where most of the object’s geometry information is located. They allow for higher resolution 3D reconstructions and a more efficient training, however, the outputs still lack of fne-scaled geometry. A more efficient 3D representation using point clouds was 46 CHAPTER 4. FFD FOR 3D OBJECT RECONSTRUCTION

4 D. Jack et al.

recently proposed to address some of these drawbacks [13,14,15,16]. In [13] a generative neural network was presented to directly output a set of unordered 3D points that can be used for the 3D reconstruction from single image and shape completion tasks. By now, such architectures have been demonstrated for the generation of relatively low-resolution outputs and to scale these networks to higher resolution is yet to be explored. 3D shapes can be efficiently represented by polygon meshes which encode both geometrical (point cloud) and topological (surface connectivity) informa- tion. However, it is difficult to parametrize meshes to be used within learning frameworks [31]. A deep residual network to generate 3D meshes has been pro- posed in [32]. A limitation however is that they adopted the geometry image representation for generative modelling of 3D surfaces so it can only manage simple (i.e. genus-0) and low-quality surfaces. In [2], the authors reconstruct 3D mesh objects from single images by jointly analysing a collection of images of different objects along with a smaller collection of existing 3D models. While the method yields impressive results, it suffers from scalability issues and is sensitive to semantic segmentation of the image and dense correspondences. Ffd has also been explored for 3D mesh representation where one can in- trinsically represent an object by a set of polynomial basis functions and a small number of coefficients known as control points used for cage-like deformation. A 3D mesh editing tool proposed in [33] uses a volumetric generative network to infer per-voxel deformationfows using Ffd. Their method takes a volumet- ric representation of a 3D mesh as input and a high-level deformation intention label (e.g. sporty car,fghter jet, etc.) to learn the Ffd displacements to be applied to the original mesh. In [34,35] a method for 3D mesh reconstruction from a single image was proposed based on a low-dimensional parametrization using Ffd and sparse linear combinations given the image silhouette and class- specifc landmarks. Recently, the DeformNet was proposed in [36] where they employed Ffd as a differentiable layer in their 3D reconstruction framework. The method builds upon two networks, one 2D CNN for 3D shape retrieval and one 3D CNN to infer Ffd parameters to deform the 3D point cloud of the shape retrieved. In contrast, our proposed method reconstructs 3D meshes using a single lightweight CNN with no 3D convolutions involved to infer a 3D mesh template and its deformationfow in one shot.

3 Problem Statement

We focus on the problem of inferring a 3D mesh from a single image. We represent n 3 a 3D meshc by a list of vertex coordinatesV R v × and a set of triangular nf 3 ∈ facesF Z × ,0 F ij < nv defned such thatf i = [p, q, r] indicates there is a face∈ connecting the≤ verticesv ,v andv , i.e.c= V,F . p q r { } Given a query image, the task is to infer a 3D mesh ˜c which is close by some measure to the actual meshc of the object in the image. We employ the Ffd technique to deform the 3D mesh to bestft the image. 47

FFD for 3D Reconstruction 5

3.1 Comparing 3D Meshes

There are a number of metrics which can be used to compare 3D meshes. We consider three: Chamfer distance and earth mover distance between point clouds, and the intersection-over-union (IoU) of their voxelized representations.

Chamfer distance. The Chamfer distanceλ c between two point cloudsA and B is defned as ∑ ∑ 2 2 λc(A, B)= min a b + min b a . (1) b B ∥ − ∥ a A ∥ − ∥ a A ∈ b B ∈ ∈ ∈

Earth mover distance. The earth mover [37] distanceλ em between two point clouds of the same size is the sum of distances between a point in one cloud and a corresponding partner in the other minimized over all possible 1-to-1 correspondences. More formally, ∑ λem = min a ϕ(a) , (2) ϕ:A B ∥ − ∥ → a A ∈ whereϕ is a bijective mapping. Point cloud metrics evaluated on vertices of 3D meshes can give misleading results, since large planar regions will have very few vertices, and hence con- tribute little. Instead, we evaluate these metrics using a point cloud sampled uniformly from the surface of each 3D mesh.

Intersection over union. As the name suggests, the intersection-over-union of volumetric representations IoU is defned by the ratio of the volumes of the intersection over their union, A B IoU(A, B)= | ∩ | . (3) A B | ∪ |

3.2 Deforming 3D Meshes

We deform a 3D object by freely manipulating some control points using the Ffd technique. Ffd creates a grid of control points and its axes are defned by the orthogonal vectorss,t andu[ 17]. The control points are then defned byl,m andn which divides the grid inl+1, m+1, n + 1 planes in thes,t,u directions, respectively. A local coordinate for each object’s vertex is then imposed. In this work, we deform an object through a trivariate Bernstein tensor – a weighted sum of the control points – as in Sederberg and Parry [17]. The deformed position of any arbitrary point is given by ∑l ∑m ∑n s(s, t, u) = Bil(s)Bjm(t)Bkn(u)pijk, (4) i=0 j=0 k=0 wheres contains the coordinates of the displaced point,B N (x) is the Bernstein polynomial of degreeN which sets the infuence of each control· point on every vertex, andp ijk is the i, j, k-th control point. This can be expressed in matrix form as 48 CHAPTER 4. FFD FOR 3D OBJECT RECONSTRUCTION

6 D. Jack et al.

S=BP, (5) M 3 N 3 where the rows ofP R × andS R × are coordinates of theM control ∈ ∈ N M points andN displaced points respectively andB R × is the deformation matrix. ∈

4 Learning Free-Form Deformations

Our method involves applying deformations encoded with a parameter∆P (t) to T different template modelsc (t) with 0 t

(qt) t∗ = arg max γ , (9) t ˜(cq) = S˜(qt∗),F (t∗) . (10) { } Previous work using learned FFD[ 36] infers deformations based on learned embeddings of voxelized templates. By learning distinct sub-networks for each template we forgo the need to learn this embedding explicitly, and our sub- networks are able to attain a much more intimate knowledge of their associated template without losing information via voxelization. Key advantages of the architecture are as follows: no 3D convolutions are involved, meaning the network scales well with in- • creased resolution; no discretization occurs, allowing higher precision than voxel-based methods; • 49

FFD for 3D Reconstruction 7

the output∆ P˜ can be used to generate an arbitrarily dense point cloud – • not necessarily the same density as that used during training; and a mesh can be inferred by applying the deformation to the Bernstein decom- • position of the vertices while maintaining the same face connections. Drawbacks include, the network size scales linearly with the number of templates considered; • and there is at this time no mechanism to explicitly encourage topological or • semantic similarity.

4.1 Diversity in Model Selection Preliminary experiments showed training using standard optimizers with an identity weighting functionf resulted in a small number of templates being selected frequently. This is at least partially due to a positive feedback loop caused by the interaction between the weighting sub-network and the deforma- tion sub-networks. If a particular template deformation sub-network performs particularly well initially, the weighting sub-network learns to assign increased weight to this template. This in turn affects the gradients whichfow through the deformation sub-network, resulting in faster learning, improved performance and hence higher weight in subsequent batches. We experimented with a number of network modifcations to reduce this.

Non-linear Weighting. One problem with the identity weighting scheme (f(γ)=γ) is that there is no penalty for over-confdence. A well-trained network with a slight preference for one template over all the rest will be inclined to put all weight into that template. By using anf with positive curvature, we discour- age the network from making overly confdent inferences. We experimented with an entropy-inspired weightingf(γ)= log(1 γ). − − Explicit Entropy Penalty. Another approach is to penalize the lack of diver- sity directly by introducing an explicit∑ entropy loss term, (t) (t) λe = ¯γ log ¯γ , (11) t ( ) where ¯γ(t) is the weight value of templatet averaged over the batch. This en- courages an even distribution over the batch but still allows confdent estimates for the individual inferences. For these experiments, the network was trained with a linear combination of weighted Chamfer lossλ 0 and the entropy penalty,

λe′ =λ 0 +κ eλe. (12) While a large entropy error term encourages all templates to be assigned weight and hence all subnetworks to learn, it also forces all subnetworks to try and learn all possible deformations. This works against the idea of specialization, where each subnetwork should learn to deform their template to match query models close to their template. To alleviate this, we anneal the entropy over time 50 CHAPTER 4. FFD FOR 3D OBJECT RECONSTRUCTION

8 D. Jack et al.

b/b0 κe =e − κe0, (13)

whereκ e0 is the initial weighting,b is the batch index andb 0 is some scaling factor.

Deformation Regularization. In order to encourage the network to select a template requiring minimal deformation, we introduce a deformation regulariza- tion term, ∑ (qt) (qt) 2 λr = γ ∆P˜ , (14) | | q,t where 2 is the squared 2-norm of the vectorized input. |·| Large regularization encourages a network to select the closest matching template, though punishes subnetworks for deforming their template a lot, even if the result is a better match to the query mesh. We combine this regularization term with the standard loss in a similar way to the entropy loss term,

λr′ =λ 0 +κ rλr, (15)

whereκ r is an exponentially annealed weighting with initial valueκ r0.

4.2 Deformed Mesh Inference

For the algorithm to result in high-quality 3D reconstructions, it is important that the vertex density of each template mesh is approximately equivalent to (or higher than) the point cloud density used during training. To ensure this is the case, we subdivide edges in the template mesh such that no edge length is greater than some thresholdϵ e. Example cases where this is particularly important are illustrated in Figure 2.

(a) (b) (c) (d) (e)

Fig. 2: Two examples of poor mesh model output (chair and table) as a result of low vertex density. (a) Original low vertex-density mesh. (b) Original mesh deformed according to inferred Ffd. (c) Subdivided mesh. (d) Subdivided mesh deformed according to same Ffd. (e) Ground truth. 51

FFD for 3D Reconstruction 9

4.3 Implementation Details

We employed a MobileNet architecture that uses depthwise separable convo- lutions to build light weight deep neural networks for mobile and embedded vision applications [38] without thefnal fully connected layers and with width α=0.25. Weights were initialized from the convolutional layers of a network trained on the 192 192 ImageNet dataset [ 39]. To reduce dimensionality, we add a single 1 1 convolution× after thefnal MobileNet convolution layer. After fattening the× result, we have one shared fully connected layer with 512 nodes followed by a fully connected layer for each template. A summary of layers and output dimensions is given in Table 1.

Layer Output size Input image 192 256 3 × × MobileNet CNN 6 8 256 × × 1 1 convolution 6 8 64 × × × Flattened 3, 072 Shared FC 512 Template FC (t) 192 + 1 Table 1: Output size of network layers. Each template fully connected (FC) layer output is interpreted as 3 4 3 = 192 values for∆ P˜ (qt) and a singleγ (qt) value. × We used a subset of the ShapeNet Core dataset [40] over a number of cat- egories, using an 80/20 train/evaluation split. All experiments were conducted using 4 control points in each dimension (l=m=n = 3) for the free form parametrizations. To balance computational cost with loss accuracy, we initially sampled all models surfaces with 16, 384 points for both labels and free form decomposition. At each step of training, we sub-sampled a different 1, 024 points for use in the Chamfer loss. All input images were 192 256 3 and were the result of rendering each × × textured mesh from the same view from 30◦ above the horizontal, 45◦ away from front-on and well-lit by a light above and on each side of the model. We trained a different network with 30 templates for each category. Templates were selected manually to ensure good variety. Models were trained using a standard Adam 3 8 optimizer with learning rate 10− ,β 1 = 0.9,β 2 = 0.999 andϵ = 10 − . Mini- batches of 32 were used, and training was run for 100, 000 steps. Exponential annealing usedb 0 = 10, 000. For each training regime, a different model was trained for each category. Hyper-parameters for specifc training regimes are given in Table 2. To pro- duce meshes and subsequent voxelizations and IoU scores, template meshes had 3 edges subdivided to a maximum length ofϵ e = 0.02. We voxelize on a 32 grid

Training Regime IDϵ γ f(γ)κ e0 κr0 base base 0.1γ 00 log-weighted log-w. 0.001 log(1 γ) 0 0 − − entropy ent. 0.1γ 100 0 regularized reg. 0.1γ 01 Table 2: Hyper parameters for the primary training regimes. 52 CHAPTER 4. FFD FOR 3D OBJECT RECONSTRUCTION

10 D. Jack et al.

by initially assigning any voxel containing part of the mesh as occupied, and subsequentlyflling in any unoccupied voxels with no free path to the outside.

5 Experimental Results

Qualitatively, we observe the networks preference in applying relatively large de- formations to geometrically simple templates, and do not shy away from merging separate features of template models. For example, models frequently select the bi-plane template and merge the wings together to match single-wing aircraft, or warp standard 4-legged chairs and tables into structurally different objects as shown in Figure 3.

5.1 Quantitative Comparison For point cloud comparison, we compare against the works of Kuryenkov et al. [36] (DN) and Fan et al.[13] (PSGN) for 5 categories. We use the results for the improved PSGN model reported in [36]. We use the same scaling as in these pa- pers,fnding transformation parameters that transform the ground-truth meshes to a minimal bounding hemispherez 0 of radius 3.2 and applying this trans- formation to the inferred clouds. We≥ also compare IoU values with PSGN [13] on an additional 8 categories for voxelized inputs on a 323 grid. Results for all 13 categories with each different training regime are given in Table 3. All our

Training Regime Other Methods base log-w. ent. reg. DN PSGN plane 31/306/292 31/310/297 31/300/289 33/304/307 100/560/- 140/115/399 bench 42/280/431 39/280/425 40/284/418 45/275/445 100/550/- 210/980/450 car 58/328/210 60/333/219 58/325/207 59/324/216 90/520/- 110/380/169 chair 36/280/407 35/275/393 35/277/392 37/277/401 130/510/- 330/770/456 sofa 64/329/275 63/320/275 64/324/271 65/319/276 210/770/- 230/600/292

mean5 46/305/323 46/304/322 46/300/315 48/292/329 130/580/- 200/780/353 cabinet 37/249/282 37/250/282 36/251/264 37/246/279 - -/-/229 monitor 38/253/369 37/250/367 37/255/367 43/255/380 - -/-/448 lamp 52/402/514 49/393/480 44/384/473 55/425/520 - -/-/538 speaker 72/312/301 68/309/304 71/313/301 73/308/315 - -/-/263 firearm 39/312/332 30/279/281 32/288/326 39/301/345 - -/-/396 table 47/352/447 46/331/432 46/342/420 49/319/450 - -/-/394 cellphone 16/159/241 15/150/224 15/154/192 15/154/222 - -/-/251 watercraft 83/408/493 48/296/340 49/304/361 53/317/367 - -/-/389

mean13 47/305/353 43/290/332 43/292/329 46/294/348 - 250/800/360 Table 3: 1000 (λ /λ /1 IoU) for our different training regimes, compared × c em − against state-of-the-art models DN [36] and PSGN [13]. Lower is better.λ c and λem values for PSGN are from the latest version as reported by Kuryenkov et al.[36], while IoU values are from the original paper. mean5 is the mean value across the plane, bench, car, chair and sofa categories, while mean13 is the average across all 13.

training regimes out-perform the others by a signifcant margin on all categories for point-cloud metrics (λc andλ em). We also outperform PSGN on IoU for most categories and on average. The categories for which the method performs worst in terms of IoU – tables and chairs – typically feature large,fat surfaces 53

FFD for 3D Reconstruction 11

(a) Input (b) Template (c) Deformed PC (d) Deformed Voxels (e) Deformed mesh (f) GT

Fig. 3: Representative results for different categories. Column (a) shows the input image. Column (b) shows the selected template model. Column (c) shows the deformed point cloud by Ffd. The deformed voxelized model is shown in column (d). Column (e) shows ourfnal 3D mesh reconstruction, and the ground truth is shown in column (f). 54 CHAPTER 4. FFD FOR 3D OBJECT RECONSTRUCTION

12 D. Jack et al.

and thin structures. Poor IoU scores can largely be attributed to poor width or depth inference (a difficult problem given the single view provided) and small, consistent offsets that do not induce large Chamfer losses. An example is shown in Figure 4. We acknowledge this experimental setup gives us a slight advan-

(a) (b) (c) (d) Fig. 4: An example of a qualitatively accurate inference with a low IoU score, λIoU = 0.33. Small errors in depth/width inference correspond to small Chamfer losses. For comparison, black voxels are true positives (intersection), red voxels are false negatives and green voxels are false positives.

tage over DN and PSGN. Most notably, these approaches used renderings from different angles, compared to our uniform viewing angles. In practice, we found using varied viewing angles had a small negative effect on our results, though this regresssion could be partially offset by a higher capacity image network. DN and PSGN also only trained a single model, where as we trained a new model for each category (though DN used category information at evaluation time). For a more fair comparison, we trained a single network using two tem- plates from each of the 13 categories investigated on a dataset of all 13 cate- gories. Again we note a regression in performance, though we still signifcantly outperform the other methods. We present this approach purely for simple, fair comparison with PSGN – we do not suggest it is the most optimal approach to extending our model to multiple categories, though leave further investigation to future work. Chamfer losses for varied views and multiple categories are given in Table 4.

ent. 8-view 8-view-full DN [36] 13c 13c-long PSGN [13] plane 31 38 37 100 54 53 140 bench 40 50 43 100 53 52 210 car 58 61 59 90 71 68 110 chair 35 44 42 130 48 46 330 sofa 64 72 71 210 79 77 230

mean5 46 53 50 130 61 59 204

mean13 43 52 50 - 67 62 250 Table 4: 1000 λ c. ent.: same as original paper. 8-view: same as ent. but trained on a random× view of 8. 8-view-full: same as 8-view but uses a full-features Mo- bileNet architecture. 13c: same as 8-view, but a single model for all 13 categories, 2 templates per category. 13c-long: same as 13c but trained for twice as long.

5.2 Template Selection

We begin our analysis by investigating the number of times each template was selected across the different training regimes, and the quality of the match of the undeformed template to the query model. Results for the sofa and table cate- gories are given in Figure 5. We illustrate the typical behaviour of our framework with the sofa and table categories, since these are categories with topologically 55

FFD for 3D Reconstruction 13

1.0 1.0 0.8 b 0.35 b 0.7 y y w 0.30 w 0.8 0.8 c c n n 0.6 e e e e u u 0.25 q q 0.6 0.6 e e 0.5

r r X X r r f f 0.20 < < d d

0.4 c c e e λ λ z z 0.4 0.4 i i 0.15 l l a a 0.3 m m r r 0.10 0.2 0.2

0.2 o o b b N N w w 0.1 0.05 e e 0.0 r 0.0 r 0.0 0.00 10−1 100 10−2 10−1 100 Template Template X X (a) (b) (c) (d)

Fig. 5: Normalized count of the number of times each template was selected, sorted descending (a, b), and cumulative Chamfer error (c, d) for the deformed models (dashed) and undeformed template models (dotted) for the sofa category (a,c) and table category (b,d). b, w, e, r legend entries correspond to base, log- weighted, entropy and regularized training regimes respectively. similar models and topologically different models respectively. In both cases, the base training regime resulted in a model with template selection dominated by a small number of templates, while additional loss terms in the form of defor- mation regularization and entropy succeeded in smearing out this distribution to some extent. The behaviour of the non-linear weighting regime is starkly dif- ferent across the two categories however, reinforcing template dominance for the category with less topological differences across the dataset, and encouraging variety for the table category. In terms of the Chamfer loss, all training regimes produced deformed models with virtually equivalent results. The difference is apparent when inspecting the undeformed models. Unsurprisingly, penalizing large deformation via regulariza- tion results in the best results for the undeformed template, while the other two non-base methods selected templates slightly better than the base regime. To further investigate the effect of template selection on the model, we trained a base model with a single template (T = 1), and entropy models withT 2,4,8, 16 templates for the sofa dataset. In each case, the topN templates∈ selected{ by} the 30-template regularized model were used. Cumulative Chamfer losses and IoU scores are shown in Figure 6. Surprisingly, the deformation networks manage to achieve almost identical results on these metrics regardless of the number of templates available. Addi- tional templates do improve accuracy of the undeformed model up to a point, suggesting the template selection mechanism is not fundamentally broken. 5.3 Semantic Label Transfer

While no aspect of the training related to semantic information, applying the inferred deformations to a semantically labelled point cloud allows us to infer another semantically labelled point cloud. Some examples are shown in Figure 7. For cases where the template is semantically similar to the query object, the additional semantic information is retained in the inferred cloud. However, some templates either do not have points of all segmentation types, or have points of segmentation types that are not present in the query object. In these cases, while the inferred point cloud matches the surface relatively well, the semantic information is unreliable. 56 CHAPTER 4. FFD FOR 3D OBJECT RECONSTRUCTION

14 D. Jack et al.

1.0 ��� ��� ��� ��� 0.8 ��� ��� � � �� � � ��

0.6 ��� � � < X � c λ 0.4 � ��� � T=1 T=2 0.2 T=4 ��� T=8 T = 16 0.0 T = 30 ���

10−1 100 ��� ��� ��� ��� ��� ��� X � (a) (b)

Fig. 6: Cumulative Chamfer loss (left) and IoU results (right) for models with limited templates. All models withT> 1 trained under the entropy regime (e) on the sofa category.T = 1 model was trained with the base training regime. Dotted: undeformed selected template values, dashed: deformed model values.

Fig. 7: Good (left block) and bad (right block) examples from the chair and plane categories. For each block, left to right: Input image; selected template’s semantically segmented cloud; deformed segmented cloud; deformed mesh. Mod- els trained with additional entropy loss term (ent.).

6 Conclusion We have presented a simple framework for combining modern CNN approaches with detailed, unstructured meshes by using Ffd as afxed sized intermediary and learning to select and deform template point clouds based on minimally ad- justed off-the-shelf image processing networks. We out-perform state-of-the-art methods with respect to point cloud generation, and perform at-or-above state- of-the-art on the volumetric IoU metric, despite our network not being optimized for it. We present various mechanisms by which the diversity of templates se- lected can be increased and demonstrate these result in modest improvements. We demonstrate the main component of the low metric scores is the ability of the network to learn deformations tailored to specifc templates, rather than the precise selection of these templates. Models with only a single template to select from achieve comparable results to those with a greater selection at their disposal. This indicates the choice of template – and hence any semantic of topological information – makes little difference to the resulting point cloud, diminishing the trustworthiness of topological or semantic information. 57

FFD for 3D Reconstruction 15

References

1. Penner, E., Zhang, L.: Soft 3D reconstruction for view synthesis. In: ACM Trans- actions on Graphics. Volume 36. (2017) 1 2. Huang, Q., Wang, H., Koltun, V.: Single-view reconstruction via joint analysis of image and shape collections. In: ACM Transactions on Graphics. Volume 34. (2015) 1, 4 3. Maier, R., Kim, K., Cremers, D., Kautz, J., Nießner, M.: Intrinsic3D: High-quality 3D reconstruction by joint appearance and geometry optimization with spatially- varying lighting. In: ICCV. (2017) 1 4. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classifcation with deep convolutional neural networks. In: NIPS. (2012) 1097–1105 2 5. Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. TPAMI 35 (2013) 1915–1929 2 6. Graves, A., Liwicki, M., Fern´andez, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. TPAMI 31 (2009) 855–68 2 7. Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3D-R2N2: A unifed approach for single and multi-view 3D object reconstruction. In: ECCV. (2016) 2, 3 8. Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: Learning single-view 3D object reconstruction without 3D supervision. In: NIPS. (2016) 2, 3 9. Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multi-view CNNs for object classifcation on 3D data. In: CVPR. (2016) 2, 3 10. Kar, A., H¨ane,C., Malik, J.: Learning a multi-view stereo machine. In: NIPS. (2017) 2, 3 11. Zhu, R., Galoogahi, H.K., Wang, C., Lucey, S.: Rethinking reprojection: Closing the loop for pose-aware shape reconstruction from a single image. In: NIPS. (2017) 2, 3 12. Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, W.T., Tenenbaum, J.B.: MarrNet: 3D Shape Reconstruction via 2.5D Sketches. In: NIPS. (2017) 2, 3 13. Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3D object recon- struction from a single image. In: CVPR. (2017) 2, 4, 10, 12 14. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: Deep learning on point sets for 3D classifcation and segmentation. In: CVPR. (2017) 2, 4 15. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: Deep hierarchical feature learning on point sets in a metric space. In: NIPS. (2017) 2, 4 16. Lin, C.H., Kong, C., Lucey, S.: Learning efficient point cloud generation for dense 3D object reconstruction. In: AAAI. (2018) 2, 4 17. Sederberg, T., Parry, S.: Free-form deformation of solid geometric models. In: SIGGRAPH. (1986) 2, 5 18. Ulusoy, A.O., Geiger, A., Black, M.J.: Towards probabilistic volumetric recon- struction using ray potential. In: 3DV. (2015) 3 19. Wu, Z., Song, S., Khosla, A., Tang, X., Xiao, J.: 3D ShapeNets: A deep represen- tation for volumetric shapes. In: CVPR. (2015) 3 20. Cherabier, I., Hne, C., Oswald, M.R., Pollefeys, M.: Multi-label semantic 3D re- construction using voxel blocks. In: 3DV. (2016) 3 21. Sharma, A., Grau, O., Fritz, M.: VConv-DAE: Deep volumetric shape learning without object labels. In: ECCVW. (2016) 3 58 CHAPTER 4. FFD FOR 3D OBJECT RECONSTRUCTION

16 D. Jack et al.

22. J. Rezende, D., Eslami, S.M.A., Mohamed, S., Battaglia, P., Jaderberg, M., Heess, N.: Unsupervised learning of 3D structure from images. In: NIPS. (2016) 3 23. Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable and generative vector representation for objects. In: ECCV. (2016) 3 24. Wu, J., Zhang, C., Xue, T., Freeman, W.T., Tenenbaum, J.B.: Learning a prob- abilistic latent space of object shapes via 3D generative-adversarial modeling. In: NIPS. (2016) 3 25. Liu, J., Yu, F., Funkhouser, T.A.: Interactive with a generative adversarial network. In: 3DV. (2017) 3 26. Gwak, J., Choy, C.B., Garg, A., Chandraker, M., Savarese, S.: Weakly supervised generative adversarial networks for 3D reconstruction. In: 3DV. (2017) 3 27. Riegler, G., Ulusoy, A.O., Geiger, A.: OctNet: Learning deep 3D representations at high resolutions. In: CVPR. (2017) 3 28. Wang, P.S., Liu, Y., Guo, Y.X., Sun, C.Y., Tong, X.: O-CNN: Octree-based con- volutional neural networks for 3D shape analysis. In: SIGGRAPH. (2017) 3 29. H¨ane,C., Tulsiani, S., Malik, J.: Hierarchical surface prediction for 3D object reconstruction. In: 3DV. (2017) 3 30. Tatarchenko, M., Dosovitskiy, A., Brox, T.: Octree generating networks: Efficient convolutional architectures for high-resolution 3D outputs. In: ICCV. (2017) 3 31. Lun, Z., Gadelha, M., Kalogerakis, E., Maji, S., Wang, R.: 3D shape reconstruction from sketches via multi-view convolutional networks. In: 3DV. (2017) 4 32. Sinha, A., Unmesh, A., Huang, Q., Ramani, K.: SurfNet: Generating 3D shape surfaces using deep residual network. In: CVPR. (2017) 4 33. Yumer, M.E., Mitra, N.J.: Learning semantic deformationfows with 3D convolu- tional networks. In: ECCV. (2016) 4 34. Kong, C., Lin, C.H., Lucey, S.: Using locally corresponding CAD models for dense 3D reconstructions from a single image. In: CVPR. (2017) 4 35. Pontes, J.K., Kong, C., Eriksson, A., Fookes, C., Sridharan, S., Lucey, S.: Compact model representation for 3D reconstruction. In: 3DV. (2017) 4 36. Kurenkov, A., Ji, J., Garg, A., Mehta, V., Gwak, J., Choy, C.B., Savarese, S.: DeformNet: Free-form deformation network for 3d shape reconstruction from a single image. Volume abs/1708.04672. (2017) 4, 6, 10, 12 37. Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. International journal of computer vision 40 (2000) 99–121 5 38. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An- dreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017) 9 39. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. CVPR (2009) 248–255 9 40. Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., Yu, F.: ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR] (2015) 9 Chapter 5

IGE-Net: Inverse Graphics Energy Networks for Human Pose Estimation and Single-View Reconstruction

“One geometry cannot be more true than another; it can only be more convenient.” — Henri Poncare´

Recall our first contribution 3, which sought solutions that were both feasible and consistent by a two-stage process, where feasibility was learned in the first phase and consistency optimized by standard optimization techniques in a separate inference step. While we were able to use GANs to learn a feasible solution space, there is no guarantee this space will behave nicely with respect to the optimization in the inference phase. It also requires us to use hard-coded reprojection and temporal smoothing losses, rather than allowing these to be learned.

Our third contribution seeks to address these issues by combining these two sequential optimization steps into a single supervised learning framework. Conceptually, rather than learning a feasible pose space, we learn an energy function which reflects both the feasibility of the solution and its consistency with the 2D observation. Our inference is the result of partially optimizing this energy function with respect to a proposed solution, and the parameters of the energy function are optimized by minimizing the loss from the result of this energy minimizing inference and supervised labels. This idea of multi-level optimization/energy networks is not original – our contribution is based on the use of simple graphics techniques and novel energy functions. We coin the resulting models Inverse Graphics Energy Networks, or IGE-Nets.

We demonstrate the effectiveness of IGE-Nets on both problems investigated thus far in this thesis: 2D-to-3D human pose lifting and single-view image reconstruction. Our human pose models result in models with two orders of magnitude fewer parameters, an order of magnitude fewer multiply-adds operations, and comparable performance to baseline deep learning approaches. Our object reconstruction networks are capable of inferring voxel grids at resolutions of 2563 on standard desktop GPUs with state- of-the-art performance with respect to intersection-over-union.

59 60 CHAPTER 5. INVERSE GRAPHICS ENERGY NETWORKS

Statement of Contribution of Co-Authors for Thesis by Published Paper

The authors listed below have certified that:

1. they meet the criteria for authorship in that they have participated in the conception, execution, or interpretation, of at least that part of the publication in their field of expertise;

2. they take public responsibility for their part of the publication, except for the responsible author who accepts overall responsibility for the publication;

3. there are no other authors of the publication according to these criteria;

4. potential conflicts of interest have been disclosed to (a) granting bodies, (b) the editor or publisher of journals or other publications, and (c) the head of the responsible academic unit; and

5. they agree to the use of the publication in the student’s thesis and its publication on the QUT’s ePrints site consistent with any limitations set by publisher requirements.

In the case of this chapter: IGE-Net: Inverse Graphics Energy Networks for Human Pose Estimation and Single-View Reconstruction, presented at the Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, 2019.

Dominic Jack QUT Verified Experiment design, model implementation, write-up. Signature

26 May, 2020 Frederic Maire Advised on model design and paper editing. Sareh Shirazi Advised on model design and paper editing. Anders Eriksson Advised on model design and paper editing.

Principle Supervisor Confirmation

I have sighted email or other correspondence from all Co-authors confirming their certifying authorship. (If the Co-authors are not able to sign the form please forward their email or other correspondence confirming the certifying authorship to the RSC). QUT Verified Frederic Maire Signature 26 May 2020 Name Signature Date 61

IGE-Net: Inverse Graphics Energy Networks for Human Pose Estimation and Single-View Reconstruction

Dominic Jack1 Frederic Maire1 Sareh Shirazi1 Anders Eriksson2 [email protected] [email protected] [email protected] [email protected] 1School of Electrical Engineering and Computer Science, Queensland University of Technology 2School of Information Technology and Electrical Engineering, University of Queensland

Abstract 2. Main Contributions Our main contributions are as follows. Inferring 3D scene information from 2D observations is an open problem in computer vision. We propose us- 1. We propose simple parameterized energy functions ing a deep-learning based energy minimization framework that capture both consistency and feasibility for the to learn a consistency measure between 2D observations problems of human pose estimation and object recon- and a proposed world model, and demonstrate that this struction based on 2D features and well-understood framework can be trained end-to-end to produce consis- computer-graphics principles. tent and realistic inferences. We evaluate the framework on human pose estimation and voxel-based object recon- 2. For the case of human pose estimation, we show the struction benchmarks and show competitive results can be proposed energy function can be used to lift 2D pose achieved with relatively shallow networks with drastically inferences to 3D at competitive accuracies with sig- fewer learned parameters and floating point operations nificantly fewer learned parameters and computational than conventional deep-learning approaches. requirements.

3. For object reconstruction, we demonstrate the frame- 1. Introduction work can produce high-resolution voxel grids from single images on standard desktop GPUs without the Computer graphics involves reducing 3D scene informa- need for 3D convolutions or deconvolutions, out- tion to 2D using well-understood physics-based arguments performing state-of-the-art high-resolution methods in and mathematical operations like frame transformations and terms of accuracy. projections. Computer vision can be thought of as the in- verse problem – inferring 3D scene information from some 3. Prior Work 2D representation. Unlike graphics, computer vision is inherently ill-posed. 3.1. Multi-Level Optimization While it is straight-forward enough to obtain an inference Many problems in machine learning involve inferring which is consistent with a given 2D representation using values of unknown variables from observations. Energy- standard graphics and optimization techniques, there is no based models describe relationships between sets of vari- guarantee this inference will be realistic. To resolve this, we ables by mapping each combination to a scalar energy propose using simple optimization techniques on a learned value, where realistic combinations correspond to lower en- energy function which combines graphics operations with a ergies than their less viable counterparts. Inferences are learned realism component. made by fixing values of known variables and seeking un- We learn this energy function itself using deep-learning known values which minimize the energy [27]. optimization techniques, resulting in a multi-level optimiza- Energy-based models have been combined with deep tion framework which can be trained end-to-end. We apply learning in the past. Zheng et al.[59] formulated condi- our framework to two common problems: 3D human pose tional random fields (CRFs) as a recurrent neural network estimation, and single-view voxel-based object reconstruc- layer, which combined with a standard convolutional neural tion. network (CNN) achieved state-of-the-art results for image

1 62 CHAPTER 5. INVERSE GRAPHICS ENERGY NETWORKS

segmentation. Amos and Kolter [1] considered energy func- inherent ambiguity associated with depth inference and oc- tions based on quadratic programs. Their implementation clusions. Adversarial approaches tackle this by introducing solved the inner optimization problem efficiently and ex- loss terms which are themselves learned in a modified mini- actly, and demonstrated it was able to learn hard constraints max game [21, 47, 56]. like those associated with the number-game Sudoku. Domke [13] presented a number of implementations for 3.3. Single View 3D Object Reconstruction efficiently computing and differentiating approximate opti- Reconstructing 3D objects from a single view is a com- mizations – solutions where the energy minimization pro- mon problem in computer vision and robotics. Fundamen- cess is based on a fixed number of steps of some optimiza- tal to any approach is the representation of the output ob- tion algorithm. While the algorithms did not find the exact ject. Volumetric methods are the most wildely used in solution to the energy minimization problem, these trun- 3D learning [48, 54,8,9, 20, 55, 37, 51, 24, 60]. These cated optimization processes still yielded good results for approaches generally use 3D analogues of ideas and op- image denoising and labeling problems. Belanger et al.[3] erations that have proven successful in image processing, took a similar approach and showed inexact optimization of including convolutions, deconvolutions and feature pool- complex energy functions outperformed exact solutions us- ing. Recent advances in auto-encoders [15, 43] and GANs ing simpler functions for image denoising and natural lan- [53, 52, 31, 17] have also shown promising results on regu- guage semantic role labeling. lar 3D grids, while Tulsiani et al.[46] showed object shape and pose can be learned simultaneously and without 3D la- 3.2. Human Pose Estimation bels using only depth maps or silhouettes to encourage view Inferring human pose in two or three dimensions from consistency across multiple views. images is an important part of many tasks including human- Unfortunately, the additional dimension inherent to 3D computer interaction and action recognition. For the 2D representations means these methods scale poorly with res- olution, resulting in generally coarse outputs – typically 323 problem, traditional approaches combine visual features 3 and image descriptors with a tree-structure of the body and or 64 . To overcome this scaling issue, octree networks known invariants and proportions [58]. More recently, deep [40, 49, 18, 45] recursively divide regions of interest into learning’s wave of success in other image processing appli- octants. By focusing only on regions near the object sur- cations such as image classification and segmentation has face, these methods operate with complexity proportional flowed into pose estimation, with fully-convolutional ap- to surface area rather than volume. proaches achieving exceptionally accurate 2D inferences by Other approaches to high-resolution inference keep the regressing heatmaps rather than the joint coordinates them- regular volumetric data structure but use operations that selves [50, 35, 10,5, 11]. scale better to higher resolutions [23, 39]. The 3D problem is considerably more challenging. In Point cloud methods avoid the need to discretize space, addition to problems involved in the 2D variant, the main instead working on continuous coordinates of points on the difficulty in training 3D pose inference systems that work object surface [14, 36, 38, 28]. However, the variable size in the wild is the availability of varied datasets. While 2D and unordered nature of point clouds introduce their own datasets can be annotated manually, 3D information is gen- complexity in deep learning frameworks. Template defor- erally gathered using special motion-capture systems. Al- mation approaches [26, 22, 57] instead infer a constant- though these systems are capable of generating massive vol- sized space warping that can be applied to an arbitrarily umes of data, the examples within such datasets are usu- dense cloud or mesh. This comes at a cost however, as the ally limited in variety. For example, the human 3.6 million topology of the output shape is intrinsically coupled with dataset (H3M) [19] contains millions of frames, but all im- that of the deformed template. ages are collected in the same room with only a handful of subjects. By contrast, the popular 2D dataset COCO [30] 4. Method Overview features over 50,000 human pose annotations with very few Our approach is based on energy minimization networks duplicates. which have been discussed previously in the literature [27, To get around this lack of varied 3D data, many meth- 13,2,3]. We base our notation on the work of Belanger, ods use a 2-stage approach to 3D inference by inferring 2D McCallum and Yang [3], where we seek the minimizer of poses from images, then lifting these 2D poses to 3D sepa- some energy function rately [4,7, 33]. These approaches benefit from the varied image features in 2D datasets, but the separate stages means argminE(y˜;x,θE ). (1) y˜ any “lifting” module is unable to take advantage of contex- tual information learned in the first stage. We implement the energy function E as a neural network The other main difficulty with 3D pose estimation is the which takes as inputs a proposed solution y˜ and extracted 63

features x with learned parameters θE . For generic non- 5. Human Pose Estimation convex energies, calculating the exact argmin is intractible, We begin by considering the problem of lifting human hence we approximate the result by the output of some iter- N 2 N 3 joint information from 2D (x R J2 × ) to 3D (y R J3 × ). ative strategy ∈ ∈ Note we do not require the number of joints to be the same, (t) (t) (t 1) y˜ = f(y˜ ,E(y˜ − ;x,θE );θopt), (2) nor do we require any known correspondences between the two sets. This allows us to pair 2D inferences from a model for some fixed number of steps t [1,T], where θopt are trained on one dataset with 3D pose data with different joint ∈ (0) hyper-parameters of the optimization strategy and y˜ is an annotations. initial proposal. For example, basic gradient-descent with Recent progress in this area has resulted in a number of learning rate η is implemented as algorithms performing very well on standard benchmarks, split on accuracy metrics by a matter of millimeters. For f(y˜,E(y˜;x,θE );η) = y˜ η∇y˜ E. (3) − many applications, such error rates are well and truly satis- In this investigation we also considered gradient-descent factory, so we approach this problem with the aim to mini- with momentum and gradient-clipping, where the momen- mize memory requirements and computational costs – fac- tum term and clip value were trained as part of θopt. tors more important in areas such as mobile robotics and autonomous systems – while maintaining reasonable accu- racy. We also limit our methods to perform well as defined by scale-invariant metrics. While scale can be learned based on contextual information, errors in scale inference tend to drown-out those associated with relative positions. 5.1. Network Structure Figure 1: Unrolled optimization involves iteratively updat- We base our feature extractor module on the work of (t) ing a proposed value y˜ to minimize some energy function Martinez et al.[33]. The proposed network is made up of E according to an update step f. Parameters of E and f (blue) two residual blocks each containing two dense layers along are learned in the outer optimization process. with an input and output layer for a total of six, as well as batch normalization, rectified linear activations, weight This process is illustrated in Figure1. We refer to this clipping, residual connections and dropout. While this net- scheme as unrolled gradient descent or inner optimization. work is small by modern standards, we reduce it further by To train our network we use a loss λ made up of a removing one of the internal blocks and dropping the num- weighted sum of losses applied to all steps of the optimiza- ber of units in each remaining inner layer by a factor of 8. tion process, This reduces the number of trainable parameters by roughly T λ = ∑ λˆ (y˜(t),y). (4) a factor of 100. Since our losses and evaluation are scale- t=0 agnostic, we also remove the weight clipping. We consider an energy function E as the combination of where kt is a scalar weighting value, y is the example label and λˆ is some per-proposal loss function dependent on the a reprojection energy Ex and a feasibility energy Ey, problem. In all experiments we use exponential weighting T t E(y˜;x) = Ex (x˜(y˜);x) + Ey(y˜), (5) kt = 0.9 − . Assuming E and f are piecewise-doubly-differentiable where x˜(y˜) is the projection of the proposed solution. We and λ0 is piecewise differentiable, the parameters θE and assume the intrinsic camera parameters are known, and in- θopt can be learned using any standard optimization strategy fer 3D poses in the camera’s reference frame. referred to as the outer optimizer. For brevity, we drop the Each energy function makes use of pairwise squared eu- parameters θE and θopt in equations and diagrams hereafter. clidean distances similar to Moreno-Noguer [34], To summarise, our inverse graphics energy networks ∆2 (z) = z z 2, j > i, (6) (IGE-Net) are made up of: i j || i − j||2 a feature extractor module that provides a (possibly N • where z is an ordered set of points in R . This transforma- empty) set of features as well as an initial estimate; tion has many desirable properties, including invariance to rotation, translation and reflection. Unlike Moreno-Noguer, an energy module which reduces a proposed solution • we use the squared distance rather than the actual differ- and observed features to a scalar value; and ence, as this avoids a square root operation causes problems an inner optimization strategy. with gradients near zero. • 64 CHAPTER 5. INVERSE GRAPHICS ENERGY NETWORKS

prevent negative momentum early due to spurious gradients in the initial loss function, we used the absolute value of a learned parameter rather than the learned parameter itself. We run experiments on the popular Human 3.6 million (H3M) dataset [19]. We use 2D pose inferences provided by Martinez et al.[33] which come from stacked hourglass networks of Newell et al.[35]: one trained entirely on var- Figure 2: For lifting 2D pose information to 3D, we split ied 2D poses in-the-wild, and another fine tuned on H3M. the energy into 2 parts: a reprojection loss E which mea- x We also experiment with ground-truth 2D poses. All train- sures how consistent the projected proposed pose is with the ing and evaluation uses inputs with the 16 joints used in observed 2D information, and a feasibility loss E which y COCO [30], and infer a slightly different set of 16 joints operates on the normalized proposed pose. in 3D. Evaluation is on a 17-joint skeleton with additional pelvis joint as per Martinez et al.[33]. We train on subjects We parameterize our reprojection loss as a 2-layer dense 1, 5, 6, 7 and 8 and evaluate on subjects 9 and 11. network DNx with a softplus and softabs activation. In- puts are given by the pairwise squared distances between 5.3. Results all points in x and x˜, i.e. Sample results for our 2-step network trained on 2D ground-truth inputs are shown in Figure3. We see the net- E (x˜;x) = DN ∆2(x˜ x) , (7) x x ⊕ work learns to reconcile inconsistent 2D data in a single step. The subsequent step has a smaller impact, but still where is the concatenation operator along the joint di- ⊕ makes minor adjustments to the 3D pose without losing mension. consistency with the observation. While a perfect proposal will yield a perfect reprojection (y˜ = y x˜ = x), the reverse implication does not hold. As ⇒ the name suggests, the feasibility energy Ey is intended to promote feasible proposals independently of the appearance x˜. To make this scale-invariant, we normalize the proposed pose yˆ = N(y˜) by dividing by the distance between the hip joints, then consider the pairwise squared distances,

2 Figure 3: Camera view (top) and novel view (bottom) of Ey(y˜) = DNy(∆ (yˆ)). (8) inferred pose (solid) and ground truth (dotted) after 0, 1 and 2 steps. Note the observed 2D pose (dotted, top) has one This energy architecture is illustrated in Figure2. fewer joints in the head. The model uses camera view 2D To train our model, we use a per-step outer loss function joint coordinates (top, dotted) as inputs. ˆ (t) (t) λ(y˜ ,y) = ky˜ y 2, (9) || − || We evaluate our models using two metrics: the mean per- where k is the optimal scaling factor with respect to the joint error after scaling as per Equation9, and the per-joint squared error. error after an optimal rigid body transformation. We refer to these as Protocol 1a and Protocol 2 respectively (Martinez 5.2. Implementation Details et al.[33] define Protocol 1 to be a slightly different metric. We pretrain our initial pose estimation network indepen- It is largely analogous in meaning to our Protocol 1a, though dently for 200 epochs with a batch size of 64 as per the not equivalent). original [33]. We begin analysis by looking at performance of our net- For our inner loss networks, we initialized the hidden works using ground truth 2D poses with different numbers layer weights using Glorot initialization [16], and the loss of inner optimizer steps. We compare against different ver- 3 layer weights with a version scaled down by 10− . This re- sions of the base model without the IGE component by sulted in the inner optimizer starting with little effect and varying the number of residual blocks as well as the number growing, which smooths learning in the very early stages. of hidden units in each layer. Protocol 2 results are shown We used the same learning-rate decay schedule as Martinez in Figure4. et al.[33] except with initial learning rate lowered by a fac- Our IGE networks can achieve competitive results in tor of 10 and training to convergence. a handful of steps, with performance comparable to the We initialize our inner optimizer’s learning rate, gradient full base model with significantly few operations. Unlike clip value and momentum at 1, 1 and 0.1 respectively. To the baseline, our networks also have a constant number of 65

IGE Base1 70 Base2

60

50 Protocol 2 (mm)

40

105 106 Multiply adds

Figure 4: Protocol 2 scores (lower is better) and the num- Figure 5: Energy function for single view reconstruction. ber of multiply-adds due to dense layers in inference. Base model values are for networks with (left-to-right) 128, 256, 512 and 1024 units in each dense layer and 1 (red) or 2 tion by comparing silhouettes and using some 3D con- (blue) residual blocks. IGE values (black) are for (left-to- volutional encoder respectively. However, initial experi- right) 0, 1, 2, 4, 6 and 8, 12 and 16 steps. The size of ments showed this approach suffered from a number is- each dot represents the number of trainable parameters of sues. These included issues with formulating projections the model. of continuous-valued proposed solutions and scaling issues associated with the cubic nature of the grid. Protocol 1a Protocol 2 Instead, we propose a very different energy function for- 2D source SH FT GT SH FT GT mulation for single view reconstruction. We consider the in- Mart. [33] - - - 52.5 47.7 37.1 ner optimizer input x to be the progressive outputs of some Base 1024/2 79.0 75.1 61.6 52.2 47.9 35.8 2D convolutional network with NC output feature-banks of different resolutions x = x1,x2, ,xN . IGE4 75.1 67.8 45.1 56.1 51.5 39.4 { ··· C } We consider an energy function made up of the sum of IGE8 72.8 66.0 42.6 55.1 50.5 37.7 energy functions at each resolution. For each image feature Table 1: Average Protocol 1a/Protocol 2 scores for infer- map xi of shape (hi,wi, fi) we consider a voxel grid in the ences based on stacked hourglass detections (SH), fine- camera’s viewing frustum y¯i of shape (hi,wi,di) by averag- tuned stacked hourglass detections (FT) and ground truth ing the proposed voxel grid values in world coordinates y˜ projections (GT). Baseline models had 1024 hidden units over the frustum voxel volumes. Our energy function seeks and 2 residual blocks. IGE networks were trained for 4 and to learn the consistency between all voxels values along a 8 steps. Lower is better. ray and the image features of the associated pixel,

E(y˜;x) = ∑CNNi(xi y¯i), (10) trained parameters, resulting in a significantly smaller mem- i ⊕ ory footprint. Results for experiments based on inferred 2D poses are where concatenation is along the feature dimension and shown in Table1. Interestingly, our baseline method ap- CNNi is some short 2D convolutional neural network. pears to overfit certain displacements, resulting in a rela- By setting the depth of the averaged frustum voxel grid tively high Protocol 1a loss, though achieves a loss consis- di and the number of filters in each layer of CNNi to be tent with Martinetz et al. after optimal translation. Our IGE proportional to the number of image features fi, and assum- network performs slightly worse than Martinez et al. on in- ing those image features roughly double in depth as they ferred detections, though given the reduced computational halve in spatial resolution, we ensure the number of oper- and memory burden we believe this would be an acceptable ations at each image resolution is the same. This allows trade-off in many scenarios. for much better resolution scaling than typical 3D convolu- tion/deconvolution networks. 6. Single View 3D Object Reconstruction In practice, averaging a voxel grid in world coordinates over voxels corresponding to a frustum grid is a non-trivial For the problem of 3D object reconstruction we param- operation and must be done at each step and resolution of eterize shapes as voxel occupancy grids and seek a method the inner optimizer across all examples. Instead, we trans- that will scale well to high resolutions. form the labels of our dataset into the frustum space in a preprocessing step. During inference, the proposed solu- 6.1. Energy Formulation tion y˜ is a voxel grid in frustum coordinates which is av- Theoretically, the approach of separating reprojection erage pooled anisotropically with different pool sizes for and feasibility losses can be applied to object reconstruc- each image resolution. This means only the average pool- 66 CHAPTER 5. INVERSE GRAPHICS ENERGY NETWORKS

ing must be done at each inner optimization step and res- tion, which was a 3 3 followed by a 2 2, with softplus × × olution. Although this pooling operation still scales pro- and softabs activations. portionally to the number of voxels and inner optimization Our inner optimizer used a learned learning rate and gra- steps ( (TN3)), GPU pooling implementations are rela- dient clip value. We observed no significant difference with O tively fast and the operation introduces no additional pa- momentum, so did not include it in experiments. rameters. We used a base-line 3D deconvolutional network for While this means our method requires knowledge of the low-resolution comparison (323) similar to the initial esti- intrinsic camera parameters, we argue the choice of frame is mate network, except we doubled the number of features to arbitrary. Our method does not explicitly use the pose of the keep the number of trained parameters comparable. camera in its inference, and while the dataset transforma- An overview of feature sizes and parameter counts is tion discussed above results in a slightly different problem given in Table2. Additional details and network diagrams compared to other approaches in the literature, we do not are provided in the supplementary material. believe this puts us at an unfair advantage. On the contrary, Base IGE the transformation results in a more varied dataset, and we MN2 I4 MN2 I4 demonstrate experimentally that traditional approaches per- Output size 1280 1536 42 320 42 1536 Image Encoder × × Parameters 2,223,872 54,276,192 1,811,712 54,276,192 form slightly worse in this environment. Initial size 43 128 43 512 43 64 43 256 3D Decoder × × × × Our energy architecture is illustrated in Figure5. Parameters 2,656,113 14,159,297 238,009 3,802,849 Initial size - - 42 128 42 512 Image Decoder × × Parameters - - 140,992 2,928,384 6.2. Outer Loss Initial size - - 42 256 42 1024 Inner-loop CNN × × Parameters - - 1,109,840 16,573,760 For training, we experimented with two different per- Inner Optimizer Parameters - - 2 2 step outer losses. Firstly, we consider an α-balanced focal Total Parameters 4,879,985 68,435,489 3,300,555 77,581,187 loss [29] based on cross-entropy, Table 2: Network specification summary. Parameter counts λˆ (˜, ) = [y (1 y˜ )γ (1 + α)log(y˜ )+ CE y y ∑ v v v are for 323 networks – Image decoder and inner-loop CNN − v − (11) parameter counts increase negligibly for higher resolutions. (1 y )y˜γ (1 α)log(1 y˜ )], − v v − − v 6.4. Dataset where summation is over all voxels v. This is a general- ization of standard cross-entropy (which is recovered by We conduct experiments on the 13 categories of the pop- setting γ = α = 0) designed to alleviate issues with class ular Shapenet dataset [6] popularized by Choy et al.[9]. imbalance. α (0,1) results in additional focus on pos- Due to difficulties reconciling the rendering parameters, im- ∈ itive examples, while γ > 0 results in reduced focus on ages and models supplied by the authors, we use our own easy examples like those associated with the outside (usu- renderings and voxelizations. As per Choy et al. each model ally empty) or very center (usually filled) of the voxel grid. was rendered from 24 different camera positions with az- Secondly, we experiment with a continuous intersection- imuth angle uniformly sampled from [0◦,360◦) and eleva- tion angle in [25 ,30 ] with resolution 128 128. over-union implementation similar to that proposed by ◦ ◦ × Richter and Roth [39], We created voxel grids by defining any voxel intersected by a face as filled. This means thin structures take up dis- y˜ y proportionately large volumes at low resolutions. This is λˆ IoU (y˜,y) = 1 · . (12) − y˜ + y 1 y˜ y different to approaches which take a less strict approach || || − · which may preserve a better overall volume ratio but risk 6.3. Implementation Details losing thin structures entirely. This difference can affect We experimented with two architectures: a small net- low resolution grids significantly, though the difference be- work with encoder based on MobilenetV2 (MN) [42], and comes insignificant at higher resolutions. After initial vox- another larger network based on Inception-V4 (I4) [44]. elization, grids were filled in consistent with the approach Image decoding networks built off the encoder network used by Johnston et al.[23]. following a typical U-Net architecture common in the lit- 6.5. Results erature [41, 32, 35]. For the initial estimate, we used the output of a 3D deconvolution network based on the gener- Images of two of our models’ inferences are shown in ator of Wu et al.[53] with one fewer layers, producing an Figure6 – our smaller model trained with α-balanced cross- output of resolution 323. We then trilinearly upsampled to entropy loss and the larger model with continuous IoU. Un- the required resolution. surprisingly both models learn to space-carve very well, The inner-loop CNNs (CNN ) each consist of two 3 3 featuring virtually no voxels along rays that miss the ob- i × 2D convolutions without padding except the lowest resolu- ject. The IoU-trained model appears to be more conserva- 67

Frustum Dataset World Aligned Dataset IGE-MN IGE-I4 IGE Base Base cont. IoU α = γ = 0 α = 0.7 γ = 2 cont. IoU α = γ = 0 α = 0.7 γ = 2 MN I4 MN IF MN I4 R2N2 [9] OGN [45] Mat. [39] 323 63.5 60.7 58.0 59.8 66.0 61.9 59.6 64.0 plane 59.6 62.4 49.2 50.2 55.0 62.6 51.3 58.7 64.7 643 61.5 56.8 57.1 56.9 64.7 60.8 59.3 59.3 bench 52.4 55.2 47.3 47.9 52.8 58.1 42.1 48.1 57.7 1283 58.9 51.6 54.5 53.9 62.2 56.4 56.8 57.1 cabinet 73.6 74.9 70.6 71.3 72.1 74.9 71.6 72.9 77.6 car 78.4 79.9 74.2 73.5 77.2 76.9 79.8 81.6 85.0 telephone 69.9 72.2 65.4 64.5 70.9 70.3 66.1 70.2 75.6 chair 57.0 60.1 52.4 53.6 55.0 60.7 46.6 48.3 54.7 Table 4: Mean IoU (in %, averaged over categories) for our sofa 69.6 71.2 65.9 66.8 66.7 69.8 62.8 64.6 68.1 rifle 60.6 62.6 47.8 50.0 55.0 60.2 54.4 59.3 61.6 IGE models trained with different losses. lamp 54.0 56.5 47.5 50.1 48.7 50.8 38.1 39.8 40.8 monitor 58.8 60.7 53.5 55.4 54.7 60.0 46.8 50.2 53.2 car plane table speaker 74.5 76.5 72.4 72.8 70.6 72.4 66.2 63.7 70.1 Resolution 32 64 128 256 32 64 128 256 32 64 128 256 OGN [45] 64.1 77.1 78.2 76.6 ------table 57.4 60.6 52.9 54.3 57.8 61.0 51.3 53.6 57.3 1 MAT [39] 68.3 78.4 79.4 79.6 36.7 48.8 58.0 59.6 38.6 42.3 43.5 41.3 watercraft 61.9 64.0 55.5 56.6 54.8 60.0 51.3 63.0 59.1 1 IGE-MN13 57.8 68.8 72.8 73.3 29.6 44.8 52.9 54.4 33.6 44.0 47.8 48.2 mean 63.7 65.9 58.0 59.0 60.8 64.4 56.0 59.5 63.5 IGE-I413 57.9 70.9 74.0 75.2 30.5 47.8 57.5 57.3 34.8 46.5 52.7 50.5 IGE-MN1 57.0 70.3 76.2 75.2 30.7 47.9 58.7 58.1 33.6 45.9 50.6 50.2 IGE-I41 58.4 71.2 76.5 76.5 30.1 49.2 60.5 62.0 35.0 46.4 52.2 52.1 Table 3: IoU values (in %) at 323 resolutions. IGE mod- els was trained with continuous IoU loss from Equation 12. Table 5: Mean IoU (in %) trained at difference resolutions Mean values are calculated by class. A single model was and evaluated at 2563 for models trained across all cate- trained across all categories for each of our columns. gories (13) and per-category (1). Per-category break-down of 13-category models available in supplementary. tive when it comes to thin-structures, while the α-balanced Unsurprisingly, our larger model out-performs the model inferences often display slight shadowing along rays. smaller one in all categories, regardless of the model ar- This often results in more realistic looking inferences de- chitecture. spite a lower average IoU score. To better understand the effect of the loss functions in- Quantitatively, we first investigate the performance of volved, we trained models at various resolutions with con- 3 the models and effect of the frustum grid at 32 resolutions. tinous IoU loss compared to models trained with different We compare against R2N2 [9] – a standard benchmark – versions of Equation 11: base cross entropy (α = 0,γ = 0), along with other approaches designed for high resolution in- reweighted cross entropy (α = 0.7, γ = 0) and focal loss ference: Octree-Generation Network (OGN) [45] and Ma- (α = 0, γ = 2). Results are provided in Table4. tryoshka networks (Mat.) [39]. Continuous IoU loss gives superior metrics scores to Intersection-over-union (IoU) values are shown in Table all variations of cross-entropy. There is no clear winner 3. Baseline models trained on the world-aligned grid con- amongst cross-entropy variants. sistently out-perform those trained on the frustum grid by Finally, we consider how our continuous IoU model per- a small margin. This suggests the patterns present in the forms at resolution of 2563. Results for models trained at frustum dataset are harder to learn than those in the regu- different resolutions and then linearly interpolated are given lar dataset. This is not surprising, as there is significantly in Table5. We trained a single model on all 13 categories, as more variety in the frustum voxel grid dataset (1 grid per well as a separate model for each of cars, planes and tables view, rather than 1 grid per model). For example, almost for fair comparison with other work. all planes in the world-aligned dataset have long fuselages Our networks all perform comparably on cars and and angled wings. A model that learns to identify planes planes, with our larger network performing slightly bet- could do reasonably well at low resolutions by simply infer- ter, and category-specific training also improving things ring the class average rather than taking into account fine- slightly. We significantly out-perform other methods on the level detail. To do similarly well on the frustum dataset, the table category, where the space-carving ability of our net- model would need to additionally infer the camera position work can extract high-precision corners and edges and ac- and learn to transform the average grid values accordingly. curately reconstruct many thin structures. While this means subsequent comparison to other meth- Poor performance of low resolution models when evalu- ods trained on world-aligned grids is not truly fair, we in- ated at high resolutions is clear for our models. We attribute clude their results anyway. We believe this is more infor- this to the large change in the volume of these structures as mative than only using self-comparisons so long as they are resolution increases as a result of our voxelization strategy. intepreted with this disclaimer in mind. A small performance regression is observed going from Our multi-level optimization approach clearly out- 1283 to 2563 in most experiments on our 13-category mod- performs the baseline on the same dataset across all cat- els. This is consistent with the observation made in OGN egories and both image networks. It also out-performs the [45], who demonstrate that training on a more limited base method on the easier world-aligned dataset, along with dataset results in improved performance with resolution, all other competing methods considered on average. while more varied datasets are hindered by increased res- 68 CHAPTER 5. INVERSE GRAPHICS ENERGY NETWORKS

Figure 6: Sample results for IGE-MN model trained with α = 0.7 loss @ 1283 resolution and IGE-I4 model trained with continuous IoU @ 2563. For each block of 6, top row: (left) input image, (middle) inference (blue) projected ground truth silhouettes (gray), (right) same as middle except I4 network trained with cont. IoU loss. Bottom row: (left) ground truth object, (middle) MN inference, (right) I4 inference. olution. Unlike OGN, our regression occurs when training man pose dimension-lifting model performed comparably on the cross-category dataset, where as theirs is apparent to networks with orders of magnitude mode parameters and training on the cars dataset. with a fraction of the number of operations. We investigated two 3D reconstruction networks, and showed competitive 7. Conclusion results could be achieved with a relatively small network, and a larger network could out-perform other state-of-the- We have demonstrated energy-based multi-level opti- art high-resolution networks. mization networks can take advantage of computer graphics This research was supported by the Australian Research principles to infer 3D information from 2D inputs. Our hu- Council through the grant ARC FT170100072. 69

References of the thirteenth international conference on artificial intel- ligence and statistics, pages 249–256, 2010.4, 11 [1] B. Amos and J. Z. Kolter. Optnet: Differentiable opti- [17] J. Gwak, C. B. Choy, A. Garg, M. Chandraker, and mization as a layer in neural networks. arXiv preprint S. Savarese. Weakly supervised generative adversarial net- arXiv:1703.00443, 2017.2 works for 3D reconstruction. In 3DV, 2017.2 [2] D. Belanger and A. McCallum. Structured prediction energy [18]C.H ane,¨ S. Tulsiani, and J. Malik. Hierarchical surface pre- networks. In International Conference on Machine Learn- diction for 3D object reconstruction. In 3DV, 2017.2 ing, pages 983–992, 2016.2 [19] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. [3] D. Belanger, B. Yang, and A. McCallum. End-to-end learn- Human3. 6m: Large scale datasets and predictive meth- ing for structured prediction energy networks. In Proceed- ods for 3d human sensing in natural environments. IEEE ings of the 34th International Conference on Machine Learn- transactions on pattern analysis and machine intelligence, ing - Volume 70, ICML’17, pages 429–439. JMLR.org, 2017. 36(7):1325–1339, 2014.2,4 2 [20] D. J. Rezende, S. M. A. Eslami, S. Mohamed, P. Battaglia, [4] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, M. Jaderberg, and N. Heess. Unsupervised learning of 3D and M. J. Black. Keep it smpl: Automatic estimation of 3d structure from images. In NIPS, 2016.2 human pose and shape from a single image. In European [21] D. Jack, F. Maire, A. Eriksson, and S. Shirazi. Adversarially Conference on Computer Vision, pages 561–578. Springer, parameterized optimization for 3d human pose estimation. 2016.2 In 3D Vision (3DV), 2017 Fifth International Conference on. [5] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi- IEEE, 2017.2 person 2d pose estimation using part affinity fields. In CVPR, [22] D. Jack, J. K. Pontes, S. Sridharan, C. Fookes, S. Shi- 2017.2 razi, F. Maire, and A. Eriksson. Learning free-form de- [6] A. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, formations for 3d object reconstruction. arXiv preprint Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. arXiv:1803.10932, 2018.2 Shapenet: An information-rich 3d model repository., corr [23] A. Johnston, R. Garg, G. Carneiro, I. D. Reid, and A. van den abs/1512.03012. URL http://arxiv. org/abs/1512.03012.6 Hengel. Scaling cnns for high resolution volumetric recon- [7] C.-H. Chen and D. Ramanan. 3d human pose estimation= struction from a single image. In ICCV Workshops, pages 2d pose estimation+ matching. In CVPR, volume 2, page 6, 930–939, 2017.2,6 2017.2 [24] A. Kar, C. Hane,¨ and J. Malik. Learning a multi-view stereo [8] I. Cherabier, C. Hane,¨ M. R. Oswald, and M. Pollefeys. machine. In NIPS, 2017.2 Multi-label semantic 3D reconstruction using voxel blocks. [25] D. P. Kingma and J. Ba. Adam: A method for stochastic In 3DV, 2016.2 optimization. arXiv preprint arXiv:1412.6980, 2014. 11 [9] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d- [26] A. Kurenkov, J. Ji, A. Garg, V. Mehta, J. Gwak, C. B. Choy, r2n2: A unified approach for single and multi-view 3d object and S. Savarese. DeformNet: Free-form deformation net- reconstruction. In European conference on computer vision, work for 3d shape reconstruction from a single image. vol- pages 628–644. Springer, 2016.2,6,7 ume abs/1708.04672, 2017.2 [10] X. Chu, W. Ouyang, H. Li, and X. Wang. Structured feature [27] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang. learning for pose estimation. In Proceedings of the IEEE A tutorial on energy-based learning. Predicting structured Conference on Computer Vision and Pattern Recognition, data, 1(0), 2006.1,2 pages 4715–4723, 2016.2 [28] C.-H. Lin, C. Kong, and S. Lucey. Learning efficient point [11] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and cloud generation for dense 3D object reconstruction. In X. Wang. Multi-context attention for human pose estima- AAAI, 2018.2 tion. In Proceedings of the IEEE Conference on Computer [29] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar.´ Vision and Pattern Recognition, pages 1831–1840, 2017.2 Focal loss for dense object detection. arXiv preprint [12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- arXiv:1708.02002, 2017.6 Fei. Imagenet: A large-scale hierarchical image database. [30] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- In Computer Vision and Pattern Recognition, 2009. CVPR manan, P. Dollar,´ and C. L. Zitnick. Microsoft coco: Com- 2009. IEEE Conference on, pages 248–255. Ieee, 2009. 11 mon objects in context. In European conference on computer [13] J. Domke. Generic methods for optimization-based model- vision, pages 740–755. Springer, 2014.2,4 ing. In Artificial Intelligence and Statistics, pages 318–326, [31] J. Liu, F. Yu, and T. A. Funkhouser. Interactive 3D modeling 2012.2 with a generative adversarial network. In 3DV, 2017.2 [14] H. Fan, H. Su, and L. J. Guibas. A point set generation net- [32] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional work for 3D object reconstruction from a single image. In networks for semantic segmentation. In Proceedings of the CVPR, 2017.2 IEEE conference on computer vision and pattern recogni- [15] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta. tion, pages 3431–3440, 2015.6 Learning a predictable and generative vector representation [33] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple for objects. In ECCV, 2016.2 yet effective baseline for 3d human pose estimation. In Inter- [16] X. Glorot and Y. Bengio. Understanding the difficulty of national Conference on Computer Vision, volume 1, page 5, training deep feedforward neural networks. In Proceedings 2017.2,3,4,5 70 CHAPTER 5. INVERSE GRAPHICS ENERGY NETWORKS

[34] F. Moreno-Noguer. 3d human pose estimation from a single [49] P.-S. Wang, Y. Liu, Y.-X. Guo, C.-Y. Sun, and X. Tong. O- image via distance matrix regression. In Computer Vision CNN: Octree-based convolutional neural networks for 3D and Pattern Recognition (CVPR), 2017 IEEE Conference on, shape analysis. In SIGGRAPH, 2017.2 pages 1561–1570. IEEE, 2017.3 [50] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Con- [35] A. Newell, K. Yang, and J. Deng. Stacked hourglass net- volutional pose machines. In Proceedings of the IEEE Con- works for human pose estimation. In European Conference ference on Computer Vision and Pattern Recognition, pages on Computer Vision, pages 483–499. Springer, 2016.2,4,6 4724–4732, 2016.2 [36] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep [51] J. Wu, Y. Wang, T. Xue, X. Sun, W. T. Freeman, and J. B. learning on point sets for 3D classification and segmentation. Tenenbaum. MarrNet: 3D Shape Reconstruction via 2.5D In CVPR, 2017.2 Sketches. In NIPS, 2017.2 [37] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. [52] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Tor- Guibas. Volumetric and multi-view CNNs for object clas- ralba, and W. T. Freeman. Single image 3D interpreter net- sification on 3D data. In CVPR, 2016.2 work. In ECCV, 2016.2 [38] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. PointNet++: Deep [53] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. hierarchical feature learning on point sets in a metric space. Learning a probabilistic latent space of object shapes via 3d In NIPS, 2017.2 generative-adversarial modeling. In Advances in Neural In- [39] S. R. Richter and S. Roth. Matryoshka networks: Predicting formation Processing Systems, pages 82–90, 2016.2,6 3d geometry via nested shape layers. In Proceedings of the [54] Z. Wu, S. Song, A. Khosla, X. Tang, and J. Xiao. 3D IEEE Conference on Computer Vision and Pattern Recogni- ShapeNets: A deep representation for volumetric shapes. In tion, pages 1936–1944, 2018.2,6,7 CVPR, 2015.2 [40] G. Riegler, A. O. Ulusoy, and A. Geiger. OctNet: Learning [55] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspec- deep 3D representations at high resolutions. In CVPR, 2017. tive transformer nets: Learning single-view 3D object recon- 2 struction without 3D supervision. In NIPS, 2016.2 [41] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convo- [56] W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang. lutional networks for biomedical image segmentation. In 3d human pose estimation in the wild by adversarial learning. International Conference on Medical image computing and In Proceedings of the IEEE Conference on Computer Vision computer-assisted intervention, pages 234–241. Springer, and Pattern Recognition, volume 1, 2018.2 2015.6 [57] Y. Yang, C. Feng, Y. Shen, and D. Tian. Foldingnet: Point [42] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. cloud auto-encoder via deep grid deformation. In Proc. IEEE Chen. Mobilenetv2: Inverted residuals and linear bottle- Conf. on Computer Vision and Pattern Recognition (CVPR), necks. In Proceedings of the IEEE Conference on Computer volume 3, 2018.2 Vision and Pattern Recognition, pages 4510–4520, 2018.6, [58] Y. Yang and D. Ramanan. Articulated pose estimation with 11 flexible mixtures-of-parts. In Computer Vision and Pat- [43] A. Sharma, O. Grau, and M. Fritz. VConv-DAE: Deep vol- tern Recognition (CVPR), 2011 IEEE Conference on, pages umetric shape learning without object labels. In ECCVW, 1385–1392. IEEE, 2011.2 2016.2 [59] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, [44] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional ran- Inception-v4, inception-resnet and the impact of residual dom fields as recurrent neural networks. In Proceedings of connections on learning. In AAAI, volume 4, page 12, 2017. the IEEE international conference on computer vision, pages 6 1529–1537, 2015.1 [45] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree gen- [60] R. Zhu, H. K. Galoogahi, C. Wang, and S. Lucey. Rethinking erating networks: Efficient convolutional architectures for reprojection: Closing the loop for pose-aware shape recon- high-resolution 3d outputs. In Proc. of the IEEE Interna- struction from a single image. In NIPS, 2017.2 tional Conf. on Computer Vision (ICCV), volume 2, page 8, 2017.2,7 [46] S. Tulsiani, A. A. Efros, and J. Malik. Multi-view con- sistency as supervisory signal for learning shape and pose prediction. In Computer Vision and Pattern Regognition (CVPR), 2018.2 [47] H.-Y. F. Tung, A. W. Harley, W. Seto, and K. Fragkiadaki. Adversarial inverse graphics networks: Learning 2d-to-3d lifting and image-to-image translation from unpaired super- vision. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17), volume 2, 2017.2 [48] A. O. Ulusoy, A. Geiger, and M. J. Black. Towards prob- abilistic volumetric reconstruction using ray potential. In 3DV, 2015.2 71

8. Supplementary Material IGE-MN IGE-I4 323 643 1283 2563 323 643 1283 2563 8.1. Voxelization plane 29.6 44.8 52.9 54.4 30.5 47.8 57.5 57.3 bench 25.5 32.8 35.4 34.6 26.1 37.1 41.2 38.4 Initial voxelization was done in world coordinates and cabinet 58.0 66.7 69.0 69.4 59.5 68.6 71.9 70.2 used exact voxelization, i.e. any voxel partially intersected car 57.8 68.8 72.8 73.3 57.9 70.9 74.0 75.2 telephone 54.9 66.3 69.7 68.5 55.0 66.6 73.3 70.0 by a face was defined as filled. We subsequently filled in chair 37.4 46.3 49.2 48.1 38.7 50.1 53.7 51.3 hollow models by switching the state of any empty voxel sofa 55.5 63.9 66.5 65.8 56.0 66.7 70.0 68.2 without any free path to the exterior bounding box along rifle 32.3 44.2 51.0 50.1 32.1 47.4 54.6 51.9 any of the 6 rays pointing in the iˆ, jˆ and kˆ directions lamp 27.6 35.9 39.5 37.8 29.2 38.7 42.4 39.6 ± ± ± monitor 41.1 48.7 51.2 51.0 41.7 52.2 54.9 51.2 starting at the voxel. speaker 61.6 69.1 71.5 71.2 63.6 71.4 74.4 73.3 Frustum voxels were calculated based on these filled in table 33.6 44.0 47.8 48.2 34.8 46.5 52.7 50.5 world-coordinate voxel grids of the same resolution using watercraft 41.5 52.1 56.0 55.4 41.6 54.6 59.2 57.1 nearest neighbour sampling. Near and far planes were based meancat. 42.8 52.6 56.3 56.0 43.6 55.3 60.0 58.0 on viewing a sphere of radius 0.5 centred at the origin. Note all reported intersection-over-union values on frustum grids Table 6: Mean IoU (in %) evaluated at 2563 resolution. A have been reweighted to account for the non-uniform vol- single model is trained across all categories. Lower resolu- ume of the elements. tion inferences are trilinearly upsampled.

x6 8.2. Training Details 2562 4 ×

All kernels of all networks except the inner-loop CNNs Input x5 Input 1282 3 1282 8 1282 3 4 × × × had L2 regularization with weight 5 10− . No regulariza- × Layer 2 x4 Layer 2 642 16 642 16 642 16 tion was applied to the inner-loops CNNs. × × ×

3 3 3 3 (0) Layer 4 x3 Layer 4 3 y˜ Networks were trained at 32 , 64 , 128 and 256 . All 2 2 2 32 8 32 24 32 32 32 24 × 323 except the highest resolution were trained with a batch × × × Layer 7 x2 Layer 7 3 162 32 162 64 162 32 16 16 size of 16 for 100,000 steps. The image encoder for the × × × × 3 Layer 14 x1 Layer 14 3 82 96 82 128 82 96 8 32 32 model was initialized with publicly available weights × × × × trained on ImageNet [12]. Other convolutional kernels were Layer 18 x0 Layer 18 2 3 42 320 42 256 42 320 4 256 4 64 initialized with Glorot uniform initialization [16] except the × × × × × final kernel of each loss network, which was initially a fac- Figure 7: Left: Image Feature Network for IGE-MN model. 3 tor of 10− lower as before. Higher resolution networks The left column is the standard MobilenetV2 convolutional were initialized from their lower resolution counterparts. network [42] and is shared with the initial estimate network. To allow models to be trained on a desktop GPU (Nvidia Dashed arrows represent bilinear resizing. xi values are the 3 GTX 1080-Ti), the highest resolution networks (256 ) used result of a 1 1 convolution on the concatenated inputs fol- × a batch size of 4. We turned off batch normalization for lowed by batch normalization and rectified linear activation. this final model to avoid spurious batch statistics due to the Right: Initial Estimate Network for MN model. The left reduced batch size. We observed very little improvement column is shared with the image feature network. Dashed beyond the first few thousand steps, so terminated training arrows represent 43 deconvolutions with isotropic stride 2 after 20,000 steps. followed by batch normalization and a rectified linear acti- 3 We used Adam [25] for our outer optimizer. Our 32 vation. The dotted line is a reshape. models used a learning rate of 5 10 5, while higher reso- × − lution networks used 2 10 5. × − Layer 18 42 320 ×

Layer 19 1 1 4 128 42 1280 × × × ×

Embedding 1536 3 4 128 1 4 1 128 + 43 128 1280 × × × × × ×

4 1 1 128 × × ×

Figure 8: Initial decoding transformation for baseline voxel MN decoder. Operations left-to-right from embedding layer: dense layer, reshape, split/reshape, addition with di- mension broadcasting. Note there are no learned parameters due to any dotted arrows 72 CHAPTER 5. INVERSE GRAPHICS ENERGY NETWORKS Chapter 6

Sparse Convolutions on Continuous Domains for Point Cloud and Event Stream Networks

“It’s fine to work on any problem, so long as it generates interesting mathematics along the way – even if you don’t solve it at the end of the day.” — Andrew Wiles

On the surface, our fourth contribution has a fairly tenuous link to the rest of this thesis. While our other contributions have all focused on inferring 3D representations from 2D inputs, this work looks at extracting features from irregular 3D data sources – point clouds and event streams – useful for tasks such as classification and semantic segmentation.

To understand the relevance to this thesis one must appreciate this was not the publication that was originally envisioned. Rather, the initial plan was to implement an object-reconstruction IGE-Net using point clouds rather than volumetric occupancy grids. This would alleviate scaling issues by focusing on the surface rather than the entire volume, and greatly simplify the projection operator compared to probabilistic occupancy grids.

While we believe point cloud representations are well suited for 3D reconstruction with IGE-Nets, it quickly became apparent there was one major problem: there is no universally accepted point cloud convolution operation similar to voxel/image convolutions in deep learning, and popular methods in the literature are non-hierarchical or discontinuous in cloud coordinates making them incompatible with IGE-Net energy functions.

This contribution originally aimed to plug that gap in the literature, presenting a point cloud con- volution operator that is both hierarchical and continuous with respect to input coordinates. Upon implementing and appreciating its capacity to scale, we decided to extend it from point clouds to event camera streams. While point clouds and event streams might seem quite different input types, we can think of events as 3D points in (x, y, t)-space. While there are obvious differences between the two sources, both represent irregular sparse data types with at least one continuous dimension.

73 74 CHAPTER 6. SPARSE CONVOLUTIONS ON CONTINUOUS DOMAINS

Statement of Contribution of Co-Authors for Thesis by Published Paper

The authors listed below have certified that:

1. they meet the criteria for authorship in that they have participated in the conception, execution, or interpretation, of at least that part of the publication in their field of expertise;

2. they take public responsibility for their part of the publication, except for the responsible author who accepts overall responsibility for the publication;

3. there are no other authors of the publication according to these criteria;

4. potential conflicts of interest have been disclosed to (a) granting bodies, (b) the editor or publisher of journals or other publications, and (c) the head of the responsible academic unit; and

5. they agree to the use of the publication in the student’s thesis and its publication on the QUT’s ePrints site consistent with any limitations set by publisher requirements.

In the case of this chapter: Sparse Convolutions on Continuous Domains for Point Cloud and Event Stream Networks, submitted to the European Conference on Computer Vision (ECCV), Glasgow, Scotland, 2020.

Dominic Jack QUT Verified Experiment design, model implementation, write-up. Signature 26 May, 2020 Frederic Maire Advised on model design and paper editing. Simon Denman Advised on model design and paper editing. Anders Eriksson Advised on model design and paper editing.

Principle Supervisor Confirmation

I have sighted email or other correspondence from all Co-authors confirming their certifying authorship. (If the Co-authors are not able to sign the form please forward their email or other correspondence confirming the certifying authorship to the RSC). QUT Verified Frederic Maire Signature 26 May 2020 Name Signature Date 75

Sparse Convolutions on Continuous Domains for Point Cloud and Event Stream Networks

1[0000 0002 8371 3502] 1[0000 0002 6212 7651] Dominic Jack − − − , Frederic Maire − − − , 1[0000 0002 0983 5480] 2[0000 0003 2652 7110] Simon Denman − − − , and Anders Eriksson − − −

1 Queensland University of Technology, QLD, Australia d1.jack,f.maire,s.denman @qut.edu.au { } 2 University of Queensland, QLD, Australia [email protected]

Abstract. Image convolutions have been a cornerstone of a great num- ber of deep learning advances in computer vision. The research commu- nity is yet to settle on an equivalent operator for sparse, unstructured continuous data like point clouds and event streams however. We present an elegant sparse matrix-based interpretation of the convolution opera- tor for these cases, which is consistent with the mathematical definition of convolution and efficient during training. On benchmark point cloud classification problems we demonstrate networks built with these oper- ations can train an order of magnitude or more faster than top existing methods, whilst maintaining comparable accuracy and requiring a tiny fraction of the memory. We also apply our operator to event stream pro- cessing, achieving state-of-the-art results on multiple tasks with streams of hundreds of thousands of events.

Keywords: Convolution, Point Clouds, Event Cameras, Deep Learning

1 Introduction

Deep learning has exploded in popularity since AlexNet [1] achieved ground- breaking results in image classification [2]. The field now boasts state-of-the- art performance in fields as diverse as [3], natural language processing [4], and molecular design [5]. Robotics [6] applications are of particular interest due to their capacity to revolutionize society in the near future. Driverless cars [7] specifically have at- tracted enormous amounts of research funding, with advanced systems being built with multi-camera setups [8], active LiDAR sensors [9], and sensor fusion approaches [10]. At the other end of the spectrum, small mobile robotics applications and mobile devices benefit from an accurate 3D understanding of the world. These platforms generally don’t have access to large battery stores or computation- ally hefty hardware, so efficient computation is essential. Even where compute is available, the cost of energy alone can be prohibitive, and the research com- munity is beginning to appreciate the environmental cost of training massive power-hungry algorithms in data centres [11]. 76 CHAPTER 6. SPARSE CONVOLUTIONS ON CONTINUOUS DOMAINS

2 D. Jack et al.

The convolution operator has been a critical component of almost all recent advances in deep learning for computer vision. However, implementations de- signed for use with images cannot be used for data types that are not defined on a regular grid. Consider for example event cameras, a new type of sensor which shows great promise, particularly in the area of mobile robotics. Rather than reporting average intensities of every pixel at a uniform frame rate, pixels in an event camera fire individual events when they observe an intensity change. The result is a sparse signal with very fast response time, high dynamic range and low power usage. Despite the potential, this vastly different data encoding means that a traditional 2D convolution operation is no longer appropriate.

Fig. 1: Learned image convolutions can be thought of as linear combinations of static basis convolutions, where the linear combination is learned. Each basis convolution can be expressed as a sparse-dense matrix product. We take the same approach with point clouds and event streams.

In this work, we investigate how the convolution operator can be applied to two non-image input sources: point clouds and event streams. In particular, our contributions are as follows.

1. We implement a convolution operator for sparse inputs on continuous do- mains using only matrix products and addition during training. While others have proposed such an operator, we believe we are the first to implement one without compromising the mathematical definition of convolution. 2. We discuss implementation details essential to the feasible training and de- ployment of networks using our operator on modest hardware. We demon- strate speed-ups of an order of magnitude or more compared to similar meth- ods with a memory foot-print that allows for batch sizes in the thousands. 3. For point clouds, we discuss modifications that lead to desirable properties like robustness to varying density and continuity, and demonstrate that rel- atively small convolutional networks can perform competitively with much larger, more expensive networks. 77

Sparse Convolutions on Continuous Domains 3

4. For event streams, we demonstrate that convolutions can be used to learn features from spiking network spike trains. By principled design of our ker- nels, we propose two implementations of the same networks: one for learning that takes advantage of modern accelerator hardware, and another for asyn- chronous deployment which can provide features or inferences associated with events as they arrive. We demonstrate the effectiveness of our learned implementation by achieving state-of-the-art results on multiple classifica- tion benchmarks, including a 44% reduction in error rate on sign language classification [12].

2 Prior Work

Point Clouds Early works in point cloud processing – Pointnet [13] and Deep Sets [14] – use point-wise shared subnetworks and order invariant pooling op- erations. The successor to Pointnet, Pointnet++ [15] was (to the best of our knowledge) the first to take a hierarchical approach, applying Pointnet submod- els to local neighborhoods. SO-Net [16] takes a similar hierarchical approach to Pointnet++, though uses a different method for sampling and grouping based on self-organizing maps. DGCNN [17] applies graph convolutions to point clouds with edges based on spatial proximity. KCNet [18] uses dynamic kernel points in correlation layers that aim to learn features that encapsulate the relationships between those ker- nel points and the input cloud. While most approaches treat point clouds as unordered sets by using order-invariant operations, PointCNN [19] takes the approach of learning a canonical ordering over which an order-dependent opera- tion is applied. SpiderCNN [20] and FlexConv [21] each bring their own unique interpretation to generalizing image convolutions to irregular grids. While Spi- derCNN focuses on large networks for relatively small classification and seg- mentation problems, FlexConv utilizes a specialized GPU kernel to apply their method to point clouds with millions of points.

Event Stream Networks Compared to standard images, relatively little re- search has been done with event networks. Interest has started to grow recently with the availability of a number of event-based cameras [22, 23] and publicly available datasets [23–26, 12]. A number of approaches utilize the extensive research in standard image processing by converting event streams to images [25, 27]. While these can lever- age existing libraries and cheap hardware optimized for image processing, the necessity to accumulate synchronous frames prevents them from taking advan- tage of many potential benefits of the data format. Other approaches look to biologically-inspired spiking neural networks (SNNs) [28–30]. While promising, these networks are difficult to train due to the discrete nature of the spikes. Other notable approaches include the work of Lagorce et al. [31], who in- troduce hierarchical time-surfaces to describe spatio-temporal patterns; Sironi et al. [26], who show histograms of these time surfaces can be used for object 78 CHAPTER 6. SPARSE CONVOLUTIONS ON CONTINUOUS DOMAINS

4 D. Jack et al.

classification; and Bi et al. [12], who use graph convolution techniques operating over graphs formed by connecting close events in space-time.

Sparse Convolutions Sparse convolutions have been used in a variety of ways in deep learning before. Liu et al.[32] and Park et al.[33] demonstrate improved speed from using implementations optimized for sparse kernel on discrete do- mains, while there are various voxel-based approaches [34–36] that look at con- volutions on discrete sparse inputs and dense kernels. Other approaches involve performing dense discrete convolutions on interpolated projections [37, 38].

3 Method Overview

For simplicity, we formulate continuous domain convolutions in the context of physical point clouds in Section 3.1, before modifying the approach for event streams in Section 3.2. A summary of notation used in this section is provided in the supplementary material.

3.1 Point Cloud Convolutions

We begin by considering the mathematical definition of a convolution of a func- tion h with a kernel g,

(h g)(t) = h(τ)g(t τ) dτ. (1) ∗ − ZD We wish to evaluate the convolution of a function with values defined at fixed points x in an input cloud of size S, at a finite set of points x0 in an output j X i cloud 0 of size S0. We denote a single feature for each point in these clouds SX S0 f R and f 0 R respectively. For the moment we assume coordinates for both∈ input and∈ output clouds are given. In practice it is often the case that only the input coordinates are given. We discuss choices of output clouds in subsequent sections. By considering our convolved function h to be the sum of scaled Dirac delta functions δ centred at the point coordinates,

h(x) = f δ(x x ), (2) j − j j X Equation 1 reduces to

f 0 = f g(x0 x ), (3) i j i − j x Xj ∈Ni where is the set of points in the input cloud within some neighborhood of the Ni output point x0 . We refer to pairs of points x , x0 where x as an edge, i { j i} j ∈ Ni and the difference in coordinates ∆x = x0 x as the edge vector. ij i − j 79

Sparse Convolutions on Continuous Domains 5

Like Groh et al.[21], we use a kernel made up of a linear combination of M unlearned basis functions pm,

g(∆x; θ) = pm(∆x)θm, (4) m X where θm are learnable parameters. As with Groh et al., we use geometric mono- mials for our basis function. Substituting this into Equation 3 and reordering summations yields

fi0 = pm(∆xij)fjθm. (5) m x X Xj ∈Ni We note the inner summation can be expressed as a sparse-dense matrix product, (m) f 0 = N fθm, (6) m X This is visualized in Figure 1. Neighborhood matrices N (m) have the same spar- (m) sity structure for all m. Values nij are given by the corresponding basis func- tions evaluated at edge vectors,

(m) pm(∆xij) xj i, nij = ∈ N (7) (0 otherwise.

S Q Generalizing to multi-channel input and output features F R × and S0 P ∈ F 0 R × respectively, this can be expressed as a sum of matrix products, ∈ (m) (m) F 0 = N FΘ , (8) m X (m) Q P where Θ R × is a matrix of learned parameters. The elegance∈ of this representation should not be understated. N (m) is a sparse matrix defined purely by relative point coordinates and choice of basis functions. Θ(m) is a dense matrix of parameter weights much like traditional convolutional layers, and F 0 and F are feature matrices with the same structure, allowing networks to be constructed in much the same way as image CNNs. We now identify three implementations with analogues to common image convolutions. A summary is provided in Table 1.

Down-Sampling Convolutions Convolutions in which there are fewer output points than input points and more output channels than input channels are more efficiently computed left-to-right, i.e. as (N (m)F )Θ(s). These are analogous to conventional strided image convolutions.

Up-Sampling Convolutions Convolutions in which there are more output points than input points and fewer output channels than input channels are more effi- ciently computed right-to-left, i.e. N (m) FΘ(m) . These are analogous to con- ventional fractionally strided or transposed image convolutions.  80 CHAPTER 6. SPARSE CONVOLUTIONS ON CONTINUOUS DOMAINS

6 D. Jack et al.

Featureless Convolutions The initial convolutions in image CNNs typically have large receptive fields and a large increase in the number of filters. For point clouds, there are often no input features at all – just coordinates. In this instance the convolution reduces to a sum of kernel values over the neighborhood. In the multi-input/output channel context this is given by

Z = GΦ˜ 0, (9)

S Q ˜ N 0 S where Φ0 R × is the learned matrix and G R × is a dense matrix of summed monomial∈ values ∈ (m) g˜is = nˆij . (10) j X

Opt. Cond. Form Mult. Adds Mem.

Q = P (m) (m) In Place m N FΘ MP (E + SP ) SP S0 = S

Q < P P (m) (m) Down-Sample m N F Θ MQ(E + S0P ) S0Q S0 < S   Q > P P (m) (m) Up-Sample m N FΘ MP (E + SQ) SP S0 > S   Featureless F = 1 P GΦ˜ 0 MSP - Table 1: Time complexity of different point cloud convolution operations and theoretical space complexity of intermediate terms (Mem). The matrix product for in place convolutions can be evaluated in either order.

Neighborhoods To be consistent with the mathematical definition of convolu- tion, the neighborhood of each point should be fixed, which precludes the use of k-nearest neighbors (kNN), despite its prevalence in the literature [21, 20, 15, 39]. The obvious choice of a neighborhood is a ball. Equation 8 can be implemented trivially using either kNN or ball neighborhoods, though from a deep learning perspective each neighborhood has its own advantages and disadvantages.

Predictable computation time: The sparse-dense matrix products have compu- tation proportional to the number of edges. For kNN this is proportional to the output cloud size, but is less predictable when using ball-searches.

Robustness to point density: Implementations based on each neighborhood react differently to variations in point density. As the density increases, kNN imple- mentations shrink their receptive field. On the other hand, ball-search implemen- tations suffer from increased computation time and output values proportional to the density.

Discontinuity in point coordinates: Both neighborhood types result in operations that are discontinuous in point coordinates. kNN convolutions are discontinuous as the kth and (k + 1)th neighbors of each point pass each other. Ball-search convolutions have a discontinuity at the ball-search radius. 81

Sparse Convolutions on Continuous Domains 7

Symmetry: Connectedness in ball-neighborhoods is symmetric – i.e. if xi0 j then x for neighborhood functions with the same radius. This means∈ the N j ∈ Ni neighborhood matrix NIJ between sets I and J is related to the reversed T X X neighborhood by NIJ = NJI (up to a possible difference in sign due to the monomial value). This allows for shared computation between different layers.

Transposed Neighborhood Occupancy: For kNN, all neighborhood matrices are guaranteed to have k entries in each row. This guarantees there will be no empty rows, and hence no empty neighborhoods. Ball search neighborhoods do not share this property, and there is no guarantee points will have any neighbors. This is important for transposed convolutions, where points may rely on neigh- bors from a lower resolution cloud to seed their features.

Subsampling Thus far we have remained silent as to how the S0 output points making up 0 are chosen. In-place convolutions can be performed with the same input and outputX clouds, but to construct networks we would like to reduce the number of points as we increase the number of channels in a similar way to image CNNs. We adopt a similar approach to Pointnet++ [15] in that we sample a set of points from the input cloud. Pointnet++ [15] selects points based on the first S0 points in iterative farthest point (IFP) ordering, whereby a fixed number of points are iteratively selected based on being farthest from the currently selected points. For each point selected, the distance to all other points has to be computed, resulting in an (S0S) implementation. To improve upon this, we begin by updatingO distances only to points within a ball neighborhood – a neighborhood that may have already been computed for the previous convolution. By storing the minimum distances in a priority queue, this sampling process still gives relatively uniform coverage like the original IFP, but can be computed in (S0k¯), where k¯ is the average number of neighbors of each point. O We also propose to terminate searching once this set of neighborless candi- dates has been exhausted, rather than iterating for a fixed number of steps. We refer to this as rejection sampling. This results in point clouds of different sizes, but leads to a more consistent number of edges in subsequent neighborhood ma- trices. It also guarantees all points in the input cloud will have a neighbor in the output cloud. We provide pseudo-code for these algorithms and illustrations in the supplementary material.

Weighted Convolutions To address both the discontinuity at the ball radius and the neighbor count variation inherent to using balls, we propose using a weighted average convolution by weighting neighboring values by some continu- ous function w which decreases to zero at the ball radius,

(m) 1 (m) nˆij = wijnij (11) Wi where wij = w( ∆xij ) and Wi = j wij. We use w(x) = 1 x/r for our experiments, where| r is| the search radius. − P 82 CHAPTER 6. SPARSE CONVOLUTIONS ON CONTINUOUS DOMAINS

8 D. Jack et al.

Comparison to Existing Methods We are not the first to propose hierar- chical convolution-like operators for learning on point clouds. In this section we look at a number of other implementations and identify key differences. Pointnet++ [15] and SpiderCNN [20] each use feature kernels which are non- linear with respect to the learned parameters. This means these methods have a large memory usage which increases as they create edge features from point features, before reducing those edge features back to point features. Pointnet++ claims to use a ball neighborhood – and show results are im- proved using this over kNN. However, their implementation is based on a trun- cated kNNsearch with fixed k, meaning padding edges are created in sparse regions and meaningful edges are cropped out in dense regions. The cropping is partially offset by the use of max pooling over the neighborhood and IFP ordering, since the first k neighbors found are relatively spread out over the neighborhood. As discussed however, IFP is (SS0) in time, but removing this means results in the truncated ball search willO no longer necessarily be evenly distributed. Also, the padding of sparse neighborhoods leads to an inefficient implementation, as edge features are computed despite never being used. FlexConv [21] present a very similar derivation to our own. However, they implement Equation 5 with a custom GPU kernel that only supports kNN. On the whole, we are unable to find any existing learned approaches that perform true ball searches, nor make any attempt to deal with the discontinuity inherent to kNN. We accept models are capable of learning robustness to such discontinuities, but feel enforcing it at the design stage warrants consideration.

Data Pipeline There are two aspects of the data processing that are critical to the efficient implementation of our point cloud convolution networks.

Neighborhood Preprocessing The neighborhood matrices N (m) are functions of relative point coordinates and the choice of unlearned basis functions – they do not depend on any learned parameters. This means they can be pre-computed, either online on CPUs as the previous batch utilizes available accelerators, or of- fline prior to training. In practice we only pre-compute the neighborhood indices and calculate the relative coordinates and basis functions on the accelerator. This additional computation on the accelerator(s) takes negligible time and reduces the amount of memory that needs to be stored, loaded and shipped.

Ragged Batching During the batching process, the uneven number of points in each cloud for each example can be concatenated, rather than padded to a uniform size, and sparse matrices block diagonalized. For environments where fixed-sized inputs are required, additional padding can occur at the batch level, rather than the individual example level, where variance in the average size will be smaller. Unlike standard dataset preprocessing, our networks require network-specific preprocessing – preprocessing dependent on e.g. the size of the ball searches at each layer, the number of layers etc. To facilitate testing and rapid prototyping, 83

Sparse Convolutions on Continuous Domains 9 we developed a meta-network module for creating separate pre- and post-batch preprocessing, while simultaneously building learned network stages based on a single conceptual network architecture. This is illustrated in Figure 2.

Dataset

Batching Augmentation 0 X

0 X

sample Featureless Conv. Φ0

KDTree Search N00 Block Diag. Featureless Conv. Φ0 F01

F01

1 Down-Sample Conv. Θ01 X 1 N01 Block Diag. Down-Sample Conv. Θ01 X

F11 F11

KDTree Search N11 Block Diag. In-Place Conv. Θ11 sample In-Place Conv. Θ11

F12

F12 . N Block Diag. Down-Sample Conv. Θ . 12 12

2 Down-Sample Conv. Θ12 . X .

. . Pre-batch (CPU) Post-batch (CPU) Learned (Accelerator)

(a) (b)

Fig. 2: (a) Conceptual network vs (b) separate computation graphs.

3.2 Event Stream Convolutions Event streams from cameras can be thought of as 3D point clouds in (x, y, t) space. However, only the most fundamental of physicists would consider space and time equivalent dimensions, and in practice their use cases are significantly different. For event cameras in particular, – spatial coordinates of events are discrete and occur on a fixed size grid cor- responding to the pixels of the camera; – the time coordinate is effectively continuous and unbounded; and – events come in time-sorted order. We aim to formulate a model with the following requirements: – Intermediate results: we would like our model to provide a stream of predic- tions, updated as more information comes in, rather than having to wait for the end of a sequence before making an inference. – Run indefinitely: we would like to deploy these models in systems which can run for long periods of time. As such, our memory and computational re- quirements must be (1) and (E) respectively, with respect to the number of events. O O Unfortunately, these requirements are difficult to enforce while making good use of modern deep learning frameworks and hardware accelerators. That said, 84 CHAPTER 6. SPARSE CONVOLUTIONS ON CONTINUOUS DOMAINS

10 D. Jack et al.

just because we desire these properties in our end system does not mean we need them during training. By using convolutions with a domain of integration extending backwards in time only and using an exponentially decaying kernel, we can train using accelerators on sparse matrices and have an alternative de- ployment implementation which respects the above requirements. Formally, we propose neighborhoods defined over small spatial neighborhood of size Mu – similar to image convolutions – extending backwards in time – with a kernel given by g(u, ∆t) = exp( λ ∆t)θ , (12) − uv uv v X where u corresponds to the pixel offset between events and v sums over some fixed temporal kernel size Mv, λuv is a learned temporal decay term enforced to be positive, θuv is a learned parameter and u extends over the spatial neighborhood. A temporal domain of integration extending backwards in time only ensures ∆t 0, hence we ensure the effects of events on features decay over time. ≥

Dual Implementations For training, the kernel function of Equation 12 can be used in Equation 5 and reduced to a form similar to Equation 8, where M = MuMv. This can be implemented in the same way as our point cloud con- volutions. Unfortunately, this requires us to construct the entire sparse matrix, removing any chance of getting intermediate results when they are relevant, and also breaks our (1) memory constraint with respect to the number of events. As such, we additionallyO propose a deployment implementation that updates features at pixels using exponential moving averages in response to events. As an input events come in, we decay the current values of the corresponding pixel by the time since it was last updated and add the new input features. When the features for an output event are required, the features of the pixels in the receptive field can be decayed and then transformed by Θ(uv), and reduced like (uv) Q a standard image convolution. Formally, we initialize zx = 0 R and τx = 0 for all pixels x. For each input event (x, t) with features f,∈ we perform the following updates:

z(uv) f + exp( λ (t τ ))z(uv) (13) x ← − uv − x x τ t. (14) x ←

Features f 0 for output event at (x0, t0) can thus be computed by

T (uv) (uv) f 0 = exp( λuv(t0 τx u))zx u Θ . (15) − − 0− 0− u,v X This requires (M M Q) operations per input event, (M M PQ) opera- O u v O u v tions per output event and (MuMvQ) space per pixel. Alternatively, the linear O (uv) transform can be applied to f during the zx update (equivalent to up-sampling convolutions) for subtly different space and computational requirements. Either way, our requirements are satisfied. 85

Sparse Convolutions on Continuous Domains 11

Subsampling As with our point cloud formulation, we would like a hierarchical model with convolutions joining multiple streams with successively lower reso- lution and higher dimensional features. We propose using an unlearned leaky- integrate-and-fire (LIF) model due to the simplicity of the implementation and its prevalence in SNN literature [40]. LIF models transform input spike trains by tracking a theoretical voltage at each location or “neuron”. These voltages exponentially decay over time, but are increased discontinuously by input events in some receptive field. If the voltage at a location exceeds a certain threshold, the voltage at that neuron is reset, and an output event is fired. SNNs generally learn the sensitivity of each output neuron to input spikes. We take a simpler approach, using a fixed voltage increase of 1/n as a result of an input spike, where n is the number of output neurons affected by the input event. Note we do not suggest this is optimal for our use case – particularly the unlearned nature of it – though we leave additional investigation of this idea to future work.

4 Experiments

We perform experiments on various classification tasks across point clouds and event streams. We provide a brief overview of network structures here. Model diagrams and technical details about the training procedure are provided in the supplementary material. We investigate our point cloud operator in the context of Modelnet40 [41], a 40-class classification problem with 9840 training examples and 2468 testing examples. We use the first 1024 points provided by Pointnet++ [15] and use the same point dropout, random jittering and off-axis rotation, uniform scaling and shuffling data augmentation policies. We construct two networks based loosely on Resnet [42]. Our larger model consists of an in-place convolution with 32 output channels, followed by 3 alter- nating down-sampling and in-place residual blocks, with the number of filters increasing by a factor of 2 in each down-sampling block. Our in-place ball radii start at 0.1125 and increase by a factor of 2 each level. Our down-sample radii are √2 larger than the previous in-place convolution. This results in sampled point clouds with roughly 25% of the original points on average, roughly 10 neighbors per in-place output point and 20 neighbors per down-sample output point. After our final in-place convolution block we use a single point-wise con- volution to increase the number of filters by a factor of 4 before max pooling across the point dimension. We feed this into a single hidden layer classifier with 256 hidden units. All convolutions use monomial basis functions up to 2nd order. We use dropout, batch normalization and L2 regularization throughout. Our smaller model is similar, but skips the initial in-place convolution and has half the number of filters at each level. Both are trained using a batch size of 128 using Adam optimizer [43] and with the learning rate reduced by a factor of 5 after 20 epochs without an improvement to training accuracy. 86 CHAPTER 6. SPARSE CONVOLUTIONS ON CONTINUOUS DOMAINS

12 D. Jack et al.

For event streams, we consider 5 classification tasks – N-MNIST and N- Caltech101 from Orchard et al.[24], MNIST-DVS and CIFAR10-DVS from Ser- rano et al.[23]) and ASL-DVS from Bi et al.[12]. All our event models share the same general structure, with an initial 3x3 con- volution with stride 2 followed by alternating in-place resnet/inception-inspired convolution blocks and down-sample convolutions (also 3x3 with stride 2), fin- ishing with a final in-place block. We doubled the number of filters and the LIF decay time at each down sampling. The result is multiple streams, with each event in each stream having its own features. The features of any event in any stream could be used as inputs to a classifier, but in order to compare to other work we choose to pool our streams by averaging over (x, y, t) voxels at our three lowest resolutions. For example, our CIFAR-10 model had streams with learned features at 64 64 down to 4 4. We voxelized the 16 16 stream to 16 16 4, the 8 8 stream into× an 8 8 2 grid× and the final stream× into a 4 4 1.× Each× voxel grid× beyond the first× receives× inputs from the lower resolution× voxel× grid (via a strided 2 2 2 voxel convolution), and from the average of the event stream. In this way,× examples× with relatively few events that result in zero events at the final stream still resulted in predictions (empty voxels are assigned the value 0). Hyperparameters associated with stream propagation (decay rate, spike threshold and reset potential) were hand-selected via an extremely crude search that aimed to achieve a modest number of events in the final stream for most examples. These hyperparameters, along with further details on training and data augmentation are provided in the supplementary material.

5 Results

5.1 Point Clouds

We begin by benchmarking our implementations of Equation 8. We implement the outer summation in two ways: a parallel implementation which unstacks the relevant tensors and computes matrix-vector products in parallel, and a map- reduce variant which accumulates intermediate values. Both are written entirely in the high-level python interface to Tensorflow 2.0. We compare with the work of Groh et al.[21] who provide benchmarks for their own tensorflow implementation, as well as a custom CUDA implemen- tation that only supports kNNneighborhoods. Computation time and memory requirements are shown in Table 2. Values do not include neighborhood cal- culations. Despite our implementation being more flexible, our forward pass is 3-5.4 faster, while our full training pass is sped up by factors of up to 27.6. Our implementation× does require significantly more memory – particularly in the backwards pass – though this is less severe using our Map-Reduce implemen- tation. We also see significant improvements by using Tensorflow’s accelerated linear algebra (XLA) just-in-time compilation module, particularly in terms of memory usage. 87

Sparse Convolutions on Continuous Domains 13 Time (ms) GPU Mem (Mb) Forward Back-θ Back-full Forward Back-θ Back-full TF [21] 1829 - 2738 34G - 63G Custom [21] 24.0 - 265.0 8.4 - 8.7 Map-Reduce 7.4 12.2 21.7 37.5 77.7 237.6 Map-Reduce JIT 6.8 11.8 14.5 37.6 58.1 69.6 Parallel 4.5 8.1 16.5 66.2 142.8 728.1 Parallel JIT 4.4 8.6 9.6 62.5 62.5 62.5 Table 2: FlexConv benchmarks on an Nvidia GTX-1080Ti. M = 4, P = Q = 64, S = S0 = 4096, 9 neighbors and batch size of 8. Forward, Back-θ and Back- full are forward pass, forward pass plus gradients w.r.t. learned parameters and forward pass plus gradients w.r.t. all inputs respectively. Note only Back-θ is required during training. JIT rows correspond to using just-in-time compilation (excluding compile time).

Next we look at training times and capacity of our model on the Modelnet40 classification task using 1024 input points. Table 3 shows performance at various possible batch sizes and training times for our standard model compared to various other methods. For fair comparison, we do not use XLA compilation.

Epoch time (s) Model Reported/Best Mean Model Batch Size Online Offline Ours (small) 88.77 87.94 Pointnet [13] 89.20 88.65 SpiderCNN [20] 24 196 - KCNet [18] 91.00 89.62 32 56 - Pointnet++ [15] DeepSets [14] 90.30 89.71 64 56 - Pointnet++ [15] 90.70 90.14 32 35 - Ours (large) 91.08 90.34 PointCNN [19] 64 33 - DGCNN [17] 92.20 91.55 128 33 - PointCNN [19] 92.20 91.82 32 20.3 9.24 SO-Net [16] 93.40 92.65 64 18.6 6.32 Table 4: Top-1 instance accuracy on Ours (large) 128 18.1 4.85 1024 17.5 3.70 Modelnet40, sorted by mean of 10 4096 17.7 3.47 runs according to Koguciuk et al.[44], 32 11.4 6.15 for our large model with batch size 64 10.3 3.84 128. Reported/Best are those values Ours (small) 128 9.4 2.62 reported by other papers, and the best 1024 8.2 1.55 of 10 runs for our models. 4096 9.1 1.39 9840 9.3 1.52 Table 3: Time to train 1 epoch of Mod- elnet40 classification on an Nvidia GTX- 1080Ti. Online/offline refers to preprocess- ing. 88 CHAPTER 6. SPARSE CONVOLUTIONS ON CONTINUOUS DOMAINS

14 D. Jack et al.

Clearly our model runs significantly faster than those we compare to. Just as clear is the fact that our models which compute neighborhood information online are CPU-constrained. This preprocessing is not particularly slow – a modest 8- core desktop is capable of completing the 7 neighborhood searches and 3 rejection samplings associated with each example at over 500 Hz – but in the context of an accelerator-based training loop that runs at over 3000 Hz this is a major bottleneck. Our smaller network can preprocess examples at twice this speed (1000 Hz), though the corresponding increase in model training speed leaves the situation unchanged. One might expect such a speed-up to come at the cost of inference accuracy. Top-1 accuracy is given in Table 4. We observe a slight drop in performance compared to recent state-of-the-art methods, though our large model is still competitive with well established methods like Pointnet++. Our small model performs distinctly worse, though still respectably.

5.2 Event Camera Streams Table 5 shows results for our method on the selected classification tasks. We see minor improvements over current state-of-the art methods on the straight- forward MNIST variants, though acknowledge the questionable value of such minor improvements on datasets like these. We see a modest improvement on CIFAR-10, though perform relatively poorly on N-Caltech101. Our ASL-DVS model significantly out-performs the current state-of-the-art, with a 44% reduc- tion in error rate. We attribute the greater success on this last dataset compared to others to the significantly larger number of examples available during training ( 80,000 vs 10,000). ∼ ∼

Model N-MNIST MNIST-DVS CIFAR-DVS NCaltech101 ASL-DVS HATS [26] 99.1 98.4 52.4 64.2 - RG-CNN [12] 99.0 98.6 54.0 65.7 90.1 Ours 99.2 99.1 56.6 63.0 94.6 Table 5: Top-1 classification accuracy (%) for event stream classification tasks.

6 Conclusion

We have presented an elegant interpretation of convolutions for applications to sparse inputs on continuous domains. Combined with efficient sampling meth- ods, our resulting convolutional networks achieve competitive accuracies with top point cloud classification models with a fraction of the time and memory requirements. Our method can also scale to operate on event streams with hun- dreds of thousands of events, achieving new state-of-the-art results on a number of classification benchmarks. 89

Sparse Convolutions on Continuous Domains 15

References

1. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. In Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q., eds.: Advances in Neural Information Processing Systems 25. Curran Asso- ciates, Inc. (2012) 1097–1105 2. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, Ieee (2009) 248–255 3. Erickson, B.J., Korfiatis, P., Akkus, Z., Kline, T.L.: Machine learning for medical imaging. Radiographics 37 (2017) 505–515 4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 5. Elton, D.C., Boukouvalas, Z., Fuge, M.D., Chung, P.W.: Deep learning for molecu- lar design-a review of the state of the art. Molecular Systems Design & Engineering (2019) 6. Pierson, H.A., Gashler, M.S.: Deep learning in robotics: a review of recent research. Advanced Robotics 31 (2017) 821–835 7. Grigorescu, S., Trasnea, B., Cocias, T., Macesanu, G.: A survey of deep learning techniques for autonomous driving. arXiv preprint arXiv:1910.07738 (2019) 8. Heng, L., Choi, B., Cui, Z., Geppert, M., Hu, S., Kuan, B., Liu, P., Nguyen, R., Yeo, Y.C., Geiger, A., et al.: Project autovision: Localization and 3d scene for an autonomous vehicle with a multi-camera system. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE (2019) 4695–4702 9. Li, B., Zhang, T., Xia, T.: Vehicle detection from 3d lidar using fully convolutional network. arXiv preprint arXiv:1608.07916 (2016) 10. Gao, H., Cheng, B., Wang, J., Li, K., Zhao, J., Li, D.: Object classification using cnn-based fusion of vision and lidar in autonomous vehicle environment. IEEE Transactions on Industrial Informatics 14 (2018) 4224–4231 11. Garc´ıa-Mart´ın,E., Rodrigues, C.F., Riley, G., Grahn, H.: Estimation of energy consumption in machine learning. Journal of Parallel and Distributed Computing 134 (2019) 75–88 12. Bi, Y., Chadha, A., Abbas, A., Bourtsoulatze, E., Andreopoulos, Y.: Graph-based spatial-temporal feature learning for neuromorphic vision sensing. arXiv preprint arXiv:1910.03579 (2019) 13. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 652–660 14. Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R.R., Smola, A.J.: Deep sets. In: Advances in neural information processing systems. (2017) 3391–3401 15. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems. (2017) 5099–5108 16. Li, J., Chen, B.M., Hee Lee, G.: So-net: Self-organizing network for point cloud analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2018) 9397–9406 17. Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG) 38 (2019) 146 90 CHAPTER 6. SPARSE CONVOLUTIONS ON CONTINUOUS DOMAINS

16 D. Jack et al.

18. Shen, Y., Feng, C., Yang, Y., Tian, D.: Mining point cloud local structures by kernel correlation and graph pooling. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2018) 4548–4557 19. Li, Y., Bu, R., Sun, M., Wu, W., Di, X., Chen, B.: Pointcnn: Convolution on x-transformed points. In: Advances in Neural Information Processing Systems. (2018) 820–830 20. Xu, Y., Fan, T., Xu, M., Zeng, L., Qiao, Y.: Spidercnn: Deep learning on point sets with parameterized convolutional filters. CoRR abs/1803.11527 (2018) 21. Groh, F., Wieschollek, P., Lensch, H.P.A.: Flex-convolution (deep learning beyond grid-worlds). CoRR abs/1803.07289 (2018) 22. Posch, C., Matolin, D., Wohlgenannt, R.: A qvga 143 db dynamic range frame-free pwm image sensor with lossless pixel-level video compression and time-domain cds. IEEE Journal of Solid-State Circuits 46 (2010) 259–275 23. Serrano-Gotarredona, T., Linares-Barranco, B.: A 128 times 128 1.5% contrast sensitivity 0.9% fpn 3 µs latency 4 mw asynchronous frame-free dynamic vision sensor using transimpedance preamplifiers. IEEE Journal of Solid-State Circuits 48 (2013) 827–838 24. Orchard, G., Jayawant, A., Cohen, G.K., Thakor, N.: Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in neuroscience 9 (2015) 437 25. Maqueda, A.I., Loquercio, A., Gallego, G., Garc´ıa,N., Scaramuzza, D.: Event- based vision meets deep learning on steering prediction for self-driving cars. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2018) 5419–5427 26. Sironi, A., Brambilla, M., Bourdis, N., Lagorce, X., Benosman, R.: Hats: His- tograms of averaged time surfaces for robust event-based object classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2018) 1731–1740 27. Nguyen, A., Do, T.T., Caldwell, D.G., Tsagarakis, N.G.: Real-time 6dof pose relocalization for event cameras with stacked spatial lstm networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. (2019) 0–0 28. Bohte, S.M., Kok, J.N., La Poutre, H.: Error-backpropagation in temporally en- coded networks of spiking neurons. Neurocomputing 48 (2002) 17–37 29. Russell, A., Orchard, G., Dong, Y., Mihalas, S¸., Niebur, E., Tapson, J., Etienne- Cummings, R.: Optimization methods for spiking neurons and networks. IEEE transactions on neural networks 21 (2010) 1950–1962 30. Cao, Y., Chen, Y., Khosla, D.: Spiking deep convolutional neural networks for energy-efficient object recognition. International Journal of Computer Vision 113 (2015) 54–66 31. Lagorce, X., Orchard, G., Galluppi, F., Shi, B.E., Benosman, R.B.: Hots: a hier- archy of event-based time-surfaces for pattern recognition. IEEE transactions on pattern analysis and machine intelligence 39 (2016) 1346–1359 32. Liu, B., Wang, M., Foroosh, H., Tappen, M., Pensky, M.: Sparse convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 806–814 33. Park, J., Li, S., Wen, W., Tang, P.T.P., Li, H., Chen, Y., Dubey, P.: Faster cnns with direct sparse convolutions and guided pruning. arXiv preprint arXiv:1608.01409 (2016) 34. Graham, B., van der Maaten, L.: Submanifold sparse convolutional networks. arXiv preprint arXiv:1706.01307 (2017) 91

Sparse Convolutions on Continuous Domains 17

35. Graham, B., Engelcke, M., Van Der Maaten, L.: 3d semantic segmentation with submanifold sparse convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2018) 9224–9232 36. Choy, C., Gwak, J., Savarese, S.: 4d spatio-temporal convnets: Minkowski convo- lutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2019) 3075–3084 37. Jampani, V., Kiefel, M., Gehler, P.V.: Learning sparse high dimensional filters: Image filtering, dense crfs and bilateral neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 4452–4461 38. Su, H., Jampani, V., Sun, D., Maji, S., Kalogerakis, E., Yang, M.H., Kautz, J.: Splatnet: Sparse lattice networks for point cloud processing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2018) 2530–2539 39. Boulch, A.: Convpoint: continuous convolutions for point cloud processing. Com- puters & Graphics (2020) 40. Koch, C., Segev, I., et al.: Methods in neuronal modeling: from ions to networks. MIT press (1998) 41. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3d shapenets: A deep representation for volumetric shapes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 1912–1920 42. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016) 770–778 43. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 44. Koguciuk, D., Chechli´nski,L.,El-Gaaly, T.: 3d object recognition with ensemble learning—a study of point cloud-based deep learning models. In: International Symposium on , Springer (2019) 100–114 92 CHAPTER 6. SPARSE CONVOLUTIONS ON CONTINUOUS DOMAINS

18 D. Jack et al.

7 Supplementary Material

7.1 Summary of Notation

Dimensions D Physical dimensionality of point cloud Q Number of input channels P Number of output channels S Size of input cloud S0 Size of output cloud E Number of edges M Number of basis functions Sets D R Input cloud coordinates X ⊂ D 0 R Output cloud coordinates X ⊂ i Set of inputs in neighborhood of x0 N ⊆ X i Tensors th xj j input coordinate ∈ X th xi0 0 i output coordinate ∈ X D ∆xij R Edge vector: xi0 xj , xj i ∈ − ∈ N fj R Input feature associated with xj ∈ fi0 R Output feature associates with xi0 ∈ S f R Single-channel input feature for input cloud ∈ S X f 0 R 0 Single-channel output feature for output clouse 0 ∈ S Q X F R × Multi-channel input features ∈ S P F 0 R 0× Multi-channel output feature (m∈) Q P th Θ R × kernel parameters associated with m basis fn (m) ∈ S S N R 0× Neighborhood matrix ∈ Table 6: Summary of notation. 93

Sparse Convolutions on Continuous Domains 19

7.2 Additional Point Cloud Network Details

Pseudo-code for Iterative Farthest Point (IFP) variants and rejection sampling are given in Algorithms 1 through refalg:approx-ifp-rej. Select differences between rejection sampling and random sampling are given in Figure 3. A diagram of our large point cloud network is given in Figure 4.

Random Rejection

Fig. 3: Output cloud (red dots) resulting from different sampling schemes applied to input clouds (blue) and the corresponding neighborhoods (light red circles). From the top left image, we can see random sampling can result in some re- gions being under-sampled. This is particularly problematic for networks with subsequent up-sampling, where some blue points have no red points in their own neighborhood. The number of sampled points is not fixed for rejection sampling, so significantly less points will be sampled from pointy surfaces (bottom). By construction, none of the dark red circles (top right, half base radius) overlap, so the total number of possible sample points is limited by ball packing theorems. 94 CHAPTER 6. SPARSE CONVOLUTIONS ON CONTINUOUS DOMAINS

20 D. Jack et al.

Algorithm 1: IFP Algorithm 4: Rejection Inputs: input point cloud Sampling X S0 output size Inputs: input point cloud X Result: 0: sampled points ( ) neighborhood fn X N · 0 []; Result: X ← S size( ); 0: sampled points d ← X ones(S); dX : distance from each min ← ∞ × min for i in range(S0) do input j argmin(d ); point to closest output ← min 0.append(x ); point X j dmin 0 []; min(←d , d( , x )); SX ←size( ); min X j ← X end dmin ones(S); visited← ∞False × ones(S); ← × Algorithm 2: Approx. IFP for x0 in do i X Inputs: input point cloud if visited[i] then X S0 output size continue; ( ) neighborhood fn end N · Q priority queue 0.append(x0 ); X i Result: 0: sampled points i (xi0 ); X N ← N []; for xj in i do 0 N forX ←j in range(S ) do visited[j] True; 0 ← x0 Q.pop(); dmin[j] i ← ← 0.append(x0 ); min(dmin[j], d(xi0 , xj)); X i for x in (x0 ) do end n N i Q.update(xn, d(xn, xi0 )) end end end Algorithm 5: Approx. IFP (with rej.) Algorithm 3: Approx. IFP Inputs: input point cloud X (without rej.) S0 output size Inputs: input point cloud ( ) neighborhood fn X N · S0 output size Result: 0: sampled points X ( ) neighborhood fn 0, d N · X0 min ← Result: 0: sampled points Rejection Sampling( ,S0, ); S size(X ); X N Q ← PriorityX Queue( Q Priority Queue(d , ); ← ∞ × ← min X ones(S), ); S0 S0 size( 0); X 1 ← − X0 0 0 X ← X1 ← Approx. IFP( ,S0, ,Q) Approx. IFP( ,S0 , ,Q); X N X 1 N 0 concatenate( 0, 0); X ← X0 X1 95

Sparse Convolutions on Continuous Domains 21

7.3 Additional Event Stream Network Details The Leaky Integrate and Fire (LIF) algorithm we used is given in Algorithm 6.

Algorithm 6: Leaky Integrate and Fire (LIF) Inputs: X input grid shape t times for input events, sorted ascending x coordinates for input events, same order as t ( ) spatial neighborhood fn giving coordinates of receptive field N · t˜ decay time vthresh spike threshold vreset reset potential Result: tout, xout: time and coordinates of output stream. xout []; ← tout []; ← V zeros(X); ← T zeros(X); ← S size(x); ← for i in range(S) do ti t[i]; ← xi x[i]; ← n size( (xi)); ← N for xj in (xi) do N 1 v V [xj ] exp (ti T [xj ])/t˜ + ; ← − − n if v > v then thresh  v vreset; ← tout.append(ti); xout.append(xj ); end V [xj ] v; ← T [xj ] ti; ← end end

We down-sampled examples from the two highest-resolution datasets – N- Caltech101 and ASL-DVS – by a factor of 2 in each dimension. We performed ba- sic data augmentation involving small rotations ( 22.5◦ to 22.5◦), time/polarity reversal for all datasets except ASL-DVS and left-right− flips for CIFAR-10-DVS and N-Caltech101. No data augmentation was applied to ASL-DVS. We com- puted neighborhood information for N-MNIST online and offline with 8 aug- mented repeats for MNIST-DVS, CIFAR10-DVS and NCaltech101-DVS. For the small number of examples with more than 300,000 events we took the first 300,000. Apart from this infrequent cropping, we use all events in all examples. All models were trained with Adam optimizer, initial learning rate 1e 3, β = 0.9, β = 0.999,  = 1e 7. We trained our ASL-DVS model for 100 epochs− 1 2 − 96 CHAPTER 6. SPARSE CONVOLUTIONS ON CONTINUOUS DOMAINS

22 D. Jack et al.

with a fixed learning rate. For all others, we decay the learning rate by a factor of 5 after the training accuracy fails to increase for 10 epochs, and run until learning ceases as a result of several such decays. Dataset summary statistics and select model hyper-parameters parameters given in Table 7. A diagram of the model used for CIFAR10-DVS is given in Figure 5.

Dataset N-MNIST MNIST-DVS CIFAR10-DVS N-Caltech101 ASL-DVS # Classes 10 10 10 101 24 Resolution 34 34 128 128 128 128 174 234 180 240 × × × × × # Train examples 60,000 9,000 9,000 7,838 80,640 Median # events 4,196 70,613 203,301 104,904 17,078 Mean # events 4,171 73,704 204,979 115,382 28,120 Max # events 8,183 151,124 422,550 428,595 470,435 Batch Size 32 32 8 8 8 Spike Threshold, vthresh 1.5 1.5 1.6 1.2 1.0 Reset Potential, vreset -3.0 -2.0 -3.0 -2.0 -3.0 Initial Decay Time, t0 2,000 10,000 4,000 1,000 1,000 Initial Filters, f0 32 8 8 16 16 # Down Samples 3 5 5 5 5 Data Augmentation Rotation up to 22.5◦ Yes Yes Yes Yes No ± Flip left-right No No Yes Yes No Flip time/polarity Yes Yes Yes Yes No Preprocessing repeats (online) 8 8 8 1 ∞ Table 7: Event stream dataset summary statistics and model/data augmentation hyperparameters. 97

Sparse Convolutions on Continuous Domains 23

S Q Dense S Q × × Dense ReLU ReLU BN S Q/4 BN S P/4 × × Conv Conv ReLU Sample ReLU BN BN

S Q/4 S0 Q S0 P/4 × × × Dense Dense Dense BN BN BN Dropout Dropout Dropout

S Q S0 P S0 P × × × + + ReLU ReLU

S Q S0 P × × (a) In-place residual block (b) Down-sample residual block

0 X [1024? 3] × Ball N00 Feat.less Conv r = r [10 1024? 1024?] [1024 32] 0 × × × Sample

Ball N01 Res Block 1 X r = √2r0 [10 256? 1024?] [256? 64] [256? 3] × × × × Ball N11 Res Block r = 2r [10 256? 256?] [256? 64] 0 × × × Sample

Ball N12 Res Block 2 X r = 2√2r0 [10 64? 256?] [64? 128] [64? 3] × × × × Ball N22 Res Block r = 4r [10 64? 64?] [64? 128] 0 × × × Sample

Ball N23 Res Block 3 X r = 4√2r0 [10 16? 64?] [16? 256] [16? 3] × × × × Ball N33 Res Block r = 8r [10 16? 16?] [16? 256] 0 × × ×

Dense [16? 1024] ×

Global Pooling [1024]

ReLU, BN, Dropout, Dense [256]

ReLU, BN, Dropout, Dense [40]

(c) Large Point Cloud Network, r0 = 0.1125. Numbers in brackets represent output example dimensions. Dimensions with question marks (?) correspond to approximate number of points when using no point dropout. Dashed line corresponds to preprocess- ing/batching divide. BN is batch normalization, and Dropout uses a rate of 0.5

Fig. 4 98 CHAPTER 6. SPARSE CONVOLUTIONS ON CONTINUOUS DOMAINS

24 D. Jack et al.

Polarity [128 128 2] × ×

Conv 3 3 4/2, t0 × × Inception Block, t˜ [64 64 f0] × × Input [H W Q] × × Inception, t0 [64 64 f0] × × Dense, ReLU [H W 4Q] × × Conv 3 3 4/2, 2t0 Conv (1 5 5 1) 4/1, t˜ Conv 1 1 16/1, 4t˜ × × × ∪ × × × × [32 32 2f0] [H W Q] [H W Q] × × × × × × Dense [H W Q] × × Inception, 2t 0 +, ReLU, BN [32 32 2f0] × × + [H W Q] × ×

Conv 3 3 4/2, 4t0 × × [16 16 4f0] × ×

Inception, 4t0 Mean Voxelize [16 16 4f0] [16 16 4 4f0] × × × × ×

Conv 3 3 4/2, 8t0 Conv3D, 2 2 2/2 × × × × [8 8 8f0] [8 8 2 8f0] × × × × ×

Inception, 8t Mean Voxelize 0 + [8 8 8f0] [8 8 2 2f0] × × × × ×

Conv 3 3 4/2, 16t0 Conv3D, 2 2 2/2 × × × × [4 4 16f0] [8 8 2 8f0] × × × × ×

Inception, 16t Mean Voxelize Dense 0 + [4 4 16f0] [4 4 1 16f0] [4 4 1 32f0] × × × × × × × ×

Max Pool [32f0]

Dense [256]

Softmax Classifier

Fig. 5: Network architecture for event stream inference for CIFAR10-DVS. Conv h w t/S, t˜is a down-sampling convolution with spatial stride S, spatial kernel × × shape h w and temporal kernel size t, i.e.Mu = hw, Mv = t. The output stream is the result× of LIF subsampling with the same spatial kernel size and decay time t˜. Edges with ∆t > 4t˜ are cropped, and convolutions use ∆t scaled by t˜. Each down-sampling convolution, the pre-max-pooling dense layer and the final dense layer are all followed by ReLU, batch normalization and dropout with rate 0.5. Each mean voxelization is followd by batch normalization, and each Conv3D is followed by ReLU and batch normalization. Chapter 7

Conclusion

“By far, the greatest danger of Artificial Intelligence is that people conclude too early that they understand it.” — Eliezer Yudkowsky

7.1 Contributions

This thesis has addressed three problems in computer vision: 3D human pose estimation, single-view object reconstruction and learnable feature extarction from irregular 3D data sources. In particular, we have focused our efforts on reducing the computational complexity of network architectures by taking advantage of relatively simple, well-established techniques used in computer graphics.

Our first two contributions looked at the former two problems independently. Taking inspiration from the distinctly different optimization procedures used in our adversarial optimization paper, our IGE-Net model tackled both problems in a similar framework. In the interest of extending this work to accommodate point cloud inferences, we developed a new point cloud convolution operator with desirable mathematical properties and more consistent with the mathematical definition of convolution, and showed that networks using this operation could be trained using a fraction of the resources and time of similar hierarchical point cloud networks. In particular, we showed the following.

Chapter 3: Distributions generated by GANs can be used to parameterize search spaces for • inferring 3D solutions that are consistent with 2D observations.

Chapter 4: Free form deformation networks can learn large, accurate deformations to transform • fixed template point clouds to match query clouds, though the deformed meshes suffer from inherent topological constraints.

Chapter 5: Similar optimization processes to the two involved in our Adversarial Parameterization • approach can be coupled. Combined with a similar idea of consistency, the resulting energy-based models can achieve good results with a fraction of the computational/memory budget of standard deep networks or scaled to high resolutions to achieve state-of-the-art results.

99 100 CHAPTER 7. CONCLUSION

Chapter 6: There is significant potential for optimization when it comes to point cloud networks • and the constituent operations. These same operations can be applied to event stream networks, and with slight modifications can be implemented to produce asynchronous predictions as events arrive.

With respect to the specific research questions raised in Section 1.4, we believe both our Adver- sarial Parameterization and IGE-Net formulations are principled approaches to 3D generation. Both formulations applied to human pose estimation explicitly accommodated consistency and feasibility sep- arately, while the inference results from our IGE-Net for object reconstruction demonstrate the network architecture had an implicit understanding of consistency by space-carving almost perfectly. We also believe these formulations did a good job of utilizing computer graphics techniques to minimize what needed to be learned, thus allowing for performant models applicable to moderately resourced robotics platforms. Our IGE-Net and sparse convolution implementations also demonstrated that careful use of preprocessing – both offline and that which takes advantage of heterogeneous computer architectures – can deliver significant savings in terms of training time.

7.2 Future Work

7.2.1 Sparse Point Cloud Convolutions

Our fourth contribution was primarily motivated by the need for a continuous and hierarchical point cloud operator for use in an IGE-Net energy function. In the publication we elected to focus on the formulation and implementation of such an operator so as not to introduce too many new ideas at once. Implementation into an IGE-Net is yet to be done, but we believe all of the building blocks are there. One possible stumbling block is that our implementation currently uses a CPU implementation for KDTree neighborhood queries. This is fine – even advantageous – when the point cloud is static as it can be preprocessed either while the previous batch is training on accelerators or offline before any training begins. A point-cloud based IGE-Net would produce a dynamic point-cloud for each step of optimization however. While a CPU implementation would still work here – and rough calculations show the computation time would notbe feasible – this would require shipping of data between CPU and accelerator multiple times. Our thoughts are that the resultant model would be slow but viable – though exactly how slow and viable is difficult to say. A GPU implementation of KDTree querying would be preferable, and while publicly available GPU KDTree implementations exist their integration into a deep learning framework may be non-trivial. This would have benefits beyond IGE-Nets however, as other methods exist that use dynamic point clouds as part of their learned models.

7.2.2 Sparse Event Stream Convolutions

The networks introduced in the event stream experiments in Chapter 6 were relatively simplistic. In particular, the subsampling mechanism – leaky integrate and fire – was based on static, near-uniform responses to stimuli. Given the convolutions give us a way of propagating features associated with 7.2. FUTURE WORK 101

events, we wonder whether this could somehow be used to make this firing process learnable.

7.2.3 IGE-Nets

In addition to a point-cloud based implementation as discussed above, we have two additional directions we would like to take with our IGE-Nets.

Firstly, out networks focused on accommodating consistency and feasibility. Another important con- sideration in mobile robotics is uncertainty. In many situations there is simply not enough information to make an accurate prediction, e.g. as the result of heavy occlusions. In these cases it is important models can communicate what they know they don’t know. While our inferences and metrics made no effort to quantify this, energy networks implicitly encode this information in the energy landscape itself. For example, heavy occlusions may result in large flat regions, while equally likely but incompatible solutions – e.g. resolving left-right ambiguities in human pose estimation – may manifest itself in two separate local minima.

A simple way of quantifying this uncertainty with energy networks would be to consider our energy minimization as a distribution refinement module by optimizing samples from an initial distribution, rather than an initial point estimate. This initial distribution could come from either a conditional GAN or a variational encoder.

Another strength of IGE-Nets which wasn’t exploited in our publication is the ease with which additional loss terms can be added. For example, we could:

add data for multiple views in order to tackle the multi-view variant of each problem; • use multiple frames for human pose estimation with a per-frame consistency/feasibility loss and a • single additional temporal feasibility loss; and/or

add a GAN-based feasibility loss to our object reconstruction model to encourage realistic looking • inferences.

Critically, none of these additional data sources/loss terms would require changes to the underlying framework, and existing components could be reused.

7.2.4 Isosurface Extraction Models

Some tangential work to this thesis [164] looked at representing shapes as level sets of embedding functions parameterized as trilinear interpolations of inferred values at fixed voxel coordinates using 3D convolutions. 3D convolutions are good in that they re-use computation very efficiently. However, they inherently scale cubicly with resolution, despite our application – surface extraction – scaling quadrat- ically. DeepSDF [163] learned continuous functions which could be evaluated at an arbitrary number of points – however, this fails to take advantage of significant computation re-use as in the convolution case. Both of these approaches also seek to learn a signed distance function – an unnecessarily complex task if only the isosurface is desired. 102 CHAPTER 7. CONCLUSION

We propose resolving this by considering a DeepSDF-like continuous embedding function that jointly maps a point cloud to a corresponding set of embedding values using a convolutional point cloud network. Furthermore, rather than trying to learn the embedding function directly, we could use a Newton-like root-finding algorithm to iterate points towards the isosurface of this embedding function, much like the inner optimization loop in energy networks. The resulting point cloud could then be compared to the ground truth point cloud using something like a Chamfer distance as used in our FFD contribution, modified to account for inexact isosurface extraction.

7.2.5 Universal Latent Shape Representation

Our work on object reconstruction demonstrates that different shape representations (voxel occupancy grids, point clouds, deformable meshes) each have their own advantages and disadvantages. We propose learning a universal latent shape representation along with a set of encoders and decoders – one for each representation – in a similar way to multi-lingual neural machine translation systems [196, 197]. This could be generalized to more abstract input/output representations – e.g. an image encoder coupled with a 3D decoder could perform single-view object reconstruction, or a natural language variational encoder could produce a distribution of objects from just its textual description.

Energy models could be used for decoder implementations for representations without a natural generator architecture. For example, our sparse point cloud convolutions presented in Chapter 6 rely on having input and output point clouds. Experiments used output point clouds that were sampled versions on the input cloud, but no generative model for producing a high resolution point cloud from a lower- resolution counterpart was proposed. We could implement an energy-based generator that, given a latent representation, searches the space of possible point clouds for one which has an encoding closest to the latent representation. References

[1] W. Team, 2019, [Online; accessed 09-Jan-2020]. [Online]. Available: https://waymo.com/

[2] I. Salian, “Medical imaging startup uses ai to classify conditions from sinus and brain scans,” 2019, [Online; accessed 09-Jan-2020]. [Online]. Available: https://blogs.nvidia.com/blog/2019/04/11/medical-imaging-informai/

[3] G. Staff, “Robot dog begins work on san francisco airport project,” 2019, [Online; accessed 09-Jan-2020]. [Online]. Available: http://www.globalconstructionreview.com/news/robot-dog- begins-work-san-francisco-airport-projec/

[4] B. Stilwell, “These insane robot machine guns guard the korean dmz,” 2019, [Online; accessed 09-Jan-2020]. [Online]. Available: https://www.wearethemighty.com/gear-tech/robot-machine- guns-guard-dmz

[5] Z. Cao, M. G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields.” IEEE transactions on pattern analysis and machine intelligence, 2019.

[6] G. Moon, J. Y. Chang, and K. M. Lee, “Camera distance-aware top-down approach for 3d multi- person pose estimation from a single rgb image,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 10 133–10 142.

[7] T. Urban, “The ai revolution: The road to superintelligence,” 2015, [Online; accessed 08-Jan- 2020]. [Online]. Available: https://waitbutwhy.com/2015/01/artificial-intelligence-revolution- 1.html

[8] W. contributors, “List of self-driving car fatalities,” 2020, [Online; accessed 08-Jan-2020]. [Online]. Available: https://en.wikipedia.org/wiki/List of self-driving car fatalities

[9] H. Landahl, W. S. McCulloch, and W. Pitts, “A statistical consequence of the logical calculus of nervous nets,” Bulletin of Mathematical Biology, vol. 5, no. 4, pp. 135–137, 1943.

[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.

[11] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming with cuda,” Queue, vol. 6, no. 2, pp. 40–53, 2008.

103 104 REFERENCES

[12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.

[13] S. Dreyfus, “The numerical solution of variational problems,” Journal of Mathematical Analysis and Applications, vol. 5, no. 1, pp. 30–45, 1962.

[14] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807– 814.

[15] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. icml, vol. 30, no. 1, 2013, p. 3.

[16] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015.

[17] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.

[18] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural networks,” in Advances in neural information processing systems, 2017, pp. 971–980.

[19] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” arXiv preprint arXiv:1710.05941, 2017.

[20] D. Misra, “Mish: A self regularized non-monotonic neural activation function,” arXiv preprint arXiv:1908.08681, 2019.

[21] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, pp. 448–456.

[22] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.

[23] Y. Wu and K. He, “Group normalization,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19.

[24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770– 778.

[25] ——, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645. REFERENCES 105

[26] S. Xie, R. Girshick, P. Dollar,´ Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.

[27] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.

[28] B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964.

[29] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in International conference on machine learning, 2013, pp. 1139– 1147.

[30] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of machine learning research, vol. 12, no. Jul, pp. 2121–2159, 2011.

[31] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.

[32] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.

[33] D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization. corr abs/1412.6980 (2014),” 2014.

[34] M. Zaheer, S. Reddi, D. Sachan, S. Kale, and S. Kumar, “Adaptive methods for nonconvex optimization,” in Advances in neural information processing systems, 2018, pp. 9793–9803.

[35] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256.

[36] H. Zhang, Y. N. Dauphin, and T. Ma, “Fixup initialization: Residual learning without normalization,” arXiv preprint arXiv:1901.09321, 2019.

[37] Anonymous, “Batch normalization has multiple benefits: An empirical study on residual networks,” in Submitted to International Conference on Learning Representations, 2020, under review. [Online]. Available: https://openreview.net/forum?id=BJeVklHtPr

[38] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.

[39] A. Krogh and J. A. Hertz, “A simple weight decay can improve generalization,” in Advances in neural information processing systems, 1992, pp. 950–957. 106 REFERENCES

[40] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.

[41] G. Zhang, C. Wang, B. Xu, and R. Grosse, “Three mechanisms of weight decay regularization,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=B1lz-3Rct7

[42] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.

[43] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” arXiv preprint arXiv:1302.4389, 2013.

[44] J. Guo and S. Gould, “Depth dropout: efficient training of residual convolutional neural networks,” in 2016 International Conference on Computing: Techniques and Applications (DICTA). IEEE, 2016, pp. 1–7.

[45] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.

[46] R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, A. Belopolsky, Y. Bengio, A. Bergeron, J. Bergstra, V. Bisson, J. Bleecher Snyder, N. Bouchard, N. Boulanger-Lewandowski, X. Bouthillier, A. de Brebisson,´ O. Breuleux, P.-L. Carrier, K. Cho, J. Chorowski, P. Christiano, T. Cooijmans, M.-A. Cotˆ e,´ M. Cotˆ e,´ A. Courville, Y. N. Dauphin, O. Delalleau, J. Demouth, G. Desjardins, S. Dieleman, L. Dinh, M. Ducoffe, V. Dumoulin, S. Ebrahimi Kahou, D. Erhan, Z. Fan, O. Firat, M. Germain, X. Glorot, I. Goodfellow, M. Graham, C. Gulcehre, P. Hamel, I. Harlouchet, J.-P. Heng, B. Hidasi, S. Honari, A. Jain, S. Jean, K. Jia, M. Korobov, V. Kulkarni, A. Lamb, P. Lamblin, E. Larsen, C. Laurent, S. Lee, S. Lefrancois, S. Lemieux, N. Leonard,´ Z. Lin, J. A. Livezey, C. Lorenz, J. Lowin, Q. Ma, P.-A. Manzagol, O. Mastropietro, R. T. McGibbon, R. Memisevic, B. van Merrienboer,¨ V. Michalski, M. Mirza, A. Orlandi, C. Pal, R. Pascanu, M. Pezeshki, C. Raffel, D. Renshaw, M. Rocklin, A. Romero, M. Roth, P. Sadowski, J. Salvatier, F. Savard, J. Schluter,¨ J. Schulman, G. Schwartz, I. V. Serban, D. Serdyuk, S. Shabanian, E. Simon, S. Spieckermann, S. R. Subramanyam, J. Sygnowski, J. Tanguay, G. van Tulder, J. Turian, S. Urban, P. Vincent, F. Visin, H. de Vries, D. Warde-Farley, D. J. Webb, M. Willson, K. Xu, L. Xue, L. Yao, S. Zhang, and Y. Zhang, “Theano: A Python framework for fast computation of mathematical expressions,” arXiv e-prints, vol. abs/1605.02688, May 2016. [Online]. Available: http://arxiv.org/abs/1605.02688

[47] R. Collobert, S. Bengio, and J. Mariethoz,´ “Torch: a modular machine learning software library,” Idiap, Tech. Rep., 2002. REFERENCES 107

[48] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.

[49] J. Bai, F. Lu, K. Zhang et al., “Onnx: Open neural network exchange,” https://github.com/onnx/onnx, 2019.

[50] E. D. D. Team, “DL4J: Deep Learning for Java,” 2016. [Online]. Available: https://github.com/eclipse/deeplearning4j

[51] S. Tokui, R. Okuta, T. Akiba, Y. Niitani, T. Ogawa, S. Saito, S. Suzuki, K. Uenishi, B. Vogel, and H. Yamazaki Vincent, “Chainer: A deep learning framework for accelerating the research cycle,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2019, pp. 2002–2011.

[52] F. Seide and A. Agarwal, “Cntk: Microsoft’s open-source deep-learning toolkit,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 2135–2135.

[53] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning,” in 12th USENIX { } Symposium on Operating Systems Design and Implementation ( OSDI 16), 2016, pp. 265–283. { } [54] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.

[55] F. Chollet et al., “Keras,” https://keras.io, 2015.

[56] J. Howard et al., “fastai,” https://github.com/fastai/fastai, 2018.

[57] H. Kase, R. Negishi, M. Arifuku, N. Kiyoyanagi, and Y. Kobayashi, “Biosensor response from target molecules with inhomogeneous charge localization,” Journal of Applied Physics, vol. 124, no. 6, p. 064502, 2018.

[58] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter performance analysis of a tensor processing unit,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2017, pp. 1–12.

[59] N. M. Linke, D. Maslov, M. Roetteler, S. Debnath, C. Figgatt, K. A. Landsman, K. Wright, and C. Monroe, “Experimental comparison of two quantum computing architectures,” Proceedings of the National Academy of Sciences, vol. 114, no. 13, pp. 3305–3310, 2017.

[60] N. Mishra, M. Kapil, H. Rakesh, A. Anand, N. Mishra, A. Warke, S. Sarkar, S. Dutta, S. Gupta, A. Dash, R. Gharat, Y. Chatterjee, S. Roy, S. Raj, V. Jain, S. Bagaria, S. Chaudhary, V. Singh, R. Maji, and P. Panigrahi, “Quantum machine learning: A review and current status,” 09 2019. 108 REFERENCES

[61] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using gpu model parallelism,” arXiv preprint arXiv:1909.08053, 2019.

[62] “Deep learning on aws,” 2020, [Online; accessed 22-Jan-2020]. [Online]. Available: https://aws.amazon.com/deep-learning/

[63] “Azure machine learning,” 2020, [Online; accessed 22-Jan-2020]. [Online]. Available: https://azure.microsoft.com/en-au/services/machine-learning/

[64] “Turn ideas into reality with google cloud ai,” 2020, [Online; accessed 22-Jan-2020]. [Online]. Available: https://cloud.google.com/solutions/ai/

[65] B. Saeta, “Cloud tpu now offers preemptible pricing and global availability,” 2018, [Online; accessed 22-Jan-2020]. [Online]. Available: https://cloudplatform.googleblog.com/2018/06/Cloud-TPU-now-offers-preemptible- pricing-and-global-availability.html

[66] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 2672–2680. [Online]. Available: http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf

[67] Z. Pan, W. Yu, X. Yi, A. Khan, F. Yuan, and Y. Zheng, “Recent progress on generative adversarial networks (gans): A survey,” IEEE Access, vol. 7, pp. 36 322–36 333, 2019.

[68] C. Ledig, L. Theis, F. Huszar,´ J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4681–4690.

[69] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy, “Esrgan: Enhanced super-resolution generative adversarial networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.

[70] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.

[71] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional gans,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8798–8807.

[72] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232. REFERENCES 109

[73] C. Li and M. Wand, “Precomputed real-time texture synthesis with markovian generative adversarial networks,” in European conference on computer vision. Springer, 2016, pp. 702– 716.

[74] N. Jetchev, U. Bergmann, and R. Vollgraf, “Texture synthesis with spatial generative adversarial networks,” arXiv preprint arXiv:1611.08207, 2016.

[75] U. Bergmann, N. Jetchev, and R. Vollgraf, “Learning texture manifolds with the periodic spatial gan,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 469–477.

[76] S. Bouaziz, B. Amberg, T. Weise, P. Snape, S. Brugger, A. Mansfield, R. Knothe, and T. Kiser, “Generating animated three-dimensional models from captured images,” Oct. 1 2019, uS Patent 10,430,642.

[77] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence generative adversarial nets with policy gradient,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.

[78] T. Schlegl, P. Seebock,¨ S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs, “Unsupervised anomaly detection with generative adversarial networks to guide marker discovery,” in International conference on information processing in medical imaging. Springer, 2017, pp. 146–157.

[79] N. Killoran, L. J. Lee, A. Delong, D. Duvenaud, and B. J. Frey, “Generating and designing dna with deep generative models,” arXiv preprint arXiv:1712.06148, 2017.

[80] W. Hu and Y. Tan, “Generating adversarial malware examples for black-box attacks based on gan,” arXiv preprint arXiv:1702.05983, 2017.

[81] Z. Zhang, M. Li, and J. Yu, “On the convergence and mode collapse of gan,” in SIGGRAPH Asia 2018 Technical Briefs, 2018, pp. 1–4.

[82] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.

[83] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in neural information processing systems, 2017, pp. 5767–5777.

[84] H. Petzka, A. Fischer, and D. Lukovnicov, “On the regularization of wasserstein gans,” arXiv preprint arXiv:1709.08894, 2017.

[85] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Advances in neural information processing systems, 2016, pp. 2234–2242.

[86] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in Advances in neural information processing systems, 2017, pp. 6626–6637. 110 REFERENCES

[87] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” arXiv preprint arXiv:1802.05957, 2018.

[88] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial networks,” arXiv preprint arXiv:1805.08318, 2018.

[89] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang, “A tutorial on energy-based learning,” Predicting structured data, vol. 1, no. 0, 2006.

[90] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr, “Conditional random fields as recurrent neural networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1529–1537.

[91] B. Amos and J. Z. Kolter, “Optnet: Differentiable optimization as a layer in neural networks,” arXiv preprint arXiv:1703.00443, 2017.

[92] J. Domke, “Generic methods for optimization-based modeling,” in Artificial Intelligence and Statistics, 2012, pp. 318–326.

[93] D. Belanger, B. Yang, and A. McCallum, “End-to-end learning for structured prediction energy networks,” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, ser. ICML’17. JMLR.org, 2017, pp. 429–439. [Online]. Available: http://dl.acm.org/citation.cfm?id=3305381.3305426

[94] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological cybernetics, vol. 36, no. 4, pp. 193–202, 1980.

[95] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[96] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.

[97] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Thirty-first AAAI conference on artificial intelligence, 2017.

[98] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.

[99] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017. REFERENCES 111

[100] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.

[101] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8697–8710.

[102] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” arXiv preprint arXiv:1905.11946, 2019.

[103] ——, “Mixconv: Mixed depthwise convolutional kernels,” CoRR, abs/1907.09595, 2019.

[104] L. Deng, “The mnist database of handwritten digit images for machine learning research [best of the web],” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012.

[105] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.

[106] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.

[107] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International journal of computer vision, vol. 104, no. 2, pp. 154–171, 2013.

[108] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904–1916, 2015.

[109] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.

[110] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91– 99.

[111] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.

[112] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271.

[113] ——, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.

[114] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37. 112 REFERENCES

[115] “Drive: Digital retinal images for vessel extraction,” [Online; accessed 23-Jan-2020]. [Online]. Available: https://drive.grand-challenge.org/

[116] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.

[117] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar,´ and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.

[118] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.

[119] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.

[120] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.

[121] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.

[122] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818.

[123] T. Takikawa, D. Acuna, V. Jampani, and S. Fidler, “Gated-scnn: Gated shape cnns for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 5229–5238.

[124] Y. Zhu, K. Sapra, F. A. Reda, K. J. Shih, S. Newsam, A. Tao, and B. Catanzaro, “Improving semantic segmentation via video propagation and label relaxation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8856–8865.

[125] K. He, G. Gkioxari, P. Dollar,´ and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.

[126] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang, “Mask scoring r-cnn,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6409–6418.

[127] P. O. Pinheiro, R. Collobert, and P. Dollar,´ “Learning to segment object candidates,” in Advances in Neural Information Processing Systems, 2015, pp. 1990–1998. REFERENCES 113

[128] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollar,´ “Learning to refine object segments,” in European Conference on Computer Vision. Springer, 2016, pp. 75–91.

[129] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, S. Chintala, and P. Dollar,´ “A multipath network for object detection,” arXiv preprint arXiv:1604.02135, 2016.

[130] Y. Yang and D. Ramanan, “Articulated pose estimation with flexible mixtures-of-parts,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. 1385–1392.

[131] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4724–4732.

[132] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European Conference on Computer Vision. Springer, 2016, pp. 483–499.

[133] X. Chu, W. Ouyang, H. Li, and X. Wang, “Structured feature learning for pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4715–4723.

[134] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in CVPR, 2017.

[135] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, “Multi-context attention for human pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1831–1840.

[136] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 7, pp. 1325–1339, 2014.

[137] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black, “Keep it smpl: Automatic estimation of 3d human pose and shape from a single image,” in European Conference on Computer Vision. Springer, 2016, pp. 561–578.

[138] C.-H. Chen and D. Ramanan, “3d human pose estimation= 2d pose estimation+ matching,” in CVPR, vol. 2, no. 5, 2017, p. 6.

[139] J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A simple yet effective baseline for 3d human pose estimation,” in International Conference on Computer Vision, vol. 1, no. 2, 2017, p. 5.

[140] H.-Y. F. Tung, A. W. Harley, W. Seto, and K. Fragkiadaki, “Adversarial inverse graphics networks: Learning 2d-to-3d lifting and image-to-image translation from unpaired supervision,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV’17), vol. 2, 2017. 114 REFERENCES

[141] W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang, “3d human pose estimation in the wild by adversarial learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, 2018.

[142] A. O. Ulusoy, A. Geiger, and M. J. Black, “Towards probabilistic volumetric reconstruction using ray potential,” in 3DV, 2015.

[143] Z. Wu, S. Song, A. Khosla, X. Tang, and J. Xiao, “3D ShapeNets: A deep representation for volumetric shapes,” in CVPR, 2015.

[144] I. Cherabier, C. Hane,¨ M. R. Oswald, and M. Pollefeys, “Multi-label semantic 3D reconstruction using voxel blocks,” in 3DV, 2016.

[145] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese, “3d-r2n2: A unified approach for single and multi-view 3d object reconstruction,” in European conference on computer vision. Springer, 2016, pp. 628–644.

[146] D. J. Rezende, S. M. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess, “Unsupervised learning of 3D structure from images,” in NIPS, 2016.

[147] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee, “Perspective transformer nets: Learning single- view 3D object reconstruction without 3D supervision,” in NIPS, 2016.

[148] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas, “Volumetric and multi-view CNNs for object classification on 3D data,” in CVPR, 2016.

[149] J. Wu, Y. Wang, T. Xue, X. Sun, W. T. Freeman, and J. B. Tenenbaum, “MarrNet: 3D Shape Reconstruction via 2.5D Sketches,” in NIPS, 2017.

[150] A. Kar, C. Hane,¨ and J. Malik, “Learning a multi-view stereo machine,” in NIPS, 2017.

[151] R. Zhu, H. K. Galoogahi, C. Wang, and S. Lucey, “Rethinking reprojection: Closing the loop for pose-aware shape reconstruction from a single image,” in NIPS, 2017.

[152] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta, “Learning a predictable and generative vector representation for objects,” in ECCV, 2016.

[153] A. Sharma, O. Grau, and M. Fritz, “Vconv-dae: Deep volumetric shape learning without object labels,” in European Conference on Computer Vision. Springer, 2016, pp. 236–250.

[154] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum, “Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling,” in Advances in Neural Information Processing Systems, 2016, pp. 82–90.

[155] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman, “Single image 3D interpreter network,” in ECCV, 2016. REFERENCES 115

[156] J. Liu, F. Yu, and T. A. Funkhouser, “Interactive 3D modeling with a generative adversarial network,” in 3DV, 2017.

[157] J. Gwak, C. B. Choy, A. Garg, M. Chandraker, and S. Savarese, “Weakly supervised generative adversarial networks for 3D reconstruction,” in 3DV, 2017.

[158] S. Tulsiani, A. A. Efros, and J. Malik, “Multi-view consistency as supervisory signal for learning shape and pose prediction,” in Computer Vision and Pattern Regognition (CVPR), 2018.

[159] G. Riegler, A. O. Ulusoy, and A. Geiger, “OctNet: Learning deep 3D representations at high resolutions,” in CVPR, 2017.

[160] P.-S. Wang, Y. Liu, Y.-X. Guo, C.-Y. Sun, and X. Tong, “O-CNN: Octree-based convolutional neural networks for 3D shape analysis,” in SIGGRAPH, 2017.

[161]C.H ane,¨ S. Tulsiani, and J. Malik, “Hierarchical surface prediction for 3D object reconstruction,” in 3DV, 2017.

[162] M. Tatarchenko, A. Dosovitskiy, and T. Brox, “Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs,” in Proc. of the IEEE International Conf. on Computer Vision (ICCV), vol. 2, 2017, p. 8.

[163] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove, “Deepsdf: Learning continuous signed distance functions for shape representation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 165–174.

[164] M. Michalkiewicz, J. K. Pontes, D. Jack, M. Baktashmotlagh, and A. Eriksson, “Implicit surface representations as layers in neural networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 4743–4752.

[165] A. Johnston, R. Garg, G. Carneiro, I. D. Reid, and A. van den Hengel, “Scaling cnns for high resolution volumetric reconstruction from a single image.” in ICCV Workshops, 2017, pp. 930– 939.

[166] S. R. Richter and S. Roth, “Matryoshka networks: Predicting 3d geometry via nested shape layers,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1936–1944.

[167] A. Kurenkov, J. Ji, A. Garg, V. Mehta, J. Gwak, C. B. Choy, and S. Savarese, “DeformNet: Free- form deformation network for 3d shape reconstruction from a single image,” vol. abs/1708.04672, 2017.

[168] Y. Yang, C. Feng, Y. Shen, and D. Tian, “Foldingnet: Point cloud auto-encoder via deep grid deformation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), vol. 3, 2018.

[169] H. Fan, H. Su, and L. J. Guibas, “A point set generation network for 3D object reconstruction from a single image,” in CVPR, 2017. 116 REFERENCES

[170] C.-H. Lin, C. Kong, and S. Lucey, “Learning efficient point cloud generation for dense 3D object reconstruction,” in AAAI, 2018.

[171] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on point sets for 3D classification and segmentation,” in CVPR, 2017.

[172] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola, “Deep sets,” in Advances in neural information processing systems, 2017, pp. 3391–3401.

[173] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical feature learning on point sets in a metric space,” in NIPS, 2017.

[174] J. Li, B. M. Chen, and G. Hee Lee, “So-net: Self-organizing network for point cloud analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9397–9406.

[175] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” ACM Transactions on Graphics (TOG), vol. 38, no. 5, p. 146, 2019.

[176] Y. Shen, C. Feng, Y. Yang, and D. Tian, “Mining point cloud local structures by kernel correlation and graph pooling,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4548–4557.

[177] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: Convolution on x-transformed points,” in Advances in Neural Information Processing Systems, 2018, pp. 820–830.

[178] Y. Xu, T. Fan, M. Xu, L. Zeng, and Y. Qiao, “Spidercnn: Deep learning on point sets with parameterized convolutional filters,” CoRR, vol. abs/1803.11527, 2018. [Online]. Available: http://arxiv.org/abs/1803.11527

[179] F. Groh, P. Wieschollek, and H. P. A. Lensch, “Flex-convolution (deep learning beyond grid- worlds),” CoRR, vol. abs/1803.07289, 2018. [Online]. Available: http://arxiv.org/abs/1803.07289

[180] D. Koguciuk, Ł. Chechlinski,´ and T. El-Gaaly, “3d object recognition with ensemble learning - a study of point cloud-based deep learning models,” in International Symposium on Visual Computing. Springer, 2019, pp. 100–114.

[181] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015.

[182] L. Yi, L. Shao, M. Savva, H. Huang, Y. Zhou, Q. Wang, B. Graham, M. Engelcke, R. Klokov, V. Lempitsky et al., “Large-scale 3d shape reconstruction and segmentation from shapenet core55,” arXiv preprint arXiv:1710.06104, 2017.

[183] K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su, “Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 909–918. REFERENCES 117

[184] M. A. Uy, Q.-H. Pham, B.-S. Hua, T. Nguyen, and S.-K. Yeung, “Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 1588–1597.

[185] C. Posch, D. Matolin, and R. Wohlgenannt, “A qvga 143 db dynamic range frame-free pwm image sensor with lossless pixel-level video compression and time-domain cds,” IEEE Journal of Solid-State Circuits, vol. 46, no. 1, pp. 259–275, 2010.

[186] T. Serrano-Gotarredona and B. Linares-Barranco, “A 128 times 128 1.5% contrast sensitivity 0.9% fpn 3 µs latency 4 mw asynchronous frame-free dynamic vision sensor using transimpedance preamplifiers,” IEEE Journal of Solid-State Circuits, vol. 48, no. 3, pp. 827–838, 2013.

[187] G. Orchard, A. Jayawant, G. K. Cohen, and N. Thakor, “Converting static image datasets to spiking neuromorphic datasets using saccades,” Frontiers in neuroscience, vol. 9, p. 437, 2015.

[188] A. I. Maqueda, A. Loquercio, G. Gallego, N. Garc´ıa, and D. Scaramuzza, “Event-based vision meets deep learning on steering prediction for self-driving cars,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5419–5427.

[189] A. Sironi, M. Brambilla, N. Bourdis, X. Lagorce, and R. Benosman, “Hats: Histograms of averaged time surfaces for robust event-based object classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1731–1740.

[190] Y. Bi, A. Chadha, A. Abbas, E. Bourtsoulatze, and Y. Andreopoulos, “Graph-based spatial- temporal feature learning for neuromorphic vision sensing,” arXiv preprint arXiv:1910.03579, 2019.

[191] A. Nguyen, T.-T. Do, D. G. Caldwell, and N. G. Tsagarakis, “Real-time 6dof pose relocalization for event cameras with stacked spatial lstm networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.

[192] S. M. Bohte, J. N. Kok, and H. La Poutre, “Error-backpropagation in temporally encoded networks of spiking neurons,” Neurocomputing, vol. 48, no. 1-4, pp. 17–37, 2002.

[193] A. Russell, G. Orchard, Y. Dong, S¸. Mihalas, E. Niebur, J. Tapson, and R. Etienne-Cummings, “Optimization methods for spiking neurons and networks,” IEEE transactions on neural networks, vol. 21, no. 12, pp. 1950–1962, 2010.

[194] Y. Cao, Y. Chen, and D. Khosla, “Spiking deep convolutional neural networks for energy-efficient object recognition,” International Journal of Computer Vision, vol. 113, no. 1, pp. 54–66, 2015.

[195] X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B. Benosman, “Hots: a hierarchy of event- based time-surfaces for pattern recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 7, pp. 1346–1359, 2016.

[196] Y. Lu, P. Keung, F. Ladhak, V. Bhardwaj, S. Zhang, and J. Sun, “A neural interlingua for multilingual machine translation,” arXiv preprint arXiv:1804.08198, 2018. 118 REFERENCES

[197]R.V azquez,´ A. Raganato, J. Tiedemann, and M. Creutz, “Multilingual nmt with a language- independent attention bridge,” arXiv preprint arXiv:1811.00498, 2018.