Dominic Jack Thesis

Deep Learning Approaches for 3D Inference from Monocular Vision SUBMITTED IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Dominic Jack Bachelor of Applied Science (Hons) School of Electrical Engineering and Robotics Science and Engineering Faculty Queensland University of Technology 2020 Copyright in Relation to This Thesis c Copyright 2020 by Dominic Jack Bachelor of Applied Science (Hons). All rights reserved. Statement of Original Authorship The work contained in this thesis has not been previously submitted to meet requirements for an award at this or any other higher education institution. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made. Signature: QUT Verified Signature Date: 3 September 2020 i ii Abstract Deep learning has contributed significant advances to computer vision in the last decade. This thesis looks at two problems involving 3D inference from 2D inputs: human pose estimation, and single-view object reconstruction. Each of our methods considers a different type of 3D representation that seeks to take advantage of the representation’s strengths, including keypoints, occupancy grids, deformable meshes and point clouds. We additionally investigate methods for learning from unstructured 3D data directly, including point clouds and event streams. In particular, we focus on methods targeted towards applications on moderately-sized mobile robotics platforms with modest computational power on board. We prioritize methods that operate in real-time with relatively low memory footprint and power usage compared to those tuned purely for accuracy-like performance metrics. Our first contribution looks at 2D-to-3D human pose keypoint lifting, i.e. how to infer a 3D human pose from 2D keypoints. We use a generative adversarial network to learn a latent space corresponding to feasible 3D poses, and optimize this latent space at inference time to find an element corresponding to the 3D pose which is most consistent with the 2D observation using a known camera model. This results in competitive accuracies using a very small generator model. Our second contribution looks at single-view object reconstruction using deformable mesh models. We learn to simultaneously choose a template mesh from a small number of candidates and infer a continuous deformation to apply to that mesh based on an input image. We tackle both problems of human pose estimation and single-view object reconstruction in our third contribution. Through a reformulation of the model presented in our first contribution, we combine multiple separate optimization steps into a single multi-level optimization problem that takes into account the feasibility of the 3D representation and its consistency with observed 2D features. We show that approximate solutions to the inner optimization process can be expressed as a learnable layer and propose problem-specific networks which we call Inverse Graphics Energy Networks (IGE-Nets). For human pose estimation, we achieve comparable results to benchmark deep learning models with a fraction of iii the number of operations and memory footprint, while our voxel-based object reconstruction model achieves state-of-the-art results at high resolution on a standard desktop GPU. Our final contribution was initially intended to extend our IGE-Net architecture to accommodate point clouds. However, a search of the literature found no simple network architectures which were both hierarchical in cloud density and continuous in coordinates – both necessary conditions for efficient IGE- Nets. As such, we present various approaches that improve performance of existing point cloud methods, and present a modification which is not only hierarchical and continuous, but also runs significantly faster and requires significantly less memory than existing methods. We further extend this work for use with event camera streams, producing networks that take advantage of the asynchronous nature of the input format and achieve state-of-the-art results on multiple classification benchmarks. iv Acknowledgments To my supervisors, thank you for your guidance and the freedom to chase my ideas – even when you knew they would not lead anywhere; to other students and academics in the lab, thank you for the advice, the water-cooler chats, and listening to my rants of frustrations; to my parents, thank you for your unquestioning love and support whether I’m starting a circus school or finishing a PhD; and finally to my friends, who kept me sane, accepted when I was distant, and welcomed me back after each of my deadline-induced disappearances. This project has pushed me to my limit, and I could not have got through without each and every one of you. I would also like to thank the Australian government for the Research Training Program (RTP) of which I was a recipient, along with QUT’s Deputy Vice-Chancellor’s Initiative Scholarship. v vi Table of Contents Abstract iii Acknowledgments v Acronyms xi List of Figures xiii List of Publications 1 1 Introduction 3 1.1 Computer Vision in Robotics . .4 1.2 Artificial Intelligence and the Deep Learning Era . .5 1.3 Problem Descriptions . .6 1.3.1 Human Pose Estimation . .6 1.3.2 Single View Object Reconstruction . .8 1.3.3 Point Cloud Classification . 10 1.3.4 Event Stream Classification . 10 1.4 Research Questions . 11 1.5 Thesis Outline . 11 2 Literature Review 13 2.1 General Deep Learning . 13 2.1.1 The Exploding/Vanishing Gradient Problem . 13 2.1.2 Optimizers . 14 vii 2.1.3 Initialization . 15 2.1.4 Regularization . 15 2.1.5 Implementation and Training . 16 2.2 Generative Adversarial Networks . 18 2.3 Learned Energy Networks . 19 2.4 Deep Learning in Computer Vision . 19 2.4.1 Object detection . 20 2.4.2 Semantic Segmentation . 20 2.4.3 Instance Segmentation . 21 2.4.4 Human Pose Estimation . 21 2.4.5 Single View Object Reconstruction . 22 2.4.6 Point Cloud Networks . 23 2.4.7 Event Stream Networks . 24 3 Adversarially Parameterized Optimization for 3D Human Pose Estimation 27 4 Learning Free-Form Deformations for 3D Object Reconstruction 41 5 IGE-Net: Inverse Graphics Energy Networks for Human Pose Estimation and Single-View Reconstruction 59 6 Sparse Convolutions on Continuous Domains for Point Cloud and Event Stream Networks 73 7 Conclusion 99 7.1 Contributions . 99 7.2 Future Work . 100 7.2.1 Sparse Point Cloud Convolutions . 100 7.2.2 Sparse Event Stream Convolutions . 100 7.2.3 IGE-Nets . 101 7.2.4 Isosurface Extraction Models . 101 7.2.5 Universal Latent Shape Representation . 102 viii References 118 ix x Acronyms AI Artificial Intelligence DL Deep Learning CNN Convolutional Neural Network CPU Central Processing Unit FFD Free Form Deformation GAN Generative Adversarial Network GPU Graphics Processing Unit IGE-Net Inverse Graphics Energy Network ML Machine Learning NLP Natural Language Processing QUT Queensland University of Technology ReLU Rectified Linear Units R-CNN Region-based CNN SGD Stochastic Gradient Descent TPU Tensor Processing Unit xi xii List of Figures 1.1 Left-to-right: fleets of driverless cars are already being built [1]; doctor-assisted medical diagnosis improves accuracy and reduces doctor stress and fatigue [2]; a robotic “dog” is in use at San Francisco airport, taking and collating construction photographs [3]; Samsung’s SRG-A1 sentry gun on the South Korean side of the demilitarized zone [4]. .4 1.2 Left-to-right: image classification, object detection and localization, semantic segmentation, and instance segmentation are common tasks in 2D computer vision. .4 1.3 Human pose representations. Left-to-right: 2D heatmap and keypoint detections [5] and 3D keypoint detections [6]. .7 1.4 3D object representations. Left-to-right: triangular mesh, point cloud, voxel occupancy grid, level sets slice. .9 1.5 Output from an event camera can be thought of as a point-cloud in x; y; t space. 10 xiii xiv List of Publications Following is the list of peer-reviewed publications that form a part of this thesis. Dominic Jack, Frederic Maire, Anders Eriksson, and Sareh Shirazi. “Adversarially Parameterized • Optimization for 3D Human Pose Estimation” in International Conference on 3D Vision (3DV). Qingdao, China, 2017. Dominic Jack, Jhony K. Pontes, Sridha Sridharan, Clinton Fookes, Sareh Shirazi, Frederic Maire, • Anders Eriksson. “Learning Free-Form Deformations for 3D Object Reconstruction,” in the Asian Conference on Computer Vision (ACCV). Perth, Australia, 2018. Dominic Jack, Frederic Maire, Sareh Shirazi, and Anders Eriksson. “IGE-Net: Inverse Graphics • Energy Networks for Human Pose Estimation and Single-View Reconstruction,” in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA, 2019. Dominic Jack, Frederic Maire, Simon Denman, and Anders Eriksson. “Sparse Convolutions on • Continuous Domains for Point Cloud and Event Stream Networks,” submitted to the European Conference on Computer Vision (ECCV). Glasgow, UK, 2020. The following publication is related to this thesis but not included. Mateusz Michalkiewicz, Jhony K. Pontes, Dominic Jack, Mahsa Baktashmotlagh, Anders Eriks- • son. “Implicit Surface Representations As Layers in Neural Networks,” in International Confer- ence on Computer Vision (ICCV). Seoul, Korea, 2019. 1 2 Chapter 1 Introduction “As I dug into research on Artificial Intelligence, I could not believe what I was reading. It hit me pretty quickly that what’s happening in the world of AI is not just an important topic, but by far THE most important topic for our future.” — Tim Urban, Wait but Why [7] Driverless cars. Aerial drone-based parcel delivery. Robots doing house work. Military sentry guns automatically selecting and eliminating targets. These aren’t concepts discussed by futurists predicted for the far-off future, or even experimental proof-of-concept implementations in highly controlled labs. All are in operation today across the globe, and progress in the area is only speeding up.

Dominic Jack Thesis

Image-Based 3D Reconstruction: Neural Networks Vs. Multiview Geometry

Amodal 3D Reconstruction for Robotic Manipulation Via Stability and Connectivity

Stereoscopic Vision System for Reconstruction of 3D Objects

Configurable 3D Scene Synthesis and 2D Image Rendering with Per-Pixel Ground Truth Using Stochastic Grammars

3D Scene Reconstruction from Multiple Uncalibrated Views

3D Shape Reconstruction from Vision and Touch

3D Reconstruction Is Not Just a Low-Level Task: Retrospect and Survey

Automatic Reconstruction of Textured 3D Models of Textured 3Dmodels Automatic Reconstruction Dipl.-Ing

Image-Based Synthesis and Re-Synthesis of Viewpoints Guided by 3D Models

3D Reconstruction and Recognition Acknowledgement

Bayesian Reconstruction of 3D Human Motion from Single-Camera Video

Arxiv:2001.05613V2 [Cs.CV] 14 Oct 2020 Mental Results Demonstrate That the Mean Per Joint Position I.E., Parts Or All of the Body Must Not Be Lost at Any Time