Doctor of Philosophy

RICE UNIVERSITY By Randall Balestriero A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE Doctor of Philosophy APPROVED, THESIS COMMITTEE Richard Baraniuk (Apr 28, 2021 06:57 ADT) Ankit B Patel (Apr 28, 2021 02:34 CDT) Ankit Patel Richard Baraniuk Behnam Azhang Behnam Azhang (Apr 26, 2021 18:34 CDT) Behnaam Aazhang Stephane Mallat Moshe Vardi Moshe Vardi (Apr 26, 2021 20:04 CDT) Albert Cohen (Apr 27, 2021 16:36 GMT+2) Moshe Vardi Albert Cohen HOUSTON, TEXAS April 2021 ABSTRACT Max-Affine Splines Insights Into Deep Learning by Randall Balestriero We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings. For instance, conditioned on the spline partition region containing the input signal, the output of a MASO DN can be written as a simple affine transformation of the input. Studying the geometry of those regions allows to obtain novel insights into different regularization techniques, different layer configurations or different initialization schemes. Going further, this spline viewpoint allows to obtain precise geometric insights in various domains such as the characterization of the Deep Generative Networks's generated manifold, the understanding of Deep Network pruning as a mean to simplify the DN input space partition or the relationship between different nonlinearities e.g. ReLU-Sigmoid Gated Linear Unit as simply corresponding to different MASO region membership inference algorithms. The spline partition of the input signal space that is implicitly induced by a MASO directly links DNs to the theory of vector quantiza- tion (VQ) and K-means clustering, which opens up new geometric avenues to study how DNs organize signals in a hierarchical fashion. ACKNOWLEDGEMENTS I would like to thank Prof. Herve Glotin for giving me the opportunity to enter into the research world during my Bachelor's degree with topics of greatest interests. Needless to say that without Herve's passion and love for his academic profession, I would not be doing research in this exciting field of machine and deep learning. Herve has done much more than just providing me with an opportunity. He has molded me into a curious dreamer, a quality that I hope to hold for as long as possible in order to one day walk into Herve's steps. I would also like to especially thank Prof. Sebastien Paris for considering me as an equal colleague during my Bachelor's research internships and thereafter. Sebastien's rigor, knowledge, and pragmatism have influenced me greatly in the most positive way. I also want to thank the countless invaluable encounters I have had within the LSIS team such as with Prof. Ricard Marxer, the LJLL team such as with Prof. Frederic Hecht and Prof. Albert Cohen, and in the DI ENS team such as with Prof. Stephane Mallat and Prof. Vincent Lostanlen, all sharing two common traits: a limitless expertise of their field, and an unbounded desire to share their knowledge. I would also like to thank Prof. Richard Baraniuk for taking me into his group and for constantly inspiring me to produce work of the highest quality. Rich's influences have allowed me to considerably improve upon my ability to not only conduct research, but also to communicate research. I would have been an incomplete PhD candidate without this primordial skill. I also want to thank Prof. Rudolf Riedi for taking me into a mathematical tour. Rolf's ability to seamlessly bridge the most abstract theoretical concepts and the most intuitive observations will never cease to amaze me and to fuel my desire to learn. I am also thanking Sina Alemohammad, CJ Barberan, Yehuda Dar, Ahmed Imtiaz, Hamid Javadi, Daniel Lejeune, Lorenzo Luzi, Tan Nguyen, Jasper Tan, Zichao Wang who are part of the Deep Learning group at Rice and with whom I have been collaborating, discussing and brainstorming. I also want to thank beyond words my family from whom I never stopped to learn, and my partner Dr. Kerda Varaku for mollifying the world around me (while performing a multi-year long reinforcement learning experiment on me, probably still in-progress today). I also want to give a special word for Dr. Romain Cosentino with whom we have been blindly pursuing ideas that led us to novel fields, and for Dr. Leonard Seydoux with whom I have discovered geophysics in the most interesting and captivating way. This work was partially supported by NSF grants IIS-17-30574 and IIS-18-38177, AFOSR grant FA9550-18-1-0478, ARO grant W911NF-15-1-0316, ONR grants N00014- 17-1-2551 and N00014-18-12571, DARPA grant G001534-7500, and a DOD Vannevar Bush Faculty Fellowship (NSSEFF) grant N00014-18-1-2047, a BP fellowship from the Ken Kennedy Institute. Contents Abstract ii Acknowledgments iii List of Illustrations xi List of Tables xxiv Notations xxvi 1 Introduction 1 1.1 Motivation . 2 1.2 Deep Networks . 4 1.2.1 Layers . 5 1.2.2 Training . 8 1.2.3 Approximation Results . 9 1.3 Related Works . 11 1.3.1 Mathematical Formulations of Deep Networks . 11 1.3.2 Training of Deep Generative Networks. 16 1.3.3 Batch-Normalization Understandings. 18 1.3.4 Deep Network Pruning. 19 1.4 Contributions . 21 2 Max-Affine Splines for Convex Function Approximation 24 2.1 Spline Functions . 25 2.2 Max-Affine Splines . 28 2.3 (Max-)Affine Spline Fitting . 30 vi 3 Deep Networks: Composition of Max-Affine Spline Op- erators 32 3.1 Max-Affine Spline Operators . 32 3.2 From Deep Network Layers to Max-Affine Spline Operators . 33 3.3 Composition of Max-Affine Spline Operators . 36 3.4 Deep Networks Input Space Partition: Power Diagram Subdivision . 38 3.4.1 Voronoi Diagrams and Power Diagrams . 38 3.4.2 Single Layer: Power Diagram . 40 3.4.3 Composition of Layers: Power Diagram Subdivision . 43 3.5 Discussions . 45 4 Insights Into Deep Generative Networks 47 4.1 Introduction . 47 4.1.1 Related Works . 47 4.1.2 Contributions . 49 4.2 Deep Generative Network Latent and Intrinsic Dimension . 49 4.2.1 Input-Output Space Partition and Per-Region Mapping . 50 4.2.2 Generated Manifold Angularity . 52 4.2.3 Generated Manifold Intrinsic Dimension . 55 4.2.4 Effect of Dropout/Dropconnect . 57 4.3 Per-Region Affine Mapping Interpretability and Manifold Tangent Space 61 4.3.1 Per-Region Mapping as Local Coordinate System and Disentanglement . 61 4.3.2 Tangent Space Regularization . 63 4.4 Density on the Generated Manifold . 65 4.4.1 Analytical Output Density . 66 4.4.2 On the Difficulty of Generating Low entropy/Multimodal Distributions . 67 vii 4.5 Discussions . 68 5 Expectation-Maximization for Deep Generative Networks 70 5.1 Introduction . 70 5.1.1 Related Works . 70 5.1.2 Contributions . 74 5.2 Posterior and Marginal Distributions of Deep Generative Networks . 74 5.2.1 Conditional, Marginal and Posterior Distributions of Deep Generative Networks . 75 5.2.2 Obtaining the DGN Partition . 77 5.2.3 Gaussian Integration on the Deep Generative Network Latent Partition . 79 5.3 Expectation-Maximization Learning of Deep Generative Networks . 81 5.3.1 Expectation Step . 82 5.3.2 Maximization Step . 83 5.3.3 Empirical Validation and VAE Comparison . 84 5.4 Discussions . 86 6 Insights Into Deep Network Pruning 88 6.1 Introduction . 88 6.1.1 Related Works . 89 6.1.2 Contributions . 90 6.2 Winning Tickets and DN Initialization . 90 6.2.1 The Initialization Dilemma and the Importance of Overparametrization . 91 6.2.2 Better DN Initialization: An Alternative to Pruning . 93 6.3 Pruning Continuous Piecewise Affine DNs . 95 6.3.1 Interpreting Pruning from a Spline Perspective . 97 6.3.2 Spline Early-Bird Tickets Detection . 98 viii 6.3.3 Spline Pruning Policy . 100 6.4 Experiment Results . 103 6.4.1 Proposed Layerwise Spline Pruning over SOTA Pruning Methods . 103 6.4.2 Proposed Global Spline Pruning over SOTA Pruning Methods 104 6.5 Discussions . 105 7 Insights into Batch-Normalization 107 7.1 Introduction . 107 7.1.1 Related Works . 108 7.1.2 Contributions . 110 7.2 Batch Normalization: Unsupervised Layer-Wise Fitting . 110 7.2.1 Batch-Normalization Updates . 111 7.2.2 Layer Input Space Hyperplanes and Partition . 111 7.2.3 Translating the Hyperplanes . 113 7.3 Multiple Layer Analysis: Following the Data Manifold . 116 7.3.1 Deep Network Partition and Boundaries . 116 7.3.2 Interpreting Each Batch-Normalization Parameter . 120 7.3.3 Experiments: Batch-Normalization Focuses the Partition onto the Data . 121 7.4 Where is the Decision Boundary . 123 7.4.1 Batch-Normalization is a Smart Initialization . 123 7.4.2 Experiments: Batch-Normalization Initialization Jump-Starts Training . 126 7.5 The Role of the Batch-Normalization Learnable Parameters . 127 7.6 Batch-Normalization Noisyness . 129 7.7 Discussions . 132 8 Insights Into (Smooth) Deep Networks Nonlinearities 133 ix 8.1 Introduction . 133 8.2 Max-Affine Splines meet Gaussian Mixture Models . 135 8.2.1 From MASO to GMM via K-Means . 135 8.2.2 hard-VQ Inference . 137 8.2.3 Soft-VQ Inference . 138 8.2.4 Soft-VQ MASO Nonlinearities . 139 8.3 Hybrid Hard/Soft Inference via Entropy Regularization . 139 8.4 Discussions . 141 A Insights into Generative Networks 142 A.1 Architecture Details . 142 A.2 Proofs .

Doctor of Philosophy

LNAI 4264, Pp

MMLS 2017 Booklet

Nil Ib N O Ir Ali Mi S Na El Oo B Ilp Itl

Theory and Algorithms for Modern Problems in Machine Learning and an Analysis of Markets

Session II – Unit I – Fundamentals

The Boosting Apporach to Machine Learning

Algorithms and Hardness for Linear Algebra on Geometric Graphs

Algorithms and Hardness for Linear Algebra on Geometric Graphs

Contents U U U

12 | Learning Theory

Association for Computing Machinery 2 Penn Plaza, Suite 701, New York

A Short Introduction to Boosting