RICE UNIVERSITY

By

Randall Balestriero

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE

Doctor of Philosophy

APPROVED, THESIS COMMITTEE

Richard Baraniuk (Apr 28, 2021 06:57 ADT) Ankit B Patel (Apr 28, 2021 02:34 CDT) Richard Baraniuk Ankit Patel

Behnam Azhang Behnam Azhang (Apr 26, 2021 18:34 CDT)

Behnaam Aazhang Stephane Mallat

Moshe Vardi (Apr 26, 2021 20:04 CDT) Albert Cohen (Apr 27, 2021 16:36 GMT+2) Moshe Vardi Albert Cohen

HOUSTON, TEXAS April 2021 ABSTRACT

Max-Affine Splines Insights Into Deep Learning

by

Randall Balestriero

We build a rigorous bridge between deep networks (DNs) and approximation the- ory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings. For instance, conditioned on the spline partition region containing the input signal, the output of a MASO DN can be written as a simple affine transformation of the input. Studying the geometry of those regions allows to obtain novel insights into different regular- ization techniques, different layer configurations or different initialization schemes. Going further, this spline viewpoint allows to obtain precise geometric insights in various domains such as the characterization of the Deep Generative Networks’s gen- erated manifold, the understanding of Deep Network pruning as a mean to simplify the DN input space partition or the relationship between different nonlinearities e.g. ReLU-Sigmoid Gated Linear Unit as simply corresponding to different MASO region membership inference algorithms. The spline partition of the input signal space that is implicitly induced by a MASO directly links DNs to the theory of vector quantiza- tion (VQ) and K-means clustering, which opens up new geometric avenues to study how DNs organize signals in a hierarchical fashion. ACKNOWLEDGEMENTS

I would like to thank Prof. Herve Glotin for giving me the opportunity to enter into the research world during my Bachelor’s degree with topics of greatest interests. Needless to say that without Herve’s passion and love for his academic profession, I would not be doing research in this exciting field of machine and deep learning. Herve has done much more than just providing me with an opportunity. He has molded me into a curious dreamer, a quality that I hope to hold for as long as possible in order to one day walk into Herve’s steps. I would also like to especially thank Prof. Sebastien Paris for considering me as an equal colleague during my Bachelor’s research internships and thereafter. Sebastien’s rigor, knowledge, and pragmatism have influenced me greatly in the most positive way. I also want to thank the countless invaluable encounters I have had within the LSIS team such as with Prof. Ricard Marxer, the LJLL team such as with Prof. Frederic Hecht and Prof. Albert Cohen, and in the DI ENS team such as with Prof. Stephane Mallat and Prof. Vincent Lostanlen, all sharing two common traits: a limitless expertise of their field, and an unbounded desire to share their knowledge. I would also like to thank Prof. Richard Baraniuk for taking me into his group and for constantly inspiring me to produce work of the highest quality. Rich’s influences have allowed me to considerably improve upon my ability to not only conduct research, but also to communicate research. I would have been an incomplete PhD candidate without this primordial skill. I also want to thank Prof. Rudolf Riedi for taking me into a mathematical tour. Rolf’s ability to seamlessly bridge the most abstract theoretical concepts and the most intuitive observations will never cease to amaze me and to fuel my desire to learn. I am also thanking Sina Alemohammad, CJ Barberan, Yehuda Dar, Ahmed Imtiaz, Hamid Javadi, Daniel Lejeune, Lorenzo Luzi, Tan Nguyen, Jasper Tan, Zichao Wang who are part of the Deep Learning group at Rice and with whom I have been collaborating, discussing and brainstorming. I also want to thank beyond words my family from whom I never stopped to learn, and my partner Dr. Kerda Varaku for mollifying the world around me (while performing a multi-year long reinforcement learning experiment on me, probably still in-progress today). I also want to give a special word for Dr. Romain Cosentino with whom we have been blindly pursuing ideas that led us to novel fields, and for Dr. Leonard Seydoux with whom I have discovered geophysics in the most interesting and captivating way. This work was partially supported by NSF grants IIS-17-30574 and IIS-18-38177, AFOSR grant FA9550-18-1-0478, ARO grant W911NF-15-1-0316, ONR grants N00014- 17-1-2551 and N00014-18-12571, DARPA grant G001534-7500, and a DOD Vannevar Bush Faculty Fellowship (NSSEFF) grant N00014-18-1-2047, a BP fellowship from the Ken Kennedy Institute. Contents

Abstract ii Acknowledgments iii List of Illustrations xi List of Tables xxiv Notations xxvi

1 Introduction 1 1.1 Motivation ...... 2 1.2 Deep Networks ...... 4 1.2.1 Layers ...... 5 1.2.2 Training ...... 8 1.2.3 Approximation Results ...... 9 1.3 Related Works ...... 11 1.3.1 Mathematical Formulations of Deep Networks ...... 11 1.3.2 Training of Deep Generative Networks...... 16 1.3.3 Batch-Normalization Understandings...... 18 1.3.4 Deep Network Pruning...... 19 1.4 Contributions ...... 21

2 Max-Affine Splines for Convex Function Approximation 24 2.1 Spline Functions ...... 25 2.2 Max-Affine Splines ...... 28 2.3 (Max-)Affine Spline Fitting ...... 30 vi

3 Deep Networks: Composition of Max-Affine Spline Op- erators 32 3.1 Max-Affine Spline Operators ...... 32 3.2 From Deep Network Layers to Max-Affine Spline Operators ...... 33 3.3 Composition of Max-Affine Spline Operators ...... 36 3.4 Deep Networks Input Space Partition: Power Diagram Subdivision . 38 3.4.1 Voronoi Diagrams and Power Diagrams ...... 38 3.4.2 Single Layer: Power Diagram ...... 40 3.4.3 Composition of Layers: Power Diagram Subdivision ...... 43 3.5 Discussions ...... 45

4 Insights Into Deep Generative Networks 47 4.1 Introduction ...... 47 4.1.1 Related Works ...... 47 4.1.2 Contributions ...... 49 4.2 Deep Generative Network Latent and Intrinsic Dimension ...... 49 4.2.1 Input-Output Space Partition and Per-Region Mapping . . . . 50 4.2.2 Generated Manifold Angularity ...... 52 4.2.3 Generated Manifold Intrinsic Dimension ...... 55 4.2.4 Effect of Dropout/Dropconnect ...... 57 4.3 Per-Region Affine Mapping Interpretability and Manifold Tangent Space 61 4.3.1 Per-Region Mapping as Local Coordinate System and Disentanglement ...... 61 4.3.2 Tangent Space Regularization ...... 63 4.4 Density on the Generated Manifold ...... 65 4.4.1 Analytical Output Density ...... 66 4.4.2 On the Difficulty of Generating Low entropy/Multimodal Distributions ...... 67 vii

4.5 Discussions ...... 68

5 Expectation-Maximization for Deep Generative Networks 70 5.1 Introduction ...... 70 5.1.1 Related Works ...... 70 5.1.2 Contributions ...... 74 5.2 Posterior and Marginal Distributions of Deep Generative Networks . 74 5.2.1 Conditional, Marginal and Posterior Distributions of Deep Generative Networks ...... 75 5.2.2 Obtaining the DGN Partition ...... 77 5.2.3 Gaussian Integration on the Deep Generative Network Latent Partition ...... 79 5.3 Expectation-Maximization Learning of Deep Generative Networks . . 81 5.3.1 Expectation Step ...... 82 5.3.2 Maximization Step ...... 83 5.3.3 Empirical Validation and VAE Comparison ...... 84 5.4 Discussions ...... 86

6 Insights Into Deep Network Pruning 88 6.1 Introduction ...... 88 6.1.1 Related Works ...... 89 6.1.2 Contributions ...... 90 6.2 Winning Tickets and DN Initialization ...... 90 6.2.1 The Initialization Dilemma and the Importance of Overparametrization ...... 91 6.2.2 Better DN Initialization: An Alternative to Pruning ...... 93 6.3 Pruning Continuous Piecewise Affine DNs ...... 95 6.3.1 Interpreting Pruning from a Spline Perspective ...... 97 6.3.2 Spline Early-Bird Tickets Detection ...... 98 viii

6.3.3 Spline Pruning Policy ...... 100 6.4 Experiment Results ...... 103 6.4.1 Proposed Layerwise Spline Pruning over SOTA Pruning Methods ...... 103 6.4.2 Proposed Global Spline Pruning over SOTA Pruning Methods 104 6.5 Discussions ...... 105

7 Insights into Batch-Normalization 107 7.1 Introduction ...... 107 7.1.1 Related Works ...... 108 7.1.2 Contributions ...... 110 7.2 Batch Normalization: Unsupervised Layer-Wise Fitting ...... 110 7.2.1 Batch-Normalization Updates ...... 111 7.2.2 Layer Input Space Hyperplanes and Partition ...... 111 7.2.3 Translating the Hyperplanes ...... 113 7.3 Multiple Layer Analysis: Following the Data Manifold ...... 116 7.3.1 Deep Network Partition and Boundaries ...... 116 7.3.2 Interpreting Each Batch-Normalization Parameter ...... 120 7.3.3 Experiments: Batch-Normalization Focuses the Partition onto the Data ...... 121 7.4 Where is the Decision Boundary ...... 123 7.4.1 Batch-Normalization is a Smart Initialization ...... 123 7.4.2 Experiments: Batch-Normalization Initialization Jump-Starts Training ...... 126 7.5 The Role of the Batch-Normalization Learnable Parameters . . . . . 127 7.6 Batch-Normalization Noisyness ...... 129 7.7 Discussions ...... 132

8 Insights Into (Smooth) Deep Networks Nonlinearities 133 ix

8.1 Introduction ...... 133 8.2 Max-Affine Splines meet Gaussian Mixture Models ...... 135 8.2.1 From MASO to GMM via K-Means ...... 135 8.2.2 hard-VQ Inference ...... 137 8.2.3 Soft-VQ Inference ...... 138 8.2.4 Soft-VQ MASO Nonlinearities ...... 139 8.3 Hybrid Hard/Soft Inference via Entropy Regularization ...... 139 8.4 Discussions ...... 141

A Insights into Generative Networks 142 A.1 Architecture Details ...... 142 A.2 Proofs ...... 144 A.2.1 Proof of Thm 4.1 ...... 144 A.2.2 Proof of Proposition 4.1 ...... 144 A.2.3 Proof of Proposition 4.2 ...... 145 A.2.4 Proof of Theorem 4.2 ...... 145 A.2.5 Proof of Theorem 4.3 ...... 146

B Expectation Maximization Training of Deep Generative Networks 148 B.1 Computing the Latent Space Partition ...... 148 B.2 Analytical Moments for truncated Gaussian ...... 151 B.3 Implementation Details ...... 153 B.4 Algorithms ...... 154 B.5 Proofs ...... 154 B.5.1 Proof of Lemma 5.1 ...... 154 B.5.2 Proof of Proposition 5.1 ...... 155 B.5.3 Proof of Theorem 5.1 ...... 155 x

B.5.4 Proof of Lemma 5.2 ...... 156 B.5.5 Proof of Moments ...... 157 B.6 Proof of EM-step ...... 158 B.6.1 E-step derivation ...... 159 B.6.2 Proof of M step ...... 159 B.7 Regularization ...... 165 B.8 Computational Complexity ...... 166 B.9 Additional Experiments ...... 166

C Deep Network Pruning 172 C.1 Additional Results on Initialization and Pruning ...... 173 C.1.1 Winning Tickets and Overparameterization ...... 173 C.1.2 Additional Results for Layerwise Pretraining ...... 173 C.2 Additional Early-Bird Visualizations ...... 176 C.2.1 Early-Bird Visualization for VGG-16 and PreResNet-101 . . . 176 C.3 Additional Experimental Details and Results ...... 176 C.3.1 Experiments Settings ...... 176 C.3.2 Additional Results of Our Global Spline Pruning ...... 177 C.3.3 Ablation Studies of Our Spline Pruning Method ...... 178

D Batch-Normalization 181 D.1 Proofs ...... 181 D.1.1 Proof of Theorem 7.1 ...... 181 D.1.2 Proof of Corollary 7.1 ...... 182 D.1.3 Prof of Theorem 7.2 ...... 183 D.1.4 Proof of Proposition 7.1 ...... 184 Illustrations

3.1 Two equivalent representations of a power diagram (PD). Top: The

grey circles have centers [µ]k,: and radii [rad]k; each point x is assigned to a specific region/cell according to the Laguerre distance from the centers, which is defined as the length of the segment tangent to and starting on the circle and reaching x. Bottom: A

D PD in R (here D = 2) is constructed by lifting the centroids [µ]k,: up D+1 into an additional dimension in R by the distance [rad]k and then finding the Voronoi diagram (VD) of the augmented centroids

D+1 ([µ]k,:, [rad]k) in R . The intersection of this higher-dimensional VD with the originating space RD yields the PD...... 40 xii

3.2 Visual depiction of the subdivision process that occurs when a deeper layer ` refines/subdivides an up-to-layer ` − 1 already built partition Ω(1,...,`−1). We depict here a toy model (2-layer DN) with 3 units at the first layer (leading to 4 regions) and 8 units at the second layer with random weights and biases. The colors are the DN input space partitioning with respect to the first layer. Then for each color (or region) the layer1-layer2 defines a specific PD that will sub-divide this aforementioned region (this is the first row) where the region is colored and the PD is depicted for the whole input space. Then this sub-division is applied onto the first layer region only as it only sub-divides its region (this is the second row on the right). And finally grouping together this process for each of the 4 region, we obtain the layer-layer 2 space partitioning (second row on the left). . 44

4.1 Visual depiction of Thm. 4.1 with a (random) generator G : R2 7→ R3. Left: generator input space partition Ω made of polytopal regions. Right: generator image Im(G) which is a continuous piecewise affine surface composed of the polytopes obtained by affinely transforming the polytopes from the input space partition (left) the colors are per-region and correspond between left and right plots. This input-space-partition / generator-image / per-region-affine-mapping relation holds for any architecture employing piecewise affine activation functions. Understanding each of the three brings insights into the others, as we demonstrate in this paper...... 50

4.2 The columns represent different widths D` ∈ {6, 8, 16, 32} and the rows correspond to repetition of the learning for different random initializations of the GDNs for consecutive seeds...... 53 xiii

4.3 Histograms of the DGN adjacent region angles for DGNs with two hidden layers, S = 16 and D = 17,D = 32 respectively and varying

width D` on the y-axis. Three trends to observe: increasing the width increases the bimodality of the distribution while favoring near 0 angles; increasing the output space dimension increases in the

number of angles near orthogonal; the Aω and Aω0 of adjacent regions ω and ω0 are highly similar making most angles smaller than if they were independent (depicted in blue)...... 54 4.4 DGN with dropout trained (GAN) on a circle dataset (blue dots); dropout turns a DGN into an ensemble of DGNs (each dropout realization is drawn in a different color)...... 58 4.5 Impact of dropout and dropconnect on the intrinsic dimension of the noise induced generators for two “drop” probabilities 0.1 and 0.3 and for a generator G with S = 6, D = 10, L = 3 with varying width D1 = D2 ranging from 6 to 48 (x-axis). The boxplot represents the distribution of the per-region intrinsic dimensions over 2000 sampled regions and 2000 different noise realizations. Recall that the intrinsic dimension is upper bounded by S = 6 in this case. Two key observations: first dropconnect tends to producing DGN with intrinsic dimension preserving the latent dimension (S = 6) even for

narrow models (D1,D2 ≈ S), as opposed to dropout which tends to produce DGNs with much smaller intrinsic dimension than S. As a result, if the DGN is much wider than S, both techniques can be used, while in narrow models, either none or dropconnect should be preferred. 59 xiv

4.6 Deep Autoencoder experiment when equipping the DGN (decoder) with dropout where we employ the following MLP with S = D1 = D2 = 32 and D3 = D4 = 1024,D5 = D, test set reconstruction error is displayed for multiple datasets and training settings. The architecture purposefully maintains a narrow width for the first two layers to highlight that in those cases, dropout is detrimental regardless of the dropout rate. We compare applying dropout to all layers black line versus applying dropout only on the last two (wide) layers blue line. We see that unless the dropout rate is adapted to the layer width and desired intrinsic dimension, the test set performance is negatively impacted by dropout. The exact rate reaching best test set performance for the case of employing dropout only for wide layers is shown with a green arrow. The exact values for each graph are given in Table 4.1...... 60 4.7 Probability (0:blue,1:red) that dropout maintains the intrinsic dimension (red line, left:32, right:64) as a function of the dropout rate ( x-axis) and the layer’s width y-axis, with the 95% and 99% line in black continuous and black dashed respectively. We see that when the layer’s width is close to the desired intrinsic dimension, no dropout should be applied, and that for a dropout rate of 0.5, the layer must be at least two times wider than the desired intrinsic dimension...... 61 xv

4.8 Visualization of a single basis vectors [Aω].,k before and after learning obtained from a region ω containing the digits 7, 5, 9, and 0 respectively per-column, for GAN and VAE models made of fully connected or convolutional layer. We observe how those basis vectors encodes: right rotation, cedilla extension, left rotation, and upward

translation respectively; studying the columns of Aω provides interpretability into the learn DGN affine parameters and underlying data manifold...... 62 4.9 Test set reconstruction (y-axis) error during training for each epoch (x-axis) for a baseline unconstrained Deep AutoEncoder (black line) and for the tangent space regularized DGN (decoder) from (4.6) with varying regularization coefficient λ (colored lines) for three datasets (per column) and with S = 128,T = 16 (top) and S = 32,T = 16 (bottom). We observed that by constraining the

tangent space basis Aω to span the data tangent space for each region ω containing training samples, the manifold fitting is improved leading to better test sample reconstruction...... 65 4.10 Distribution of the per-region log-determinants (bottom row) for DGN trained on a data distribution with varying per mode variance (blue points, first row). The estimated data distribution is depicted through the red samples. We clearly observe the tight relationship between the multimodality and Shannon Entropy of the data distribution to be approximated and the distribution of the

per-region determinant of Aω. That is, as the DGN tries to approximate a data distribution with high multimodality and low

Shannon Entropy, as the per-region slope matrices Aω have increasing singular values, in turn synonym of exploding per-layer weights and thus training instabilities (recall Thm. 4.1)...... 67 xvi

q T 4.11 Distribution of log( det(Aω Aω)) for 2000 regions ω with a DGN

with L = 3,S = 6,D = 10 and weights initialized with Xavier; then, half of

the weights’ coefficients (picked randomly) are rescaled by σ1 and the

other half by σ2. We observe that greater variance of the weights increase the spread of the log-determinants and increase the mean of the distribution...... 68

5.1 Recursive partition discovery for a DGN with S = 2 and L = 2, starting with an initial region obtained from a sampled latent vector z (init). By walking on the faces of this region, neighboring regions sharing a common face are discovered (Step 1). Recursively repeating this process until no new region is discovered (Steps 2–4) provides the DGN latent space partition at left ...... 78 5.2 Triangulation T (ω) as per (5.7) of a polytopal region ω (left plot) obtained from the Delaunay Triangulation of the region vertices leading to 3 simplices (three right plots)...... 80 5.3 Left: Noiseless generated samples g(z) in red and noisy samples g(z) + 

in blue, with Σx = 0.1I, Σz = I. Middle: marginal distribution p(x) from (5.3). Right: the posterior distribution p(z|x) from (5.4) (blue), its expectation (green) and the position of the region limits (black), with sample point x depicted in black in the left figure...... 81 5.4 DGN training under EM (black) and VAE training with various learning rates for VAE (blue: 0.005, red: 0.001, green: 0.0001). In all cases, VAE converges to the maximum of its ELBO. The gap between the VAE and EM curves is due to the inability of the VAE’s AVI to correctly estimate the true posterior, pushing the VAE’s ELBO far from the true log-likelihood (recall (5.1)) and thus preventing it from precisely approximating the true data distribution...... 84 xvii

5.5 KL-divergence between a VAE variational distribution and the true DGN posterior when trained on a noisy circle dataset in 2D for 3 different learning rates. During learning, the DGN adapts such that g(z) +  models the data distribution based on the VAE’s estimated ELBO. As learning progresses, the true DGN posterior becomes harder to approximate by the VAE’s variational distribution in the AVI process. As such, even in this toy dataset, the commonly employed Gaussian variational distribution is not rich enough to capture the multimodality of p(z|x) from (5.4)...... 85 5.6 EM training of a DGN with latent dimension 1. We show only the generated continuous piecewise affine manifold g(z) without the additional white noise . We see how EM training of the DGN is able to fit the dataset, while VAE (with different learning rates (LR)) suffers from hyperparameter sensitivity and slow convergence. Training details and additional figures for this experiment are provided in Appendix B.9...... 85 5.7 Reprise of Fig. 5.6 for MNIST data restricted to the digit 4, employing a 3-layer DGN with latent dimension of 1. Details of training and additional figures for this experiments are provided in Appendix B.9...... 86

6.1 K-means experiments on a toy mixture of 64 Gaussian in 2d, where in all cases the number of final cluster is 64 but the number of starting clusters (x-axis) varies and pruning is applied during training to remove redundant centroids, comparing random centroid initialization and kmeans++. With overparametrization, random initialization and pruning reaches the same accuracy as kmeans++. . 91 xviii

6.2 (a) Difference between node and weight pruning, where the former removes entire subdivision lines while the latter simply quantize those partition lines to be colinear to the space axes. (b) Toy classification task pruning, where the blue lines represent subdivisions in the first layer and the red lines denote the last layer’s decision boundary. We see that: 1) pruning indeed removes redundant subdivision lines so that the decision boundary remains an X-shape until 80% nodes are pruned; and 2) ideally, one blue subdivision line would be sufficient to provide two turning points for the decision boundary, e.g., visualization at 80% sparsity, but the classification accuracy degrades a lot if further pruned. That aligns with the initialization dilemma for small DNs, i.e., blue lines are not well initialized and all lines remain hard for training. (c) MNIST reproduction of (b), where to produce these visuals, we choose two images from different classes to obtain a 2-dimensional slice of the 764-dimensional input space (grid depicted on the left). We thus obtain a low-dimensional depiction of the subdivision lines that we depict in blue for the first layer, green for the second convolutional layer, and red for the decision boundary of 6 vs. 9 (based on the left grid). The observation consistently shows that only parts of subdivision lines are useful for decision boundary; and the goal of pruning is to remove those (redundant) subdivision lines...... 96 6.3 Spline trajectory during training and visualizing the Early-Bird (EB) Phenomenon, which can be leveraged to largely reduce the training costs due to the less training of costly overparametrized DNs. The trajectories mainly adapt during early phase of training...... 98 xix

6.4 We depict on the left a small (L = 2,D1 = 5,D2 = 8) DN input space partition, layer 1 trajectories in black and layer 2 in blue. In the middle is the measure from Eq. (6.1) finding similar “partition trajectories” from layer 2 seen in the DN input space (comparing the green trajectory to the others with coloring based on the induce similarity from dark to light). Based on this measure, pruning can be done to remove the “grouped partition trajectoris” and obtain the pruned partition on the right...... 101

7.1 Depiction for a 5-layer DN with 6 units per layer of the impact of BN (with statistics computed from all samples) onto the position and

shape of the -up-to-layer-` input space partition Ω1|`; in blue are the newly introduced boundaries from the current layer, in grey are the existing boundaries. The absence of BN (top row) leaves the partition random and unalert of the data samples while BN (bottom row) positions and focuses the partition onto the data samples (while all other parameters of the BN are left identical); as per Thm. 7.1, BN minimizes the distances between the boundaries and the data samples. 115 7.2 Depiction of the layer (left) and DN (right) input space partition with L = 2,D(1) = 2,D(2) = 2. The partition boundaries of a layer in its input space corresponds to the hyperplanes H(`,k) (7.5), for deeper layers, vieweing H(`,k) in the DN input space leads to the paths P(`,k) (7.13)...... 116 7.3 Depicition of P(`,k), ` = 1, 4 where for each `, P(`,k) is colored based

(`) (`) (2) on [σ ]k/k[W ]k,.k (blue: smallest, green: highest). As per Thm. 7.1, 7.2, the bluer colored paths are the ones closer to the dataset (black dots) allowing interpretability of the σ(`) parameter as the fitness between P(`,k) and the mini-batch samples...... 120 xx

7.4 This figure reproduces the experiment from Fig. 7.1 with a more complex (2-D) input dataset (left) and a much wider DN with D(`) = 1024 and L = 11. We depict for some layers the boundaries of (`) the layer partitions seen in the DN input space (∂Ω0 , recall 7.13) for DNs with different initializations: random for slopes and biases (random), or random for slopes and zero for biases (zero) or the scaling of the slopes and the biases are initialized from the BN (`) (`) statistics µBN and σBN from 7.3 (BN). The overlap of multiple partition boundaries induces a darker color demonstrating the presence of more partition boundaries for each spatial location. Clearly, BN concentrates the partition boundaries onto the data samples...... 124 7.5 Average number of regions from the DN partition Ω in an -ball around 100 CIFAR images (left) and 100 random images (right) for a CNN demonstrating that BN adapts the partition on the data samples. The weights initialization (random, zero, BN) follows Fig. 7.4. Additional dataset and architectures are given in Appendix showing the same result...... 125 xxi

7.6 Image classification with different architectures on SVHN, CIFAR10/100. In all cases no BN is used during training; the initialization of the weights is either random (black) or random a (`) (`) with fixed BN parameters µBN, σBN, ∀` (blue). That is, the BN parameters are found as-per the BN strategy in a pretraining phase, and then those parameters are frozen (all other parameters remained at their random initialization). Then training start and the random parameters are tuned based on the loss at han. We can see that BN initialization (again, no BN is used during training) is beneficial to reach better accuracy effectively showing that BN initialization alone plays a crucial role for DNs. In most cases, the DN that does not leverage the BN initialization diverges altogether...... 128 7.7 Decision Boundaries realisations obtained for different batches on a 2-dimensional binary classification task. Each mini-batch (of size B) produces a different DN decision boundary based on the realisations (`) (`) of the random variables µBN, σBN (recall 7.17,7.18). Variance of those r.v. depend on B as seen in the figure. We depict those realisations at initialization (left) and after learning (right) for B = 16, 256, the latter producing smaller variance in the decision boundaries...... 130

8.1 For the MASO parameters A(`),B(`) for which HVQ yields the ReLU, absolute value, and an arbitrary convex activation function, we explore how changing β in the β-VQ alters the induced activation

1 function. Solid black: HVQ (β = 1), Dashed black: SVQ (β = 2 ), Red: β-VQ (β ∈ [0.1, 0.9]). Interestingly, note how some of the functions are nonconvex...... 141

B.1 sample of noise data for the wave dataset ...... 167 xxii

B.2 Depiction of the evolution of the NLL during training for the EM and VAE algorithms, we can see that despite the high number of training steps, VAEs are not yet able to correctly approximate the data distribution as opposed to EM training which benefits from much faster convergence. We also see how the VAEs tend to have a large KL divergence between the true posterior and the variational estimate due to this gap, we depict below samples from those models. 168 B.3 Samples from the various models trained on the wave dataset. We can see on top the result of EM training where each column represents a different run, the remaining three rows correspond to the VAE training. Again, EM demonstrates much faster convergence, for VAE to reach the actual data distribution, much more updates are needed. 169 B.4 Evolution of the true data negative log-likelihood (in semilogy-y plot on MNIST (class 4) for EM and VAE training for a small DGN as described above. The experiments are repeated multiple times, we can see how the learning rate is clearly impacting the learning significantly despite the use of Adam, and that even with the large learning rate, the EM learning is able to reach lower NLL, in fact the quality of the generated samples of the EM modes is much higher as shows below...... 170 B.5 Random samples from trained DGNs with EM or VAEs on a MNIST experiment (with digit 4). We see the ability of EM training to produce realistic and diversified samples despite using a latent space dimension of 1 and a small generative network...... 171

C.1 Depiction of the dataset used for the K-means experiment with 64 centroids...... 174 xxiii

C.2 Left: Depiction of a simple (toy) univariate regression task with

target function being a sawtooth with two peaks. Right: The `2 training error (y-axis) as a function of the width of the DN layer (2 layers in total). In theory, only 4 units are requires to perfectly solve the task at hand with a ReLU layer, however we see that optimization in narrow DNs is difficult and gradient based learning fails to find the correct layer parameters. As the width is increased as the difficulty of the optimization problem reduces and SGD manages to find a good set of parameters solving the regression task...... 175 C.3 Accuracy vs. efficiency trade-offs of lottery initialization and layerwise pretraining...... 176 C.4 Illustrating the spline Early-Bird tickets in VGG-16 and PreResNet-101.177 C.5 Abalation studies of the hyperparameter ρ in our spline pruning method on two commonly used models, VGG-16 and PreResNet-101. 179 Tables

4.1 Test set reconstruction error for varying dropout rates as displayed in Fig. 4.6, for different datasets, and when applying dropout on all layers or only on wide enough layers. We see that it is crucial to adapt the dropout rate to the layer width as otherwise the test error only increases when employing dropout...... 60 4.2 Test set reconstruction error averaged over 3 runs when employing the tangent space regularization (4.6) on various dataset with a DeepAutoencoder when varying the weight of the regularization term (λ) and the latent space dimension (S) and the number of neighbors used to estimate the data tangent space (T ). We see that the proposed regularization effectively improves generalization performances in all cases and even for complicated and high-dimensional datasets such as CIFAR10, where the data tangent space estimation becomes more challenging. This also demonstrates that DGNs trained only to reconstruct the data samples do not align correctly with the underlying data manifold tangent space...... 66

6.1 Accuracies of layerwise (LW) pretraining, structured pruning with random and lottery ticket initialization...... 94 6.2 Evaluating the proposed layerwise spline pruning over SOTA pruning methods on CIFAR-100...... 103 xxv

6.3 Evaluating the proposed global spline pruning over SOTA pruning methods on ImageNet...... 104

C.1 Evaluating our global spline pruning method over SOTA methods on CIFAR-10/100 datasets. Note that the “Spline Improv.” denotes the improvement of our spline pruning (w/ or w/o EB) as compared to the strongest baselines...... 180 NOTATIONS The entire thesis follows the following notations. A scalar is always represented in lower case and in standard font weight as a. A vector is always represented in lower case and in bold font weight as a. A matrix is always represented in upper case and in bold font weight as in A. A function producing a scalar or a vector output is expressed in upper case and in standard font weight as F . An upperscript surrounded by parentheses on any parameter/function is an indexing, and does not correspond to taking the power of the output e.g. F (4) can represent the fourth mapping and is not to be understood as F 4. Lastly, accessing a specific dimension of a vector, matrix

th or tensor, is achieved through the [.] operator as in [a]k for the k dimension of a th th vector, [A]k,d for the d entry of the k row of a matrix and so on. 1

Chapter 1

Introduction

Deep learning has significantly advanced our ability to address a wide range of dif-

ficult and signal processing problems. Today’s machine learning landscape is dominated by deep (neural) networks (DNs), which are compositions of a large number of simple parametrized linear and nonlinear operators. In this thesis, we build a bridge between DNs and spline functions and operators. We prove that a large class of DNs including convolutional neural networks (CNNs) [LeCun, 1998], residual networks (ResNets) [He et al., 2015b], skip connection networks [Srivastava et al., 2015], fully connected networks [Pal and Mitra, 1992], recurrent neural net- works (RNNs) [Graves, 2013], scattering networks [Bruna and Mallat, 2013], inception networks [Szegedy et al., 2017], and more can be written as spline operators. In fact, we prove that any DN employing current standard-practice piecewise affine and con- vex nonlinearities (e.g., ReLU, absolute value, max-pooling, etc.), can be written as a composition of max-affine spline operators (MASOs), which are an extension of max- affine splines [Magnani and Boyd, 2009, Hannah and Dunson, 2013]. The max-affine spline connection provides a powerful portal through which to view and analyze the inner workings of a DN using tools from approximation theory, functional analysis and computational geometry. The goal of this thesis, is to thoroughly adapt the max-affine splines insights for deep networks, to derive direct theoretical results from this formulation, and to provide insights and practical guidance for deep learning practitioners and researchers. 2

1.1 Motivation

Deep learning is increasingly becoming the backbone of our society. Powering novel industries and finding its way through applications such as self-driving cars, drug discovery, renewable energies, space exploration and law enforcement. An all-too- familiar story of late is that of plugging a DN into an application as a black box, learning its parameter values using copious training data, and then significantly im- proving performance over classical task-specific approaches.

Despite this empirical progress, the precise mechanisms by which deep learning works so well remain open to questioning, adding an air of mystery to the entire field.

This pitfall becomes increasingly problematic as DNs are deployed in our society and many systems now rely exclusively on such models. Beyond interpretability of the prediction/decision making, which is lacking in DNs, one important issue lies in the safety of those models. It has been for example demonstrated how deployed mod- els such as copyright infringement detectors, identity recognition models, and speech recognition models can be manipulated by any third-party agent through noise in- jection in the data [Saadatpanah et al., 2020, Goldblum et al., 2020, Cherepanova et al., 2021]. As a result, it is crucial to increase our theoretical and practical un- derstanding of DNs and in particular in a way that allows practitioners to better design and control those powerful methods. In a recent turn of events, most, if not all, of currently employed models have had their architecture altered, through trial and error, to finally become what they are today: affine spline functions. Through the rich theory of splines, we will demonstrate how to study DNs from this viewpoint.

This thesis is organized as follows. First, we propose to review Max-Affine Splines in all generality as those convex, piecewise affine splines will be the backbone of this thesis (Chap. 2). The core novel results consist in the reformulation of DNs as 3 max-affine splines and leverage this form to derive a direct result on the characteriza- tion of the DN input space partition (Chap. 3). Following this, we propose different facets of results that are direct consequences of this formulation: we will study Deep

Generative Networks and the geometry of the manifold that they span (Chap. 4).

Those results apply to many frameworks such as Generative Adversarial Networks or

Variational Autoencoders as well as Autoencoders and will provide practitioners with insights into architecture designs and techniques such as dropout and dropconnect.

Second, we directly move into exploiting the spline formulation and the result on the DN partition to derive novel strategies to learn Deep Generative Networks via

Expectation-Maximization (Chap. 5). This chapter closes the study of Deep Gener- ative Networks. Third, we propose to study Deep Network Pruning, which consist in removing nodes/weights from an architecture with the hope to maintain high per- formances while reducing the model complexity (Chap. 6). By leveraging the spline viewpoint it will be possible to obtain geometrical insights and to derive novel and mo- tivated pruning solutions. Fourth, we will dive into Batch-Normalization (Chap. 7).

Batch-Normalization is one of the most popular technique that greatly fasten and stabilize DNs’ training. Through the understanding of the DN partition, novel re- sults and explanation of Batch-Normalization will be possible concluding that this technique allows to concentrate the DN regions around the data samples and thus allows to help training by acting on the DN partition. Fifth and lastly, we conclude this thesis by demonstrating how the insights and results drawn throughout the above chapters can be extended to DNs with smooth nonlinearities by allowing the region assignment of the max-affine splines to be probabilistic (Chap. 8). This process is very similar to the ability of Gaussian Mixture Models to produce a probability that an input belong in a specific region i.e. cluster, as opposed to K-means that produces 4 a yes/no region membership value. Most of the chapters rely on conference papers that are cited as part of the corresponding chapter’s introduction. Proofs are pro- vided in multiple appendices, divided per chapters. Proofs that are short in length are put directly in the main document.

1.2 Deep Networks

We now introduce deep (neural) networks (DNs): nonlinear functions formed by a composition of layers, each layer performing a simple (possibly constrained) affine transformation of its input followed by a nonlinearity. The success of DNs on chal- lenging computer vision tasks goes back at least as far as LeCun et al. [1995b] for handwritten digit classification. A typical DN F that employs L layers is expressed as

F (x) = (F (L) ◦ · · · ◦ F (1))(x), (1.1)

(`) D(`−1) D(`) (`−1) where each function F : R 7→ R maps its input zx , a feature map, to an (`) (0) output feature map zx with initialization zx , x. We thus have

(`) (`) (1) zx = (F ◦ · · · ◦ F )(x).

Different DNs such as CNNs [LeCun, 1998], Residual Networks [He et al., 2015b],

Densenets [Huang et al., 2017] simply correspond to DNs in which the organization and the types of layers are specified explicitly. Some layers operate on feature maps with specific shapes such as 3-dimensional tensors corresponding to multi-channel images. In any case, it is possible to consider the flattened version of such tensors and adapt the layer operations accordingly. To streamline our development we will thus always consider feature maps to be vectors. We describe below the basic operators 5 that form any current DN layer and review how DN training is done i.e. how the per-layer weights are tuned in order to produce a desired DN. For a complete survey we refer the reader to Goodfellow et al. [2016].

1.2.1 Layers

A DN layer, as employed in (1.1) to form the final prediction, is itself internally composed of a few simple operators. Different types of layers can be obtained by combining those simple operators adequately, in turn, different types of layers and layers’ organization will produce different types of DNs. To remain as general as possible, we thus propose to first review the main operators used in today’s layers.

Dense operator. A dense operator oftentimes referred to as a fully-connected operator, performs an affine transformation of a given input x as in

W x + b where x is the considered input, W is a dense/full matrix, and b is a bias vector. This operator combined with the activation operator (described below) forms the layers employed in the first generation of DNs: multilayer perceptrons [Rosenblatt, 1961].

Current DNs often employ the dense operator within their last layers only and prefer more constrained operators such as the convolution operator for the first layers.

Convolution operator. A convolution operator transforms its input via

Cx + b where a special structure is defined on the matrix C so that it performs multi-channel convolutions on the input x. Similarly, the bias vector b is often constrained to 6

have the same entries for different dimensions. Special cases include the use of 1 × 1

convolutional filters [Kingma and Dhariwal, 2018], in which case C is made of multiple

blocks, each being a diagonal matrix. Convolution operators are at the origin of

the performance gains observed in computer vision tasks starting with the LeNet

architecture [LeCun et al., 1989]. Even most recent state-of-the-art DNs employ at

some point convolution operators combined with an activation or a pooling operator.

Pooling operator. A pooling operator is a sub-sampling operation applied on

ian input according to a sub-sampling policy ρ and a collection of dimension, for each

output dimension, that ρ must consider in order to produce each output dimension.

Formally, for each output dimension k = 1,...,K, we denote this collection of dimen-

R sions as Rk. For concreteness, we denote Rk ⊂ {1,...,D} where D is the dimension of the input, and R > 1 is the number of dimensions in the input to apply ρ onto. In our case we assume the same R for each Rk but generalizing this is straightforward. We thus obtain the following input-output mapping for the pooling operator

   ρ [x][R1]1 , [x][R1]2 ,..., [x][R1]R       ρ [x][R ] , [x][R ] ,..., [x][R ]   2 1 2 2 2 R  .  .   .   .     ρ [x][RD]1 , [x][R2]D ,..., [x][RD]R

We consider here that the pooling operator reduces the input dimensionality (R > 1).

Often the pooling operator ρ is the max operator as was originally proposed. However the average also produces successful layers that have been used in many DNs. More complex functions include the softmax pooling [Murray and Perronnin, 2014]. When the indices Rk include all the input dimensions, the pooling operator is referred to as global. As opposed to the dense and convolution operators, the pooling operator is 7 most often nonlinear. Another popular nonlinear operator is the activation operator we now turn to.

Activation operator The activation operator applies a scalar nonlinearity σ to each dimension of its input as in   σ([x]1)     σ([x]2)   ,  .   .   .    σ([x]D) where we will abbreviate the above as simply σ(x) where σ should be understood as being applied elementwise. The first most popular activation function was the

1 exp(u)−exp(−u) sigmoid σ(u) = 1+exp(−u) and the hyperbolic tangent σ(u) = exp(u)+exp(−u) [Rosenblatt, 1961, Hornik et al., 1989]. While not coined with that name, the ReLU activation

σ(u) = max(0, u) emerged from hinging hyperplanes [Breiman, 1993] which can be seen as a layer with a dense operator and an activation operator (ReLU). The official introduction of ReLU in DN was done in Glorot et al. [2011]. Variants include the leaky-ReLU σ(u) = max(η, u), η > 0 [Maas et al., 2013], the absolute value, the exponential linear unit σ(u) = log(1 + exp(u)) [Shah et al., 2016].

With the few operators described above it is already possible to form most of the current layers employed in DNs. In order to formally define what a layer can include and ensure that the decomposition (1.1) is unique for any known DN, we propose the following definition.

Definition 1.1 (Layer) A layer F (`) is made of a single nonlinear operator and all the preceding linear operators (if any). 8

Some popular layers are the convolutional layer that comprises a convolution op-

erator and an activation operator, the maxout layer with is formed by a convolutional

or dense operator and a max-pooling operator. Additionally, any layer can be turned

into a residual layer [He et al., 2015b] by adding a linear connection between the layer

input and its output. For example a residual convolutional layer is

σ(Cx + b) + W resx + bres,

where W res is often taken to be the identity matrix if the layer output dimension is

the same as the input, and bres is often set to be zero.

1.2.2 Training

In order to optimize the parameters that govern each layer of the DN one needs a

dataset, a loss function to be minimized that is preferably differentiable such as the

mean squared error [Wang and Bovik, 2009] or the cross-entropy [Kleinbaum et al.,

2002], and a parameter update policy/rule such as some flavor of gradient descent

[Bottou, 2010].

A dataset is a collection of observations which can be a set of inputs (unsupservised),

a set of input-output pairs (supervised), or a mix of both (semi-supervised). For

concreteness let’s consider the supervised case here and let’s denote the dataset as

D = {(xn, yn), n = 1,...,N}. Commonly, this dataset is partitioned into three: a

training set Dtrain, a validation set Dval, and a testing set Dtest such that there is no overlap between them and their union gives back D, the entire dataset. The DN

parameters are updated based on Dtrain, the DN hyper-parameters are chosen based on Dvalid and finally the out-of-sample performance (also referred to as generalization performance) is estimated based on Dtest. 9

Based on the training set Dtrain, the chosen loss function, and the weight update policy, one tunes the DN parameters to minimize the loss on this set of observation.

Commonly this is done with flavors of gradient descent such as Nesterov momentum

[Nesterov], Adadelta [Zeiler, 2012], Adam [Kingma and Ba, 2014], or any variant of

those. In fact, all of the operations introduced above for standard DNs are differen-

tiable almost everywhere with respect to their parameters and inputs. As the training

set size (|Dtrain|) is often large, each parameter update is computed after only feeding a

mini-batch of B data sampled from Dtrain and with cardinality much smaller than the number of training samples B  |Dtrain|. Mini-batch training, in addition to reduc- ing the amount of computation required to perform a step of parameter update, also

provides many benefits from a generalization perspective [Keskar et al., 2016, Masters

and Luschi, 2018]. For each mini-batch, the parameter updates are computed for all

the network parameters by backpropagation [Hecht-Nielsen, 1992], which follows from

applying the chain rule of calculus. Once all the samples in the training set have been

|Dtrain| observed (after B mini-batches are sampled without replacement) then one epoch is completed. Whenever B = 1 the above is denoted as stochastic gradient descent.

Usually a network needs hundreds of epochs to converge. Hyperparameters such as learning rate, early-stopping are tuned based on the performances on Dvalid. Once training is completed and the best hyper-parameters have been selected, estimates of the DN performances on new, unseen data is obtained on Dtest.

1.2.3 Approximation Results

The ability of certain DNs to approximate an arbitrary functional/operator mapping has been well established [Cybenko, 1989, Breiman, 1993]. For completeness, we remind those results that have been pivotal in the theoretical analysis of DNs. 10

Theorem 1.1 (Cybenko [1989])

Let σ be any bounded, measurable sigmoidal function, then given any (target func- tion) f ∈ C0([0, 1]D) there exists a shallow network

K X (2) (1) (1) fΘ(x) = [W ]k,1σ(h[W ]k,., xi + [b ]k) k=1 such that

D |F (x) − FΘ(x)| < , for all x ∈ [0, 1]

In the above result, a sigmoidal function is defined as a function σ that must ful-

fill limt→−∞ σ(t) = 0, limt→∞ σ(t) = 1 and that does not have any monotonicity constraint. This result has been generalized to the case of employing continuous, bounded and nonconstant activation function σ in Hornik [1991], and to Radial Basis

Function in Park and Sandberg [1991]. Those result consider the case of fixed depth, and increasing width. The dual of this, considering fixed width and increasing depth, also led to universal approximation results.

Theorem 1.2 (Lu et al. [2017])

For any Lebesgue-integrable (target function) F : RD 7→ R any  > 0, there exists a fully-connected ReLU network fΘ with width ≤ D + 4, such that

Z |F (x) − FΘ(x)|dx < . RD

The same result but for CNNs has been obtained recently in Zhou [2020], for Resid- ual Networks in Tabuada and Gharesifard [2020], and for recurrent networks in Doya

[1993], Sch¨aferand Zimmermann [2006]. In the specific case of univariate DNs with continuous piecewise affine activation functions, Daubechies et al. [2019] demon- strated how the special structure that can be reached through depth, allowed DNs to not only approximate arbitrarily closely a target function as long as the number 11 of layers is large enough, but also that the rate of approximation was greater than alternative methods for some particular class of functions to be approximated.

All the variants of the universal approximation theorem guarantee that one can approximate any reasonable target function with the correct choice of architecture.

However, those results do not prescribe how to obtain such function in practice i.e. how to learn the DN underlying parameters in a more principled way without resorting to gradient-based optimization. Learning the DN parameters in a way to maximize generalization performances (or any desired metric) is one of the fundamental question that remains open. We now propose to introduce thoroughly the rich and powerful class of function that are splines in order to build our novel results in the following chapters.

1.3 Related Works

Ongoing directions to build a rigorous mathematical framework allowing to derive theoretical results and/or insights into DNs fall roughly into six camps, five of them have little to do with spline functions but are presented for completeness and to overview alternative directions. The last direction deals with splines and is the most relevant to our study.

1.3.1 Mathematical Formulations of Deep Networks

Statistical correlation based visualizations. The first successful attempt at providing human interpretable understanding of DNs’ inner workings arose from two techniques: activity maximization [Cadena et al., 2018] and saliency maps [Simonyan et al., 2014, Zeiler and Fergus, 2014, Li and Yu, 2015, Kim et al., 2019]. In the former case, one optimizes a (randomly) initialized input living in the data space 12 such that this input produces some target latent representation in the DN. This can take different forms such as considering a specific unit in the DN, or adding additional constraints. In the latter case, one feeds an input to a DN and leverages the gradient of the DN (a specific unit at a specific layer) with respect to the DN input. This saliency map depicts how sensitive for each input dimension is a specific unit of the

DN when considering a specific input. Those solutions provide visuals in the data space that highly correlate with the firing of specific DN inputs. Such techniques can be coupled with segmentation networks and ground truth image segmentation labels to obtain an actual ‘label’ or ‘concept’ that the DN’s units correlate highly with. This has been explored in classifier DNs [Bau et al., 2017] and generative DNs [Bau et al.,

2020]. In order to provide extensive guarantees into the interpretations done from those visuals, statistical tests and results have been developed [Lin and Lin, 2014,

Adebayo et al., 2018]. This development has pioneered our current understanding of the underlying knowledge being learned by DNs.

Optimization and approximation theory. The theoretical understanding of

DNs, their approximation power as well as their generalization capacity is one of the most fundamental questions that has been studied for decades. For example, Cy- benko [1989], Breiman [1993] studied the approximation capacity of shallow networks with sigmoid and ReLU activation functions respectively. Through specific consider- ations of architectures finer and finer results have been obtained [Arora et al., 2013,

Cohen et al., 2016, Parhi and Nowak, 2021] with results reaching beyond pure approx- imation capacity and characterizing the loss surface geometry [Lu and Kawaguchi,

2017, Soudry and Hoffer, 2017, Nguyen and Hein, 2017] or the VC-dimension [Har- vey et al., 2017] of DNs. Tremendous insights have been gained from those results. 13

For example, it was shown in Daubechies et al. [2019] how residual connections on a carefully designed univariate DN allowed for faster error convergence rate.

Architecture-constrained models. Another line of research considers carefully designed and constrained DN architectures in order to enforce mathematical proper- ties and to allow for theoretical analysis. The first successful attempt of such a model that managed to provide competitive performances is the Extreme Learning Machine

(ELM) [Huang et al., 2011, Tang et al., 2015] consisting of DNs with random layer weights for all but the last layer. Due to the absence of training of the layer weights, it was possible to gain insights into the DN decision process and ELMs opened the door to the integration of continuous constraints like positivity, monotonicity, or bounded curvature in the learned function [Neumann et al., 2013]. The Scattering Networks

[Mallat, 2012, Bruna and Mallat, 2013] is another carefully designed DN consist- ing of a succession of wavelet transforms and (complex) modulus. As opposed to standard DNs, the features of this network are obtained by collecting the average of the per-layer feature maps. Variants include performing local averaging, employing

1-dimensional or 2-dimensional wavelet transforms depending on the nature of the data [And´enet al., 2019], and changing the employed wavelet filter-banks [Lostanlen and And´en,2016]. Thanks to such parametrization, it is possible to leverage tools from signal processing [Mallat, 1999] and group theory [Mallat, 2016] to study and interpret the scattering features and thus understand the benefits and mathematical properties of this architecture. The interpretability of this network in conjunction with the absence of learning enabled its application in fields such as quantum chem- istry [Hirn et al., 2017] or aerial scene understanding [Nadella et al., 2016]. Lastly, some methods have been developed in which only a specific part of the DN is tweaked 14 based on an explicit design that improves analysis or interpretability. Most known examples include the group equivariant convolutional networks [Cohen and Welling,

2016] where the per-layer parameters are constrained to have a group structure and the capsule network [Sabour et al., 2017] that leverages hard coded ‘computer vision’ rules in the forming of its prediction.

Probabilistic generative models. Probabilistic Graphical Models (PGMs) [Koller and Friedman, 2009] are one class of methods in machine learning that always enjoyed ease of interpretation and data modeling. The main reasons behind those benefits being (i) the ability to specify an explicit generative model that governs the data at hand thus allowing easy integration of a priori knowledge and understanding of the underlying data modeling [Bhattacharya and Cheng, 2015], (ii) the explicit analyti- cal forms to train the model parameters and infer the missing variables in the model

[Jordan, 1998], and (iii) the versatility of such models which can be used to detect outliers, to denoise and to classify/cluster [Bilmes and Zweig, 2002]. Most success- ful PGMs include the Gaussian Mixture Model [Xu and Jordan, 1996], the Hidden

Markov Model [Rabiner and Juang, 1986] or the Factor Analyzis Model [Akaike,

1987]. Motivated by those successes, many recent studies [Yuksel et al., 2012, Patel et al., 2016, Kim and Bengio, 2016, Nie et al., 2018] have modeled the underlying

DN mechanisms as a PGM in order to port all the above benefits to deep learning.

Those approaches have successfully provided principled solutions to perform semi- supervised and unsupervised tasks as well as pushing our understanding of DN inner working from a generative perspective. One limitation that remains to be tackled comes from the now intractable learning solutions. In fact, due to the increase in the model complexity that those methods need to employ in order to mimic DNs as 15

best as possible, the closed form solutions for training and inference no longer exist,

limiting the explainability provided by those methods.

Infinite-width limit. Another recent development leverages the overparametriza-

tion regime of modern architectures. In this specific setting, overparametrization is to

be understood as a growth (in the limit, to infinity) of the layers’width. This has re-

sulted in DNs preserving generalization performances [Novak et al., 2018, Neyshabur

et al., 2019, Belkin et al., 2019] while providing additional convergence guarantees for

the optimization problem [Du et al., 2019a, Allen-Zhu et al., 2019a,b, Zou et al., 2018,

Arora et al., 2019b]. Going into the infinite width limit, it became possible to obtain

analytical models of such impractical DNs. The most recent infinite width study

showed that the training dynamics of (infinite-width) DNs under gradient flow are

captured by a constant kernel called the Neural Tangent Kernel (NTK) that evolves

according to an ordinary differential equation (ODE) [Jacot et al., 2018, Lee et al.,

2019a, Arora et al., 2019a]. Every DN architecture and parameter initialization pro-

duces a distinct analytical NTK. The original NTK was derived from the Multilayer

Perceptron [Jacot et al., 2018] and was soon followed by kernels derived from CNNs

[Arora et al., 2019a, Yang, 2019], Residual Networks [Huang et al., 2020], and Graph

CNNs (GNTK) [Du et al., 2019b]. In [Yang, 2020], a general strategy to obtain the

NTK of any architecture is provided. Due to the analytical form of the NTK this

development has led to an entirely new active research area.

Continuous piecewise affine operators. The last research direction that also relates the most to this thesis concerns the use of spline functions and operators to study DNs. While not in a deep learning setting, the first study of the approximation capacity of shallow ReLU networks was performed in Breiman [1993] and offered an 16 alternative approximation result from the ones based on sigmoid activation functions from that time. More recently and mainly due to the popularity of novel activation functions such as leaky-ReLU or absolute value, the use of spline function theory to study DNs has grown exponentially. The first brick was posed by Montufar et al.

[2014] where the Continuous Piecewise Affine (CPA) structure of DNs employing such nonlinearities was highlighted. Along with this result, an upper bound in the number of regions of the DN input space partition was derived. Later works con- tinued to refine the bounds on the number of regions [Serra et al., 2018, Hanin and

Rolnick, 2019, Serra and Ramalingam, 2020]. Meanwhile, the rich theory of spline which has been extensively refined in signal processing and function approximation, has allowed to port many results such as formulating current DNs from a functional optimization problem [Unser, 2018], studying the piecewise convexity of the DN in the context of optimization [Rister and Rubin, 2017] and obtaining acute approxima- tion results [Daubechies et al., 2019]. To date, theoretical studies relying on splines have focused on either considering specific topologies or providing theoretical guar- antees and bounds on specific properties, such as a DN’s approximation capacity for univariate input-output or number of regions in the DN’s input space partition.

1.3.2 Training of Deep Generative Networks.

Deep Generative Networks (DGNs), which map a low-dimensional latent variable z to a higher-dimensional generated sample x are the state-of-the-art methods for a range of machine learning applications, including anomaly detection, data generation, like- lihood estimation, and exploratory analysis across a wide variety of datasets [Blaauw and Bonada, 2016, Inoue et al., 2018, Liu et al., 2018, Lim et al., 2018]. While we proposed a thorough geometrical study of DGNs in all generality in Chap. 4, we now 17 go a step further and exploit the composition of MASO formulation to provide a novel training solution. Training of DGNs roughly falls into two camps: (i) By leveraging an adversarial network as in a Generative Adversarial Network (GAN) [Goodfellow et al., 2014] to turn the method into an adversarial game; and (ii) by modeling the latent variable and observed variables as random variables and performing some flavor of likelihood maximization training. A widely used solution to likelihood-based DGN training is via a Variational Autoencoder (VAE) [Kingma and Welling, 2013]. The popularity of the VAE is due to its intuitive and interpretable loss function, which is obtained from likelihood estimation, and its ability to exploit standard estimation techniques ported from the probabilistic graphical models literature. Yet, VAEs offer only an approximate solution for likelihood-based training of DGNs. In fact, all cur- rent VAEs employ three major approximation steps in the likelihood maximization process. First, the true (unknown) posterior is approximated by a variational distri- bution. This estimate is governed by some free parameters that must be optimized to fit the variational distribution to the true posterior. VAEs estimate such param- eters by means of an alternative network, the encoder, with the datum as input and the predicted optimal parameters as output. This step is referred to as Amortized

Variational Inference (AVI), as it replaces the explicit, per datum, optimization by a single deep network (DN) pass. Second, as in any latent variable model, the complete likelihood is estimated by a lower bound (ELBO) obtained from the expectation of the likelihood taken under the posterior or variational distribution. With a DGN, this expectation is unknown, and thus VAEs estimate the ELBO by Monte-Carlo

(MC) sampling. Third, the maximization of the MC-estimated ELBO, which drives the parameters of the encoder to better model the data distribution and the encoder to produce better variational parameter estimates, is performed by some flavor of 18 gradient descend (GD). These VAE approximation steps enable rapid training and test-time inference of DGNs. However, due to the lack of analytical forms for the posterior, ELBO, and explicit (gradient free) parameter updates, it is not possible to measure the above steps’ quality or effectively improve them. Since the true posterior and expectation are unknown, current VAE research roughly fall into three camps:

(i) developing new and more complex output and latent distributions [Nalisnick and

Smyth, 2016, Li and She, 2017], such as the truncated distribution; (ii) improving the various estimation steps by introducing complex MC sampling with importance re-weighted sampling [Burda et al., 2015]; (iii) providing different estimates of the posterior with moment matching techniques [Dieng and Paisley, 2019, Huang et al.,

2019]. More recently, Park et al. [2019] exploited the special continuous piecewise affine structure of current ReLU DGNs to develop an approximation of the posterior distribution based on mode estimation and DGN linearization leading to Laplacian

VAEs.

1.3.3 Batch-Normalization Understandings.

Nowadays, the empirical benefits of BN are ubiquitous with more than 12,000 ci- tations to the original BN article and a unanimous community employing BN to accelerate training by helping the optimization procedure and to increase general- ization performances [He et al., 2016b, Zagoruyko and Komodakis, 2016, Szegedy et al., 2016, Zhang et al., 2018c, Huang et al., 2018a, Liu et al., 2017b, Ye et al.,

2018, Jin et al., 2019, Bender et al., 2018]. Despite its prevalence in today’s DN architectures’ performances, the understanding of the unseen forces that BN applies on DNs remains elusive; and for many, understanding why BN improves so drasti- cally DNs performances remains one of the key open problems in the theory of deep 19 learning [Richard et al., 2018]. One of the first practical arguments in favor of fea- ture map normalization emerged in Cun et al. [1998] as “good-practice” to stabilize training. By studying how the backpropagation algorithm updates the layer weights, it was observed that unless with normalized feature maps, those updates would be constrained to live on a low-dimensional subspace limiting the learning capacity of gradient-based algorithms. By explicitly reparametrizing the affine transformation weights and slightly altering the renormalization process of BN, weight renormal- ization [Salimans and Kingma, 2016] showed how the σ(`) renormalization smooths the optimization landscape of DNs. Similarly, Bjorck et al. [2018], Santurkar et al.

[2018], Kohler et al. [2019] further studied the impact of BN in the gradient distri- butions and optimization landscape by designing careful and large scale experiments.

By providing a smoother optimization landscape BN “simplifies” the stochastic opti- mization procedure and thus accelerates the training convergence and generalization.

In parallel to this optimization analysis of BN in standard DN architectures, Yang et al. [2019b] developed a mean field theory for fully-connected feed-forward neural networks with random weights where BN is analytically studied. In doing so, they were able to characterize the gradient statistics in such DNs and to study the sig- nal propagation stability depending on the weight initialization, concluding that BN stabilizes gradients and thus training.

1.3.4 Deep Network Pruning.

With a tremendously increasing need for DNs’ practical deployments, one line of research aims to produce a simpler, energy efficient DN by pruning a dense one, e.g. removing some layers/nodes/weights and any combination of these options from a

DN architecture, leading to a much reduced computational cost. Recent progresses 20

[You et al., 2020, Molchanov et al., 2016] in this direction allow to obtain models much more energy friendly while nearly maintaining the models’ task accuracy [Li et al., 2020]. Throughout this chapter, we will often abuse notations and refer to an unpruned DN as “dense” or “complete”. While tremendous empirical progress has been made regarding DN pruning, there remains a lack of theoretical understanding of its impact on a DN’s decision boundary as well as a lack of theoretical tools for deriving pruning techniques in a principled way. Such understandings are crucial for one to study the possible failure modes of pruning techniques, to better decide which to use based on a given application, or to design pruning techniques possibly guided by some a priori knowledge about the given task and data. The common pruning scheme adopts a three-step routine: (i) training a large model with more parameters/units than the desired final DN, (ii) pruning this overly large trained

DN, and (iii) fine-tuning the pruned model to adjust the remaining parameters and restore as best as possible the performance lost during the pruning step. The last two steps can be iterated to get a highly-sparse network [Han et al., 2015]. Within this routine, different pruning methods can be employed, each with a specific pruning criteria, granularity, and scheduling [Liu et al., 2019b, Blalock et al., 2020]. Those techniques roughly fall into two categories: unstructured pruning [Han et al., 2015,

Frankle and Carbin, 2019, Evci et al., 2019] and structured pruning [He et al., 2018,

Liu et al., 2017b, Chin et al., 2020a]. Regardless of the pruning methods, the trade- offs lie between the amount of pruning performed on a model and the final accuracy.

For various energy efficient applications, novel pruning techniques have been able to push this trade-off favorably. The most recent theoretical works on DN pruning relies on studying the existence of Winning Tickets. Frankle and Carbin [2019] first hypothesized the existence of sub-networks (pruned DNs), called winning tickets, 21

that can produce comparable performances to their non-pruned counterpart. Later,

You et al. [2020] showed that those winning tickets could be identified in the early

training stage of the un-pruned model. Such sub-networks are denoted as early-bird

(EB) tickets.

1.4 Contributions

There are many fundamental questions that need to be addressed in deep learning.

However, we propose in this thesis to focus specifically on three of them that would

not only help in bringing novel understandings in the underlying Deep Networks com-

putational intricacies, but would also produce better performing models:

Question 1: Can we train a deep network to learn a probability distribution in high dimensional space from training data?

Question 2: How can we lower the power consumption of implementing a deep net- work (both learning and inference) given an architecture and training dataset?

Question 3: How to explain the ability of a technique such as Batch-Normalization to considerably boost Deep Networks performances regardless of the architecture, task, and data at hand?

The above three fundamental questions would push barriers of current techniques in

Deep Learning across applications ranging from manifold learning and density esti- mation (Q1) to providing interpretability and explainability into everyday used Deep

Networks techniques that is Batch-Normalization (Q3) while also allow the principled design of novel methods guided by theoretical understanding as we will do for pruning

(Q2). Needless to say that in order to answer those three fundamental questions, we will need to provide a novel mathematical formulation of Deep Networks. The first part of this thesis thus proposes a novel formulation of DNs via a very special type 22

of splines: max-affine splines, that we review in (Chap. 2). This formulation, which

consists of a reformulation of DNs based on those splines, opens the door to theoret-

ical study on CPA DNs with high-dimensional input spaces and allows to leverage

results from combinatorial, computational geometry to further enrich our understand-

ings (Chap. 3). Equipped with this formulation and novel understandings, we will be

able to answer those three questions posed above in additional of developing many

visualization tools with the following organization:

Answer 1: deriving an Expectation-Maximization training for Deep Generative Net-

works (Chap. 5): In this chapter, we advance both the theory and practice of DGNs

and VAEs by computing the exact analytical posterior and marginal distributions of

any DGN employing continuous piecewise affine (CPA) nonlinearities. The knowl-

edge of these distributions enables us to perform exact inference without resorting to

AVI or MC-sampling and to train the DGN in a gradient-free manner wth guaranteed

convergence.

Answer 2: designing a novel and theoretically grounded state-of-the-art Deep Net- work pruning strategy (Chap. 6): In this chapter we are turning our focus towards a recent technique that is Deep Network Pruning. As we will see, pruning, which consists in removing some weights and/or units of a DN, can be studied thoroughly from a geometric point of view thanks to the knowledge of the DN input space parti- tion and its ties with the DN input-output mapping. After providing many practical insights into pruning, we will propose a novel strategy from those understandings that is able to compete with alternative state-of-the-art methods.

Answer 3: interpreting and theoretically studying from a spline point-of-view one

of the most important Deep Learning technique: Batch-Normalization (Chap. 7):We

will demonstrate in this chapter how BN, by replacing proposing an specific layer 23 input-output mapping parametrization, provides an unsupervised learning technique that interacts with the (un)supervised learning algorithm used to train a DN in order to focus the attention of the network onto the data points.

In addition of the above core contributions, we will also consider exploiting the affine spline formulation of deep networks to study and interpret Deep Generative Networks in all generality in Chap. 4 and finally we will conclude by demonstrating in Chap. 8 how to extend all the above results to smooth Deep Networks, thus effectively porting all our results beyond the affine spline world. 24

Chapter 2

Max-Affine Splines for Convex Function Approximation

Function approximation is the general task of utilizing an approximant function fˆ to

‘mimic’ as best as possible a target (possibly unknown) function f. This task can take many forms ranging from fitting fˆ based on samples generated from f as often done in machine learning, imposing physical constraints on fˆ that are known to govern f as often done in partial differential equation approximations, or possibly a mix of both approaches. Solving this task accurately has tremendous applications as the approximant fˆ can then be deployed for example to provide autonomous controllers as used in aircraft and uninhabited air vehicles [Farrell et al., 2005, Xu et al., 2014], to perform weather prediction [Richardson, 2007, Brown et al., 2012, Bauer et al.,

2015], to accelerate drug discovery [King et al., 1992, Lima et al., 2016, Zhang et al.,

2017a, Ong et al., 2020], or to better identify and prevent suicide attempts [Walsh et al., 2017, Torous et al., 2018]. While the topic of function approximation is vast

(we refer the reader to Powell [1981], DeVore [1998] for an overview), for our study, we focus on a specific class of approximant: spline functions. A clear understanding of those functions and their notations will be crucial for the remaining of the thesis as we aim to employ the rich theory of splines to study Deep Networks. 25

2.1 Spline Functions

Spline functions [Schoenberg, 1973] are powerful practical function approximators that have been thoroughly studied theoretically in term of their approximation ca- pacity along with various properties [De Boor and Rice, 1968, Unser et al., 1993,

Schumaker, 2007].

Splines: constrained piecewise polynomials. Consider a partition of a domain

X into a finite set of regions Ω = {ω1, . . . , ωR}. In our study we will focus on partitions of a continuous domain X ⊂ RD,D ≥ 1.

Definition 2.1 (Partition) A partition Ω of a domain X is a finite collection of regions

R Ω = {ω1, . . . , ωR} such that their union recovers the domain, ∪i=rωr = X , and the

◦ ◦ intersection of the interior of any two different regions is empty ωi ∩ ωj = ∅ ∀i 6= j, where .◦ is the interior operator [Halmos, 2013].

k Let’s consider R piecewise polynomial mappings of degree k that we denote as φr for r = 1,...,R. For now we consider univariate polynomials i.e. D = 1, hence, each one of those mappings transform an input x to an output via

k k X p φr (x; ar,:) := x ar,p, x ∈ R (2.1) p=0

th th where ar,p ∈ R is the p degree polynomial coefficient (p ∈ {0, . . . , k}) for the r polynomial (r ∈ {1,...,R}). We denote by ar,: the vector of k + 1 parameters

T k+1 (ar,0, . . . , ar,k) ∈ R .

Definition 2.2 (Piecewise polynomial function) The mapping defined as

|Ω| X k P (x; a:,:) = φr (x; ar,.)1{x∈ωr}, (2.2) r=1 26

k is known as a order-k piecewise polynomial function where φr is from (2.1) and Ω is a partition of the considered domain.

An order-k spline function on a domain X is obtained by constraining an order-k

piecewise polynomial function P defined on a partition Ω of X to have continuous

first 0, . . . , k − 1-order derivatives i.e. P ∈ Ck−1(X ). We recall that the 0-order

derivative is the function itself. In order to gain insights into the constraint imposed

on piecewise polynomials to obtain a spline, we first need to formally define two

regions ωi, ωj as adjacent iff ∂ωi ∩ ∂ωj 6= ∅, with ∂. the boundary operator. Now, since P is a piecewise polynomial, it is clear that the restriction of P onto any region

∞ ω ∈ Ω, that we denote by P |ω, fulfills P |ω ∈ C (ω). As a result, the constraint

k−1 k k P ∈ C (X ) can be seen as enforcing the piecewise polynomial mappings φr , φr0 for

any adjacent regions ωr, ωr0 to have the same 0, . . . , k − 1-order derivatives at the intersection of their regions’ boundaries.

Definition 2.3 (Spline function) Given a partition Ω = {ω1, . . . , ωR} of some domain X , a spline function of order k is a order-k piecewise polynomial P (recall Def. 2.2)

on Ω such that P ∈ Ck−1(X ).

As a result, in the special case k = 2 (quadratic polynomials), the mapping P will be

a spline function iff P and P 0 (zero-order and first-order derivatives) are continuous.

In the case k = 1 (affine polynomials), which will be the main setting of this thesis, P

must be a piecewise polynomial of degree 1 in each region of the partition, and must

be continuous on the entire domain. For a thorough study of piecewise polynomials

and splines, we refer the reader to Schumaker [2007]. Generalizing the above spline

construction for multivariate domains of dimension D > 1 follows naturally by con-

k sidering multivariate polynomial functions for φr ; the notion of adjacent regions and 27

the derivative constraints that must be fulfilled by a piecewise polynomial mapping

to be a spline is identical.

Spline functions’ bases. In many situations, one needs to ‘learn’ a spline given a

specific order k and (possibly) a known partition Ω based on some criteria to minimize.

Commonly, one will not solve a constrained optimization problem of fitting a piecewise

polynomial function while constraining the first 0, . . . , k − 1-order derivatives to be

continuous. Instead, one employs an unconstrained optimization problem in which

the coefficients to be optimized are weighting some basis functions that belong to the

considered spline functional space. One of the most famous type of basis functions

are denoted as B-splines [Schoenberg, 1973, 1988] and consist of order-k splines with

compact support (on each region of Ω). Since any spline function of a given degree can

be expressed as a linear combination of B-splines of that degree, it is enough to learn

the correct linear combination in order to learn the spline function. For alternative

bases we refer the reader to Girosi et al. [1993], Unser and Blu [2005]. We now focus

on affine (k = 1) splines for the remaining of this thesis.

Affine splines. Affine splines (k = 1) have been particularly popular since their basis functions are efficient to evaluate and the number of parameters required to describe a spline on a partition Ω of a domain X grows linearly with the dimension

of the domain dim(X ). Such affine splines have been used for detection of patterns in images [Rives et al., 1985], contour tracing [Dobkin et al., 1990], extraction of straight lines in aerial images [Venkateswar and Chellappa, 1992], global optimiza- tion [Mangasarian et al., 2005], compression of chemical process data [Bakshi and

Stephanopoulos, 1996], gas demand forecasting [Gasc´onand S´anchez-Ubeda,´ 2018] and circuit modeling [Vandenberghe et al., 1989]. Let’s specialize the spline mapping 28

D 1 (2.2) to the affine case. In that case and with X ⊂ R the mappings φr depend on

D a slope vector ar ∈ R and an offset/bias scalar br ∈ R leading to the multivariate affine spline

|Ω| X P (x; a:, b:) = (har, xi + br) 1{x∈ωr}, r=1

(0) where we recall that ar, br are such that P ∈ C (X ). For in-depth study of affine splines and their representation we refer the reader to Kang and Chua [1978], Kahlert and Chua [1990]. While affine splines might seem constrained due to the use of order-

1 polynomials, we should emphasize that in the context of function approximations, the degree of the spline matters very little. However, the partition Ω is of crucial importance. A correctly tuned partition along with an order-1 spline will produce a better approximation than an higher degree spline with an incorrect partition. For a thorough study on the relation between the polynomial degree, the partition, and the target function in the final approximation error, we refer the reader to Birkhoff and De Boor [1964], Lyche and Schumaker [1975], Cohen et al. [2012]. Clearly, the complication in spline fitting arises when one aims to fit the spline function basis and the partition jointly leading to an intractable problem. As we will see in the next section, it is possible to design, for specific applications, splines that will automatically adapt the partition Ω while the basis functions are fit.

2.2 Max-Affine Splines

Whenever an affine spline is constrained to be globally convex then it can be rewritten as a Max-Affine Spline (MAS). The origin of MASs is not linked to a specific paper or study, but arose many times for example in the development of hinging hyper- planes. Dedicated study of MASs has been done in the context of convex function 29 approximation in Magnani and Boyd [2009], Hannah and Dunson [2013]. A MAS is a continuous, convex, and piecewise affine function that maps its output x via

P (x; a:, b:) = max har, xi + br. (2.3) r=1,...,R

An extremely useful feature of such a spline is that it is completely determined by its parameters ar and br, r = 1,...,R and does not require an explicit partition Ω. Changes in those parameters automatically induce changes in the partition Ω, mean- ing that they are adaptive partitioning splines [Binev et al., 2014]. A thorough study and characterization of the partition Ω induced by those parameters will be carried in Sec. 3.4. A max-affine spline is always piecewise affine, globally convex and hence

D continuous regardless of the values of its parameters ar ∈ R , br ∈ R, r = 1,...,R. Conversely, any piecewise affine, globally convex, and continuous function can be written as a MAS.

Proposition 1 For any continuous function h ∈ C0(RD) that is convex, and piecewise

D D affine on a partition Ω of R , there exists R > 1 and ar ∈ R , br ∈ R, r = 1,...,R such that h(.) = P (.; a:, b:) everywhere.

This result follows from the fact that the pointwise maximum of a collection of convex functions is convex [Roberts, 1993], for the reciprocal see Sec. 3.2.3 in Boyd et al.

[2004]. A great benefit of MASs for convex function approximation lies in (i) the ability to solve the fitting problem as an unconstrained optimization problem. In fact, as mentioned above, the parameters can be fit arbitrarily without breaking the convexity property. And (ii) the ability to adapt the domain partition Ω while the affine parameters are tuned to better fit the target function, removing the intractable joint optimization of the partition and the per-region mappings. We now study more specifically what are the fitting methods for MASs. 30

2.3 (Max-)Affine Spline Fitting

Several methods have been proposed for fitting general affine splines to (multidi- mensional) data. A neural network algorithm is used in Gothoskar et al. [2002]; a

Gauss-Newton method is used in Juli´anet al. [1998], Horst and Beichel [1997]; a reference on methods for least-squares with semi smooth functions is Kanzow and

Petra [2004]. For our study, we are interested in fitting a MAS in the form of (2.3).

Due to the special form and the convexity property of this approximant, and when considering a mean-squared error, Magnani and Boyd [2009] proposed an iterative

fitting algorithm that can be interpreted as a Gauss-Newton algorithm. We report this algorithm in Algo. 1.

Iterative procedures, similar in spirit to the one we presented in Algo. 1 for MASs but for specialized applications, are described in Phillips and Rosenfeld [1988], Yin

[1998], Ferrari-Trecate and Muselli [2002], Kim et al. [2004]. For additional references on affine splines fitting with D = 1, Dunham [1986] propose to find the minimum number of segments to achieve a given maximum error, Goodrich [1994], Bellman and Roth [1969], Hakimi and Schmeichel [1991], Wang et al. [1993] propose dynamic programming methods to solve the affine spline fitting problem, and Pittman and

Murthy [2000] propose genetic algorithms. For D = 2, Aggarwal et al. [1989], Mitchell and Suri [1995] propose variants of the univariate fitting solutions. 31

Algorithm 1 Description of the MAS fitting when considering a mean-squared error.

The algorithm consists of successively fitting the per-region mappings to the samples that are within each region, and then updating the partition. This method can be seen as the solution of the Gauss-Newton algorithm with a MAS approximant.

Convergence is not guaranteed, for examples for such failure cases, see Sec. 3.3 of

Magnani and Boyd [2009]. + procedure Mean square max-affine spline fitting(D,Tlimit ∈ N ) T = 0 . set counter

(T ) (T ) (T ) Ω = {ω1 , . . . , ωK },K ≤ R. initialize a partition of D

while T < Tlimit do for r = 1,..., |Ω(T )| do

if |ωr| == 0 then break   (T +1) ar X 2   = arg min kha, xi + b − yk2  (T +1)  a,b b (T ) r (x,y)∈ωr   −1    T X xx x X yx =         T     (T ) x 1 (T ) y x∈ωr (x,y)∈ωr

(T +1) (T +1) (T +1) Ω = {ω1 , . . . , ωR } (T +1) (T +1) (T +1) ωr = {(x, y) ∈ D : arg minr0=1,...,Rhar0 , xi + br0 = r} if Ω(T +1) == Ω(T ) then

Exit

else

T ← T + 1 32

Chapter 3

Deep Networks: Composition of Max-Affine Spline Operators

In this chapter, we are exploiting Max-Affine Splines (MASs) to formulate Deep

Networks (DNs) as a composition of Max-Affine Spline Operators (MASOs), a mul-

tivariate output version of MASs. We will first develop the MASO formulation,

demonstrate how each DN layer can be expressed as a MASO and then express any

DN as a composition of MASO. We conclude this chapter by a dedicated study of the

DN input space partition.

3.1 Max-Affine Spline Operators

A natural extension of a Max-Affine Spline (MAS) function is a max-affine spline   operator (MASO) M A:, b: that produces a multivariate output. It is obtained simply by concatenating K independent max-affine spline functions from (2.3). A

MASO mapping a D-dimensional input to a K-dimensional input has slope parame-

K×D K ters Ar ∈ R and offset parameters br ∈ R and is defined as   maxr=1,...,Rh[Ar]1,:, xi + [br]1    .  M(x; A:, b:) = max (Arx + br) =  .  , (3.1) r=1,...,R     maxr=1,...,Rh[Ar]K,:, xi + [br]K where the maximum is taken component wise. Since a MASO is built from K inde-

pendent MASs and can be seen as producing its output by stacking the output of

each MAS into a vector, it has a property analogous to Proposition 1. 33

Proposition 3.1

T 0 D For any operator H(x) = [h1(x), . . . , hK (x)] with hk ∈ C (R ) ∀k that are convex

D piecewise affine on their respective partition Ωk of R , there exist A:, b: such that

H(.) = M(.; A:, b:) everywhere.

The development of MASOs is crucial for our development since DNs compose mul- tiple multivariate mappings. The goal of the next section is to demonstrate that a MASO can be used to formulate most of the current DN layers. From that, it will become clear that an entire DN input-output mapping is nothing else than a composition of MASOs.

3.2 From Deep Network Layers to Max-Affine Spline Oper-

ators

We begin by showing that the DN layers defined in Section 1.2 are MASOs, and we demonstrate in each case what are the corresponding parameters A:, b: (recall (3.1)). The next section will concern the reformulation of the entire DN as a MASO composition.

A dense layer which consists of an unconstrained affine transformation of the input followed by a pointwise nonlinearity, can be expressed as a MASO as long as the nonlinearity is convex, piecewise affine. It turns out that most currently employed nonlinearity fall into that category (ReLU, leaky-ReLU, absolute value). As a result, and following the notations from Sec. 1.2, we have that a dense layer can be expressed as a MASO with R = 2 and parameters

A1 = WA2 = αW (3.2)

b1 = b b2 = αb, (3.3) 34

with α being 0 for ReLU, −1 for absolute value and α > 0 for leaky-ReLU.

The case of a convolutional layer is similar to the one of a dense layer. The only change is to replace the unconstrained slope matrix W and bias vector by their constrained counterparts. That is, the MASO of a convolutional layer has parameters

R = 2

A1 = CA2 = αC (3.4)

b1 = b b2 = αb, (3.5) where the same values of α hold for each nonlinearity.

The case of a max-pooling layer (without any preceding affine mapping) can be expressed as a MASO as well. In that case, the number of mappings R correspond to the number of dimension that the max-pooling is applied onto. In a computer vision case with the common 2 × 2 max-pooling, one would have R = 4. We thus have the following MASO parameters

[Ar]k,d = 1{[Rk]r=d}, ∀r br = 0, ∀r. (3.6)

That is, the matrices Ar are filled with 0 and 1 values; each row k contains a single 1

th positioned at the r index of the pooling region Rk that produces the output of the corresponding dimension k.

The case of a maxout layer follows directly from the max-pooling case since it simply corresponds to a max-pooling layer in which an affine transform is added before the pooling operator. As such, the associated MASO will have R set based on the number of dimensions that are being pooled and the parameters are given by

[Ar]k,. = [W ][Rk]r,., ∀r br = [b][Rk]r , ∀r. (3.7) 35

Another important case occurs when adding a residual connection to any given layer. This can also be modeled easily with a MASO as follows. First, formulate the given layer without residual connection as a MASO with one of the above formulation.

This provides a MASO parametrization A:, b:. Now, to add the residual connection to this layer, one simply adds the residual affine parameters to all the MASO parameters

i.e. for each r = 1,...,R as follows

Ar ← Ar + W res, ∀r br ← br + bres, ∀r. (3.8)

In the special case of a skip-connection, then one would set W res to be the identity

matrix, and bres to be 0. While the above does not explicitly cover all the possible layers that one can form by combining various operators, the same recipe can be

applied. We formalize the generality of this formulation in the following result. Proposition 3.2 (DN layer as MASO)

Any DN layer (recall Def. 1.1) that uses a continuous, convex, and piecewise affine

(for each output dimension) nonlinear operator, and any (if any) preceding linear

operator can be expressed as a MASO.

It will be convenient to abstract away the region selection r based on a given input

and thus introduce the following notation

M(x; A:, b:) = Axx + bx, (3.9)

where the input induced affine parameters are given by     T [Ar1(x)]1,: [br1(x)]1      .   .  Ax ,  .  , bx ,  .  , rk(x) = arg max (h[Ar]k,:, xi + [br]k) ,     r=1,...,R     T [ArK (x)]K,: [brK (x)]K (3.10) 36

hence the parameters Ax, bx simply correspond to the slope and bias parameters responsible to produce the input-output mapping based on the given input x. Sim- ilarly, we denote by Aω, bω the parameters Ax, bx obtained from any x ∈ ω. It will become convenient in the coming section to index those parameters by the layer in- dex as in R(`), A(`) and b(`). From this result, we are now able to express any DN that composes layers fulfilling Prop. ??. We propose to do the reformulation in the next section where we will focus on two specific architectures to see how such an formulation can aid in comparing models from a data modeling view.

3.3 Composition of Max-Affine Spline Operators

We first formalize the ability to express DNs as composition of MASOs. This result will be the key of the entire thesis as it opens the door to further analysis of DNs from a spline perspective.

Theorem 3.1 (DNs as MASO composition)

A DN constructed from an arbitrary composition of layers that fulfill the conditions of Prop. 3.2 can be formulated as a composition of MASOs; the overall composition is itself a continuous affine spline operator.

DNs covered by Theorem 3.1 include CNNs, ResNets, inception networks, max- out networks, network-in-networks, scattering networks, and their variants using connected/convolution operators, (leaky) ReLU or absolute value activations, and max/mean pooling. Thanks to the ability to express any layer as a MASO, we can express the entire DN input-output mapping FΘ as

L−1 ! L L−`−1 ! Y (L−`) X Y (L−j) ` FΘ(x) = Ax x + Ax bx. (3.11) `=0 `=1 j=0 37

Note however that DNs of the form stated in Theorem 3.1 can not in general be

written as a single MASO since the composition of two or more MASOs is not nec-

essarily a convex operator (it is merely a continuous affine spline operator). Indeed,

a composition of MASOs remains convex if and only if all of the intermediate opera-

tors are non-decreasing with respect to each of their output dimensions [Boyd et al.,

2004]. Interestingly, ReLU, max-pooling, and average pooling are all non-decreasing,

while leaky ReLU is strictly increasing. The culprits of the non-convexity of the

composition of operators are negative entries in the fully connected and convolution

slope matrices. A DN where these culprits are thwarted is an interesting special case,

because it is convex with respect to its input [Amos et al., 2016] and multiconvex

[Xu and Yin, 2013] with respect to its parameters (i.e., convex with respect to each

operator’s parameters while the other operators’ parameters are held constant). The

MASO form allows to simply formalize those constraints based on the A: parameters. Theorem 3.2 (Globally convex DNs)

h (`)i A MASO DN whose layers ` = 2,...,L MASO slopes are nonnegative as Ar ≥ i,j 0, ∀(r, i, j) ∈ {1,...,R(`)} × {1,...,D(`)} × {1,...,D(`−1)} is globally convex with respect to each of its output dimensions.

Note that Theorem 3.2 remains true regardless of the MASO parameters of the first layer. Input convexity is a beneficial property that can be leverage for specific appli- cations as optimization of the input becomes a convex optimization problem [Amos et al., 2016]. We now propose to dive in more details into characterizing the DN input space partition i.e. the collection of regions in the DN input space Ω in which the

DN input-output mapping remains linear. This explicit analytical characterization is crucial as understanding the partition of a spline operator opens the door to further theoretical study such as generalization performances. 38

3.4 Deep Networks Input Space Partition: Power Diagram

Subdivision

One of the key element for any spline function is its input space partition Ω. From it,

results on generalization and approximation can be obtained as well as a better un-

derstanding of the approximant behavior via the study of the region’s shapes. Other

works have focused on the properties of the partitioning, such as upper bounding the

number of regions [Montufar et al., 2014, Raghu et al., 2017, Hanin and Rolnick, 2019]

or providing an explicit characterization of the input space partitioning of a single

layer DN with ReLU activation [Zhang et al., 2018b] by means of tropical geometry.

We propose in this section to characterize the DN input space partition with more

generality by providing results that apply to any MASO-based DN, regardless of the

underlying width/depth/layers. To do so, we adopt a computational and combina-

torial geometry [Pach and Agarwal, 2011, Preparata and Shamos, 2012] perspective

of MASO-based DNs to derive the analytical form of the input-space partition of a

DN unit, a DN layer, and an entire end-to-end DN. We demonstrate that each DN

layer performs a partitioning according to a Power Diagram [Aurenhammer and Imai,

1988] with a large number of regions and that those Power Diagrams are subdivided in a special way to create the overall DN input-space partition.

3.4.1 Voronoi Diagrams and Power Diagrams

In order to precisely derive our result on the DN input space partition, we first need to remind the reader with some specific input space partitions, namely, voronoi diagrams and power diagrams.

Definition 3.1 (Voronoi Diagram) A voronoi diagram (VD) [Voronoi, 1908] partitions 39

a space X into R regions Ω = {ω1, . . . , ωR} where each cell is obtained via ωr = {x ∈ X : r(x) = r}, r = 1,...,R, with

2 r(x) = arg min kx − [µ]k,:k . (3.12) k=1,...,R

The parameter [µ]k,: is called the centroid.

VDs are also denoted as Dirichlet tessellation and the Voronoi regions are also known as Thiessen polygons. For a thorough study of VD we refer the reader to Aurenham- mer [1991]. A power diagram (PD), also known as a Laguerre–Voronoi diagram , is a generalization of the classical Voronoi diagram (VD).

Definition 3.2 (Power Diagram) A power diagram (PD) [Aurenhammer and Imai,

1988] partitions a space X into at most R regions Ω = {ω1, . . . , ωR} where each cell is obtained via ωr = {x ∈ X : r(x) = r}, r = 1,...,R, with

2 r(x) = arg min kx − [µ]k,:k − [rad]k. (3.13) k=1,...,R

The parameter [µ]k,: is called the centroid, while [rad]k is called the radius. The distance minimized in (3.13) is called the Laguerre distance [Imai et al., 1985].

When the radii are equal for all k, a PD collapses to a VD. See Fig. 3.1 for two equivalent geometric interpretations of a PD. For additional insights, see Preparata and Shamos [2012]. We will have the occasion to use negative radii in our development

2 2 below. Since arg mink kx − [µ]k,:k − [rad]k = arg mink kx − [µ]k,:k − ([rad]k + ρ), we can always apply a constant shift ρ to all of the radii to make them positive .

In general, a PD is defined with nonnegative radii to provide additional geometric interpretations.

The Laguerre distance corresponds to the length of the line segment that starts at x ∈ X and ends at the tangent to the hypersphere with center [µ]k,: and radius 40

Figure 3.1 : Two equivalent representations of a power diagram (PD). Top: The grey circles have centers [µ]k,: and radii [rad]k; each point x is assigned to a specific region/cell according to the Laguerre distance from the centers, which is defined as the length of the segment tangent to and starting on the circle and reaching x. Bottom: A PD in RD (here D = 2) is constructed by lifting the centroids [µ]k,: up into an additional dimen- D+1 sion in R by the distance [rad]k and then finding the Voronoi diagram (VD) of the augmented centroids D+1 ([µ]k,:, [rad]k) in R . The intersection of this higher- dimensional VD with the originating space RD yields the PD.

  rad k (see Fig. 3.1). The hyperplanar boundary between two adjacent power diagram (PD) regions can be characterized in terms of the chordale of the corresponding hyperspheres [Johnson, 1960]. Doing so for all adjacent boundaries fully characterizes the region boundaries in simple terms of hyperplane intersections [Aurenhammer,

1987]. Those two mathematical objects will be enough for us to build a complete characterization of the DN input space partition which we now turn into, first for a single layer case, and then for the multilayer case.

3.4.2 Single Layer: Power Diagram

A MASO layer combines K max affine spline (MAS) units to produce the layer output given its input. To streamline our argument, we omit the ` superscript and denote the layer input by x, with X the layer’s domain. It shall be clear that each MAS encodes indirectly a partition of its input space, where each region corresponds to the collection of inputs that are mapped via the same affine mapping. In other word, the 41

th partition Ωk of the k MAS mapping in a MASO is obtained via

Ωk = {ωk,1, . . . , ωk,R},

where each region ωk,r is the collection of inputs given by

ωk,r = {x ∈ X : arg maxh[Ar0 ]k,:, xi + [br0 ]k = r}. r0=1,...,R

Following simple calculus, we can rewrite the region assignment as follows:   ωk,r = x ∈ X : arg max (h[Ar0 ]k,:, xi + [br0 ]k) = r r0=1,...,R   = x ∈ X : arg min (−2h[Ar0 ]k,:, xi − 2[br0 ]k) = r (sign change, scaling) r0=1,...,R   2 = x ∈ X : arg min −2h[Ar0 ]k,:, xi − 2[br0 ]k + kxk2 = r (adding a constant) r0=1,...,R   2 2 2 = x ∈ X : arg min −2h[Ar0 ]k,:, xi − 2[br0 ]k + k[Ar0 ]k,:k2 − k[Ar0 ]k,:k2 + kxk2 = r r0=1,...,R   2 2 = x ∈ X : arg min kx − [Ar0 ]k,:k2 − 2[br0 ]k − k[Ar0 ]k,:k2 = r , r0=1,...,R

2 where by identification, and denoting 2[br0 ]k + k[Ar0 ]k,:k2 as the radius term, we see that a MAS partitions its input space according to a Power Diagram. Theorem 3.3 (MAS partition)

The kth MAS unit of a MASO partitions its input space according to a PD with

2 R centroids and radii given by [µ]r,: = [Ar]k,: and [rad]r = 2[br]k + k[Ar]k,:k2, ∀r ∈ {1,...,R} (recall (3.13)).

Going from the partition of a single unit Ωk of the MASO layer to the entire layer input space partition Ω is done by studying the joint behavior of all the layer’s

constituent units. A MASO layer is a continuous, piecewise affine operator made by

the concatenation of K MAS units (recall (3.1)). This operator is linear in the region

of its domain where all the MAS units are jointly linear. From this, it is direct to see 42

that Ω will involve all the possible intersections of the regions from Ω1,..., ΩK . We

can formally obtain the exact form of the partition as follows. Denote a region ωr with r ∈ {1,...,R}K as

ωr = {x ∈ X : rk(x) = [r]k, k = 1,...,K},

where rk(x) is taken from (3.10). So ωr is the (possibly empty) region of the layer domain that contains all the input with the specified arg max values for each of

the units based on the provided integer vector r. To be consistent with our previous

PK k derivation, we will index ω with an integer given by I(r) = k=1 R ([r]k −1). Clearly, the I mapping is a bijection between {1,...,R}K and {0,...,RK − 1}; I can be seen as a change of basis of its integer input from base 10 to base R, conversely, I−1 is the inverse mapping. Following a similar approach than for the MAS case, we obtain

 −1 ωr = x ∈ X : rk(x) = [I (r)]k, k = 1,...,K

( K ) X  −1 0 0 = x ∈ X : arg max h[A[r ]k ]k,:, xi + [b[r ]k ]k = I (r) (indep. max.) 0 K r ∈{1,...,R} k=1 ( K ) X  −1 0 0 = x ∈ X : arg min −2 h[A[r ]k ]k,:, xi − 2[b[r ]k ]k = I (r) 0 K r ∈{1,...,R} k=1 ( K ) X 2 −1 0 0 = x ∈ X : arg min −2 h[A[r ]k ]k,:, xi − 2[b[r ]k ]k + kxk2 = I (r) 0 K r ∈{1,...,R} k=1   K 2  X = x ∈ X : arg min x − [A 0 ]  [r ]k k,: r0∈{1,...,R}K  k=1 2

K K 2  X X −1  − 2[b 0 ] − [A 0 ] = I (r) [r ]k k [r ]k k,:  k=1 k=1 2 

  K 2 K  X X = x ∈ X : arg min x − [A −1 0 ] − 2[b −1 0 ]  [I (r )]k k,: [I (r )]k k r0=1,...,RK  k=1 2 k1=

K 2  X −1  − [A −1 0 ] = I (r) , [I (r )]k k,:  k=1 2  43

by identification, we can see that again, we fall back to a Power Diagram, thanks to

the independent maximization process that is done for each unit of the MASO. Theorem 3.4 (MASO partition)

A DN layer partitions its input space according to a PD containing up to RK regions

K K 2 P −1 P −1 0 with centroids µr = k=1[A[I (r)]k ]k,: and radii radr = 2 k=1[b[I (r )]k ]k + kµrk . The input space partition of a DN layer is composed of convex polytopes.

As a result, each layer in a DN partitions its own input space according to a PD with the above parameters. Going into the composition of layers case is described in the next section and heavily relies on the above result on the MASO partition.

3.4.3 Composition of Layers: Power Diagram Subdivision

We provide the formula for the input space partition of an L-layer DN by means of a recursion. Since we will now consider multiple layers, we have to bring back the upperscript indexing of the per-layer quantities that we will study. The input space of layer ` is X (`−1), the partition of this input space with respect to the layer PD is

Ω(`).

Initialization (` = 0): Define the region of interest in the input space X (0) ⊂ RD. First step (` = 1): The first layer subdivides X (0) into a PD via Theorem 3.4 to obtain the layer-1 partition Ω(1).

Recursion step (` = 2): The second layer subdivides X (1) into a PD via Theorem 3.4 to obtain the layer-2 partition Ω(2). Jointly, the first layer units map X (0) into X (1) but remain a simple affine mapping in each region of the first layer’s partition. Hence, each convex polytope ω ∈ Ω(1) that lives in the first layer and DN’s input space is mapped to another convex polytope in X (1), the second layer’s input space via

n (1) (1) (1)o (1) (1) (1) affω(1) = Aω(1) x + bω(1) , x ∈ ω ⊂ X , ∀ω ∈ Ω . (3.14) 44

Ω(1) → Each Layer 1 region ω(1) leads a different PD subdivision

Ω(1,2) ← Sub-division of each region with respective PD

Figure 3.2 : Visual depiction of the subdivision process that occurs when a deeper layer ` refines/subdivides an up-to-layer ` − 1 already built partition Ω(1,...,`−1). We depict here a toy model (2-layer DN) with 3 units at the first layer (leading to 4 re- gions) and 8 units at the second layer with random weights and biases. The colors are the DN input space partitioning with respect to the first layer. Then for each color (or region) the layer1-layer2 defines a specific PD that will sub-divide this aforemen- tioned region (this is the first row) where the region is colored and the PD is depicted for the whole input space. Then this sub-division is applied onto the first layer region only as it only sub-divides its region (this is the second row on the right). And finally grouping together this process for each of the 4 region, we obtain the layer-layer 2 space partitioning (second row on the left).

(2) (1) As a result, it is clear that the partition Ω of X possibly subdivides affω(1) into smaller regions. As the first layer is linear in this part of the space, we can effectively express the PD that subdivides each region affω(1) back into the DN input space by (1) (1) replacing x with Aω(1) x + bω(1) in Thm. 3.4. Repeating this subdivision process for all regions ω(1) from Ω(1) forms the subdivided input space partition of both layers

Ω(1,2). See Fig. 3.2 for a numerical example with a 2-layer DN and D = 2 dimensional input space.

Recursion step (`): Consider the situation at layer ` knowing Ω(1,...,`−1) from the previous subdivision steps. Similarly to the ` = 2 step, layer ` subdivides each cell in

Ω(1,...,`−1) to produce Ω(1,...,`) leading to the up-to-layer-`-layer DN partition Ω(1,...,`). 45

Theorem 3.5 (DN partition)

The DN input space partition is a Power Diagram subdivision; the number of subdi-

vision is at most the number of layers; at step `, the subdivision of a previously built region subdivides it into 1 to D(`) regions; at each step, the subdivision of different regions is not independent;

The subdivision recursion provides a direct result on the shape of the DN input space partition regions that we formalize in its own statement below.

Corollary 3.1 (Region convexity)

For any number of MASO layers L ≥ 1, and any type of layer (as long as they are

MASOs), the regions of the DN input space partition are convex polytopes.

The above result comes naturally from our characterization. Recall that the DN partition successively subdivides the previously built partition starting from the entire

DN input space. The first layer produces Ω(1), which is a PD, has convex regions as

is the case for any PD. Each region ω ∈ Ω(1) is then subdivided with another PD,

hence the intersection of convex regions with other convex regions occurs. The result,

Ω(1,2), is thus made of convex regions. Repeating this process of intersection convex

regions with other convex regions ultimately lead to the DN input space partition

Ω(1,...,L) made of convex regions.

3.5 Discussions

Our ability to characterize the DN input space partition as a Power Diagram subdivi-

sion concludes this chapter on employing Max-Affine Spline Operators to reformulate

current DNs. We know have a better grasp at the underlying structure of the spline

operator that is a DN. As was highlighted, this formulation offers a few key ben- 46 efits. First, it is able to model DNs regardless of the actual input/latent/output space dimensions, and can be used for any DN as long as each layer nonlinearity is a Continuous Piecewise Affine operator. This generality couples with the prac- ticality of Max-Affine Splines to obtain theoretical results should open the door to extending many powerful results obtained in univariate settings to more general cases.

The subsequent chapters focus on bringing insights into various DN techniques and applications such as Deep Generative Networks, Deep Network pruning or Batch-

Normalization in Deep Networks from the MASO formulation. 47

Chapter 4

Insights Into Deep Generative Networks

In this chapter, we propose to leverage the results from Chap. 3 and apply them specifically to Deep Generative Networks (DGNs). Up until this point, we have been mainly focusing on a DN FΘ void of any application setting. But DGNs, even though being close to regression, propose to solve the problem of manifold learning. This particular scenario will allow us to draw many geometric insights into the ability of

Continuous Piecewise Affine DGNs to fit manifolds and into their inner workings e.g. their intrinsic dimension or their local basis vectors.

4.1 Introduction

4.1.1 Related Works

Deep Generative Networks (DGNs), which map a low-dimensional latent variable z to a higher-dimensional generated sample x, have made enormous leaps in capabilities in recent years. DGNs alone only provide a nonlinear mapping from their latent space to an ambient space, learning the underlying DGN parameters can be done in a few different manners. First, one can employ Generative Adversarial Networks (GANs)

[Goodfellow et al., 2014] or their variants [Dziugaite et al., 2015, Zhao et al., 2016,

Durugkar et al., 2016, Arjovsky et al., 2017, Mao et al., 2017, Yang et al., 2019a]. In this setting the DGN is adapted in order to produce samples that can not be distin- guished from the training set’s samples based on a discriminative DN. Another option 48 is to employ Variational Autoencoders [Kingma and Welling, 2013] or their variants

[Fabius and van Amersfoort, 2014, van den Oord et al., 2017, Higgins et al., 2017,

Tomczak and Welling, 2017b, Davidson et al., 2018]. In this setting, a (minimal)

Probabilistic Graphical Model (PGM) is used in which the DGN represents the map- ping between two neighboring vertices in this graph. This formulation allows training from a likelihood maximization perspective. In a similar vein flow-based models such as NICE [Dinh et al., 2014], Normalizing Flow (NF) [Rezende and Mohamed, 2015] or their variants [Dinh et al., 2016, Grathwohl et al., 2018, Kingma and Dhariwal,

2018] propose to leverage the DGN as a succession of coordinate changes and to adapt them in order to force the data distribution to become a (simple) target distribution, often taken as an isotropic Gaussian. Training flow-based models also follows the maximum likelihood principle but in a somewhat reversed formulation from VAEs.

Despite an exponential growth in the number of extensions or novel training meth- ods for DGNs all emerging techniques are motivated by studying the coupling between the dynamics of the DGN and the training framework [Mao et al., 2017, Chen et al.,

2018], or by extensive empirical studies [Arjovsky and Bottou, Miyato et al., 2018, Xu and Durrett, 2018]. For example, GANs are mostly studied through the theoretical convergence properties of two player games [Liu et al., 2017a, Zhang et al., 2017b,

Biau et al., 2018], or regret analysis [Li et al., 2017b, Kodali et al., 2017]. VAEs are mostly studied from a perturbation theory perspective of their latent space [Roy et al.,

2018, Andr´es-Terr´eand Li´o,2019] or from a pure PGM perspective with emphasis on the inference and training schemes [Chen et al., 2018]. Finally, NFs mostly focus on improving tractability of the model by means of parametrization such as Householder transformations [Tomczak and Welling, 2016] or Sylvester matrices [Berg et al., 2018] of the DGN layer mappings. 49

4.1.2 Contributions

In this chapter, we propose to study DGNs and their properties solely based on their

Continuous Piecewise Affine structure that we built in Chap. 3. That is, we propose to

explicit the fundamental properties and limitations of DGNs regardless of the training

setting employed. In doing so, we will be able to provide new perspectives into many

observed phenomena such as unstable training when dealing with multimodal data

distributions (mode collapse) or the relationship between the DGNs latent space

dimension and its ability to generalize. For this chapter, we will use the following

notations. A deep generative network (DGN) is an operator GΘ with parameters Θ

mapping a latent input z ∈ RS to an observation x ∈ RD by composing L intermediate layer mappings G(`), ` = 1,...,L. We precisely define a layer G(`) as comprising a

single nonlinear operator composed with any (if any) preceding linear operators that

lie between it and the preceding nonlinear operator as per Def. 1.1. We will omit

(`) Θ from the GΘ operator for conciseness unless needed. Each layer G transforms

(`−1) (`) its input feature map z(`−1) ∈ RD into an output feature map z(`) ∈ RD with in particular z0 := z, D(0) = S, and z(L) := x,D(L) = D. In such framework z is

interpreted as a latent representation, and x is the generated/observed data, e.g, a time-serie or image.

4.2 Deep Generative Network Latent and Intrinsic Dimen-

sion

S D In this section we study the properties of the mapping GΘ : R → R of a DGN comprising L MASO layers. 50

Figure 4.1 : Visual depiction of Thm. 4.1 with a (random) generator G : R2 7→ R3. Left: generator input space partition Ω made of polytopal regions. Right: generator image Im(G) which is a continuous piecewise affine surface composed of the polytopes obtained by affinely transforming the polytopes from the input space partition (left) the colors are per-region and correspond between left and right plots. This input- space-partition / generator-image / per-region-affine-mapping relation holds for any architecture employing piecewise affine activation functions. Understanding each of the three brings insights into the others, as we demonstrate in this paper.

4.2.1 Input-Output Space Partition and Per-Region Mapping

As was hint in the previous chapter, the MASO formulation of a DGN allows to express the (entire) DGN mapping G (a composition of L MASOs) as a per-region affine mapping

X S G(z) = (Aωz + bω) 1z∈ω, z ∈ R , (4.1) ω∈Ω with Ω a partition of RS. Recall from Sec. 3.4 that this partition corresponds to a Power Diagram subidivision and can be obtained analytically, if needed. In order to study and characterize the DGN mapping (4.1), we make explicit the formation of the per-region slope and bias parameters. The affine parameters Aω, bω decompose 51

into

L−1 Y (L−`)  (L−`) (L)  (L) (1)  (1) Aω = diag σ˙ (ω) W = diag σ˙ (ω) W ... diag σ˙ (ω) W , `=0 (4.2)

whereσ ˙ (`)(ω) is the pointwise derivative of the activation function of layer ` based

on its input W (`)z`−1 + b(`), ∀z ∈ ω. At the time of this thesis, no DGN employs a pooling operator, we thus omit such operator in this chapter to streamline our notations and development. The diag operator simply puts the given vector into a diagonal square matrix. For convolutional layers (or else) one can simply replace the corresponding W (`) with the correct slope matrix parametrization. Notice that since the employed activation functions σ(`), ∀` ∈ {1,...,L} are piecewise affine, their

(`) derivative is piecewise constant, in particular with values [σ ˙ (ω)]k ∈ {α, 1} with α = 0 for ReLU, α = −1 for absolute value, and in general with α > 0 for Leaky-

ReLU for k ∈ {1,...,D(`)}. We denote the collection of all the per-layer activation

1 T (L) T T QL D(`) derivatives [σ ˙ (ω) ,..., σ˙ (ω) ] ∈ {α, 1} `=1 as the activation pattern of the

generator. Based on the above, if one already knows the associated activation pattern

of a region ω, then the matrix Aω can be formed directly. Practically, one instead observes a sample z ∈ ω from which obtaining the activation pattern will be direct. In

this case, we will slightly abuse notation and denote those known activation patterns

(`) (`) asσ ˙ (ω) , σ˙ (z), z ∈ ω with ω being the considered region. In a similar way, the bias vector is obtained as

L " L−`−1 ! # X Y (L−i)  (L−i) (`)  (`) bω = diag σ˙ (ω) W diag σ˙ (ω) b . (4.3) `=1 i=0

As for the slope matrix Aω, the bias vector bω can be obtained either from a sample z ∈ ω or based on the known region activation pattern. Equipped with the above 52

notations, we can now state our first formal result characterizing the image of a DGN

regardless of its parameters and training setting.

Theorem 4.1 (Per-region affine subspace)

The image of a generator G employing MASO layers is a continuous piecewise affine

surface made of connected polytopes obtained by affine transformations of the poly-

topes of the input space partition Ω as in

S [ Im(G) , {G(z): z ∈ R } = Aff(ω; Aω, bω) (4.4) ω∈Ω with Aff(ω; Aω, bω) = {Aωz + bω : z ∈ ω}; we will denote for conciseness G(ω) ,

Aff(ω; Aω, bω) and the volume of a region ω ∈ Ω denoted by µ(ω) is related to the

q T volume of G(ω) as per µ(G(ω)) = det(Aω Aω)µ(ω) with Aω being full-rank.

The above result is pivotal to bridge the understanding of the input space partition

Ω, the per-region affine mappings Aω, bω, and the generator’s image. We visualize

Thm. 4.1 in Fig. 4.1 to make it clear that characterizing Aω alone already provides tremendous information about the generator. This result also provides a direct an-

swer to the problem of generating disconnected manifolds (or sets) by employing

current DGNs. In a specific GAN setting, it was empirically shown to be impossible

[Khayatkhoei et al., 2018, Tanielian et al., 2020] which aligns with Thm. 4.1: given a

connected set Z ⊂ RS, any DGN mapping G(Z) made of a composition of MASOs is always connected for any depth, width or parameter settings for W (`), b(`), ∀`. We

now turn into the study of the DGN intrinsic dimension.

4.2.2 Generated Manifold Angularity

We now study the angularity of the generated surface i.e. the image of G. Recall

(from Thm. 4.1) that the per-region affine subspace of adjacent region are continuous, 53

Figure 4.2 : The columns represent different widths D` ∈ {6, 8, 16, 32} and the rows correspond to repetition of the learning for different random initializations of the GDNs for consecutive seeds.

and joint at the region boundaries with a certain angle that we now characterize.

Definition 4.1 (Adjacent regions) Two regions ω, ω0 are adjacent whenever they share

part of their boundary as in ∂ω ∩ ∂ω0 6= ∅.

The angle between adjacent affine subspace is characterized by means of the greatest

principal angle [Afriat, 1957, Bjorck and Golub, 1973] which is denoted for our study

as θω,ω0 . The following result demonstrating how to compute such an angle can be obtained by a direct application of the main result in Sec. 1 of Absil et al. [2006].

Theorem 4.2 (Angularity between adjacent subspaces)

The angle θω,ω0 between adjacent (recall Def. 4.1) region mappings for two adjacent regions ω, ω0 is given by

 T −1 T T −1 T 0 0 0 sin θω,ω = Aω(Aω Aω) Aω − Aω (Aω0 Aω ) Aω0 2 .

assuming that Aω, ∀ω ∈ Ω are full-rank.

Notice that in the special case of S = 1, the angle is given by the cosine similarity

between the vectors Aω and Aω0 of adjacent regions, and when S = D − 1 the angle is given by the cosine similarity between the normal vectors of the D − 1 subspaces 54

Figure 4.3 : Histograms of the DGN adjacent region angles for DGNs with two hidden layers, S = 16 and D = 17,D = 32 respectively and vary- ing width D` on the y-axis. Three trends to ob- serve: increasing the width increases the bimodal- ity of the distribution while favoring near 0 angles; increasing the output space dimension increases in the number of angles near orthogonal; the Aω and 0 Aω0 of adjacent regions ω and ω are highly simi- lar making most angles smaller than if they were independent (depicted in blue).

spanned by Aω and Aω0 respectively. We illustrate those angles in a simple case D = 2 and Z = 1 in Fig. 4.2. It can be seen how the angles of the generated

surface (in this case, the generated line) follows the curvature of the target manifold

when the number of parameters remains small. However, as soon as the width of the

DGN increases, as overfitting occurs and tremendous small angles are introduced in

unnecessary parts of the manifold.. We can also use the above result to study the

distribution of angles of different DGNs with random weights and study the impact

of depth, width, as well as the impact of Z and D, the latent and output dimensions

respectively. Figure 4.3 summarizes the distribution of angles for several different

settings from which we observe two key trends. First, the affine parameters Aω and

Aω0 of adjacent regions have very constrained angles that are much smaller than if they were random. This can be justified theoretically by observing (4.2) and recalling

from Sec. 3.4 that adjacent regions will share most of their activation pattern. Hence,

the product of matrices that form Aω and Aω0 will be the same for all except one (or a few) entry in one (or a few) of the diag(σ ˙ (`)) matrix. As can be seen in the

figure, if the matrices Aω and Aω0 were independent, the produced angle would be much larger. Second, the distribution of those angles depend on the ratio S/D rather 55

that those values taken independently. In particular, as this ratio gets smaller, as

the angle distribution becomes bi-modal with disappearance of ‘medium’ angles only

preserving the extreme angles (small and large). This makes the generated manifold

‘flatter’ overall except in some parts of the space where high angularity is present.

The above experiment demonstrates the impact of width and latent space dimen-

sion into the angularity of the DGN output manifold and how to pick its architecture

based on a priori knowledge of the target manifold. Under the often-made assump-

tions that the weights of overparametrized DGN do not move far from their initial-

ization during training [Li et al., 2018], these results also hint at the distribution

of angles after training. We can now turn into the impact of architecture into the

dimensionality of the generated manifold. Combining insights from this section and

the next one allows for precise architecture design guidance given a priori knowledge

of the manifold curvature and dimension.

4.2.3 Generated Manifold Intrinsic Dimension

We now turn into the intrinsic dimension of the per-region affine subspaces G(ω)

that are the pieces forming the generated manifold. In fact, as per (4.4), the di-

mension of each subspace G(ω) depends not only on the latent dimension S but also indirectly on the per-layer parameters through the forming (by composition) of the slope matrix Aω. In fact, it is clear from Thm. 4.1 that dim(G(ω)) = rank(Aω).

Now, since Aω composes multiple matrices as per (4.2), we can leverage the fact that rank(UV ) = min(rank(U), rank(V )) (see for example (4.5.2) in Meyer [2000]) to

obtain the following.

Proposition 4.1 (Generated manifold intrinsic dimension)

The intrinsic dimension of the affine subspace G(ω) (recall (4.4)) has the following 56

upper bound

 (`) (`) dim(G(ω)) ≤ min S, min #{i : [σ ˙ (ω)]i 6= 0, i = 1,...,D } , `=1,...,L | {z } # units with nonzero activation function derivative    min rank W (`) , (4.5) `=1,...,L

where # represents the cardinal operator.

We make two observations. First, we see that the choice of the nonlinearity (i.e.,

the choice of α) and/or the choice of the layer widths D(`), ∀` are the key elements

controlling the upper bound of dim(G). For example, in the case of ReLU (α = 0) then

dim(G(ω)) is directly impacted by the number of 0s in the activation patternsσ ˙ (`)(ω)

of each layer in addition of the rank of W (`); this sensitivity does not occur when using other nonlinearities (α 6= 0). Second, “bottleneck layers” (layers with width

D(`) smaller than other layers) impact directly the dimension of the subspace and thus should be carefully employed based on the a priori knowledge of the target manifold intrinsic dimension. In particular, if α 6= 0 and the weights are non degenerate (such as at random initialization) then the matrices W (`) are almost surely full-rank; those two cases correspond to    (`)  (`) min S, min` D , min` rank W , (α 6= 0) dim(G(ω)) ≤  (`) (`) min S, min` D . (α 6= 0, W full-rank ∀`)

We now exploit the above formula describing the intrinsic dimension of the generated polytopes to gain further insights into dropout and dropconnect, two noise techniques used to increase generalization by acting on the layer units or weights. 57

4.2.4 Effect of Dropout/Dropconnect

Noise techniques, such as dropout [Wager et al., 2013] and dropconnect [Wan et al.,

2013, Isola et al., 2017], alter the per-region affine mapping in a very particular way

and impact the per-region affine mapping. Those techniques perform an Hadamard

product of samples from iid Bernoulli random variables against the feature maps

(dropout) or against the layer weights (dropconnect); we denote a DGN equipped

with such technique and given a noise realization by G where  includes the noise

realizations of all layers; note that G now has its own input space partition Ω. For

(`) D(`) dropout  , { ∈ {0, 1} , ` = 1,...,L} leading to the mapping

L ! Y (L−`) (L−`) (L−`) G(z) = diag σ˙ (ω)  W z `=1 L " L−`−1 ! # X Y + diag σ˙ (L−i)(ω) (L−i) W (L−i) diag σ˙ (`)(ω) b(`) , `=1 i=0 where is the Hadamard product and z ∈ ω. As opposed to the dropout case which applies the binary noise onto the feature maps, dropconnect applies this binary noise

(`) (`) D(`)×D(`−1) onto the slope matrices W with  , {R ∈ {0, 1} , ` = 1,...,L} leading to the mapping

L ! Y (L−`)   (L−`) (L−`) G(z) = diag σ˙ (ω) W R z `=1 L " L−`−1 ! # X Y   + diag σ˙ (L−i)(ω) W (L−i) R(L−i) diag σ˙ (`)(ω) b(`) . `=1 i=0 Those techniques have been extensively employed in classification settings as they increase generalization performances. In that specific setting, it was shown that adding dropout/connect to a DN classifier turned the network into an ensemble of classifiers [Warde-Farley et al., 2013, Baldi and Sadowski, 2013, Bachman et al., 2014,

Hara et al., 2016]. We can formally extend those results in the case of DGNs as follows. 58

Figure 4.4 : DGN with dropout trained (GAN) on a circle dataset (blue dots); dropout turns a DGN into an ensemble of DGNs (each dropout realization is drawn in a different color).

Proposition 4.2 (Dropout/dropconnect and ensemble of DGNs)

Adding dropout/dropconnect to a DGN G produces a (finite) ensemble of generators

{G, ∀}, each with per-region intrinsic dimension

0 ≤ max dim(G(ω)) ≤ max dim(G(ω)), ∀, ω∈Ω ω∈Ω those bounds are tight.

We illustrate the mixture of models in Fig. 4.4. By leveraging the above and Thm. 4.1 we can highlight a potential limitation of those techniques for narrow models (D(`) ≈

S). In this case, it is highly likely that the noisy generators G for some noise realization, will have a per-region intrinsic dimension much smaller than S, and thus will make training unstable. For example in the extreme case of G having latent dimension of 1, the internal layer weights will be adapted to produce a continuous piecewise linear line into the data. On the other hand, when used with wide DGNs

(D(`)  S), the noise induced generators will maintain an intrinsic dimension closer to the one of the original generator and thus provide a beneficial ensemble of DGNs which will help learning and thus generalization. Formally, the two above comments 59

Dropout 0.1 Dropout 0.3 Dropconnect 0.1 Dropconnect 0.3 )) ω (  G dim(

layers’ width

Figure 4.5 : Impact of dropout and dropconnect on the intrinsic dimension of the noise induced generators for two “drop” probabilities 0.1 and 0.3 and for a generator G with S = 6, D = 10, L = 3 with varying width D1 = D2 ranging from 6 to 48 (x-axis). The boxplot represents the distribution of the per-region intrinsic dimensions over 2000 sampled regions and 2000 different noise realizations. Recall that the intrinsic dimension is upper bounded by S = 6 in this case. Two key observations: first dropconnect tends to producing DGN with intrinsic dimension preserving the latent dimension (S = 6) even for narrow models (D1,D2 ≈ S), as opposed to dropout which tends to produce DGNs with much smaller intrinsic dimension than S. As a result, if the DGN is much wider than S, both techniques can be used, while in narrow models, either none or dropconnect should be preferred. translate to the following statements

lim #{G s.t. max dim(G(ω)) ≥ S, ∀} = 1. D(`)→S ω∈Ω

We empirically demonstrate the impact of dropout into the dimension of the DGN surface in Fig. 4.5; clearly, one must adapt the dropout rate to the layer widths and ensure that the probability of of a noisy DGN G having degenerate intrinsic dimension (smaller than the desired dimension) remains low.

From the above analysis we see that one should carefully consider the use of dropout/dropconnect based on the type of generator architecture that is used and the desired generator intrinsic dimension. As dropout has a more dramatic effect in collapsing the intrinsic dimension of the DGN manifold, we further study it in real setting. We experiment with a Deep Autoencoder with a simple reconstruction task on multiple dataset in Fig. 4.6 and demonstrate that unless correctly set up, 60

MNIST fashion MNIST Arabic charac. SVHN CIFAR

Figure 4.6 : Deep Autoencoder experiment when equipping the DGN (decoder) with dropout where we employ the following MLP with S = D1 = D2 = 32 and D3 = D4 = 1024,D5 = D, test set reconstruction error is displayed for multiple datasets and training settings. The architecture purposefully maintains a narrow width for the first two layers to highlight that in those cases, dropout is detrimental regardless of the dropout rate. We compare applying dropout to all layers black line versus applying dropout only on the last two (wide) layers blue line. We see that unless the dropout rate is adapted to the layer width and desired intrinsic dimension, the test set performance is negatively impacted by dropout. The exact rate reaching best test set performance for the case of employing dropout only for wide layers is shown with a green arrow. The exact values for each graph are given in Table 4.1.

Table 4.1 : Test set reconstruction error for varying dropout rates as displayed in Fig. 4.6, for different datasets, and when applying dropout on all layers or only on wide enough layers. We see that it is crucial to adapt the dropout rate to the layer width as otherwise the test error only increases when employing dropout. Dropout proba. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 drop. on all 1.88 2.80 4.27 4.95 5.16 4.46 4.41 4.20 5.09 6.18 CIFAR10 drop. on wide 1.92 1.95 1.82 1.93 2.01 2.10 2.25 2.46 2.93 4.22 drop. on all 1.05 1.87 3.41 3.65 3.83 4.11 5.38 6.04 6.03 6.7 MNIST drop. on wide 1.05 0.91 0.94 1.01 1.05 1.15 1.26 1.58 2.13 3.60 drop. on all 0.67 1.46 2.79 3.97 4.11 4.05 3.80 2.45 2.99 5.07 SVHN drop. on wide 0.63 0.63 0.72 0.64 0.75 1.17 1.42 1.81 1.88 2.14 drop. on all 1.09 1.84 3.35 4.57 4.13 3.74 3.69 5.34 6.00 8.22 fashion MNIST drop. on wide 1.09 1.09 1.09 1.16 1.22 1.33 1.45 1.64 2.00 2.85 drop. on all 3.00 3.25 4.05 4.97 5.27 5.38 5.88 6.53 6.62 7.21 ARABIC drop. on wide 3.07 2.66 2.58 2.54 2.53 2.66 2.78 2.96 3.48 4.83

applying dropout blindly at each layer negatively impacts test set performances. We

also propose in Fig. 4.7 a simple scheme guiding the choice of dropout rate based on

the layer width and the designed intrinsic dimension. 61

Figure 4.7 : Probability (0:blue,1:red) that dropout maintains the intrinsic di- mension (red line, left:32, right:64) as a function of the dropout rate ( x-axis) and the layer’s width y-axis, with the 95% and 99% line in black continuous and black dashed respectively. We see that when the layer’s width is close to the layer’s width desired intrinsic dimension, no dropout dropout rate (prob. to turn a unit to 0) should be applied, and that for a dropout rate of 0.5, the layer must be at least two times wider than the desired intrinsic di- mension.

4.3 Per-Region Affine Mapping Interpretability and Mani-

fold Tangent Space

We now turn to the study of the local coordinates of the affine mappings comprising a DGN’s generated manifold.

4.3.1 Per-Region Mapping as Local Coordinate System and Disentangle-

ment

Recall from (4.4) that a DGN is a CPA operator. Inside region ω ∈ Ω, points are mapped to the output affine subspace which is itself governed by a coordinate system or basis (Aω) which we assume to be full rank for any ω ∈ Ω through this section. The affine mapping is performed locally for each region ω, in a manner similar to an “adaptive basis” [Donoho et al., 1994]. In this context, we aim to characterize the subspace basis in term of disentanglement, i.e., the alignment of the basis vectors with respect to each other. While there is no unique definition for disentanglement, a general consensus is that a disentangled basis should provide a “compact” and inter- 62

FC GAN CONV GAN FC VAE CONV VAE learned initial

Figure 4.8 : Visualization of a single basis vectors [Aω].,k before and after learning obtained from a region ω containing the digits 7, 5, 9, and 0 respectively per-column, for GAN and VAE models made of fully connected or convolutional layer. We observe how those basis vectors encodes: right rotation, cedilla extension, left rotation, and upward translation respectively; studying the columns of Aω provides interpretability into the learn DGN affine parameters and underlying data manifold.

pretable latent representation z for the associated x = G(z). In particular, it should ensure that a small perturbation of the dth dimension (d = 1,...,S) of z implies a transformation independent from a small perturbation of d0 6= d [Schmidhuber, 1992,

Bengio et al., 2013]. That is, hG(z) − G(z + δd),G(z) − G(z + δd0 )i ≈ 0 with δd a one-hot vector at position d and length S [Kim and Mnih, 2018]. A disentangled representation is thus considered to be most informative as each latent dimension imply a transformation that leaves the others unchanged [Bryant and Yarnold, 1995].

We are now able to provide the first condition relating the per-region basis Aω to the concept of disentanglement.

A necessary condition for disentanglement is to have “near orthogonal” columns, i.e., h[Aω].,i, [Aω].,ji ≈ 0, ∀i, 6= j, ∀ω ∈ Ω. We provide in Fig. 4.8 visuals of the basis vectors of four different DGNs trained on the MNIST dataset with S = 10. From this, 63

we see how by leveraging the fact that Aω is the (local) basis of the DGN, inspecting its columns provide direct visuals to understand the transformations encoded in each latent space dimension. This type of visualization is crucial not only for latent space dimension inspection but also to discover new ways to move around the data manifolds and thus perform controlled data generation, a key challenged in GANs and VAEs

[Zhao et al., 2017, Huang et al., 2018b].

4.3.2 Tangent Space Regularization

In the previous section, we have highlighted the link between the per region slope matrix Aω and the DGN tangent space for any DGN input z ∈ ω: the columns Aω span the DGN tangent space of that region. We now further leverage this finding to provide a novel and motivated regularization of DGNs.

Regularizations in DGNs have always been driven by the principle that the DGN must be a contractive mapping. That is, highly similar inputs should produce highly similar outputs. From this, a vague of techniques have been derived involving penalty terms constraining the input-output mapping to be similar when adding noise into the DGN inputs [Vincent et al., 2008, Rifai et al., 2011, Teng and Choromanska,

2019]. Such regularizations however do not leverage the geometry of the data, and in fact, those regularizations exploit very little information about the training sam- ples: only their positions in the ambient space. We propose instead to use a “richer” regularization based on the tangent space of the samples. This will allow to explic- itly incorporate the geometry of the data manifold into the DGN through a novel regularization by enforcing that for each region G(ω) in which there exists a training sample, and where estimation of the tangent space of the data manifold is possible, we constrain Aω to be a basis of that tangent space. In practice, and as is commonly 64 done, we employ a k-NN algorithm to estimate the tangent space around a sample x

[Ma et al., 2010, Deng et al., 2020]. The number of neighbors, that we denote here as T , defines the dimensionality of the estimated tangent space. The basis of the estimated tangent space around x is denoted by Tx and is given by

Tx , (x − x1,..., x − xT ) , where x1,..., xT are the T nearest neighbors to x from the training set. The DGN tangent space will be aligned with the estimated data tangent space if and only if they both span the same subspace. One measure of alignment between subspaces is given

T −1 T T −1 T by the following matrix norm R(x, Aω) = kAω(Aω Aω) Aω − Tx(Tx Tx) Tx k2 [Bjorck and Golub, 1973, Miao and Ben-Israel, 1992]. The latter will be 0 whenever the subspaces spanned by Aω and Tx are included in each other. Our regularization thus take the simple following form given a training set X

X X 1{x∈ω}R(x, Aω), (4.6) x∈X ω∈Ω and we simply add this regularization term weighted by a constant λ to the employed training loss of any DGN to tilt the DGN tangent space to align as best as possible to the estimated data tangent space. We provide in Fig. 4.9 and Table 4.2 the impact of our proposed regularization technique on various datasets and training settings. We observe that the speed of convergence during training is not impacted however the generalization capacity of DGNs employing our regularization is greatly increased.

This results demonstrates that even for “simple” datasets such as MNIST which are relatively simple image manifolds compared to high-resolution realistic images,

DGNs fail to correctly align with the data manifold, and instead only pass through the training samples but with a different tangent space. We believe that there is many avenues to further study this regularization term and derive generalization guarantees 65

mnist svhn cifar10 log(reconstruction error)

log(epochs) log(epochs) log(epochs) Figure 4.9 : Test set reconstruction (y-axis) error during training for each epoch (x-axis) for a baseline unconstrained Deep AutoEncoder (black line) and for the tangent space regularized DGN (decoder) from (4.6) with varying regularization coef- ficient λ (colored lines) for three datasets (per column) and with S = 128,T = 16 (top) and S = 32,T = 16 (bottom). We observed that by constraining the tangent space basis Aω to span the data tangent space for each region ω containing training samples, the manifold fitting is improved leading to better test sample reconstruction.

from it, however we leave that for future work to first consider a last section focusing on DGNs employing probability densities on their input space.

4.4 Density on the Generated Manifold

The study of DGNs would not be complete without considering that the latent space is often equipped with a density distribution pz from which z are sampled in turn leading to sampling of G(z); we now study this density and its properties. 66

Table 4.2 : Test set reconstruction error averaged over 3 runs when employing the tangent space regularization (4.6) on various dataset with a DeepAutoencoder when varying the weight of the regularization term (λ) and the latent space dimension (S) and the number of neighbors used to estimate the data tangent space (T ). We see that the proposed regularization effectively improves generalization performances in all cases and even for complicated and high-dimensional datasets such as CIFAR10, where the data tangent space estimation becomes more challenging. This also demon- strates that DGNs trained only to reconstruct the data samples do not align correctly with the underlying data manifold tangent space. T = 8 T = 16 λ = 0.00 0.01 0.1 1 10 0.00 0.01 0.1 1 10 mnist 1.57 1.57 1.51 1.31 1.10 1.57 1.57 1.51 1.32 1.10 fashionmnist 1.53 1.53 1.50 1.30 1.20 1.53 1.54 1.50 1.31 1.20 = 32 svhn 0.82 0.85 0.89 0.82 0.71 0.82 0.86 0.88 0.90 0.66 S cifar10 2.41 2.42 2.51 2.17 1.74 2.48 2.46 2.47 2.09 1.65 mnist 1.34 1.34 1.30 1.16 1.08 1.34 1.34 1.30 1.17 1.07 fashionmnist 1.38 1.43 1.34 1.28 1.16 1.38 1.44 1.35 1.29 1.15

= 128 svhn 0.87 1.07 1.00 0.70 0.78 0.87 0.59 1.26 0.59 0.74 S cifar10 2.85 1.97 2.14 2.05 1.54 2.87 1.87 2.02 1.97 1.42

4.4.1 Analytical Output Density

Given a distribution pz over the latent space, we can explicitly compute the output distribution after the application of G, which lead to an intuitive result exploiting the piecewise affine property of the generator. For the remaining of our study we assume

Aω to be full-rank. Theorem 4.3 (Output density)

The generator probability density pG(x) given pz and a bijective generator G is given −1 pz(Gω (x)) by p (x) = P √ 1 . G ω∈Ω T {x∈G(ω)} det(Aω Aω)

That is, the distribution obtained in the output space naturally corresponds to a

piecewise affine transformation of the original latent space distribution, weighted by

the change in volume of the per-region mappings. 67

Highly multimodal and small E(px) → Unimodal and large E(px) ) data ω ∀

q T q T q T

histogram ( log( det(Aω Aω)) log( det(Aω Aω)) log( det(Aω Aω))

Figure 4.10 : Distribution of the per-region log-determinants (bottom row) for DGN trained on a data distribution with varying per mode variance (blue points, first row). The estimated data distribution is depicted through the red samples. We clearly observe the tight relationship between the multimodality and Shannon Entropy of the data distribution to be approximated and the distribution of the per-region determinant of Aω. That is, as the DGN tries to approximate a data distribution with high multimodality and low Shannon Entropy, as the per-region slope matrices Aω have increasing singular values, in turn synonym of exploding per-layer weights and thus training instabilities (recall Thm. 4.1).

4.4.2 On the Difficulty of Generating Low entropy/Multimodal Distribu-

tions

We conclude this study by hinting at the possible main cause of instabilities encoun-

tered when training DGNs on multimodal densities or other atypical cases.

We demonstrated in Thm. 4.3 that the product of the nonzero singular values of

Aω plays the central role to concentrate or disperse the density on G(ω). Even when considering a simple mixture of Gaussians case, it is clear that the standard deviation

of the modes and the inter-mode distances will put constraints on the singular values 68

σ1 = 0, σ2 ∈ {1, 2, 3} σ1 = 1, σ2 ∈ {1, 2, 3} σ1 = 2, σ2 ∈ {1, 2, 3}

q T Figure 4.11 : Distribution of log( det(Aω Aω)) for 2000 regions ω with a DGN with L = 3,S = 6,D = 10 and weights initialized with Xavier; then, half of the weights’ coefficients (picked randomly) are rescaled by σ1 and the other half by σ2. We observe that greater variance of the weights increase the spread of the log-determinants and increase the mean of the distribution.

(`) of the slope matrix Aω, in turn stressing the parameters W as they compose the slope matrix. We highlight this tight relationship between the per-layer matrices

(`) W and the overall per-region matrix Aω and how their singular values are tied in Fig. 4.11. This problem emerges from the continuous property of DGNs which have to somehow connect in the output space the different modes in a way that produces near 0 probability in between them. We highlight this in Fig. 4.10, where we trained a GAN DGN on two Gaussians for different multimodality settings.

4.5 Discussions

In conclusion, we demonstrated how the spline formulation of DGNs offers a rich portal to obtain theoretical results and understandings which can be evaluated em- pirically in a tractable manner through standard tools from linear algebra. We par- ticularly focused on the relationship between DGNs intrinsic dimension and perfor- mances, the role of dropout and dropconnect into helping or hurting performances along with a design recipe for practitioners, and how to derive new, theoretically grounded regularization techniques improving performances. In addition to those, 69 we demonstrate that interpretability, a key challenge in deep learning, is also within the reach of the spline formulation of DGNs, where we particularly focused on the tangent space basis of DGNs, their role in disentanglement and controlled sample generation, and finally on the crucial problem of mode collapse and training instabil- ities with multimodal target densities where we demonstrated that CPA DGNs are by construction inclined to produce instabilities in those settings. 70

Chapter 5

Expectation-Maximization for Deep Generative Networks

5.1 Introduction

Deep Generative Networks (DGNs), which map a low-dimensional latent variable z to a higher-dimensional generated sample x are the state-of-the-art methods for a range of machine learning applications, including anomaly detection, data generation, like- lihood estimation, and exploratory analysis across a wide variety of datasets [Blaauw and Bonada, 2016, Inoue et al., 2018, Liu et al., 2018, Lim et al., 2018]. While we proposed a thorough geometrical study of DGNs in all generality in Chap. 4, we now go a step further and exploit the composition of MASO formulation to provide a novel training solution.

5.1.1 Related Works

Training of DGNs roughly falls into two camps: (i) By leveraging an adversarial network as in a Generative Adversarial Network (GAN) [Goodfellow et al., 2014] to turn the method into an adversarial game; and (ii) by modeling the latent variable and observed variables as random variables and performing some flavor of likelihood maximization training. A widely used solution to likelihood based DGN training is via a Variational Autoencoder (VAE) [Kingma and Welling, 2013]. The popularity of the VAE is due to its intuitive and interpretable loss function, which is obtained 71 from likelihood estimation, and its ability to exploit standard estimation techniques ported from the probabilistic graphical models literature.

Yet, VAEs offer only an approximate solution for likelihood based training of

DGNs. In fact, all current VAEs employ three major approximation steps in the likelihood maximization process. First, the true (unknown) posterior is approximated by a variational distribution. This estimate is governed by some free parameters that must be optimized to fit the variational distribution to the true posterior. VAEs estimate such parameters by means of an alternative network, the encoder, with the datum as input and the predicted optimal parameters as output. This step is referred to as Amortized Variational Inference (AVI), as it replaces the explicit, per datum, optimization by a single deep network (DN) pass. Second, as in any latent variable model, the complete likelihood is estimated by a lower bound (ELBO) obtained from the expectation of the likelihood taken under the posterior or variational distribution.

With a DGN, this expectation is unknown, and thus VAEs estimate the ELBO by

Monte-Carlo (MC) sampling. Third, the maximization of the MC-estimated ELBO, which drives the parameters of the encoder to better model the data distribution and the encoder to produce better variational parameter estimates, is performed by some

flavor of gradient descend (GD).

These VAE approximation steps enable rapid training and test-time inference of

DGNs. However, due to the lack of analytical forms for the posterior, ELBO, and explicit (gradient free) parameter updates, it is not possible to measure the above steps’ quality or effectively improve them. Since the true posterior and expectation are unknown, current VAE research roughly fall into three camps: (i) developing new and more complex output and latent distributions [Nalisnick and Smyth, 2016, Li and

She, 2017], such as the truncated distribution; (ii) improving the various estimation 72

steps by introducing complex MC sampling with importance re-weighted sampling

[Burda et al., 2015]; (iii) providing different estimates of the posterior with moment

matching techniques [Dieng and Paisley, 2019, Huang et al., 2019]. More recently,

Park et al. [2019] exploited the special continuous piecewise affine structure of current

ReLU DGNs to develop an approximation of the posterior distribution based on

mode estimation and DGN linearization leading to Laplacian VAEs. Nevertheless,

derivation of analytical DGN distributions was not considered.

Variational Expectation-Maximization. A Probabilistic Graphical Model

(PGM) combines probability and graph theory into an organized data structure that

expresses the relationships between a collection of random variables: the observed vari-

ables collected into x and the latent, or unobserved, variables collected into z [Jordan,

2003]. The parameters θ that govern the PGM probability distributions are learned from observations xi ∼ x, i = 1,...,N, requiring estimation of the unobserved zi, ∀i. This inference-optimization is commonly done with the Expectation-Maximization

(EM) algorithm [Dempster et al., 1977].

The EM algorithm consists of (i) estimating each zi from the Expectation of the complete log-density taken with respect to the posterior distribution under the

current parameters at time t; (ii) Maximizing the estimated complete log-likelihood to

produce the updated parameters θt+1. The estimated complete log-likelihood obtained from the E-step is a tight lower bound to the true complete log-likelihood; this lower bound is maximized in the M-step. This process has many attractive theoretical properties, including guaranteed convergence to a local minimum of the likelihood

[Koller and Friedman, 2009].

In the absence of closed form or tractable posterior, an alternative (non-tight) lower bound can be obtained by using a variational distribution instead. This distri- 73 bution is governed by parameters γ that are optimized to make this distribution as close as possible to the true posterior. This process is results in a variational E (VE) step [Attias, 2000] or variational inference (VI). The tightness of the lower bound is measured by the Kullback–Leibler (KL) divergence between the variational and true posterior distributions. Minimization of this divergence cannot be done directly

(due to the absence of tractable posterior) but rather indirectly by maximizing the so-called evidence lower bound (ELBO) via

log(p(x)) = Eq(z|γ)[log(p(x, z|θ))] + H(q(z|γ)) +KL(q(z|γ)||p(z|x, θ)), (5.1) | {z } ELBO with q the variational distribution and H the (differential) entropy. Maximizing the

ELBO with respect to γ produces the γ∗ that adapts q(z|γ∗) to fit as closely as possible to the true posterior. Finally, maximizing the ELBO with respect to the

PGM parameters θ provides θt+1; this can be performed on the entire dataset or on mini-batches [Hoffman et al., 2013].

Variational AutoEncoders. A Variational AutoEncoder (VAE) uses a mini- mal probabilistic graphical model (PGM) with just a few nodes but highly nonlinear inter-node relations [Lappalainen and Honkela, 2000, Valpola, 2000]. The use of DNs to model the nonlinear relations originated in Oh and Seung [1998], Ghahramani and Roweis [1999], MacKay and Gibbs [1999] and has been born again with VAEs

Kingma and Welling [2013]. Many variants have been developed, but the core ap- proach consists of modeling the latent distribution over z with a Gaussian or uniform distribution and then modeling the data distribution as x = g(z) +  with  some noise distribution and g a DGN. Learning the DGN/PGM parameters requires infer- ence of the latent variables z. This inference is performed in VAEs by producing an

∗ amortized VI where a second encoder DN f produces γn = f(xn) from (5.1). Hence, 74

the encoder is fed with an observation x and outputs its estimate of the optimal

variational parameters that minimizes the KL divergence between the variational dis-

tribution and true posterior. During learning, the encoder adapts to make better

estimates f(xn) of the optimum parameters γn. Then, the ELBO is estimated with some flavor of Monte-Carlo (MC) sampling (since its analytical form is not known), and the maximization of the θ parameters is solved iteratively using some flavor of gradient descent.

5.1.2 Contributions

In this chapter, we advance both the theory and practice of DGNs and VAEs by com- puting the exact analytical posterior and marginal distributions of any DGN employ- ing continuous piecewise affine (CPA) nonlinearities. The knowledge of these distri- butions enables us to perform exact inference and obtain Expectation-Maximization training of DGNs without resorting to AVI or MC-sampling and to train the DGN in a gradient-free manner with guaranteed convergence.

5.2 Posterior and Marginal Distributions of Deep Generative

Networks

We now derive analytical forms of the key DGN distributions by exploiting the CPA property. In Sec. 5.3 we will use this result to derive the EM learning algorithm for

DGNs and study the VAE inference approximation versus the analytical one.

Our key insight is that a CPA DGN consists of an implicit latent space partition and an associated per-region affine mapping (recall (3.1)). In a DGN, propagating a latent datum z through the layers progressively builds the Aω, bω. We now demon- strate that turning this region selection process explicit, the analytical DGN marginal 75

and posterior distributions can be obtained.

5.2.1 Conditional, Marginal and Posterior Distributions of Deep Gener-

ative Networks

Throughout the sequel we will consider the commonly employed case of a centered

Gaussian latent prior and centered Gaussian noise [Zhang et al., 2018a] as

p(x|z) = φ(x; g(z), Σx), p(z) = φ(z; 0, Σz), (5.2)

with φ the multivariate Gaussian density function with given mean and covariance matrix [DeGroot and Schervish, 2012]. When using CPA DGNs, the generator map- ping is continuous and piecewise affine with an underlying latent space partition and per-region mapping as in (3.1). We can thus obtain the analytical form of the condi- tional distribution of x given the latent vector z as follows.

Lemma 5.1 (Conditional probability) P The DGN conditional distribution is given by p(x|z) = ω∈Ω 1z∈ωφ (x; Aωz + bω, Σx) with per-region parameters from (4.2) and (4.3).

This type of data modeling is closely related to MPPCA [Tipping and Bishop,

1999a] that combines multiple PPCAs [Tipping and Bishop, 1999b] and MFA [Ghahra-

mani et al., 1996, Hinton et al., 1997] that combines multiple factor analyzers [Har-

man, 1976]. The associated PGMs represent the data distribution with R components

and leverage an explicit categorical distribution t ∼ Cat(π), leading to the conditional PR input distributions x|(z, t) = r=1 1r=t (W rz + br) + , with W r, br denoting the

per-component affine parameters and with Σx diagonal (MPPCA) or fully occupied

(MFA) and z ∼ N (µz, Σz). Note, however, that neither MPPCA nor MFA im- pose continuity in the (t, z) 7→ x mapping as opposed to a DGN. To formalize this, 76

consider an (arbitrary) ordering of the DGN latent space regions as ω1, . . . , ωR with

R = card(Ω). We also denote by Φω the cumulative density function on ωr (integral of the density function on ωr). Proposition 5.1 (MPPCA)

A DGN with distributions given by (5.2) corresponds to a continuous MPPCA (or

MFA) model with implicit categorical variable given by p(t = r) = Φωr (0, Σz), W r =

Aωr , br = bωr ,R = Card(Ω) and Σx = σI (or full Σx).

Note that this result generalizes the results of Lucas et al. [2019], Park et al.

[2019], which showed that shallow DGNs and deep linear DGNs fall back to a PPCA model. This can be easily seen from the formula in Lemma 5.1 by setting the DGN g to be linear as in g(z) = W z + b + ; in that case, the partition is only made of a single region (the entire DGN input space), and the (single) affine parameters are

Aω = W , bω = b. We now calculate the marginal p(x) and posterior p(z|x) distributions. The former will be of use to compute the likelihood, while the latter will enable us to derive the analytical E-step in the next section.

Theorem 5.1 (Marginal and posterior distribution)

The marginal and posterior distributions of a CPA DGN are given by

X T p(x) = φ(x; bω, Σx + AωΣzAω )Φω(µω(x), Σω), (5.3) ω∈Ω

−1 X T p(z|x) =p(x) 1z∈ωφ(x; bω, Σx + AωΣzAω )φ(z; µω(x), Σω), (5.4) ω∈Ω

T −1  −1 T −1 −1 with µω(x) =Σω Aω Σx (x − bω) , and Σω = Σz + Aω Σx Aω . (5.5)

We demonstrate how to compute the integral of a multivariate Gaussian on a polytopal domain (Φω(µω(x), Σω)) in the next section. Note that for both the 77

marginal and the posterior distribution, when considering a specific region ω ∈ Ω,

those distributions are parametrized by a region-specific mean µω(x) and covariance

Σω that we can interpret. For that purpose, consider Σx = I, Σz = I to obtain

T −1 T µω(x) = (I + Aω Aω) Aω (x − bω). That is, the bias of the per-region affine map- ping is removed from the input which is then mapped back to the latent space via

T Aω and whitened by the “regularized” inverse of the correlation matrix of Aω. Note

T that Aω backpropagates the signal from the output to the latent space in the same way that gradients are backpropagated during gradient learning in a DN. We further highlight that the specific form of the posterior is a mixture of truncated Gaussians

[Horrace, 2005], a truncated Gaussian being a Gaussian distribution for which the domain RS has been constrained to a (convex) sub-domain, ω in our case. In most of practical cases, this is taken to be a unimodal (Gaussian) distribution. However, as per the analytical posterior that we obtain, unimodal variational distribution cannot capture the multimodality of the true posterior, leading to a poor variational EM step. Based on our result, practitioners should thus favor as much as possible mul- timodal variational distributions, for instance by employing a mixture of Gaussians for q(z|γ) (recall (5.1) as in Tomczak and Welling [2017a].

5.2.2 Obtaining the DGN Partition

In order to streamline our development, we leverage a simplified version from (4.2) where we denote diag σ˙ (`)(ω) by Q(`)(ω), and oftentimes we will refer to those diagonal matrices simply as Q(`) but it should be clear that their actual configuration depends on the considered region in the DN input space partition. We thus obtain the up-to-layer-` affine parameters

1→` (`) (`−1) (`−1) 1 1 Aω , W Qω W ... QωW 78

Init. Step 1 Step 2 Step 3 Step 4

Figure 5.1 : Recursive partition discovery for a DGN with S = 2 and L = 2, starting with an initial region obtained from a sampled latent vector z (init). By walking on the faces of this region, neighboring regions sharing a common face are discovered (Step 1). Recursively repeating this process until no new region is discovered (Steps 2–4) provides the DGN latent space partition at left .

`−1 1→` (`) X (`) (`−1) (`−1) (i) (i) bω , b + W Qω W ... Qω b , (5.6) i=1

(`) D(`) (`) 1→` 1→` producing the pre-activation feature maps h (z) ∈ R by h (z) = Aω z + bω

1→` D(`)×S 1→` D` and with Aω ∈ R and bω ∈ R . Note that we have, in particular, that

L L Aω = Aω and bω = bω. Corollary 5.1 (Partition region H-representation)

The polyhedral region ω is given by

L−1 \ n S 1→` (`) 1→`o ω = z ∈ R : Aω z < −Q (ω) bω , `=1 with the Hadamard product.

The above result tells us that the pre-activation signs indicate which side of each hyperplane the region ω is located, which provides a direct way to compute the H- representation of ω. To obtain the entire partition Ω, we propose a recursive scheme that starts from an initial region (or sample z) in the DGN input space and walks on its faces to discover the neighboring regions. This process is repeated on the newly discovered regions until no new region is discovered. We illustrate this exploration procedure in Fig. 5.1. 79

5.2.3 Gaussian Integration on the Deep Generative Network Latent Par-

tition

We now turn to the computation of the DGN marginal (5.3) and posterior (5.4)

distributions, for which we need to integrate over all of the regions ω ∈ Ω in the

latent space partition.

The Gaussian integral on a region ω (and its moments) cannot in general be

obtained by direct integration unless ω is a rectangular region [Tallis, 1961, BG and

Wilhelm, 2009] or is polytopal with at most S faces [Tallis, 1965]. In general, the

DGN regions ω ∈ Ω will have at least S + 1 faces, as they are closed polytopes in RS. To leverage the known integral forms, we propose to first decompose a DGN region

ω into simplices (S + 1-face polytopes in our case Munkres [2018]) and then further decompose each simplex into open polytopes with at most S faces, allowing to Tallis

[1965]’s result to integrate a Gaussian on an arbitrary polytope ω. In our case, we perform the simplex decomposition with the Delaunay triangulation Delaunay et al.

[1934] denoted as T (ω) with

card(T (ω)) T (ω) , {∆1,..., ∆card(T (ω))}, with ∪i=1 ∆i = ω and δ∆i ∩ δ∆j = ∅, ∀i 6= j, (5.7)

S+1 where each ∆i is a simplex defined by S + 1 half-spaces ∆i = ∩s=1 Hi,j. This process is illustrated in Fig. 5.2. The decomposition of each simplex into open polytopes

with less than S +1 faces is performed by employing the standard inclusion-exclusion

principle [Bj¨orklund et al., 2009], leading to the following result.

Lemma 5.2 (Domain of integration splitting)

The integral of any integrable function g on a polytopal region ω ∈ Ω can be decom- 80

Figure 5.2 : Triangulation T (ω) as per (5.7) of a polytopal region ω (left plot) obtained from the Delaunay Triangulation of the region vertices leading to 3 simplices (three right plots). posed into integration over open polytopes of at most S faces via

Z X X Z g(z)dz = s g(z)dz. ω ∆∈T (ω) (s,V )∈H(∆) V

 |J|+S  with H(∆i) , (−1) , ∩j∈J Hi,j ,J ⊆ {1,...,S + 1}, |J| ≤ S .

From the above result, we can apply the known form of the Gaussian integral on a polytopal region with fewer than S faces and obtain the form of the integral and moments as provided in Appendix B.2, where detailed pseudo code is also provided.

Computational Complexity: Exact evaluation of the analytical DGN distri- butions is effected by (i) computing the partition, (ii) triangulating each partition region, and (iii) integrating on a region using Lemma 5.2. The first two steps have complexity growing with the latent space dimension and the number of regions. Even though their asymptotic complexity is linear with respect to the number of regions, one must recall that this quantity grows exponentially in the width and depth of a

DN [Montufar et al., 2014]. The third step of integration from Lemma 5.2 is compu- tationally expensive, particularly with respect to the latent space dimension S. This is the current main practical limitation of performing the analytical computation of the DGN posterior (and thus the E-step). A more elaborated discussion plus several 81

Figure 5.3 : Left: Noiseless generated samples g(z) in red and noisy samples g(z)+ in blue, with Σx = 0.1I, Σz = I. Middle: marginal distribution p(x) from (5.3). Right: the posterior distribution p(z|x) from (5.4) (blue), its expectation (green) and the position of the region limits (black), with sample point x depicted in black in the left figure. solutions are provided in the next section; see also Appendix B.8 for the asymptotic computational complexity details.

Visualization of the Marginal and Posterior Distributions. To illustrate our theoretical development so far, we now visualize the posterior and marginal dis- tributions of a randomly initialized DGN in a low-dimensional space D = 2 and with latent dimension S = 1. (See Appendix B.9 for the architectural details of the DGN.)

We depict the various distributions as well as the generated samples in Fig. 5.3. We also plot the posterior distribution based on one observation obtained via g(z0) given a sampled z0 from the z distribution and one noisy observation g(z0) + 0 given a noise realization 0.

5.3 Expectation-Maximization Learning of Deep Generative

Networks

We now derive an analytical Expectation-Maximization (EM) training algorithm for

CPA DGNs based on the results of the previous sections. We then compare DGN 82

training via EM and AVI and leverage the exact complete likelihood to perform model

selection and study the VAE approximation error.

5.3.1 Expectation Step

The E-step infers the latent (unobserved) variables associated to the generation of

each observation x by taking the expectation of the log of the complete likelihood with respect to the posterior distribution (5.4). We denote the per-region moments of

0 1 the DGN posterior (from Appendix B.2) by Ez|x[1z∈ω] , eω(x), Ez|x[z1z∈ω] , eω(x) T 2 1 P 1 and Ez|x[zz 1z∈ω] , Eω(x); we also have e (x) , Ez|x[z] = ω eω(x) and likewise for the second moment. We obtain the following E-step (the detailed derivations are

in Appendix B.6.1):

1   1 E [log (p(x|z)p(z))] = − log (2π)S+D| det(Σ )|| det(Σ )| − trace(Σ−1E2(x)) z|x 2 x z 2 z

1  X  − xT Σ−1x − 2xT Σ−1 A e1 (x) + b e0 (x) 2 x x ω ω ω ω ω  ! X T −1 2 0 1 T −1 + trace(Aω Σx AωEω(x)) + (eωbω + 2Aωeω(x)) Σx bω . ω

1 Note that the (per-region) moments involved in the E-step, such as eω(x), are taken

(`) (`) L with respect to the current parameters (θ = {Σx, Σz, (W , b )`=1}). That is, if gradient based optimization is leveraged to maximize the ELBO, then no gradient should be propagated through them. We can see from the above formula that the contributions of each region’s affine parameters are weighted based on the posterior for each datum x. That is, for each input x, the posterior combines all of the per- region affine parameters as opposed to current forms of learning that leverage only the parameters involved in the specific region activated by the DGN input z. 83

5.3.2 Maximization Step

Given the E-step, the ELBO can be maximized via some flavor of gradient based

optimization. However, thanks to the analytical E-step and the Gaussian form of

the involve distributions, there exists analytical form of this maximization process

(M-step) leading to the analytical M-step for DGNs. The formulas for all of the DGN

parameters are provided in Appendix B.6. We provide here the analytical form for

(`)∗ ` the bias b , for which we introduce rω(x) as the expected reconstruction error of the DGN as ! ` X i+1→L i i 0 1 ` rω(x) , x − Aω Qωb eω(x) − Aωeω(x) (expected residual without b ), i6=` !−1  (`)∗ X X (`) L→`+1 −1 `+1→L (`) X X (`) L→`+1 −1 ` b = Qω Aω Σx Aω Qω  Qω Aω Σx rω(x)  . x ω x | {z } ω∈Ω residual back-propagated to layer `

Some interesting observations can be made based on the analytical form of these

updates. First, the bias update is based on the residual of the reconstruction error

with a DGN whose bias has been removed; this residual is then backpropagated to the

`th layer. The backpropagation is performed via the (transposed) backpropagation

matrix as when performing gradient-based learning. Second, the updates of any

parameter depend on each region parameter’s contribution based on the posterior

moments and integrals, similarly to any mixture model. Third, all of the updates are

whitened based on the backpropagation (or forward propagation) correlation matrix

`→L Aω , ∀ω, ∀`. We study the impact of using a probabilistic prior on the layer weights

such as Gaussian, Laplacian, or Uniform, which are related to the `2, `1 regularization and weight clipping techniques in Appendix B.7. 84

Figure 5.4 : DGN training under EM (black) and VAE training with various learning rates for VAE (blue: 0.005, red: 0.001, green: 0.0001). In all cases, VAE converges to the maximum of its ELBO. The gap between the VAE and EM curves is due to the inability of the VAE’s AVI to correctly estimate the true posterior, pushing the VAE’s ELBO far from the true log-likelihood (recall (5.1)) and thus preventing it from precisely approximating the true data distribution.

5.3.3 Empirical Validation and VAE Comparison

We now numerically validate the above EM-steps on a simple problem involving data points on a circle of radius 1 in 2D augmented with a Gaussian noise of standard deviation 0.05. We depict the EM-training of a 2-layer DGN with width of 8 against

VAE training. In all cases the DGNs have the same architecture with same weight initialization; the dataset is also identical between models with the same noise real- izations. Thanks to the analytical form of the marginals, we can compute the true

ELBO (without variational estimation of the true posterior) for the VAE during its training to monitor its ability to fit the data distribution. We depict the evolution of the negative log-likelihood during training (EM-step for the EM training setting and

VAE updates for VAEs) in Fig. 5.4.

We observe that EM training converges faster and to a lower negative log-likelihood.

In addition, we see how all of the trained VAEs seem to converge to the same bound, which likely corresponds to the maximum of its ELBO, where the gap is induced 85

training steps Figure 5.5 : KL-divergence between a VAE variational distribution and the true DGN posterior when trained on a noisy circle dataset in 2D for 3 different learning rates. During learning, the DGN adapts such that g(z) +  models the data distribution based on the VAE’s estimated ELBO. As learning progresses, the true DGN posterior becomes harder to approximate by the VAE’s variational distribution in the AVI pro- cess. As such, even in this toy dataset, the commonly employed Gaussian variational distribution is not rich enough to capture the multimodality of p(z|x) from (5.4).

Data EM VAE (lr LR) VAE (med LR) VAE (sm LR)

Figure 5.6 : EM training of a DGN with latent dimension 1. We show only the generated continuous piecewise affine manifold g(z) without the additional white noise . We see how EM training of the DGN is able to fit the dataset, while VAE (with different learning rates (LR)) suffers from hyperparameter sensitivity and slow convergence. Training details and additional figures for this experiment are provided in Appendix B.9.

by the use of a variational approximation of the true posterior. We confirm this by

looking at the KL divergence between the true posterior and the AVI estimates of

the VAE models during training in Fig. 5.5. We also experiment with another uni-

dimensional manifold which is a localized subpart of a cosine function in 2D and a more complicated manfiold that is MNIST constrained to the digit 4. We present the manifolds and the EM versus VAE learned manifolds in Fig. 5.6 and Fig. 5.7. We observe the ability of EM to fit the manifold while VAEs suffer from slow convergence 86

EM VAE (large lr) VAE (medium lr) VAE (small lr)

Figure 5.7 : Reprise of Fig. 5.6 for MNIST data restricted to the digit 4, employing a 3-layer DGN with latent dimension of 1. Details of training and additional figures for this experiments are provided in Appendix B.9.

and poor posterior approximation. Additional figures and experiments with various architectures are provided in Appendix B.9.

We thus observed that EM learning produces a much smaller negative log-likelihood and that providing better posterior estimates (improved AVI) is key to improve VAE performances. In particular, multimodal variational distributions should be consid- ered for VAEs regardless of the data at hand. In fact, recall from (5.4) that the

T posterior is a mixture of truncated Gaussians with covariances based on Aω Aω.

5.4 Discussions

We have derived the analytical form of the posterior, marginal, and conditional dis- tributions for DGNs constructed using continuous piecewise affine nonlinearities with

Gaussian output and latent distributions. This has enabled us to derive the EM- learning algorithm for DGNs that not only converges faster than state-of-the-art VAE training but also to a higher likelihood. Our proposed methodology also applies to more general distributions, requiring them only to be conjugate priors in order to obtain an analytical solution. Our analytical forms can be leveraged to improve the variational distribution of VAEs, understand the form of analytical weight updates, 87 study how a DGN infers the latent variable z from x, and leverage standard statistical tools to perform model selection, anomaly detection and beyond. 88

Chapter 6

Insights Into Deep Network Pruning

6.1 Introduction

Deep Networks (DNs) are powerful and versatile function approximators that have reached outstanding performances across various tasks, such as board-game playing

[Silver et al., 2017], genomics [Zou et al., 2019], and language processing [Esteva et al.,

2019]. For decades, the main driving factor of DN performances has been progresses in their architectures, e.g. with the finding of novel nonlinear operators [Glorot et al.,

2011, Maas et al., 2013], or by discovering novel arrangements of the succession of

linear and nonlinear operators [LeCun et al., 1995a, He et al., 2016a, Zhang et al.,

2018d]. With a tremendously increasing need for DNs’ practical deployments, one line

of research aims to produce a simpler, energy efficient DN by pruning a dense one,

e.g. removing some layers/nodes/weights and any combination of these options from

a DN architecture, leading to a much reduced computational cost. Recent progresses

[You et al., 2020, Molchanov et al., 2016] in this direction allow to obtain models

much more energy friendly while nearly maintaining the models’ task accuracy [Li

et al., 2020]. Throughout this chapter, we will often abuse notations and refer to an

unpruned DN as “dense” or “complete”. While tremendous empirical progress has

been made regarding DN pruning, there remains a lack of theoretical understanding

of its impact on a DN’s decision boundary as well as a lack of theoretical tools for

deriving pruning techniques in a principled way. Such understandings are crucial for 89 one to study the possible failure modes of pruning techniques, to better decide which to use based on a given application, or to design pruning techniques possibly guided by some a priori knowledge about the given task and data.

6.1.1 Related Works

The common pruning scheme adopts a three-step routine: (i) training a large model with more parameters/units than the desired final DN, (ii) pruning this overly large trained DN, and (iii) fine-tuning the pruned model to adjust the remaining parameters and restore as best as possible the performance lost during the pruning step. The last two steps can be iterated to get a highly-sparse network [Han et al., 2015].

Within this routine, different pruning methods can be employed, each with a specific pruning criteria, granularity, and scheduling [Liu et al., 2019b, Blalock et al., 2020].

Those techniques roughly fall into two categories: unstructured pruning [Han et al.,

2015, Frankle and Carbin, 2019, Evci et al., 2019] and structured pruning [He et al.,

2018, Liu et al., 2017b, Chin et al., 2020a]. Regardless of the pruning methods, the trade-offs lie between the amount of pruning performed on a model and the final accuracy. For various energy efficient applications, novel pruning techniques have been able to push this trade-off favorably. The most recent theoretical works on DN pruning relies on studying the existence of Winning Tickets. Frankle and Carbin

[2019] first hypothesized the existence of sub-networks (pruned DNs), called winning tickets, that can produce comparable performances to their non-pruned counterpart.

Later, You et al. [2020] showed that those winning tickets could be identified in the early training stage of the un-pruned model. Such sub-networks are denoted as early- bird (EB) tickets. Despite the above discoveries, the DN pruning literature lacks a theoretical analysis that would bring insights into (i) current pruning techniques and 90

(ii) observed phenomenons such as EBs tickets, while leading to principled pruning techniques. We propose to approach this task by leveraging the spline viewpoint of

DNs built in Chap. 3 to provide novel interpretations of existing pruning techniques, study the conditions to their success and when should they be avoided, and finally, how to derive novel pruning strategies from first principles.

6.1.2 Contributions

In this chapter we are turning our focus towards a recent technique that is Deep

Network Pruning. As we will see, pruning, which consists in removing some weights and/or units of a DN, can be studied thoroughly from a geometric point of view thanks to the knowledge of the DN input space partition and its ties with the DN input-output mapping. After providing many practical insights into pruning, we will propose a novel strategy from those understandings that is able to compete with alternative state-of-the-art methods.

6.2 Winning Tickets and DN Initialization

In this section, we develop a novel perspective to produce minimal energy efficient

DNs by searching for improved initialization schemes as opposed to employing the standard strategy of training an overparametrized DN, and repeatedly pruning it and fine-tuning it. As we will demonstrate, when initialization is done in a specific way, one can directly train the minimal DNs resulting in better performances and a reduced number of FLOPs. 91

Figure 6.1 : K-means experiments on a toy mixture of 64 Gaussian in 2d, where in all cases the number of final cluster is 64 but the number of starting clusters (x- axis) varies and pruning is applied during training to remove redundant centroids,

est Accuracy(%) comparing random centroid initialization T and kmeans++. With overparametriza- tion, random initialization and pruning

Number of Initial Clusters reaches the same accuracy as kmeans++.

6.2.1 The Initialization Dilemma and the Importance of Overparametriza-

tion

The term overparametrization has been used extensively in the recent DN literature.

Throughout this paper, we will refer to a model as being overparametrized when it is possible to prune it and still manage to solve the task at hand with roughly the same

final performance. This can be done by reducing the number of units (and layers) in a DN, or reducing the number of centroids in K-means [MacQueen et al., 1967]. In this subsection, we propose to consider not only DNs but also more standard machine learning algorithms such as K-means with the following goal: demonstrate that the chaining of (i) overparametrization, (ii) training, and (iii) pruning provides a power- ful strategy when good initialization for the algorithms is not known, and conversely, that searching for novel initialization schemes for DNs might provide alternative so- lutions to current pruning techniques.

The importance of initialization is crucial for most models, even standard methods such as K-means. A rich branch of research has been focusing on K-means initializa- tion alone for decades [Bradley and Fayyad, 1998, Kanungo et al., 2002, Hamerly and

Elkan, 2002, Arthur and Vassilvitskii, 2006, Celebi et al., 2013]. As a result, we will 92 employ K-means as a control method where we can experiment both with random initialization and “advanced” initialization. We first consider the case of K-means in a toy setting where we know a priori the number of clusters for artificial data that we generate from a Gaussian Mixture Model (GMM) [Reynolds, 2009] with spherical and identical covariances, and uniform cluster prior in order to fully fall into the K-means data modeling. We perform the usual pruning strategy (i.e., training, pruning, and

fine-tuning) over multiple runs and with varying numbers of initial clusters. Specif- ically, pruning is done by removing the centroids that are closest to each other in terms of their `2 distance, and until the final number of clusters is equal to the true one. We also repeat this experiment by employing kmeans++ initialization [Arthur and Vassilvitskii, 2006] for K-means instead of sampling random centroids. We report the clustering accuracy in Fig. 6.1, from which we distinctively observe the ability of the random initialization case to produce very accurate models whenever the num- ber of starting clusters is greater than the true one, while the advanced initialization strategy offers near-optimal performances without resorting to overparametrization.

In fact, it should be clear that due to random initialization of the centroids, the more initial clusters are used, the more likely it becomes that at least one centroid will be near each of the clusters of the data distribution.

In the case of DNs, most initialization techniques focus on maintaining feature maps statistics bounded through depth to avoid vanishing or exploding gradient

[Glorot and Bengio, 2010, Sutskever et al., 2013, Mishkin and Matas, 2015]. How- ever, incorporating data information into the DN weights initialization as is done in K-means with say kmeans++ remains to be developed for DNs. Hence, over- parametrization allows successful training, and a posteriori, one can remove the re- dundant parameters and obtain a final model with much better performances versus 93 the non-overparametrized and non-pruned counterpart. This is the key motivation of Early Bird tickets. Furthermore, the parallel between DNs and K-means is most relevant as it has been shown in Sec. 3.4 that the DN decision process relies on an input space partition based on centroids that is very similar to the one of K-means and which thus benefit in the same way to overparametrization. Beyond this geo- metric aspect, overparametrization has been proven to facilitate optimization [Arpit and Bengio, 2019] and to position the initial parameters close to good local minima

[Allen-Zhu et al., 2019a, Zou and Gu, 2019, Kawaguchi et al., 2019] reducing the number of updates needed during training. Remark 6.1

Winning tickets are the result of employing overly parametrized DNs which are sim- pler to optimize and produce better performances, as current optimization techniques can not escape from poor local minima and advanced DN initialization (near good local minima) is unknown.

We further support the above remark in the following section where we demonstrate how the absence of good initialization coupled with non-optimal optimization prob- lems impacts performances unless overparametrization is used, in which case winning tickets naturally emerge.

6.2.2 Better DN Initialization: An Alternative to Pruning

We saw in the previous section that the concept of winning tickets emerges from the need to overparametrize DNs which in turn emerges naturally from architecture search and cross-validation as overparametrizing greatly facilitates training and improves

final results in the absence of good initialization. We now show that if a better initialization of DNs existed, one would have the ability to train a minimal DN directly 94

Table 6.1 : Accuracies of layerwise (LW) pretraining, structured pruning with random and lottery ticket initialization. Pruning Setting Ratio Random Init. Lottery Init. LW Pretrain 30% 93.33±0.01 93.57±0.01 93.08±0.00 VGG-16 50% 93.07±0.03 93.55±0.03 93.08±0.01 CIFAR-10 70% 92.68±0.02 93.44±0.01 92.81±0.02 90% 90.48±0.06 90.41±0.23 90.88±0.02 10% 71.49±0.03 71.70±0.09 71.14±0.02 VGG-16 30% 71.34±0.10 71.24±1.18 71.35±0.01 CIFAR-100 50% 67.74±1.05 69.73±1.15 70.19±0.01 70% 60.44±4.98 66.61±0.95 67.40±0.83

and thus would not resort to the entire pruning pipeline.

We convey the above point with a carefully designed experiment. We consider

three cases. First, the case of employing a minimal DN with random weights ini-

tialized the usual from random Kaiming initialization [He et al., 2015a]. Second, we consider the same minimal DN architecture but with weights initialized based on un- supervised layerwise pretraining which we consider as a data-aware initialization (no label information is used) [Belilovsky et al., 2019]. In both cases, after initialization, training is done on the classification task in the same manner. Third, we consider an overparametrize DN trained with the lottery ticket (LT) method (training, pruning, and re-training). The final models of the three cases have the same architecture (but different weights based on their own training method). We report their classification results in Table 6.1, from which we can see that especially for very small final DNs

(high pruning ratios) LT models outperform a randomly initialized DN, but in turn a well initialized DN is able to outperform LT training. From this, we see that the abil- ity of pruning methods and, in particular, LT to produce better-performing minimal 95

DNs than directly training the same minimal DN lies in the lack of good initialization for DNs. In fact, for high pruning ratios, layerwise pretraining even offers a more en- ergy efficient method overall (including the pretraining phase). This should open the door and motivates further studies of such schemes as a possible alternative solution to produce energy efficient models.

As the amount of different architectures grows rapidly and the specificity of those architectures can vary drastically, simple layerwise pretraining falls short of provid- ing an advanced initialization solution. For example, it is not clear how layerwise pretraining can be used with a DenseNet [Huang et al., 2017] where some parame- ters connect layers that are far apart in the architecture. Hence, while we believe in searching for improved initialization strategies, we now focus on studying LT train- ing and DN pruning as they provide a universal solution. We thus propose to study pruning in-depth and how to develop new ones from a spline perspective, providing a universal solution to understand and design efficient DNs across tasks and datasets.

6.3 Pruning Continuous Piecewise Affine DNs

As demonstrated in Chap. 3, a DN equipped with standard nonlinearities, such as

(leaky-)ReLU/max-pooling, is a CPA operator with an underlying partition of its input space. Such a connection provides a new perspective to analyze how the DN decision boundary is formed during training and what are the impacts of network compression methods, e.g. network pruning from such a geometry perspective. We propose to study those questions in the following sections. 96

Node Pruning Weight Pruning (a) Spline Insights for Pruning

Data Grid Unpruned Prune 20% Prune 40%

Prune 60% Prune 80% Prune 90% FCNet Pruning Ratio (%) FCNet Splines Visualization (b) FCNets Spline Experiments

Unpruned Prune 20% Prune 40% Data Grid

Prune 60% Prune 80% Prune 90% ConvNet Pruning Ratio (%) ConvNet Splines Visualization (c) ConvNets Spline Experiments Figure 6.2 : (a) Difference between node and weight pruning, where the former removes entire subdivision lines while the latter simply quantize those partition lines to be colinear to the space axes. (b) Toy classification task pruning, where the blue lines represent subdivisions in the first layer and the red lines denote the last layer’s decision boundary. We see that: 1) pruning indeed removes redundant subdivision lines so that the decision boundary remains an X-shape until 80% nodes are pruned; and 2) ideally, one blue subdivision line would be sufficient to provide two turning points for the decision boundary, e.g., visualization at 80% sparsity, but the classification accuracy degrades a lot if further pruned. That aligns with the initialization dilemma for small DNs, i.e., blue lines are not well initialized and all lines remain hard for training. (c) MNIST reproduction of (b), where to produce these visuals, we choose two images from different classes to obtain a 2-dimensional slice of the 764-dimensional input space (grid depicted on the left). We thus obtain a low-dimensional depiction of the subdivision lines that we depict in blue for the first layer, green for the second convolutional layer, and red for the decision boundary of 6 vs. 9 (based on the left grid). The observation consistently shows that only parts of subdivision lines are useful for decision boundary; and the goal of pruning is to remove those (redundant) subdivision lines. 97

6.3.1 Interpreting Pruning from a Spline Perspective

We first propose to leverage the DN input space partition to study the difference be-

tween node (or unit) and weight pruning. In the former, units of different layers are

removed, while in the latter, entries of the W (`) matrices (or C(`) for convolutions) are removed. We demonstrate in Fig. 6.2 (a) that node pruning removes entire sub-

division lines while weight pruning (or quantization) can be thought as finer granular limitations on the slopes of subdivision lines, and will only remove the subdivision line when all entries of a specific row in W (`) are 0. From this, we can already identify the

reason why pruned networks are less expressive than the overparametrized variants

[Sharir and Shashua, 2018] as pruned DNs’ input space partition is limited compared

to their non-pruned counter parts.

Despite the constraints that pruning imposes on the DN input space partition,

classification performances do not necessarily reduce when pruning is employed. In

fact, the final decision boundary, while being tied with the DN input space partition,

does not always depend on all the existing subdivision lines. That is, pruning will

not degrade performances as long as the needed decision boundary geometry does not

rely on the partition regions that are being affected by pruning. We demonstrate

and provide detailed visualization of the above in Fig. 6.2 (b) with a simple toy

classification task which only requires a few subdivision lines to produce a decision

boundary perfectly solving the task. Hence, as long as pruning leaves at least those

few subdivision lines, the final performances will remain high. In fact, we observe that

for this toy case, and with a two-layer FCNet (20 nodes per layer), applying pruning

ratios ranging from 20% ∼ 95% (i.e., prune 4 ∼ 19 nodes) does not prevent solving the task as long as the remaining subdivision lines are positioned to allow the decision boundary geometry to remain intact. We also extend the above experiment to a high 98 Loss Loss

Iterations Epochs AlexNet on CIFAR10 (a) Spline Trajectory for FCNets (b) Spline Trajectory for ConvNets (c) Spline Early-Bird Tickets

Figure 6.3 : Spline trajectory during training and visualizing the Early-Bird (EB) Phenomenon, which can be leveraged to largely reduce the training costs due to the less training of costly overparametrized DNs. The trajectories mainly adapt during early phase of training.

dimensional case with MNIST classification and a DN with two convolutional layers,

20 filters, and kernel sizes of 21 and 5, respectively in Fig. 6.2 (c). By adopting the same channel pruning method as in Liu et al. [2017b], we observe that most of the pruned nodes remove subdivision lines that were not crucial for the decision boundary and thus only have a small impact on the final classification performance.

6.3.2 Spline Early-Bird Tickets Detection

Early-Bird (EB) tickets [You et al., 2020] provides a method to draw winning ticket sub-networks from a large model very early during training (10% ∼ 20% of the total number of training epochs). The EB drawn is based on an a priori designed pruning strategy and compares how different (in terms of which units/channels are removed) are the hypothetical pruned models through the training steps; this method out- performs SOTA methods [Frankle and Carbin, 2019, Liu et al., 2017b]. The main limitation of EB lies in the need to define a priori a pruning technique (itself depend- ing on various hyper-parameters). Based on the spline formulation, we formulate a novel EB method that does not rely on an external technique and only considers the 99

evolution of the DN input space partition during training.

Early-Bird in the Spline Trajectory. First, we demonstrate that there exists an EB phenomenon when viewing the DN input space partition, which should follow naturally as the DN weights and the DN input space partition are tied. We visualize

DN partition’s evolution at different training stages in Fig. 6.3 (a) and (b), under the same settings as Sec. 6.3.1. From this, we clearly see that the partition quickly adapts to the task and data at hand, and then is only slightly refined through the remaining training epochs. This fast convergence comes as early as the 2000-th iteration (w.r.t.

10000 iterations for FCNets) and the 30-th epoch (w.r.t. 160 epochs for ConvNets).

Additionally, we observe that the contribution of the first layers in the input space partition becomes stable more rapidly than for deeper layers. We can thus leverage this early convergence to detect EB tickets by using a novel metric based on those subdivision lines to draw better EB tickets than originally proposed in [You et al.,

2020].

Quantitative Distance between Input Space Partitions. To draw EB tickets based on the evolution of DN input space partitions, we first need to provide a metric that conveys such information. First, recall that each region from the DN input space has an associated binary code based on which side of the subdivision trajectories the regions lie (recall (4.2) and (4.3)). Given a large collection of data points, we can assign each datum the code of the region it lies in (found simply based on the sign of the per-layer feature maps). As training occurs and the partition adapts, the code associated with an input will vary. However, once training stabilizes and regions do not change anymore, this code will remain the same. In fact, one can easily show that in the infinite data sample regime covering the entire input space, DNs with the same codes also have the same input space partition, in turn the same decision boundary 100 geometry. The proposed metric is thus the hamming distance between the codes of each datum observed at two consecutive training steps.

We visualize the above hamming distance of the DN partition between consecutive epochs, when training AlexNet on CIFAR-10 (see more visualizations in Appendix

B.1, shown as the spline distance matrix (160 × 160) in Fig. 6.3 (c), where the (i, j)- th element represents the spline distance between networks from the i-th and j-th epochs. The distances are normalized between 0 and 1, where a lower value (w.r.t. warmer temperature) indicates a smaller spline distance (and thus DNs with similar partitions). We consistently observe that such distance becomes small (i.e., < 0.15) after the first few epochs (visualization for other networks can be found in Fig. 4 of the Appendix), indicating the EB phenomenon, but now captured in terms of the DN input space partition. To obtain an active EB drawing strategy from that, we measure and record the spline distance between three consecutive epochs, and stop the training when the two associated distances are smaller than a predefined threshold, denoted by the red block in Fig. 6.3 (c). We conclude by emphasizing that as opposed to the usual EB tickets drawn in You et al. [2020], our formulation provides a more interpretable scheme that is invariant to the pruning strategy as well as its hyper-parameters (e.g., the pruning ratio). Hence, our formulation allows for a much simpler solution that does not require to be adapted based on the pruning technique that users experiment with.

6.3.3 Spline Pruning Policy

We now propose to derive from first principles novel pruning strategies of DNs based on the spline viewpoint insights. Recall from Sec. 3.4 that the layer input space parti- tion is formed by a successive subdivision process involving each per-layer input space 101

2 0 DN Partition Ω N1 (k, k ) Pruned Ω

Figure 6.4 : We depict on the left a small (L = 2,D1 = 5,D2 = 8) DN input space partition, layer 1 trajectories in black and layer 2 in blue. In the middle is the measure from Eq. (6.1) finding similar “partition trajectories” from layer 2 seen in the DN input space (comparing the green trajectory to the others with coloring based on the induce similarity from dark to light). Based on this measure, pruning can be done to remove the “grouped partition trajectoris” and obtain the pruned partition on the right.

partition. As we also studied in the previous section, for classification performances, not all the input space partition regions and boundaries are relevant since not all affect the final decision boundary. Knowing a priori which regions of the input space partition are helping in solving the task is extremely challenging, since it requires knowledge of the desired decision boundary and of the input space partition, both being highly difficult to obtain for high dimensional spaces and large networks [Mont- ufar et al., 2014, Hanin and Rolnick, 2019]. What is simpler to obtain, however, is how redundant are some of the layer weights/units in terms of the forming of the DN partition relative to other units/weights. From that, it will become trivial to prune the redundant units/weights as their impact on the forming of the decision boundary is already carried by another unit/weight. 102

When considering the layer input space partition, we can identify “redundant” units based on how each unit impacts the partition with respect to other units. For example, if two units have biases and slope vectors proportional to each other, then one can effectively remove one of the two units without altering the layer input space partition. While this is a pathological case, we will demonstrate that the angles between per-unit slope matrices and inter-bias distances measure such a redundancy.

We first introduce our pairwise redundancy measure as follows:

(`) (`) ! (`) 0 |h[W ]k,., [W ]k0,.i| (`) (`) 0 Nρ (k, k ) = 1 − (`) (`) + ρ|[b ]k − [b ]k |, ρ > 0, (6.1) k[W ]k,.k2k[W ]k0,.k2 where ρ is an hyper-parameter measuring the sensitivity of the difference in angle versus the biases. Finding the two units with the most similar contribution to the DN

(`) 0 input space partitioning can be done via arg mink,k06=k Nρ (k, k ) where the obtained couple (k, k0) encodes the two units which are the most redundant. In turn, one of those two units can be pruned such that the impact of pruning onto the DN input space partition is minimized.

Proposition 6.1 (Partition boundary redundance)

Given a layer and its input space partition, removing sequentially one of the two units,

0 (`) 0 k and k , for which Nρ (k, k ) = 0, leaves the layer input space partition unchanged.

The above result is crucial as removing units that do not affect the layer input space is synonymous with removing units that do not affect the entire DN input space

(`) 0 partition. In practice, units with small enough but nonzero Nρ (k, k ) are also highly redundant and can be removed. We provide an example of this procedure in Fig. 6.4. 103

Table 6.2 : Evaluating the proposed layerwise spline pruning over SOTA pruning methods on CIFAR-100. PreResNet-101 VGG-16 Pruning Dataset ratio NS SplineEB SplineImprov. NS SplineEB SplineImprov. Unpruned93.66 93.66 93.66 - 92.71 92.71 92.71 - 30% 93.4893.56 93.07 +0.08 93.29 93.21 92.83 -0.08 CIFAR-10 50% 92.5292.55 92.37 +0.03 91.85 92.13 92.23 +0.38 70% 91.27 91.33 91.33 +0.06 88.52 89.68 88.65 +1.16 Unpruned73.10 73.10 73.10 - 71.43 71.43 71.43 - 10% 71.58 71.58 73.14 +1.56 71.6 71.78 72.28 +0.68 CIFAR-100 30% 70.70 70.13 72.11 +1.41 70.32 71.15 71.59 +1.27 50% 68.70 69.05 70.88 +2.18 66.1 69.92 69.96 +3.86 70% 66.51 67.06 68.41 +1.90 61.16 63.13 64.01 +2.85

6.4 Experiment Results

Here we evaluate our spline pruning method with the experiment settings added to the Appendix C.1.

6.4.1 Proposed Layerwise Spline Pruning over SOTA Pruning Methods

(`) 0 Recall that the spline pruning policy is done by solving arg mink,k06=k Nρ (k, k ). By regarding k as the index of channels for convolutional layers, we are able to conduct channel pruning in a layerwise manner. Table 6.2 shows the comparison between the spline pruning (w/ and w/o EB detection) and SOTA network slimming (NS) method

[Liu et al., 2017b] on CIFAR-10/100 datasets. We can see that the spline pruning consistently outperforms NS, achieving -0.08% ∼ 3.86% accuracy improvements. This set of results verifies our hypothesis that removing redundant splines incurs little changes in decision boundary and thus provides a good a priori initialization for retraining. 104

Table 6.3 : Evaluating the proposed global spline pruning over SOTA pruning meth- ods on ImageNet. Pruning Top-1 Top-5 FLOPs Energy Models Methods ratio Acc. (%) Acc. (%) (P) (MJ) Unpruned - 69.5 89.2 1259.1 98.1 0.1 69.6 89.2 2424.8 193.5 NS Liu et al. [2017b] 0.3 67.8 88.0 2168.8 180.9 SFP He et al. [2018] 0.3 67.1 87.7 1991.9 158.1

ResNet-18 0.1 69.4 89.0 1101.2 95.6 EB Spline 0.3 67.8 87.9 831.0 82.8 Unpruned - 75.9 92.9 2839.9 280.7 0.3 72.0 90.6 4358.5 456.1 ThiNet Luo et al. [2017] 0.5 71.01 90.02 3850.03 431.73 SFP He et al. [2018] 0.3 74.61 92.0 4330.8 470.7 LeGR Chin et al. [2020b] - 75.3 92.4 4174.74 412.6 GAL-0.5 Lin et al. [2019] - 72.0 91.8 4458.74 440.7 ResNet-50 GDP Lin et al. - 72.6 91.1 4487.1 443.5 C-SGD-50 Ding et al. [2019] - 74.5 92.1 4117.9 407.0 Meta Pru. Liu et al. [2019a] 0.5 73.4 - 3532.6 349.1 0.3 75.1 92.6 2434.0 264.2 EB Spline 0.5 73.3 91.5 1636.0 197.0

6.4.2 Proposed Global Spline Pruning over SOTA Pruning Methods

We next extend the analysis to global pruning, where the mismatch of the filter dimension in different layers impedes the cosine similarity calculation. To solve this issue, we adopt PCA for reducing the feature dimensions to the same before applying the spline pruning policy.

Spline Pruning over SOTA on CIFAR. Table 1 in the Appendix compares the retraining accuracy and total training FLOPS/energy of spline pruning with four

SOTA pruning methods Frankle and Carbin [2019], Lee et al. [2019b], Liu et al.

[2017b], Luo et al. [2017], whose detailed descriptions are in Sec. C.2 of the Appendix. 105

These results show that spline pruning consistently outperforms all competitors in

terms of the accuracy and computational cost trade-offs. Specifically, compared with

the strongest competitor among the four SOTA baselines, spline pruning achieves

0.8 × ∼ 3.5 × training FLOPs reductions while offering comparable or even better

(-0.67% ∼ 0.69%) accuracies.

Spline Pruning over SOTA on ImageNet. We further investigate whether the

spline pruning have consistent performance in a harder dataset, using ResNet-18/50

on ImageNet and benchmarking with eight SOTA pruning methods including ThiNet,

NS, SFP, LeGR, GAL-0.5, GDP, C-SGD-50, and Meta Pruning. Specifically, spline

pruning with EB detection (EB Spline) achieves a reduced training FLOPs of 43.8%

∼ 57.5% and a reduced training energy of 42.1% ∼ 54.3% for ResNet-50, while leading to a top-1 accuracy improvement of -0.12% ∼ 3.04% (a top-5 accuracy improvement of 0.18% ∼ 1.91%). Consistently, EB Spline achieves a reduced training FLOPs of 44.7% ∼ 61.7% and a reduced training energy of 39.5% ∼ 54.2% for ResNet-18, while leading to comparable top-1 accuracies (-0.24% ∼ 0.71%) and top-5 accuracies

(-0.16% ∼ 0.21%).

The above experiments show the consistent superiority of our global spline prun- ing. We also conduct ablation studies to measure the sensitivity of the hyperparam- eter ρ (see Equ. 6.1) in Sec. C.3 of the Appendix.

6.5 Discussions

We demonstrated the tight link between the presence of winning tickets and over- parametrization in DNs, the latter being necessary for DNs to reach high perfor- mances with current (random) initializations; and we demonstrated that this phe- nomenon is not unique to DNs but affect other machine learning methods such as 106

K-means. This opens new avenues to produce small, energy efficient and performing

DNs by developing “clever” initialization techniques. Furthermore, we leveraged the spline formulation of DNs to sharpen our understanding of different pruning poli- cies, study the conditions in which pruning does not deteriorate performances, and develop a novel and more principled pruning strategy extending EB tickets; and ex- tensive experiments demonstrated the superior performances (accuracy and energy efficiency) of the proposed method. The proposed spline viewpoint should open new avenues to theoretically study novel and existing pruning techniques as well as guide practitioners via the proposed visualization tools. 107

Chapter 7

Insights into Batch-Normalization

7.1 Introduction

Deep Learning has made major impacts in a wide range of areas. A deep (neural)

network (DN) is an operator fΘ, where Θ collects all learnable parameters, that maps

the input x ∈ RD to the output predictiony ˆ ∈ RS by composing L intermediate layer mappings that combine affine and simple nonlinear operators such as the fully

connected operator (simply the affine transformation defined by the weight matrix

W (`) and bias vector b(`)), convolution operator (with circulant W (`)), activation

operator (applying a scalar nonlinearity such as the ubiquitous ReLU), or pooling operator. Precise definitions of these operators can be found in Chap. 3. Each layer as defined per Def. 1.1 maps an input z(`−1) to an output z(`) via   z(`−1) 7→ σ W (`)z(`−1) + b(`) := z(`), (7.1)

with σ a element-wise activation function, and W (`), b(`) some learnable parameters.

For this chapter, we omit the pooling operator to keep our development light in no-

(`) (`) tations. The learnable parameters Θ of the DN e.g. Θ , {W , b , ∀`} are trained based on a training dataset consisting of input-output pairs X := {(xn, yn), n =

1,...,N} for supervised learning and input observations only X := {xn, n = 1,...,N} for unspervised learning, a loss function, and some gradient based optimization set- ting. This allows to learn an end-to-end nonlinear mapping. For a large dataset, it is common to extract a mini-batch B ⊂ X where usually |B|  N, evaluate the loss 108

on the samples in B and update the weights Θ based on the gradient of the mini-

batch loss. No matter what architecture or processing layers, a critical component to

performance in deep learning has been batch normalization (BN) [Ioffe and Szegedy,

2015]. The BN operator is typically inserted prior to the activation function, after

the affine transformation [Huang et al., 2017, Chen et al., 2013] and augments the

linear operator ∗ to form the BN-equipped layer mapping ! W (`)z(`−1) − µ(`) z(`−1) 7→ σ γ(`) + β(`) . (7.2) σ(`)

The vectors µ(`) and σ(`) are the element-wise average and standard deviation of

W (`)z(`−1) taken over the current mini-batch B (during training) and taken over

the entire training set (during testing). The parameters γ(`) and β(`) are learnable parameters guided by the given loss and updated via some flavors of gradient descent.

We recall that training a DN is most often done by first splitting a dataset into two parts: a training set and a testing set. The testing (test) set is used to measure the out-of-sample capacity (generalization) of a (trained) DN. A typical BN-equipped DN

(`) (`) (`) thus employs Θ , {W , γ , β , ∀`}.

7.1.1 Related Works

Nowadays, the empirical benefits of BN are ubiquitous with more than 12,000 ci- tations to the original BN article and a unanimous community employing BN to accelerate training by helping the optimization procedure and to increase generaliza- tion performances [He et al., 2016b, Zagoruyko and Komodakis, 2016, Szegedy et al.,

2016, Zhang et al., 2018c, Huang et al., 2018a, Liu et al., 2017b, Ye et al., 2018,

∗the bias can be omitted as it is included in the BN operator 109

Jin et al., 2019, Bender et al., 2018]. Despite its prevalence in today’s DN architec- tures’ performances, the understanding of the unseen forces that BN applies on DNs remains elusive; and for many, understanding why BN improves so drastically DNs performances remains one of the key open problems in the theory of deep learning

[Richard et al., 2018].

One of the first practical arguments in favor of feature map normalization emerged in Cun et al. [1998] as “good-practice” to stabilize training. By studying how the backpropagation algorithm updates the layer weights, it was observed that unless with normalized feature maps, those updates would be constrained to live on a low- dimensional subspace limiting the learning capacity of gradient based algorithms. By explicitly reparametrizing the affine transformation weights and slightly altering the renormalization process of BN, weight renormalization [Salimans and Kingma, 2016] showed how the σ(`) renormalization smooths the optimization landscape of DNs.

Similarly, Bjorck et al. [2018], Santurkar et al. [2018], Kohler et al. [2019] further studied the impact of BN in the gradient distributions and optimization landscape by designing careful and large scale experiments. By providing a smoother optimization landscape BN “simplifies” the stochastic optimization procedure and thus accelerates the training convergence and generalization. In parallel to this optimization analysis of BN in standard DN architectures, Yang et al. [2019b] developed a mean field theory for fully-connected feed-forward neural networks with random weights where BN is analytically studied. In doing so, they were able to characterize the gradient statistics in such DNs and to study the signal propagation stability depending on the weight initialization, concluding that BN stabilizes gradients and thus training. From the above results, it seems that BN has been demystified through an optimization lens.

However, many alternative techniques offering similar ‘loss surface smoothing’ and 110

‘stable gradient descent’ exist e.g. Adam [Kingma and Ba, 2014], mollifying networks

[Gulcehre et al., 2016], or resnets [Li et al., 2017a], none of which is able to reach as impressive performances as DNs equipped with BNs. This raises the following question. Are the benefits of BN completely explained solely from a loss surface and optimization perspective. We shall answer this question by providing two novel results that will extend our understanding of BN and demonstrate that BN pushes

DN performances by adapting the DN input space partition to concentrate on the data samples and by increasing the decision boundary margin.

7.1.2 Contributions

In this chapter we propose to leverage the spline formulation built in Chap. 3 to study specifically one of the most important technique in deep learning: batch- normalization. As we will see in the subsequent sections, batch-normalization is a sur- prisingly simple strategy that allowed deep learning to greatly jump in performances.

Despite being a simple algorithm, batch-normalization underlying mechanisms have never been fully grasped. We propose to contribute to the current understanding of this technique below.

Through the course of the next few sections, we will demonstrate how BN, by replacing 7.1 with 7.2, provides an unsupervised learning technique that interacts with the (un)supervised learning algorithm used to train a DN in order to focus the attention of the network onto the data points.

7.2 Batch Normalization: Unsupervised Layer-Wise Fitting

We first formalize the role of BN in a layer-by-layer study; that is, we now study the layer input space partition and how BN affects it based on its statistics µ(`), σ(`) and 111

the observed layer inputs z(`−1). This section corresponds to a single layer analysis,

the multilayer cases will be conducted in the following section.

7.2.1 Batch-Normalization Updates

BN alters the layer mapping by introducing two additional shifts and scaling (recall 7.1

and 7.2). Those additional operations leverage learnable parameters γ(`), β(`) as well as the mean and standard deviation parameters µ(`) and σ(`) respectively which are computed based on the current mini-batch B during training and the entire training set during test time as s (`) 1 X (`−1) (`) 1 X (`−1) (`) 2 µ W (`)z , σ W (`)z − µ  , (7.3) BN , B i BN , B i BN i∈B i∈B

(`) th th where zi is the ` layer output of the i datum, and B contains the (B) indices of the observations from the training set X that are contained in the current mini-batch.

(`) (`) (`) (`) During training, BN sets µ and σ to µBN and σBN respectively, and during testing, BN computes those statistics over the entire training set. In the latter case, we denote

(`) (`) (`) (`) (`) those test time statistics as µBN, σBN. The other layer parameters β , γ and W are learned based on the given loss and some flavor of gradient descent such as Adam of Nesterov momentum. In the remaining of this section we will consider the BN learnable parameters to be fixed at their initialization values (β(`) = 0 and γ(`) = 1); we study the role of those parameters in the next section where in particular we will show that their impact on performances is negligible for current dataset and tasks.

7.2.2 Layer Input Space Hyperplanes and Partition

We now focus on layers where the nonlinearity σ is continuous and piecewise affine. In particular, we will focus on the popular nonlinearities with two linear regions that are 112

ReLU (σ(u) = max(0, u)), leaky-ReLU (σ(u) = max(α, u), α > 0) or absolute value

(σ(u) = max(−u, u)). Those activation functions are continuous piecewise linear

mappings, combining this fact with the linearity of the affine transform preceding the

layer activation we will be able to characterize the layer input space partition as a

function of the activation function and the layer affine transform.

A nonlinearity “state” changes when its input, or pre-activation, sign changes.

This is enough to include of layers employing nonlinearities such as (leaky-)ReLU or

absolute value. The same can be extended to max-pooling layer but is omitted here to

streamline our development. We denote the nonlinearity pre-activations of layer ` by

h(`) such that z(`) = σ(h(`)). In layer `, the input to the kth coordinate nonlinearity

(recall (7.2)) is

(`) (`) (`−1) (`) (`) [h ]k = (h[W ]k,., z i − [µ ]k)/[σ ]k. (7.4)

The kth nonlinearity of layer ` will change state whenever its pre-activation is 0, which we can in turn express in the input space of layer `. In particular, the collection of layer inputs that are in term of the layer input z(`−1), corresponding to an hyperplane that we denote by H(`,k) and that is given by

(`,k) n (`−1) D(`−1) (`) o H : , z ∈ R :[h ]k = 0

n (`−1) D(`−1) D (`) (`−1)E (`) o = z ∈ R : [W ]k,., z = [µ ]k . (7.5)

Notice that the hyperplanes H(`,k), ∀k are D(`−1) −1 dimensional affine subspace living

(`−1) in RD , the input space of layer `. From the above, we obtain the boundary of the layer input space partition ∂Ω(`) by combining each of the hyperplanes H(`,k) as in

(`) D(`) (`,k) ∂Ω = ∪k=1 H . (7.6) 113

The forming of ∂Ω(`) obtained from a collection of hyperplanes is often denoted as

an hyperplane arrangement [Zaslavsky, 1975]. Furthermore, recall from Sec. 3.4 that

the associated layer input space partition Ω corresponds to a power diagram made of

convex cells. From 7.5 and 7.6 we see that the BN parameter µ(`) participates into

the layer input space partition, we now study how its boundaries are affected by BN.

7.2.3 Translating the Hyperplanes

Recall that BN utilizes a fixed update rule for its parameters µ(`) and σ(`) as given

in 7.3. The key result of this section shows that those fixed updates make the layer

partition concentrate around the data points. Thanks to the explicit parametrization

of the layer partition boundary from 7.5 and 7.6 we can now study how BN translates

the H(`,k) hyperplanes onto the layer inputs. To help out development, we denote by

B(`−1) the mini-batch of feature maps that are fed as input of layer `. That is, it is the

collection of z(`−1) for all the samples x ∈ B. We also denote by Z(`−1) the collection

of all the feature maps z(`−1) for all the training samples x ∈ X . As we will see, the

BN update rule (recall 7.3) will fit H(`,k) onto B(`−1) during training, and onto Z(`−1),

during testing, based on a orthogonal least-square distance minimization. Let first

recall that the point-to-hyperplane distance is given by (see for example 1 in Amaldi

and Coniglio [2013])

(`) (`−1) (`) h[W ]k,., z i − [µ ]k (`−1) d(z(`−1), H(`,k)) = , ∀z(`−1) ∈ D , (7.7) (`) (2) R k[W ]k,.k (`) (2) which is defined as long as k[W ]k,.k > 0. We also introduce the following average squared distance between the layer input space partition boundary ∂Ω(`) and the layer inputs (being either Z(`−1) or B(`−1)) as

1 X 2 L ([µ(`)] ; Z) = d z, H(`,k) [µ(`)]  , (7.8) k k Card(Z) k z∈Z 114

(`,k) (`) where for clarity we explicit the dependency of the hyperplane H with [µ ]k (recall 7.5); we will omit it hereafter for conciseness. We prove in the Appendix that

(`) (`) the BN statistics µBN and µBN (recall 7.3) are the unique solution of the strictly convex optimization problem minimizing the average squared distance between the

hyperplanes H(`,k) and the layer inputs. Theorem 7.1 (BN partition fitting)

At layer `, BN adapts the layer input space partition Ω(`) such that the boundaries

∂Ω(`) are shifted to minimize the average squared distance with the layer inputs:

D(`) (`) X (`−1) µBN = arg min Lk([µ]k; B ), (training) (`) µ∈RD k=1 D(`) (`) X (`−1) µBN = arg min L([µ]k; Z ), (testing), (`) µ∈RD k=1 and with

(`) 2 (`) 2 (`) (`−1) [σBN]k =k[W ]k,.k2Lk([µBN]k; B ), (training)

(`) 2 (`) 2 (`) (`−1) [σBN]k =k[W ]k,.k2Lk([µBN]k; Z ), (testing).

The above theorem provides the first practical understanding of how BN explicitly adapts the layer input space partition such that the boundaries are shifted to min- imize the total least square distance with the layer inputs. We illustrate this result in the first column of Fig. 7.1 where one can see how the hyperplanes are inde- pendently shifted onto the data. Another key impact of BN is that it adapts the partition into a very particular form. The average of the layer inputs, denoted as

(`−1) 1 P (`−1) 1 P z = B z∈B(`−1) z during training and z = N z∈Z(`−1) z, belongs to the intersection of all the hyperplanes H(`,k). That is, BN turns ∂Ω(`) into a central hy- perplane arrangement (non-empty intersection of all the hyperplanes [Stanley et al.,

2004]) as can be seen in the first column and second row of Fig. 7.1. 115

layer 0 layer 1 layer 2 layer 3 layer 4 with BN without BN

Figure 7.1 : Depiction for a 5-layer DN with 6 units per layer of the impact of BN (with statistics computed from all samples) onto the position and shape of the -up-to- layer-` input space partition Ω1|`; in blue are the newly introduced boundaries from the current layer, in grey are the existing boundaries. The absence of BN (top row) leaves the partition random and unalert of the data samples while BN (bottom row) positions and focuses the partition onto the data samples (while all other parameters of the BN are left identical); as per Thm. 7.1, BN minimizes the distances between the boundaries and the data samples.

Corollary 7.1 (Central Hyperplane Arrangement)

BN makes the layer input space partition boundaries (∂Ω(`)) a central hyperplane

arrangement, such that

D(`) \ z(`−1) ⊂ H(`,k), ∀` k=1 with z(`−1) the coordinate-wise average of the data (mini-batch during training, entire

training set during testing).

Thus, per layer, BN enforces that the coordinate-wise average of the layer inputs is

always located on the hyperplanes H(`,k), ∀k, and that the layer input space partition boundary ∂Ω(`) is always a central hyperplane arrangement, for any value of W (`).

While the above fully describes the role of BN into the layer input space partition, we now propose to extend this understanding to the entire DN being a composition 116

Layer 1 Partition seen in Layer 2 partition seen in Figure 7.2 : Depiction of the the DN/layer input space the DN input space layer (left) and DN (right) in- put space partition with L = 2,D(1) = 2,D(2) = 2. The par- tition boundaries of a layer in its input space corresponds to the hyperplanes H(`,k) (7.5), for deeper layers, vieweing H(`,k) in the DN input space leads to the paths P(`,k) (7.13).

of layers and BN operators.

7.3 Multiple Layer Analysis: Following the Data Manifold

In the multilayer case, the mapping consists of a composition of layers; the layer- wise fitting between ∂Ω(`) and the layer inputs z(`−1) (from Thm. 7.1) occurs at each layer in a greedy fashion. This layer-wise fitting also impacts the entire DN partition which we now propose to characterize. To do so, we first have to understand how the per-layer partitions Ω(`), ∀` combine to form the entire DN partition Ω. As we will see, this layer composition will result in the DN partitioning being a successively subdivided input partition as studied in Sec. 3.4; however, we propose here to derive those results in term of the partition boundaries, as those are the objects that allow us to characterize the role of BN.

7.3.1 Deep Network Partition and Boundaries

The DN composes L layers, each with a layer-wise partition Ω(`) from 7.15 which is affected by the presence of BN. The entire DN is itself a Continuous Piecewise Affine

(CPA) operator with a partition Ω of its input space. As was the case in the layer- 117

wise study, the DN remains affine for all inputs in a region ω ∈ Ω. This section aims

at deriving the DN partition in term of the layer-wise partitions Ω(`) which can be

efficiently leveraged to obtain the up-to-layer-` partition Ω1,...,`. To lighten notations, we will refer to Ω(1,...,`) directly as Ω(1→`) where |` can be though of as ‘up-to-layer-`’.

The DN partition Ω is simply obtained from the special case Ω|L while the partition Ω(1→`) can be thought of as the DN partition of a DN only comprising the layers

1, . . . , `. To find Ω, we have to recall that the DN remains linear in ω ∈ Ω as long

as all inputs x ∈ ω, when mapped through all the layers, produce pre-activations

h(1),..., h(L) with same sign (recall 7.5). The `th pre-activation sign will change

based on the `th layer partition, but now seen in the DN input space. Hence, by first

expressing the layer-wise partition Ω(`) in the DN input space we will easily obtain

Ω by taking their intersections. The layer-wise partition expressed in the DN input

space corresponds to finding the DN inputs (and not the layer inputs) such that layer

(`) ` remains linear. We denote this partition as Ω0 with the lowerscript 0 emphasizing that we are now working in the DN input space, and it is given by

D(`) (`) [ D (`) ∂Ω0 = {x ∈ R :[h ]k = 0}. (7.9) k=1

(`) (`) By comparing the above with 7.5,7.6 we see that ∂Ω0 is just ∂Ω “mapped back” to the DN input space. This mapped-back partition now depends on all the earlier layers. To make 7.9 explicit we need to introduce the max-affine spline operators expressing the up-to-layer-` mapping as an input/region dependent affine mapping

(`) (1→`) (1→`) as in z = Ax x + bx with

(1→`) (`) (`) (`−1) (`−1) (1) (1) Ax , Q W Q W ... Q W , (7.10) ` X µ(`) b(1→`) Q(`)W (`)Q(`−1)W (`−1) ... Q W diag( ), (7.11) x , i+1 i+1 σ(`) i=1 118

and where Q(`) is a diagonal matrix encoding the activation function “state” recalling

(4.2) and (4.3). The is each matrix Q(`) is diagonal and filled with {0, 1} for ReLU,

{−1, 1} for abs. value and then scaled by the BN parameter σ(`) as   (`) (`) (`−1) (`) α/[σ ]i, [W z − µ ]i ≤ 0 (`)  [Q ]i,i = . (7.12)  (`) (`) (`−1) (`) 1/[σ ]i, [W z − µ ]i > 0

We thus obtain explicitly the mapped-back layer-wise partition

D(`) (`) [ [ (`,k) ∂Ω0 = P |ω, k=1 ω∈Ω(1→`−1) | {z } ,P(`,k)

(`,k) where P |ω is either empty, or corresponds to a polytope face living in ω, a region of the partition Ω(1→`−1) of the DN input space, formally given by

(`,k) n (1→`−1) T (`) (`) P |ω = x ∈ ω : h(A )ω [W ]k,., xi = [µ ]k

(`) (1→`−1) o (1→`−1) − h[W ]k,., bω i , ∀ω ∈ Ω . (7.13)

We emphasize that while ∂Ω(`) from 7.6 provides the boundaries of the per-layer

(`) partition in layer input space, ∂Ω0 provides a mapped back view of it in the DN input space (compare 7.5 and 7.9). Due to the presence of nonlinear layers between the DN input space and layer ` input space, the hyperplanes H(`,k) from ∂Ω(`) now

(`,k) become a collection of polytope faces (P |ω) that depend on the earlier layers weights and partitions. Given our construction, the up-to-layer-` partition Ω(1→`) is then simply obtained by ( ) (1→`) \ (1) (`) Ω = ω, U ∈ Ω0 × · · · × Ω0 , (7.14) ω∈U L D(`) [ [ ∂Ω(1→`) = P(`,k). (7.15) `=1 k=1 119

We now study how BN impacts this DN input space partition Ω i.e. Ω|L by acting upon P(`,k). We demonstrated in Thm. 7.1 how BN shifts the hyperplanes H(`,k)

onto the layer inputs. As described above, those hyperplanes H(`,k) become P(`,k)

in the DN input space and contribute to the DN input space partition (recall 7.15).

It is thus clear that by acting on H(`,k), BN also indirectly acts upon P(`,k), and

ultimately upon Ω. The first result demonstrates how any reduction in the distance

between the hyperplane H(`,k) and the layer inputs z(`−1) implies a reduction of the

distance between P(`,k) and the DN inputs x which is a crucial result demonstrating

that (even though done per-layer) BN also adapts the entire DN partition boundaries

∂Ω to get close to the samples x. Theorem 7.2 (Multilayer BN distance)

(`,k) (`−1) Reducing the distance between an hyperplane H and a layer input zx also reduces the distance between the folded (mapped back) hyperplane P(`,k) and the

DN input x, in particular we have:

(`−1) (`,k) • if the layer ` input zx lies on an hyperplane H then the corresponding

DN input x lies on the folded hyperplane P(`,k) and vice-versa:

(`−1) (`,k) (`,k) d(zx , H ) = 0 ⇐⇒ d(x, P ) = 0

(1,...,`−1) (`,k) • if the DN input x lives on a region ω ∈ Ω where P |ω is non-empty, then the distances in the DN input space and in the layer input space are

proportional:

d(z(`−1), H(`,k)) ∝ d(x, P(`,k)).

The above result demonstrates how the layer-wise result from Thm. 7.1 (minimizing

distances between H(`,k) and z(`−1)) extends to a composition of layers, hence a DN. 120

P1,k P4,k

Figure 7.3 : Depicition of P(`,k), ` = 1, 4 where for each `, P(`,k) is colored based on (`) (`) (2) [σ ]k/k[W ]k,.k (blue: smallest, green: highest). As per Thm. 7.1, 7.2, the bluer colored paths are the ones closer to the dataset (black dots) allowing interpretability of the σ(`) parameter as the fitness between P(`,k) and the mini-batch samples.

In fact, we showed that this layer-wise fitting alse reduces the distance between P(`,k) and x. We illustrate this result in Fig. 7.1.

7.3.2 Interpreting Each Batch-Normalization Parameter

We now propose a more focused analysis on the role of σ(`). As we extensively studied

(`) (`,k) (`) (`) (`) above, µ shifts the hyperplanes H while σ does not impact ∂Ω nor ∂Ω0

(`0) 0 (`) (`) but only ∂Ω0 , ` > `. Yet, while µ acts as a shift, σ impacts the angles between adjacent polytope faces of P(`0,k), ∀`0 > `, ∀k. Furthermore, it will adapt the angle based on the underlying least squared error (recall Thm 7.1).

Theorem 7.3

The dihedral angle θ between adjacent faces of the folded hyperplanes are impacted by σ(`) and not by µ(`); in particular in a two layer DN, the angle between two adjacent 121

(1,q) faces of P2,k and separated by H is given by |[W (2)]T Q(1)(ω)W (1)[W (1)] | (2,k) (1,q) k,. q,. θ P |ω, H = arccos . (7.16) (2) T (1) (1) (2) (2) k[W ]k,.Q (ω)W k k[W 1]q,.k To better illustrate the above, let’s consider orthogonal weights W (1)(W (1))T = 1

which is often employed in DNs to improve performances [Bansal et al., 2018], and

absolute value activation [Mallat, 2012]. The above becomes

(2) (1) (2,k) (1,q) |[W ]k,q|/[σ ]q θ P |ω, H = arccos (1) . PD (2) (1) i=1 |[W ]k,i|/[σ ]i It is clear that as the first layer hyperplane H(1,q) gets more and more aligned with the

data (smallest total least square distance than H1,j, j 6= q) as the second layer folded

(1,q) (2,k) (1,q) hyperplane will get more and more aligned with H as θ P |ω, H → 0. This demonstrates how not only µ(`) but also σ(`) plays a crucial role into the shifting and alignment of the entire DN partition onto the data samples.

7.3.3 Experiments: Batch-Normalization Focuses the Partition onto the

Data

The proposed understanding of BN, highlighted in Fig. ?? and 7.3 clearly concentrates the partition boundaries onto the data. To further demonstrate how this partition concentration occurs we reproduce those experiments for deeper and wider DN on various datasets. In all the following experiments the DNs are with random weights, only BN is applied throughout, no training is done. In particular, we focus on three settings: (random) where no BN is employed, W (`), b(`) are random, (zero) where no BN is applied and b(`) = 0, this is a common DN initialization, and finally (BN)

(`) (`) (`) where W is random, and the BN statistics µBN and σBN are set as per 7.3 (BN). Notice that in all cases, no labels are needed as all three initialization schemes are at most depending on the samples xn. 122

Low-Dimensional Visualization

First, we propose to employ more complex DN architecture while leveraging a low- dimensional (2d) dataset. We artficialy construct a star shape dataset and visually depict the concentration of partition boundaries via a backpropagation scheme. That is, we overlay the boundaries, superimposing them, leading to a heatmap where high values indicate high number of boundaries in an -ball region. The uniform sampling in the -ball is done efficiently via a technique noted in Harman and Lacko [2010] and proved in Voelker et al. [2017] relying on Gaussian distributions. We depict those concentration maps for our three initialization settings (zero, random, BN) in Fig. 7.4. We see how increasing the width and depth allows BN to precisely concentrate the DN input space partition onto the samples. That is, by forcing the

DN partition boundary ∂Ω to be onto the data samples xn, the number of regions near the data samples also increases as (recall 7.13) “crossing” a boundary implies a change of region, we also confirm this by explicitly counting the number of regions around samples in the next section.

High-Dimensional Experiments

Second, we extend the above experiment to more realistic samples that the CIFAR and SVHN images. Each image is RGB with 32 × 32 pixels. As the DN input space is now high dimensional (3072), we propose a different strategy to convey the visual information from Fig. 7.4. For each dataset we pick 100 images as well as 100 random images generated with uniform distribution for each pixel. For each of those 200 samples, we count the number of regions around the sample in an -ball. Counting is done by extensively sampling around each sample and recording the number of unique activation patterns (recall (7.15)). We report those numbers in Fig. 7.5 and observed 123

that BN does significantly increase the number of regions around the dataset images

while it barely impacts the number of regions present around random samples. This

effectively confirm the same behavior as seen in Fig. 7.4.

From the empirical result of Fig. 7.1,7.4 and 7.5 we validate the results from

Thm. 7.1,7.2 demonstrating that BN actively alters the layer-wise and DN input

space partition in order to shift and orient the boundaries onto the data samples in

turn leading to greater number of regions around the data samples. We now study

the impact of such concentration for classification tasks.

7.4 Where is the Decision Boundary

We demonstrated how BN actively contributes to the DN partition in an unsupervised

manner. We now demonstrate that this a priori naive objective is well adapted for

supervised tasks by jump-starting learning thanks to the decision boundary being

already near the data points. We also conclude by studying the role of the BN

learnable parameters.

7.4.1 Batch-Normalization is a Smart Initialization

To understand the benefit of BN in a classification setting we first need to relate the

DN decision boundary to the DN input space. Without loss of generality, consider a

binary classification problem leading to a last layer with a single output unit (D(L) =

1). The sign of this last layer output predicts the input class. The decision boundary

thus corresponds to the zero-set of z(L), which in fact falls back to 7.13. That is, the decision boundary is simply the folded hyperplane of the last layer, P(L−1). As a result, the decision boundary leaves on the -up-to-layer-L − 1 DN partition Ω(1→L−1)

(recall 7.14) which is already adapted to the given dataset as per the entire previous 124

0 0 0 samples ∂Ω3 ∂Ω7 ∂Ω11

0.2 0.2 0.2

0.0 0.0 0.0

1.0 0.7 0.7 zero bias random BN

Figure 7.4 : This figure reproduces the experiment from Fig. 7.1 with a more complex (2-D) input dataset (left) and a much wider DN with D(`) = 1024 and L = 11. We depict for some layers the boundaries of the layer partitions seen in the DN input (`) space (∂Ω0 , recall 7.13) for DNs with different initializations: random for slopes and biases (random), or random for slopes and zero for biases (zero) or the scaling of the (`) (`) slopes and the biases are initialized from the BN statistics µBN and σBN from 7.3 (BN). The overlap of multiple partition boundaries induces a darker color demonstrating the presence of more partition boundaries for each spatial location. Clearly, BN concentrates the partition boundaries onto the data samples.

section. In particular, we have the following result. Remark 7.1

The increased boundary concentration around data samples obtained from BN allows for finer decision boundaries around the data samples.

In fact, the link between DN input space partition and decision boundary was 125

around images around noise samples 30000 random zero 20000 bn

10000 # regions

0 0.005 0.010 0.015 0.020 0.005 0.010 0.015 0.020

radius () radius () Figure 7.5 : Average number of regions from the DN partition Ω in an -ball around 100 CIFAR images (left) and 100 random images (right) for a CNN demonstrat- ing that BN adapts the partition on the data samples. The weights initialization (random, zero, BN) follows Fig. 7.4. Additional dataset and architectures are given in Appendix showing the same result.

first studied in [Balestriero et al., 2019] where it was found that the more partition boundary there is, the more folds can the decision boundary have; in our case, BN alone allows for this beneficial property without any supervised training. Lastly, we also have the following result demonstrate how without training (at initialization) there will be samples from the DN input that lies on each side of the decision boundary.

Proposition 7.1 (Decision boundary position)

At initialization, for any weight matrix W (`) and layer activation functions σ(`), ` =

1,...,L , zero last layer bias b(L) = 0 (usual initialization) and σ(L−1) =leaky-ReLU,

using BN ensures that there will be samples in the training mini-batch on each side

of the decision boundary.

The above result is equivalent of saying that the DN decision boundary passes

through the mini-batch samples in the DN input space for any random initialization

thanks to the presence of BN. Along the same lines, this BN induced initialization is 126 also beneficial to solve the dying neuron problem [Trottier et al., 2017] which prevents learning of some DN parameters and is usually solved by careful initialization [Glorot and Bengio, 2010, He et al., 2015a]. BN also allows to solve this problem as it has been empirically demonstrated in Liao and Carneiro [2016]. We now empirically validate how this BN initialization is position the DN partition and decision boundary closer to the optimal ones.

7.4.2 Experiments: Batch-Normalization Initialization Jump-Starts Train-

ing

Up to this point, the BN analysis was studied from an unsupervised angle as BN does not use any external information about the data samples to alter the DN input space partition. We now move to a supervised classification task to empirically validate

Remark. 7.1 stating how the region concentration resulting from using BN is beneficial to obtain higher resolution decision boundary and this should help classification.

To validate this, we use the same three initialization settings as in the last section: random, zero and BN. Then training is done to solve the task but no BN is applied during training (that is, once the BN statistics are initialized, they are not updated during the training as it would be the cases is using BN). To do so, we first get the values for µ(`), β(`) based on the training set, with a randomly initialization DN (no training is performed yet). Then, those values are used as the BN statistics, and kept frozen during training as opposed to being updated for each mini-batch. Hence, this

BN initialization is nothing more than a “smart” weight initialization of DNs where information about the samples are used. We compare this BN initialization with the usual random weight initialization and zero bias initialization (standard init. in DNs) and propose in Fig. 7.6 the classification performance on various datasets for varying 127 depth Resnet DNs. We clearly see how BN initialization is responsible for a significant performance gain and faster training. This demonstrates an entirely new aspect of

BN that has not yet been studied theoretically by prior work as they focused on the effect of BN used during training and its impact of the DN loss surface. Since here

BN is only used as an initializer, the measured gain of performances are only due to a better positioning of the DN initial partition Ω.

We demonstrated how BN not only adapts the partition onto the data samples but do so in a way that directly provide the decision boundary with an adapted resolution following the dataset at hand. As such, even when used solely as a “smart” initialization technique, BN is able to greatly improve DN performances compared to the random data agnostic initialization. This results supports the importance of DN initialization and how a good initialization alone plays a crucial role in performances as empirically studied with slope variance constraints [Mishkin and Matas, 2015, Xie et al., 2017], singular values constraints [Jia et al., 2017] or with orthogonal constraints

[Saxe et al., 2013, Bansal et al., 2018].

7.5 The Role of the Batch-Normalization Learnable Param-

eters

We now study the impact of the BN learnable parameters β(`) and γ(`). The previous results and characterization of the layer and DN input space partition (7.6 and 7.15) is done by setting β(`) = 0 and γ(`) = 1, recall 7.2 which corresponds to the standard initialization of BN. That is, all the above derived results on BN and its ability to

“fit” the partition boundaries to the data samples occurs exactly during the first mini-batch, and then β(`) and γ(`) are adapted. We demonstrate here that even when 128

Inception Resnet50 EfficientNet CIFAR100 CIFAR10 SVHN

Figure 7.6 : Image classification with different architectures on SVHN, CIFAR10/100. In all cases no BN is used during training; the initialization of the weights is either (`) (`) random (black) or random a with fixed BN parameters µBN, σBN, ∀` (blue). That is, the BN parameters are found as-per the BN strategy in a pretraining phase, and then those parameters are frozen (all other parameters remained at their random initialization). Then training start and the random parameters are tuned based on the loss at han. We can see that BN initialization (again, no BN is used during training) is beneficial to reach better accuracy effectively showing that BN initialization alone plays a crucial role for DNs. In most cases, the DN that does not leverage the BN initialization diverges altogether.

those parameters are learned, the insights developed above remain valid.

To do so, we first derive the following result which demonstrates that one can effectively disregard the γ parameter as its impact on the layer mapping will be can- celed by subsequent layers in the DN. That is, in a DN, the only learnable parameter 129

that should be considered is β(`). Proposition 7.2

In a DN with multiple layers, learning of the γ(`) parameters for the intermediate layers is equivalent to learning of the β(`) parameter and W (`+1) parameter.

To gain further insights, we can illustrate the above result in this simple two layer case ! W (1)x − µ(1) f(x) = W (2)σ γ(1) + β(1) + b(2) σ(1) ! W (1)x − µ(1) = W 0(2)σ + β0(1) + b(2). σ(1)

Since the γ(`) can effectively be disregarded as per above without any impact on the

DN training, we are now left with studying the impact of β(`) on the results derived

in the previous sections. This parameter acts on the layer (and DN) input space

partition in a similar manner to the µ(`) parameters, by shifting the layer hyperplanes.

Hence, learning of this parameter allows a DN to move away from the BN position of

the partition boundaries if needed. We provide in the appendix the analytical form

of those partition boundaries in the presence of β(`). However we demonstrate in

Table ?? in the appendix that learning or not β(`) has very limited impact on the final

performances of DNs across tasks and architectures. This effectively demonstrates

that all the derived results can be extended to trained DNs as well.

7.6 Batch-Normalization Noisyness

(`) (`) The BN statistics µBN, σBN, during training, depend only on the observed mini-batch B. That is, they are estimators of the entire data mean and standard deviation

computed only from the B samples of the observed mini-batch. That is, the BN 130

init. (B=16) init. (B=256) learned (B=16) learned (B=256)

Figure 7.7 : Decision Boundaries realisations obtained for different batches on a 2-dimensional binary classification task. Each mini-batch (of size B) produces a different DN decision boundary based on the realisations of the random variables (`) (`) µBN, σBN (recall 7.17,7.18). Variance of those r.v. depend on B as seen in the figure. We depict those realisations at initialization (left) and after learning (right) for B = 16, 256, the latter producing smaller variance in the decision boundaries.

mean and standard deviations are random variables with different realisations for

each mini-batch. A direct application of probability and statistics [Von Mises, 2014]

will give us the following general result. Proposition 7.3

Assuming that the layer inputs follow some distribution W (`)z(`−1) ∼ Z with V ar(Z) =

2 (`) σ2 (`) 2 µ4 σ4(B−3) σ , we have V ar(µBN) = B and V ar((σBN) ) = B − B(B−1) .

Per the above result, we see that during training, the BN centering and scaling introduces an additive and multiplicative noise with variance increasing as the mini- batch size (B) decreases. This “noise” thus becomes detrimental for small mini- batches and has been empirically observed in Ioffe [2017]. We formalize this important property of BN in the following result and illustrate it in Fig. 7.7 where we depict 131 the DN decision boundary realisations from observing different mini-batches. Proposition 7.4

BN statistics during training introduce a multiplicative noise for the slope and additive noise for the bias at each layer with variance given in Prop. 7.3.

This combination of two different noises is different than in for example standard dropout [Gal and Ghahramani, 2016] where only one noise type is used (and most commonly the multiplicative noise is binary). To gain further insights, let’s assume a Gaussian form for the data distribution to be i.i.d. as z(`−1) ∼ N (µ, Σ) from which we obtain the following distributions

(`) T (`) ! [W ] Σ[W ]k,. [µ(`) ] ∼N [W (`)]T µ, k,. , (7.17) BN k k,. B 2  (`) T (`)  [W ]k,.Σ[W ]k,. [(σ(`) )2] ∼ X 2 . (7.18) BN k B N−1

2 with XN−1 a Chi-squared distribution with N − 1 degrees of freedom. From those distributions we see how the layer weights also interact with Σ to produce the final covariance matrix. As we depicted in Fig. 7.7, the BN noise induces a “jittering” effect in the per mini-batch DN decision boundary. However, during the final stages of training, it is comon to observe a training error of 0 meaning that the weights are not updated anymore and more importantly, that for each mini-batch, the decision boundary realisation remains able to perfectly classify the mini-batch samples. Such noise techniques have been proven beneficial for performances [Srivastava, 2013, Pham et al., 2014, Molchanov et al., 2017, Wang et al., 2018]. To further demonstrate that this noise is beneficial, we propose to artificially amplify the variance of the

BN noise by adding an additional Chi-square multiplicative noise and a Gaussian additive noise. By increase the variance of those random variables we are able to 132 further increase classification performances (averaged over 5 runs) with a Resnet10 from 93.34% to 93.68% (cifar10), from 72.22% to 72.74% (cifar100) and from 96.16% to 96.41% (svhn).

7.7 Discussions

We demonstrated how the ability to compute the DN input space partition provided a novel perspective to study batch-normalization. In particular, we proved that batch- normalization concentrates the DN input space partition around the data samples which offers an advantageous configuration that speeds-up training and stability gra- dient updates by allowing to start from a better initialized DN. This should open novel research direction to improve batch-normalization, and understand the tight relationship between DN input space partition and performances. Finally, with the novel argument that batch-normalization can be seen as a data aware initialization, we believe that novel perspectives will emerge in trying to provide more principle initializers that take into account the geometry of the data in a principled way based on a priori requirements. 133

Chapter 8

Insights Into (Smooth) Deep Networks Nonlinearities

In this chapter we demonstrate how the previous results and insights from Chap. 3 on continuous piecewise affine splines and deep networks can be ported to network architectures employing smooth nonlinearities. The backbone of this result will rely on turning the input space partition of MASOs into soft partitions which are often denoted as fuzzy partitions [Bezdek and Harris, 1978] and fuzzy sets [Dubois, 1980].

That is, a point no longer live or not in a region, but has a probability of being in each region. This process finds strong similarities with Gaussian Mixture Model clustering

(soft partition) versus k-means clustering (hard partition).

8.1 Introduction

Nonlinearities i.e. pointwise activation functions and nonlinear pooling operators are crucial to a DN’s performance. Indeed, without nonlinearity, the entire network would collapse to a simple affine transformation. But to date there has been little progress understanding and unifying the menagerie of nonlinearities, with few reasons to choose one over another other than intuition or experimentation. The key result of

Chap. 3 is that any DN layer constructed from a combination of linear and piecewise affine and convex is a MASO, and hence the entire DN is merely a composition of MASOs. MASOs have the attractive property that their partition of the signal space (the collection of multi-dimensional “knots”) is completely determined by their 134

affine parameters (slopes and offsets) (recall Sec. 3.4). This provides an elegant link

to vector quantization (VQ) and clustering. This is good progress for DNs based on

ReLU, absolute value, and max-pooling, but what about DNs based on classical, high-

performing nonlinearities that are neither piecewise affine nor convex like the sigmoid

gated linear unit [Elfwing et al., 2018], hyperbolic tangent gated linear unit [Roy

et al., 2019], or recent adaptive nonlinearities (with learnable parameter controling

their smoothness) like the swish [Ramachandran et al., 2017] that has been shown to

outperform others on a range of tasks?

In this chapter, we address this gap in the DN theory by developing a new

framework that unifies a wide range of DN nonlinearities and inspires and supports

the development of new ones. The key idea is to leverage the yinyang relation-

ship between deterministic VQ/K-means and probabilistic Gaussian Mixture Models

(GMMs) [Biernacki et al., 2000] input space partitions. Under a GMM, piecewise

affine, convex nonlinearities like ReLU and absolute value can be interpreted as solu-

tions to certain natural hard inference problems, while sigmoid and hyperbolic tangent

can be interpreted as solutions to corresponding soft inference problems. This chap- ter is organized as follows. The GMM-based extension of MASOs is developed in

Sec. 8.2 and is further extended in Sec. 8.3 where we develop the hybrid β-VQ. This generalized formulation which effectively allows one to interpolate and extrapolate between the deterministic and probabilistic regime will further extend the reach of the MASO analysis to DNs employing adaptive nonlinearities like the Swish and the

β-softmax pooling. 135

8.2 Max-Affine Splines meet Gaussian Mixture Models

An important consequence of (3.1) is that a MASO is completely determined by its slope and offset parameters without needing to specify the partition of the input space (the “knots” when D = 1). Indeed, solving (3.1) automatically computes an optimized partition of the input space RD that is equivalent to a vector quantization (VQ) [Nasrabadi and King, 1988, Gersho and Gray, 2012]. We can make the VQ aspect i.e. the region assignment of layer ` explicit by rewriting (3.1) in terms of the

(`) D(`)×R(`) (`) Hard-VQ (HVQ) matrix T H ∈ {0, 1} that contains D vertically stacked one-hot row vectors as

h (`)i T = 1 . (8.1) H   k,r (`) (`−1) (`) arg max h[Ar ]k.:,z i+[br ]k=r r=1,...,R(`) 

Given the input induced HVQ matrix, the MASO input-output mapping is affine and fully determined as

R(`) (`) X D (`) (`−1)E  [z ]k = [T H ]k,r [Ar ]k,:, z + [br]k , (8.2) r=1 the entire development that follows consist in deriving an alternative (soft) VQ matrix

(`) that when used in-place of T H in (8.2) would correspond to employing an alternative (smooth) nonlinearity.

8.2.1 From MASO to GMM via K-Means

For now, we focus on a single unit k from layer ` of a MASO DN. The same develop- ment generalize naturally to an entire layer by applying this analysis component-wise, as (recall (3.1)) the maximum is taken independently between units of a same layer.

Recalling Thm. 3.3 we can provide the following direct result that demonstrates for 136

which parameters the input space partition of the unit corresponds to a Voronoi

Diagram (recall Def. 3.1) i.e. the partition is the one of a k-mean algorithm.

Proposition 8.1 (Voronoi Diagram for a unit)

1  (`) 2  (`) Given − 2 Ar k,: 2 = br k, the MASO VQ partition corresponds to a K-means ∗  (`) clustering i.e. a Voronoi Diagram with centroids Ar k,: leading to

(`) [T ] = 1 2. H k,r  (`) (`−1) arg min Ar −z r=1,...,R(`) k,: 2

We now leverage the well-known relationship between K-means and Gaussian Mixture

Models (GMMs) [Bishop, 2006] to GMM-ize the deterministic VQ process of max-

 (`) affine splines. As we will see, the constraint on the value of br k in Prop. 8.1 will be relaxed thanks to the GMM’s ability to work with a nonuniform prior over the

regions (in contrast to K-means).

To move from a deterministic MASO model to a probabilistic GMM-like model, we

(`) reformulate the one-hot position of the HVQ selection variable [T ]k,: that we denote

(`) (`) (`) as [t ]k as an unobserved categorical variable [t ]k ∼ Cat([π ]k,:) with parameter

(`) (`) [π ]k,: ∈ 4R(`) with 4R(`) being the simplex of dimension R . Armed with this, we define the following generative model for the layer input z(`−1) as a mixture of R(`)

(`) D(`−1) 2 Gaussians with mean [Ar ]k,: ∈ R and identical covariance scaled with σ

R(`) (`−1) X  (`) z = 1   Ar + , (8.3) t(`) =r k,: r=1 k with  ∼ N (0, Iσ2). For reasons that will become clear below in Section 8.2.3, we will refer to the GMM model (8.3) as the Soft MASO (SMASO) model.

∗It would be more accurate to call this R(`)-means clustering in this case. 137

8.2.2 hard-VQ Inference

Given the GMM (8.3) and an input z(`−1), we can compute a hard inference of the opti-

(`) mal VQ selection variable [t ]k via the maximum a posteriori (MAP) principle which

(`) 1 (`) 2 (`) exp([br ]k+ 2 k[Ar ]k,:k ) falls back to the MASO region selection when setting [π ]k,t = (`) PR (`) 1 (`) 2 r=1 exp([br ]k+ 2 k[Ar ]k,:k ) as

(`) (`−1) [tc ]k = arg max p r|z r=1,...,R(`)

= arg max log p(z(`−1)|r)p(r) (Bayes rule and log preserves argmax) r=1,...,R(`) 1 2 1 2 (`−1) (`) (`) (`) = arg max − z − [Ar ]k,: + [br ]k + [Ar ]k,: r=1,...,R(`) 2 2 2 2

(`) (`−1) (`) = arg maxh[Ar ]k,:, z i + [br ]k, (8.4) r=1,...,R(`) falling back to the usual HVQ from (8.1). We formalize this result in the following statement Theorem 8.1 (GMM MAP)

Given a GMM with parameters σ2 = 1 for all mixtures and with mixture prior

(`) 1 (`) 2 (`) exp([br ]k+ 2 k[br ]k,:k ) (`) probability [π ]k,r = (`) , r = 1,...,R , the MAP inference PR (`) 1 (`) 2 r=1 exp([br ]k+ 2 k[Ar ]k,:k ) (`) of the mixture categorical variable [t ]k is given by (8.4) and corresponds to a MAS VQ (8.1).

Note that in Thm. 8.1 the bias constraint of Prop. 8.1 is completely relaxed. HVQ in- ference of the selection matrix sheds light on some of the drawbacks that affect any DN employing piecewise affine, convex activation functions. First, during gradient-based learning, the gradient will propagate back only through the activated VQ regions that

(`) correspond to the 1-hot entries in T H . The parameters of other regions will not be updated; this is known as the “dying neurons phenomenon” [Trottier et al., 2017,

Agarap, 2018]. Second, the overall MASO mapping is continuous but its derivative is 138

not leading to unexpected gradient discontinuities during learning and thus training

instabilities. Third, the HVQ inference contains no information regarding the confi-

dence of the VQ region selection, which is related to the distance of the query point

to the region boundary. As we will now see, this extra information can be very useful

and gives rise to a range of classical and new activation functions.

8.2.3 Soft-VQ Inference

We can overcome many of the limitations of HVQ inference in DNs by replacing the

1-hot entries of the HVQ selection matrix with the probability that the layer input

belongs to a given VQ region. This soft region assignment is captured by what we

denote as the soft-VQ (SVQ) matrix, the counter part of (8.1), given by

D (`) (`−1)E (`)  exp [Ar ]k,:, z + [br ]k [T (`)] = p [t(`)] = r|z(`−1) = , (8.5) S k,r k (`) D E  PR (`) (`−1) (`) r=1 exp [Ar ]k,:, z + [br ]k which follows from the simple structure of the GMM. This corresponds to a soft

(`) inference of the categorical variable [t ]k. Given the SVQ selection matrix, the MASO output is still computed via (8.2). The SVQ matrix can be computed indirectly from an entropy-penalized MASO optimization which follows from the exact same procedure as when formulating the E step of the EM algorithm as a maximization problem [Neal and Hinton, 1998, Manning and Klein, 2003, Mount, 2011]. Proposition 8.2

The entries of the SVQ selection matrix (8.5) solve the following entropy-penalized maximization, where H(·) is the Shannon entropy†

(`) Rk (`) X D (`) E (`)  [T ] = arg max [r] [A ] , z(`−1) + [b ] + H(r). (8.6) S k,: r [r]r k,: [r]r k r∈4 R(`) r=1

†The observant reader will recognize this as the E-step of the GMM’s EM learning algorithm. 139

Prop. 8.2 unifies HVQ and SVQ in a single optimization problem. The transition from HVQ (8.1) to SVQ (8.5) is obtained simply by adding the entropy regularization

H(r) in the optimization problem.

8.2.4 Soft-VQ MASO Nonlinearities

Remarkably, switching from HVQ to SVQ MASO inference recovers several classical and powerful nonlinearities and provides an avenue to derive completely new ones.

(`) (`) Given a set of MASO parameters A: , b: for calculating the layer-` output of a DN via (3.1), we can derive two distinctly different DNs: one based on the HVQ inference of (8.1) that produces back the usual affine spline that has been thoroughly studied in Chap. 3, and one based on the SVQ inference of (8.5) and that is smooth. By inserting some usual MASO parameters we can draw the following links.

Proposition 8.3 (Equivalent between nonlinearities)

(`) (`) The MASO parameters A: , b: that induce the ReLU activation under HVQ in- duce the sigmoid gated linear unit [Elfwing et al., 2018] under SVQ. The MASO

(`) (`) parameters A: , b: that induce the absolute value activation under HVQ induce the hyperbolic tangent gated linear unit [Roy et al., 2019] under SVQ. The MASO

(`) (`) parameters A: , b: that induce the max-pooling nonlinearity under HVQ induce softmax-pooling [Boureau et al., 2010] under SVQ.

8.3 Hybrid Hard/Soft Inference via Entropy Regularization

Combining the hard-VQ optimization problem and the soft-VQ optimization problem yields a hybrid optimization for a new β-VQ that recovers hard, soft, and linear VQ inference as special cases defined by weighting the importance of the data driven categorical variable inference (left term of (8.6)) and the regularizer given by the 140

Shannon Entropy (right term of (8.6)). We denote this β-VQ and the induced matrix

as (`) Rk (`) X D (`) (`−1)E (`)  [T β ]k,: = arg max β [r]r [Ar ]k,:, z + [br ]k + (1 − β) H(r), (8.7) r∈4 R(`) r=1 with the new hyper-parameter β ∈ (0, 1). In a similar fashion as done in Thm. 8.1 we obtain the analytical solution of the β-VQ as follows. Theorem 8.2 (β-VQ analytical solution)

The unique global optimum of (8.7) is given by

 β D (`) (`−1)E (`)  exp 1−β [Ar ]k,:, z + [br ]k [T (`)] = . (8.8) β k,r (`)  D E  PR β (`) (`−1) (`) j=1 exp 1−β [Aj ]k,:, z + [bj ]k The β-VQ covers all of the theory developed above as special cases: β → 1 yields

1 HVQ, β = 2 yields SVQ, and β = 0 yields a linear MASO. See Figure 8.1 for exam- ples of how the β parameter interacts with three example activation functions. Note

also the attractive property that (8.8) is differentiable with respect to β. The β-VQ

supports the development of new, high-performance DN nonlinearities. For exam-

ple, the swish activation σswish(u) = σsig(ηu)u extends the sigmoid gated linear unit with the learnable parameter η [Ramachandran et al., 2017]. Similarly, the LiSHT

σlisht(u) = σhtan(ηu)u extends the hyperbolic tangent gated linear unit [Roy et al., 2019]. Numerous experimental studies have shown that DNs equipped with a learned

swish/lisht activation significantly outperform those with more classical activations

like ReLU. We now formalize the ability of the β-VQ to recover those learnable acti-

vation functions as the learnability of the smoothness that each activation exhibits. Proposition 8.4 (Swish/lisht as β-VQ)

(`) (`) The MASO A: , b: parameters that induce the ReLU or absolute value nonlinearity under HVQ induce the swish or lisht nonlinearity respectively under β-VQ, with

β η = 1−β . 141

Figure 8.1 : For the MASO parameters A(`),B(`) for which HVQ yields the ReLU, absolute value, and an arbitrary convex activation function, we explore how changing β in the β-VQ alters the induced activation function. Solid black: HVQ (β = 1), 1 Dashed black: SVQ (β = 2 ), Red: β-VQ (β ∈ [0.1, 0.9]). Interestingly, note how some of the functions are nonconvex.

8.4 Discussions

Our development of the SMASO model opens the door to several new research ques- tions. First, we have merely scratched the surface in the exploration of new nonlinear activation functions and pooling operators based on the SVQ and β-VQ. For exam- ple, the soft- or β-VQ versions of leaky-ReLU, absolute value, and other piecewise affine and convex nonlinearities could outperform the new swish nonlinearity. Sec- ond, replacing the entropy penalty in the (8.6) and (8.7) with a different penalty will create entirely new classes of nonlinearities that inherit the rich analytical properties of MASO DNs. 142

Appendix A

Insights into Generative Networks

A.1 Architecture Details

We describe the used models below. The Dense(T) represents a fully connected layer with T units (activation function not included). The Conv2D(I, J, K) represent I

filters of spatial shape (J, K) and the input dilation and padding follow the standard definition. For the VAE models the encoder is given below and for the GAN models the discriminator is given below as well. FC GAN model means that the FC generator is used in conjunction with the discriminator, the CONV GAN means that the CONV generator is used in conjunction with the discriminator and similarly for the VAE case. 143

FC generator CONV generator Encoder Discriminator

Dense(256) Dense(256) Dense(512) Dense(1024)

leaky ReLU leaky ReLU Dropout(0.3) Dropout(0.3)

Dense(512) Dense(8 * 6 * 6) leaky ReLU leaky ReLU

leaky ReLU leaky ReLU Dense(256) Dense(512)

Dense(1024) Reshape(8, 6, 6) leaky ReLU Dropout(0.3)

leaky ReLU Conv2D(8, 3, 3, inputdilation=2, pad=same) Dense(2*S) leaky ReLU

Dense(28*28) leaky ReLU Dense(256)

Conv2D(8, 4, 4, inputdilation=3, pad=valid) Dropout(0.3)

Reshape(28*28) leaky ReLU

Dense(2)

all the training procedures employ the Adam optimizer with a learning of 0.0001 which stays constant until training completion. In all cases training is done on 300 epochs, an epoch consisting of viewing the entire image training set once.

Tangent constraint For the experiment in Fig. 4.9 on the tangent constraint, we employed the follow deep autoencoder: 144

AE

Dense(1024)

ReLU

Dense(1024)

ReLU

Dense(S)

Dense(1024)

ReLU

Dense(1024)

ReLU Dense(D)

A.2 Proofs

A.2.1 Proof of Thm 4.1

Proof A.1 The result is a direct application of Corollary 3 in Balestriero et al. [2019]

adapted to GDNs (and not classification based DNs). The input regions are proven

to be convex polytopes. Then by linearity of the per region mapping, convexity of

the projected region is preserved and with form given by (4.4).

A.2.2 Proof of Proposition 4.1

Proof A.2 First recall the standard result that

rank(AB) ≤ min(rank(A), rank(B)),

for any matrix A ∈ RN×K and B ∈ RK×D (see for example Banerjee and Roy [2014] chapter 5). Now, noticing that min(min(a, b), min(c, d)) = min(a, b, c, d) leads to the desired result by unrolling the product of matrices that make up the Aω matrix to 145

obtain the desired result.

A.2.3 Proof of Proposition 4.2

Proof A.3 The first bound is obtained by taking the realization of the noise where

r = 0, in that case the input space partition is the entire space as any input is mapped

to the same VQ code. As such, the mapping associated to this trivial partition has 0

slope (matrix filled with zeros) and a possibly nonzeros bias; as such the mapping is

zero-dimensional (any point in the latent space is mapped to the same point in the

ambient space). This gives the lower bound stating that in the mixture of GDNs, one

will have dimension 0. For the other case, simply take the trivial case of r = 1 which

gives the result.

A.2.4 Proof of Theorem 4.2

T −1 T Proof A.4 First, notice that P (Aω) = Aω(Aω Aω) Aω defines a projection matrix. In fact, we have that

2 T −1 T T −1 T P (Aω) = Aω(Aω Aω) Aω Aω(Aω Aω) Aω

T −1 T = Aω(Aω Aω) Aω

= P (Aω)

T −1 and we have that (Aω Aω) is well defined as we assume injectivity (rank(Aω) = S)

T making the S × S matrix Aω Aω full rank. Now it is clear that this projection matrix maps an arbitrary point x ∈ RD to the affine subspace G(ω) up to the bias shift. As we are interested in the angle between two adjacent subspaces G(ω) and G(ω0) it is

also clear that the biases (which do not change the angle) can be omited. Hence the

task simplifies to finding the angle between P (Aω) and P (Aω0 ). This can be done by 146

means of the greatest principal angle (proof can be found in Stewart [1973]) with the

0  result being sin θ(G(ω), G(ω )) = kP (Aω) − P (Aω0 )k2 as desired.

Discussion on volume and affine mappings

Proof A.5 In the special case of an affine transform of the coordinate given by the

matrix A ∈ RD×D the well known result from demonstrates that the change of volume is given by | det(A)| (see Theorem 7.26 in Rudin [2006]). However in our case the

mapping is a rectangular matrix as we span an affine subspace in the ambiant space

RD making | det(A)| not defined. However by applying Sard’s theorem Spivak [2018] we obtain that the change of volume from the region ω to the affine subspace G(ω)

is given by pdet(AT A) which can also be written as follows with USV T the svd-

decomposition of the matrix A:

pdet(AT A) = pdet((USV T )T (USV T )) =pdet((VST U T )(USV T ))

=pdet(VST SV T )

=pdet(ST S) Y = σi(A)

i:σi6=0

A.2.5 Proof of Theorem 4.3

T −1 T Proof A.6 We will be doing the change of variables z = (Aω Aω) Aω (x − bω) where

+ T −1 T we will denote for clarity Aω , (Aω Aω) Aω . First, we know that PG(z)(x ∈ w) = −1 R Pz(z ∈ G (w)) = G−1(w) pz(z)dz which is well defined based on our full rank assumptions. We then proceed by

X Z PG(z)(x ∈ w) = pz(z)dz −1 ω∈Ω ω∩G (w) 147

Z q X −1 + T + = pz(G (x)) det((Aω ) Aω )dx −1 ω∈Ω ω∩G (w) Z X −1 Y + = pz(G (x))( σi(Aω ))dx −1 ω∩G (w) + ω∈Ω i:σi(Aω )>0 Z X −1 Y −1 = pz(G (x))( σi(Aω)) dx Etape 1 ω∩G−1(w) ω∈Ω i:σi(Aω)>0 Z X −1 1 = pz(G (x))q dx ω∩G−1(w) T ω∈Ω det(Aω Aω)

+ −1 Let’s now prove the Etape 1 step by proving that σi(A ) = (σi(A)) where we

T lighten notations as A := Aω and USV is the svd-decomposition of A:

A+ = (AT A)−1AT =((USV T )T (USV T ))−1(USV T )T

=(VST U T USV T )−1(USV T )T

=(VST SV T )−1VST U T

=V (ST S)−1ST U T

+ −1 =⇒ σi(A ) = (σi(A)) with the above it is direct to see that pdet((A+)T A+) = √ 1 as follows ω ω T det(Aω Aω) q + T + Y + Y −1 det((Aω ) Aω ) = σi(Aω ) = σi(Aω)

i:σi6=0 i:σi6=0 !−1 Y = σi(Aω)

i:σi6=0 1 = q T det(Aω Aω) which gives the desired result. 148

Appendix B

Expectation Maximization Training of Deep Generative Networks

B.1 Computing the Latent Space Partition

In this section we first introduce notations and demonstrate how to express a region ω

of the partition Ω as a polytope defined by a system of inequalities, and then leverage

this formulation to demonstrate how to obtain Ω by recursively exploring neighboring

regions starting from a random point/region.

Regions as Polytopes To represent the regions ω ∈ Ω as a polytope via a system

of inequalities we need to recall from (3.1) that the input-output mapping is defined

on each region by the affine parameters Aω,Bω themselves obtained by composition of MASOs. Each layer pre-activation (feature map prior application of the nonlinearity)

` D` ` 1→` 1→` is denoted by h (z) ∈ R , ` = 1,...,L−1 and given by h (x) = Aω z +bω , with up-to-layer ` affine parameters

1→` ` `−1 `−1 1 1 1→` D`×S Aω , W Qω W ... QωW , Aω ∈ R , (B.1) ` 1→`−1 ` X ` `−1 `−1 i i 1→` D` bω , v + W Qω W ... Qωv , bω ∈ R , (B.2) i=1 which depend on the region ω in the latent space ∗. Notice that we have in particular

L L Aω = Aω and bω = bω, the entire DGN affine parameters from (??) on region ω.

∗looser condition can be put as the up-to-layer ` mapping is a CPA on a coarser partition than

Ω but this is sufficient for our goal. 149

The regions depend on the signs of the pre-activations defined as q`(z) = sign(h`(z)) due to the used activation function behaving linearly as long as the feature maps preserve the same sign. This holds for (leaky-)ReLU or absolute value, for max- pooling we would need to look at the argmax position of each pooling window, as

all pooling is rare in DGN we focus here on DN without max-pooling; let q (z) , [(qL−1(z))T ,..., (q1(z))T ]T collect all the per layer sign operators without the last layer as it does not apply any activation. Lemma B.1

The qall operator is piecewise constant and there is a bijection between Ω and im(q).

The above demonstrates the equivalence of knowing ω in which an input z belongs to and knowing the sign pattern of the feature maps associated to z; we will thus use interchangeably qall(z), z ∈ ω and qall(ω). From this, we see that the pre-activation signs and the regions are tied together. We can now leverage this result and provide the explicit region ω as a polytope via its sytem of inequality, to do so we need to collect the per-layer slopes and biases into     1→L−1 1→L−1 Aω bω     QL−1 ` QL−1 ` all   all   all ( `=1 D )×S all `=1 D Aω =  ...  , bω =  ...  , Aω ∈ R , bω ∈ R . (B.3)      1→1   1→1  Aω bω

Corollary B.1

The H-representation of the polyhedral region ω is given by

L−1 S all all all \ S 1→` ` 1→` ω = {z ∈ R : Aω z < −q (ω) bω } = {z ∈ R : Aω z < −q (ω) bω }, `=1 (B.4) with the Hadamard product. 150

Proof B.1 From the above result, it is clear that the preactivation roots define the

boundaries of the regions. Obtaining the hyperplane representation of the region

thus simply consists of reexpressing this statement with the explicit pre-activation

hyperplanes for all the layers and units, the intersection between layers coming from

the subdivision. For additional details please see Balestriero et al. [2019].

From the above, it is clear that the sign locates in which side of each hyperplane

the region is located. We now have a direct way to obtain the polytope ω from its

sign pattern qall(ω) or equivalently from an input z ∈ ω; the only task left is to obtain

the entire partition Ω collecting all the DN regions, which we now propose to do via

a simple scheme.

Partition Cells Enumeration. The search for all cells in a partition is known

as the cell enumeration problem and has been extensively studied in the context of

speicific partitions such as hypreplane arrangements Avis and Fukuda [1996], Sleumer

[1999], Gerstner and Holtz [2006]. In our case however, the set of inequalitites of

different regions changes. In fact, for any neighbour region, not only the sign pattern

all all all q will change but also Aω and bω due to the composition of layers. In fact, changing one activation state say −1 to 1 for a specific unit at layer ` will alter the

affine parameters from (B.1) and (B.2) due to the layer composition. As such, we

propose to enumerate all the cells ω ∈ Ω with a deterministic algorithm that starts

from an intial region and recursively explores its neighbouring cells untill all have been

visited while recomputing the inequality system at each step. To do so, consider the

initial region ω0. First, one finds all the non-redundant inequalities of the inequality system (B.4), the remaining inequalities define the faces of the polytope ω. Second,

one obtains any of the neighbouring regions sharing a face with ω0 by switching the sign in the entry of q(ω0) corresponding to the considered face. Repeat this for all 151

non-redundant inequalities to obtain all the adjacent regions to ω0 sharing a face with it. Each altered code defines an adjacent region and its sytem of inequality can be obtain as per Lemma B.1. Doing so for all the faces of the initial region and then iterating this process on all the newly discovered regions will enumerate the entire partition Ω. We summarize this in Algo 1 in the appendix and illustrate this recursive procedure in Fig. 5.1.

We now have each cell as a polytope and enumerated the partition Ω, we can now turn into the computation of the marginal and posterior DGN distributions.

B.2 Analytical Moments for truncated Gaussian

To lighten the derivation, we introduce extend the [.] indexing operator such that for

th example for a matrix, [.]−k,. means that all the rows but the k are taken, and all

th th columns are taken. Also, [.](k,l),. means that only the k and l rows are taken and all the columns. Let also introduce the following quantities

 [F (a, Σ)]k =φ ([a]k; 0, [Σ]k,k)Φ[[a]−k,∞) µ(k), Σ(k)     [G(a, Σ)]k,l =φ [a](k,l); 0, [Σ](k,j),(k,j) Φ[[a]−(k,l),∞) µ (k, l) , Σ (k, l) ! l F (l, Σ) − Σ G(l, Σ)1 H(a, Σ) =G(a, Σ) + diag diag(Σ)

−1 −1 T with µ(u) = [Σ]−u,u[Σ]u,u[a]u, and Σ(u) = [Σ]−u,−u − [Σ]−u,u[Σ]u,u[Σ]−u,u. Thanks

0 to the above form, we can now obtain the integral eω(Σ) , Φω(0, Σ) and the first 1 R 2 two moments of a centered truncated gaussian eω(Σ) , ω zφ(z; 0, Σ) and Eω(Σ) , R T ω zz φ(z; 0, Σ) 152

Corollary B.2

The integral and first two moments of a centered truncated gaussian are given by

0 X X T  eω(Σ) = sΦ[l(C),∞) 0,RcΣRc dz, (B.5) ∆∈T (ω) (s,C)∈T (∆)

1 X X T T eω(Σ) =Σ sRC F (lω,c,RcΣRc ), (B.6) ∆∈T (ω) (s,C)∈T (∆)   2 X X T T 0 Eω(Σ) =Σ  sRC (H(lω,C ,RcΣRc ))RC  Σ + eω(Σ)Σ (B.7) ∆∈T (ω) (s,C)∈T (∆)

To simplify notations let consider the following notation of the posterior (5.4) where are incorporate the terms independent of z into

φ(x; B , Σ + A Σ AT ) α (x) = ω x ω z ω , (B.8) ω P T ω φ(x; Bω, Σx + AωΣzAω )Φω(µω(x), Σω) P leading to p(z|x) = ω∈Ω δω(z)αω(x)φ(z; µω(x), Σω). Theorem B.1

The first (per region) moments of the DGN posterior are given by

0 Ez|x[1z∈ω] = αω(x)eω(Σω),

[z1 ] = α (x)e1 (Σ ) + e0 (Σ )µ (x) Ez|x z∈ω ω ω−µω(x) ω ω−µω(x) ω ω [zzT 1 ] = α (x)E2 (Σ ) + e1 (Σ )µ (x)T Ez|x z∈ω ω ω−µω(x) ω ω−µω(x) ω ω + µ (x)e1 (x)T + µ (x)µ (x)T e0 (x) ω−µω(x) ω−µω(x) ω ω ω

0 1 T which we denote Ez|x[1z∈ω] , eω(x), Ez|x[z1z∈ω] , eω(x) and Ez|x[zz 1z∈ω] , 2 Eω(x).   CT Proof B.2 Constant:RC =    T −1 H Σω Z Z p(z|x)dz = αω(x) φ(z; µω(x), Σω)dz ω ω 153

Z = αω(x) φ(z; 0, Σω)dz ω−µω(x) = α (x)e0 ω ω−µω(x)

First moment:

− 1 (z−µ (x))T Σ−1(z−µ (x)) Z Z e 2 ω ω ω zp(z|x)dz =αω(x) z K/2 1/2 dz ω ω (2π) | det(Σω)| − 1 yT Σ−1y Z e 2 ω =α (x) (y + µ (x)) dz ω ω (2π)K/2| det(Σ )|1/2 ω−µω(x) ω   =α (x) e1 + e0 µ (x) ω ω−µω(x) ω−µω(x) ω

Second moment: Z Z T T zz p(z|x)dz =αω(x) zz φ(z; µω(x), Σω)dz ω Z T =αω(x) (y + µω(x))(y + µω(x)) φ(z; 0, Σω)dz ω−µω(x)

2 1 T 1 T T 0  = αω(x) E + µω(x)eω(Σω) + eω(Σω)µω(x) + µω(x)µω(x) eω−µω(x)(Σω)

B.3 Implementation Details

The Delaunay triangulation needs the V-representation of ω, the vertices which con- vex hull form the region Gr¨unbaum [2013]. Given that we have the H-representation,

finding the vertices is known as the vertex enumeration problem Dyer [1983]. To com- pute the triangulation we use the Python scipy Virtanen et al. [2020] implementation which interfaces the C/C++ Qhull implementation Barber et al. [1996]. To compute the H 7→ V representation and vice-versa we leverage pycddlib † which interfaces the C/C++ cddlib library ‡ employing the double description method Motzkin et al.

[1953].

†https://pypi.org/project/pycddlib/ ‡https://inf.ethz.ch/personal/fukudak/cdd_home/index.html 154

Algorithm 2 SearchRegion procedure Starting region ω and q(ω), initial set (Ω) if ω 6∈ Ω then

all all I = reduce(Aω ,Bω ) for r = 1,..., |Ω(T )| do

SearchRegion(flip(q(ω), i), Ω)

B.4 Algorithms

B.5 Proofs

In this section we provide all the proofs for the main paper theoretical claims. In particular we will go through the derivations of the per region posterior first moments and then the derivation of the expectation and maximization steps.

B.5.1 Proof of Lemma 5.1

Proof B.3 The proof consists of expressing the conditional distribution and using the properties of DGN with piecewise affine nonlinearities. We are able to split the distribution into a mixture model as follows:

1 − 1 (x−g(z))T Σ−1(x−g(z)) p(x|z) = e 2 x D/2p (2π) | det Σx|

1 − 1 (x−P 1 (A z+B ))T Σ−1(x−P 1 (A z+B )) = e 2 ω∈Ω z∈ω ω ω x ω∈Ω z∈ω ω ω D/2p (2π) | det Σx|

1 − 1 P 1 (x−(A z+B ))T Σ−1(x−(A z+B )) = e 2 ω∈Ω z∈ω ω ω x ω ω D/2p (2π) | det Σx|

1 1 T −1 X − (x−(Aωz+Bω)) Σx (x−(Aωz+Bω)) = 1z∈ω e 2 D/2p ω∈Ω (2π) | det Σx| X = 1z∈ωφ(x|Aωz + Bω, Σx) ω∈Ω 155

B.5.2 Proof of Proposition 5.1

Proof B.4 This result is direct by noticing that the probability to obtain a specific region slope and bias is the probability that the sampled latent vector lies in the corresponding region. This probability is obtained simply by integrating the latent gaussian distribution on the region. We obtain the result of the proposition.

B.5.3 Proof of Theorem 5.1

Proof B.5 For the first part, we simply leverage the known result from linear Gaussian models Roweis and Ghahramani [1999] stating that

p(x|z)p(z) p(z|x) = p(x) − 1 (x−g(z))T Σ−1(x−g(z)) − 1 (z−µ)T Σ−1(z−µ) 1 e 2 x e 2 z = D/2p S/2p p(x) (2π) | det(Σx)| (2π) | det(Σz)| − 1 (x−A z−B )T Σ−1(x−A z−B ) − 1 (z−µ)T Σ−1(z−µ) 1  X e 2 ω ω x ω ω  e 2 z = 1z∈ω p(x) D/2p S/2p ω∈Ω (2π) | det(Σx)| (2π) | det(Σz)| − 1 (x−A z−B )T Σ−1(x−A z−B )− 1 zT Σ−1z 1 X e 2 ω ω x ω ω 2 z = 1z∈ω p(x) (S+D)/2p ω∈Ω (2π) | det(Σx)|| det(Σz)| − 1 ((x−B )−A z)T Σ−1((x−B )−A z)− 1 zT Σ−1z 1 X e 2 ω ω x ω ω 2 z = 1z∈ω p(x) (S+D)/2p ω∈Ω (2π) | det(Σx)|| det(Σz)| − 1 ((AT Σ−1A +Σ−1)−1AT Σ−1(x−B )−z)T (AT Σ−1A +Σ−1)((AT Σ−1A +Σ−1)−1AT Σ−1(x−B )−z) 1 X e 2 ω x ω z ω x ω ω x ω z ω x ω z ω x ω = 1z∈ω p(x) (S+D)/2p ω∈Ω (2π) | det(Σx)|| det(Σz)|

− 1 ((x−B )T Σ−1(x−B ))+ 1 ((x−B )T Σ−1A (AT Σ−1A +Σ−1)−1AT Σ−1(x−B )) × e 2 ω x ω 2 ω x ω ω x ω z ω x ω

− 1 ((AT Σ−1A +Σ−1)−1AT Σ−1(x−B )−z)T (AT Σ−1A +Σ−1)((AT Σ−1A +Σ−1)−1AT Σ−1(x−B )−z) 1 X e 2 ω x ω z ω x ω ω x ω z ω x ω z ω x ω = 1z∈ω p(x) (S+D)/2p ω∈Ω (2π) | det(Σx)|| det(Σz)|

− 1 ((x−B )T (Σ−1−Σ−1A (AT Σ−1A +Σ−1)−1AT Σ−1)(x−B )) × e 2 ω x x ω ω x ω z ω x ω

− 1 ((AT Σ−1A +Σ−1)−1AT Σ−1(x−B )−z)T (AT Σ−1A +Σ−1)((AT Σ−1A +Σ−1)−1AT Σ−1(x−B )−z) 1 X e 2 ω x ω z ω x ω ω x ω z ω x ω z ω x ω = 1z∈ω p(x) (S+D)/2p ω∈Ω (2π) | det(Σx)|| det(Σz)| 156

− 1 ((x−B )T (Σ +A Σ AT )−1(x−B )) × e 2 ω x ω z ω ω

1 T −1 − (µ (x)−z) Σω (µ (x)−z) 1 e 2 ω ω 1 T T −1 X − ((x−Bω) (Σx+AωΣzAω ) (x−Bω)) = 1z∈ω e 2 p(x) (S+D)/2p ω∈Ω (2π) | det(Σx)|| det(Σz)|

T −1 T −1 −1 −1 with µω(x) = ΣωAω Σx (x − Bω) and Σω = (Aω Σx Aω + Σz ) as a result it corresponds to a mixture of truncated gaussian, each living on ω. Now we determine the renormalization constant:

Z p(x) = p(x|z)p(z)dz

1 T −1 − (µ (x)−z) Σω (µ (x)−z) Z e 2 ω ω 1 T T −1 X − ((x−Bω) (Σx+AωΣzAω ) (x−Bω)) = 1z∈ω e 2 dz (S+D)/2p ω∈Ω ω (2π) | det(Σx)|| det(Σz)| 1 T T −1 − ((x−Bω) (Σx+AωΣzAω ) (x−Bω)) Z X e 2 p = 1z∈ω det(Σω) φ(z; µ (x), Σω)dz D/2p ω ω∈Ω (2π) | det(Σx)|| det(Σz)| ω − 1 ((x−B )T (Σ +A Σ AT )−1(x−B )) X e 2 ω x ω z ω ω p = 1z∈ω det(Σω)Φω(µ (x), Σω) D/2p ω ω∈Ω (2π) | det(Σx)|| det(Σz)| p T X det(Σx + AωΣzAω ) det(Σω) T =1z∈ω p φ(x; Bω, Σx + AωΣzAω )Φω(µω(x), Σω), ω∈Ω | det(Σx)|| det(Σz)| now using the Matrix determinant lemma Harville [1998] we have that det(Σx +

T −1 T −1 AωΣzAω ) = det(Σz + Aω Σx Aω) det(Σx) det(Σz) leading to

X T p(x) = φ(x; Bω, Σx + AωΣzAω )Φω(µω(x), Σω), ω T X φ(x; Bω, Σx + AωΣzA )φ(z; µ (x), Σω) p(z|x) = δ (z) ω ω . ω P φ(x; B , Σ + A Σ AT )Φ (µ (x), Σ ) ω ω ω x ω z ω ω ω ω

B.5.4 Proof of Lemma 5.2

Proof B.6 The proof consists of rearranging the terms from the inclusion-exclusion formula as in

X |J|+1 (−1) (∩j∈J Aj) = ∪iAi J⊆{1,...,F },J6=∅ 157

F +1 X |J|+1 (−1) S + (−1) (∩j∈J Aj) = ∪iAi J⊆{1,...,F },J6=∅,|J|

F +1 X |J|+1 (−1) S = ∪iAi− (−1) (∩j∈J Aj) J⊆{1,...,F },J6=∅,|J|

F +1 F +1 X |J|+1 S = (−1) ∪i Ai−(−1) (−1) (∩j∈J Aj) J⊆{1,...,F },J6=∅,|J|

F +1 X |J|+1+F S = (−1) ∪i Ai+ (−1) (∩j∈J Aj) J⊆{1,...,F },J6=∅,|J|

can be decomposed into the signed sum of per cone integration. Finally, a simplex in

dimension S has S + 1 faces, making F = S + 1 and leading to the desired result.

B.5.5 Proof of Moments Lemma B.2

The first moments of Gaussian integration on an open rectangle defined by its lower limits a is given by

Z ∞ zφ(0, Σ)dz =ΣF (a), (B.9) a Z ∞  ! T a F (a) − Σ G(a) 1 zz φ(0, Σ)dz =Φ[a,∞)(0, Σ)Σ + Σ G(a) + Σ. a diag(Σ) (B.10)

where the division is performed elementwise.

Proof B.7 First moment:

− 1 zT Σ−1z Z Z e 2 zφ(x; 0, Σ)dz = z K/2 1/2 dz ω ω (2π) | det(Σ)| − 1 (R z)T (RT )−1Σ−1R−1R z X X Z e 2 C C ω C C = s z K/2 1/2 dz (2π) | det(Σω)| ∆∈S(ω) (s,C)∈T (∆) C 1 T T −1 Z − u (RC ΣωRC ) u X X −1 e 2 = s R u K/2 1/2 du (2π) | det(RC )|| det(Σω)| ∆∈S(ω) (s,C)∈T (∆) l(C) 158

Z X X −1 T = sRC uφ(u; 0,RC ΣωRC )du ∆∈S(ω) (s,C)∈T (∆) l(C)

X X −1 T = sRC (RC ΣωRC F (l(C)) ∆∈S(ω) (s,C)∈T (∆)

X X T =Σω sRC F (l(C)) ∆∈S(ω) (s,C)∈T (∆) Second moment

Z Z − 1 zT Σ−1z T T e 2 zz φ(x; 0, Σ)dz = zz K/2 1/2 dz ω ω (2π) | det(Σ)| 1 T T −1 −1 −1 Z − (RC y) (RC ) Σω RC RC y X X T e 2 = s zz K/2 1/2 dz (2π) | det(Σω)| ∆∈S(ω) (s,C)∈T (∆) C 1 T T −1 Z − u (RC ΣωRC ) u X X −1 T −1 T e 2 = s RC uu (RC ) K/2 1/2 du (2π) | det(RC )|| det(Σω)| ∆∈S(ω) (s,C)∈T (∆) l(C) Z X X −1 T T −1 T = sRC uu φ(u; 0,RC ΣωRC )du(RC ) ∆∈S(ω) (s,C)∈T (∆) l(C)

X X −1h T T = sRC Φ[l(C),∞)(0,RC ΣωRC )RC ΣωRC

∆∈S(ω−µω(x)) (s,C)∈T (∆) T  ! T l(C) F (l(C)) + RC ΣωRC G(l(C)) 1 T T i −1 T + RC ΣωRC T (RC ΣωRC ) (RC ) diag(RC ΣωRC )

X X h T = s Φ[l(C),∞)(0,RC ΣωRC )Σω

∆∈S(ω−µω(x)) (s,C)∈T (∆) T  ! T l(C) F (l(C)) + RC ΣωRC G(l(C)) 1 i + ΣωRC T RC Σω diag(RC ΣωRC )

= e0 Σ ω−µω(x) ω T  !  X X T l(C) F (l(C)) + RC ΣωRC G(l(C)) 1  + Σω sRC T RC Σω diag(RC ΣωRC ) ∆∈S(ω−µω(x)) (s,C)∈T (∆)

B.6 Proof of EM-step

We now derive the expectation maximization steps for a piecewise affine and contin- uous DGN. 159

B.6.1 E-step derivation

1 0 Ez|x[(Aωz + Bω)1ω] = Amω + Beω (B.11)

T T T 2 Ez|x[z Aω Aωz1ω] = trace(Aω Aωm ) (B.12)

  EZ|X log pX|Z (x|z)pZ (z) =

" − 1 (x−g(z))T Σ−1(x−g(z)) − 1 zT Σ−1x !# e 2 x e 2 z EZ|X log D/2p S/2p (2π) | det(Σx)| (2π) | det(Σz)|

 (S+D)/2p p  = − log (2π) | det(Σz)| | det(Σx)|) 1   − E (x − g(z))T Σ−1(x − g(z)) + zT Σ−1z 2 Z|X x z

 (S+D)/2p p  = − log (2π) | det(Σz)| | det(Σx)|) 1   − xT Σ−1x + E − 2xT Σ−1g(z) + g(z)T Σ−1g(z) + zT Σ−1z 2 x Z|X x x z

 (S+D)/2p p  = − log (2π) | det(Σz)| | det(Σx)|) 1 − xT Σ−1x + trace(E [zzT Σ−1]) 2 x Z|X z   T −1 T −1 + EZ|X − 2x Σx g(z) + g(z) Σx g(z)

 (S+D)/2p p  = − log (2π) | det(Σz)| | det(Σx)|) ! 1 X − xT Σ−1x − 2xT Σ−1 A e1 (x) + b e0 (x) 2 x x ω ω ω ω ω

X 0 T −1 T −1 2 + eωbω Σx bω + trace(Aω Σx AωEω(x)) ω  1 T −1 −1 2 + 2(Aωmω(x)) Σx bω + trace(Σz E (x))

B.6.2 Proof of M step

Let first introduce some notations:

L→i L→i T Aω , (Aω ) (back-propagation matrix to layer i), 160

!! ` 0 1 X 0 i+1→L i i rω(x) , xeω(x) − Aωeω(x) + mω(x)Aω Qωv (expected residual) i6=`

` `−1 1→`−1 1 1→`−1 0  zˆω(x) , Qω Aω mω(x) + bω eω (expected feature map of layer `) we can now provide the analytical forms of the M step for each of the learnable parameters: ! 1 X X T Σ∗ = xxT+ b b m0 (x) + 2A e1 (x) − 2x(zˆL(x))T+A E2 (x)AT , x N ω ω ω ω ω ω ω ω ω x ω (B.13) !−1  `∗ X X ` L→`+1 −1 `+1→L ` X X ` L→`+1 −1 ` v = QωAω Σx Aω Qω  QωAω Σx rω(x)  , x ω x | {z } ω∈Ω residual back-propagated to layer ` (B.14)  L !  `∗ −1 X X ` L→`+1 −1 X i+1→L i i ` T vect(W ) = Uω vect QωA Σx x − Aω Qωv (zˆω(x)) , x ω i=` | {z } residual back-propagated to layer ` (B.15)

we provide detailed derivations below.

Update of the bias parameter

L PL−1 L L−1 L−1 i i Recall that bω = v + i=1 W Qω W ... Qωv , we can thus rewrite the loss as

1   L(v`) = − log (2π)S+D| det(Σ )|| det(Σ )| 2 x z ! 1 X − xT Σ−1x − 2xT Σ−1 A m1 (x) + b m0 (x) 2 x x ω ω ω ω ω ! X 0 T −1 T −1 2 1 T −1 + mωbω Σx bω +trace(Aω Σx AωM ω(x)) + 2(Aωmω(x)) Σx bω ω 1 − trace(Σ−1M 2(x)) 2 z 161

! 1 X X = − − 2xT Σ−1 b e0 (x) + e0 bT Σ−1b 2 x ω ω ω ω x ω ω ω  X 1 T −1 + 2 (Aωmω(x)) Σx bω + cst ω 1 X  = − − 2xT Σ−1 A`+1→LQ` v`e0 (x) 2 x ω ω ω ω

0 `+1→L ` ` T −1 `+1→L ` ` + eω(Aω Qωv ) Σx (Aω Qωv )

0 X i+1→L i i T −1 `+1→L ` ` + 2eω(x)( Aω Qωv ) Σx (Aω Qωv ) i6=`

1 T T −1 `+1→L ` ` + 2((mω(x)) (Aω) Σx Aω Qωv + cst 1 X  = − e0 (A`+1→LQ` v`)T Σ−1(A`+1→LQ` v`)(A) 2 ω ω ω x ω ω ω

0 X i+1→L i i 1 T −1 `+1→L ` `  + 2(eω(x)( Aω Qωv − x) + Aωeω(x)) Σx (Aω Qωv ) + cst (B) i6=` " 1 X =⇒ ∂L(v`) = − − e0 (x)2Q` AL→`+1Σ−1A`+1→LQ` v` 2 ω ω ω x ω ω ω ! !# `+1→L ` T −1 0 X i+1→L i i 1 + 2 Aω Qω Σx eω(x) Aω Qωv − x + Aωeω(x) i6=` !−1 ` X X 0 ` L→`+1 −1 `+1→L ` =⇒ v = eω(x)QωAω Σx Aω Qω x ω !! X X ` L→`+1 −1 0 1 X 0 i+1→L i i × QωAω Σx xeω(x) − Aωmω(x) + mω(x)Aω Qωv x ω∈Ω i6=` as

0 `+1→L ` ` T −1 `+1→L ` ` (A) =eω(x)(Aω Qωv ) Σx (Aω Qωv )

0 ` L→`+1 −1 `+1→L ` ` =⇒ ∂(A) =eω(x)2QωAω Σx Aω Qωv  ! !T  0 X i+1→L i i 1 −1 `+1→L ` ` (B) = 2  eω(x) Aω Qωv − x + Aωeω(x) Σx (Aω Qωv ) + cst i6=` 162

! ! `+1→L ` T −1 0 X i+1→L i i 1 =⇒ ∂(B) = Aω Qω Σx eω(x) Aω Qωv − x + Aωeω(x) i6=`

Update of the slope parameter

We can thus rewrite the loss as

! X 1 X L(v`) =xT Σ−1 A e1 (x) + b e0 (x) − e0 bT Σ−1b x ω ω ω ω 2 ω ω x ω ω ω 1 X X − trace(AT Σ−1A E2 (x)) − (A m1 (x))T Σ−1b 2 ω x ω ω ω ω x ω ω ω

`+1→L ` ` `−1 1→`−1 PL i+1→L i i Notice that we can rewrite bω = A QωW Qω bω + i=` Aω Qωv and

`+1→L ` ` `−1 1→`−1 Aω = Aω QωW Qω A and thus we obtain:

` X T −1 `+1→L ` ` `−1 1→`−1 1 1→`−1 0  L(v ) = x Σx A QωW Qω A eω(x) + bω eω(x) ω 1 X − e0 (A`+1→LQ` W `Q`−1b1→`−1)T Σ−1(A`+1→LQ` W `Q`−1b1→`−1) 2 ω ω ω ω x ω ω ω ω L X 0 `+1→L ` ` `−1 1→`−1 T −1 X i+1→L i i − eω(A QωW Qω bω ) Σx ( Aω Qωv ) ω i=` 1X − trace((A`+1→LQ` W `Q`−1A1→`−1)T Σ−1(A`+1→LQ` W `Q`−1A1→`−1)E2 (x)) 2 ω ω ω x ω ω ω ω ω

X `+1→L ` ` `−1 1→`−1 1 T −1 `+1→L ` ` `−1 1→`−1 − (Aω QωW Qω A mω(x)) Σx (A QωW Qω bω ) ω L X `+1→L ` ` `−1 1→`−1 1 T −1 X i+1→L i i − (Aω QωW Qω A mω(x)) Σx ( Aω Qωv ) + cst ω i=`

X T −1 `+1→L ` ` `−1 1→`−1 1 1→`−1 0  = x Σx A QωW Qω A eω(x) + bω eω(x) (A) ω 1 X − e0 (A`+1→LQ` W `Q`−1b1→`−1)T Σ−1(A`+1→LQ` W `Q`−1b1→`−1)(B) 2 ω ω ω ω x ω ω ω ω T X  `+1→L ` ` `−1 1→`−1 1 1→`−1 0  − A QωW Qω A eω(x) + bω eω(x) ω 163

L −1 X i+1→L i i × Σx ( Aω Qωv )(C) i=` 1 X − trace((A`+1→LQ` W `Q`−1A1→`−1)T 2 ω ω ω ω

−1 `+1→L ` ` `−1 1→`−1 2 × Σx (Aω QωW Qω A )Eω(x)) (D)

X `+1→L ` ` `−1 1→`−1 1 T − (Aω QωW Qω A mω(x)) ω

−1 `+1→L ` ` `−1 1→`−1 × Σx (A QωW Qω bω ) + cst (E)

X T −1 `+1→L ` ` `−1 1→`−1 1 1→`−1 0  A = x Σx A QωW Qω A eω(x) + bω eω(x) ω

X ` L→`+1 −1 `−1 1→`−1 1 1→`−1 0 T =⇒ ∂A = QωAω Σx x Qω (A eω(x) + b eω(x)) ω

1 X T B = − e0 (x) A`+1→LQ` W `Q`−1b1→`−1 Σ−1 A`+1→LQ` W `Q`−1b1→`−1 2 ω ω ω ω x ω ω ω ω 1 X = − e0 (x)(Q`−1b1→`−1)T (W `)T (A`+1→LQ` )T Σ−1 A`+1→LQ` W `Q`−1b1→`−1 2 ω ω ω ω x ω ω ω ω 1 X = − e0 (x) 2 ω ω

` T `+1→L ` T −1 `+1→L ` ` `−1 1→`−1 `−1 1→`−1 T  × trace (W ) (A Qω) Σx A QωW Qω bω (Qω bω )

X 0 ` L→`+1 −1 `+1→L ` ` `−1 1→`−1 1→`−1 T `−1 =⇒ ∂B = − eω(x)QωA Σx A QωW Qω bω (bω ) Qω ω

T X  `+1→L ` ` `−1 1→`−1 1 1→`−1 0  C = − A QωW Qω A eω(x) + bω eω(x) ω L −1 X i+1→L i i Σx ( Aω Qωv ) i=` 164

X `−1 1→`−1 1 1→`−1 0 T ` T `+1→L ` T = − (Qω A eω(x) + bω eω(x) (W ) (A Qω) ω L −1 X i+1→L i i Σx ( Aω Qωv ) i=` L X ` L→`+1 −1 X i+1→L i i `−1 =⇒ ∂C = − QωA Σx ( Aω Qωv )(Qω ω i=`

1→`−1 1 1→`−1 0 T A eω(x) + bω eω(x)

1 X D = − trace((A`+1→LQ` W `Q`−1A1→`−1)T 2 ω ω ω ω

−1 `+1→L ` ` `−1 1→`−1 2 × Σx Aω QωW Qω A Eω(x)) 1 X = − trace((W `)T (A`+1→LQ` )T Σ−1A`+1→LQ` W `Q`−1 2 ω x ω ω ω ω

1→`−1 2 `−1 1→`−1 T A Eω(x)(Qω Aω ) )

X ` L→`+1 −1 `+1→L ` ` `−1 1→`−1 2 `−1→1 `−1 =⇒ ∂D = − QωA Σx Aω QωW Qω A Eω(x)Aω Qω ω

X `+1→L ` ` `−1 1→`−1 1 T −1 E = − Aω QωW Qω A mω(x) Σx ω

`+1→L ` ` `−1 1→`−1 × A QωW Qω bω

X ` T `+1→L ` T −1 `+1→L ` = − trace (W ) (Aω Qω) Σx A Qω ω

` `−1 1→`−1 `−1 1→`−1 1 T  ×W Qω bω (Qω A mω(x))

X ` L→`+1 −1 `+1→L ` ` =⇒ ∂E = − QωA Σx Aω QωW ω

`−1 1→`−1 1 T `−1→1 1→`−1 1 1→`−1 T  `−1 T  × Q bω (mω(x)) Aω + Aω mω(x)(bω ) (Q ) we can group B,D and E together as well as A and C. Now to solve this equal 0 we will need to consider the flatten version of W ` which we denote by w` = vect(W `) leading to 165

L X ` L→`+1 −1 X i+1→L i i ∂L = QωAω Σx (x − Aω Qωv ) ω i=`

`−1 1→`−1 1 1→`−1 0 T × Qω (A eω(x) + b eω(x))

X ` L→`+1 −1 `+1→L ` ` `−1 − QωA Σx Aω QωW Qω ω

 0 1→`−1 1→`−1 T 1→`−1 2 `−1→1 × eω(x)bω (bω ) + Aω Eω(x)Aω

1→`−1 1 T `−1→1 1→`−1 1 1→`−1 T  `−1 + bω (mω(x)) Aω + Aω mω(x)(bω ) Qω

X ` ` ` ` = Pω(x) − UωW Vω (x) ω

X X ` ` T ` X X ` =⇒ ( Uω ⊗ (Vω (x)) )vect(W ) = vect(Pω(x) x ω x ω

` ∗ X X ` ` T −1 X X ` =⇒ vect(W ) =( Uω ⊗ (Vω (x)) ) ( vect(Pω(x)) x ω x ω

B.7 Regularization

We propose in this section a brief discussion on the impact of using a probabilistic prior on the weights of the GDN. In particular, it is clear that imposing a Gaussian prior with zero mean and isotropic covariance on the weights falls back in the log likelihood to impose a l2 regularization of the weights with parameter based on the covariance of the prior. If the prior is a Laplace distribution, the log-likelihood will turn the prior into an l1 regularization of the weights, again with regularization coef-

ficient based on the prior covariance. Finally, in the case of uniform prior with finite support, the log likelihood will be equivalent to a weight clipping, a standard tech- nique employed in DNs where the weights can not take values outside of a predefined range. 166

B.8 Computational Complexity

The computational complexity of the method increases drastically with the latent space dimension, and the number of regions, and the number of faces per regions.

Those last quantities are directly tied into the complexity (depth and width) of the

DGNs. This complexity bottleneck comes from the need to search for all regions, and the need to decompose each region into simplices. As such, the EM learning is not yet suitable for large scale application, however based on the obtained analytical forms, it is possible to derive an approximation of the true form that would be more tractable while providing approximation error bounds as opposed to current methods.

B.9 Additional Experiments

In this section we propose to complement the toy circle experiment from the main paper first we an additional 2d case and then with the MNIST dataset.

Wave

We propose here a simple example where the read data is as follows:

We train on this dataset the EM and VAE based learning with various learning rates and depict below the evolution of the NLL for all models, we also depict the samples after learning.

MNIST We now employ MNIST which consists of images of digits, and select the 4 class. Note that due to complexity overhead we maintain a univariate latent space of the GDN and employ a three layer DGN with 8 and 16 hidden units. We provide first the evolution of the NLL through learning for all the training methods and then sample images from the trained DGNs demonstrating how for small DGNs

EM learning is able to learn a better data distribution and thus generated realistic 167

Figure B.1 : sample of noise data for the wave dataset

samples as opposed to VAEs which need much longer training steps. 168

Figure B.2 : Depiction of the evolution of the NLL during training for the EM and VAE algorithms, we can see that despite the high number of training steps, VAEs are not yet able to correctly approximate the data distribution as opposed to EM training which benefits from much faster convergence. We also see how the VAEs tend to have a large KL divergence between the true posterior and the variational estimate due to this gap, we depict below samples from those models. 169

Figure B.3 : Samples from the various models trained on the wave dataset. We can see on top the result of EM training where each column represents a different run, the remaining three rows correspond to the VAE training. Again, EM demonstrates much faster convergence, for VAE to reach the actual data distribution, much more updates are needed. 170

Figure B.4 : Evolution of the true data negative log-likelihood (in semilogy-y plot on MNIST (class 4) for EM and VAE training for a small DGN as described above. The experiments are repeated multiple times, we can see how the learning rate is clearly impacting the learning significantly despite the use of Adam, and that even with the large learning rate, the EM learning is able to reach lower NLL, in fact the quality of the generated samples of the EM modes is much higher as shows below. 171

Expectation-Maximization training

VAE training (large learning rate)

VAE training (medium learning rate)

VAE training (small learning rate)

Figure B.5 : Random samples from trained DGNs with EM or VAEs on a MNIST experiment (with digit 4). We see the ability of EM training to produce realistic and diversified samples despite using a latent space dimension of 1 and a small generative network. 172

Appendix C

Deep Network Pruning

This supplement provides more experiments to evaluate our proposed methods and is organized as follows:

• Sec. C.1 of this supplement provides more experiments to support Sec. 3 of the

main content.

– Sec. C.1.1 extends the insights about overparametrization and winning

tickets.

– Sec. C.1.2 supplies comprehensive comparisons between layerwise pretrain

and lottery initialization methods.

• Sec. C.2 of this supplement provides more visualizations to support Sec. 4 of

the main content.

– Sec. C.2.1 visualizes the Early-Bird phenomenon on more networks.

• Sec. C.3 of this supplement provides extensive experiments to enrich Sec. 5 of

the main content.

– Sec. C.3.1 describes the detailed experiment settings.

– Sec. C.3.2 supplies global spline pruning results on CIFAR-10/100.

– Sec. C.3.3 supplies ablation studies to investigate the sensitivity of the

hyperparameter ρ. 173

C.1 Additional Results on Initialization and Pruning

C.1.1 Winning Tickets and Overparameterization

Here we extend the overparametrization-pruning vs. initialization insights from Sec. 3

of the main content to univariate DNs on a carefully designed dataset. Considering a

simple unidimensional sawtooth as displayed in Fig. C.2 with P peaks (here P = 2).

In the special case of a single hidden layer with a ReLU activation function, one must

have at least 2P units to perfectly fit this function with the weight configuration being

(k−1)%2 [W 1]1,k = 1, [b1]k = −k with k = 1,...,D1 and [W 2]1,1 = 1, [W 2]1,k(−2) , k =

2,...,D1. Note that these weights are not unique (other ones can identically fit the function) and are given as an example. At initialization, if the DN has only 2P units, the probability that a random weight initialization arranges the initial splines in a way to allow effective gradient based training is low. Increasing the width of the initial network will increase the probability that some of the units are advantageously initial- ized throughout the domain and aligned with the natural input space partitioning of the target function (different regions for different increasing or decreasing sides of the sawtooth). This is what is empirically illustrated in Fig. C.2 (right) where one can see that even repeating multiple initializations of a DN without overparametrization does not allow to solve the task, while overparametrizing, training, and then pruning that together preserve only the correct number of units allow for better approximation.

C.1.2 Additional Results for Layerwise Pretraining

We have shown that unsupervised layerwise pretraining provides a good initialization for small networks in the main paper. Here we supply more results and analysis to support that point and discuss other interesting trade-offs between overparametriza- 174

Figure C.1 : Depiction of the dataset used for the K-means experiment with 64 centroids.

tion and layerwise pretraining.

Fig. C.3 further compares the required FLOPs when networks are initialized

using lottery initialization and layerwise pretraining, respectively. We observer that

1) when the pruning ratio is low (i.e., < 50%), networks with lottery initialization require a smaller number of computational FLOPs to provide a good initialization for the pruned network, while leading to a comparable or even higher retraining accuracy; and 2) when the pruning ratio is higher, layerwise pretraining requires a much smaller number of computational FLOPs as compared to training highly overparametrized dense networks. Such a phenomenon opens a door for investigating the following two questions, which we leave as our further works. 175 y

x layer width Figure C.2 : Left: Depiction of a simple (toy) univariate regression task with target function being a sawtooth with two peaks. Right: The `2 training error (y-axis) as a function of the width of the DN layer (2 layers in total). In theory, only 4 units are requires to perfectly solve the task at hand with a ReLU layer, however we see that optimization in narrow DNs is difficult and gradient based learning fails to find the correct layer parameters. As the width is increased as the difficulty of the optimization problem reduces and SGD manages to find a good set of parameters solving the regression task.

• Is there a clear boundary/condition to show whether we should start from over-

parametrization or consider pretraining as a good initialization for samll DNs

instead?

• How much overparametrization do we need to maintain better trade-offs be-

tween accuracy and efficiency, as compared to other initialization ways (e.g.,

layerwise pretraining)? 176

VGG-16 on CIFAR-10 VGG-16 on CIFAR-100 94 71 66 92 61 56 90 Layerwise Pretrain 51 Layerwise Pretrain Lottery Initialization Lottery Initialization 46 Testing Accuracy (%) 88 Testing Accuracy (%) 0.873 2.034 3.196 4.357 5.519 0.234 1.687 3.140 4.594 6.047 FLOPs 1e16 FLOPs 1e16

Figure C.3 : Accuracy vs. efficiency trade-offs of lottery initialization and layerwise pretraining.

C.2 Additional Early-Bird Visualizations

C.2.1 Early-Bird Visualization for VGG-16 and PreResNet-101

Here we supply more visualizations for the spline Early-Bird detection, which mea-

sures the hamming distances of the DN partition between consecutive epochs. Fig. C.4

shows such visualizations for VGG-16 and PreResNet-101 networks evaluating on the

CIFAR-100 dataset. We can see that spline EB tickets can be identified in the early

training stages (i.e., 13/14-th epoch w.r.t. 160 epochs in total) in both networks,

further validating the general effectiveness of our spline EB tickets.

C.3 Additional Experimental Details and Results

C.3.1 Experiments Settings

Models & Datasets: We consider four DNN models (PreResNet-101, VGG-16, and

ResNet-18/50) on both the CIFAR-10/100 and ImageNet datasets following the basic setting of [You et al., 2020]. 177

VGG-16 on CIFAR100 PreResNet-101 on CIFAR100

Figure C.4 : Illustrating the spline Early-Bird tickets in VGG-16 and PreResNet-101.

Evaluation Metrics: We evaluate in terms of the retraining accuracy, total training FLOPs, and real-device energy cost, the latter of which are measured by training the models on an edge GPU (NVIDIA JETSON TX2), which considers both the computational and data movement costs.

Training Settings: For the CIFAR-10/100 datasets, the training takes a total of

160 epochs; and the initial learning rate is set to 0.1 and is divided by 10 at the 80-th and 120-th epochs, respectively. For the ImageNet dataset, the training takes a total of 90 epochs while the learning rate drops at the 30-th and 60-th epochs, respectively.

In all the experiments, the batch size is set to 256, and an SGD solver is adopted with a momentum of 0.9 and a weight decay of 0.0001, following the setting of [Liu et al., 2019b]. Additionally, ρ in Equ. 2 of the main paper is set to 0.05 for all cases.

C.3.2 Additional Results of Our Global Spline Pruning

Spline Pruning over SOTA on CIFAR-10/100. Table C.1 compares the retrain- ing accuracy, the total training FLOPs, and the total training energy of our spline pruning methods and four SOTA pruning methods, including two unstructured prun- 178 ing baselines (i.e., the original lottery ticket (LT) training [Frankle and Carbin, 2019] and SNIP [Lee et al., 2019b]) and two structured pruning baselines (i.e., NS [Liu et al., 2017b] and ThiNet [Luo et al., 2017]). The results demonstrate that our spline pruning again consistently outperforms all the competitors in terms of the achieved accuracy and training efficiency trade-offs. Specifically, compared with the strongest competitor among the four SOTA baselines, spline pruning achieves 0.8 × ∼ 3.5 × training FLOPs reductions and 0.7 × ∼ 1.5 × energy cost reductions while offering comparable or even better (-0.67% ∼ 0.69%) accuracies. In particular, spline pruning consistently achieves 1.16 × ∼ 3.16 × training FLOPs reductions than all the struc- tured pruning baselines, while leading to comparable or better accuracies (-0.17% ∼

1.28%).

C.3.3 Ablation Studies of Our Spline Pruning Method

Recalling that the only hyper-parameters ρ in our spline pruning method (see Equ.

2 of the main content), which balances the difference between the angles versus the biases. Here we conduct ablation studies to measure the retraining accuracies under different values of ρ for investigating its sensitivity, as shown in Fig. C.5. Without loss of generality, we evaluate two commonly used models, VGG-16 and PreResNet-

101, on the representative CIFAR-100 dataset. Results show that spline pruning consistently performs well for a wide range of ρ values ranging from 0.01 to 0.4, which also generalizes to different pruning ratios (denoted by p). This set of experiments demonstrate the robustness of our spline pruning methods. 179

PreResNet-101 VGG-16

75 75

70 70

65 p = 30% 65 p = 10% p = 50% p = 30% p = 70% p = 50% 60 Testing Accuracy (%) Testing Accuracy (%) 60 0.01 0.05 0.1 0.3 0.4 0.01 0.05 0.1 0.3 0.4

Figure C.5 : Abalation studies of the hyperparameter ρ in our spline pruning method on two commonly used models, VGG-16 and PreResNet-101. 180

Table C.1 : Evaluating our global spline pruning method over SOTA methods on CIFAR-10/100 datasets. Note that the “Spline Improv.” denotes the improvement of our spline pruning (w/ or w/o EB) as compared to the strongest baselines. Retrain acc. Energy cost (KJ)/FLOPs (P) Setting Methods p=30%p=50%p=70% p=30% p=50% p=70% LT (one-shot) 93.7 93.21 92.78 6322/14.9 6322/14.9 6322/14.9 SNIP 93.76 93.31 92.76 3161/7.40 3161/7.40 3161/7.40 NS 93.83 93.42 92.49 5270/13.9 4641/12.7 4211/11.0 PreResNet-101ThiNet 93.39 93.07 91.42 3579/13.2 2656/10.6 1901/8.65 CIFAR-10 Spline 94.13 93.92 92.06 4897/13.6 4382/12.1 3995/10.1 EB Spline 93.67 93.18 92.32 2322/6.00 1808/4.26 1421/2.74 Spline Improv. 0.3 0.5 -0.46 1.4x/1.2x 1.5x/2.5x 1.4x/3.2x LT (one-shot) 93.18 93.25 93.28 746.2/30.3 746.2/30.3 746.2/30.3 SNIP 93.2 92.71 92.3 373.1/15.1373.1/15.1 373.1/15.1 NS 93.05 92.96 92.7 617.1/27.4 590.7/25.7 553.8/23.8 VGG16 ThiNet 92.82 91.92 90.4 631.5/22.6 383.9/19.0 380.1/16.6 CIFAR-10 Spline 93.62 93.46 92.85 643.5/26.4 603.4/25.0 538.1/19.6 EB Spline 93.28 93.05 91.96 476.1/19.4 436.1/15.5 370.7/11.1 Spline Improv. 0.42 0.21 -0.43 0.8x/0.8x 0.9x/1.0x 1.0x/1.4x LT (one-shot) 71.9 71.6 69.95 6095/14.9 6095/14.9 6095/14.9 SNIP 72.34 71.63 70.01 3047/7.40 3047/7.40 3047/7.40 NS 72.8 71.52 68.46 4851/13.7 4310/12.5 3993/10.3 PreResNet-101ThiNet 73.1 70.92 67.29 3603/13.2 2642/10.6 1893/8.65 CIFAR-100 Spline 73.79 72.04 68.24 4980/12.6 4413/10.9 4008/9.36 EB Spline 72.67 71.99 69.74 2388/5.44 1821/3.84 1416/2.46 Spline Improv. 0.69 0.44 -0.27 1.3x/1.4x 1.5x/2.8x 1.3x/3.5x p=10%p=30%p=50% p=10% p=30% p=50% LT (one-shot) 72.62 71.31 70.96 741.2/30.3 741.2/30.3 741.2/30.3 SNIP 71.55 70.83 70.35 370.6/15.1370.6/15.1 370.6/15.1 NS 71.24 71.28 69.74 636.5/29.3 592.3/27.1 567.8/24.0 VGG16 ThiNet 70.83 69.57 67.22 632.2/27.4 568.5/22.6 381.4/19.0 CIFAR-100 Spline 72.18 71.54 70.07 688.3/28.0 605.2/22.9 555.0/19.4 EB Spline 72.07 71.46 70.29 512.2/19.9 429.1/15.3 378.9/11.8 Spline Improv. -0.44 0.23 -0.67 0.7x/0.8x 0.9x/1.0x 1.0x/1.3x 181

Appendix D

Batch-Normalization

D.1 Proofs

D.1.1 Proof of Theorem 7.1

Proof D.1 In order to prove the theorem we will demonstrate below that the optimium

of the total least square optimization problem is reached at the unique global optimum

given by the average of the data, hence corresponding to the batch-normalization

mean parameter. Then we demonstrate that at this minimum, the value of the total

least square loss is given by the variance parameter of batch-normalization.

The optimization problem is given by

2 D(`) D(`) h[W (`)] , z(`−1)i − [µ] X X 2 X X k,. k L(µ; Z) = d z, H(`,k) = (`) 2 k=1 z∈Z k=1 z∈Z k[W ]k,.k2 it is clear that the optimization problem

min L(µ, Z), (`) µ∈RD can be decomposed into multiple independent optimization problem for each dimen- sion of the vector µ, since we are working with an unconstrained optimization problem with separable sum. We thus focus on a single [µ]k for now. The optimization problem becomes 2 (`) (`−1) X h[W ]k,., z i − [µ]k min (`) 2 [µ]k∈R z∈Z k[W ]k,.k2 182

taking the first derivate leads to 2   (`) (`−1) (`) (`−1) X h[W ]k,., z i − [µ]k X h[W ]k,., z i − [µ]k ∂ = − 2 (`) 2 (`) 2 z∈Z k[W ]k,.k2 z∈Z k[W ]k,.k2 (`) (`−1) X h[W ]k,., z i [µ]k = − 2 + 2Card(Z) (`) 2 (`) 2 z∈Z k[W ]k,.k2 k[W ]k,.k2

the above first derivative of the total least square (quadratic) loss function is thus a linear function of [µ]k being 0 at the unique point given by

(`) (`−1) X h[W ]k,., z i [µ]k − 2 + 2Card(Z) = 0 (`) 2 (`) 2 z∈Z k[W ]k,.k2 k[W ]k,.k2 P h[W (`)] , z(`−1)i ⇐⇒ [µ] = z∈Z k,. k Card(Z)

confirming that the average of the pre-activation feature maps (per-dimension) is

indeed the optimum of the optimization problem. One can verify easily that it is

indeed a minimum by taking the second derivative of the total least square which

indeed positive and given by

2Card(Z) > 0. (`) 2 k[W ]k,.k2 The above can be done for each dimension k in a similar manner. Now, by inserting this optimal value back into the total least square loss, we obtain the desired result.

D.1.2 Proof of Corollary 7.1

Proof D.2 In order to prove the desired result, we first demonstrate that the layer input centroid z(`−1) indeed belongs to one hyperplane.

For a data point (in our case z(`−1)) to belong to the kth (unit) hyperplane H(`,k) of layer `, we must ensure that this point belong to the set of the hyperplane defined 183

as (recall (7.5))

(`,k) n (`−1) D(`−1) D (`) (`−1)E (`) o H = z ∈ R : [W ]k,., z = [µ ]k ,

in our case we can simply use the data centroid and ensure that it fulfils the hyperplane

equality

D E  P z  [W (`)] , z(`−1) = [W (`)] , z∈Z k,. k,. Card(Z) D (`) E X [W ]k,., z = Card(Z) z∈Z

(`) =[µBN]k

where the last equation gives in fact the batch-normalization mean parameter. So

now, recalling the equation of H(`,k) we see that the point z(`−1) makes plane pro-

(`) (`−1) jection [µBN]k which equals the bias of the hyperplane effectively making z part of the (batch-normalized) hyperplane H(`,k). Doing the above for each k ∈ D(`) we see that the layer input centroid belongs to all the unit hyperplane that are shifted by the correct batch-normalization parameter, hence we directly obtain the desired result

\ z(`−1) ⊂ H(`,k), k∈D(`) concluding the proof.

D.1.3 Prof of Theorem 7.2

Proof D.3 Define by x∗ the shortest point in P(`,k) from x defined by

∗ x = arg min kx − uk2. u∈P(`,k) 184

The path from x to x∗ is a straight line in the input space which we define by

l(θ) = x∗θ + (1 − θ)x, θ ∈ [0, 1], s.t. l(0) = x, our original point, and l(1) is the shortest point on the kinked hyper- plane. Now, in the input space of layer `, this parametric line becomes a continuous piecewise affine parametric line defined as

(`−1) (`−1) z (θ) = (f ◦ · · · ◦ f1)(l(θ).

By definition, if P(`,k) is brought closer to x, it means that ∃θ < 1 s.t. l(θ) ∈ P(`,k).

Similarly this can be defined in the layer input space as follows.

∃θ0 < 1 s.t. z(`−1)(θ) ∈ H(`,k) =⇒ ∃θ < 1 s.t. l(θ) ∈ P(`,k) this demonstrates that when moving the layer hyperplane s.t. it intersects the kinked path z(`−1) at a point z(`−1)(θ0) with θ0 < 1, then the distance in the input space is also reduced. Now, the BN fitting is greedy and tried to minimize the length of the straight line between z(`−1)(0) a.k.a z(`−1)(x) and the hyperplane H(`,k). However, notice that if the length of this straight line decreases by brining the hyperplane closer

(`−1) 0 (`−1) 0 to z (x) then this also decreases the θ s.t. z (θ ) ∈ H,k in turn reducing the distance between x and P(`,k) in the DN input space, giving the desired (second) result. Conversely, if z(`−1)(0) ∈ H(`,k) then the point x lies in the zero-set of the unit, in turn making it belong to the kinked hyperplane P(`,k) which corresponds to this exact set (recall Eq. 7.13).

D.1.4 Proof of Proposition 7.1

Proof D.4 When using leaky-ReLU the input to the last layer will have positive and negative values in each dimension for at least 1 in the current minibatch. That means 185 that each dimension will have at least 1 negative value and all the other positive or vice-versa. As the last layer is initialized with zero bias, the decision boundary is defined in the last layer input space as the hyperplanes (or zero-set) of each output unit. Also, being on one side or the other of the decision boundary in the DN input space is equivalent to being on one side or the other of the linear decision boundary in the last layer input space. Combining those two results we obtain that at initialization, there has to be at least 1 sample one side of the decision boundary and the others on the other side. 186

Bibliography

P-A Absil, Alan Edelman, and Plamen Koev. On the largest principal angle between

random subspaces. Linear Algebra and its applications, 414(1):288–294, 2006.

Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt,

and Been Kim. Sanity checks for saliency maps. In S. Bengio, H. Wal-

lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,

Advances in Neural Information Processing Systems, volume 31. Curran Asso-

ciates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/

294a8ed24b1ad22ec2e7efea049b8737-Paper.pdf.

Sidney N Afriat. Orthogonal and oblique projectors and the characteristics of pairs of

vector spaces. In Mathematical Proceedings of the Cambridge Philosophical Society,

volume 53, pages 800–816. Cambridge University Press, 1957.

A. F. Agarap. Deep learning using rectified linear units (ReLU). arXiv preprint

arXiv:1803.08375, 2018.

Alok Aggarwal, Heather Booth, Joseph O’Rourke, Subhash Suri, and Chee K Yap.

Finding minimal convex nested polygons. Information and Computation, 83(1):

98–110, 1989.

Hirotugu Akaike. Factor analysis and aic. In Selected papers of hirotugu akaike, pages

371–386. Springer, 1987. 187

Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in

overparameterized neural networks, going beyond two layers. In Advances in neural

information processing systems, pages 6158–6169, 2019a.

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training

recurrent neural networks. In Advances in Neural Information Processing Systems,

pages 6676–6688, 2019b.

Edoardo Amaldi and Stefano Coniglio. A distance-based point-reassignment heuris-

tic for the k-hyperplane clustering problem. European Journal of Operational

Research, 227(1):22–29, 2013. ISSN 0377-2217. doi: https://doi.org/10.1016/

j.ejor.2012.09.026. URL https://www.sciencedirect.com/science/article/

pii/S037722171200690X.

Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. arXiv

preprint arXiv:1609.07152, 2016.

Joakim And´en,Vincent Lostanlen, and St´ephaneMallat. Joint time–frequency scat-

tering. IEEE Transactions on Signal Processing, 67(14):3704–3718, 2019.

Helena Andr´es-Terr´eand Pietro Li´o. Perturbation theory approach to study the

latent space degeneracy of variational autoencoders, 2019.

M Arjovsky and L Bottou. Towards principled methods for training generative ad-

versarial networks. arxiv, 2017. arXiv preprint arXiv:1701.04862.

Martin Arjovsky, Soumith Chintala, and L´eonBottou. Wasserstein gan. arXiv

preprint arXiv:1701.07875, 2017. 188

S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds for learning some deep

representations. arXiv preprint arXiv:1310.6343, 2013.

Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong

Wang. On exact computation with an infinitely wide neural net. In Advances in

Neural Information Processing Systems, pages 8141–8150, 2019a.

Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained

analysis of optimization and generalization for overparameterized two-layer neural

networks. arXiv preprint arXiv:1901.08584, 2019b.

Devansh Arpit and Yoshua Bengio. The benefits of over-parameterization at initial-

ization in deep relu networks. arXiv preprint arXiv:1901.03611, 2019.

David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding.

Technical report, Stanford, 2006.

Hagai Attias. A variational baysian framework for graphical models. In Advances in

neural information processing systems, pages 209–215, 2000.

Franz Aurenhammer. Power diagrams: properties, algorithms and applications. SIAM

Journal on Computing, 16(1):78–96, 1987.

Franz Aurenhammer. Voronoi diagrams—a survey of a fundamental geometric data

structure. ACM Computing Surveys (CSUR), 23(3):345–405, 1991.

Franz Aurenhammer and Hiroshi Imai. Geometric relations among voronoi diagrams.

Geometriae Dedicata, 27(1):65–75, 1988.

David Avis and Komei Fukuda. Reverse search for enumeration. Discrete applied

mathematics, 65(1-3):21–46, 1996. 189

Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles.

In Advances in neural information processing systems, pages 3365–3373, 2014.

Bhavik R Bakshi and George Stephanopoulos. Compression of chemical process data

by functional approximation and feature extraction. AIChE Journal, 42(2):477–

492, 1996.

Pierre Baldi and Peter J Sadowski. Understanding dropout. In Advances in neural

information processing systems, pages 2814–2822, 2013.

Randall Balestriero, Romain Cosentino, Behnaam Aazhang, and Richard Baraniuk.

The geometry of deep networks: Power diagram subdivision. In Advances in Neural

Information Processing Systems 32, pages 15806–15815. 2019.

Sudipto Banerjee and Anindya Roy. Linear algebra and matrix analysis for statistics.

Chapman and Hall/CRC, 2014.

Nitin Bansal, Xiaohan Chen, and Zhangyang Wang. Can we gain more from or-

thogonality regularizations in training deep networks? In Advances in Neural

Information Processing Systems, pages 4261–4271, 2018.

C Bradford Barber, David P Dobkin, and Hannu Huhdanpaa. The quickhull algorithm

for convex hulls. ACM Transactions on Mathematical Software (TOMS), 22(4):

469–483, 1996.

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network

dissection: Quantifying interpretability of deep visual representations. In Proceed-

ings of the IEEE conference on computer vision and pattern recognition, pages

6541–6549, 2017. 190

David Bau, Jun-Yan Zhu, Hendrik Strobelt, Agata Lapedriza, Bolei Zhou, and Anto-

nio Torralba. Understanding the role of individual units in a deep neural network.

Proceedings of the National Academy of Sciences, 117(48):30071–30078, 2020.

Peter Bauer, Alan Thorpe, and Gilbert Brunet. The quiet revolution of numerical

weather prediction. Nature, 525(7567):47–55, 2015.

Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon. Greedy layerwise

learning can scale to imagenet. In International conference on machine learning,

pages 583–593. PMLR, 2019.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern

machine-learning practice and the classical bias-variance trade-off. Proceedings of

the National Academy of Sciences, 116(32):15849–15854, 2019.

Richard Bellman and Robert Roth. Curve fitting by segmented straight lines. Journal

of the American Statistical Association, 64(327):1079–1084, 1969.

Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc

Le. Understanding and simplifying one-shot architecture search. In International

Conference on Machine Learning, pages 549–558, 2018.

Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new

perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, 2013.

Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max

Welling. Sylvester normalizing flows for variational inference. arXiv preprint

arXiv:1803.05649, 2018. 191

James C Bezdek and J Douglas Harris. Fuzzy partitions and relations; an axiomatic

basis for clustering. Fuzzy sets and systems, 1(2):111–127, 1978.

Manjunath BG and Stefan Wilhelm. Moments calculation for the double truncated

multivariate normal density. Available at SSRN 1472153, 2009.

Debswapna Bhattacharya and Jianlin Cheng. De novo protein conformational sam-

pling using a probabilistic graphical model. Scientific reports, 5(1):1–13, 2015.

G´erardBiau, BenoˆıtCadre, MAXIME Sangnier, and Ugo Tanielian. Some theoretical

properties of gans. arXiv preprint arXiv:1803.07819, 2018.

C. Biernacki, G. Celeux, and G. Govaert. Assessing a mixture model for clustering

with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell.,

22(7):719–725, 2000.

Jeff Bilmes and Geoffrey Zweig. The graphical models toolkit: An open source soft-

ware system for speech and time-series processing. In 2002 IEEE International

Conference on Acoustics, Speech, and Signal Processing, volume 4, pages IV–3916.

IEEE, 2002.

Peter Binev, Albert Cohen, Wolfgang Dahmen, Ronald DeVore, et al. Classification

algorithms using adaptive partitioning. Annals of Statistics, 42(6):2141–2163, 2014.

Garrett Birkhoff and Carl De Boor. Error bounds for spline interpolation. Journal

of mathematics and mechanics, 13(5):827–835, 1964.

C. M. Bishop. Pattern Recognition and Machine Learning, volume 4. Springer-Verlag

New York, 2006. 192

Ake Bjorck and Gene H Golub. Numerical methods for computing angles between

linear subspaces. Mathematics of computation, 27(123):579–594, 1973.

Nils Bjorck, Carla P Gomes, Bart Selman, and Kilian Q Weinberger. Understanding

batch normalization. In Advances in Neural Information Processing Systems, pages

7694–7705, 2018.

Andreas Bj¨orklund, Thore Husfeldt, and Mikko Koivisto. Set partitioning via

inclusion-exclusion. SIAM Journal on Computing, 39(2):546–563, 2009.

Merlijn Blaauw and Jordi Bonada. Modeling and transforming speech using vari-

ational autoencoders. Morgan N, editor. Interspeech 2016; 2016 Sep 8-12; San

Francisco, CA.[place unknown]: ISCA; 2016. p. 1770-4., 2016.

Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What

is the state of neural network pruning? In Third Conference on Machine Learning

and Systems, 2020.

L´eonBottou. Large-scale machine learning with stochastic gradient descent. In

Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.

Y. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in

visual recognition. In Proc. Int. Conf. Mach. Learn., pages 111–118, 2010.

Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization.

Cambridge university press, 2004.

Paul S Bradley and Usama M Fayyad. Refining initial points for k-means clustering.

In ICML, volume 98, pages 91–99. Citeseer, 1998. 193

Leo Breiman. Hinging hyperplanes for regression, classification, and function approx-

imation. IEEE Transactions on Information Theory, 39(3):999–1013, 1993.

Andrew Brown, Sean Milton, Mike Cullen, Brian Golding, John Mitchell, and Ann

Shelly. Unified modeling and prediction of weather and climate: A 25-year journey.

Bulletin of the American Meteorological Society, 93(12):1865–1877, 2012.

Joan Bruna and St´ephaneMallat. Invariant scattering convolution networks. IEEE

transactions on pattern analysis and machine intelligence, 35(8):1872–1886, 2013.

Fred B Bryant and Paul R Yarnold. Principal-components analysis and exploratory

and confirmatory factor analysis. 1995.

Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoen-

coders. arXiv preprint arXiv:1509.00519, 2015.

Santiago A Cadena, Marissa A Weis, Leon A Gatys, Matthias Bethge, and Alexan-

der S Ecker. Diverse feature visualizations reveal invariances in early layers of deep

neural networks. In Proceedings of the European Conference on Computer Vision

(ECCV), pages 217–232, 2018.

M Emre Celebi, Hassan A Kingravi, and Patricio A Vela. A comparative study of

efficient initialization methods for the k-means clustering algorithm. Expert systems

with applications, 40(1):200–210, 2013.

Bo Chen, Gungor Polatkan, Guillermo Sapiro, David Blei, David Dunson, and

Lawrence Carin. Deep learning with hierarchical convolutional factor analysis.

IEEE transactions on pattern analysis and machine intelligence, 35(8):1887–1901,

2013. 194

Ricky TQ Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating

sources of disentanglement in variational autoencoders. Advances in Neural Infor-

mation Processing Systems, 31:2610–2620, 2018.

Valeriia Cherepanova, Micah Goldblum, Harrison Foley, Shiyuan Duan, John Dicker-

son, Gavin Taylor, and Tom Goldstein. Lowkey: Leveraging adversarial attacks to

protect social media users from facial recognition. arXiv preprint arXiv:2101.07922,

2021.

Ting-Wu Chin, Ruizhou Ding, Cha Zhang, and Diana Marculescu. Towards effi-

cient model compression via learned global ranking. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, 2020a.

Ting-Wu Chin, Ruizhou Ding, Cha Zhang, and Diana Marculescu. Towards efficient

model compression via learned global ranking. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition, pages 1518–1528, 2020b.

Albert Cohen, Nira Dyn, Fr´ed´ericHecht, and Jean-Marie Mirebeau. Adaptive mul-

tiresolution analysis based on anisotropic triangulations. Mathematics of Compu-

tation, 81(278):789–810, 2012.

Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learn-

ing: A tensor analysis. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir,

editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of

Machine Learning Research, pages 698–728, Columbia University, New York, New

York, USA, 23–26 Jun 2016. PMLR.

T. S. Cohen and M. Welling. Group equivariant convolutional networks. arXiv

preprint arXiv:1602.07576, 2016. 195

YL Cun, L Bottou, G Orr, and K Muller. Efficient backprop, neural networks: Tricks

of the trade. Lecture notes in computer sciences, 1524:5–50, 1998.

George Cybenko. Approximation by superpositions of a sigmoidal function. Mathe-

matics of Control, Signals, and Systems (MCSS), 2(4):303–314, 1989.

Ingrid Daubechies, Ronald DeVore, Simon Foucart, Boris Hanin, and Guergana

Petrova. Nonlinear approximation and (deep) relu networks. arXiv preprint

arXiv:1905.02199, 2019.

Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak.

Hyperspherical variational auto-encoders. arXiv preprint arXiv:1804.00891, 2018.

Carl De Boor and John R Rice. Least squares cubic spline approximation i-fixed

knots. 1968.

Morris H DeGroot and Mark J Schervish. Probability and statistics. Pearson Educa-

tion, 2012.

Boris Delaunay et al. Sur la sphere vide. Izv. Akad. Nauk SSSR, Otdelenie Matem-

aticheskii i Estestvennyka Nauk, 7(793-800):1–2, 1934.

Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from

incomplete data via the em algorithm. Journal of the Royal Statistical Society:

Series B (Methodological), 39(1):1–22, 1977.

Tingquan Deng, Dongsheng Ye, Rong Ma, Hamido Fujita, and Lvnan Xiong. Low-

rank local tangent space embedding for subspace clustering. Information Sciences,

508:1–21, 2020.

Ronald A DeVore. Nonlinear approximation. Acta numerica, 7:51–150, 1998. 196

Adji B Dieng and John Paisley. Reweighted expectation maximization. arXiv preprint

arXiv:1906.05850, 2019.

Xiaohan Ding, Guiguang Ding, Yuchen Guo, and Jungong Han. Centripetal sgd for

pruning very deep convolutional networks with complicated structure. In Proceed-

ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

pages 4943–4953, 2019.

Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent

components estimation. arXiv preprint arXiv:1410.8516, 2014.

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using

real nvp. arXiv preprint arXiv:1605.08803, 2016.

David P Dobkin, Allan R Wilks, Silvio VF Levy, and William P Thurston. Con-

tour tracing by piecewise linear approximations. ACM Transactions on Graphics

(TOG), 9(4):389–423, 1990.

David L Donoho, Iain M Johnstone, et al. Ideal denoising in an orthonormal basis

chosen from a library of bases. Comptes rendus de l’Acad´emiedes sciences. S´erie

I, Math´ematique, 319(12):1317–1322, 1994.

Kenji Doya. Universality of fully connected recurrent neural networks. Dept. of

Biology, UCSD, Tech. Rep, 1993.

Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent

finds global minima of deep neural networks. In International Conference on Ma-

chine Learning, pages 1675–1685, 2019a. 197

Simon S Du, Kangcheng Hou, Russ R Salakhutdinov, Barnabas Poczos, Ruosong

Wang, and Keyulu Xu. Graph neural tangent kernel: Fusing graph neural networks

with graph kernels. In Advances in Neural Information Processing Systems, pages

5724–5734, 2019b.

Didier J Dubois. Fuzzy sets and systems: theory and applications, volume 144. Aca-

demic press, 1980.

James George Dunham. Optimum uniform piecewise linear approximation of planar

curves. IEEE Transactions on Pattern Analysis and Machine Intelligence, (1):

67–75, 1986.

Ishan Durugkar, Ian Gemp, and Sridhar Mahadevan. Generative multi-adversarial

networks. arXiv preprint arXiv:1611.01673, 2016.

Martin E Dyer. The complexity of vertex enumeration methods. Mathematics of

Operations Research, 8(3):381–402, 1983.

Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training gener-

ative neural networks via maximum mean discrepancy optimization. arXiv preprint

arXiv:1505.03906, 2015.

S. Elfwing, E. Uchibe, and K. Doya. Sigmoid-weighted linear units for neural network

function approximation in reinforcement learning. Neural Netw., 2018.

Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov,

Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and

Jeff Dean. A guide to deep learning in healthcare. Nature medicine, 25(1):24–29,

2019. 198

Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rig-

ging the lottery: Making all tickets winners. arXiv preprint arXiv:1911.11134,

2019.

Otto Fabius and Joost R van Amersfoort. Variational recurrent auto-encoders. arXiv

preprint arXiv:1412.6581, 2014.

Jay Farrell, Manu Sharma, and Marios Polycarpou. Backstepping-based flight con-

trol with adaptive function approximation. Journal of Guidance, Control, and

Dynamics, 28(6):1089–1102, 2005.

Giancarlo Ferrari-Trecate and Marco Muselli. A new learning method for piecewise

linear regression. In International conference on artificial neural networks, pages

444–449. Springer, 2002.

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse,

trainable neural networks. In International Conference on Learning Representa-

tions, 2019.

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Repre-

senting model uncertainty in deep learning. In international conference on machine

learning, pages 1050–1059, 2016.

Alberto Gasc´onand Eugenio F S´anchez-Ubeda.´ Automatic specification of piecewise

linear additive models: application to forecasting natural gas demand. Statistics

and Computing, 28(1):201–217, 2018.

A. Gersho and R. M. Gray. Vector Quantization and Signal Compression, volume

159. Springer US, 2012. 199

Thomas Gerstner and Markus Holtz. Algorithms for the cell enumeration and orthant

decomposition of hyperplane arrangements. University of Bonn, 2006.

Zoubin Ghahramani and Sam T Roweis. Learning nonlinear dynamical systems using

an em algorithm. In Advances in neural information processing systems, pages 431–

437, 1999.

Zoubin Ghahramani, Geoffrey E Hinton, et al. The em algorithm for mixtures of

factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of

Toronto, 1996.

Federico Girosi, Michael Jones, and Tomaso Poggio. Priors stabilizers and basis

functions: From regularization to radial, tensor and additive splines. 1993.

X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward

neural networks. In Proc. 13th Int. Conf. AI Statist., volume 9, pages 249–256,

2010.

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural

networks. In Proceedings of the Fourteenth International Conference on Artificial

Intelligence and Statistics, pages 315–323, 2011.

Micah Goldblum, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild,

Dawn Song, Aleksander Madry, Bo Li, and Tom Goldstein. Data security for

machine learning: Data poisoning, backdoor attacks, and defenses. arXiv preprint

arXiv:2012.10544, 2020.

I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning, volume 1. MIT Press,

2016. http://www.deeplearningbook.org. 200

I. J Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,

A. Courville, and Y. Bengio. Generative adversarial nets. In Proceedings of the

27th International Conference on Neural Information Processing Systems, pages

2672–2680. MIT Press, 2014.

Michael T. Goodrich. Efficient piecewise-linear function approximation using the

uniform metric: (preliminary version). In Proceedings of the Tenth Annual Sym-

posium on Computational Geometry, SCG ’94, page 322–331, New York, NY,

USA, 1994. Association for Computing Machinery. ISBN 0897916484. doi:

10.1145/177424.178040. URL https://doi.org/10.1145/177424.178040.

Gaurav Gothoskar, Alex Doboli, and Simona Doboli. Piecewise-linear modeling of

analog circuits based on model extraction from trained neural networks. In Pro-

ceedings of the 2002 IEEE International Workshop on Behavioral Modeling and

Simulation, 2002. BMAS 2002., pages 41–46. IEEE, 2002.

Will Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Du-

venaud. Ffjord: Free-form continuous dynamics for scalable reversible generative

models. arXiv preprint arXiv:1810.01367, 2018.

Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint

arXiv:1308.0850, 2013.

Branko Gr¨unbaum. Convex polytopes, volume 221. Springer Science & Business

Media, 2013.

Caglar Gulcehre, Marcin Moczulski, Francesco Visin, and Yoshua Bengio. Mollifying

networks. arXiv preprint arXiv:1608.04980, 2016. 201

S Louis Hakimi and Edward F Schmeichel. Fitting polygonal functions to a set of

points in the plane. CVGIP: Graphical Models and Image Processing, 53(2):132–

136, 1991.

Paul R Halmos. Measure theory, volume 18. Springer, 2013.

Greg Hamerly and Charles Elkan. Alternatives to the k-means algorithm that find

better clusterings. In Proceedings of the eleventh international conference on In-

formation and knowledge management, pages 600–607, 2002.

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep

neural networks with pruning, trained quantization and huffman coding. arXiv

preprint arXiv:1510.00149, 2015.

Boris Hanin and David Rolnick. Complexity of linear regions in deep networks. arXiv

preprint arXiv:1901.09021, 2019.

L. A. Hannah and D. B. Dunson. Multivariate convex regression with adaptive par-

titioning. J. Mach. Learn. Res., 14(1):3261–3294, 2013.

Kazuyuki Hara, Daisuke Saitoh, and Hayaru Shouno. Analysis of dropout learning

regarded as ensemble learning. In International Conference on Artificial Neural

Networks, pages 72–79. Springer, 2016.

Harry H Harman. Modern factor analysis. University of Chicago press, 1976.

Radoslav Harman and Vladim´ırLacko. On decompositional algorithms for uniform

sampling from n-spheres and n-balls. Journal of Multivariate Analysis, 101(10):

2297–2304, 2010. 202

Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension

bounds for piecewise linear neural networks. In Conference on Learning Theory,

pages 1064–1068. PMLR, 2017.

David A Harville. Matrix algebra from a statistician’s perspective, 1998.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti-

fiers: Surpassing human-level performance on imagenet classification. In Proceed-

ings of the IEEE international conference on computer vision, pages 1026–1034,

2015a.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning

for image recognition. CoRR, abs/1512.03385, 2015b.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning

for image recognition. In Proceedings of the IEEE conference on computer vision

and pattern recognition, pages 770–778, 2016a.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in

deep residual networks. In European conference on computer vision, pages 630–

645. Springer, 2016b.

Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft fil-

ter pruning for accelerating deep convolutional neural networks. arXiv preprint

arXiv:1808.06866, 2018.

Robert Hecht-Nielsen. Theory of the backpropagation neural network. In Neural

networks for perception, pages 65–93. Elsevier, 1992. 203

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew

Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic

visual concepts with a constrained variational framework. ICLR, 2(5):6, 2017.

Geoffrey E Hinton, Peter Dayan, and Michael Revow. Modeling the manifolds of

images of handwritten digits. IEEE transactions on Neural Networks, 8(1):65–74,

1997.

Matthew Hirn, St´ephane Mallat, and Nicolas Poilvert. Wavelet scattering regression

of quantum chemical energies. Multiscale Modeling & Simulation, 15(2):827–863,

2017.

Francis Hirsch and Gilles Lacombe. Elements of functional analysis, volume 192.

Springer Science & Business Media, 2012.

Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic

variational inference. The Journal of Machine Learning Research, 14(1):1303–1347,

2013.

Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural

networks, 4(2):251–257, 1991.

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward net-

works are universal approximators. Neural networks, 2(5):359–366, 1989.

William C Horrace. Some results on the multivariate truncated normal distribution.

Journal of multivariate analysis, 94(1):209–221, 2005.

John Albert Horst and Isabel Beichel. A simple algorithm for efficient piecewise linear 204

approximation of space curves. In Proceedings of international conference on image

processing, volume 2, pages 744–747. IEEE, 1997.

Chin-Wei Huang, Kris Sankaran, Eeshan Dhekane, Alexandre Lacoste, and Aaron

Courville. Hierarchical importance weighted autoencoders. arXiv preprint

arXiv:1905.04866, 2019.

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely

connected convolutional networks. In Proceedings of the IEEE conference on com-

puter vision and pattern recognition, pages 4700–4708, 2017.

Gao Huang, Shichen Liu, Laurens Van der Maaten, and Kilian Q Weinberger. Con-

densenet: An efficient densenet using learned group convolutions. In Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition, pages 2752–

2761, 2018a.

Guang-Bin Huang, Dian Hui Wang, and Yuan Lan. Extreme learning machines: a

survey. International journal of machine learning and cybernetics, 2(2):107–122,

2011.

Huaibo Huang, Ran He, Zhenan Sun, Tieniu Tan, et al. Introvae: Introspective

variational autoencoders for photographic image synthesis. In Advances in neural

information processing systems, pages 52–63, 2018b.

Kaixuan Huang, Yuqing Wang, Molei Tao, and Tuo Zhao. Why do deep residual

networks generalize better than deep feedforward networks?–a neural tangent kernel

perspective. arXiv preprint arXiv:2002.06262, 2020.

Hiroshi Imai, Masao Iri, and Kazuo Murota. Voronoi diagram in the laguerre geometry

and its applications. SIAM Journal on Computing, 14(1):93–105, 1985. 205

Tadanobu Inoue, Subhajit Choudhury, Giovanni De Magistris, and Sakyasingha Das-

gupta. Transfer learning from synthetic to real images using variational autoen-

coders for precise position detection. In 2018 25th IEEE International Conference

on Image Processing (ICIP), pages 2725–2729. IEEE, 2018.

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by

reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in

batch-normalized models. In Advances in neural information processing systems,

pages 1945–1953, 2017.

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image trans-

lation with conditional adversarial networks. In Proceedings of the IEEE conference

on computer vision and pattern recognition, pages 1125–1134, 2017.

Arthur Jacot, Franck Gabriel, and Cl´ement Hongler. Neural tangent kernel: Con-

vergence and generalization in neural networks. In Advances in neural information

processing systems, pages 8571–8580, 2018.

Kui Jia, Dacheng Tao, Shenghua Gao, and Xiangmin Xu. Improving training of

deep neural networks via singular value bounding. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, pages 4344–4352, 2017.

Haifeng Jin, Qingquan Song, and Xia Hu. Auto-keras: An efficient neural architecture

search system. In Proceedings of the 25th ACM SIGKDD International Conference

on Knowledge Discovery & , pages 1946–1956. ACM, 2019.

Roger A Johnson. Advanced Euclidean Geometry: An Elementary Treatise on the 206

Geometry of the Triangle and the Circle: Under the Editorship of John Wesley

Young. Dover Publications, 1960.

M.I. Jordan. Learning in Graphical Models. Adaptive computation and machine

learning. London, 1998. ISBN 9780262600323. URL https://books.google.com/

books?id=zac7L4LbNtUC.

Michael I Jordan. An introduction to probabilistic graphical models, 2003.

Pedro Juli´an,Mario Jord´an,and Alfredo Desages. Canonical piecewise-linear ap-

proximation of smooth functions. IEEE Transactions on Circuits and Systems I:

Fundamental Theory and Applications, 45(5):567–571, 1998.

Claus Kahlert and Leon O Chua. A generalized canonical piecewise-linear represen-

tation. IEEE Transactions on Circuits and Systems, 37(3):373–383, 1990.

S Kang and L Chua. A global representation of multidimensional piecewise-linear

functions with linear partitions. IEEE Transactions on Circuits and Systems, 25

(11):938–940, 1978.

Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Sil-

verman, and Angela Y Wu. An efficient k-means clustering algorithm: Analysis and

implementation. IEEE Transactions on Pattern Analysis & Machine Intelligence,

(7):881–892, 2002.

Christian Kanzow and Stefania Petra. On a semismooth least squares formulation of

complementarity problems with gap reduction. Optimization Methods and Software,

19(5):507–525, 2004. 207

Kenji Kawaguchi, Jiaoyang Huang, and Leslie Pack Kaelbling. Effect of depth and

width on local minima in deep learning. Neural computation, 31(7):1462–1498,

2019.

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and

Ping Tak Peter Tang. On large-batch training for deep learning: Generalization

gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.

Mahyar Khayatkhoei, Maneesh K Singh, and Ahmed Elgammal. Disconnected mani-

fold learning for generative adversarial networks. In Advances in Neural Information

Processing Systems, pages 7343–7353, 2018.

Beomsu Kim, Junghoon Seo, Seunghyeon Jeon, Jamyoung Koo, Jeongyeol Choe, and

Taegyun Jeon. Why are saliency maps noisy? cause of and solution to noisy

saliency maps. In 2019 IEEE/CVF International Conference on Computer Vision

Workshop (ICCVW), pages 4149–4157. IEEE, 2019.

Hyunjik Kim and Andriy Mnih. Disentangling by factorising. arXiv preprint

arXiv:1802.05983, 2018.

Jintae Kim, Jaeseo Lee, Lieven Vandenberghe, and Chih-Kong Ken Yang. Techniques

for improving the accuracy of geometric-programming based analog circuit design

optimization. In IEEE/ACM International Conference on Computer Aided Design,

2004. ICCAD-2004., pages 863–870. IEEE, 2004.

Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based

probability estimation. arXiv preprint arXiv:1606.03439, 2016.

Ross D King, Stephen Muggleton, Richard A Lewis, and MJ Sternberg. Drug de-

sign by machine learning: The use of inductive logic programming to model the 208

structure-activity relationships of trimethoprim analogues binding to dihydrofolate

reductase. Proceedings of the national academy of sciences, 89(23):11322–11326,

1992.

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv

preprint arXiv:1412.6980, 2014.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint

arXiv:1312.6114, 2013.

Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1

convolutions. In Advances in Neural Information Processing Systems, pages 10215–

10224, 2018.

David G Kleinbaum, K Dietz, M Gail, Mitchel Klein, and Mitchell Klein. Logistic

regression. Springer, 2002.

Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and

stability of gans. arXiv preprint arXiv:1705.07215, 2017.

Jonas Kohler, Hadi Daneshmand, Aurelien Lucchi, Thomas Hofmann, Ming Zhou,

and Klaus Neymeyr. Exponential convergence rates for batch normalization: The

power of length-direction decoupling in non-convex optimization. In The 22nd

International Conference on Artificial Intelligence and Statistics, pages 806–815,

2019.

Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and tech-

niques. MIT press, 2009. 209

Harri Lappalainen and Antti Honkela. Bayesian non-linear independent component

analysis by multi-layer perceptrons. In Advances in independent component analy-

sis, pages 93–121. Springer, 2000.

Yann LeCun. The mnist database of handwritten digits. http://yann. lecun.

com/exdb/mnist/, 1998.

Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard,

Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten

zip code recognition. Neural computation, 1(4):541–551, 1989.

Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and

time series. The handbook of brain theory and neural networks, 3361(10):1995,

1995a.

Yann LeCun, LD Jackel, L´eon Bottou, Corinna Cortes, John S Denker, Harris

Drucker, , UA Muller, Eduard Sackinger, Patrice Simard, et al.

Learning algorithms for classification: A comparison on handwritten digit recogni-

tion. Neural networks: the statistical mechanics perspective, 261:276, 1995b.

Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak,

Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth

evolve as linear models under gradient descent. In Advances in neural information

processing systems, pages 8572–8583, 2019a.

Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: SINGLE-SHOT

NETWORK PRUNING BASED ON CONNECTION SENSITIVITY. In In-

ternational Conference on Learning Representations, 2019b. URL https://

openreview.net/forum?id=B1VZqjAcYX. 210

Chaojian Li, Tianlong Chen, Haoran You, Zhangyang Wang, and Yingyan Lin. Halo:

Hardware-aware learning to optimize. In Proceedings of the European Conference

on Computer Vision (ECCV), September 2020.

Guanbin Li and Yizhou Yu. Visual saliency based on multiscale deep features. In

Proceedings of the IEEE conference on computer vision and pattern recognition,

pages 5455–5463, 2015.

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing

the loss landscape of neural nets. arXiv preprint arXiv:1712.09913, 2017a.

Jerry Li, Aleksander Madry, John Peebles, and Ludwig Schmidt. Towards un-

derstanding the dynamics of generative adversarial networks. arXiv preprint

arXiv:1706.09884, 2017b.

Xiaopeng Li and James She. Collaborative variational autoencoder for recommender

systems. In Proceedings of the 23rd ACM SIGKDD international conference on

knowledge discovery and data mining, pages 305–314, 2017.

Xiaopeng Li, Zhourong Chen, Leonard KM Poon, and Nevin L Zhang. Learning latent

superstructures in variational autoencoders for deep multidimensional clustering.

arXiv preprint arXiv:1803.05206, 2018.

Zhibin Liao and Gustavo Carneiro. On the importance of normalisation layers in deep

learning with piecewise linear activation units. In 2016 IEEE Winter Conference

on Applications of Computer Vision (WACV), pages 1–8. IEEE, 2016.

Jaechang Lim, Seongok Ryu, Jin Woo Kim, and Woo Youn Kim. Molecular generative

model based on conditional variational autoencoder for de novo molecular design.

Journal of cheminformatics, 10(1):1–9, 2018. 211

Angelica Nakagawa Lima, Eric Allison Philot, Gustavo Henrique Goulart Trossini,

Luis Paulo Barbour Scott, Vin´ıciusGon¸calves Maltarollo, and Kathia Maria Hon-

orio. Use of machine learning approaches for novel drug discovery. Expert opinion

on drug discovery, 11(3):225–239, 2016.

Ru-Je Lin and Wei-Song Lin. A computational visual saliency model based on statis-

tics and machine learning. Journal of vision, 14(9):1–1, 2014.

Shaohui Lin, Rongrong Ji, Yuchao Li, Yongjian Wu, Feiyue Huang, and Baochang

Zhang. Accelerating convolutional networks via global and dynamic filter pruning.

In Proceedings of the Twenty-Seventh International Joint Conference on Artificial

Intelligence, IJCAI-18.

Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang, Liujuan Cao, Qixiang

Ye, Feiyue Huang, and David Doermann. Towards optimal structured cnn pruning

via generative adversarial learning. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages 2790–2799, 2019.

Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander Gaunt. Con-

strained graph variational autoencoders for molecule design. In S. Bengio,

H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett,

editors, Advances in Neural Information Processing Systems 31, pages 7795–

7804. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/

8005-constrained-graph-variational-autoencoders-for-molecule-design.

pdf.

Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and con- 212

vergence properties of generative adversarial learning. In Advances in Neural In-

formation Processing Systems, pages 5545–5553, 2017a.

Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, Kwang-Ting

Cheng, and Jian Sun. Metapruning: Meta learning for automatic neural network

channel pruning. In Proceedings of the IEEE/CVF International Conference on

Computer Vision, pages 3296–3305, 2019a.

Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Chang-

shui Zhang. Learning efficient convolutional networks through network slimming.

In Proceedings of the IEEE International Conference on Computer Vision, pages

2736–2744, 2017b.

Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking

the value of network pruning. In International Conference on Learning Represen-

tations, 2019b. URL https://openreview.net/forum?id=rJlnB3C5Ym.

Vincent Lostanlen and Joakim And´en. Binaural scene classification with wavelet

scattering. Proc. DCASE, 2016.

Haihao Lu and Kenji Kawaguchi. Depth creates no bad local minima. arXiv preprint

arXiv:1702.08580, 2017.

Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The

expressive power of neural networks: A view from the width. arXiv preprint

arXiv:1709.02540, 2017.

James Lucas, George Tucker, Roger Grosse, and Mohammad Norouzi. Understanding

posterior collapse in generative latent variable models. 2019. 213

Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method

for deep neural network compression. In Proceedings of the IEEE international

conference on computer vision, pages 5058–5066, 2017.

Tom Lyche and Larry L Schumaker. Local spline approximation methods. Journal

of Approximation Theory, 15(4):294–325, 1975.

Li Ma, Melba M Crawford, and Jinwen Tian. Local manifold learning-based k-nearest-

neighbor for hyperspectral image classification. IEEE Transactions on Geoscience

and Remote Sensing, 48(11):4099–4109, 2010.

Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve

neural network acoustic models. In Proc. icml, volume 30, page 3, 2013.

David JC MacKay and Mark N Gibbs. Density networks. Statistics and neural

networks: advances at the interface. Oxford University Press, Oxford, pages 129–

144, 1999.

James MacQueen et al. Some methods for classification and analysis of multivari-

ate observations. In Proceedings of the fifth Berkeley symposium on mathematical

statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.

A. Magnani and S. P. Boyd. Convex piecewise-linear fitting. Optim. Eng., 10(1):

1–17, 2009.

S. Mallat. Group invariant scattering. Comm. Pure Appl. Math., 65(10):1331–1398,

July 2012.

St´ephaneMallat. A wavelet tour of signal processing. Academic press, 1999. 214

St´ephaneMallat. Understanding deep convolutional networks. Phil. Trans. R. Soc.

A, 374(2065):20150203, 2016.

Olvi L Mangasarian, J Ben Rosen, and ME Thompson. Global minimization via

piecewise-linear underestimation. Journal of Global Optimization, 32(1):1–9, 2005.

Christopher Manning and Dan Klein. Optimization, maxent models, and conditional

estimation without magic. In Proceedings of the 2003 Conference of the North

American Chapter of the Association for Computational Linguistics on Human

Language Technology: Tutorials-Volume 5, pages 8–8. Association for Computa-

tional Linguistics, 2003.

Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen

Paul Smolley. Least squares generative adversarial networks. In Proceedings of the

IEEE International Conference on Computer Vision, pages 2794–2802, 2017.

Dominic Masters and Carlo Luschi. Revisiting small batch training for deep neural

networks. arXiv preprint arXiv:1804.07612, 2018.

Carl D Meyer. Matrix analysis and applied linear algebra, volume 71. Siam, 2000.

Jianming Miao and Adi Ben-Israel. On principal angles between subspaces in rn.

Linear Algebra Appl, 171(92):81–98, 1992.

Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint

arXiv:1511.06422, 2015.

Joseph SB Mitchell and Subhash Suri. Separation and approximation of polyhedral

objects. Computational Geometry, 5(2):95–114, 1995. 215

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida.

Spectral normalization for generative adversarial networks. arXiv preprint

arXiv:1802.05957, 2018.

Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout spar-

sifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017.

Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Prun-

ing convolutional neural networks for resource efficient inference. arXiv preprint

arXiv:1611.06440, 2016.

Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the

number of linear regions of deep neural networks. In Advances in neural information

processing systems, pages 2924–2932, 2014.

Theodore S Motzkin, Howard Raiffa, Gerald L Thompson, and Robert M Thrall. The

double description method. Contributions to the Theory of Games, 2(28):51–73,

1953.

John Mount. The equivalence of logistic regression and maximum entropy models.

URL: http://www. win-vector. com/dfiles/LogisticRegressionMaxEnt. pdf, 2011.

James R Munkres. Elements of algebraic topology. CRC Press, 2018.

Naila Murray and Florent Perronnin. Generalized max pooling. In Proceedings of

the IEEE conference on computer vision and pattern recognition, pages 2473–2480,

2014.

Sandeep Nadella, Amarjot Singh, and SN Omkar. Aerial scene understanding us- 216

ing deep wavelet scattering network and conditional random field. In European

Conference on Computer Vision, pages 205–214. Springer, 2016.

Eric Nalisnick and Padhraic Smyth. Stick-breaking variational autoencoders. arXiv

preprint arXiv:1605.06197, 2016.

N. M. Nasrabadi and R. A. King. Image coding using vector quantization: A review.

IEEE Trans. Commun., 36(8):957–971, 1988.

Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies

incremental, sparse, and other variants. In Learning in graphical models, pages

355–368. Springer, 1998.

Yu Nesterov. A method of solving a convex programming problem with convergence

rate o (1/kˆ 2) o (1/k2). In Sov. Math. Dokl, volume 27.

Klaus Neumann, Matthias Rolf, and Jochen Jakob Steil. Reliable integration of

continuous constraints into extreme learning machines. International Journal of

Uncertainty, Fuzziness and Knowledge-Based Systems, 21(supp02):35–50, 2013.

Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan

Srebro. The role of over-parametrization in generalization of neural networks.

In International Conference on Learning Representations, 2019. URL https:

//openreview.net/forum?id=BygfghAcYX.

Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks.

In International conference on machine learning, pages 2603–2612. PMLR, 2017.

Siqi Nie, Meng Zheng, and Qiang Ji. The deep regression bayesian network and 217

its applications: Probabilistic deep learning for computer vision. IEEE Signal

Processing Magazine, 35(1):101–111, 2018.

Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha

Sohl-Dickstein. Sensitivity and generalization in neural networks: an empirical

study. In International Conference on Learning Representations, 2018. URL https:

//openreview.net/forum?id=HJC2SzZCW.

Jong-Hoon Oh and H Sebastian Seung. Learning generative models with the up

propagation algorithm. In Advances in Neural Information Processing Systems,

pages 605–611, 1998.

Edison Ong, Mei U Wong, Anthony Huffman, and Yongqun He. Covid-19 coron-

avirus vaccine design using reverse vaccinology and machine learning. Frontiers in

immunology, 11:1581, 2020.

J´anosPach and Pankaj K Agarwal. Combinatorial geometry, volume 37. John Wiley

& Sons, 2011.

Sankar K Pal and Sushmita Mitra. Multilayer perceptron, fuzzy sets, and classifica-

tion. IEEE Transactions on neural networks, 3(5):683–697, 1992.

Rahul Parhi and Robert D Nowak. Banach space representer theorems for neural

networks and ridge splines. Journal of Machine Learning Research, 22(43):1–40,

2021.

Jooyoung Park and Irwin W Sandberg. Universal approximation using radial-basis-

function networks. Neural computation, 3(2):246–257, 1991. 218

Yookoon Park, Chris Kim, and Gunhee Kim. Variational laplace autoencoders. In

International Conference on Machine Learning, pages 5032–5041, 2019.

A. Patel, T. Nguyen, and R. Baraniuk. A probabilistic framework for deep learning.

In Proc. Adv. Neural Inf. Process. Syst. (NIPS’16), Dec. 2016.

Vu Pham, Th´eodore Bluche, Christopher Kermorvant, and J´erˆome Louradour.

Dropout improves recurrent neural networks for handwriting recognition. In 2014

14th international conference on frontiers in handwriting recognition, pages 285–

290. IEEE, 2014.

Tsai-Yun Phillips and Azriel Rosenfeld. An isodata algorithm for straight line fitting.

Pattern Recognition Letters, 7(5):291–297, 1988.

Jennifer Pittman and CA Murthy. Fitting optimal piecewise linear functions using ge-

netic algorithms. IEEE Transactions on pattern analysis and machine intelligence,

22(7):701–718, 2000.

Michael James David Powell. Approximation theory and methods. Cambridge univer-

sity press, 1981.

Franco P Preparata and Michael I Shamos. Computational geometry: an introduction.

Springer Science & Business Media, 2012.

Lawrence Rabiner and B Juang. An introduction to hidden markov models. ieee assp

magazine, 3(1):4–16, 1986.

Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein.

On the expressive power of deep neural networks. In Proceedings of the 34th In- 219

ternational Conference on Machine Learning-Volume 70, pages 2847–2854. JMLR.

org, 2017.

P. Ramachandran, B. Zoph, and Q. Le. Searching for activation functions. ArXiv

e-prints, Oct. 2017.

Douglas A Reynolds. Gaussian mixture models. Encyclopedia of biometrics, 741:

659–663, 2009.

Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing

flows. arXiv preprint arXiv:1505.05770, 2015.

Baraniuk Richard, Anandkumar Anima, Mallat Stephane, Patel Ankit, and Ho nhat.

Integration of deep learning theories, 2018.

Lewis Fry Richardson. Weather prediction by numerical process. Cambridge university

press, 2007.

S. Rifai, G. Mesnil, P. Vincent, X. Muller, Y. Bengio, Y. Dauphin, and X. Glorot.

Higher order contractive auto-encoder. In Joint European Conference on Machine

Learning and Knowledge Discovery in Databases, pages 645–660. Springer, 2011.

Blaine Rister and Daniel L Rubin. Piecewise convexity of artificial neural networks.

Neural Networks, 94:34–45, 2017.

G´erardRives, Michel Dhome, Jean-Thierry Laprest´e,and Marc Richetin. Detection

of patterns in images from piecewise linear contours. Pattern recognition letters, 3

(2):99–104, 1985.

Arthur Wayne Roberts. Convex functions. In Handbook of Convex Geometry, Part

B, pages 1081–1104. Elsevier, 1993. 220

Frank Rosenblatt. Principles of neurodynamics. perceptrons and the theory of brain

mechanisms. Technical report, Cornell Aeronautical Lab Inc Buffalo NY, 1961.

Sam Roweis and Zoubin Ghahramani. A unifying review of linear gaussian models.

Neural computation, 11(2):305–345, 1999.

Aurko Roy, Ashish Vaswani, Arvind Neelakantan, and Niki Parmar. Theory and

experiments on vector quantized autoencoders. arXiv preprint arXiv:1805.11063,

2018.

Swalpa Kumar Roy, Suvojit Manna, Shiv Ram Dubey, and Bidyut Baran Chaudhuri.

Lisht: Non-parametric linearly scaled hyperbolic tangent activation function for

neural networks. arXiv preprint arXiv:1901.05894, 2019.

Walter Rudin. Real and complex analysis. Tata McGraw-hill education, 2006.

Parsa Saadatpanah, Ali Shafahi, and Tom Goldstein. Adversarial attacks on copyright

detection systems. In Hal Daum´eIII and Aarti Singh, editors, Proceedings of the

37th International Conference on Machine Learning, volume 119 of Proceedings

of Machine Learning Research, pages 8307–8315. PMLR, 13–18 Jul 2020. URL

http://proceedings.mlr.press/v119/saadatpanah20a.html.

Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between

capsules. arXiv preprint arXiv:1710.09829, 2017.

Tim Salimans and Durk P Kingma. Weight normalization: A simple reparame-

terization to accelerate training of deep neural networks. In Advances in Neural

Information Processing Systems, pages 901–909, 2016. 221

Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How

does batch normalization help optimization? In Advances in Neural Information

Processing Systems, pages 2483–2493, 2018.

Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the

nonlinear dynamics of learning in deep linear neural networks. arXiv preprint

arXiv:1312.6120, 2013.

Anton Maximilian Sch¨afer and Hans Georg Zimmermann. Recurrent neural networks

are universal approximators. In International Conference on Artificial Neural Net-

works, pages 632–640. Springer, 2006.

J¨urgenSchmidhuber. Learning factorial codes by predictability minimization. Neural

Computation, 4(6):863–879, 1992.

Isaac J Schoenberg. Cardinal spline interpolation. SIAM, 1973.

Isaac Jacob Schoenberg. Spline functions and the problem of graduation. In IJ

Schoenberg Selected Papers, pages 201–204. Springer, 1988.

Larry Schumaker. Spline functions: basic theory. Cambridge University Press, 2007.

Thiago Serra and Srikumar Ramalingam. Empirical bounds on linear regions of deep

rectifier networks. In Proceedings of the AAAI Conference on Artificial Intelligence,

volume 34, pages 5628–5635, 2020.

Thiago Serra, Christian Tjandraatmadja, and Srikumar Ramalingam. Bounding and

counting linear regions of deep neural networks. In International Conference on

Machine Learning, pages 4558–4566. PMLR, 2018. 222

Anish Shah, Eashan Kadam, Hena Shah, Sameer Shinde, and Sandip Shingade. Deep

residual networks with exponential linear unit. In Proceedings of the Third Inter-

national Symposium on Computer Vision and the Internet, pages 59–65, 2016.

Or Sharir and Amnon Shashua. On the expressive power of overlapping architectures

of deep learning. In International Conference on Learning Representations, 2018.

URL https://openreview.net/forum?id=HkNGsseC-.

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al.

Mastering the game of go without human knowledge. nature, 550(7676):354–359,

2017.

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional

networks: Visualising image classification models and saliency maps. 2014.

Nora H Sleumer. Output-sensitive cell enumeration in hyperplane arrangements.

Nordic journal of computing, 6(2):137–147, 1999.

Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima

in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.

Michael Spivak. Calculus on manifolds: a modern approach to classical theorems of

advanced calculus. CRC press, 2018.

Nitish Srivastava. Improving neural networks with dropout. University of Toronto,

182(566):7, 2013.

Rupesh K Srivastava, Klaus Greff, and J¨urgenSchmidhuber. Training very deep 223

networks. In Advances in neural information processing systems, pages 2377–2385,

2015.

Richard P Stanley et al. An introduction to hyperplane arrangements. Geometric

combinatorics, 13:389–496, 2004.

Gilbert W Stewart. Error and perturbation bounds for subspaces associated with

certain eigenvalue problems. SIAM review, 15(4):727–764, 1973.

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance

of initialization and momentum in deep learning. In International conference on

machine learning, pages 1139–1147, 2013.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna.

Rethinking the inception architecture for computer vision. In Proceedings of the

IEEE conference on computer vision and pattern recognition, pages 2818–2826,

2016.

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.

Inception-v4, inception-resnet and the impact of residual connections on learning.

In AAAI, volume 4, page 12, 2017.

Paulo Tabuada and Bahman Gharesifard. Universal approximation power of

deep residual neural networks via nonlinear control theory. arXiv preprint

arXiv:2007.06007, 2020.

Georges M Tallis. The moment generating function of the truncated multi-normal

distribution. Journal of the Royal Statistical Society: Series B (Methodological),

23(1):223–229, 1961. 224

GM Tallis. Plane truncation in normal populations. Journal of the Royal Statistical

Society: Series B (Methodological), 27(2):301–307, 1965.

Jiexiong Tang, Chenwei Deng, and Guang-Bin Huang. Extreme learning machine for

multilayer perceptron. IEEE transactions on neural networks and learning systems,

27(4):809–821, 2015.

Ugo Tanielian, Thibaut Issenhuth, Elvis Dohmatob, and Jeremie Mary. Learning

disconnected manifolds: a no gans land. arXiv preprint arXiv:2006.04596, 2020.

Y. Teng and A. Choromanska. Invertible autoencoder for domain adaptation. Com-

putation, 7(2):20, 2019.

Michael E Tipping and Christopher M Bishop. Mixtures of probabilistic principal

component analyzers. Neural computation, 11(2):443–482, 1999a.

Michael E Tipping and Christopher M Bishop. Probabilistic principal component

analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodol-

ogy), 61(3):611–622, 1999b.

Jakub M Tomczak and Max Welling. Improving variational auto-encoders using

householder flow. arXiv preprint arXiv:1611.09630, 2016.

Jakub M. Tomczak and Max Welling. VAE with a vampprior. CoRR, abs/1705.07120,

2017a. URL http://arxiv.org/abs/1705.07120.

Jakub M Tomczak and Max Welling. Vae with a vampprior. arXiv preprint

arXiv:1705.07120, 2017b.

John Torous, Mark E Larsen, Colin Depp, Theodore D Cosco, Ian Barnett,

Matthew K Nock, and Joe Firth. Smartphones, sensors, and machine learning 225

to advance real-time prediction and interventions for suicide prevention: a review

of current progress and next steps. Current psychiatry reports, 20(7):1–6, 2018.

L. Trottier, P. Gigu, and B. Chaib-draa. Parametric exponential linear unit for deep

convolutional neural networks. In 16th IEEE Int. Conf. Mach. Learn. Appl., pages

207–214. IEEE, 2017.

Michael Unser. A representer theorem for deep neural networks. arXiv preprint

arXiv:1802.09210, 2018.

Michael Unser and Thierry Blu. Cardinal exponential splines: Part i-theory and

filtering algorithms. IEEE Transactions on Signal Processing, 53(4):1425–1438,

2005.

Michael Unser, Akram Aldroubi, and Murray Eden. B-spline signal processing. i.

theory. IEEE transactions on signal processing, 41(2):821–833, 1993.

Harri Valpola. Unsupervised learning of nonlinear dynamic state-space models.

Helsinki University of Technology, 2000.

Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In

Advances in Neural Information Processing Systems, pages 6306–6315, 2017.

Lieven Vandenberghe, BL De Moor, and Joos Vandewalle. The generalized linear

complementarity problem applied to the complete analysis of resistive piecewise-

linear circuits. IEEE transactions on circuits and systems, 36(11):1382–1391, 1989.

V Venkateswar and Rama Chellappa. Extraction of straight lines in aerial images.

IEEE Transactions on Pattern Analysis & Machine Intelligence, 14(11):1111–1114,

1992. 226

P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol. Extracting and com-

posing robust features with denoising autoencoders. In Proceedings of the 25th

international conference on Machine learning, pages 1096–1103, 2008.

Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy,

David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan

Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python.

Nature methods, pages 1–12, 2020.

Aaron R Voelker, Jan Gosmann, and Terrence C Stewart. Efficiently sampling vec-

tors and coordinates from the n-sphere and n-ball. Technical report, Tech. Rep.)

Waterloo, ON: Centre for Theoretical Neuroscience. doi: 10.13140 . . . , 2017.

Richard Von Mises. Mathematical theory of probability and statistics. Academic Press,

2014.

Georges Voronoi. Nouvelles applications des param`etrescontinus `ala th´eoriedes

formes quadratiques. premier m´emoire.sur quelques propri´et´esdes formes quadra-

tiques positives parfaites. Journal f¨urdie reine und angewandte Mathematik, 1908

(133):97–102, 1908.

Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regu-

larization. In Advances in neural information processing systems, pages 351–359,

2013.

Colin G Walsh, Jessica D Ribeiro, and Joseph C Franklin. Predicting risk of suicide

attempts over time through machine learning. Clinical Psychological Science, 5(3):

457–469, 2017. 227

Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization

of neural networks using dropconnect. In International conference on machine

learning, pages 1058–1066, 2013.

DP Wang, NF Huang, HS Chao, and Richard CT Lee. Plane sweep algorithms for the

polygonal approximation problems with applications. In International Symposium

on Algorithms and Computation, pages 515–522. Springer, 1993.

Zhou Wang and Alan C Bovik. Mean squared error: Love it or leave it? a new look

at signal fidelity measures. IEEE signal processing magazine, 26(1):98–117, 2009.

Zichao Wang, Randall Balestriero, and Richard Baraniuk. A max-affine spline per-

spective of recurrent neural networks. In International Conference on Learning

Representations, 2018.

David Warde-Farley, Ian J Goodfellow, Aaron Courville, and Yoshua Bengio.

An empirical analysis of dropout in piecewise linear networks. arXiv preprint

arXiv:1312.6197, 2013.

Di Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Exploring

better solution for training extremely deep convolutional neural networks with or-

thonormality and modulation. In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 6176–6185, 2017.

Jiacheng Xu and Greg Durrett. Spherical latent spaces for stable variational autoen-

coders. arXiv preprint arXiv:1808.10805, 2018.

Lei Xu and Michael I Jordan. On convergence properties of the em algorithm for

gaussian mixtures. Neural computation, 8(1):129–151, 1996. 228

Xin Xu, Lei Zuo, and Zhenhua Huang. Reinforcement learning algorithms with func-

tion approximation: Recent advances and applications. Information Sciences, 261:

1–31, 2014.

Yangyang Xu and Wotao Yin. A block coordinate descent method for regularized

multiconvex optimization with applications to nonnegative tensor factorization and

completion. SIAM Journal on imaging sciences, 6(3):1758–1789, 2013.

Dingdong Yang, Seunghoon Hong, Yunseok Jang, Tianchen Zhao, and Honglak Lee.

Diversity-sensitive conditional generative adversarial networks, 2019a.

Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian pro-

cess behavior, gradient independence, and neural tangent kernel derivation. arXiv

preprint arXiv:1902.04760, 2019.

Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. arXiv

preprint arXiv:2006.14548, 2020.

Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, and Samuel S

Schoenholz. A mean field theory of batch normalization. arXiv preprint

arXiv:1902.08129, 2019b.

Jianbo Ye, Xin Lu, Zhe Lin, and James Z Wang. Rethinking the smaller-norm-less-

informative assumption in channel pruning of convolution layers. arXiv preprint

arXiv:1802.00124, 2018.

Peng-Yeng Yin. Algorithms for straight line fitting using k-means. Pattern recognition

letters, 19(1):31–41, 1998. 229

Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen,

Yingyan Lin, Zhangyang Wang, and Richard G. Baraniuk. Drawing early-bird

tickets: Toward more efficient training of deep networks. In International Confer-

ence on Learning Representations, 2020. URL https://openreview.net/forum?

id=BJxsrgStvr.

Seniha Esen Yuksel, Joseph N Wilson, and Paul D Gader. Twenty years of mixture

of experts. IEEE transactions on neural networks and learning systems, 23(8):

1177–1193, 2012.

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint

arXiv:1605.07146, 2016.

Thomas Zaslavsky. Facing up to arrangements: Face-count formulas for partitions of

space by hyperplanes: Face-count formulas for partitions of space by hyperplanes,

volume 154. American Mathematical Soc., 1975.

Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint

arXiv:1212.5701, 2012.

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional

networks. In European conference on computer vision, pages 818–833. Springer,

2014.

Cheng Zhang, Judith B¨utepage, Hedvig Kjellstr¨om,and Stephan Mandt. Advances

in variational inference. IEEE transactions on pattern analysis and machine intel-

ligence, 41(8):2008–2026, 2018a.

Liwen Zhang, Gregory Naitzat, and Lek-Heng Lim. Tropical geometry of deep neural 230

networks. CoRR, abs/1805.07091, 2018b. URL http://arxiv.org/abs/1805.

07091.

Lu Zhang, Jianjun Tan, Dan Han, and Hao Zhu. From machine learning to deep learn-

ing: progress in machine intelligence for rational drug discovery. Drug discovery

today, 22(11):1680–1685, 2017a.

Pengchuan Zhang, Qiang Liu, Dengyong Zhou, Tao Xu, and Xiaodong He. On the

discrimination-generalization tradeoff in gans. arXiv preprint arXiv:1711.02771,

2017b.

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely

efficient convolutional neural network for mobile devices. In Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition, pages 6848–6856,

2018c.

Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense

network for image super-resolution. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 2472–2481, 2018d.

Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial

network. arXiv preprint arXiv:1609.03126, 2016.

Shengjia Zhao, Jiaming Song, and Stefano Ermon. Towards deeper understanding of

variational autoencoding models. arXiv preprint arXiv:1702.08658, 2017.

Ding-Xuan Zhou. Universality of deep convolutional neural networks. Applied and

computational harmonic analysis, 48(2):787–794, 2020. 231

Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized

deep neural networks. In Advances in Neural Information Processing Systems,

pages 2055–2064, 2019.

Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent

optimizes over-parameterized deep relu networks. arXiv preprint arXiv:1811.08888,

2018.

James Zou, Mikael Huss, Abubakar Abid, Pejman Mohammadi, Ali Torkamani, and

Amalio Telenti. A primer on deep learning in genomics. Nature genetics, 51(1):

12–18, 2019.