RICE UNIVERSITY
By
Randall Balestriero
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE
Doctor of Philosophy
APPROVED, THESIS COMMITTEE
Richard Baraniuk (Apr 28, 2021 06:57 ADT) Ankit B Patel (Apr 28, 2021 02:34 CDT) Richard Baraniuk Ankit Patel
Behnam Azhang Behnam Azhang (Apr 26, 2021 18:34 CDT)
Behnaam Aazhang Stephane Mallat
Moshe Vardi Moshe Vardi (Apr 26, 2021 20:04 CDT) Albert Cohen (Apr 27, 2021 16:36 GMT+2) Moshe Vardi Albert Cohen
HOUSTON, TEXAS April 2021 ABSTRACT
Max-Affine Splines Insights Into Deep Learning
by
Randall Balestriero
We build a rigorous bridge between deep networks (DNs) and approximation the- ory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings. For instance, conditioned on the spline partition region containing the input signal, the output of a MASO DN can be written as a simple affine transformation of the input. Studying the geometry of those regions allows to obtain novel insights into different regular- ization techniques, different layer configurations or different initialization schemes. Going further, this spline viewpoint allows to obtain precise geometric insights in various domains such as the characterization of the Deep Generative Networks’s gen- erated manifold, the understanding of Deep Network pruning as a mean to simplify the DN input space partition or the relationship between different nonlinearities e.g. ReLU-Sigmoid Gated Linear Unit as simply corresponding to different MASO region membership inference algorithms. The spline partition of the input signal space that is implicitly induced by a MASO directly links DNs to the theory of vector quantiza- tion (VQ) and K-means clustering, which opens up new geometric avenues to study how DNs organize signals in a hierarchical fashion. ACKNOWLEDGEMENTS
I would like to thank Prof. Herve Glotin for giving me the opportunity to enter into the research world during my Bachelor’s degree with topics of greatest interests. Needless to say that without Herve’s passion and love for his academic profession, I would not be doing research in this exciting field of machine and deep learning. Herve has done much more than just providing me with an opportunity. He has molded me into a curious dreamer, a quality that I hope to hold for as long as possible in order to one day walk into Herve’s steps. I would also like to especially thank Prof. Sebastien Paris for considering me as an equal colleague during my Bachelor’s research internships and thereafter. Sebastien’s rigor, knowledge, and pragmatism have influenced me greatly in the most positive way. I also want to thank the countless invaluable encounters I have had within the LSIS team such as with Prof. Ricard Marxer, the LJLL team such as with Prof. Frederic Hecht and Prof. Albert Cohen, and in the DI ENS team such as with Prof. Stephane Mallat and Prof. Vincent Lostanlen, all sharing two common traits: a limitless expertise of their field, and an unbounded desire to share their knowledge. I would also like to thank Prof. Richard Baraniuk for taking me into his group and for constantly inspiring me to produce work of the highest quality. Rich’s influences have allowed me to considerably improve upon my ability to not only conduct research, but also to communicate research. I would have been an incomplete PhD candidate without this primordial skill. I also want to thank Prof. Rudolf Riedi for taking me into a mathematical tour. Rolf’s ability to seamlessly bridge the most abstract theoretical concepts and the most intuitive observations will never cease to amaze me and to fuel my desire to learn. I am also thanking Sina Alemohammad, CJ Barberan, Yehuda Dar, Ahmed Imtiaz, Hamid Javadi, Daniel Lejeune, Lorenzo Luzi, Tan Nguyen, Jasper Tan, Zichao Wang who are part of the Deep Learning group at Rice and with whom I have been collaborating, discussing and brainstorming. I also want to thank beyond words my family from whom I never stopped to learn, and my partner Dr. Kerda Varaku for mollifying the world around me (while performing a multi-year long reinforcement learning experiment on me, probably still in-progress today). I also want to give a special word for Dr. Romain Cosentino with whom we have been blindly pursuing ideas that led us to novel fields, and for Dr. Leonard Seydoux with whom I have discovered geophysics in the most interesting and captivating way. This work was partially supported by NSF grants IIS-17-30574 and IIS-18-38177, AFOSR grant FA9550-18-1-0478, ARO grant W911NF-15-1-0316, ONR grants N00014- 17-1-2551 and N00014-18-12571, DARPA grant G001534-7500, and a DOD Vannevar Bush Faculty Fellowship (NSSEFF) grant N00014-18-1-2047, a BP fellowship from the Ken Kennedy Institute. Contents
Abstract ii Acknowledgments iii List of Illustrations xi List of Tables xxiv Notations xxvi
1 Introduction 1 1.1 Motivation ...... 2 1.2 Deep Networks ...... 4 1.2.1 Layers ...... 5 1.2.2 Training ...... 8 1.2.3 Approximation Results ...... 9 1.3 Related Works ...... 11 1.3.1 Mathematical Formulations of Deep Networks ...... 11 1.3.2 Training of Deep Generative Networks...... 16 1.3.3 Batch-Normalization Understandings...... 18 1.3.4 Deep Network Pruning...... 19 1.4 Contributions ...... 21
2 Max-Affine Splines for Convex Function Approximation 24 2.1 Spline Functions ...... 25 2.2 Max-Affine Splines ...... 28 2.3 (Max-)Affine Spline Fitting ...... 30 vi
3 Deep Networks: Composition of Max-Affine Spline Op- erators 32 3.1 Max-Affine Spline Operators ...... 32 3.2 From Deep Network Layers to Max-Affine Spline Operators ...... 33 3.3 Composition of Max-Affine Spline Operators ...... 36 3.4 Deep Networks Input Space Partition: Power Diagram Subdivision . 38 3.4.1 Voronoi Diagrams and Power Diagrams ...... 38 3.4.2 Single Layer: Power Diagram ...... 40 3.4.3 Composition of Layers: Power Diagram Subdivision ...... 43 3.5 Discussions ...... 45
4 Insights Into Deep Generative Networks 47 4.1 Introduction ...... 47 4.1.1 Related Works ...... 47 4.1.2 Contributions ...... 49 4.2 Deep Generative Network Latent and Intrinsic Dimension ...... 49 4.2.1 Input-Output Space Partition and Per-Region Mapping . . . . 50 4.2.2 Generated Manifold Angularity ...... 52 4.2.3 Generated Manifold Intrinsic Dimension ...... 55 4.2.4 Effect of Dropout/Dropconnect ...... 57 4.3 Per-Region Affine Mapping Interpretability and Manifold Tangent Space 61 4.3.1 Per-Region Mapping as Local Coordinate System and Disentanglement ...... 61 4.3.2 Tangent Space Regularization ...... 63 4.4 Density on the Generated Manifold ...... 65 4.4.1 Analytical Output Density ...... 66 4.4.2 On the Difficulty of Generating Low entropy/Multimodal Distributions ...... 67 vii
4.5 Discussions ...... 68
5 Expectation-Maximization for Deep Generative Networks 70 5.1 Introduction ...... 70 5.1.1 Related Works ...... 70 5.1.2 Contributions ...... 74 5.2 Posterior and Marginal Distributions of Deep Generative Networks . 74 5.2.1 Conditional, Marginal and Posterior Distributions of Deep Generative Networks ...... 75 5.2.2 Obtaining the DGN Partition ...... 77 5.2.3 Gaussian Integration on the Deep Generative Network Latent Partition ...... 79 5.3 Expectation-Maximization Learning of Deep Generative Networks . . 81 5.3.1 Expectation Step ...... 82 5.3.2 Maximization Step ...... 83 5.3.3 Empirical Validation and VAE Comparison ...... 84 5.4 Discussions ...... 86
6 Insights Into Deep Network Pruning 88 6.1 Introduction ...... 88 6.1.1 Related Works ...... 89 6.1.2 Contributions ...... 90 6.2 Winning Tickets and DN Initialization ...... 90 6.2.1 The Initialization Dilemma and the Importance of Overparametrization ...... 91 6.2.2 Better DN Initialization: An Alternative to Pruning ...... 93 6.3 Pruning Continuous Piecewise Affine DNs ...... 95 6.3.1 Interpreting Pruning from a Spline Perspective ...... 97 6.3.2 Spline Early-Bird Tickets Detection ...... 98 viii
6.3.3 Spline Pruning Policy ...... 100 6.4 Experiment Results ...... 103 6.4.1 Proposed Layerwise Spline Pruning over SOTA Pruning Methods ...... 103 6.4.2 Proposed Global Spline Pruning over SOTA Pruning Methods 104 6.5 Discussions ...... 105
7 Insights into Batch-Normalization 107 7.1 Introduction ...... 107 7.1.1 Related Works ...... 108 7.1.2 Contributions ...... 110 7.2 Batch Normalization: Unsupervised Layer-Wise Fitting ...... 110 7.2.1 Batch-Normalization Updates ...... 111 7.2.2 Layer Input Space Hyperplanes and Partition ...... 111 7.2.3 Translating the Hyperplanes ...... 113 7.3 Multiple Layer Analysis: Following the Data Manifold ...... 116 7.3.1 Deep Network Partition and Boundaries ...... 116 7.3.2 Interpreting Each Batch-Normalization Parameter ...... 120 7.3.3 Experiments: Batch-Normalization Focuses the Partition onto the Data ...... 121 7.4 Where is the Decision Boundary ...... 123 7.4.1 Batch-Normalization is a Smart Initialization ...... 123 7.4.2 Experiments: Batch-Normalization Initialization Jump-Starts Training ...... 126 7.5 The Role of the Batch-Normalization Learnable Parameters . . . . . 127 7.6 Batch-Normalization Noisyness ...... 129 7.7 Discussions ...... 132
8 Insights Into (Smooth) Deep Networks Nonlinearities 133 ix
8.1 Introduction ...... 133 8.2 Max-Affine Splines meet Gaussian Mixture Models ...... 135 8.2.1 From MASO to GMM via K-Means ...... 135 8.2.2 hard-VQ Inference ...... 137 8.2.3 Soft-VQ Inference ...... 138 8.2.4 Soft-VQ MASO Nonlinearities ...... 139 8.3 Hybrid Hard/Soft Inference via Entropy Regularization ...... 139 8.4 Discussions ...... 141
A Insights into Generative Networks 142 A.1 Architecture Details ...... 142 A.2 Proofs ...... 144 A.2.1 Proof of Thm 4.1 ...... 144 A.2.2 Proof of Proposition 4.1 ...... 144 A.2.3 Proof of Proposition 4.2 ...... 145 A.2.4 Proof of Theorem 4.2 ...... 145 A.2.5 Proof of Theorem 4.3 ...... 146
B Expectation Maximization Training of Deep Generative Networks 148 B.1 Computing the Latent Space Partition ...... 148 B.2 Analytical Moments for truncated Gaussian ...... 151 B.3 Implementation Details ...... 153 B.4 Algorithms ...... 154 B.5 Proofs ...... 154 B.5.1 Proof of Lemma 5.1 ...... 154 B.5.2 Proof of Proposition 5.1 ...... 155 B.5.3 Proof of Theorem 5.1 ...... 155 x
B.5.4 Proof of Lemma 5.2 ...... 156 B.5.5 Proof of Moments ...... 157 B.6 Proof of EM-step ...... 158 B.6.1 E-step derivation ...... 159 B.6.2 Proof of M step ...... 159 B.7 Regularization ...... 165 B.8 Computational Complexity ...... 166 B.9 Additional Experiments ...... 166
C Deep Network Pruning 172 C.1 Additional Results on Initialization and Pruning ...... 173 C.1.1 Winning Tickets and Overparameterization ...... 173 C.1.2 Additional Results for Layerwise Pretraining ...... 173 C.2 Additional Early-Bird Visualizations ...... 176 C.2.1 Early-Bird Visualization for VGG-16 and PreResNet-101 . . . 176 C.3 Additional Experimental Details and Results ...... 176 C.3.1 Experiments Settings ...... 176 C.3.2 Additional Results of Our Global Spline Pruning ...... 177 C.3.3 Ablation Studies of Our Spline Pruning Method ...... 178
D Batch-Normalization 181 D.1 Proofs ...... 181 D.1.1 Proof of Theorem 7.1 ...... 181 D.1.2 Proof of Corollary 7.1 ...... 182 D.1.3 Prof of Theorem 7.2 ...... 183 D.1.4 Proof of Proposition 7.1 ...... 184 Illustrations
3.1 Two equivalent representations of a power diagram (PD). Top: The
grey circles have centers [µ]k,: and radii [rad]k; each point x is assigned to a specific region/cell according to the Laguerre distance from the centers, which is defined as the length of the segment tangent to and starting on the circle and reaching x. Bottom: A
D PD in R (here D = 2) is constructed by lifting the centroids [µ]k,: up D+1 into an additional dimension in R by the distance [rad]k and then finding the Voronoi diagram (VD) of the augmented centroids
D+1 ([µ]k,:, [rad]k) in R . The intersection of this higher-dimensional VD with the originating space RD yields the PD...... 40 xii
3.2 Visual depiction of the subdivision process that occurs when a deeper layer ` refines/subdivides an up-to-layer ` − 1 already built partition Ω(1,...,`−1). We depict here a toy model (2-layer DN) with 3 units at the first layer (leading to 4 regions) and 8 units at the second layer with random weights and biases. The colors are the DN input space partitioning with respect to the first layer. Then for each color (or region) the layer1-layer2 defines a specific PD that will sub-divide this aforementioned region (this is the first row) where the region is colored and the PD is depicted for the whole input space. Then this sub-division is applied onto the first layer region only as it only sub-divides its region (this is the second row on the right). And finally grouping together this process for each of the 4 region, we obtain the layer-layer 2 space partitioning (second row on the left). . 44
4.1 Visual depiction of Thm. 4.1 with a (random) generator G : R2 7→ R3. Left: generator input space partition Ω made of polytopal regions. Right: generator image Im(G) which is a continuous piecewise affine surface composed of the polytopes obtained by affinely transforming the polytopes from the input space partition (left) the colors are per-region and correspond between left and right plots. This input-space-partition / generator-image / per-region-affine-mapping relation holds for any architecture employing piecewise affine activation functions. Understanding each of the three brings insights into the others, as we demonstrate in this paper...... 50
4.2 The columns represent different widths D` ∈ {6, 8, 16, 32} and the rows correspond to repetition of the learning for different random initializations of the GDNs for consecutive seeds...... 53 xiii
4.3 Histograms of the DGN adjacent region angles for DGNs with two hidden layers, S = 16 and D = 17,D = 32 respectively and varying
width D` on the y-axis. Three trends to observe: increasing the width increases the bimodality of the distribution while favoring near 0 angles; increasing the output space dimension increases in the
number of angles near orthogonal; the Aω and Aω0 of adjacent regions ω and ω0 are highly similar making most angles smaller than if they were independent (depicted in blue)...... 54 4.4 DGN with dropout trained (GAN) on a circle dataset (blue dots); dropout turns a DGN into an ensemble of DGNs (each dropout realization is drawn in a different color)...... 58 4.5 Impact of dropout and dropconnect on the intrinsic dimension of the noise induced generators for two “drop” probabilities 0.1 and 0.3 and for a generator G with S = 6, D = 10, L = 3 with varying width D1 = D2 ranging from 6 to 48 (x-axis). The boxplot represents the distribution of the per-region intrinsic dimensions over 2000 sampled regions and 2000 different noise realizations. Recall that the intrinsic dimension is upper bounded by S = 6 in this case. Two key observations: first dropconnect tends to producing DGN with intrinsic dimension preserving the latent dimension (S = 6) even for
narrow models (D1,D2 ≈ S), as opposed to dropout which tends to produce DGNs with much smaller intrinsic dimension than S. As a result, if the DGN is much wider than S, both techniques can be used, while in narrow models, either none or dropconnect should be preferred. 59 xiv
4.6 Deep Autoencoder experiment when equipping the DGN (decoder) with dropout where we employ the following MLP with S = D1 = D2 = 32 and D3 = D4 = 1024,D5 = D, test set reconstruction error is displayed for multiple datasets and training settings. The architecture purposefully maintains a narrow width for the first two layers to highlight that in those cases, dropout is detrimental regardless of the dropout rate. We compare applying dropout to all layers black line versus applying dropout only on the last two (wide) layers blue line. We see that unless the dropout rate is adapted to the layer width and desired intrinsic dimension, the test set performance is negatively impacted by dropout. The exact rate reaching best test set performance for the case of employing dropout only for wide layers is shown with a green arrow. The exact values for each graph are given in Table 4.1...... 60 4.7 Probability (0:blue,1:red) that dropout maintains the intrinsic dimension (red line, left:32, right:64) as a function of the dropout rate ( x-axis) and the layer’s width y-axis, with the 95% and 99% line in black continuous and black dashed respectively. We see that when the layer’s width is close to the desired intrinsic dimension, no dropout should be applied, and that for a dropout rate of 0.5, the layer must be at least two times wider than the desired intrinsic dimension...... 61 xv
4.8 Visualization of a single basis vectors [Aω].,k before and after learning obtained from a region ω containing the digits 7, 5, 9, and 0 respectively per-column, for GAN and VAE models made of fully connected or convolutional layer. We observe how those basis vectors encodes: right rotation, cedilla extension, left rotation, and upward
translation respectively; studying the columns of Aω provides interpretability into the learn DGN affine parameters and underlying data manifold...... 62 4.9 Test set reconstruction (y-axis) error during training for each epoch (x-axis) for a baseline unconstrained Deep AutoEncoder (black line) and for the tangent space regularized DGN (decoder) from (4.6) with varying regularization coefficient λ (colored lines) for three datasets (per column) and with S = 128,T = 16 (top) and S = 32,T = 16 (bottom). We observed that by constraining the
tangent space basis Aω to span the data tangent space for each region ω containing training samples, the manifold fitting is improved leading to better test sample reconstruction...... 65 4.10 Distribution of the per-region log-determinants (bottom row) for DGN trained on a data distribution with varying per mode variance (blue points, first row). The estimated data distribution is depicted through the red samples. We clearly observe the tight relationship between the multimodality and Shannon Entropy of the data distribution to be approximated and the distribution of the
per-region determinant of Aω. That is, as the DGN tries to approximate a data distribution with high multimodality and low
Shannon Entropy, as the per-region slope matrices Aω have increasing singular values, in turn synonym of exploding per-layer weights and thus training instabilities (recall Thm. 4.1)...... 67 xvi
q T 4.11 Distribution of log( det(Aω Aω)) for 2000 regions ω with a DGN
with L = 3,S = 6,D = 10 and weights initialized with Xavier; then, half of
the weights’ coefficients (picked randomly) are rescaled by σ1 and the
other half by σ2. We observe that greater variance of the weights increase the spread of the log-determinants and increase the mean of the distribution...... 68
5.1 Recursive partition discovery for a DGN with S = 2 and L = 2, starting with an initial region obtained from a sampled latent vector z (init). By walking on the faces of this region, neighboring regions sharing a common face are discovered (Step 1). Recursively repeating this process until no new region is discovered (Steps 2–4) provides the DGN latent space partition at left ...... 78 5.2 Triangulation T (ω) as per (5.7) of a polytopal region ω (left plot) obtained from the Delaunay Triangulation of the region vertices leading to 3 simplices (three right plots)...... 80 5.3 Left: Noiseless generated samples g(z) in red and noisy samples g(z) +
in blue, with Σx = 0.1I, Σz = I. Middle: marginal distribution p(x) from (5.3). Right: the posterior distribution p(z|x) from (5.4) (blue), its expectation (green) and the position of the region limits (black), with sample point x depicted in black in the left figure...... 81 5.4 DGN training under EM (black) and VAE training with various learning rates for VAE (blue: 0.005, red: 0.001, green: 0.0001). In all cases, VAE converges to the maximum of its ELBO. The gap between the VAE and EM curves is due to the inability of the VAE’s AVI to correctly estimate the true posterior, pushing the VAE’s ELBO far from the true log-likelihood (recall (5.1)) and thus preventing it from precisely approximating the true data distribution...... 84 xvii
5.5 KL-divergence between a VAE variational distribution and the true DGN posterior when trained on a noisy circle dataset in 2D for 3 different learning rates. During learning, the DGN adapts such that g(z) + models the data distribution based on the VAE’s estimated ELBO. As learning progresses, the true DGN posterior becomes harder to approximate by the VAE’s variational distribution in the AVI process. As such, even in this toy dataset, the commonly employed Gaussian variational distribution is not rich enough to capture the multimodality of p(z|x) from (5.4)...... 85 5.6 EM training of a DGN with latent dimension 1. We show only the generated continuous piecewise affine manifold g(z) without the additional white noise . We see how EM training of the DGN is able to fit the dataset, while VAE (with different learning rates (LR)) suffers from hyperparameter sensitivity and slow convergence. Training details and additional figures for this experiment are provided in Appendix B.9...... 85 5.7 Reprise of Fig. 5.6 for MNIST data restricted to the digit 4, employing a 3-layer DGN with latent dimension of 1. Details of training and additional figures for this experiments are provided in Appendix B.9...... 86
6.1 K-means experiments on a toy mixture of 64 Gaussian in 2d, where in all cases the number of final cluster is 64 but the number of starting clusters (x-axis) varies and pruning is applied during training to remove redundant centroids, comparing random centroid initialization and kmeans++. With overparametrization, random initialization and pruning reaches the same accuracy as kmeans++. . 91 xviii
6.2 (a) Difference between node and weight pruning, where the former removes entire subdivision lines while the latter simply quantize those partition lines to be colinear to the space axes. (b) Toy classification task pruning, where the blue lines represent subdivisions in the first layer and the red lines denote the last layer’s decision boundary. We see that: 1) pruning indeed removes redundant subdivision lines so that the decision boundary remains an X-shape until 80% nodes are pruned; and 2) ideally, one blue subdivision line would be sufficient to provide two turning points for the decision boundary, e.g., visualization at 80% sparsity, but the classification accuracy degrades a lot if further pruned. That aligns with the initialization dilemma for small DNs, i.e., blue lines are not well initialized and all lines remain hard for training. (c) MNIST reproduction of (b), where to produce these visuals, we choose two images from different classes to obtain a 2-dimensional slice of the 764-dimensional input space (grid depicted on the left). We thus obtain a low-dimensional depiction of the subdivision lines that we depict in blue for the first layer, green for the second convolutional layer, and red for the decision boundary of 6 vs. 9 (based on the left grid). The observation consistently shows that only parts of subdivision lines are useful for decision boundary; and the goal of pruning is to remove those (redundant) subdivision lines...... 96 6.3 Spline trajectory during training and visualizing the Early-Bird (EB) Phenomenon, which can be leveraged to largely reduce the training costs due to the less training of costly overparametrized DNs. The trajectories mainly adapt during early phase of training...... 98 xix
6.4 We depict on the left a small (L = 2,D1 = 5,D2 = 8) DN input space partition, layer 1 trajectories in black and layer 2 in blue. In the middle is the measure from Eq. (6.1) finding similar “partition trajectories” from layer 2 seen in the DN input space (comparing the green trajectory to the others with coloring based on the induce similarity from dark to light). Based on this measure, pruning can be done to remove the “grouped partition trajectoris” and obtain the pruned partition on the right...... 101
7.1 Depiction for a 5-layer DN with 6 units per layer of the impact of BN (with statistics computed from all samples) onto the position and
shape of the -up-to-layer-` input space partition Ω1|`; in blue are the newly introduced boundaries from the current layer, in grey are the existing boundaries. The absence of BN (top row) leaves the partition random and unalert of the data samples while BN (bottom row) positions and focuses the partition onto the data samples (while all other parameters of the BN are left identical); as per Thm. 7.1, BN minimizes the distances between the boundaries and the data samples. 115 7.2 Depiction of the layer (left) and DN (right) input space partition with L = 2,D(1) = 2,D(2) = 2. The partition boundaries of a layer in its input space corresponds to the hyperplanes H(`,k) (7.5), for deeper layers, vieweing H(`,k) in the DN input space leads to the paths P(`,k) (7.13)...... 116 7.3 Depicition of P(`,k), ` = 1, 4 where for each `, P(`,k) is colored based
(`) (`) (2) on [σ ]k/k[W ]k,.k (blue: smallest, green: highest). As per Thm. 7.1, 7.2, the bluer colored paths are the ones closer to the dataset (black dots) allowing interpretability of the σ(`) parameter as the fitness between P(`,k) and the mini-batch samples...... 120 xx
7.4 This figure reproduces the experiment from Fig. 7.1 with a more complex (2-D) input dataset (left) and a much wider DN with D(`) = 1024 and L = 11. We depict for some layers the boundaries of (`) the layer partitions seen in the DN input space (∂Ω0 , recall 7.13) for DNs with different initializations: random for slopes and biases (random), or random for slopes and zero for biases (zero) or the scaling of the slopes and the biases are initialized from the BN (`) (`) statistics µBN and σBN from 7.3 (BN). The overlap of multiple partition boundaries induces a darker color demonstrating the presence of more partition boundaries for each spatial location. Clearly, BN concentrates the partition boundaries onto the data samples...... 124 7.5 Average number of regions from the DN partition Ω in an -ball around 100 CIFAR images (left) and 100 random images (right) for a CNN demonstrating that BN adapts the partition on the data samples. The weights initialization (random, zero, BN) follows Fig. 7.4. Additional dataset and architectures are given in Appendix showing the same result...... 125 xxi
7.6 Image classification with different architectures on SVHN, CIFAR10/100. In all cases no BN is used during training; the initialization of the weights is either random (black) or random a (`) (`) with fixed BN parameters µBN, σBN, ∀` (blue). That is, the BN parameters are found as-per the BN strategy in a pretraining phase, and then those parameters are frozen (all other parameters remained at their random initialization). Then training start and the random parameters are tuned based on the loss at han. We can see that BN initialization (again, no BN is used during training) is beneficial to reach better accuracy effectively showing that BN initialization alone plays a crucial role for DNs. In most cases, the DN that does not leverage the BN initialization diverges altogether...... 128 7.7 Decision Boundaries realisations obtained for different batches on a 2-dimensional binary classification task. Each mini-batch (of size B) produces a different DN decision boundary based on the realisations (`) (`) of the random variables µBN, σBN (recall 7.17,7.18). Variance of those r.v. depend on B as seen in the figure. We depict those realisations at initialization (left) and after learning (right) for B = 16, 256, the latter producing smaller variance in the decision boundaries...... 130
8.1 For the MASO parameters A(`),B(`) for which HVQ yields the ReLU, absolute value, and an arbitrary convex activation function, we explore how changing β in the β-VQ alters the induced activation
1 function. Solid black: HVQ (β = 1), Dashed black: SVQ (β = 2 ), Red: β-VQ (β ∈ [0.1, 0.9]). Interestingly, note how some of the functions are nonconvex...... 141
B.1 sample of noise data for the wave dataset ...... 167 xxii
B.2 Depiction of the evolution of the NLL during training for the EM and VAE algorithms, we can see that despite the high number of training steps, VAEs are not yet able to correctly approximate the data distribution as opposed to EM training which benefits from much faster convergence. We also see how the VAEs tend to have a large KL divergence between the true posterior and the variational estimate due to this gap, we depict below samples from those models. 168 B.3 Samples from the various models trained on the wave dataset. We can see on top the result of EM training where each column represents a different run, the remaining three rows correspond to the VAE training. Again, EM demonstrates much faster convergence, for VAE to reach the actual data distribution, much more updates are needed. 169 B.4 Evolution of the true data negative log-likelihood (in semilogy-y plot on MNIST (class 4) for EM and VAE training for a small DGN as described above. The experiments are repeated multiple times, we can see how the learning rate is clearly impacting the learning significantly despite the use of Adam, and that even with the large learning rate, the EM learning is able to reach lower NLL, in fact the quality of the generated samples of the EM modes is much higher as shows below...... 170 B.5 Random samples from trained DGNs with EM or VAEs on a MNIST experiment (with digit 4). We see the ability of EM training to produce realistic and diversified samples despite using a latent space dimension of 1 and a small generative network...... 171
C.1 Depiction of the dataset used for the K-means experiment with 64 centroids...... 174 xxiii
C.2 Left: Depiction of a simple (toy) univariate regression task with
target function being a sawtooth with two peaks. Right: The `2 training error (y-axis) as a function of the width of the DN layer (2 layers in total). In theory, only 4 units are requires to perfectly solve the task at hand with a ReLU layer, however we see that optimization in narrow DNs is difficult and gradient based learning fails to find the correct layer parameters. As the width is increased as the difficulty of the optimization problem reduces and SGD manages to find a good set of parameters solving the regression task...... 175 C.3 Accuracy vs. efficiency trade-offs of lottery initialization and layerwise pretraining...... 176 C.4 Illustrating the spline Early-Bird tickets in VGG-16 and PreResNet-101.177 C.5 Abalation studies of the hyperparameter ρ in our spline pruning method on two commonly used models, VGG-16 and PreResNet-101. 179 Tables
4.1 Test set reconstruction error for varying dropout rates as displayed in Fig. 4.6, for different datasets, and when applying dropout on all layers or only on wide enough layers. We see that it is crucial to adapt the dropout rate to the layer width as otherwise the test error only increases when employing dropout...... 60 4.2 Test set reconstruction error averaged over 3 runs when employing the tangent space regularization (4.6) on various dataset with a DeepAutoencoder when varying the weight of the regularization term (λ) and the latent space dimension (S) and the number of neighbors used to estimate the data tangent space (T ). We see that the proposed regularization effectively improves generalization performances in all cases and even for complicated and high-dimensional datasets such as CIFAR10, where the data tangent space estimation becomes more challenging. This also demonstrates that DGNs trained only to reconstruct the data samples do not align correctly with the underlying data manifold tangent space...... 66
6.1 Accuracies of layerwise (LW) pretraining, structured pruning with random and lottery ticket initialization...... 94 6.2 Evaluating the proposed layerwise spline pruning over SOTA pruning methods on CIFAR-100...... 103 xxv
6.3 Evaluating the proposed global spline pruning over SOTA pruning methods on ImageNet...... 104
C.1 Evaluating our global spline pruning method over SOTA methods on CIFAR-10/100 datasets. Note that the “Spline Improv.” denotes the improvement of our spline pruning (w/ or w/o EB) as compared to the strongest baselines...... 180 NOTATIONS The entire thesis follows the following notations. A scalar is always represented in lower case and in standard font weight as a. A vector is always represented in lower case and in bold font weight as a. A matrix is always represented in upper case and in bold font weight as in A. A function producing a scalar or a vector output is expressed in upper case and in standard font weight as F . An upperscript surrounded by parentheses on any parameter/function is an indexing, and does not correspond to taking the power of the output e.g. F (4) can represent the fourth mapping and is not to be understood as F 4. Lastly, accessing a specific dimension of a vector, matrix
th or tensor, is achieved through the [.] operator as in [a]k for the k dimension of a th th vector, [A]k,d for the d entry of the k row of a matrix and so on. 1
Chapter 1
Introduction
Deep learning has significantly advanced our ability to address a wide range of dif-
ficult machine learning and signal processing problems. Today’s machine learning landscape is dominated by deep (neural) networks (DNs), which are compositions of a large number of simple parametrized linear and nonlinear operators. In this thesis, we build a bridge between DNs and spline functions and operators. We prove that a large class of DNs including convolutional neural networks (CNNs) [LeCun, 1998], residual networks (ResNets) [He et al., 2015b], skip connection networks [Srivastava et al., 2015], fully connected networks [Pal and Mitra, 1992], recurrent neural net- works (RNNs) [Graves, 2013], scattering networks [Bruna and Mallat, 2013], inception networks [Szegedy et al., 2017], and more can be written as spline operators. In fact, we prove that any DN employing current standard-practice piecewise affine and con- vex nonlinearities (e.g., ReLU, absolute value, max-pooling, etc.), can be written as a composition of max-affine spline operators (MASOs), which are an extension of max- affine splines [Magnani and Boyd, 2009, Hannah and Dunson, 2013]. The max-affine spline connection provides a powerful portal through which to view and analyze the inner workings of a DN using tools from approximation theory, functional analysis and computational geometry. The goal of this thesis, is to thoroughly adapt the max-affine splines insights for deep networks, to derive direct theoretical results from this formulation, and to provide insights and practical guidance for deep learning practitioners and researchers. 2
1.1 Motivation
Deep learning is increasingly becoming the backbone of our society. Powering novel industries and finding its way through applications such as self-driving cars, drug discovery, renewable energies, space exploration and law enforcement. An all-too- familiar story of late is that of plugging a DN into an application as a black box, learning its parameter values using copious training data, and then significantly im- proving performance over classical task-specific approaches.
Despite this empirical progress, the precise mechanisms by which deep learning works so well remain open to questioning, adding an air of mystery to the entire field.
This pitfall becomes increasingly problematic as DNs are deployed in our society and many systems now rely exclusively on such models. Beyond interpretability of the prediction/decision making, which is lacking in DNs, one important issue lies in the safety of those models. It has been for example demonstrated how deployed mod- els such as copyright infringement detectors, identity recognition models, and speech recognition models can be manipulated by any third-party agent through noise in- jection in the data [Saadatpanah et al., 2020, Goldblum et al., 2020, Cherepanova et al., 2021]. As a result, it is crucial to increase our theoretical and practical un- derstanding of DNs and in particular in a way that allows practitioners to better design and control those powerful methods. In a recent turn of events, most, if not all, of currently employed models have had their architecture altered, through trial and error, to finally become what they are today: affine spline functions. Through the rich theory of splines, we will demonstrate how to study DNs from this viewpoint.
This thesis is organized as follows. First, we propose to review Max-Affine Splines in all generality as those convex, piecewise affine splines will be the backbone of this thesis (Chap. 2). The core novel results consist in the reformulation of DNs as 3 max-affine splines and leverage this form to derive a direct result on the characteriza- tion of the DN input space partition (Chap. 3). Following this, we propose different facets of results that are direct consequences of this formulation: we will study Deep
Generative Networks and the geometry of the manifold that they span (Chap. 4).
Those results apply to many frameworks such as Generative Adversarial Networks or
Variational Autoencoders as well as Autoencoders and will provide practitioners with insights into architecture designs and techniques such as dropout and dropconnect.
Second, we directly move into exploiting the spline formulation and the result on the DN partition to derive novel strategies to learn Deep Generative Networks via
Expectation-Maximization (Chap. 5). This chapter closes the study of Deep Gener- ative Networks. Third, we propose to study Deep Network Pruning, which consist in removing nodes/weights from an architecture with the hope to maintain high per- formances while reducing the model complexity (Chap. 6). By leveraging the spline viewpoint it will be possible to obtain geometrical insights and to derive novel and mo- tivated pruning solutions. Fourth, we will dive into Batch-Normalization (Chap. 7).
Batch-Normalization is one of the most popular technique that greatly fasten and stabilize DNs’ training. Through the understanding of the DN partition, novel re- sults and explanation of Batch-Normalization will be possible concluding that this technique allows to concentrate the DN regions around the data samples and thus allows to help training by acting on the DN partition. Fifth and lastly, we conclude this thesis by demonstrating how the insights and results drawn throughout the above chapters can be extended to DNs with smooth nonlinearities by allowing the region assignment of the max-affine splines to be probabilistic (Chap. 8). This process is very similar to the ability of Gaussian Mixture Models to produce a probability that an input belong in a specific region i.e. cluster, as opposed to K-means that produces 4 a yes/no region membership value. Most of the chapters rely on conference papers that are cited as part of the corresponding chapter’s introduction. Proofs are pro- vided in multiple appendices, divided per chapters. Proofs that are short in length are put directly in the main document.
1.2 Deep Networks
We now introduce deep (neural) networks (DNs): nonlinear functions formed by a composition of layers, each layer performing a simple (possibly constrained) affine transformation of its input followed by a nonlinearity. The success of DNs on chal- lenging computer vision tasks goes back at least as far as LeCun et al. [1995b] for handwritten digit classification. A typical DN F that employs L layers is expressed as
F (x) = (F (L) ◦ · · · ◦ F (1))(x), (1.1)
(`) D(`−1) D(`) (`−1) where each function F : R 7→ R maps its input zx , a feature map, to an (`) (0) output feature map zx with initialization zx , x. We thus have
(`) (`) (1) zx = (F ◦ · · · ◦ F )(x).
Different DNs such as CNNs [LeCun, 1998], Residual Networks [He et al., 2015b],
Densenets [Huang et al., 2017] simply correspond to DNs in which the organization and the types of layers are specified explicitly. Some layers operate on feature maps with specific shapes such as 3-dimensional tensors corresponding to multi-channel images. In any case, it is possible to consider the flattened version of such tensors and adapt the layer operations accordingly. To streamline our development we will thus always consider feature maps to be vectors. We describe below the basic operators 5 that form any current DN layer and review how DN training is done i.e. how the per-layer weights are tuned in order to produce a desired DN. For a complete survey we refer the reader to Goodfellow et al. [2016].
1.2.1 Layers
A DN layer, as employed in (1.1) to form the final prediction, is itself internally composed of a few simple operators. Different types of layers can be obtained by combining those simple operators adequately, in turn, different types of layers and layers’ organization will produce different types of DNs. To remain as general as possible, we thus propose to first review the main operators used in today’s layers.
Dense operator. A dense operator oftentimes referred to as a fully-connected operator, performs an affine transformation of a given input x as in
W x + b where x is the considered input, W is a dense/full matrix, and b is a bias vector. This operator combined with the activation operator (described below) forms the layers employed in the first generation of DNs: multilayer perceptrons [Rosenblatt, 1961].
Current DNs often employ the dense operator within their last layers only and prefer more constrained operators such as the convolution operator for the first layers.
Convolution operator. A convolution operator transforms its input via
Cx + b where a special structure is defined on the matrix C so that it performs multi-channel convolutions on the input x. Similarly, the bias vector b is often constrained to 6
have the same entries for different dimensions. Special cases include the use of 1 × 1
convolutional filters [Kingma and Dhariwal, 2018], in which case C is made of multiple
blocks, each being a diagonal matrix. Convolution operators are at the origin of
the performance gains observed in computer vision tasks starting with the LeNet
architecture [LeCun et al., 1989]. Even most recent state-of-the-art DNs employ at
some point convolution operators combined with an activation or a pooling operator.
Pooling operator. A pooling operator is a sub-sampling operation applied on
ian input according to a sub-sampling policy ρ and a collection of dimension, for each
output dimension, that ρ must consider in order to produce each output dimension.
Formally, for each output dimension k = 1,...,K, we denote this collection of dimen-
R sions as Rk. For concreteness, we denote Rk ⊂ {1,...,D} where D is the dimension of the input, and R > 1 is the number of dimensions in the input to apply ρ onto. In our case we assume the same R for each Rk but generalizing this is straightforward. We thus obtain the following input-output mapping for the pooling operator
ρ [x][R1]1 , [x][R1]2 ,..., [x][R1]R ρ [x][R ] , [x][R ] ,..., [x][R ] 2 1 2 2 2 R . . . . ρ [x][RD]1 , [x][R2]D ,..., [x][RD]R
We consider here that the pooling operator reduces the input dimensionality (R > 1).
Often the pooling operator ρ is the max operator as was originally proposed. However the average also produces successful layers that have been used in many DNs. More complex functions include the softmax pooling [Murray and Perronnin, 2014]. When the indices Rk include all the input dimensions, the pooling operator is referred to as global. As opposed to the dense and convolution operators, the pooling operator is 7 most often nonlinear. Another popular nonlinear operator is the activation operator we now turn to.
Activation operator The activation operator applies a scalar nonlinearity σ to each dimension of its input as in σ([x]1) σ([x]2) , . . . σ([x]D) where we will abbreviate the above as simply σ(x) where σ should be understood as being applied elementwise. The first most popular activation function was the
1 exp(u)−exp(−u) sigmoid σ(u) = 1+exp(−u) and the hyperbolic tangent σ(u) = exp(u)+exp(−u) [Rosenblatt, 1961, Hornik et al., 1989]. While not coined with that name, the ReLU activation
σ(u) = max(0, u) emerged from hinging hyperplanes [Breiman, 1993] which can be seen as a layer with a dense operator and an activation operator (ReLU). The official introduction of ReLU in DN was done in Glorot et al. [2011]. Variants include the leaky-ReLU σ(u) = max(η, u), η > 0 [Maas et al., 2013], the absolute value, the exponential linear unit σ(u) = log(1 + exp(u)) [Shah et al., 2016].
With the few operators described above it is already possible to form most of the current layers employed in DNs. In order to formally define what a layer can include and ensure that the decomposition (1.1) is unique for any known DN, we propose the following definition.
Definition 1.1 (Layer) A layer F (`) is made of a single nonlinear operator and all the preceding linear operators (if any). 8
Some popular layers are the convolutional layer that comprises a convolution op-
erator and an activation operator, the maxout layer with is formed by a convolutional
or dense operator and a max-pooling operator. Additionally, any layer can be turned
into a residual layer [He et al., 2015b] by adding a linear connection between the layer
input and its output. For example a residual convolutional layer is
σ(Cx + b) + W resx + bres,
where W res is often taken to be the identity matrix if the layer output dimension is
the same as the input, and bres is often set to be zero.
1.2.2 Training
In order to optimize the parameters that govern each layer of the DN one needs a
dataset, a loss function to be minimized that is preferably differentiable such as the
mean squared error [Wang and Bovik, 2009] or the cross-entropy [Kleinbaum et al.,
2002], and a parameter update policy/rule such as some flavor of gradient descent
[Bottou, 2010].
A dataset is a collection of observations which can be a set of inputs (unsupservised),
a set of input-output pairs (supervised), or a mix of both (semi-supervised). For
concreteness let’s consider the supervised case here and let’s denote the dataset as
D = {(xn, yn), n = 1,...,N}. Commonly, this dataset is partitioned into three: a
training set Dtrain, a validation set Dval, and a testing set Dtest such that there is no overlap between them and their union gives back D, the entire dataset. The DN
parameters are updated based on Dtrain, the DN hyper-parameters are chosen based on Dvalid and finally the out-of-sample performance (also referred to as generalization performance) is estimated based on Dtest. 9
Based on the training set Dtrain, the chosen loss function, and the weight update policy, one tunes the DN parameters to minimize the loss on this set of observation.
Commonly this is done with flavors of gradient descent such as Nesterov momentum
[Nesterov], Adadelta [Zeiler, 2012], Adam [Kingma and Ba, 2014], or any variant of
those. In fact, all of the operations introduced above for standard DNs are differen-
tiable almost everywhere with respect to their parameters and inputs. As the training
set size (|Dtrain|) is often large, each parameter update is computed after only feeding a
mini-batch of B data sampled from Dtrain and with cardinality much smaller than the number of training samples B |Dtrain|. Mini-batch training, in addition to reduc- ing the amount of computation required to perform a step of parameter update, also
provides many benefits from a generalization perspective [Keskar et al., 2016, Masters
and Luschi, 2018]. For each mini-batch, the parameter updates are computed for all
the network parameters by backpropagation [Hecht-Nielsen, 1992], which follows from
applying the chain rule of calculus. Once all the samples in the training set have been
|Dtrain| observed (after B mini-batches are sampled without replacement) then one epoch is completed. Whenever B = 1 the above is denoted as stochastic gradient descent.
Usually a network needs hundreds of epochs to converge. Hyperparameters such as learning rate, early-stopping are tuned based on the performances on Dvalid. Once training is completed and the best hyper-parameters have been selected, estimates of the DN performances on new, unseen data is obtained on Dtest.
1.2.3 Approximation Results
The ability of certain DNs to approximate an arbitrary functional/operator mapping has been well established [Cybenko, 1989, Breiman, 1993]. For completeness, we remind those results that have been pivotal in the theoretical analysis of DNs. 10
Theorem 1.1 (Cybenko [1989])
Let σ be any bounded, measurable sigmoidal function, then given any (target func- tion) f ∈ C0([0, 1]D) there exists a shallow network
K X (2) (1) (1) fΘ(x) = [W ]k,1σ(h[W ]k,., xi + [b ]k) k=1 such that
D |F (x) − FΘ(x)| < , for all x ∈ [0, 1]
In the above result, a sigmoidal function is defined as a function σ that must ful-
fill limt→−∞ σ(t) = 0, limt→∞ σ(t) = 1 and that does not have any monotonicity constraint. This result has been generalized to the case of employing continuous, bounded and nonconstant activation function σ in Hornik [1991], and to Radial Basis
Function in Park and Sandberg [1991]. Those result consider the case of fixed depth, and increasing width. The dual of this, considering fixed width and increasing depth, also led to universal approximation results.
Theorem 1.2 (Lu et al. [2017])
For any Lebesgue-integrable (target function) F : RD 7→ R any > 0, there exists a fully-connected ReLU network fΘ with width ≤ D + 4, such that
Z |F (x) − FΘ(x)|dx < . RD
The same result but for CNNs has been obtained recently in Zhou [2020], for Resid- ual Networks in Tabuada and Gharesifard [2020], and for recurrent networks in Doya
[1993], Sch¨aferand Zimmermann [2006]. In the specific case of univariate DNs with continuous piecewise affine activation functions, Daubechies et al. [2019] demon- strated how the special structure that can be reached through depth, allowed DNs to not only approximate arbitrarily closely a target function as long as the number 11 of layers is large enough, but also that the rate of approximation was greater than alternative methods for some particular class of functions to be approximated.
All the variants of the universal approximation theorem guarantee that one can approximate any reasonable target function with the correct choice of architecture.
However, those results do not prescribe how to obtain such function in practice i.e. how to learn the DN underlying parameters in a more principled way without resorting to gradient-based optimization. Learning the DN parameters in a way to maximize generalization performances (or any desired metric) is one of the fundamental question that remains open. We now propose to introduce thoroughly the rich and powerful class of function that are splines in order to build our novel results in the following chapters.
1.3 Related Works
Ongoing directions to build a rigorous mathematical framework allowing to derive theoretical results and/or insights into DNs fall roughly into six camps, five of them have little to do with spline functions but are presented for completeness and to overview alternative directions. The last direction deals with splines and is the most relevant to our study.
1.3.1 Mathematical Formulations of Deep Networks
Statistical correlation based visualizations. The first successful attempt at providing human interpretable understanding of DNs’ inner workings arose from two techniques: activity maximization [Cadena et al., 2018] and saliency maps [Simonyan et al., 2014, Zeiler and Fergus, 2014, Li and Yu, 2015, Kim et al., 2019]. In the former case, one optimizes a (randomly) initialized input living in the data space 12 such that this input produces some target latent representation in the DN. This can take different forms such as considering a specific unit in the DN, or adding additional constraints. In the latter case, one feeds an input to a DN and leverages the gradient of the DN (a specific unit at a specific layer) with respect to the DN input. This saliency map depicts how sensitive for each input dimension is a specific unit of the
DN when considering a specific input. Those solutions provide visuals in the data space that highly correlate with the firing of specific DN inputs. Such techniques can be coupled with segmentation networks and ground truth image segmentation labels to obtain an actual ‘label’ or ‘concept’ that the DN’s units correlate highly with. This has been explored in classifier DNs [Bau et al., 2017] and generative DNs [Bau et al.,
2020]. In order to provide extensive guarantees into the interpretations done from those visuals, statistical tests and results have been developed [Lin and Lin, 2014,
Adebayo et al., 2018]. This development has pioneered our current understanding of the underlying knowledge being learned by DNs.
Optimization and approximation theory. The theoretical understanding of
DNs, their approximation power as well as their generalization capacity is one of the most fundamental questions that has been studied for decades. For example, Cy- benko [1989], Breiman [1993] studied the approximation capacity of shallow networks with sigmoid and ReLU activation functions respectively. Through specific consider- ations of architectures finer and finer results have been obtained [Arora et al., 2013,
Cohen et al., 2016, Parhi and Nowak, 2021] with results reaching beyond pure approx- imation capacity and characterizing the loss surface geometry [Lu and Kawaguchi,
2017, Soudry and Hoffer, 2017, Nguyen and Hein, 2017] or the VC-dimension [Har- vey et al., 2017] of DNs. Tremendous insights have been gained from those results. 13
For example, it was shown in Daubechies et al. [2019] how residual connections on a carefully designed univariate DN allowed for faster error convergence rate.
Architecture-constrained models. Another line of research considers carefully designed and constrained DN architectures in order to enforce mathematical proper- ties and to allow for theoretical analysis. The first successful attempt of such a model that managed to provide competitive performances is the Extreme Learning Machine
(ELM) [Huang et al., 2011, Tang et al., 2015] consisting of DNs with random layer weights for all but the last layer. Due to the absence of training of the layer weights, it was possible to gain insights into the DN decision process and ELMs opened the door to the integration of continuous constraints like positivity, monotonicity, or bounded curvature in the learned function [Neumann et al., 2013]. The Scattering Networks
[Mallat, 2012, Bruna and Mallat, 2013] is another carefully designed DN consist- ing of a succession of wavelet transforms and (complex) modulus. As opposed to standard DNs, the features of this network are obtained by collecting the average of the per-layer feature maps. Variants include performing local averaging, employing
1-dimensional or 2-dimensional wavelet transforms depending on the nature of the data [And´enet al., 2019], and changing the employed wavelet filter-banks [Lostanlen and And´en,2016]. Thanks to such parametrization, it is possible to leverage tools from signal processing [Mallat, 1999] and group theory [Mallat, 2016] to study and interpret the scattering features and thus understand the benefits and mathematical properties of this architecture. The interpretability of this network in conjunction with the absence of learning enabled its application in fields such as quantum chem- istry [Hirn et al., 2017] or aerial scene understanding [Nadella et al., 2016]. Lastly, some methods have been developed in which only a specific part of the DN is tweaked 14 based on an explicit design that improves analysis or interpretability. Most known examples include the group equivariant convolutional networks [Cohen and Welling,
2016] where the per-layer parameters are constrained to have a group structure and the capsule network [Sabour et al., 2017] that leverages hard coded ‘computer vision’ rules in the forming of its prediction.
Probabilistic generative models. Probabilistic Graphical Models (PGMs) [Koller and Friedman, 2009] are one class of methods in machine learning that always enjoyed ease of interpretation and data modeling. The main reasons behind those benefits being (i) the ability to specify an explicit generative model that governs the data at hand thus allowing easy integration of a priori knowledge and understanding of the underlying data modeling [Bhattacharya and Cheng, 2015], (ii) the explicit analyti- cal forms to train the model parameters and infer the missing variables in the model
[Jordan, 1998], and (iii) the versatility of such models which can be used to detect outliers, to denoise and to classify/cluster [Bilmes and Zweig, 2002]. Most success- ful PGMs include the Gaussian Mixture Model [Xu and Jordan, 1996], the Hidden
Markov Model [Rabiner and Juang, 1986] or the Factor Analyzis Model [Akaike,
1987]. Motivated by those successes, many recent studies [Yuksel et al., 2012, Patel et al., 2016, Kim and Bengio, 2016, Nie et al., 2018] have modeled the underlying
DN mechanisms as a PGM in order to port all the above benefits to deep learning.
Those approaches have successfully provided principled solutions to perform semi- supervised and unsupervised tasks as well as pushing our understanding of DN inner working from a generative perspective. One limitation that remains to be tackled comes from the now intractable learning solutions. In fact, due to the increase in the model complexity that those methods need to employ in order to mimic DNs as 15
best as possible, the closed form solutions for training and inference no longer exist,
limiting the explainability provided by those methods.
Infinite-width limit. Another recent development leverages the overparametriza-
tion regime of modern architectures. In this specific setting, overparametrization is to
be understood as a growth (in the limit, to infinity) of the layers’width. This has re-
sulted in DNs preserving generalization performances [Novak et al., 2018, Neyshabur
et al., 2019, Belkin et al., 2019] while providing additional convergence guarantees for
the optimization problem [Du et al., 2019a, Allen-Zhu et al., 2019a,b, Zou et al., 2018,
Arora et al., 2019b]. Going into the infinite width limit, it became possible to obtain
analytical models of such impractical DNs. The most recent infinite width study
showed that the training dynamics of (infinite-width) DNs under gradient flow are
captured by a constant kernel called the Neural Tangent Kernel (NTK) that evolves
according to an ordinary differential equation (ODE) [Jacot et al., 2018, Lee et al.,
2019a, Arora et al., 2019a]. Every DN architecture and parameter initialization pro-
duces a distinct analytical NTK. The original NTK was derived from the Multilayer
Perceptron [Jacot et al., 2018] and was soon followed by kernels derived from CNNs
[Arora et al., 2019a, Yang, 2019], Residual Networks [Huang et al., 2020], and Graph
CNNs (GNTK) [Du et al., 2019b]. In [Yang, 2020], a general strategy to obtain the
NTK of any architecture is provided. Due to the analytical form of the NTK this
development has led to an entirely new active research area.
Continuous piecewise affine operators. The last research direction that also relates the most to this thesis concerns the use of spline functions and operators to study DNs. While not in a deep learning setting, the first study of the approximation capacity of shallow ReLU networks was performed in Breiman [1993] and offered an 16 alternative approximation result from the ones based on sigmoid activation functions from that time. More recently and mainly due to the popularity of novel activation functions such as leaky-ReLU or absolute value, the use of spline function theory to study DNs has grown exponentially. The first brick was posed by Montufar et al.
[2014] where the Continuous Piecewise Affine (CPA) structure of DNs employing such nonlinearities was highlighted. Along with this result, an upper bound in the number of regions of the DN input space partition was derived. Later works con- tinued to refine the bounds on the number of regions [Serra et al., 2018, Hanin and
Rolnick, 2019, Serra and Ramalingam, 2020]. Meanwhile, the rich theory of spline which has been extensively refined in signal processing and function approximation, has allowed to port many results such as formulating current DNs from a functional optimization problem [Unser, 2018], studying the piecewise convexity of the DN in the context of optimization [Rister and Rubin, 2017] and obtaining acute approxima- tion results [Daubechies et al., 2019]. To date, theoretical studies relying on splines have focused on either considering specific topologies or providing theoretical guar- antees and bounds on specific properties, such as a DN’s approximation capacity for univariate input-output or number of regions in the DN’s input space partition.
1.3.2 Training of Deep Generative Networks.
Deep Generative Networks (DGNs), which map a low-dimensional latent variable z to a higher-dimensional generated sample x are the state-of-the-art methods for a range of machine learning applications, including anomaly detection, data generation, like- lihood estimation, and exploratory analysis across a wide variety of datasets [Blaauw and Bonada, 2016, Inoue et al., 2018, Liu et al., 2018, Lim et al., 2018]. While we proposed a thorough geometrical study of DGNs in all generality in Chap. 4, we now 17 go a step further and exploit the composition of MASO formulation to provide a novel training solution. Training of DGNs roughly falls into two camps: (i) By leveraging an adversarial network as in a Generative Adversarial Network (GAN) [Goodfellow et al., 2014] to turn the method into an adversarial game; and (ii) by modeling the latent variable and observed variables as random variables and performing some flavor of likelihood maximization training. A widely used solution to likelihood-based DGN training is via a Variational Autoencoder (VAE) [Kingma and Welling, 2013]. The popularity of the VAE is due to its intuitive and interpretable loss function, which is obtained from likelihood estimation, and its ability to exploit standard estimation techniques ported from the probabilistic graphical models literature. Yet, VAEs offer only an approximate solution for likelihood-based training of DGNs. In fact, all cur- rent VAEs employ three major approximation steps in the likelihood maximization process. First, the true (unknown) posterior is approximated by a variational distri- bution. This estimate is governed by some free parameters that must be optimized to fit the variational distribution to the true posterior. VAEs estimate such param- eters by means of an alternative network, the encoder, with the datum as input and the predicted optimal parameters as output. This step is referred to as Amortized
Variational Inference (AVI), as it replaces the explicit, per datum, optimization by a single deep network (DN) pass. Second, as in any latent variable model, the complete likelihood is estimated by a lower bound (ELBO) obtained from the expectation of the likelihood taken under the posterior or variational distribution. With a DGN, this expectation is unknown, and thus VAEs estimate the ELBO by Monte-Carlo
(MC) sampling. Third, the maximization of the MC-estimated ELBO, which drives the parameters of the encoder to better model the data distribution and the encoder to produce better variational parameter estimates, is performed by some flavor of 18 gradient descend (GD). These VAE approximation steps enable rapid training and test-time inference of DGNs. However, due to the lack of analytical forms for the posterior, ELBO, and explicit (gradient free) parameter updates, it is not possible to measure the above steps’ quality or effectively improve them. Since the true posterior and expectation are unknown, current VAE research roughly fall into three camps:
(i) developing new and more complex output and latent distributions [Nalisnick and
Smyth, 2016, Li and She, 2017], such as the truncated distribution; (ii) improving the various estimation steps by introducing complex MC sampling with importance re-weighted sampling [Burda et al., 2015]; (iii) providing different estimates of the posterior with moment matching techniques [Dieng and Paisley, 2019, Huang et al.,
2019]. More recently, Park et al. [2019] exploited the special continuous piecewise affine structure of current ReLU DGNs to develop an approximation of the posterior distribution based on mode estimation and DGN linearization leading to Laplacian
VAEs.
1.3.3 Batch-Normalization Understandings.
Nowadays, the empirical benefits of BN are ubiquitous with more than 12,000 ci- tations to the original BN article and a unanimous community employing BN to accelerate training by helping the optimization procedure and to increase general- ization performances [He et al., 2016b, Zagoruyko and Komodakis, 2016, Szegedy et al., 2016, Zhang et al., 2018c, Huang et al., 2018a, Liu et al., 2017b, Ye et al.,
2018, Jin et al., 2019, Bender et al., 2018]. Despite its prevalence in today’s DN architectures’ performances, the understanding of the unseen forces that BN applies on DNs remains elusive; and for many, understanding why BN improves so drasti- cally DNs performances remains one of the key open problems in the theory of deep 19 learning [Richard et al., 2018]. One of the first practical arguments in favor of fea- ture map normalization emerged in Cun et al. [1998] as “good-practice” to stabilize training. By studying how the backpropagation algorithm updates the layer weights, it was observed that unless with normalized feature maps, those updates would be constrained to live on a low-dimensional subspace limiting the learning capacity of gradient-based algorithms. By explicitly reparametrizing the affine transformation weights and slightly altering the renormalization process of BN, weight renormal- ization [Salimans and Kingma, 2016] showed how the σ(`) renormalization smooths the optimization landscape of DNs. Similarly, Bjorck et al. [2018], Santurkar et al.
[2018], Kohler et al. [2019] further studied the impact of BN in the gradient distri- butions and optimization landscape by designing careful and large scale experiments.
By providing a smoother optimization landscape BN “simplifies” the stochastic opti- mization procedure and thus accelerates the training convergence and generalization.
In parallel to this optimization analysis of BN in standard DN architectures, Yang et al. [2019b] developed a mean field theory for fully-connected feed-forward neural networks with random weights where BN is analytically studied. In doing so, they were able to characterize the gradient statistics in such DNs and to study the sig- nal propagation stability depending on the weight initialization, concluding that BN stabilizes gradients and thus training.
1.3.4 Deep Network Pruning.
With a tremendously increasing need for DNs’ practical deployments, one line of research aims to produce a simpler, energy efficient DN by pruning a dense one, e.g. removing some layers/nodes/weights and any combination of these options from a
DN architecture, leading to a much reduced computational cost. Recent progresses 20
[You et al., 2020, Molchanov et al., 2016] in this direction allow to obtain models much more energy friendly while nearly maintaining the models’ task accuracy [Li et al., 2020]. Throughout this chapter, we will often abuse notations and refer to an unpruned DN as “dense” or “complete”. While tremendous empirical progress has been made regarding DN pruning, there remains a lack of theoretical understanding of its impact on a DN’s decision boundary as well as a lack of theoretical tools for deriving pruning techniques in a principled way. Such understandings are crucial for one to study the possible failure modes of pruning techniques, to better decide which to use based on a given application, or to design pruning techniques possibly guided by some a priori knowledge about the given task and data. The common pruning scheme adopts a three-step routine: (i) training a large model with more parameters/units than the desired final DN, (ii) pruning this overly large trained
DN, and (iii) fine-tuning the pruned model to adjust the remaining parameters and restore as best as possible the performance lost during the pruning step. The last two steps can be iterated to get a highly-sparse network [Han et al., 2015]. Within this routine, different pruning methods can be employed, each with a specific pruning criteria, granularity, and scheduling [Liu et al., 2019b, Blalock et al., 2020]. Those techniques roughly fall into two categories: unstructured pruning [Han et al., 2015,
Frankle and Carbin, 2019, Evci et al., 2019] and structured pruning [He et al., 2018,
Liu et al., 2017b, Chin et al., 2020a]. Regardless of the pruning methods, the trade- offs lie between the amount of pruning performed on a model and the final accuracy.
For various energy efficient applications, novel pruning techniques have been able to push this trade-off favorably. The most recent theoretical works on DN pruning relies on studying the existence of Winning Tickets. Frankle and Carbin [2019] first hypothesized the existence of sub-networks (pruned DNs), called winning tickets, 21
that can produce comparable performances to their non-pruned counterpart. Later,
You et al. [2020] showed that those winning tickets could be identified in the early
training stage of the un-pruned model. Such sub-networks are denoted as early-bird
(EB) tickets.
1.4 Contributions
There are many fundamental questions that need to be addressed in deep learning.
However, we propose in this thesis to focus specifically on three of them that would
not only help in bringing novel understandings in the underlying Deep Networks com-
putational intricacies, but would also produce better performing models:
Question 1: Can we train a deep network to learn a probability distribution in high dimensional space from training data?
Question 2: How can we lower the power consumption of implementing a deep net- work (both learning and inference) given an architecture and training dataset?
Question 3: How to explain the ability of a technique such as Batch-Normalization to considerably boost Deep Networks performances regardless of the architecture, task, and data at hand?
The above three fundamental questions would push barriers of current techniques in
Deep Learning across applications ranging from manifold learning and density esti- mation (Q1) to providing interpretability and explainability into everyday used Deep
Networks techniques that is Batch-Normalization (Q3) while also allow the principled design of novel methods guided by theoretical understanding as we will do for pruning
(Q2). Needless to say that in order to answer those three fundamental questions, we will need to provide a novel mathematical formulation of Deep Networks. The first part of this thesis thus proposes a novel formulation of DNs via a very special type 22
of splines: max-affine splines, that we review in (Chap. 2). This formulation, which
consists of a reformulation of DNs based on those splines, opens the door to theoret-
ical study on CPA DNs with high-dimensional input spaces and allows to leverage
results from combinatorial, computational geometry to further enrich our understand-
ings (Chap. 3). Equipped with this formulation and novel understandings, we will be
able to answer those three questions posed above in additional of developing many
visualization tools with the following organization:
Answer 1: deriving an Expectation-Maximization training for Deep Generative Net-
works (Chap. 5): In this chapter, we advance both the theory and practice of DGNs
and VAEs by computing the exact analytical posterior and marginal distributions of
any DGN employing continuous piecewise affine (CPA) nonlinearities. The knowl-
edge of these distributions enables us to perform exact inference without resorting to
AVI or MC-sampling and to train the DGN in a gradient-free manner wth guaranteed
convergence.
Answer 2: designing a novel and theoretically grounded state-of-the-art Deep Net- work pruning strategy (Chap. 6): In this chapter we are turning our focus towards a recent technique that is Deep Network Pruning. As we will see, pruning, which consists in removing some weights and/or units of a DN, can be studied thoroughly from a geometric point of view thanks to the knowledge of the DN input space parti- tion and its ties with the DN input-output mapping. After providing many practical insights into pruning, we will propose a novel strategy from those understandings that is able to compete with alternative state-of-the-art methods.
Answer 3: interpreting and theoretically studying from a spline point-of-view one
of the most important Deep Learning technique: Batch-Normalization (Chap. 7):We
will demonstrate in this chapter how BN, by replacing proposing an specific layer 23 input-output mapping parametrization, provides an unsupervised learning technique that interacts with the (un)supervised learning algorithm used to train a DN in order to focus the attention of the network onto the data points.
In addition of the above core contributions, we will also consider exploiting the affine spline formulation of deep networks to study and interpret Deep Generative Networks in all generality in Chap. 4 and finally we will conclude by demonstrating in Chap. 8 how to extend all the above results to smooth Deep Networks, thus effectively porting all our results beyond the affine spline world. 24
Chapter 2
Max-Affine Splines for Convex Function Approximation
Function approximation is the general task of utilizing an approximant function fˆ to
‘mimic’ as best as possible a target (possibly unknown) function f. This task can take many forms ranging from fitting fˆ based on samples generated from f as often done in machine learning, imposing physical constraints on fˆ that are known to govern f as often done in partial differential equation approximations, or possibly a mix of both approaches. Solving this task accurately has tremendous applications as the approximant fˆ can then be deployed for example to provide autonomous controllers as used in aircraft and uninhabited air vehicles [Farrell et al., 2005, Xu et al., 2014], to perform weather prediction [Richardson, 2007, Brown et al., 2012, Bauer et al.,
2015], to accelerate drug discovery [King et al., 1992, Lima et al., 2016, Zhang et al.,
2017a, Ong et al., 2020], or to better identify and prevent suicide attempts [Walsh et al., 2017, Torous et al., 2018]. While the topic of function approximation is vast
(we refer the reader to Powell [1981], DeVore [1998] for an overview), for our study, we focus on a specific class of approximant: spline functions. A clear understanding of those functions and their notations will be crucial for the remaining of the thesis as we aim to employ the rich theory of splines to study Deep Networks. 25
2.1 Spline Functions
Spline functions [Schoenberg, 1973] are powerful practical function approximators that have been thoroughly studied theoretically in term of their approximation ca- pacity along with various properties [De Boor and Rice, 1968, Unser et al., 1993,
Schumaker, 2007].
Splines: constrained piecewise polynomials. Consider a partition of a domain
X into a finite set of regions Ω = {ω1, . . . , ωR}. In our study we will focus on partitions of a continuous domain X ⊂ RD,D ≥ 1.
Definition 2.1 (Partition) A partition Ω of a domain X is a finite collection of regions
R Ω = {ω1, . . . , ωR} such that their union recovers the domain, ∪i=rωr = X , and the
◦ ◦ intersection of the interior of any two different regions is empty ωi ∩ ωj = ∅ ∀i 6= j, where .◦ is the interior operator [Halmos, 2013].
k Let’s consider R piecewise polynomial mappings of degree k that we denote as φr for r = 1,...,R. For now we consider univariate polynomials i.e. D = 1, hence, each one of those mappings transform an input x to an output via
k k X p φr (x; ar,:) := x ar,p, x ∈ R (2.1) p=0
th th where ar,p ∈ R is the p degree polynomial coefficient (p ∈ {0, . . . , k}) for the r polynomial (r ∈ {1,...,R}). We denote by ar,: the vector of k + 1 parameters
T k+1 (ar,0, . . . , ar,k) ∈ R .
Definition 2.2 (Piecewise polynomial function) The mapping defined as
|Ω| X k P (x; a:,:) = φr (x; ar,.)1{x∈ωr}, (2.2) r=1 26
k is known as a order-k piecewise polynomial function where φr is from (2.1) and Ω is a partition of the considered domain.
An order-k spline function on a domain X is obtained by constraining an order-k
piecewise polynomial function P defined on a partition Ω of X to have continuous
first 0, . . . , k − 1-order derivatives i.e. P ∈ Ck−1(X ). We recall that the 0-order
derivative is the function itself. In order to gain insights into the constraint imposed
on piecewise polynomials to obtain a spline, we first need to formally define two
regions ωi, ωj as adjacent iff ∂ωi ∩ ∂ωj 6= ∅, with ∂. the boundary operator. Now, since P is a piecewise polynomial, it is clear that the restriction of P onto any region
∞ ω ∈ Ω, that we denote by P |ω, fulfills P |ω ∈ C (ω). As a result, the constraint
k−1 k k P ∈ C (X ) can be seen as enforcing the piecewise polynomial mappings φr , φr0 for
any adjacent regions ωr, ωr0 to have the same 0, . . . , k − 1-order derivatives at the intersection of their regions’ boundaries.
Definition 2.3 (Spline function) Given a partition Ω = {ω1, . . . , ωR} of some domain X , a spline function of order k is a order-k piecewise polynomial P (recall Def. 2.2)
on Ω such that P ∈ Ck−1(X ).
As a result, in the special case k = 2 (quadratic polynomials), the mapping P will be
a spline function iff P and P 0 (zero-order and first-order derivatives) are continuous.
In the case k = 1 (affine polynomials), which will be the main setting of this thesis, P
must be a piecewise polynomial of degree 1 in each region of the partition, and must
be continuous on the entire domain. For a thorough study of piecewise polynomials
and splines, we refer the reader to Schumaker [2007]. Generalizing the above spline
construction for multivariate domains of dimension D > 1 follows naturally by con-
k sidering multivariate polynomial functions for φr ; the notion of adjacent regions and 27
the derivative constraints that must be fulfilled by a piecewise polynomial mapping
to be a spline is identical.
Spline functions’ bases. In many situations, one needs to ‘learn’ a spline given a
specific order k and (possibly) a known partition Ω based on some criteria to minimize.
Commonly, one will not solve a constrained optimization problem of fitting a piecewise
polynomial function while constraining the first 0, . . . , k − 1-order derivatives to be
continuous. Instead, one employs an unconstrained optimization problem in which
the coefficients to be optimized are weighting some basis functions that belong to the
considered spline functional space. One of the most famous type of basis functions
are denoted as B-splines [Schoenberg, 1973, 1988] and consist of order-k splines with
compact support (on each region of Ω). Since any spline function of a given degree can
be expressed as a linear combination of B-splines of that degree, it is enough to learn
the correct linear combination in order to learn the spline function. For alternative
bases we refer the reader to Girosi et al. [1993], Unser and Blu [2005]. We now focus
on affine (k = 1) splines for the remaining of this thesis.
Affine splines. Affine splines (k = 1) have been particularly popular since their basis functions are efficient to evaluate and the number of parameters required to describe a spline on a partition Ω of a domain X grows linearly with the dimension
of the domain dim(X ). Such affine splines have been used for detection of patterns in images [Rives et al., 1985], contour tracing [Dobkin et al., 1990], extraction of straight lines in aerial images [Venkateswar and Chellappa, 1992], global optimiza- tion [Mangasarian et al., 2005], compression of chemical process data [Bakshi and
Stephanopoulos, 1996], gas demand forecasting [Gasc´onand S´anchez-Ubeda,´ 2018] and circuit modeling [Vandenberghe et al., 1989]. Let’s specialize the spline mapping 28
D 1 (2.2) to the affine case. In that case and with X ⊂ R the mappings φr depend on
D a slope vector ar ∈ R and an offset/bias scalar br ∈ R leading to the multivariate affine spline
|Ω| X P (x; a:, b:) = (har, xi + br) 1{x∈ωr}, r=1
(0) where we recall that ar, br are such that P ∈ C (X ). For in-depth study of affine splines and their representation we refer the reader to Kang and Chua [1978], Kahlert and Chua [1990]. While affine splines might seem constrained due to the use of order-
1 polynomials, we should emphasize that in the context of function approximations, the degree of the spline matters very little. However, the partition Ω is of crucial importance. A correctly tuned partition along with an order-1 spline will produce a better approximation than an higher degree spline with an incorrect partition. For a thorough study on the relation between the polynomial degree, the partition, and the target function in the final approximation error, we refer the reader to Birkhoff and De Boor [1964], Lyche and Schumaker [1975], Cohen et al. [2012]. Clearly, the complication in spline fitting arises when one aims to fit the spline function basis and the partition jointly leading to an intractable problem. As we will see in the next section, it is possible to design, for specific applications, splines that will automatically adapt the partition Ω while the basis functions are fit.
2.2 Max-Affine Splines
Whenever an affine spline is constrained to be globally convex then it can be rewritten as a Max-Affine Spline (MAS). The origin of MASs is not linked to a specific paper or study, but arose many times for example in the development of hinging hyper- planes. Dedicated study of MASs has been done in the context of convex function 29 approximation in Magnani and Boyd [2009], Hannah and Dunson [2013]. A MAS is a continuous, convex, and piecewise affine function that maps its output x via
P (x; a:, b:) = max har, xi + br. (2.3) r=1,...,R
An extremely useful feature of such a spline is that it is completely determined by its parameters ar and br, r = 1,...,R and does not require an explicit partition Ω. Changes in those parameters automatically induce changes in the partition Ω, mean- ing that they are adaptive partitioning splines [Binev et al., 2014]. A thorough study and characterization of the partition Ω induced by those parameters will be carried in Sec. 3.4. A max-affine spline is always piecewise affine, globally convex and hence
D continuous regardless of the values of its parameters ar ∈ R , br ∈ R, r = 1,...,R. Conversely, any piecewise affine, globally convex, and continuous function can be written as a MAS.
Proposition 1 For any continuous function h ∈ C0(RD) that is convex, and piecewise
D D affine on a partition Ω of R , there exists R > 1 and ar ∈ R , br ∈ R, r = 1,...,R such that h(.) = P (.; a:, b:) everywhere.
This result follows from the fact that the pointwise maximum of a collection of convex functions is convex [Roberts, 1993], for the reciprocal see Sec. 3.2.3 in Boyd et al.
[2004]. A great benefit of MASs for convex function approximation lies in (i) the ability to solve the fitting problem as an unconstrained optimization problem. In fact, as mentioned above, the parameters can be fit arbitrarily without breaking the convexity property. And (ii) the ability to adapt the domain partition Ω while the affine parameters are tuned to better fit the target function, removing the intractable joint optimization of the partition and the per-region mappings. We now study more specifically what are the fitting methods for MASs. 30
2.3 (Max-)Affine Spline Fitting
Several methods have been proposed for fitting general affine splines to (multidi- mensional) data. A neural network algorithm is used in Gothoskar et al. [2002]; a
Gauss-Newton method is used in Juli´anet al. [1998], Horst and Beichel [1997]; a reference on methods for least-squares with semi smooth functions is Kanzow and
Petra [2004]. For our study, we are interested in fitting a MAS in the form of (2.3).
Due to the special form and the convexity property of this approximant, and when considering a mean-squared error, Magnani and Boyd [2009] proposed an iterative
fitting algorithm that can be interpreted as a Gauss-Newton algorithm. We report this algorithm in Algo. 1.
Iterative procedures, similar in spirit to the one we presented in Algo. 1 for MASs but for specialized applications, are described in Phillips and Rosenfeld [1988], Yin
[1998], Ferrari-Trecate and Muselli [2002], Kim et al. [2004]. For additional references on affine splines fitting with D = 1, Dunham [1986] propose to find the minimum number of segments to achieve a given maximum error, Goodrich [1994], Bellman and Roth [1969], Hakimi and Schmeichel [1991], Wang et al. [1993] propose dynamic programming methods to solve the affine spline fitting problem, and Pittman and
Murthy [2000] propose genetic algorithms. For D = 2, Aggarwal et al. [1989], Mitchell and Suri [1995] propose variants of the univariate fitting solutions. 31
Algorithm 1 Description of the MAS fitting when considering a mean-squared error.
The algorithm consists of successively fitting the per-region mappings to the samples that are within each region, and then updating the partition. This method can be seen as the solution of the Gauss-Newton algorithm with a MAS approximant.
Convergence is not guaranteed, for examples for such failure cases, see Sec. 3.3 of
Magnani and Boyd [2009]. + procedure Mean square max-affine spline fitting(D,Tlimit ∈ N ) T = 0 . set counter
(T ) (T ) (T ) Ω = {ω1 , . . . , ωK },K ≤ R. initialize a partition of D
while T < Tlimit do for r = 1,..., |Ω(T )| do
if |ωr| == 0 then break (T +1) ar X 2 = arg min kha, xi + b − yk2 (T +1) a,b b (T ) r (x,y)∈ωr −1 T X xx x X yx = T (T ) x 1 (T ) y x∈ωr (x,y)∈ωr
(T +1) (T +1) (T +1) Ω = {ω1 , . . . , ωR } (T +1) (T +1) (T +1) ωr = {(x, y) ∈ D : arg minr0=1,...,Rhar0 , xi + br0 = r} if Ω(T +1) == Ω(T ) then
Exit
else
T ← T + 1 32
Chapter 3
Deep Networks: Composition of Max-Affine Spline Operators
In this chapter, we are exploiting Max-Affine Splines (MASs) to formulate Deep
Networks (DNs) as a composition of Max-Affine Spline Operators (MASOs), a mul-
tivariate output version of MASs. We will first develop the MASO formulation,
demonstrate how each DN layer can be expressed as a MASO and then express any
DN as a composition of MASO. We conclude this chapter by a dedicated study of the
DN input space partition.
3.1 Max-Affine Spline Operators
A natural extension of a Max-Affine Spline (MAS) function is a max-affine spline operator (MASO) M A:, b: that produces a multivariate output. It is obtained simply by concatenating K independent max-affine spline functions from (2.3). A
MASO mapping a D-dimensional input to a K-dimensional input has slope parame-
K×D K ters Ar ∈ R and offset parameters br ∈ R and is defined as maxr=1,...,Rh[Ar]1,:, xi + [br]1 . M(x; A:, b:) = max (Arx + br) = . , (3.1) r=1,...,R maxr=1,...,Rh[Ar]K,:, xi + [br]K where the maximum is taken component wise. Since a MASO is built from K inde-
pendent MASs and can be seen as producing its output by stacking the output of
each MAS into a vector, it has a property analogous to Proposition 1. 33
Proposition 3.1
T 0 D For any operator H(x) = [h1(x), . . . , hK (x)] with hk ∈ C (R ) ∀k that are convex
D piecewise affine on their respective partition Ωk of R , there exist A:, b: such that
H(.) = M(.; A:, b:) everywhere.
The development of MASOs is crucial for our development since DNs compose mul- tiple multivariate mappings. The goal of the next section is to demonstrate that a MASO can be used to formulate most of the current DN layers. From that, it will become clear that an entire DN input-output mapping is nothing else than a composition of MASOs.
3.2 From Deep Network Layers to Max-Affine Spline Oper-
ators
We begin by showing that the DN layers defined in Section 1.2 are MASOs, and we demonstrate in each case what are the corresponding parameters A:, b: (recall (3.1)). The next section will concern the reformulation of the entire DN as a MASO composition.
A dense layer which consists of an unconstrained affine transformation of the input followed by a pointwise nonlinearity, can be expressed as a MASO as long as the nonlinearity is convex, piecewise affine. It turns out that most currently employed nonlinearity fall into that category (ReLU, leaky-ReLU, absolute value). As a result, and following the notations from Sec. 1.2, we have that a dense layer can be expressed as a MASO with R = 2 and parameters
A1 = WA2 = αW (3.2)
b1 = b b2 = αb, (3.3) 34
with α being 0 for ReLU, −1 for absolute value and α > 0 for leaky-ReLU.
The case of a convolutional layer is similar to the one of a dense layer. The only change is to replace the unconstrained slope matrix W and bias vector by their constrained counterparts. That is, the MASO of a convolutional layer has parameters
R = 2
A1 = CA2 = αC (3.4)
b1 = b b2 = αb, (3.5) where the same values of α hold for each nonlinearity.
The case of a max-pooling layer (without any preceding affine mapping) can be expressed as a MASO as well. In that case, the number of mappings R correspond to the number of dimension that the max-pooling is applied onto. In a computer vision case with the common 2 × 2 max-pooling, one would have R = 4. We thus have the following MASO parameters
[Ar]k,d = 1{[Rk]r=d}, ∀r br = 0, ∀r. (3.6)
That is, the matrices Ar are filled with 0 and 1 values; each row k contains a single 1
th positioned at the r index of the pooling region Rk that produces the output of the corresponding dimension k.
The case of a maxout layer follows directly from the max-pooling case since it simply corresponds to a max-pooling layer in which an affine transform is added before the pooling operator. As such, the associated MASO will have R set based on the number of dimensions that are being pooled and the parameters are given by
[Ar]k,. = [W ][Rk]r,., ∀r br = [b][Rk]r , ∀r. (3.7) 35
Another important case occurs when adding a residual connection to any given layer. This can also be modeled easily with a MASO as follows. First, formulate the given layer without residual connection as a MASO with one of the above formulation.
This provides a MASO parametrization A:, b:. Now, to add the residual connection to this layer, one simply adds the residual affine parameters to all the MASO parameters
i.e. for each r = 1,...,R as follows
Ar ← Ar + W res, ∀r br ← br + bres, ∀r. (3.8)
In the special case of a skip-connection, then one would set W res to be the identity
matrix, and bres to be 0. While the above does not explicitly cover all the possible layers that one can form by combining various operators, the same recipe can be
applied. We formalize the generality of this formulation in the following result. Proposition 3.2 (DN layer as MASO)
Any DN layer (recall Def. 1.1) that uses a continuous, convex, and piecewise affine
(for each output dimension) nonlinear operator, and any (if any) preceding linear
operator can be expressed as a MASO.
It will be convenient to abstract away the region selection r based on a given input
and thus introduce the following notation
M(x; A:, b:) = Axx + bx, (3.9)
where the input induced affine parameters are given by T [Ar1(x)]1,: [br1(x)]1 . . Ax , . , bx , . , rk(x) = arg max (h[Ar]k,:, xi + [br]k) , r=1,...,R T [ArK (x)]K,: [brK (x)]K (3.10) 36
hence the parameters Ax, bx simply correspond to the slope and bias parameters responsible to produce the input-output mapping based on the given input x. Sim- ilarly, we denote by Aω, bω the parameters Ax, bx obtained from any x ∈ ω. It will become convenient in the coming section to index those parameters by the layer in- dex as in R(`), A(`) and b(`). From this result, we are now able to express any DN that composes layers fulfilling Prop. ??. We propose to do the reformulation in the next section where we will focus on two specific architectures to see how such an formulation can aid in comparing models from a data modeling view.
3.3 Composition of Max-Affine Spline Operators
We first formalize the ability to express DNs as composition of MASOs. This result will be the key of the entire thesis as it opens the door to further analysis of DNs from a spline perspective.
Theorem 3.1 (DNs as MASO composition)
A DN constructed from an arbitrary composition of layers that fulfill the conditions of Prop. 3.2 can be formulated as a composition of MASOs; the overall composition is itself a continuous affine spline operator.
DNs covered by Theorem 3.1 include CNNs, ResNets, inception networks, max- out networks, network-in-networks, scattering networks, and their variants using connected/convolution operators, (leaky) ReLU or absolute value activations, and max/mean pooling. Thanks to the ability to express any layer as a MASO, we can express the entire DN input-output mapping FΘ as
L−1 ! L L−`−1 ! Y (L−`) X Y (L−j) ` FΘ(x) = Ax x + Ax bx. (3.11) `=0 `=1 j=0 37
Note however that DNs of the form stated in Theorem 3.1 can not in general be
written as a single MASO since the composition of two or more MASOs is not nec-
essarily a convex operator (it is merely a continuous affine spline operator). Indeed,
a composition of MASOs remains convex if and only if all of the intermediate opera-
tors are non-decreasing with respect to each of their output dimensions [Boyd et al.,
2004]. Interestingly, ReLU, max-pooling, and average pooling are all non-decreasing,
while leaky ReLU is strictly increasing. The culprits of the non-convexity of the
composition of operators are negative entries in the fully connected and convolution
slope matrices. A DN where these culprits are thwarted is an interesting special case,
because it is convex with respect to its input [Amos et al., 2016] and multiconvex
[Xu and Yin, 2013] with respect to its parameters (i.e., convex with respect to each
operator’s parameters while the other operators’ parameters are held constant). The
MASO form allows to simply formalize those constraints based on the A: parameters. Theorem 3.2 (Globally convex DNs)
h (`)i A MASO DN whose layers ` = 2,...,L MASO slopes are nonnegative as Ar ≥ i,j 0, ∀(r, i, j) ∈ {1,...,R(`)} × {1,...,D(`)} × {1,...,D(`−1)} is globally convex with respect to each of its output dimensions.
Note that Theorem 3.2 remains true regardless of the MASO parameters of the first layer. Input convexity is a beneficial property that can be leverage for specific appli- cations as optimization of the input becomes a convex optimization problem [Amos et al., 2016]. We now propose to dive in more details into characterizing the DN input space partition i.e. the collection of regions in the DN input space Ω in which the
DN input-output mapping remains linear. This explicit analytical characterization is crucial as understanding the partition of a spline operator opens the door to further theoretical study such as generalization performances. 38
3.4 Deep Networks Input Space Partition: Power Diagram
Subdivision
One of the key element for any spline function is its input space partition Ω. From it,
results on generalization and approximation can be obtained as well as a better un-
derstanding of the approximant behavior via the study of the region’s shapes. Other
works have focused on the properties of the partitioning, such as upper bounding the
number of regions [Montufar et al., 2014, Raghu et al., 2017, Hanin and Rolnick, 2019]
or providing an explicit characterization of the input space partitioning of a single
layer DN with ReLU activation [Zhang et al., 2018b] by means of tropical geometry.
We propose in this section to characterize the DN input space partition with more
generality by providing results that apply to any MASO-based DN, regardless of the
underlying width/depth/layers. To do so, we adopt a computational and combina-
torial geometry [Pach and Agarwal, 2011, Preparata and Shamos, 2012] perspective
of MASO-based DNs to derive the analytical form of the input-space partition of a
DN unit, a DN layer, and an entire end-to-end DN. We demonstrate that each DN
layer performs a partitioning according to a Power Diagram [Aurenhammer and Imai,
1988] with a large number of regions and that those Power Diagrams are subdivided in a special way to create the overall DN input-space partition.
3.4.1 Voronoi Diagrams and Power Diagrams
In order to precisely derive our result on the DN input space partition, we first need to remind the reader with some specific input space partitions, namely, voronoi diagrams and power diagrams.
Definition 3.1 (Voronoi Diagram) A voronoi diagram (VD) [Voronoi, 1908] partitions 39
a space X into R regions Ω = {ω1, . . . , ωR} where each cell is obtained via ωr = {x ∈ X : r(x) = r}, r = 1,...,R, with
2 r(x) = arg min kx − [µ]k,:k . (3.12) k=1,...,R
The parameter [µ]k,: is called the centroid.
VDs are also denoted as Dirichlet tessellation and the Voronoi regions are also known as Thiessen polygons. For a thorough study of VD we refer the reader to Aurenham- mer [1991]. A power diagram (PD), also known as a Laguerre–Voronoi diagram , is a generalization of the classical Voronoi diagram (VD).
Definition 3.2 (Power Diagram) A power diagram (PD) [Aurenhammer and Imai,
1988] partitions a space X into at most R regions Ω = {ω1, . . . , ωR} where each cell is obtained via ωr = {x ∈ X : r(x) = r}, r = 1,...,R, with
2 r(x) = arg min kx − [µ]k,:k − [rad]k. (3.13) k=1,...,R
The parameter [µ]k,: is called the centroid, while [rad]k is called the radius. The distance minimized in (3.13) is called the Laguerre distance [Imai et al., 1985].
When the radii are equal for all k, a PD collapses to a VD. See Fig. 3.1 for two equivalent geometric interpretations of a PD. For additional insights, see Preparata and Shamos [2012]. We will have the occasion to use negative radii in our development
2 2 below. Since arg mink kx − [µ]k,:k − [rad]k = arg mink kx − [µ]k,:k − ([rad]k + ρ), we can always apply a constant shift ρ to all of the radii to make them positive .
In general, a PD is defined with nonnegative radii to provide additional geometric interpretations.
The Laguerre distance corresponds to the length of the line segment that starts at x ∈ X and ends at the tangent to the hypersphere with center [µ]k,: and radius 40
Figure 3.1 : Two equivalent representations of a power diagram (PD). Top: The grey circles have centers [µ]k,: and radii [rad]k; each point x is assigned to a specific region/cell according to the Laguerre distance from the centers, which is defined as the length of the segment tangent to and starting on the circle and reaching x. Bottom: A PD in RD (here D = 2) is constructed by lifting the centroids [µ]k,: up into an additional dimen- D+1 sion in R by the distance [rad]k and then finding the Voronoi diagram (VD) of the augmented centroids D+1 ([µ]k,:, [rad]k) in R . The intersection of this higher- dimensional VD with the originating space RD yields the PD.
rad k (see Fig. 3.1). The hyperplanar boundary between two adjacent power diagram (PD) regions can be characterized in terms of the chordale of the corresponding hyperspheres [Johnson, 1960]. Doing so for all adjacent boundaries fully characterizes the region boundaries in simple terms of hyperplane intersections [Aurenhammer,
1987]. Those two mathematical objects will be enough for us to build a complete characterization of the DN input space partition which we now turn into, first for a single layer case, and then for the multilayer case.
3.4.2 Single Layer: Power Diagram
A MASO layer combines K max affine spline (MAS) units to produce the layer output given its input. To streamline our argument, we omit the ` superscript and denote the layer input by x, with X the layer’s domain. It shall be clear that each MAS encodes indirectly a partition of its input space, where each region corresponds to the collection of inputs that are mapped via the same affine mapping. In other word, the 41
th partition Ωk of the k MAS mapping in a MASO is obtained via
Ωk = {ωk,1, . . . , ωk,R},
where each region ωk,r is the collection of inputs given by
ωk,r = {x ∈ X : arg maxh[Ar0 ]k,:, xi + [br0 ]k = r}. r0=1,...,R
Following simple calculus, we can rewrite the region assignment as follows: ωk,r = x ∈ X : arg max (h[Ar0 ]k,:, xi + [br0 ]k) = r r0=1,...,R = x ∈ X : arg min (−2h[Ar0 ]k,:, xi − 2[br0 ]k) = r (sign change, scaling) r0=1,...,R 2 = x ∈ X : arg min −2h[Ar0 ]k,:, xi − 2[br0 ]k + kxk2 = r (adding a constant) r0=1,...,R 2 2 2 = x ∈ X : arg min −2h[Ar0 ]k,:, xi − 2[br0 ]k + k[Ar0 ]k,:k2 − k[Ar0 ]k,:k2 + kxk2 = r r0=1,...,R 2 2 = x ∈ X : arg min kx − [Ar0 ]k,:k2 − 2[br0 ]k − k[Ar0 ]k,:k2 = r , r0=1,...,R
2 where by identification, and denoting 2[br0 ]k + k[Ar0 ]k,:k2 as the radius term, we see that a MAS partitions its input space according to a Power Diagram. Theorem 3.3 (MAS partition)
The kth MAS unit of a MASO partitions its input space according to a PD with
2 R centroids and radii given by [µ]r,: = [Ar]k,: and [rad]r = 2[br]k + k[Ar]k,:k2, ∀r ∈ {1,...,R} (recall (3.13)).
Going from the partition of a single unit Ωk of the MASO layer to the entire layer input space partition Ω is done by studying the joint behavior of all the layer’s
constituent units. A MASO layer is a continuous, piecewise affine operator made by
the concatenation of K MAS units (recall (3.1)). This operator is linear in the region
of its domain where all the MAS units are jointly linear. From this, it is direct to see 42
that Ω will involve all the possible intersections of the regions from Ω1,..., ΩK . We
can formally obtain the exact form of the partition as follows. Denote a region ωr with r ∈ {1,...,R}K as
ωr = {x ∈ X : rk(x) = [r]k, k = 1,...,K},
where rk(x) is taken from (3.10). So ωr is the (possibly empty) region of the layer domain that contains all the input with the specified arg max values for each of
the units based on the provided integer vector r. To be consistent with our previous
PK k derivation, we will index ω with an integer given by I(r) = k=1 R ([r]k −1). Clearly, the I mapping is a bijection between {1,...,R}K and {0,...,RK − 1}; I can be seen as a change of basis of its integer input from base 10 to base R, conversely, I−1 is the inverse mapping. Following a similar approach than for the MAS case, we obtain
−1 ωr = x ∈ X : rk(x) = [I (r)]k, k = 1,...,K
( K ) X −1 0 0 = x ∈ X : arg max h[A[r ]k ]k,:, xi + [b[r ]k ]k = I (r) (indep. max.) 0 K r ∈{1,...,R} k=1 ( K ) X −1 0 0 = x ∈ X : arg min −2 h[A[r ]k ]k,:, xi − 2[b[r ]k ]k = I (r) 0 K r ∈{1,...,R} k=1 ( K ) X 2 −1 0 0 = x ∈ X : arg min −2 h[A[r ]k ]k,:, xi − 2[b[r ]k ]k + kxk2 = I (r) 0 K r ∈{1,...,R} k=1 K 2 X = x ∈ X : arg min x − [A 0 ] [r ]k k,: r0∈{1,...,R}K k=1 2
K K 2 X X −1 − 2[b 0 ] − [A 0 ] = I (r) [r ]k k [r ]k k,: k=1 k=1 2
K 2 K X X = x ∈ X : arg min x − [A −1 0 ] − 2[b −1 0 ] [I (r )]k k,: [I (r )]k k r0=1,...,RK k=1 2 k1=
K 2 X −1 − [A −1 0 ] = I (r) , [I (r )]k k,: k=1 2 43
by identification, we can see that again, we fall back to a Power Diagram, thanks to
the independent maximization process that is done for each unit of the MASO. Theorem 3.4 (MASO partition)
A DN layer partitions its input space according to a PD containing up to RK regions
K K 2 P −1 P −1 0 with centroids µr = k=1[A[I (r)]k ]k,: and radii radr = 2 k=1[b[I (r )]k ]k + kµrk . The input space partition of a DN layer is composed of convex polytopes.
As a result, each layer in a DN partitions its own input space according to a PD with the above parameters. Going into the composition of layers case is described in the next section and heavily relies on the above result on the MASO partition.
3.4.3 Composition of Layers: Power Diagram Subdivision
We provide the formula for the input space partition of an L-layer DN by means of a recursion. Since we will now consider multiple layers, we have to bring back the upperscript indexing of the per-layer quantities that we will study. The input space of layer ` is X (`−1), the partition of this input space with respect to the layer PD is
Ω(`).
Initialization (` = 0): Define the region of interest in the input space X (0) ⊂ RD. First step (` = 1): The first layer subdivides X (0) into a PD via Theorem 3.4 to obtain the layer-1 partition Ω(1).
Recursion step (` = 2): The second layer subdivides X (1) into a PD via Theorem 3.4 to obtain the layer-2 partition Ω(2). Jointly, the first layer units map X (0) into X (1) but remain a simple affine mapping in each region of the first layer’s partition. Hence, each convex polytope ω ∈ Ω(1) that lives in the first layer and DN’s input space is mapped to another convex polytope in X (1), the second layer’s input space via
n (1) (1) (1)o (1) (1) (1) affω(1) = Aω(1) x + bω(1) , x ∈ ω ⊂ X , ∀ω ∈ Ω . (3.14) 44
Ω(1) → Each Layer 1 region ω(1) leads a different PD subdivision
↓
Ω(1,2) ← Sub-division of each region with respective PD
Figure 3.2 : Visual depiction of the subdivision process that occurs when a deeper layer ` refines/subdivides an up-to-layer ` − 1 already built partition Ω(1,...,`−1). We depict here a toy model (2-layer DN) with 3 units at the first layer (leading to 4 re- gions) and 8 units at the second layer with random weights and biases. The colors are the DN input space partitioning with respect to the first layer. Then for each color (or region) the layer1-layer2 defines a specific PD that will sub-divide this aforemen- tioned region (this is the first row) where the region is colored and the PD is depicted for the whole input space. Then this sub-division is applied onto the first layer region only as it only sub-divides its region (this is the second row on the right). And finally grouping together this process for each of the 4 region, we obtain the layer-layer 2 space partitioning (second row on the left).
(2) (1) As a result, it is clear that the partition Ω of X possibly subdivides affω(1) into smaller regions. As the first layer is linear in this part of the space, we can effectively express the PD that subdivides each region affω(1) back into the DN input space by (1) (1) replacing x with Aω(1) x + bω(1) in Thm. 3.4. Repeating this subdivision process for all regions ω(1) from Ω(1) forms the subdivided input space partition of both layers
Ω(1,2). See Fig. 3.2 for a numerical example with a 2-layer DN and D = 2 dimensional input space.
Recursion step (`): Consider the situation at layer ` knowing Ω(1,...,`−1) from the previous subdivision steps. Similarly to the ` = 2 step, layer ` subdivides each cell in
Ω(1,...,`−1) to produce Ω(1,...,`) leading to the up-to-layer-`-layer DN partition Ω(1,...,`). 45
Theorem 3.5 (DN partition)
The DN input space partition is a Power Diagram subdivision; the number of subdi-
vision is at most the number of layers; at step `, the subdivision of a previously built region subdivides it into 1 to D(`) regions; at each step, the subdivision of different regions is not independent;
The subdivision recursion provides a direct result on the shape of the DN input space partition regions that we formalize in its own statement below.
Corollary 3.1 (Region convexity)
For any number of MASO layers L ≥ 1, and any type of layer (as long as they are
MASOs), the regions of the DN input space partition are convex polytopes.
The above result comes naturally from our characterization. Recall that the DN partition successively subdivides the previously built partition starting from the entire
DN input space. The first layer produces Ω(1), which is a PD, has convex regions as
is the case for any PD. Each region ω ∈ Ω(1) is then subdivided with another PD,
hence the intersection of convex regions with other convex regions occurs. The result,
Ω(1,2), is thus made of convex regions. Repeating this process of intersection convex
regions with other convex regions ultimately lead to the DN input space partition
Ω(1,...,L) made of convex regions.
3.5 Discussions
Our ability to characterize the DN input space partition as a Power Diagram subdivi-
sion concludes this chapter on employing Max-Affine Spline Operators to reformulate
current DNs. We know have a better grasp at the underlying structure of the spline
operator that is a DN. As was highlighted, this formulation offers a few key ben- 46 efits. First, it is able to model DNs regardless of the actual input/latent/output space dimensions, and can be used for any DN as long as each layer nonlinearity is a Continuous Piecewise Affine operator. This generality couples with the prac- ticality of Max-Affine Splines to obtain theoretical results should open the door to extending many powerful results obtained in univariate settings to more general cases.
The subsequent chapters focus on bringing insights into various DN techniques and applications such as Deep Generative Networks, Deep Network pruning or Batch-
Normalization in Deep Networks from the MASO formulation. 47
Chapter 4
Insights Into Deep Generative Networks
In this chapter, we propose to leverage the results from Chap. 3 and apply them specifically to Deep Generative Networks (DGNs). Up until this point, we have been mainly focusing on a DN FΘ void of any application setting. But DGNs, even though being close to regression, propose to solve the problem of manifold learning. This particular scenario will allow us to draw many geometric insights into the ability of
Continuous Piecewise Affine DGNs to fit manifolds and into their inner workings e.g. their intrinsic dimension or their local basis vectors.
4.1 Introduction
4.1.1 Related Works
Deep Generative Networks (DGNs), which map a low-dimensional latent variable z to a higher-dimensional generated sample x, have made enormous leaps in capabilities in recent years. DGNs alone only provide a nonlinear mapping from their latent space to an ambient space, learning the underlying DGN parameters can be done in a few different manners. First, one can employ Generative Adversarial Networks (GANs)
[Goodfellow et al., 2014] or their variants [Dziugaite et al., 2015, Zhao et al., 2016,
Durugkar et al., 2016, Arjovsky et al., 2017, Mao et al., 2017, Yang et al., 2019a]. In this setting the DGN is adapted in order to produce samples that can not be distin- guished from the training set’s samples based on a discriminative DN. Another option 48 is to employ Variational Autoencoders [Kingma and Welling, 2013] or their variants
[Fabius and van Amersfoort, 2014, van den Oord et al., 2017, Higgins et al., 2017,
Tomczak and Welling, 2017b, Davidson et al., 2018]. In this setting, a (minimal)
Probabilistic Graphical Model (PGM) is used in which the DGN represents the map- ping between two neighboring vertices in this graph. This formulation allows training from a likelihood maximization perspective. In a similar vein flow-based models such as NICE [Dinh et al., 2014], Normalizing Flow (NF) [Rezende and Mohamed, 2015] or their variants [Dinh et al., 2016, Grathwohl et al., 2018, Kingma and Dhariwal,
2018] propose to leverage the DGN as a succession of coordinate changes and to adapt them in order to force the data distribution to become a (simple) target distribution, often taken as an isotropic Gaussian. Training flow-based models also follows the maximum likelihood principle but in a somewhat reversed formulation from VAEs.
Despite an exponential growth in the number of extensions or novel training meth- ods for DGNs all emerging techniques are motivated by studying the coupling between the dynamics of the DGN and the training framework [Mao et al., 2017, Chen et al.,
2018], or by extensive empirical studies [Arjovsky and Bottou, Miyato et al., 2018, Xu and Durrett, 2018]. For example, GANs are mostly studied through the theoretical convergence properties of two player games [Liu et al., 2017a, Zhang et al., 2017b,
Biau et al., 2018], or regret analysis [Li et al., 2017b, Kodali et al., 2017]. VAEs are mostly studied from a perturbation theory perspective of their latent space [Roy et al.,
2018, Andr´es-Terr´eand Li´o,2019] or from a pure PGM perspective with emphasis on the inference and training schemes [Chen et al., 2018]. Finally, NFs mostly focus on improving tractability of the model by means of parametrization such as Householder transformations [Tomczak and Welling, 2016] or Sylvester matrices [Berg et al., 2018] of the DGN layer mappings. 49
4.1.2 Contributions
In this chapter, we propose to study DGNs and their properties solely based on their
Continuous Piecewise Affine structure that we built in Chap. 3. That is, we propose to
explicit the fundamental properties and limitations of DGNs regardless of the training
setting employed. In doing so, we will be able to provide new perspectives into many
observed phenomena such as unstable training when dealing with multimodal data
distributions (mode collapse) or the relationship between the DGNs latent space
dimension and its ability to generalize. For this chapter, we will use the following
notations. A deep generative network (DGN) is an operator GΘ with parameters Θ
mapping a latent input z ∈ RS to an observation x ∈ RD by composing L intermediate layer mappings G(`), ` = 1,...,L. We precisely define a layer G(`) as comprising a
single nonlinear operator composed with any (if any) preceding linear operators that
lie between it and the preceding nonlinear operator as per Def. 1.1. We will omit
(`) Θ from the GΘ operator for conciseness unless needed. Each layer G transforms
(`−1) (`) its input feature map z(`−1) ∈ RD into an output feature map z(`) ∈ RD with in particular z0 := z, D(0) = S, and z(L) := x,D(L) = D. In such framework z is
interpreted as a latent representation, and x is the generated/observed data, e.g, a time-serie or image.
4.2 Deep Generative Network Latent and Intrinsic Dimen-
sion
S D In this section we study the properties of the mapping GΘ : R → R of a DGN comprising L MASO layers. 50
Figure 4.1 : Visual depiction of Thm. 4.1 with a (random) generator G : R2 7→ R3. Left: generator input space partition Ω made of polytopal regions. Right: generator image Im(G) which is a continuous piecewise affine surface composed of the polytopes obtained by affinely transforming the polytopes from the input space partition (left) the colors are per-region and correspond between left and right plots. This input- space-partition / generator-image / per-region-affine-mapping relation holds for any architecture employing piecewise affine activation functions. Understanding each of the three brings insights into the others, as we demonstrate in this paper.
4.2.1 Input-Output Space Partition and Per-Region Mapping
As was hint in the previous chapter, the MASO formulation of a DGN allows to express the (entire) DGN mapping G (a composition of L MASOs) as a per-region affine mapping
X S G(z) = (Aωz + bω) 1z∈ω, z ∈ R , (4.1) ω∈Ω with Ω a partition of RS. Recall from Sec. 3.4 that this partition corresponds to a Power Diagram subidivision and can be obtained analytically, if needed. In order to study and characterize the DGN mapping (4.1), we make explicit the formation of the per-region slope and bias parameters. The affine parameters Aω, bω decompose 51
into
L−1 Y (L−`) (L−`) (L) (L) (1) (1) Aω = diag σ˙ (ω) W = diag σ˙ (ω) W ... diag σ˙ (ω) W , `=0 (4.2)
whereσ ˙ (`)(ω) is the pointwise derivative of the activation function of layer ` based
on its input W (`)z`−1 + b(`), ∀z ∈ ω. At the time of this thesis, no DGN employs a pooling operator, we thus omit such operator in this chapter to streamline our notations and development. The diag operator simply puts the given vector into a diagonal square matrix. For convolutional layers (or else) one can simply replace the corresponding W (`) with the correct slope matrix parametrization. Notice that since the employed activation functions σ(`), ∀` ∈ {1,...,L} are piecewise affine, their
(`) derivative is piecewise constant, in particular with values [σ ˙ (ω)]k ∈ {α, 1} with α = 0 for ReLU, α = −1 for absolute value, and in general with α > 0 for Leaky-
ReLU for k ∈ {1,...,D(`)}. We denote the collection of all the per-layer activation
1 T (L) T T QL D(`) derivatives [σ ˙ (ω) ,..., σ˙ (ω) ] ∈ {α, 1} `=1 as the activation pattern of the
generator. Based on the above, if one already knows the associated activation pattern
of a region ω, then the matrix Aω can be formed directly. Practically, one instead observes a sample z ∈ ω from which obtaining the activation pattern will be direct. In
this case, we will slightly abuse notation and denote those known activation patterns
(`) (`) asσ ˙ (ω) , σ˙ (z), z ∈ ω with ω being the considered region. In a similar way, the bias vector is obtained as
L " L−`−1 ! # X Y (L−i) (L−i) (`) (`) bω = diag σ˙ (ω) W diag σ˙ (ω) b . (4.3) `=1 i=0
As for the slope matrix Aω, the bias vector bω can be obtained either from a sample z ∈ ω or based on the known region activation pattern. Equipped with the above 52
notations, we can now state our first formal result characterizing the image of a DGN
regardless of its parameters and training setting.
Theorem 4.1 (Per-region affine subspace)
The image of a generator G employing MASO layers is a continuous piecewise affine
surface made of connected polytopes obtained by affine transformations of the poly-
topes of the input space partition Ω as in
S [ Im(G) , {G(z): z ∈ R } = Aff(ω; Aω, bω) (4.4) ω∈Ω with Aff(ω; Aω, bω) = {Aωz + bω : z ∈ ω}; we will denote for conciseness G(ω) ,
Aff(ω; Aω, bω) and the volume of a region ω ∈ Ω denoted by µ(ω) is related to the
q T volume of G(ω) as per µ(G(ω)) = det(Aω Aω)µ(ω) with Aω being full-rank.
The above result is pivotal to bridge the understanding of the input space partition
Ω, the per-region affine mappings Aω, bω, and the generator’s image. We visualize
Thm. 4.1 in Fig. 4.1 to make it clear that characterizing Aω alone already provides tremendous information about the generator. This result also provides a direct an-
swer to the problem of generating disconnected manifolds (or sets) by employing
current DGNs. In a specific GAN setting, it was empirically shown to be impossible
[Khayatkhoei et al., 2018, Tanielian et al., 2020] which aligns with Thm. 4.1: given a
connected set Z ⊂ RS, any DGN mapping G(Z) made of a composition of MASOs is always connected for any depth, width or parameter settings for W (`), b(`), ∀`. We
now turn into the study of the DGN intrinsic dimension.
4.2.2 Generated Manifold Angularity
We now study the angularity of the generated surface i.e. the image of G. Recall
(from Thm. 4.1) that the per-region affine subspace of adjacent region are continuous, 53
Figure 4.2 : The columns represent different widths D` ∈ {6, 8, 16, 32} and the rows correspond to repetition of the learning for different random initializations of the GDNs for consecutive seeds.
and joint at the region boundaries with a certain angle that we now characterize.
Definition 4.1 (Adjacent regions) Two regions ω, ω0 are adjacent whenever they share
part of their boundary as in ∂ω ∩ ∂ω0 6= ∅.
The angle between adjacent affine subspace is characterized by means of the greatest
principal angle [Afriat, 1957, Bjorck and Golub, 1973] which is denoted for our study
as θω,ω0 . The following result demonstrating how to compute such an angle can be obtained by a direct application of the main result in Sec. 1 of Absil et al. [2006].
Theorem 4.2 (Angularity between adjacent subspaces)
The angle θω,ω0 between adjacent (recall Def. 4.1) region mappings for two adjacent regions ω, ω0 is given by