<<

1 Reframing Neural Networks: Deep Structure in Overcomplete Representations

Calvin Murdock and Simon Lucey

Abstract—In comparison to classical shallow representation learning techniques, deep neural networks have achieved superior performance in nearly every application benchmark. But despite their clear empirical advantages, it is still not well understood what makes them so effective. To approach this question, we introduce deep frame approximation, a unifying framework for representation learning with structured overcomplete frames. While exact inference requires iterative optimization, it may be approximated by the operations of a feed-forward deep neural network. We then indirectly analyze how model capacity relates to the frame structure induced by architectural hyperparameters such as depth, width, and skip connections. We quantify these structural differences with the deep frame potential, a data-independent measure of coherence linked to representation uniqueness and stability. As a criterion for model selection, we show correlation with generalization error on a variety of common deep network architectures such as ResNets and DenseNets. We also demonstrate how recurrent networks implementing iterative optimization algorithms achieve performance comparable to their feed-forward approximations. This connection to the established theory of overcomplete representations suggests promising new directions for principled deep network architecture design with less reliance on ad-hoc engineering. !

1 INTRODUCTION 풇(풙) = argmin s. t. 풇(풙) (1) (2) 풇(풙) EPRESENTATION LEARNING has become a key com- 풘3 풘3 R ponent of computer vision and machine learning. In 풘3∈ 풮3 풘 − 퐁 풘 2 (1) (2) (푘) place of manual feature engineering, deep neural networks 2 3 3 2 풘2 풘2 풘2 풘2

have enabled more effective representations to be learned + 풘2∈ 풮2 from data for state-of-the-art performance in nearly every 풘 − 퐁 풘 2 (1) (2) (푘) 1 2 2 2 풘1 풘1 풘1 풘1 application benchmark. While this modern influx of deep + 풘1∈ 풮1 learning originally began with the task of large-scale image 풙 − 퐁 풘 2 풙 풙 recognition [1], new datasets, loss functions, and network 1 1 2 configurations have expanded its scope to include a much (a) Deep Frame (b) Neural (c) Iterative Optimization wider range of applications. Despite this, the underlying Approximation Network Algorithm architectures used to learn effective image representations Fig. 1. (a) Deep frame approximation is a unifying framework for mul- have generally remained consistent across all of them. This tilayer representation learning where inference is posed as the con- can be seen through the quick adoption of the newest state- strained optimization of a multi-layer reconstruction objective. (b) The of-the-art deep networks from AlexNet [1] to VGGNet [2], problem structure allows for effective feed-forward approximation with the activations of a standard deep neural network. (c) More accurate ResNets [3], DenseNets [4], and so on. But this begs the approximations can be found using an iterative optimization algorithm question: why do some deep network architectures work with updates implemented as recurrent feedback connections. better than others? Despite years of groundbreaking empir- ical results, an answer to this question still remains elusive. Fundamentally, the difficulty in comparing network ar- deep frame approximation: a unifying framework for rep- chitectures arises from the lack of a theoretical founda- resentation learning with structured overcomplete frames. arXiv:2103.05804v1 [cs.LG] 10 Mar 2021 tion for characterizing their generalization capacities. Shal- These problems aim to optimally reconstruct input data low machine learning techniques like support vector ma- through layers of constrained linear combinations of com- chines [5] were aided by theoretical tools like the VC- ponents from architecture-dependent overcomplete frames. dimension [6] for determining when their predictions could As shown in Fig. 1, exact inference in our model amounts be trusted to avoid overfitting. The complexity of deep to finding representations that minimize reconstruction er- neural networks, on the other hand, has made similar anal- ror subject to constraints, a process that requires iterative yses challenging. Theoretical explorations of deep general- optimization. However, the problem structure allows for ization are often disconnected from practical applications efficient approximate inference using standard feed-forward and rarely provide actionable insights into how architec- neural networks. This connection between the complicated tural hyper-parameters contribute to performance. Without nonlinear operations of deep neural networks and convex a clear theoretical understanding, progress is largely driven optimization provides new insights about the analysis and by ad-hoc engineering and trial-and-error experimentation. design of different real-world network architectures. Building upon recent connections between deep learn- Specifically, we indirectly analyze practical deep net- ing and sparse approximation [7], [8], [9], we introduce work architectures like residual networks (ResNets) [3] and densely connected convolutional networks (DenseNets) [4], which have achieved state-of-the-art performance in many • E-mail: [email protected] computer vision applications. Often very deep with skip 2

=

(a) Chain Network Gram (b) ResNet (c) DenseNet (a) Chain Network (b) ResNet (c) DenseNet

0.035 0.4 Dense Network Dense Network 0.03 Residual Network Residual Network Chain Network Chain Network 0.3 ≈ ≈ ≈ 0.025

0.02 0.2 0.015

0.01 (d) Induced Deep Frame Approximation Structures Validation Error 0.1 0.005 Minimum Frame Potential Fig. 2. Why are some deep neural network architectures better than 0 0 others? In comparison to (a) standard chain connections, skip connec- 103 104 105 106 107 103 104 105 106 107 tions like those in (b) ResNets and (c) DenseNets have demonstrated Number of Parameters (Log Scale) Number of Parameters (Log Scale) significant improvements in parameter efficiency and generalization per- (d) Minimum Deep Frame Potential (e) Validation Error formance. We provide one possible explanation for this phenomenon by approximating network activations as (d) solutions to deep frame Fig. 3. Parameter count is not a good indicator of generalization per- approximation problems with different induced frame structures. formance for deep networks. Instead, we compare different network architectures via the minimum deep frame potential, a lower bound on the mutual coherence of their corresponding structured frames. In comparison to (a) chain networks, the skip connections in (b) ResNets connections across layers, these complicated network ar- and (c) DenseNets induce Gram matrix structures with more nonzero elements allowing for (d) lower deep frame potentials across network chitectures typically lack convincing explanations for their sizes. (e) This correlates with improved parameter efficiency giving lower specific design choices. Without clear theoretical justifica- validation error with fewer parameters. tions, they are instead driven by performance improvements on standard benchmark datasets like ImageNet [10]. As an alternative, we provide a novel perspective for evaluating theoretical results from shallow representation learning. Ad- and comparing these network architectures via the global ditional layers induce overcomplete frames with structures structure of their corresponding deep frame approximation that increase both capacity and effective input dimension- problems, as shown in Fig. 2. ality, allowing more components to be spaced further apart Our approach is motivated by sparse approximation for more robust representations. Furthermore, architectures theory [11], a field that encompasses fundamental properties with denser skip connections induce structures with more of shallow representations using overcomplete frames [12]. nonzero elements, which provides additional freedom to In contrast to feed-forward representations constructed further reduce mutual coherence with fewer parameters as through linear transformations and nonlinear activation shown in Fig. 3. From this perspective, we interpret deep functions, techniques like sparse coding seek parsimonious learning through the lens of shallow learning to gain new in- representations that efficiently reconstruct input data. Ca- sights towards understanding its unparalleled performance. pacity is controlled by the number of additive components used in sparse data reconstructions. While adding more 1.1 Contributions parameters allows for more accurate representations that In order to unify the intuitive and theoretical insights of better represent training data, it may also increase input shallow representation learning with the practical advances sensitivity As the number of components increases, the made possible through deep learning, we introduce the distance between them decreases, so they are more likely deep frame potential as a cue for model selection that to be confused with one another. This may cause represen- summarizes the interactions between parameters in deep tations of similar data points to become very far apart lead- neural networks. As a lower bound on mutual coherence, ing to poor generalization performance. This fundamental it is tied to the generalization properties of the related tradeoff between the capacity and robustness of shallow deep frame approximation inference problems. However, its representations can be formalized using similarity measures minimizers depend only on the frame structures induced like mutual coherence–the maximum magnitude of the nor- by the corresponding network architectures. This enables a malized inner products between all pairs of components– priori model comparison across disparate families of deep which are theoretically tied to representation sensitivity and networks by jointly quantifying the contributions of depth, generalization [13]. width, and layer connectivity. Instead of requiring expen- Deep representations, on the other hand, have not shown sive validation on a specific dataset to approximate general- the same correlation between model size and sensitiv- ization performance, architectures can then be chosen based ity [14]. While adding more layers to a deep neural net- on how efficiently they can reduce mutual coherence with work increases its capacity, it also simultaneously introduces respect to the number of model parameters. implicit regularization to reduce overfitting. Attempts to In the case of fully-connected chain networks, we derive quantify this effect are theoretically challenging and often an analytic expression for the minimum achievable deep lack intuition about how to design architectures that more frame potential. We also provide an efficient method for effectively balance memorization and generalization. From frame potential minimization applicable to a general class the perspective of deep frame approximation, however, we of convolutional networks with skip connections, of which can interpret this observed phenomenon simply by applying ResNets and DenseNets are shown to be special cases. 3

Experimentally, we demonstrate correlation with validation Due to the vast space of possible deep network archi- error across a variety of network architectures. tectures and the computational difficulty in training them, This paper expands upon our original work in [9], which deep model selection has largely been guided by ad-hoc introduced the deep frame potential as a criterion for data- engineering and human ingenuity. Despite slowing progress independent model selection for feed-forward deep net- in the years following early breakthroughs [28], recent in- work architectures. Here, we provide a more thorough back- terest in deep learning architectures began anew because ground of shallow representation learning with overcom- of empirical successes largely attributed to computational plete tight frames to better motivate the proposed frame- advances like efficient training using GPUs and rectified work of deep frame approximation. While [9] only consid- linear unit (ReLU) activation functions [1]. Since then, nu- ered feed-forward approximations, we propose an iterative merous architectural modifications have been proposed. For algorithm for exact inference extending our work in [8]. This example, much deeper networks with residual connections allows us to construct recurrent versions of networks like were shown to achieve consistently better performance ResNets and DenseNets that achieve similar generalization with fewer parameters [3]. Building upon this, densely performance. Experimentally, we also compare minimum connected convolutional networks with skip connections deep frame potential with generalization error on additional between more layers yielded even better performance [4]. networks and datasets, further demonstrating its ability to predict the generalization capacity of different architectures 1.2.3 Deep Model Selection without requiring any training data. Because theoretical explanations for effective deep net- work architectures are lacking, consistent experimentation 1.2 Related Work on standardized benchmark datasets has been the pri- mary driver of empirical success. However, due to slowing In this section, we present a brief overview of shallow progress and the need for increased accessibility of deep representation learning techniques, deep neural network ar- learning techniques to a wider range of practitioners, more chitectures, some theoretical and experimental explorations principled approaches to architecture search have gained of their generalization properties, and the relationships be- traction. Motivated by observations of extreme redundancy tween deep learning and sparse approximation theory. in the parameters of trained networks [29], techniques have been proposed to systematically reduce the number of 1.2.1 Component Analysis and Sparse Coding parameters without adversely affecting performance. Ex- Prior to the advent of modern deep learning, shallow repre- amples include sparsity-inducing regularizers during train- sentation learning techniques such as principal component ing [30] or through post-processing to prune the parameters analysis [15] and sparse coding [16] were widely used due to of trained networks [31]. Constructive approaches to model their superior performance in applications like facial recog- selection like neural architecture search [32] instead attempt nition [17] and image classification [18]. Towards the com- to compose architectures from basic building blocks through mon goal of extracting meaningful information from high- tools like reinforcement learning. Efficient model scaling has dimensional images, these data-driven approaches are often also been proposed to enable more effective grid search for guided by intuitive goals of incorporating prior knowledge selecting architectures subject to resource constraints [33]. into learned representations. For example, statistical inde- Automated techniques can match or even surpass manu- pendence allows for the separation of signals into distinct ally engineered alternatives, but they require a validation generative sources [19], non-negativity leads to parts-based dataset and rarely provide insights transferable to other decompositions of objects [20], and sparsity gives rise to settings. locality and frequency selectivity [21]. Even non-convex formulations of matrix factorization 1.2.4 Deep Learning Generalization Theory are often associated with guarantees of convergence [16], To better understand the implicit benefits of different net- generalization [22], uniqueness [23], and even global op- work architectures, there have also been theoretical ex- timality [24]. Linear autoencoders with tied weights are plorations of deep network generalization. These works guaranteed to find globally optimal solutions spanning are often motivated by the surprising observation that principal eigenspace [25]. Similarly, other component anal- good performance can still be achieved using highly over- ysis techniques can be equivalently expressed in a least parametrized models with degrees of freedom that surpass squares regression framework for efficient learning with the number of training data. This contradicts many com- differentiable loss functions [26]. This unified view of un- monly accepted ideas about generalization, spurning new dercomplete shallow representation learning has provided experimental explorations that have demonstrated proper- insights for characterizing certain neural network architec- ties unique to deep learning. Examples include the ability tures and designing more efficient optimization algorithms of deep networks to express random data labels [14] with a for nonconvex subspace learning problems. tendency towards learning simple patterns first [34]. Some promising directions towards explaining the gen- 1.2.2 Deep Neural Network Architectures eralization properties of deep networks are founded upon Deep neural networks have since emerged as the preferred the natural stability of learning with stochastic gradient technique for representation learning in nearly every appli- descent [35]. While this has led to explicit generalization cation. Their ability to jointly learn multiple layers of ab- bounds that do not rely on parameter count, they are straction has been shown to allow for encoding increasingly typically only applicable for unrealistic special cases such complex features such as textures and object parts [27]. as very wide networks [36]. Similar explanations have also 4 arisen from the information bottleneck theory, which pro- such as images consist of a large number of individual fea- vides tools for analyzing architectures with respect to the tures (e.g. pixels), they are often highly correlated with one optimal tradeoff between data fitting and compression due another. More efficient encoding schemes can disentangle to random diffusion [37]. However, they are not causally this structure while preserving the information contained in connected to generalization performance and similar behav- data. For example, image compression algorithms often take iors can arise even without the implicit randomness of mini- advantage of the band-limited nature of images by repre- batch stochastic gradients [38]. senting them not as individual pixels, but as compact linear While exact theoretical explanations are lacking, em- combinations of different low-frequency components from pirical measurements of network sensitivity such as the a truncated . In shallow representation learning, these Jacobian norm have been shown to correlate with general- components are instead tuned to optimize performance on ization performance across different datasets [39]. Similarly, a specific dataset. Parseval regularization [40] encourages robustness by con- straining the Lipschitz constants of individual layers. This 2.1.1 Representation Inference by Best Approximation also promotes parameter diversity within layers, but unlike Dimensionality reduction techniques approximate data d the deep frame potential, it does not take into account global points x ∈ R as linear combinations of k ≤ d fixed interactions between layers. component vectors bj for j = 1, . . . , k which form the d×k columns of the undercomplete parameter matrix B ∈ R : 1.2.5 Sparse Approximation k X Because of the difficulty in analyzing deep networks di- x ≈ wjbj = Bw (1) rectly, other approaches have instead drawn connections to j=1 the rich field of sparse approximation theory. The relation- The vectors of reconstruction coefficients w ∈ k serve ship between feed-forward neural networks and principal R as lower-dimensional representations found by minimiz- component analysis has long been known for the case of ing approximation error. In the typical case of squared linear activations [25]. More recently, nonlinear deep net- Euclidean distance, representations f(x) are inferred by works with ReLU activations have been linked to multilayer solving the following optimization problem as: sparse coding to prove theoretical properties of deep repre- 1 2 sentations [7]. This connection has been used to motivate f(x) = arg min 2 kx − Bwk2 (2) new recurrent architecture designs that resist adversarial w noise attacks [41], improve classification performance [42], Because the number of component vectors is lower than or enforce prior knowledge through output constraints [8]. their dimensionality, this problem is strongly convex and We further build upon these relationships to instead provide admits a unique global solution given in closed form by novel explanations for networks that have already proven to simple matrix operations as B+x where B+ = (BTB)−1BT be effective while providing a novel approach to selecting is the pseudoinverse of the rectangular matrix B. In the efficient deep network architectures. special case of orthogonal components where BTB = I, the parameters B form a basis that spans a k−dimensional d subspace of R . Furthermore, when k = d, B is complete 2 SHALLOW REPRESENTATION LEARNING and acts as a rotation that simply aligns data to different Shallow representations are typically defined in terms of coordinate vectors. From Parseval’s identity, this preserves T what they are not: deep. In contrast to deep neural networks the magnitudes of representations so that kB xk2 = kxk2. composed of many layers, shallow learning techniques con- While the representation f(x) = BTx now takes the form struct latent representations of data using only a single hid- of a neural network layer with a linear activation function, it den layer. In neural networks, this layer typically consists has a clear geometric interpretation as the optimal projection of a learned linear transformation followed by an activation onto a subspace. d function. Specifically, given a data vector x ∈ R , a matrix d×k of learned parameters B ∈ R , and a fixed nonlinear 2.2 Overcomplete Frames activation function φ : d → d, a k−dimensional shallow R R For many supervised applications, dimensionality reduction representation can be found as f(x) = φ(BTx). However, is not an effective goal for representation learning. On the shallow representation learning also encompasses a wide contrary, higher-dimensional representations may be nec- range of other methods such as principal component analy- essary to accentuate discriminative details by encouraging sis and sparse coding that are often better suited to certain linear separability. In place of undercomplete bases, data tasks. Unlike neural networks, these techniques are often may be represented using overcomplete frames with more equipped with simple intuition and theoretical guarantees. components than dimensions. While subspace representa- In this section, we survey some of these methods to eluci- tions necessitate the loss of information, frames can span date the fundamental limitations of shallow representations the entire space allowing for the exact reconstruction of any and motivate our unifying framework for deep learning. data point. A frame as defined as a sequence {bj}j∈I of components 2.1 Undercomplete Bases from a Hibert space that satisfy the following property for Due to the inefficiency of learning in high dimensions, alter- any x and some frame bounds 0 < A ≤ B < ∞ [12]: native data representations are often employed as a means 2 X 2 2 Akxk ≤ |hx, bji| ≤ Bkxk (3) for dimensionality reduction. While high-dimensional data j∈I 5

In other words, the sum of the squared inner products 풙 between x and every bj does not deviate far from the squared norm of x itself. While frequently employed in the 흓 풙 = 푃풮(풙) 휙푖 푥푖 = 푥푖 − 휆 + analysis of infinite dimensional function spaces, we consider d finite frames in R with the index set I = {1, . . . , k}. In this 풮 case, we can concatenate the frame components bj as the d×k columns of B ∈ R where k > d. 훿 0 휆 푥푖 2.2.1 Tight Frames (a) Simplex Projection (b) Rectified Linear Unit If B is a complete orthogonal basis with k = d, then it is Fig. 4. A comparison between (a) projection onto a simplex, which cor- also a frame with frame bounds A = B = 1 [43] so that: responds to nonnegative and `1 norm constraints, and (b) the rectified k linear unit (ReLU) nonlinear activation function, which is equivalent to a 2 X 2 T 2 nonnegative softthresholding proximal operator. Akxk2 = |hx, bji| = kB xk2 (4) j=1 More generally, a rich family of overcomplete frames with than the dimensionality, there may be an infinite subspace k > d also satisfy this property for other A = B and are de- of coefficients w that all exactly reconstruct any data point noted as tight frames. These frames are of particular interest as x = Bw. In order to guarantee uniqueness and facili- because they behave similarly to complete orthogonal bases tate the effective representation of data in applications like in that the frame operator matrix F = BBT is the rescaled classification, additional constraints or penalties must be identity matrix A−1I. Thus, a higher-dimensional repre- included in the optimization problem. This yields nonlinear sentation f(x) = BTx completely preserves information representations f(x) that can not be computed solely using and allows for efficient exact reconstructions x = ABf(x). linear transformations. The redundancy provided by these overcomplete frame representations also enables theoretical robustness guaran- 2.3.1 Constraints, Penalties, and Proximal Operators tees that are beneficial for certain applications such as data Constraints and penalties restrict the space of possible solu- transmission [44]. tions and introduce nonlinearity that increases representa- tion capacity. In addition to practical considerations, they 2.2.2 The Frame Potential can be used to encode prior knowledge for encouraging Tight frames are useful in practice and can be constructed representation interpretability. For example, consider the efficiently by considering the Gram matrix G = BTB with simplex constraint set S that consists of coefficient vectors rank r ≤ d, which contains the inner products between all that are nonnegative with low `1 norms: combinations of frame elements. Note that for normalized S = {w : w ≥ 0, kwk ≤ δ} (6) frames with magnitudes constrained to have unit norm, the 1 diagonal contains all ones and so its trace, the sum of its r Nonnegativity ensures that learned components can have nonzero eigenvalues λi, is fixed to be k. For tight frames, only additive effects on the data reconstruction. This is the nonzero eigenvalues are uniform with the value A−1, commonly used as a matrix factorization constraint to en- the same as for the frame operator matrix F = BBT. Thus, courage learning more interpretable parts of objects without in order to have a fixed trace with uniform eigenvalues, the supervision [20]. The `1 norm constraint, a convex surro- Gram matrix of normalized tight frames must have the min- gate to the `0 “norm” that counts the number of nonzero imum possible Frobenius norm. This quantity, which can be coefficients, encourages sparsity. This gives data approxima- equivalently expressed as the sum of squared eigenvalues, tions that consist of relatively few components, resulting in is referred to as the frame potential [45] and is given as: the emergence of patterns resembling simple-cell receptive r k k fields when trained on natural images [21]. The orthogonal X 2 X X 2 2 projection onto this constraint, as visualized in Fig. 4a, can FP(B) = λ = |hbj, bj0 i| = kGk (5) i F be found by solving the following optimization problem: i=1 j=1 j0=1 1 2 Minimizers of the frame potential completely charac- PS (x) = arg min 2 kx − wk2 s.t. w ∈ S (7) w terize the set of all normalized tight frames. Furthermore, We can also equivalently describe this constraint set as despite its nonconvexity, the frame potential can be effec- d tively optimized using gradient descent to find globally the penalty function Φ: R → R defined as: optimal local minima. For our purposes, this allows it to be Φ(w) = I≥0(w) + λ kwk1 (8) naturally integrated into backpropagation training pipelines as a criterion for model selection or as a regularizer. Here, the convex characteristic function I≥0 is defined to be zero when each component of its argument is nonnegative and infinity otherwise. The ` constraint radius δ is also 2.3 Nonlinear Constrained Approximation 1 replaced with a corresponding penalty weight λ. While tight frames admit feed-forward representations that Analogous to the projection operator of a constraint set, can be efficiently decoded to reconstruct input data, infer- the proximal operator of a penalty function is given by the ence for general overcomplete frames again requires solving following optimization problem: the approximation error minimization problem in Eq. 2. 1 2 However, because the number of components is greater φ(x) = arg min 2 kx − wk2 + Φ(w) (9) w 6

0.35 0.2

0.3 ......

0.25 0.15 x 0.2 0.336 0.122 0.083 0.039 0.024 0.001 0.1 (c) Feed-Forward Filter Responses 0.15

0.1

Coefficient Value 0.05 Positive Correlation 0.05 = + + + + + 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Feature Index Feature Index xˆ 0.194 0.164 0.141 0.066 0.026 0.009 (a) Feed-Forward Inference (b) Optimization Inference (d) Optimal Reconstruction Coefficients

Fig. 5. An example of the “explaining away” conditional dependence provided by optimization-based inference. Sparse representations constructed by feed-forward nonnegative soft thresholding (a) have many more non-zero elements due to redundancy and spurious activations (c). On the other hand, sparse representations found by `1-penalized, nonnegative least-squares optimization (b) yield a more parsimonious set of components (d) that optimally reconstruct approximations of the data. This figure was adapted from [8].

Within the field of convex optimization, these operators are When w[0] = 0 and γ = 1, truncating this algorithm to a used in proximal algorithms for solving nonsmooth opti- single iteration results in the standard shallow feed-forward mization problems [46]. Essentially, these techniques work neural network representation f(x) = φ(BTx). Here, the by breaking a problem down into a sequence of smaller proximal operator φ takes on the role of a nonlinear activa- problems that can often be solved in closed-form. tion function. When the parameters B are orthogonal, this For example, the proximal operator corresponding to the feed-forward computation, which is commonly referred to penalty function in Eq. 8 is the projection onto a simplex and as thresholding pursuit [7], gives the exact solution of Eq. 11. is given by the nonnegative soft thresholding operator: More generally, parameters that are close to orthogonal lead to faster convergence and more accurate feed-forward φ(x) = PS (x) = (x − λ1)+ (10) approximations. In this case, additional iterations introduce conditional Visualized in Fig. 4b, this nonlinear function uniformly dependence between features to represent data using redun- shrinks the input and clips nonnegative values to zero, dant frame components. This phenomenon is commonly resulting in a sparse output. Note that this is equivalent referred to as “explaining away” within the context of to the rectified linear unit (ReLU), a nonlinear activation graphical models [49]. An example of this effect is shown in function commonly used in deep learning, with a negative Fig. 5, which compares sparse representations constructed bias of λ. Many other nonlinearities can also be interpreted using a feed-forward neural network layer with those given as proximal operators, including parametric rectified linear by iterative optimization. When components in an overcom- units, inverse square root units, arctangent, hyperbolic tan- plete set of features have high-correlation with an image, gent, sigmoid, and softmax activation functions [47]. This constrained optimization introduces competition between connection forms the basis of the close relationship between them resulting in more parsimonious representations. approximation-based shallow representation learning and If certain conditions are met, however, the result of these feed-forward neural networks. two seemingly different approaches can yield very similar results. At the same time, they can also enable theoretical 2.3.2 Iterative Optimization guarantees such as uniqueness and robustness, which are The optimization problem from Eq. 2 can be adapted to beneficial for effective representation learning. include the sparsity-inducing constraints from Eq. 6 as:

1 2 2.4 Sparse Approximation Theory f(x) = arg min 2 kx − Bwk2 + λ kwk1 (11) w≥0 Sparse approximation theory concerns the properties of Unlike feed-forward alternatives that construct represen- data representations given as sparse linear combinations of tations in closed-form via independent feature detectors, components from overcomplete frames. While the number penalized optimization problems like this require iterative of components k is typically far greater than the dimension- solutions. One common example is proximal gradient de- ality d, the `0 “norm” kwk0 of the representation is restricted scent, which separates the smooth and nonsmooth compo- to be small. Because this quantity is nonconvex and difficult nents of an objective function for faster convergence. Specif- to optimize directly, sparsity is often achieved in practice ically, the algorithm alternates between gradient descent on using greedy selection algorithms like orthogonal matching the smooth reconstruction error term and application of pursuit [50] or convex sparsity-inducing constraints like the the nonnegative soft-thresholding proximal operator from `1 norm [13] or nonnegativity [51]. Eq. 10. Given an initialized representation w[0] and an Through applications like compressed sensing [52], spar- appropriate step size γ, repeated application of the update sity has been found to exhibit theoretical properties that equation in Eq. 12 below is guaranteed to converge to a enable data representation with efficiency far greater than globally optimal solution [48]. what was previously thought possible. Central to these results is the requirement that the frame be “well-behaved,” [t] [t−1] T [t−1] w = φ w + γB x − Bw (12) essentially ensuring that none of its columns are too similar. 7

Sparse representations are also robust to input perturba- tions [53]. Specifically, given a noisy datapoint x = x0 + z where x0 can be represented exactly as x0 = Bw0 with 1 −1 kw0k0 ≤ 4 1 + µ and the noise z has bounded magni- tude kzk2 ≤ , then w0 can be approximated by solving the `1-penalized LASSO problem: 1 2 arg min 2 kx − Bwk2 + λ kwk1 (15) w (a) High Coherence (b) Minimum Coherence (c) Zero Coherence Its solution is stable and the approximation error is bounded 2 Fig. 6. Mutual coherence, visualized in red, of different frames in R . from above in Eq. 16, where δ(x, λ) is a constant. (a) When mutual coherence is high, some vectors may be very similar 2 leading to representation instability and sensitivity to noise. (b) Mutual 2 ( + δ(x, λ)) coherence is minimized at the Welch bound for equiangular tight frames kw − w0k2 ≤ (16) where all vectors are equally distributed. (c) Mutual coherence is zero in 1 − µ(4 kwk0 − 1) the case of orthogonality, which can not occur for overcomplete frames. Thus, minimizing the mutual coherence of a frame decreases the sensitivity of its sparse representations for improved ro- bustness. This is similar to evaluating input sensitivity using For undercomplete frames with k ≤ d, this is satisfied by the Jacobian norm [39]. However, instead of estimating the enforcing orthogonality, but overcomplete frames require average perturbation error over validation data, it bounds other conditions. Specifically, we focus our attention on the the worst-cast error over all possible data. mutual coherence µ, the maximum magnitude normalized While most theoretical results rely on explicit sparse inner product of all pairs of frame components. This is regularization, coherence has also been found to play a key the maximum magnitude off-diagonal element in the Gram role in other overcomplete representations. For example, matrix G = B˜ TB˜ where the columns of B˜ are normalized: nonnegativity can be sufficient to guarantee unique solu- T |bi bj| tions of incoherent underdetermined systems [51], formula- µ(B) = max = max |(G − I)ij| (13) i6=j kbik kbjk i,j tions of overcomplete independent component analysis of- ten minimize coherence to promote parameter diversity [54], Fig. 6 visualizes the mutual coherence of different overcom- and similar regularizers have been employed to encourage plete frames compared alongside an orthogonal basis. orthogonality in deep neural network layers for reduced Because they both measure pairwise interactions be- Lipschitz constants and improved generalization [40]. This tween the components of an overcomplete frame, mutual suggests that coherence control may be important even in coherence and frame potential are closely related. The frame cases when the sparsity levels required by known theoretical potential of a normalized frame can be used to compute the guarantees are not met in practice. root mean square off-diagonal element of the Gram matrix, which provides a lower bound on the mutual coherence, the 2.4.2 The Welch Bound maximum magnitude off-diagonal element: The uniqueness and robustness of overcomplete representa- s tions can both be improved with lower mutual coherence. FP(B˜ ) − Tr(G) µ(B) ≥ (14) Thus, one way of approximating a model’s capacity for N(G) representations with effective memorization and generaliza- tion properties is through a lower bound on its minimum Here, N(G) is the number of nonzero elements in the Gram achievable mutual coherence. While this minimum value matrix and Tr(G) = k is constant since the columns of B˜ are may not be necessary for effective representation learning normalized. Equality between mutual coherence and this with a particular dataset, it does provide a worst-case, data- averaged frame potential is met in the case of normalized agnostic means for comparison. equiangular tight frames, where all off-diagonal elements in Recall that mutual coherence and frame potential both the Gram matrix are equivalent. attain minimum values of zero in the case of orthogonal frames. For normalized overcomplete frames with more 2.4.1 Uniqueness and Robustness Guarantees components than dimensions, the minimum frame potential A model’s capacity for low mutual coherence increases is a positive constant that provides a lower bound on the along with its capacity for both memorizing more train- minimum mutual coherence where equality is met in the ing data through unique representations and generalizing case of equiangular frames. This value, typically denoted as to more validation data through robustness to input per- the Welch bound, was originally introduced in the context turbations. If representations from a mutually incoherent of bounding the effectiveness of error-correcting codes [55]. frame are sufficiently sparse, then they are guaranteed to be To derive this tight lower bound on the mutual coher- d×k optimally sparse and unique [13]. Specifically, if the number ence of a normalized frame B˜ ∈ R , we first apply the 1 −1 of nonzero coefficients kwk0 < 2 (1 + µ ), then w is Cauchy-Schwarz inequality to find a lower bound on the the unique,√ sparsest representation for x. Furthermore, if frame potential, the sum of the the squared eigenvalues λi −1 ˜ T ˜ kwk0 < ( 2 − 0.5)µ , then it can be found efficiently by of the positive semidefinite Gram matrix G = B B: convex optimization with `1 regularization. Thus, minimiz- d d 2 2 2 ing the mutual coherence of a frame increases its capacity X 2 1  X  Tr(G) k FP(B˜ ) = λ ≥ λi = = (17) for uniquely representing data points. i d d d i=1 i=1 8

dp2×kq2 2 B˜ ∈ R has a Gram matrix with trace Tr(G) = kq . Due its sparse repeating structure, the total number of nonzero off-diagonal elements is:

 2  N(G) = k k (o (2q − o + 1) − q) − q2 (19) (a) Convolutional Frame (b) Permuted Convolutional Frame Using the lower bound from Eq. 17, the frame potential can be bounded as FP(B) ≥ q4k2p−2d−1. By plugging these values into Eq. 14, a lower bound on the mutual coherence can be found as: s kd−1s−2 − 1 µ(B) ≥ (20) k ((2 − (o − 1) sp−1) o − 1)2 − 1 While this bound depends on the size of the input p and the convolution stride s, the limit with increasing spatial di- mensionality p → ∞ for s = 1 is the standard multichannel (c) Convolutional Gram Matrix (d) Permuted Convolutional Gram Matrix aperiodic bound from [55]: Fig. 7. A visualization of a one-dimensional convolutional frame with two s input channels, five output channels, and a filter size of three. (a) The kd−1 − 1 filters are repeated over eight spatial dimensions resulting in (b) a block- µ(B) ≥ 2 (21) Toeplitz structure that is revealed through row and column permutations. k (2f − 1) − 1 (c) The corresponding gram matrix can be efficiently computed by (d) repeating local filter interactions. While the minimum mutual coherence of a convolutional frame is greater than that of a dense frame with the same size due to parameter sharing and its fixed structure For normalized frames, Tr(G) = k. Furthermore, the num- of zeros, far fewer parameters are required. This allows ber of nonzero off-diagonal elements is N(G) = k(k − 1). overcomplete representations of high-dimensional images From Eq. 14, the Welch bound is then given as: to be more effectively learned from data. However, these s representations are still fundamentally limited since mutual kd−1 − 1 coherence must increase as the number of filters increases. µ(B) ≥ (18) k − 1 Note that for a fixed input dimensionality d, as the number 3 DEEP REPRESENTATION LEARNING of parameters increases with additional components k, the Deep representations are constructed simply as the compo- minimum achievable mutual coherence also increases. sition of multiple feed-forward shallow representations. For a simple chain-structured deep network with l layers, an 2.4.3 Structured Convolutional Frames d image x ∈ R is passed through a composition of alternat- kj−1×kj For complicated, high-dimensional images, sparse repre- ing linear transformations with parameters Bj ∈ R sentation with overcomplete frames is generally not pos- with k0 = d and fixed nonlinear activation functions φj for sible. Instead, data can be broken into smaller overlapping layers j = 1, . . . , l as follows: patches that are represented using shared parameters with T T T  localized spatial support. These local representations may f(x) = φl Bl ··· φ2(B2 (φ1(B1 x)) ··· (22) then be concatenated together to form global image rep- By allocating parameters across multiple nonlinear layers, resentations. This is accomplished in convolutional sparse representational capacity can be increased without sacrific- coding by replacing the frame matrix multiplication in ing generalization performance. These parameters may then Eq. 11 with a linear operator that convolves a set of filters be learned from data with arbitrary loss functions using over coefficient feature maps to approximately reconstruct stochastic gradient descent and backpropagation. images [56]. This operator can equivalently be represented More complicated architectures such as ResNets and as multiplication by a matrix with repeating implicit struc- DenseNets can also be constructed by adding or concate- tures of nonzero parameters as visualized in Fig. 7. nating previous outputs together, which introduces skip Due to the repeated local structure of convolutional connections between layers. Despite their straightforward frames, more efficient inference algorithms are possible [57] construction, analyzing the implicit regularization provided and theoretical guarantees similar to those in Sec. 2.4.1 can by different deep network architectures to prevent overfit- be derived requiring only local sparsity independent of ting has proven to be challenging. input dimensionality [58]. The frame potential, mutual co- herence, and Welch bound may also be computed efficiently by considering only local interactions as shown in Fig. 7. 3.1 Multilayer Convolutional Sparse Coding Consider the 2-dimensional convolution of a p × p input Building upon the connection between feed-forward shal- with d channels using k filters of size f × f with q < p. low representations and thresholding pursuit for approx- Padding the input with a border of zeros is used to ensure imate sparse coding, deep neural network analysis may size consistency while a convolution stride of s ≥ 1 gives be simplified by accumulating the behavior of individual filter overlaps of o = df/se for an output of size q × q layers as discussed in further detail by Papyan et al. [7]. where q = dp/se. The normalized convolutional frame Inference is then considered as the solution of a sequence 9

of shallow sparse coding problems instead of the feed- training often leads to networks with good generalization forward inference function f(x) from Eq. 22. Specifically, performance despite relatively high mutual coherence and the analogous multilayer sparse coding problem is: low sparsity.

l 2 Find {wj}j=1 s.t. kwj−1 − Bjwjk ≤ Ej−1, 2 (23) 3.2 Deep Frame Approximation Ψ(wj) ≤ δj, ∀j = 1, . . . , l Deep frame approximation is an extension of multilayer Here, w0 = x and wj for j = 1, . . . , l are inferred latent sparse coding where inference is posed not as a sequence variables corresponding to layers parameterized by Bj. The of layer-wise problems, but as a single global problem with inference problem amounts to jointly finding the outputs structure induced by the corresponding network architec- of each layer wj such that the error in approximating the ture. This formulation was first introduced in Deep Compo- previous layer is at most Ej−1 and they are sufficiently nent Analysis with the motivation of iteratively enforcing sparse as measured by the function Ψ with a threshold δj. prior knowledge with optimization constraints [8]. From For convolutional sparse coding, the sparsity function may the perspective of deep neural networks, it can be derived be locally relaxed to allow for invariance to input size [58]. by relaxing the implicit assumption that layer inputs are Assuming that the parameters have been learned such constrained to be the outputs of previous layers. Instead of kj that solutions exist for a particular dataset, the final layer being computed in sequence, the latent variables wj ∈ R output representations f(x) = wl satisfy certain unique- representing the outputs of all layers j = 1, . . . , l are ness and robustness guarantees given sufficiently low mu- inferred jointly as shown in Eq. 26 below, where w0 = x. tual coherence at each layer [7]. l X 2 arg min 1 kw − B w k + Φ (w ) (26) 3.1.1 Approximate Inference 2 j−1 j j 2 j j {wj } j=1 The problem in Eq. 23 may be approximated by the compo- sition of layered thresholding pursuit algorithms with feed- While we only consider the non-negative sparsity-inducing Φ forward deep neural network operations: penalties j from Sec. 2.3.1 corresponding to ReLU activa- tion functions, other convex regularizers may also be used. : T wj = φλj (Bj wj−1), ∀j = 1, . . . , l (24) The compositional constraints between adjacent layers from Eq. 25 have been relaxed and the reconstruction error Here, nonlinear activation functions φ are interpreted penalties are combined into a single convex, nonnegative as proximal operators corresponding to sparsity-inducing sparse coding problem. By combining the terms in the sum- penalty functions Φ. Because the sparsity constraint func- mation of Eq. 26 together, this problem can be equivalently tions Ψ from Eq. 23 used in theoretical analyses are dif- represented as the structured problem in Eq. 27 below. The ficult to optimize directly, convex surrogates like those latent variables wj for each layer are stacked in the vector from Eq. 2.3.1 may be used instead as long as the penalty P w, the regularizer Φ(w) = j Φj(wj), and the input x is weights λj are chosen to satisfy the required sparsity level. augmented by padding it with zeros. The layer parameters This allows standard convolutional networks with sparsity- kj−1×kj Bj ∈ R are blocks in the induced global frame B, inducing activation functions like ReLU to be analyzed P P which has j kj−1 rows and j kj columns. under the framework of multilayer sparse coding. Other optimization algorithms may also be used to better B w z }| { z }| {       2 approximate the solution to Eq. 23. In layered basis pursuit, B1 0 w1 x inference is posed as a sequence of shallow sparse coding  ..      problems. Specifically, a deep representation is given by 1  −IB2 .   w2   0  arg min 2     −   + Φ(w) (27) j = 1, . . . , l w  . .   .   .  composing the solutions for each layer :  .. .. 0   .   . 

2 1 −IBl wl 0 2 wj := arg min 2 kwj−1 − Bjwjk2 + λjΦ(wj) (25) w j The frame B has a structure of nonzero elements that As discussed in Sec. 2.3.2, these shallow problems may summarizes the corresponding feed-forward deep network be solved by iterative optimization algorithms like prox- architecture wherein off-diagonal identity matrices repre- imal gradient descent. In practice, multiple iterations are sent connections between adjacent layers. Because the data implemented using ordinary neural network operations. is augmented with zeros increasing its dimensionality, the In comparison to standard feed-forward networks, these feed-forward shallow thresholding pursuit approximation is recurrent architectures provide improved theoretical robust- ineffective, always giving wj = 0 for j > 1. However, due ness guarantees and reduced sensitivity to adversarial input to its lower-block-triangular structure, Eq. 27 can be effec- perturbations [41]. tively approximated by considering each layer in sequence Interpreting deep networks as compositions of approx- using the same feed-forward operations from Eq. 24. imate sparse coding algorithms provides some insights From the perspective of deep frame approximation, deep into their robustness and generalization properties, but the neural networks approximate shallow structured sparse theory does not generally apply to models typically used coding problems in higher dimensions. Model capacity can in practice. The error accumulation used in the analysis be increased both by adding parameters to a layer or by applies only to chain-structured compositions of layers, and adding layers, which implicitly pads the input data x with it does not effectively explain the empirical successes of more zeros. By increasing dimensionality, this can actually increasingly deeper networks. Furthermore, discriminative reduce mutual coherence by expanding the space between 10

frame components. Thus, depth allows model complexity interpret the activations in Eq. 31 as approximate solutions to scale jointly alongside effective input dimensionality so to the deep frame approximation problem:

that the induced frame structures have the capacity for l low mutual coherence and representations with improved 1 2 X arg min 2 x − B1w1 2 + Φj(wj) (32) memorization and generalization. {wj } j=1 bl/2c X 2 2 3.2.1 Architecture-Induced Frame Structure 1 T T + 2 w2j−B2jw2j-1 2 + w2j+1−w2j-1−B2j+1w2j 2 This model formulation may be extended to accommodate j=1 more complicated network architectures with skip connec- This results in the induced frame structure of Eq. 28 with tions. This allows for the direct comparison of problem Bjj = I for j > 1, Bjk = 0 for j > k + 1, Bjk = 0 for j > k structures induced by different families of architectures such with odd k, and Bjk = I for j > k with even k. as ResNets and DenseNets via their capacity for achieving incoherent global frames. 3.2.3 Densely Connected Convolutional Networks Mutual coherence is computed from normalized frame Building upon the empirical successes of residual networks, components, so it can be lowered simply by increasing the densely connected convolution networks [4] incorporate number of nonzero elements, which reduces their pair-wise skip connections between earlier layers as well. This is inner products. This may be interpreted as providing more shown in Eq. 33 where the transformation B of concate- degrees of freedom to spread out by removing implicit j nated variables [w ] for k = 1, . . . , j − 1 is equivalently lower-dimensional subspace constraints. In Eq. 28 below, k k written as the summation of smaller transformations B . we replace the identity connections of Eq. 27 with blocks jk j−1 of nonzero parameters.     : T j−1 X T   wj = φj Bj [wk]k=1 = φj Bjkwk (33) B 0 ··· 0 k=1  11   . .  These activations again provide approximate solutions to −BT B .. .   21 22  the problem in Eq. 30 with the induced frame structure of B =  .  (28)  . .. ..  B = I j > 1 B  . . . 0  Eq. 28 where jj for and the lower blocks jk   j > k T T for are all filled with learned parameters. −Bl1 · · · −Bl(l−1) Bll Skip connections enable effective learning in much This frame structure corresponds to the global deep frame deeper networks than chain-structured alternatives. While approximation inference problem: originally motivated from the perspective of improving optimization by introducing shortcuts [3], adding more l connections between layers was also shown to improve gen- 1 2 X arg min 2 x − B11w1 2 + Φj(wj) eralization performance [4]. As compared in Fig. 3, denser {wj } j=1 j−1 (29) skip connections induce frame structures with denser Gram l 2 X 1 X T matrices allowing for lower mutual coherence. This suggests + 2 Bjjwj − Bjkwk 2 j=2 k=1 that architectures’ capacities for low validation error can be quantified and compared based on their capacities for Because of the block-lower-triangular structure, following inducing frames with low minimum mutual coherence. T the first layer activation w1 := φ1(B11x), inference can again be approximated by the composition of feed-forward 3.2.4 Global Iterative Inference thresholding pursuit activations: As with multilayer sparse coding, inference approximation j−1 accuracy for deep frame approximation can be improved  T X T  wj := φj Bj Bjkwk ∀j = 2, . . . , l (30) with recurrent connections implementing an optimization k=1 algorithm. However, instead of composing the solutions In comparison to Eq. 26, additional parameters introduce for each layer in succession, the latent variables of all layers must be jointly inferred by solving a single global skip connections between layers so that the activations wj of layer j now depend on those of all previous layers k < j. optimization problem. In deep component analysis [8], the alternating direction method of multipliers was used, but 3.2.2 Residual Networks this required unrealistic assumptions for efficient iterations. While standard proximal gradient descent could be applied These connections are similar to the identity mappings in to the concatenated variables with the global frame, this residual networks [3], which introduce dependence between would be inefficient since it does not take advantage of the the activations of alternating pairs of layers. Specifically, lower-block-triangular structure. : T after an initial layer w1 = φ1(B1 x), the residual layers Instead, we use prox-linear block coordinate de- are defined as follows for even j = 2, 4, . . . , l − 1: scent [59], which cyclically updates subsets of variables T T corresponding to each layer. Viewed as a recurrent neu- wj := φj(Bj wj−1), wj+1 := φj+1(wj−1 + Bj+1wj) (31) ral network, these operations implement feedback connec- In comparison to chain networks, no additional parameters tions to ensure consistency between layers. This algorithm are required; the only difference is the addition of wj−1 was successfully applied to a similar problem for chain- in the argument of φj+1. As a special case of Eq. 30, we structured networks in Chodosh et al. [60]. 11 2 Pl 2 A single iteration of this algorithm incrementally up- frame B are Nj = i=j Cij. The full normalized frame dates the latent variables for each layer as: can then be found as B˜ = BN−1 where the matrix N is [t] [t−1] [t] block diagonal with Nj as its blocks. The blocks of the Gram w = φ w − γ g , ∀j = 1, . . . , l (34) j j j j j matrix G = B˜ TB˜ are then given as: These updates take the same form as proximal gradient l [t] X −1 T −1 descent with step sizes γj, where the gj is defined for Gjj0 = Nj BijBij0 Nj0 (36) networks with skip connections using the general global i=j0 structured frame from Eq. 28 as: 0 For chain networks, Gjj0 6= 0 only when j = j + 1, which [t] ∂ [t] [t] [t−1] [t−1] represents the connections between adjacent layers. In this gj = `(w1 ,..., wj−1, wj, wj+1 ,..., wl ) (35) ∂wj case, the blocks can be simplified as: 1 1 j−1 l 2 − 2 T 2 − 2 T X T [t] X  T Gjj = (Cj + I) (Bj Bj + I)(Cj + I) (37) = Bj Bjwj − Bjkwk + Bj0j Bj0jwj 1 1 2 − 2 2 − 2 k=1 j0=j+1 Gj(j+1) = −(Cj + I) Bj+1(Cj+1 + I) (38) 0 T j−1 j −1 Gll = Bl Bl (39) X T [t] X T [t−1] [t−1] + B 0 0 w 0 + B 0 0 w 0 − B 0 w 0 j k k j k k j j Because the diagonal is removed in the deep frame k0=1 k0=j+1 potential computation, the contribution of Gjj is simply a This is the partial gradient of the smooth reconstruction rescaled version of the local frame potential of layer j. The error term ` of the loss function from Eq. 27. Variables contribution of Gj(j+1), on the other hand, can essentially from previous layers are updated immediately in the current be interpreted as rescaled `2 weight decay where rows are iteration facilitating faster convergence. The step sizes γj weighted more heavily if the corresponding columns of may be selected automatically by bounding the Lipschitz the previous layer’s parameters have higher magnitudes. constants of the corresponding gradients [60]. Inference can Furthermore, since the global frame potential is averaged be further accelerated by learning less conservative step over the total number of nonzero elements in G, if a layer sizes as trainable parameters and using extrapolation [59]. has more parameters, then it will be given more weight in This optimization algorithm is unrolled to a fixed num- this computation. For general networks with skip connec- ber of iterations for improved inference approximation ac- tions, the summation from Eq. 36 has additional terms that curacy. The resulting recurrent networks achieve similar introduce more complicated interactions. In these cases, it performance to feed-forward approximations when trained cannot be evaluated from local properties of layers. on discriminative tasks like image classification. This sug- Essentially, the deep frame potential summarizes the gests similar representational capacities and motivates the structural properties of the global frame B induced by a indirect analysis of deep neural networks with the frame- deep network architecture by balancing interactions within work of deep frame approximation. In addition, while the each individual layer through local coherence properties computational overhead of recurrent inference may be im- and between connecting layers. practical in these cases, it can more effectively enforce other constraints [8] and improve adversarial robustness [61]. 3.3.1 Theoretical Lower Bound for Chain Networks While the deep frame potential is a function of param- 3.3 The Deep Frame Potential eter values, its minimum value is determined only by the architecture-induced frame structure. Furthermore, we We propose to use lower bounds on the mutual coherence know that it must be bounded by a nonzero constant for of induced structured frames for the data-independent com- overcomplete frames. In this section, we derive this lower parison of feed-forward deep network architecture capaci- bound for the special case of fully-connected chain networks ties. Note that while one-sided coherence is better suited to and provide intuition for why skip connections increase the nonnegativity constraints, it has the same lower bound [51]. capacity for low mutual coherence. Directly optimizing mutual coherence from Eq. 13 is difficult First, observe that a lower bound for the Frobenius norm due to its piecewise structure. Instead, we consider a tight of Gj(j+1) from Eq. 38 cannot be readily attained because lower by replacing the maximum off-diagonal element of the rows and columns are rescaled independently. This G the deep frame Gram matrix with the mean. This gives means that a lower bound for the norm of G must be found the lower bound from Eq. 14, a normalized version of the 2 by jointly considering the entire matrix structure, not simply frame potential FP (B) = kGkF from Eq. 5 that is strongly- through the summation of its components. To accomplish convex and can be optimized more effectively [45]. Due to this, we instead consider the matrix F = B˜ B˜ T, which is full the block-sparse structure of the induced frames from Eq. 28, rank and has the same norm as G: we evaluate the frame potential in terms of local blocks l l−1 kj ×k 0 Gjj0 ∈ R j that are nonzero only if layer j is connected 2 2 X 2 X 2 0 kGkF = kFkF = kFjjkF + 2 Fj(j+1) F (40) to layer j . j=1 j=1 To compute the Gram matrix, we first need to normal- F ize the global induced frame B from Eq. 28. By stacking We can then express the individual blocks of as: 2 −1 T the column magnitudes of layer j as the elements in the F11 = B1(C1 + Ik1 ) B1 (41) kj ×kj diagonal matrix Cj = diag(cj) ∈ R , the normalized F = B (C2 + I)−1BT + (C2 + I)−1 ˜ −1 jj j j j j−1 (42) parameters can be represented as Bij = BijCj . Similarly, 2 −1 the squared norms of the full set of columns in the global Fj(j+1) = −Bj(Cj + I) (43) 12

In contrast to Gj(j+1) in Eq. 38, only the columns of Fj(j+1) 0.8 0.8 in Eq. 43 are rescaled. Since B˜ j has normalized columns, its norm can be exactly expressed as: 0.7 0.7 (400,800,1600,3200) 2 kj ! (8,16,32,64) c 0.6 (8,16,32,64) 0.6 2 X jn (64,32,64) Fj(j+1) F = 2 (44) (64,32,64)

c + 1 Validation Error (128,256) Validation Error n=1 jn 0.5 0.5 (128,256) (400,800,1600,3200)

For the other blocks, we find lower bounds for their 0.4 0.4 0 0.05 0.1 0.15 0.2 0.25 0 0.05 0.1 0.15 0.2 0.25 norms through the same technique used in deriving the Minimum Deep Frame Potential Minimum Deep Frame Potential Welch bound in Sec. 2.4.2. Specifically, we can lower bound (a) Without Regularization (b) With Regularization the norms of the individual blocks as: Fig. 8. A comparison of fully-connected deep network architectures 2 k1 2 ! with varying depths and widths. Warmer colors indicate models with 2 1 X c kF k ≥ 1n (45) more total parameters. (a) Some very large networks cannot be trained 11 F k c2 + 1 0 n=1 1n effectively resulting in unusually high validation errors. (b) This can 2 be remedied with deep frame potential regularization, resulting in high kj 2 kj−1 ! correlation between minimum frame potential and validation error. 2 1 X cjn X 1 kF k ≥ + jj F k c2 + 1 c2 + 1 j−1 n=1 jn p=1 (j−1)p In this case of dense shallow frames, the Welch bound unusually high error due to the difficulty in training very depends only on the data dimensionality and the number large fully-connected networks. In Fig. 8bb, the addition of of frame components. However, due to the structure of the a deep frame potential regularization term overcomes some architecture-induced frames, the lower bound of the deep of these optimization difficulties for improved parameter frame potential depends on the data dimensionality, the efficiency. This results in high correlation between minimum number of layers, the number of units in each layer, the deep frame potential and validation error. Furthermore, it connectivity between layers, and the relative magnitudes emphasizes the diminishing returns of increasing the size of between layers. Skip connections increase the number of fully-connected chain networks; after a certain point, adding nonzero elements in the Gram matrix over which to average more parameters does little to reduce both validation error and also enable off-diagonal blocks to have lower norms. and minimum frame potential. To evaluate the effects of residual connections, we adapt 3.3.2 Model Selection by Empirical Minimization the simplified ResNet architecture from [63], which consists For more general architectures that lack a simple theoretical of a single convolutional layer followed by three groups of lower bound, we instead bound the mutual coherence of residual blocks with activations given by Eq. 31. Before the the architecture-induced frame through empirical minimiza- second and third groups, the number of filters is increased tion of the deep frame potential in the lower bound from by a factor of two and the spatial resolution is decreased Eq. 14. The lack of suboptimal local minima allows for by half with average pooling. To compare networks with effective optimization using gradient descent [45]. Because different sizes, we modify their depths by changing the it is independent of data and parameter instantiations, this number of residual blocks in each group between 2 and provides a way to compare different architectures without 10 and their widths by changing the base number of fil- training. Instead, model selection is performed by choosing ters between 4 and 32. For our experiments with densely the candidate architecture that achievest the lowest deep connected skip connections [4], we adapt the simplified frame potential subject to desired modeling constraints such DenseNet architecture from [64]. Like with residual net- as limiting the total number of parameters. works, it consists of a convolutional layer followed by three groups of the activations from Eq. 33 with decreasing spatial resolutions and increasing numbers of filters. Within each 4 EXPERIMENTAL RESULTS group, a dense block concatenates smaller convolutions that In this section, we demonstrate correlation between the take all previous outputs as inputs with filter numbers minimum deep frame potential and validation error on equal to a fixed growth rate. Network depth and width the CIFAR-10 [62] and MNIST [28] datasets across a wide are modified by respectively increasing both the number of variety of fully-connected, convolutional, chain, residual, layers per group and the base growth rate between 2 and 12. and densely connected network architectures. We show that Batch normalization [65] was also used in all convolutional networks with skip connections can achieve lower deep networks. frame potentials with fewer learnable parameters, which is We begin by experimentally validating our motivating predictive of the parameter efficiency of trained networks. assumption that feed-forward deep networks may be indi- In Fig. 8, we visualize a scatter plot of trained fully- rectly analyzed as approximate techniques for deep frame connected networks with three to five layers and 16 to approximation. Shown in Fig. 9, their generalization capac- 4096 units in each layer. The corresponding architectures ities are comparable to recurrent networks that explicitly are shown as a list of units per layer for a few repre- optimize the inference objective function from Eq. 29. While sentative examples. The minimum deep frame potential of multiple iterations make backpropagation through inference each architecture is compared against its validation error more challenging and inefficient, especially for deeper net- after training, and the total parameter count is indicated by works, the acceleration techniques from Sec. 3.2.4 speed up color. In Fig. 8a, some networks with many parameters have convergence so that fewer iterations are required. 13

0.4 0.4 0.4 0.3 0.035 Feed-Forward Baseline Feed-Forward Baseline Feed-Forward Baseline Residual Network, Base Width = 16 Residual Network, Base Width = 16 0.35 Recurrent 0.35 Recurrent 0.35 Recurrent Residual Network, Base Width = 4 Residual Network, Base Width = 4 Recurrent (Accelerated) Recurrent (Accelerated) Recurrent (Accelerated) 0.03 0.25 Chain Network, Base Width = 16 Chain Network, Base Width = 16 0.3 0.3 0.3 Chain Network, Base Width = 4 Chain Network, Base Width = 4 0.025 0.25 0.25 0.25 0.2

0.2 0.2 0.2 0.02 0.15 0.15 0.15 0.15 Validation Error Validation Error Validation Error 0.015

0.1 0.1 0.1 Validation Error 0.1 0.01

0.05 0.05 0.05 Minimum Frame Potential 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Iterations Iterations Iterations 0.05 0.005 3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10 (a) Chain Network (b) ResNet (c) DenseNet Blocks / Group Blocks / Group (a) Validation Error (b) Minimum Deep Frame Potential Fig. 9. Deep frame approximation inference compared between feed- forward deep networks and recurrent optimization for (a) chain networks Fig. 11. The effect of increasing depth in chain and residual networks. and (b) ResNets with a base width of 16 filters and 2 residual blocks per Validation error is compared against network depth for two different group, and (c) DenseNets with a growth rate of 4 and 4 layers per group. network widths. (a) Unlike chain networks, even very deep residual Unrolled to an increasing number of iterations, recurrent optimization networks can be trained effectively for decreased validation error. (b) De- networks quickly converge to validation accuracies comparable to their spite having the same number of total parameters, residual connections feed-forward approximations. also induce frame structures with lower minimum deep frame potentials.

0.3 0.4 0.02 0.025 3 Blocks / Group Dense Network 3 Blocks / Group Dense Network 4 Blocks / Group Residual Network 4 Blocks / Group Residual Network 0.25 5 Blocks / Group Chain Network Chain Network 5 Blocks / Group 0.3 0.015 8 Blocks / Group 0.02 8 Blocks / Group 10 Blocks / Group 0.2 10 Blocks / Group Minimum Width Minimum Width

0.2 0.01 0.015 0.15 Validation Error Validation Error Residual Network 0.01 0.1 0.005 0.1 Residual Network

0.05 0 0.005 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0.005 0.01 0.015 0.02 0.025 Minimum Deep Frame Potential Minimum Deep Frame Potential Chain Network Chain Network (a) CIFAR-10 (b) MNIST (a) Validation Error (b) Minimum Deep Frame Potential

Fig. 10. The correlation between the minimum deep frame potential and Fig. 12. A comparison of (a) validation error and (b) minimum frame validation error of varous network architectures on the (a) CIFAR-10 and potential between residual networks and chain networks. Colors indicate (b) MNIST datasets. Each point represents a different architecture, and different depths and datapoints are connected in order of increasing those in (a) are the same from Fig. 3. Despite the simplicity of MNIST, widths of 4, 8, 16, or 32 filters. Skip connections result in reduced error minimum deep frame potential is still a good predictor of validation error correlating with frame potential with dense networks showing superior but on a much smaller scale. efficiency with increasing depth.

In Fig. 13, we compare the parameter efficiency of chain In Fig. 10 we show that validation error correlates with networks, residual networks, and densely connected net- minimum deep frame potential even across different fami- works of different depths and widths. We visualize both lies of architectures. Because MNIST is much simpler than validation error and minimum frame potential as functions CIFAR-10, validation error saturates much quicker with of the number of parameters, demonstrating the improved respect to model size. Despite these dataset differences, scalability of networks with skip connections. While chain minimum deep frame potential is still a good predictor of networks demonstrate increasingly poor parameter effi- performance and DenseNets with many skip connections ciency with respect to increasing depth in Fig. 13a, the skip once again achieve superior performance. connections of ResNets and DenseNets allow for further In Fig. 11, we compare validation errors and minimum reducing error with larger network sizes in Figs. 13c,e. deep frame potentials of residual networks and chain net- Considering all network families together as in Fig. 3d, we works with residual connections removed. In Fig. 11a, chain see that denser connections also allow for lower validation network validation error increases for deeper networks error with comparable numbers of parameters. This trend is while that of residual networks is lower and consistently mirrored in the minimum frame potentials of Figs. 13b,d,f decreases. This emphasizes the difficulty in training very shown together in Fig. 3e. Despite fine variations in be- deep chain networks. In Fig. 11b, residual connections are havior across different families of architectures, minimum shown to enable lower minimum deep frame potentials deep frame potential is correlated with validation error following a similar trend with respect to model size. across network sizes and effectively predicts the increased In Fig. 12, we compare chain networks and residual net- generalization capacity provided by skip connections. works with exactly the same number of parameters, where color indicates the number of residual blocks per group and connected data points have the same depths but different 5 CONCLUSION widths. The addition of skip connections reduces both val- In this paper, we introduced deep frame approximation, a idation error and minimum frame potential, as visualized general framework for deep learning that poses inference by consistent placement below the diagonal line indicating as global multilayer constrained approximation problems lower values for residual networks than comparable chain with structured overcomplete frames. The solutions to these networks. This effect becomes even more pronounced with problems may be approximated using convex optimiza- increasing depths and widths. tion techniques implemented by feed-forward or recurrent 14

connected convolutional networks. This suggests a promis- 0.3 0.025 ing direction for future research towards the theoretical 0.25 0.02 analysis and practical construction of deep network archi- 0.2 0.015 tectures derived from connections between deep learning 0.15 and overcomplete representations. 3 Blocks / Group 0.01 3 Blocks / Group 0.1 4 Blocks / Group 4 Blocks / Group 5 Blocks / Group 5 Blocks / Group Validation Error 8 Blocks / Group 0.005 8 Blocks / Group 0.05 10 Blocks / Group 10 Blocks / Group

Minimum Width Minimum Frame Potential Minimum Width 0 0 REFERENCES 103 104 105 106 107 103 104 105 106 107 Number of Parameters (Log Scale) Number of Parameters (Log Scale) [1] A. Krizhevsky et al., “Imagenet classification with deep convolu- (a) Chain Validation Error (b) Chain Frame Potential tional neural networks,” in Advances in Neural Information Process- ing Systems (NeurIPS), 2012. 0.3 0.025 [2] K. Simonyan and A. Zisserman, “Very deep convolutional net-

0.25 works for large-scale image recognition,” in International Confer- 0.02 ence on Learning Representations (ICLR), 2015. 0.2 0.015 [3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for 0.15 image recognition,” in Conference on Computer Vision and Pattern 3 Blocks / Group 0.01 3 Blocks / Group Recognition (CVPR), 2016. 0.1 4 Blocks / Group 4 Blocks / Group 5 Blocks / Group 5 Blocks / Group Validation Error [4] G. Huang, Z. Liu, K. Weinberger, and L. van der Maaten, “Densely 8 Blocks / Group 0.005 8 Blocks / Group 0.05 10 Blocks / Group 10 Blocks / Group connected convolutional networks,” in Conference on Computer Minimum Width Minimum Frame Potential Minimum Width 0 0 Vision and Pattern Recognition (CVPR), 2017. 103 104 105 106 107 103 104 105 106 107 [5] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Number of Parameters (Log Scale) Number of Parameters (Log Scale) learning, vol. 20, no. 3, 1995. (c) ResNet Validation Error (d) ResNet Frame Potential [6] V. N. Vapnik and A. Y. Chervonenkis, “On the uniform con- vergence of relative frequencies of events to their probabilities,” 0.3 0.025 Theory of Probability and Its Applications, vol. XVI, no. 2, 1971. 2 Layers / Group 2 Layers / Group 0.25 3 Layers / Group 3 Layers / Group [7] V. Papyan, Y. Romano, and M. Elad, “Convolutional neural net- 4 Layers / Group 0.02 4 Layers / Group 5 Layers / Group 5 Layers / Group works analyzed via convolutional sparse coding,” Journal of Ma- 0.2 8 Layers / Group 8 Layers / Group 12 Layers / Group 0.015 12 Layers / Group chine Learning Research (JMLR), vol. 18, no. 83, 2017. 0.15 Minimum Width Minimum Width [8] C. Murdock, M. Chang, and S. Lucey, “Deep component analysis 0.01 0.1 via alternating direction neural networks,” in European Conference Validation Error 0.005 on Computer Vision (ECCV), 2018. 0.05

Minimum Frame Potential [9] C. Murdock and S. Lucey, “Dataless model selection with the 0 0 deep frame potential,” in Conference on Computer Vision and Pattern 3 4 5 6 7 3 4 5 6 7 10 10 10 10 10 10 10 10 10 10 Recognition (CVPR) Number of Parameters (Log Scale) Number of Parameters (Log Scale) , 2020. [10] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, (e) DenseNet Validation Error (f) DenseNet Frame Potential Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Fig. 13. A demonstration of the improved scalability of networks with Computer Vision (IJCV), vol. 115, no. 3, 2015. skip connections, where line colors indicate different depths and data [11] M. Elad, Sparse and redundant representations: from theory to applica- points are connected showing increasing widths. (a) Chain networks tions in signal and image processing. Springer Science & Business with greater depths have increasingly worse parameter efficiency in Media, 2010. comparison to (c) the corresponding networks with residual connections [12] J. Kovaceviˇ c,´ A. Chebira et al., “An introduction to frames,” and (e) densely connected networks with similar size, of which perfor- Foundations and Trends in , vol. 2, no. 1, 2008. mance scales efficiently with parameter count. This could potentially be [13] D. L. Donoho and M. Elad, “Optimally sparse representation in attributed to correlated efficiency in reducing frame potential with fewer general (nonorthogonal) dictionaries via l1 minimization,” Pro- parameters, which saturates much faster with (b) chain networks than ceedings of the National Academy of Sciences, vol. 100, no. 5, 2003. (d) residual networks or (f) densely connected networks. [14] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Un- derstanding deep learning requires rethinking generalization,” in International Conference on Learning Representations (ICLR), 2017. neural networks. From this perspective, the compositional [15] S. Wold, K. Esbensen, and P. Geladi, “Principal component analy- sis,” Chemometrics and intelligent laboratory systems, vol. 2, no. 1-3, nature of deep learning arises as a consequence of structure- pp. 37–52, 1987. dependent efficient approximations to augmented shallow [16] C. Bao, H. Ji, Y. Quan, and Z. Shen, “Dictionary learning for sparse learning problems. coding: Algorithms and convergence analysis,” Pattern Analysis and Machine Intelligence (PAMI), vol. 38, no. 7, pp. 1356–1369, 2016. Based upon theoretical connections to sparse approxi- [17] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of mation and deep neural networks, we demonstrated how cognitive neuroscience, vol. 3, no. 1, pp. 71–86, 1991. architectural hyper-parameters such as depth, width, and [18] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in Confer- skip connections induce different structural properties of ence on Computer Vision and Pattern Recognition (CVPR), 2009. the frames in corresponding sparse coding problems. We [19] C. Jutten and J. Herault, “Blind separation of sources, part i: An compared these frame structures through lower bounds adaptive algorithm based on neuromimetic architecture,” Signal on their mutual coherence, which is theoretically tied to Processing, vol. 24, no. 1, pp. 1–10, 1991. [20] D. D. Lee and H. S. Seung, “Learning the parts of objects by non- their capacity for uniquely and robustly representing data negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788– via sparse approximation. A theoretical lower bound was 791, 1999. derived for chain networks and the deep frame potential [21] B. A. Olshausen et al., “Emergence of simple-cell receptive field was proposed as an empirical optimization objective for properties by learning a sparse code for natural images,” Nature, vol. 381, no. 6583, pp. 607–609, 1996. constructing bounds for more complicated architectures. [22] T. Liu, D. Tao, and D. Xu, “Dimensionality-dependent generaliza- Experimentally, we observed correlations between min- tion bounds for k-dimensional coding schemes,” Neural computa- imum deep frame potential and validation error across tion, 2016. [23] N. Gillis, “Sparse and unique nonnegative matrix factorization different datasets and families of modern architectures with through data preprocessing,” Journal of Machine Learning Research skip connections, including residual networks and densely (JMLR), vol. 13, no. November, pp. 3349–3386, 2012. 15

[24] B. Haeffele, E. Young, and R. Vidal, “Structured low-rank matrix factorization: Optimality, algorithm, and applications to image [50] J. A. Tropp and A. C. Gilbert, “Signal recovery from random mea- processing,” in International Conference on Machine Learning (ICML), surements via orthogonal matching pursuit,” IEEE Transactions on 2014. information theory, vol. 53, no. 12, 2007. [25] P. Baldi and K. Hornik, “Neural networks and principal compo- [51] A. M. Bruckstein, M. Elad, and M. Zibulevsky, “On the uniqueness nent analysis: Learning from examples without local minima,” of nonnegative sparse solutions to underdetermined systems of Neural networks, vol. 2, no. 1, 1989. equations,” IEEE Transactions on Information Theory, vol. 54, no. 11, [26] F. De la Torre, “A least-squares framework for component analy- 2008. sis,” Pattern Analysis and Machine Intelligence (PAMI), vol. 34, no. 6, [52] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Infor- 2012. mation Theory, vol. 52, no. 4, 2006. [27] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional [53] D. L. Donoho, M. Elad, and V. N. Temlyakov, “Stable recovery deep belief networks for scalable unsupervised learning of hi- of sparse overcomplete representations in the presence of noise,” erarchical representations,” in International Conference on Machine IEEE Transactions on Information Theory, vol. 52, no. 1, 2005. Learning (ICML), 2009. [54] J. A. Livezey, A. F. Bujan, and F. T. Sommer, “Learning overcom- [28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based plete, low coherence dictionaries with linear inference.” Journal of learning applied to document recognition,” Proceedings of the IEEE, Machine Learning Research (JMLR), vol. 20, no. 174, 2019. vol. 86, no. 11, 1998. [55] L. Welch, “Lower bounds on the maximum cross correlation of [29] M. Denil et al., “Predicting parameters in deep learning,” in signals,” IEEE Transactions on Information theory, vol. 20, no. 3, 1974. Advances in Neural Information Processing Systems (NeurIPS), 2013. [56] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Decon- [30] J. Alvarez and M. Salzmann, “Learning the number of neurons volutional networks,” in Conference on Computer Vision and Pattern in deep networks,” in Advances in Neural Information Processing Recognition (CVPR), 2010. Systems (NeurIPS), 2016. [57] H. Bristow, A. Eriksson, and S. Lucey, “Fast convolutional sparse [31] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating coding,” in Conference on Computer Vision and Pattern Recognition very deep neural networks,” in Conference on Computer Vision and (CVPR), 2013. Pattern Recognition (CVPR), 2017. [58] V. Papyan, J. Sulam, and M. Elad, “Working locally thinking [32] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: globally: Theoretical guarantees for convolutional sparse coding,” A survey.” Journal of Machine Learning Research (JMLR), vol. 20, IEEE Transactions on Signal Processing, vol. 65, no. 21, 2017. no. 55, 2019. [59] Y. Xu and W. Yin, “A block coordinate descent method for regular- [33] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for con- ized multiconvex optimization with applications to nonnegative volutional neural networks,” in International Conference on Machine tensor factorization and completion,” SIAM Journal on imaging Learning (ICML), 2019. sciences, vol. 6, no. 3, pp. 1758–1789, 2013. [34] D. Arpit et al., “A closer look at memorization in deep networks,” [60] N. Chodosh, C. Wang, and S. Lucey, “Deep convolutional com- in International Conference on Machine Learning (ICML), 2017. pressed sensing for lidar depth completion,” in Asian Conference [35] M. Hardt, B. Recht, and Y. Singer, “Train faster, generalize better: on Computer Vision, 2018. Stability of stochastic gradient descent,” in International Conference [61] G. Cazenavette, C. Murdock, and S. Lucey, “Architectural adver- on Machine Learning (ICML), 2016. sarial robustness: The case for deep pursuit,” in Conference on [36] Y. Cao and Q. Gu, “Generalization bounds of stochastic gradient Computer Vision and Pattern Recognition (CVPR), 2021. descent for wide and deep neural networks,” Advances in Neural [62] A. Krizhevsky and G. Hinton, “Learning multiple layers of fea- Information Processing Systems (NeurIPS), 2019. tures from tiny images,” University of Toronto, Tech. Rep., 2009. [37] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep [63] Y. Wu et al., “Tensorpack,” https://github.com/tensorpack/, 2016. neural networks via information,” arXiv preprint arXiv:1703.00810, [64] Y. Li, “Tensorflow densenet,” https://github.com/YixuanLi/ 2017. densenet-tensorflow, 2018. [38] A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, [65] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep B. D. Tracey, and D. D. Cox, “On the information bottleneck network training by reducing internal covariate shift,” in Interna- theory of deep learning,” in International Conference on Learning tional Conference on Machine Learning (ICML), 2015. Representations (ICLR), 2018. [39] R. Novak et al., “Sensitivity and generalization in neural networks: an empirical study,” in International Conference on Learning Repre- sentations (ICLR), 2018. [40] C. Moustapha, B. Piotr, G. Edouard, D. Yann, and U. Nicolas, “Par- seval networks: Improving robustness to adversarial examples,” in Calvin Murdock received the PhD degree in International Conference on Machine Learning (ICML), 2017. machine learning from Carnegie Mellon Univer- [41] Y. Romano, A. Aberdam, J. Sulam, and M. Elad, “Adversarial noise sity in 2020. His research concerns fundamental attacks of deep learning architectures-stability analysis via sparse questions of representation learning towards the modeled signals,” Journal of Mathematical Imaging and Vision, 2018. goal of effective computational perception, lever- [42] J. Sulam, A. Aberdam, A. Beck, and M. Elad, “On multi-layer basis aging insights from geometry, optimization, and pursuit, efficient algorithms and convolutional neural networks,” sparse approximation theory. He is a member of Pattern Analysis and Machine Intelligence (PAMI), 2019. the IEEE. [43] S. F. Waldron, An introduction to finite tight frames. Springer, 2018. [44] P. G. Casazza and J. Kovaceviˇ c,´ “Equal-norm tight frames with erasures,” Advances in Computational Mathematics, vol. 18, no. 2-4, 2003. [45] J. Benedetto and M. Fickus, “Finite normalized tight frames,” Advances in Computational Mathematics, vol. 18, no. 2-4, 2003. [46] N. Parikh, S. Boyd et al., “Proximal algorithms,” Foundations and Trends® in Optimization, vol. 1, no. 3, 2014. [47] P. L. Combettes and J.-C. Pesquet, “Deep neural network struc- Simon Lucey received the PhD degree from the tures solving variational inequalities,” Set-Valued and Variational Queensland University of Technology, Brisbane, Analysis, 2020. Australia, in 2003. He is an Associate Research [48] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresh- Professor at the Robotics Institute at Carnegie olding algorithm for linear inverse problems with a sparsity con- Mellon University. He also holds an adjunct pro- straint,” Communications on Pure and : A Journal fessorial position at the Queensland University Issued by the Courant Institute of Mathematical Sciences, vol. 57, of Technology. His research interests include no. 11, 2004. computer vision and machine learning, and their [49] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: application to human behavior. He is a member A review and new perspectives,” Pattern Analysis and Machine of the IEEE. Intelligence (PAMI), vol. 35, no. 8, pp. 1798–1828, 2013.