Dataless Model Selection with the Deep Frame Potential

Calvin Murdock1 Simon Lucey1,2 1Carnegie Mellon University 2Argo AI {cmurdock,slucey}@cs.cmu.edu

Abstract

Choosing a deep neural network architecture is a funda- mental problem in applications that require balancing per- formance and parameter efficiency. Standard approaches (a) Chain Network (b) Residual Network (c) Densely Connected rely on ad-hoc engineering or computationally expensive Convolutional Network validation on a specific dataset. We instead attempt to quantify networks by their intrinsic capacity for unique and

robust representations, enabling efficient architecture com- ≈ ≈ ≈ parisons without requiring any data. Building upon theoret- ical connections between deep learning and sparse approx- imation, we propose the deep frame potential: a measure (d) Induced Dictionary Structures for Sparse Approximation of coherence that is approximately related to representation stability but has minimizers that depend only on network Figure 1: Why are some deep neural network architectures better structure. This provides a framework for jointly quantifying than others? In comparison to (a) standard chain connections, skip the contributions of architectural hyper-parameters such as connections like those in (b) ResNets [13] and (c) DenseNets [15] have demonstrated significant improvements in training effective- depth, width, and skip connections. We validate its use as ness, parameter efficiency, and generalization performance. We a criterion for model selection and demonstrate correlation provide one possible explanation for this phenomenon by approx- with generalization error on a variety of common residual imating network activations as (d) solutions to sparse approxima- and densely connected network architectures. tion problems with different induced dictionary structures.

1. Introduction low techniques like support vector ma- Deep neural networks have dominated nearly every chines [6] were aided by theoretical tools like the VC- benchmark within the field of computer vision. While this dimension [31] for determining when their predictions modern influx of deep learning originally began with the could be trusted to avoid overfitting. Deep neural networks, task of large-scale image recognition [18], new datasets, on the other hand, have eschewed similar analyses due to loss functions, and network configurations have quickly ex- their complexity. Theoretical explorations of deep network arXiv:2003.13866v1 [cs.LG] 30 Mar 2020 panded its scope to include a much wider range of appli- generalization [24] are often disconnected from practical cations. Despite this, the underlying architectures used to applications and rarely provide actionable insight into how learn effective image representations are generally consis- architectural hyper-parameters contribute to performance. tent across all of them. This can be seen through the com- Building upon recent connections between deep learn- munity’s quick adoption of the newest state-of-the-art deep ing and sparse approximation [26, 23], we instead interpret networks from AlexNet [18] to VGGNet [28], ResNets [13], feed-forward deep networks as algorithms for approximate DenseNets [15], and so on. But this begs the question: why inference in related sparse coding problems. These prob- do some deep network architectures work better than oth- lems aim to optimally reconstruct zero-padded input images ers? Despite years of groundbreaking empirical results, an as sparse, nonnegative linear combinations of atoms from answer to this question still remains elusive. architecture-dependent dictionaries, as shown in Fig.1. We Fundamentally, the difficulty in comparing network ar- propose to indirectly analyze practical deep network archi- chitectures arises from the lack of a theoretical founda- tectures with complicated skip connections, like residual tion for characterizing their generalization capacities. Shal- networks (ResNets) [13] and densely connected convolu- tional networks (DenseNets) [15], simply through the dic- tionary structures that they induce. To accomplish this, we introduce the deep frame poten- = tial for summarizing the interactions between parameters in feed-forward deep networks. As a lower bound on mutual (a) Chain Network Gram (b) ResNet (c) DenseNet coherence–the maximum magnitude of the normalized in- 0.035 0.4 ner products between all pairs of dictionary atoms [9]–it Dense Network Dense Network 0.03 Residual Network Residual Network Chain Network Chain Network 0.3 is theoretically tied to generalization properties of the re- 0.025 lated sparse coding problems. However, its minimizers de- 0.02 0.2 pend only on the dictionary structures induced by the cor- 0.015 0.01 responding network architectures. This enables dataless Validation Error 0.1 0.005 model comparison by jointly quantifying contributions of Minimum Frame Potential 0 0 depth, width, and connectivity. 103 104 105 106 107 103 104 105 106 107 Number of Parameters (Log Scale) Number of Parameters (Log Scale) Our approach is motivated by sparse approximation the- (d) Minimum Deep Frame Potential (e) Validation Error ory [11], a field that encompasses properties like uniqueness and robustness of shallow, overcomplete representations. In Figure 2: Parameter count is not a good indicator of generaliza- sparse coding, capacity is controlled by the number of dic- tion performance for deep networks. Instead, we compare differ- tionary atoms used in sparse data reconstructions. While ent network architectures via the minimum deep frame potential, more parameters allow for more accurate representations, the average nonzero magnitude of inner products between atoms of they may also increase input sensitivity for worse general- architecture-induced dictionaries. In comparison to (a) chain net- ization performance. Conceptually, this is comparable to works, skip connections in (b) residual networks and (c) densely overfitting in nearest-neighbor classification, where repre- connected networks produce Gram matrix structures with more nonzero elements allowing for (d) lower deep frame potentials sentations are sparse, one-hot indicator vectors correspond- across network sizes. This correlates with improved parameter ing to nearest training examples. As the number of train- efficiency giving (e) lower validation error with fewer parameters. ing data increases, the distance between them decreases, so they are more likely to be confused with one another. Simi- larly, nearby dictionary atoms may introduce instability that tional networks with skip connections, of which ResNets causes representations of similar data points to become very and DenseNets are shown to be special cases. Further- far apart leading to poor generalization performance. Thus, more, we derive an analytic expression for the minimum there is a fundamental tradeoff between the capacity and ro- value in the case of fully-connected chain networks. Exper- bustness of shallow representations due to the proximity of imentally, we demonstrate correlation with validation error dictionary atoms as measured by mutual coherence. across a variety of network architectures. However, deep representations have not shown the same correlation between model size and sensitivity [34]. While 2. Background and Related Work adding more layers to a deep neural network increases its capacity, it also simultaneously introduces implicit regular- Due to the vast space of possible deep network archi- ization to reduce overfitting. This can be explained through tectures and the computational difficulty in training them, the proposed connection to sparse coding, where additional deep model selection has largely been guided by ad-hoc layers increase both capacity and effective input dimension- engineering and human ingenuity. While progress slowed ality. In a higher-dimensional space, dictionary atoms can in the years following early breakthroughs [20], recent in- be spaced further apart for more robust representations. Fur- terest in deep learning architectures began anew due to thermore, architectures with denser skip connections induce empirical successes largely attributed to computational ad- dictionary structures with more nonzero elements, which vances like efficient training using GPUs and rectified lin- provides additional freedom to further reduce mutual co- ear unit (ReLU) activation functions [18]. Since then, nu- herence with fewer parameters as shown in Fig.2. merous architectural changes have been proposed. For ex- We propose to use the minimum deep frame potential as ample, much deeper networks with residual connections a cue for model selection. Instead of requiring expensive were shown to achieve consistently better performance with validation on a specific dataset to approximate generaliza- fewer parameters [13]. Building upon this, densely con- tion performance, architectures are chosen based on how nected convolutional networks with skip connections be- efficiently they can reduce the minimum achievable mutual tween more layers yielded even better performance [15]. coherence with respect to the number of model parame- While theoretical explanations for these improvements were ters. In this paper, we provide an efficient frame poten- lacking, consistent experimentation on standardized bench- tial minimization method for a general class of convolu- mark datasets continued to drive empirical success. T T However, due to slowing progress and the need for in- network f(x) = φl(Bl ··· φ1(B1 (x))) constructed as creased accessibility of deep learning techniques to a wider the composition of linear transformations with parameters range of practitioners, more principled approaches to archi- Bj and nonlinear activation functions φj. Equivalently, T tecture search have recently gained traction. Motivated by f(x) = wl where wj = Bj wj−1 are the layer activa- observations of extreme redundancy in the parameters of tions for layers j = 1, . . . , l and w0 = x. In many modern trained networks [7], techniques have been proposed to sys- state-of-the-art networks, the ReLU activation function has tematically reduce the number of parameters without ad- been adopted due to its effectiveness and computational ef- versely affecting performance. Examples include sparsity- ficiency. It can also be interpreted as the nonnegative soft- inducing regularizers during training [1] or through post- thresholding proximal operator associated with the function processing to prune the parameters of trained networks [14]. Φ in Eq.1, a nonnegativity constraint I(w ≥ 0) and a Constructive approaches to model selection like neural ar- sparsity-inducing `1 penalty with a weight determined by chitecture search [12] instead attempt to compose architec- the scalar bias parameter λ. tures from basic building blocks through tools like rein- Φ(w) = (w ≥ 0) + λ kwk (1) forcement learning. Efficient model scaling has also been I 1 1 2 proposed to enable more effective grid search for selecting φ(x) = ReLU(x − λ1) = arg min 2 kw − xk2 + Φ(w) architectures subject to resource constraints [30]. While au- w tomated techniques can match or even surpass manually en- Thus, the forward pass of a deep network is equivalent to gineered alternatives, they require a validation dataset and a layered thresholding pursuit algorithm for approximating rarely provide insights transferable to other settings. the solution of a multi-layer sparse coding model [26]. Re- To better understand the implicit benefits of different sults from shallow sparse approximation theory can then be network architectures, there have been adjacent theoreti- adapted to bound the accuracy of this approximation, which cal explorations of deep network generalization. These improves as the mutual coherence defined below in Eq.2 works are often motivated by the surprising observation that decreases, and indirectly analyze other theoretical proper- good performance can still be achieved using highly over- ties of deep networks like uniqueness and robustness. parametrized models with degrees of freedom that surpass 3.1. Sparse Approximation Theory the number of training data. This contradicts many com- monly accepted ideas about generalization, spurning new Sparse approximation theory considers representations d experimental explorations that have demonstrated proper- of data vectors x ∈ R as sparse linear combinations P ties unique to deep learning. Examples include the ability x ≈ j wjbj = Bw of atoms from an over-complete dic- of deep networks to express random data labels [34] with a tionary B ∈ Rd×k. The number of atoms k is greater than tendency towards learning simple patterns first [2]. While the dimensionality d and the number of nonzero coefficients k exact theoretical explanations are lacking, empirical mea- kwk0 in the representation w ∈ R is small. surements of network sensitivity such as the Jacobian norm Through applications like [8], spar- have been shown to correlate with generalization [25]. Sim- sity has been found to exhibit theoretical properties that ilarly, Parseval regularization [22] encourages robustness by enable data representation with efficiency far greater than constraining the Lipschitz constants of individual layers. what was previously thought possible. Central to these Due to the difficulty in analyzing deep networks directly, results is the requirement that the dictionary be “well- other approaches have instead drawn connections to the rich behaved,” essentially ensuring that its columns are not too field of sparse approximation theory. The relationship be- similar. For undercomplete matrices with k ≤ d, this is sat- tween feed-forward neural networks and principal compo- isfied by enforcing orthogonality, but overcomplete dictio- nent analysis has long been known for the case of linear naries require other conditions. Specifically, we focus our activations [3]. More recently, nonlinear deep networks attention on the mutual coherence µ, the maximum mag- with ReLU activations have been linked to multilayer sparse nitude normalized inner product of all pairs of dictionary coding to prove theoretical properties of deep representa- atoms. Equivalently, it is the maximum magnitude off- tions [26]. This connection has been used to motivate new diagonal element in the Gram matrix G = B˜ TB˜ where the recurrent architecture designs that resist adversarial noise columns of B˜ are normalized to have unit norm: attacks [27], improve classification performance [29], or en- T |bi bj| force prior knowledge through output constraints [23]. µ = max = max |(G − I)ij| (2) i6=j kbik kbjk i,j 3. Deep Learning as Sparse Approximation We are primarily motivated by the observation that a model’s capacity for low mutual coherence increases along To derive our criterion for model selection, we build with its capacity for both memorizing more training data upon recent connections between deep neural networks and through unique representations and generalizing to more sparse approximation. Specifically, consider a feed-forward validation data through robustness to input perturbations. With an overcomplete dictionary, there is a space of co- By combining the terms in the summation of Eq.5 to- efficients w that can all exactly reconstruct any data point as gether into a single system, this problem can be equiva- x = Bw, which would not support discriminative represen- lently represented as shown in Eq.6. The latent variables tation learning. However, if representations from a mutually wj for each layer are stacked in the vector w, the regular- P incoherent dictionary are sufficiently sparse, then they are izer Φ(w) = j Φj(wj), and the input x is augmented guaranteed to be optimally sparse and unique [9]. Specifi- with zeros. 1 −1 cally, if the number of nonzeros kwk0 < 2 (1 + µ ), then B w w is the unique, sparsest representation for x. Furthermore, z }| { z }| {   2 √ B1 0 w1 x if kwk < ( 2 − 0.5)µ−1, then it can be found efficiently 0  ..      by convex optimization with `1 regularization. Thus, min-  −IB2 .  w2   0  arg min     −   + Φ(w) (6) imizing the mutual coherence of a dictionary increases its w  . .   .   .  .. .. 0 . . capacity for uniquely representing data points.       −IB w 0 Sparse representations are also robust to input perturba- l l 2 kj−1×kj tions [10]. Specifically, given a noisy datapoint x = x0 + z The layer parameters Bj ∈ R are blocks in the in- P P where x0 can be represented exactly as x0 = Bw0 with duced dictionary B, which has j kj−1 rows and j kj 1 −1 kw0k0 ≤ 4 1 + µ and the noise z has bounded mag- columns. It has a structure of nonzero elements that summa- nitude kzk2 ≤ , then w0 can be approximated by solving rizes the corresponding feed-forward deep network archi- the `1-penalized problem: tecture wherein the off-diagonal identity matrices are con- 2 nections between adjacent layers. arg min kx − Bwk2 + λ kwk1 (3) w Model capacity can be increased both by adding parame- Its solution is stable and the approximation error is bounded ters to a layer or by adding layers, which implicitly pads the from above in Eq.4, where δ(x, λ) is a constant. input data x with more zeros. This can actually reduce the dictionary’s mutual coherence because it increases the sys- 2 2 ( + δ(x, λ)) tem’s dimensionality. Thus, depth allows model complexity kw − w0k2 ≤ (4) 1 − µ(4 kwk0 − 1) to scale jointly alongside effective input dimensionality so that the induced dictionary structures still have the capac- Thus, minimizing the mutual coherence of a dictionary de- ity for low mutual coherence and improved capabilities for creases the sensitivity of its sparse representations for im- memorization and generalization. proved robustness. This is similar to evaluating input sensi- tivity using the Jacobian norm [25]. However, instead of es- 3.3. Architecture-Induced Dictionary Structure timating the average perturbation error over validation data, In this section, we extend this model formulation to it bounds the worst-cast error over all possible data. incorporate more complicated network architectures. Be- 3.2. Deep Component Analysis cause mutual coherence is dependent on normalized dic- tionary atoms, it can be reduced by increasing the num- While deep representations can be analyzed by accu- ber of nonzero elements, which reduces the magnitudes of mulating the effects of approximating individual layers in the dictionary elements and their inner products. In Eq.7, a chain network as shallow sparse coding problems [26], we replace the identity connections of Eq.6 with blocks of this strategy cannot be easily adapted to account for more nonzero parameters to allow for lower mutual coherence. complicated interactions between layers. Instead, we adapt the framework of Deep Component Analysis [23], which   B11 0 jointly represents all layers in a neural network as the sin-  .   T ..  gle sparse coding problem in Eq.5. The ReLU activa- B = B21 B22  (7) kj   tions wj ∈ R of a feed-forward chain network approx-  . .. ..   . . . 0  imate the solutions to a joint optimization problem where T T k0 Bl1 ··· Bl(l−1) Bll w0 = x ∈ R and the regularization functions Φj are nonnegative sparsity-inducing penalties as defined in Eq.1. This structure is induced by the feed-forward activations in Eq.8, which again approximate the solutions to a nonnega- w := φ (BTw ) ∀j = 1, . . . , l (5) j j j j−1 tive sparse coding problem. l X 2 j−1 ≈ arg min kBjwj − wj−1k + Φj(wj)  X  2 w := φ − BT BT w ∀j = 1, . . . , l (8) {wj } j=1 j j jj jk k k=1 j−1 The compositional constraints between adjacent layers are l 2 X X T relaxed and replaced by reconstruction error penalty terms, ≈ arg min Bjjwj + Bjkwj + Φj(wj) F resulting in a convex, nonnegative sparse coding problem. {wj } j=1 k=1 In comparison to Eq.5, additional parameters introduce skip connections between layers so that the activations wj of layer j now depend on those of all previous layers k < j. These connections are similar to the identity mappings in residual networks [13], which introduce dependence be- tween the activations of pairs of layers for even j ∈ [1, l−1]: (a) Convolutional Dictionary (b) Permuted Dictionary

T T wj := φj(Bj wj−1), wj+1 := φj+1(wj−1+Bj+1wj) (9) In comparison to chain networks, no additional parameters are required; the only difference is the addition of wj−1 in the argument of φj+1. As a special case of Eq.8, we interpret the activations in Eq.9 as approximate solutions to the optimization problem:

l 2 X arg min kx − B1w1k2 + Φj(wj) (10) (c) Convolutional Gram Matrix (d) Permuted Gram Matrix {wj } j=1 X 2 2 + w − BTw + w − w − BT w Figure 3: A visualization of a one-dimensional convolutional dic- j j j−1 2 j+1 j−1 j+1 j 2 even j tionary with two input channels, five output channels, and a filter size of three. (a) The filters are repeated over eight spatial dimen- This results in the induced dictionary structure of Eq.7 with sions resulting in a (b) block-Toeplitz structure that is revealed Bjj = I for j > 1, Bjk = 0 for j > k + 1, Bjk = 0 for through row and column permutations. (c) The corresponding j > k with odd k, and Bjk = I for j > k with even k. gram matrix can be efficiently computed by (d) repeating local Building upon the empirical successes of residual net- filter interactions. works, densely connected convolution networks [15] incor- porate skip connections between earlier layers as well. This it has the same lower bound [5]. Directly optimizing mutual B is shown in Eq. 11 where the transformation j of concate- coherence from Eq.2 is difficult due to its piecewise struc- [w ] k = 1, . . . , j − 1 nated variables k k for is equivalently ture. Instead, we consider a tight lower by replacing the B written as the summation of smaller transformations jk. maximum off-diagonal element of the Gram matrix G with 2 j−1 the mean. This gives the averaged frame potential F (B),     T j−1 X T a strongly-convex function that can be optimized more ef- wj := φj Bj [wk]k=1 = φj Bjkwk (11) k=1 fectively [4]:   These activations again provide approximate solutions to F 2(B) = N −1(G) kGk2 − Tr G ≤ µ2(B) (12) the problem in Eq.8 with the induced dictionary structure F of Eq.7 where Bjj = I for j > 1 and the lower blocks Bjk Here, N(G) is the number of nonzero off-diagonal ele- for j > k are all filled with learned parameters. ments in the Gram matrix and Tr G equals the total number Skip connections enable effective learning in much of dictionary atoms. Equality is met in the case of equian- deeper networks than chain-structured alternatives. While gular tight frames when the normalized inner products be- originally motivated from the perspective of making op- tween all dictionary atoms are equivalent [17]. Due to timization easier [13], adding more connections between the block-sparse structure of the induced dictionaries from layers was also shown to improve generalization perfor- Eq.7, we evaluate the frame potential in terms of local kj ×k 0 mance [15]. As compared in Fig.2, denser skip connections blocks Gjj0 ∈ R j that are nonzero only if layer j is induce dictionary structures with denser Gram matrices al- connected to layer j0. In the case of convolutional layers lowing for lower mutual coherence. This suggests that ar- with localized spatial support, there is also a repeated im- chitectures’ capacities for low validation error can be quan- plicit structure of nonzero elements as visualized in Fig.3. tified and compared based on their capacities for inducing To compute the Gram matrix, we first need to normalize dictionaries with low minimum mutual coherence. the global induced dictionary B from Eq.7. By stacking the column magnitudes of layer j as the elements in the di- kj ×kj 4. The Deep Frame Potential agonal matrix Cj = diag(cj) ∈ R , the normalized ˜ −1 We propose to use lower bounds on the mutual coherence parameters can be represented as Bij = BijCj . Simi- of induced structured dictionaries for the data-independent larly, the squared norms of the full set of columns in the 2 Pl 2 comparison of architecture capacities. Note that while one- global dictionary B are Nj = i=j Cij. The full normal- sided coherence is better suited to nonnegativity constraints, ized dictionary can then be found as B˜ = BN−1 where the matrix N is block diagonal with Nj as its blocks. The We can then express the individual blocks of H as: blocks of the Gram matrix G = B˜ TB˜ are then given as: 2 −1 T H11 = B1(C1 + Ik1 ) B1 (18) l 2 −1 T 2 −1 X −1 T −1 Hjj = Bj(Cj + I) Bj + (Cj−1 + I) (19) Gjj0 = N B Bij0 N 0 (13) j ij j 2 −1 i=j0 Hj(j+1) = −Bj(Cj + I) (20) 0 For chain networks, Gjj0 6= 0 only when j = j + 1, which In contrast to Gj(j+1) in Eq. 15, only the columns of ˜ represents the connections between adjacent layers. In this Hj(j+1) in Eq. 20 are rescaled. Since Bj has normalized case, the blocks can be simplified as: columns, its norm can be exactly expressed as:

1 1 2 − T 2 − kj !2 G = (C + I) 2 (B B + I)(C + I) 2 jj j j j j (14) 2 X cjn 1 1 Hj(j+1) F = 2 (21) 2 − 2 2 − 2 c + 1 Gj(j+1) = −(Cj + I) Bj+1(Cj+1 + I) (15) n=1 jn T Gll = Bl Bl (16) For the other blocks, we find lower bounds for their norms through the same technique used in deriving the Because the diagonal is removed in the deep frame po- Welch bound, which expresses the minimum mutual co- tential computation, the contribution of G is simply a jj herence for unstructured overcomplete dictionaries [32]. rescaled version of the local frame potential of layer j. The Specifically, we apply the Cauchy-Schwarz inequality giv- contribution of G , on the other hand, can essentially 2 j(j+1) ing kAk ≥ r−1(Tr A)2 for positive-semidefinite matrices be interpreted as rescaled ` weight decay where rows are F 2 A with rank r. Since the rank of H is at most k , we weighted more heavily if the corresponding columns of the jj j−1 can lower bound the norms of the individual blocks as: previous layer’s parameters have higher magnitudes. Fur- 2 k1 2 ! thermore, since the global frame potential is averaged over 2 1 X c kH k ≥ 1n (22) the total number of nonzero elements in G, if a layer has 11 F 2 k0 c1n + 1 more parameters, then it will be given more weight in this n=1 2 kj 2 kj−1 ! computation. For more general networks with skip connec- 2 1 X cjn X 1 kH k ≥ + tions, however, the summation from Eq. 13 has additional jj F 2 2 kj−1 cjn + 1 c(j−1)p + 1 terms that introduce more complicated interactions. In these n=1 p=1 cases, it cannot be evaluated from local properties of layers. In this case of dense shallow dictionaries, the Welch Essentially, the deep frame potential summarizes the bound depends only on the data dimensionality and the structural properties of the global dictionary B induced by a number of dictionary atoms. However, due to the structure deep network architecture by balancing interactions within of the architecture-induced dictionaries, the lower bound of each individual layer through local coherence properties the deep frame potential depends on the data dimensional- and between connecting layers. ity, the number of layers, the number of units in each layer, the connectivity between layers, and the relative magnitudes 4.1. Theoretical Lower Bound for Chain Networks between layers. Skip connections increase the number of While the deep frame potential is a function of param- nonzero elements in the Gram matrix over which to average eter values, its minimum value is determined only by the and also enable off-diagonal blocks to have lower norms. architecture-induced dictionary structure. Furthermore, we 4.2. Model Selection know that it must be bounded by a nonzero constant for overcomplete dictionaries. In this section, we derive this For more general architectures that lack a simple the- lower bound for the special case of fully-connected chain oretical lower bound, we instead propose bounding the networks and provide intuition for why skip connections in- mutual coherence of the architecture-induced dictionary crease the capacity for low mutual coherence. through empirical minimization of the deep frame poten- First, observe that a lower bound for the Frobenius norm tial F 2(B) from Eq. 12. Frame potential minimization has of Gj(j+1) from Eq. 15 cannot be readily attained because been used effectively to construct finite normalized tight the rows and columns are rescaled independently. This frames due to the lack of suboptimal local minima, which means that a lower bound for the norm of G must be found allows for effective optimization using gradient descent [4]. by jointly considering the entire matrix structure, not simply We propose using the minimum deep frame potential of through the summation of its components. To accomplish an architecture–which is independent of data and individ- this, we instead consider the matrix H = B˜ B˜ T, which is ual parameter instantiations–as a means to compare dif- full rank and has the same norm as G: ferent architectures. In practice, model selection is per- l l−1 formed by choosing the candidate architecture with the low- 2 2 X 2 X 2 est minimum frame potential subject to desired modeling kGkF = kHkF = kHjjkF +2 Hj(j+1) F (17) j=1 j=1 constraints such as limiting the total number of parameters. 0.8 0.8 0.3 0.035 Residual Network, Base Width = 16 Residual Network, Base Width = 16 Residual Network, Base Width = 4 0.03 Residual Network, Base Width = 4 0.25 Chain Network, Base Width = 16 Chain Network, Base Width = 16 0.7 0.7 Chain Network, Base Width = 4 Chain Network, Base Width = 4 0.025 (400,800,1600,3200) 0.2 (8,16,32,64) 0.6 (8,16,32,64) 0.6 0.02 (64,32,64) 0.15 (64,32,64) 0.015 Validation Error Validation Error 0.5 (128,256) 0.5 (128,256) Validation Error 0.1 (400,800,1600,3200) 0.01 Minimum Frame Potential

0.4 0.4 0.05 0.005 0 0.05 0.1 0.15 0.2 0.25 0 0.05 0.1 0.15 0.2 0.25 3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10 Minimum Deep Frame Potential Minimum Deep Frame Potential Blocks / Group Blocks / Group (a) Without Regularization (b) With Regularization (a) Validation Error (b) Minimum Deep Frame Potential

Figure 4: A comparison of fully connected deep network archi- Figure 5: The effect of increasing depth in chain and residual tectures with varying depths and widths. Warmer colors indicate networks. Validation error is compared against layer count for two models with more total parameters. (a) Some very large networks different network widths. (a) In comparison to chain networks, cannot be trained effectively resulting in unusually high validation even very deep residual networks can be trained effectively result- errors. (b) This can be remedied through deep frame potential reg- ing in decreasing validation error. (b) Despite having the same ularization, resulting in high correlation between minimum frame number of total parameters, residual connections also induce dic- potential and validation error. tionary structures with lower minimum deep frame potentials.

5. Experimental Results To compare networks with different sizes, we modify their depths by changing the number of residual blocks in each In this section, we demonstrate correlation between the group from between 2 and 10 and their widths by changing minimum deep frame potential and validation error on the base number of filters from between 4 and 32. For our the CIFAR-10 dataset [19] across a wide variety of fully- experiments with densely connected skip connections [15], connected, convolutional, chain, residual, and densely con- we adapt the simplified CIFAR-10 DenseNet architecture nected network architectures. Furthermore, we show that from [21]. Like with residual networks, it consists of a con- networks with skip connections can have lower deep frame volutional layer followed by three groups of the activations potentials with fewer learnable parameters, which is predic- from Eq. 11 with decreasing spatial resolutions and increas- tive of the parameter efficiency of trained networks. ing numbers of filters. Within each group, a dense block is In Fig.4, we visualize a scatter plot of trained fully- the concatenation of smaller convolutions that take all pre- connected networks with between three and five layers and vious outputs as inputs with filter numbers equal to a fixed between 16 and 4096 units in each layer. The correspond- growth rate. Network depth and width are modified by re- ing architectures are shown as a list of units per layer for spectively increasing both the number of layers per group a few representative examples. The minimum frame poten- and the base growth rate from between 2 and 12. Batch nor- tial of each architecture is compared against its validation malization [16] was also used in all convolutional networks. error after training, and the total parameter count is indi- In Fig.5, we compare the validation errors and mini- cated by color. In Fig.4a, some networks with many pa- mum frame potentials of residual networks and compara- rameters have unusually high error due to the difficulty in ble chain networks with residual connections removed. In training very large fully-connected networks. In Fig.4b, Fig.5a, the validation error of chain networks increases for the addition of a deep frame potential regularization term deeper networks while that of residual networks is lower overcomes some of these optimization difficulties for im- and consistently decreases. This emphasizes the difficulty proved parameter efficiency. This results in high correla- in training very deep chain networks. In Fig.5b, we show tion between minimum frame potential and validation error. that residual connections enable lower minimum frame po- Furthermore, it emphasizes the diminishing returns of in- tentials following a similar trend with respect to increasing creasing the size of fully-connected chain networks; after a model size, again demonstrating correlation between vali- certain point, adding more parameters does little to reduce dation error and minimum frame potential. both validation error and minimum frame potential. In Fig.6, we compare chain networks and residual net- To evaluate the effects of residual connections [13], works with exactly the same number of parameters, where we adapt the simplified CIFAR-10 ResNet architecture color indicates the number of residual blocks per group and from [33], which consists of a single convolutional layer connected data points have the same depths but different followed by three groups of residual blocks with activations widths. The addition of skip connections reduces both val- given in Eq.9. Before the second and third groups, the idation error and minimum frame potential, as visualized number of filters is increased by a factor of two and the spa- by consistent placement below the diagonal line indicating tial resolution is decreased by half through average pooling. lower values for residual networks than comparable chain 0.3 0.025 0.3 0.025 3 Blocks / Group 3 Blocks / Group 4 Blocks / Group 0.25 4 Blocks / Group 0.25 5 Blocks / Group 5 Blocks / Group 0.02 8 Blocks / Group 0.02 8 Blocks / Group 10 Blocks / Group 0.2 0.2 10 Blocks / Group Minimum Width Minimum Width 0.015 0.015 0.15 0.15 3 Blocks / Group 0.01 3 Blocks / Group 0.1 4 Blocks / Group 4 Blocks / Group 5 Blocks / Group 5 Blocks / Group Validation Error

Residual Network 0.01 0.1 Residual Network 8 Blocks / Group 0.005 8 Blocks / Group 0.05 10 Blocks / Group 10 Blocks / Group

Minimum Width Minimum Frame Potential Minimum Width 0.05 0.005 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.005 0.01 0.015 0.02 0.025 103 104 105 106 107 103 104 105 106 107 Chain Network Chain Network Number of Parameters (Log Scale) Number of Parameters (Log Scale) (a) Validation Error (b) Minimum Deep Frame Potential (a) Chain Validation Error (b) Chain Frame Potential

Figure 6: A comparison of (a) validation error and (b) minimum 0.3 0.025

0.25 frame potential between residual networks and chain networks. 0.02

Colors indicate different depths and datapoints are connected in 0.2 0.015 order of increasing widths of 4, 8, 16, or 32 filters. Skip connec- 0.15 3 Blocks / Group 0.01 3 Blocks / Group tions result in reduced error correlating with frame potential with 0.1 4 Blocks / Group 4 Blocks / Group 5 Blocks / Group 5 Blocks / Group Validation Error dense networks showing superior efficiency with increasing depth. 8 Blocks / Group 0.005 8 Blocks / Group 0.05 10 Blocks / Group 10 Blocks / Group

Minimum Width Minimum Frame Potential Minimum Width 0 0 103 104 105 106 107 103 104 105 106 107 Number of Parameters (Log Scale) Number of Parameters (Log Scale) networks. This effect becomes even more pronounced with (c) ResNet Validation Error (d) ResNet Frame Potential increasing depths and widths. In Fig.7, we compare the parameter efficiency of chain 0.3 0.025 2 Layers / Group 2 Layers / Group 0.25 3 Layers / Group 3 Layers / Group networks, residual networks, and densely connected net- 4 Layers / Group 0.02 4 Layers / Group 5 Layers / Group 5 Layers / Group works of different depths and widths. We visualize both 0.2 8 Layers / Group 8 Layers / Group 12 Layers / Group 0.015 12 Layers / Group validation error and minimum frame potential as func- 0.15 Minimum Width Minimum Width 0.01 tions of the number of parameters, demonstrating the im- 0.1 Validation Error 0.005 proved scalability of networks with skip connections. While 0.05 Minimum Frame Potential 0 0 chain networks demonstrate increasingly poor parameter ef- 103 104 105 106 107 103 104 105 106 107 ficiency with respect to increasing depth in Fig.7a, the skip Number of Parameters (Log Scale) Number of Parameters (Log Scale) connections of ResNets and DenseNets allow for further re- (e) DenseNet Validation Error (f) DenseNet Frame Potential ducing error with larger network sizes in Figs.7c,e. Con- Figure 7: A demonstration of the improved scalability of net- sidering all network families together as in Fig.2d, we see works with skip connections, where line colors indicate different that denser connections also allow for lower validation error depths and data points are connected showing increasing widths. with comparable numbers of parameters. This trend is mir- (a) Chain networks with greater depths have increasingly worse rored in the minimum frame potentials of Figs.7b,d,f which parameter efficiency in comparison to (c) the corresponding net- are shown together in Fig.2e. Despite some fine variations works with residual connections and (e) densely connected net- in behavior across different families of architectures, mini- works with similar size, of which performance scales efficiently mum deep frame potential is correlated with validation error with parameter count. This could potentially be attributed to cor- across network sizes and effectively predicts the increased related efficiency in reducing frame potential with fewer parame- generalization capacity provided by skip connections. ters, which saturates much faster with (b) chain networks than (d) residual networks or (f) densely connected networks. 6. Conclusion In this paper, we proposed a technique for comparing proposed as an empirical optimization objective for con- deep network architectures by approximately quantifying structing bounds for more complicated architectures. their implicit capacity for effective data representations. Experimentally, we observed a correlation between min- Based upon theoretical connections between sparse approx- imum deep frame potential and validation error across dif- imation and deep neural networks, we demonstrated how ferent families of modern architectures with skip connec- architectural hyper-parameters such as depth, width, and tions, including residual networks and densely connected skip connections induce different structural properties of the convolutional networks. This suggests a promising direc- dictionaries in corresponding sparse coding problems. We tion for future research towards the theoretical analysis and compared these dictionary structures through lower bounds practical construction of deep network architectures derived on their mutual coherence, which is theoretically tied to from connections between deep learning and sparse coding. their capacity for uniquely and robustly representing data Acknowledgments: This work was supported by the CMU via sparse approximation. A theoretical lower bound was Argo AI Center for Autonomous Vehicle Research and by derived for chain networks and the deep frame potential was the National Science Foundation under Grant No.1925281. References [19] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Uni- [1] Jose Alvarez and Mathieu Salzmann. Learning the number versity of Toronto, 2009.7 of neurons in deep networks. In Advances in Neural Infor- mation Processing Systems (NeurIPS), 2016.3 [20] Yann LeCun, Leon´ Bottou, Yoshua Bengio, and Patrick [2] Devansh Arpit et al. A closer look at memorization in deep Haffner. Gradient-based learning applied to document recog- Proceedings of the IEEE networks. In International Conference on Machine Learning nition. , 86(11), 1998.2 (ICML), 2017.3 [21] Yixuan Li. Tensorflow densenet. https://github. [3] Pierre Baldi and Kurt Hornik. Neural networks and principal com/YixuanLi/densenet-tensorflow, 2018.7 component analysis: Learning from examples without local [22] Cisse Moustapha, Bojanowski Piotr, Grave Edouard, minima. Neural networks, 2(1), 1989.3 Dauphin Yann, and Usunier Nicolas. Parseval networks: Im- [4] John Benedetto and Matthew Fickus. Finite normalized tight proving robustness to adversarial examples. In International frames. Advances in Computational Mathematics, 18(2-4), Conference on Machine Learning (ICML), 2017.3 2003.5,6 [23] Calvin Murdock, MingFang Chang, and Simon Lucey. [5] Alfred M. Bruckstein et al. On the uniqueness of nonnegative Deep component analysis via alternating direction neural sparse solutions to underdetermined systems of equations. networks. In European Conference on Computer Vision IEEE Transactions on Information Theory, 54(11), 2008.5 (ECCV), 2018.1,3,4 [6] Corinna Cortes and Vladimir Vapnik. Support-vector net- [24] Behnam Neyshabur, Srinadh Bhojanapalli, David works. Machine learning, 20(3), 1995.1 McAllester, and Nati Srebro. Exploring generalization [7] Misha Denil et al. Predicting parameters in deep learn- in deep learning. In Advances in Neural Information ing. In Advances in Neural Information Processing Systems Processing Systems (NeurIPS), 2017.1 (NeurIPS), 2013.3 [8] David L. Donoho. Compressed sensing. IEEE Transactions [25] Roman Novak et al. Sensitivity and generalization in neural on Information Theory, 52(4), 2006.3 networks: an empirical study. In International Conference [9] David L. Donoho and . Optimally sparse repre- on Learning Representations (ICLR), 2018.3,4 sentation in general (nonorthogonal) dictionaries via l1 mini- [26] Vardan Papyan, Yaniv Romano, and Michael Elad. Convo- mization. Proceedings of the National Academy of Sciences, lutional neural networks analyzed via convolutional sparse 100(5), 2003.2,4 coding. Journal of Machine Learning Research (JMLR), [10] David L. Donoho, Michael Elad, and Vladimir N. 18(83), 2017.1,3,4 Temlyakov. Stable recovery of sparse overcomplete repre- [27] Yaniv Romano, Aviad Aberdam, Jeremias Sulam, and sentations in the presence of noise. IEEE Transactions on Michael Elad. Adversarial noise attacks of deep learning Information Theory, 52(1), 2005.4 architectures-stability analysis via sparse modeled signals. [11] Michael Elad. Sparse and redundant representations: from Journal of Mathematical Imaging and Vision, 2018.3 theory to applications in signal and image processing. [28] Karen Simonyan and Andrew Zisserman. Very deep con- Springer Science & Business Media, 2010.2 volutional networks for large-scale image recognition. In In- [12] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. ternational Conference on Learning Representations (ICLR), Neural architecture search: A survey. Journal of Machine 2015.1 Learning Research (JMLR), 20(55), 2019.3 [29] Jeremias Sulam, Aviad Aberdam, Amir Beck, and Michael [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Elad. On multi-layer basis pursuit, efficient algorithms and Deep residual learning for image recognition. In Conference convolutional neural networks. Pattern Analysis and Ma- on Computer Vision and Pattern Recognition (CVPR), 2016. chine Intelligence (PAMI), 2019.3 1,2,5,7 [14] Yihui He et al. Channel pruning for accelerating very deep [30] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model neural networks. In Conference on Computer Vision and Pat- scaling for convolutional neural networks. In International tern Recognition (CVPR), 2017.3 Conference on Machine Learning (ICML), 2019.3 [15] Gao Huang, Zhuang Liu, Kilian Weinberger, and Laurens [31] Vladimir N. Vapnik and A. Ya. Chervonenkis. On the uni- van der Maaten. Densely connected convolutional networks. form convergence of relative frequencies of events to their In Conference on Computer Vision and Pattern Recognition probabilities. Theory of Probability and Its Applications, (CVPR), 2017.1,2,5,7 XVI(2), 1971.1 [16] Sergey Ioffe and Christian Szegedy. Batch normalization: [32] Lloyd Welch. Lower bounds on the maximum cross corre- Accelerating deep network training by reducing internal co- lation of signals. IEEE Transactions on Information theory, variate shift. In International Conference on Machine Learn- 20(3), 1974.6 ing (ICML), 2015.7 [33] Yuxin Wu et al. Tensorpack. https://github.com/ [17] Jelena Kovaceviˇ c,´ Amina Chebira, et al. An introduction to tensorpack/, 2016.7 frames. Foundations and Trends in , 2(1), 2008.5 [34] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin [18] Alex Krizhevsky et al. Imagenet classification with deep Recht, and Oriol Vinyals. Understanding deep learning re- convolutional neural networks. In Advances in Neural In- quires rethinking generalization. In International Confer- formation Processing Systems (NeurIPS), 2012.1,2 ence on Learning Representations (ICLR), 2017.2,3