Dataless Model Selection with the Deep Frame Potential

Dataless Model Selection with the Deep Frame Potential Calvin Murdock1 Simon Lucey1,2 1Carnegie Mellon University 2Argo AI fcmurdock,[email protected] Abstract Choosing a deep neural network architecture is a fundamental problem in applications that require balancing performance and parameter efficiency. Standard approaches (a) Chain Network (b) Residual Network (c) Densely Connected rely on ad-hoc engineering or computationally expensive Convolutional Network validation on a specific dataset. We instead attempt to quantify networks by their intrinsic capacity for unique and robust representations, enabling efficient architecture com- ≈ ≈ ≈ parisons without requiring any data. Building upon theoretical connections between deep learning and sparse approximation, we propose the deep frame potential: a measure (d) Induced Dictionary Structures for Sparse Approximation of coherence that is approximately related to representation stability but has minimizers that depend only on network Figure 1: Why are some deep neural network architectures better structure. This provides a framework for jointly quantifying than others? In comparison to (a) standard chain connections, skip the contributions of architectural hyper-parameters such as connections like those in (b) ResNets [13] and (c) DenseNets [15] have demonstrated significant improvements in training effective- depth, width, and skip connections. We validate its use as ness, parameter efficiency, and generalization performance. We a criterion for model selection and demonstrate correlation provide one possible explanation for this phenomenon by approx- with generalization error on a variety of common residual imating network activations as (d) solutions to sparse approxima- and densely connected network architectures. tion problems with different induced dictionary structures. 1. Introduction low machine learning techniques like support vector ma- Deep neural networks have dominated nearly every chines [6] were aided by theoretical tools like the VC- benchmark within the field of computer vision. While this dimension [31] for determining when their predictions modern influx of deep learning originally began with the could be trusted to avoid overfitting. Deep neural networks, task of large-scale image recognition [18], new datasets, on the other hand, have eschewed similar analyses due to loss functions, and network configurations have quickly ex- their complexity. Theoretical explorations of deep network arXiv:2003.13866v1 [cs.LG] 30 Mar 2020 panded its scope to include a much wider range of appli- generalization [24] are often disconnected from practical cations. Despite this, the underlying architectures used to applications and rarely provide actionable insight into how learn effective image representations are generally consis- architectural hyper-parameters contribute to performance. tent across all of them. This can be seen through the com- Building upon recent connections between deep learn- munity’s quick adoption of the newest state-of-the-art deep ing and sparse approximation [26, 23], we instead interpret networks from AlexNet [18] to VGGNet [28], ResNets [13], feed-forward deep networks as algorithms for approximate DenseNets [15], and so on. But this begs the question: why inference in related sparse coding problems. These prob- do some deep network architectures work better than oth- lems aim to optimally reconstruct zero-padded input images ers? Despite years of groundbreaking empirical results, an as sparse, nonnegative linear combinations of atoms from answer to this question still remains elusive. architecture-dependent dictionaries, as shown in Fig.1. We Fundamentally, the difficulty in comparing network ar- propose to indirectly analyze practical deep network archi- chitectures arises from the lack of a theoretical founda- tectures with complicated skip connections, like residual tion for characterizing their generalization capacities. Shal- networks (ResNets) [13] and densely connected convolutional networks (DenseNets) [15], simply through the dictionary structures that they induce. To accomplish this, we introduce the deep frame poten- = tial for summarizing the interactions between parameters in feed-forward deep networks. As a lower bound on mutual (a) Chain Network Gram Matrix (b) ResNet (c) DenseNet coherence–the maximum magnitude of the normalized in- 0.035 0.4 ner products between all pairs of dictionary atoms [9]–it Dense Network Dense Network 0.03 Residual Network Residual Network Chain Network Chain Network 0.3 is theoretically tied to generalization properties of the re- 0.025 lated sparse coding problems. However, its minimizers de- 0.02 0.2 pend only on the dictionary structures induced by the cor- 0.015 0.01 responding network architectures. This enables dataless Validation Error 0.1 0.005 model comparison by jointly quantifying contributions of Minimum Frame Potential 0 0 depth, width, and connectivity. 103 104 105 106 107 103 104 105 106 107 Number of Parameters (Log Scale) Number of Parameters (Log Scale) Our approach is motivated by sparse approximation the- (d) Minimum Deep Frame Potential (e) Validation Error ory [11], a field that encompasses properties like uniqueness and robustness of shallow, overcomplete representations. In Figure 2: Parameter count is not a good indicator of generaliza- sparse coding, capacity is controlled by the number of dic- tion performance for deep networks. Instead, we compare differ- tionary atoms used in sparse data reconstructions. While ent network architectures via the minimum deep frame potential, more parameters allow for more accurate representations, the average nonzero magnitude of inner products between atoms of they may also increase input sensitivity for worse general- architecture-induced dictionaries. In comparison to (a) chain net- ization performance. Conceptually, this is comparable to works, skip connections in (b) residual networks and (c) densely overfitting in nearest-neighbor classification, where repre- connected networks produce Gram matrix structures with more nonzero elements allowing for (d) lower deep frame potentials sentations are sparse, one-hot indicator vectors correspond- across network sizes. This correlates with improved parameter ing to nearest training examples. As the number of train- efficiency giving (e) lower validation error with fewer parameters. ing data increases, the distance between them decreases, so they are more likely to be confused with one another. Simi- larly, nearby dictionary atoms may introduce instability that tional networks with skip connections, of which ResNets causes representations of similar data points to become very and DenseNets are shown to be special cases. Further- far apart leading to poor generalization performance. Thus, more, we derive an analytic expression for the minimum there is a fundamental tradeoff between the capacity and ro- value in the case of fully-connected chain networks. Exper- bustness of shallow representations due to the proximity of imentally, we demonstrate correlation with validation error dictionary atoms as measured by mutual coherence. across a variety of network architectures. However, deep representations have not shown the same correlation between model size and sensitivity [34]. While 2. Background and Related Work adding more layers to a deep neural network increases its capacity, it also simultaneously introduces implicit regular- Due to the vast space of possible deep network archi- ization to reduce overfitting. This can be explained through tectures and the computational difficulty in training them, the proposed connection to sparse coding, where additional deep model selection has largely been guided by ad-hoc layers increase both capacity and effective input dimension- engineering and human ingenuity. While progress slowed ality. In a higher-dimensional space, dictionary atoms can in the years following early breakthroughs [20], recent in- be spaced further apart for more robust representations. Fur- terest in deep learning architectures began anew due to thermore, architectures with denser skip connections induce empirical successes largely attributed to computational ad- dictionary structures with more nonzero elements, which vances like efficient training using GPUs and rectified lin- provides additional freedom to further reduce mutual co- ear unit (ReLU) activation functions [18]. Since then, nu- herence with fewer parameters as shown in Fig.2. merous architectural changes have been proposed. For ex- We propose to use the minimum deep frame potential as ample, much deeper networks with residual connections a cue for model selection. Instead of requiring expensive were shown to achieve consistently better performance with validation on a specific dataset to approximate generaliza- fewer parameters [13]. Building upon this, densely con- tion performance, architectures are chosen based on how nected convolutional networks with skip connections be- efficiently they can reduce the minimum achievable mutual tween more layers yielded even better performance [15]. coherence with respect to the number of model parame- While theoretical explanations for these improvements were ters. In this paper, we provide an efficient frame poten- lacking, consistent experimentation on standardized bench- tial minimization method for a general class of convolu- mark datasets continued to drive empirical success. T T However, due to slowing progress and the need for in- network f(x) = φl(Bl ··· φ1(B1 (x))) constructed

Load more