Reframing Neural Networks: Deep Structure in Overcomplete Representations

1 Reframing Neural Networks: Deep Structure in Overcomplete Representations Calvin Murdock and Simon Lucey Abstract—In comparison to classical shallow representation learning techniques, deep neural networks have achieved superior performance in nearly every application benchmark. But despite their clear empirical advantages, it is still not well understood what makes them so effective. To approach this question, we introduce deep frame approximation, a unifying framework for representation learning with structured overcomplete frames. While exact inference requires iterative optimization, it may be approximated by the operations of a feed-forward deep neural network. We then indirectly analyze how model capacity relates to the frame structure induced by architectural hyperparameters such as depth, width, and skip connections. We quantify these structural differences with the deep frame potential, a data-independent measure of coherence linked to representation uniqueness and stability. As a criterion for model selection, we show correlation with generalization error on a variety of common deep network architectures such as ResNets and DenseNets. We also demonstrate how recurrent networks implementing iterative optimization algorithms achieve performance comparable to their feed-forward approximations. This connection to the established theory of overcomplete representations suggests promising new directions for principled deep network architecture design with less reliance on ad-hoc engineering. F 1 INTRODUCTION 풇(풙) = argmin s. t. 풇(풙) (1) (2) 풇(풙) EPRESENTATION LEARNING has become a key com- 풘3 풘3 R ponent of computer vision and machine learning. In 풘3∈ 풮3 풘 − 퐁 풘 2 (1) (2) (푘) place of manual feature engineering, deep neural networks 2 3 3 2 풘2 풘2 풘2 풘2 have enabled more effective representations to be learned + 풘2∈ 풮2 from data for state-of-the-art performance in nearly every 풘 − 퐁 풘 2 (1) (2) (푘) 1 2 2 2 풘1 풘1 풘1 풘1 application benchmark. While this modern influx of deep + 풘1∈ 풮1 learning originally began with the task of large-scale image 풙 − 퐁 풘 2 풙 풙 recognition [1], new datasets, loss functions, and network 1 1 2 configurations have expanded its scope to include a much (a) Deep Frame (b) Neural (c) Iterative Optimization wider range of applications. Despite this, the underlying Approximation Network Algorithm architectures used to learn effective image representations Fig. 1. (a) Deep frame approximation is a unifying framework for mul- have generally remained consistent across all of them. This tilayer representation learning where inference is posed as the con- can be seen through the quick adoption of the newest state- strained optimization of a multi-layer reconstruction objective. (b) The of-the-art deep networks from AlexNet [1] to VGGNet [2], problem structure allows for effective feed-forward approximation with the activations of a standard deep neural network. (c) More accurate ResNets [3], DenseNets [4], and so on. But this begs the approximations can be found using an iterative optimization algorithm question: why do some deep network architectures work with updates implemented as recurrent feedback connections. better than others? Despite years of groundbreaking empirical results, an answer to this question still remains elusive. Fundamentally, the difficulty in comparing network ar- deep frame approximation: a unifying framework for rep- chitectures arises from the lack of a theoretical founda- resentation learning with structured overcomplete frames. arXiv:2103.05804v1 [cs.LG] 10 Mar 2021 tion for characterizing their generalization capacities. Shal- These problems aim to optimally reconstruct input data low machine learning techniques like support vector ma- through layers of constrained linear combinations of com- chines [5] were aided by theoretical tools like the VC- ponents from architecture-dependent overcomplete frames. dimension [6] for determining when their predictions could As shown in Fig. 1, exact inference in our model amounts be trusted to avoid overfitting. The complexity of deep to finding representations that minimize reconstruction er- neural networks, on the other hand, has made similar anal- ror subject to constraints, a process that requires iterative yses challenging. Theoretical explorations of deep general- optimization. However, the problem structure allows for ization are often disconnected from practical applications efficient approximate inference using standard feed-forward and rarely provide actionable insights into how architec- neural networks. This connection between the complicated tural hyper-parameters contribute to performance. Without nonlinear operations of deep neural networks and convex a clear theoretical understanding, progress is largely driven optimization provides new insights about the analysis and by ad-hoc engineering and trial-and-error experimentation. design of different real-world network architectures. Building upon recent connections between deep learn- Specifically, we indirectly analyze practical deep net- ing and sparse approximation [7], [8], [9], we introduce work architectures like residual networks (ResNets) [3] and densely connected convolutional networks (DenseNets) [4], which have achieved state-of-the-art performance in many • E-mail: [email protected] computer vision applications. Often very deep with skip 2 = (a) Chain Network Gram Matrix (b) ResNet (c) DenseNet (a) Chain Network (b) ResNet (c) DenseNet 0.035 0.4 Dense Network Dense Network 0.03 Residual Network Residual Network Chain Network Chain Network 0.3 ≈ ≈ ≈ 0.025 0.02 0.2 0.015 0.01 (d) Induced Deep Frame Approximation Structures Validation Error 0.1 0.005 Minimum Frame Potential Fig. 2. Why are some deep neural network architectures better than 0 0 others? In comparison to (a) standard chain connections, skip connec- 103 104 105 106 107 103 104 105 106 107 tions like those in (b) ResNets and (c) DenseNets have demonstrated Number of Parameters (Log Scale) Number of Parameters (Log Scale) significant improvements in parameter efficiency and generalization per- (d) Minimum Deep Frame Potential (e) Validation Error formance. We provide one possible explanation for this phenomenon by approximating network activations as (d) solutions to deep frame Fig. 3. Parameter count is not a good indicator of generalization per- approximation problems with different induced frame structures. formance for deep networks. Instead, we compare different network architectures via the minimum deep frame potential, a lower bound on the mutual coherence of their corresponding structured frames. In comparison to (a) chain networks, the skip connections in (b) ResNets connections across layers, these complicated network ar- and (c) DenseNets induce Gram matrix structures with more nonzero elements allowing for (d) lower deep frame potentials across network chitectures typically lack convincing explanations for their sizes. (e) This correlates with improved parameter efficiency giving lower specific design choices. Without clear theoretical justifica- validation error with fewer parameters. tions, they are instead driven by performance improvements on standard benchmark datasets like ImageNet [10]. As an alternative, we provide a novel perspective for evaluating theoretical results from shallow representation learning. Ad- and comparing these network architectures via the global ditional layers induce overcomplete frames with structures structure of their corresponding deep frame approximation that increase both capacity and effective input dimension- problems, as shown in Fig. 2. ality, allowing more components to be spaced further apart Our approach is motivated by sparse approximation for more robust representations. Furthermore, architectures theory [11], a field that encompasses fundamental properties with denser skip connections induce structures with more of shallow representations using overcomplete frames [12]. nonzero elements, which provides additional freedom to In contrast to feed-forward representations constructed further reduce mutual coherence with fewer parameters as through linear transformations and nonlinear activation shown in Fig. 3. From this perspective, we interpret deep functions, techniques like sparse coding seek parsimonious learning through the lens of shallow learning to gain new in- representations that efficiently reconstruct input data. Ca- sights towards understanding its unparalleled performance. pacity is controlled by the number of additive components used in sparse data reconstructions. While adding more 1.1 Contributions parameters allows for more accurate representations that In order to unify the intuitive and theoretical insights of better represent training data, it may also increase input shallow representation learning with the practical advances sensitivity As the number of components increases, the made possible through deep learning, we introduce the distance between them decreases, so they are more likely deep frame potential as a cue for model selection that to be confused with one another. This may cause represen- summarizes the interactions between parameters in deep tations of similar data points to become very far apart lead- neural networks. As a lower bound on mutual coherence, ing to poor generalization performance. This fundamental it is tied to the generalization properties of the related tradeoff between the capacity and robustness of shallow deep frame approximation inference problems. However, its representations can be formalized using similarity

Load more