On the Learnability of Fully-Connected Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
On the Learnability of Fully-connected Neural Networks Yuchen Zhang Jason D. Lee Martin J. Wainwright and Michael I. Jordan Stanford University University of Southern California University of California, Berkeley [email protected] [email protected] [email protected] [email protected] Abstract 1 Introduction Deep neural networks have been successful applied to var- Despite the empirical success of deep neural net- ious problems in machine learning, including image classi- works, there is limited theoretical understand- fication [12], speech recognition [8], natural language pro- ing of the learnability of these models with re- cessing [2] and reinforcement learning [19] problems. De- spect to polynomial-time algorithms. In this spite this empirical success, the theoretical understanding paper, we characterize the learnability of fully- of learning neural networks remains relatively limited. It connected neural networks via both positive and is known that training a two-layer neural network on the negative results. We focus on `1-regularized worst-case data distribution is NP-hard [5]. Real data, how- networks, where the `1-norm of the incom- ever, is rarely generated from the worst-case distribution. ing weights of every neuron is assumed to be It is thus natural to wonder whether there are conditions bounded by a constant B > 0. Our first result under which an accurate neural network can be learned in shows that such networks are properly learnable polynomial time. in poly(n; d; exp(1/2)) time, where n and d are the sample size and the input dimension, and > This paper provides some theoretical analysis of the learn- 0 is the gap to optimality. The bound is achieved ability of neural networks. We focus on the problem of by repeatedly sampling over a low-dimensional binary classification, and study fully-connected neural net- manifold so as to ensure approximate optimal- works with a constant number m of layers, and such that ity, but avoids the exp(d) cost of exhaustively the `1-norm of the incoming weights of every neuron is searching over the parameter space. We also es- bounded by a constant B > 0. This `1-regularization tablish a hardness result showing that the expo- scheme has been studied by many authors [see, e.g.,3, 11, nential dependence on 1/ is unavoidable unless 4, 21]. Under the same setting, Zhang et al. [28] proposed RP = NP. Our second result shows that the ex- a kernel-based method for improperly learning a classifier ponential dependence on 1/ can be avoided by that is competitive against the best possible neural network. exploiting the underlying structure of the data Our goal, in contrast, is to explicitly learn the neural net- distribution. In particular, if the positive and work and its parameters, in the proper learning regime. negative examples can be separated with margin The main challenge of learning neural networks comes γ > 0 by an unknown neural network, then the from the nonconvexity of the loss function. An exhaustive network can be learned in poly(n; d; 1/) time. search over the parameter space can be conducted to obtain The bound is achieved by an ensemble method a global optimum, but its time complexity will be expo- which uses the first algorithm as a weak learner. nential in the number of parameters. Existing learnability We further show that the separability assumption results either assume a constant network scale [18], or as- can be weakened to tolerate noisy labels. Finally, sume that every hidden node connects to a constant number we show that the exponential dependence on 1/γ of input coordinates [16]. To the best of our knowledge, no is unimprovable under a certain cryptographic as- agnostic learning algorithm has been shown to learn fully- sumption. connected neural networks with time complexity polyno- mial in the number of network parameters. Our first result is to exhibit an algorithm whose running Proceedings of the 20th International Conference on Artificial time is polynomial in the number of parameters to achieve Intelligence and Statistics (AISTATS) 2017, Fort Lauderdale, a constant optimality gap. Specifically, it is guaranteed to Florida, USA. JMLR: W&CP volume 54. Copyright 2017 by achieve an empirical loss that is at most > 0 greater the author(s). than that of the best neural network with time complex- On the Learnability of Fully-connected Neural Networks ity poly(n; d; Cm;B;1/). Here the integers n and d are for a neural network learner to achieve. Our weak learner the sample size and the input dimension, and the constant avoids the hardness by assuming separability, and secures Cm;B;1/ only depends on the triplet (m; B; 1/), with this the polynomial complexity by agnostic learning. It draws dependence possibly being exponential. Thus, for a con- a theoretical connection between neural network learning stant optimality gap > 0, number of layers m and `1- and boosting. bound B, the method runs in polynomial time in the pair With the aim of understanding the fundamental limits of (n; d). We refer to this method as the agnostic learning our learnability problem, we show that the time-complexity algorithm, since it makes no assumption on the data distri- guarantees for both algorithms are unimprovable under bution. It is remarkably simple, using only multiple rounds their respective assumptions. Under the assumption that of random sampling followed by optimization rounds. The RP 6= NP, we prove that the agnostic learning algo- insight is that although the network contains Ω(d) param- rithm’s exponential complexity in 1/ cannot be avoided. eters, the empirical loss can be approximately minimized More precisely, we demonstrate that there is no algorithm by only considering parameters lying on a k-dimensional achieving arbitrary excess risk > 0 in poly(n; d; 1/) manifold. Thus it suffices to optimize the loss over the time. On the other hand, we demonstrate that the BoostNet manifold. The dimension k is independent of scale of the algorithm’s exponential complexity in 1/γ is unimprovable network, only relying on the target optimality gap . as well—in particular, by showing that a poly(d; 1/, 1/γ) Due to the exponential dependence on 1/, this first algo- complexity is impossible for any algorithm under a certain rithm is too expensive to achieve a diminishing excess risk cryptographic assumption. for large datasets. In our next result, we show how this Finally, we report two empirical results on the BoostNet exponential dependence can be removed by exploiting the algorithm. The first experiment is a classical problem in underlying structure of the data distribution. In particular, computational learning theory, called learning parity func- by assuming that the positive and the negative examples of tions with noise. We show that BoostNet learns a two-layer the dataset are separable by some unknown neural network neural network that encodes the correct function, while with a constant margin γ > 0, we propose an algorithm the performance of backpropagation is as poor as random that learns the network in polynomial time, and correctly guessing. The second experiment is digit recognition on the classifies all training points with margin Ω(γ). As a con- MNIST dataset, where we show that BoostNet consistently sequence, it achieves a generalization error bounded by outperforms backpropagation for training neural networks with sample complexity n = poly(d; 1/) and time com- with the same number of hidden nodes. plexity poly(n; d; 1/). Both complexities have a polyno- mial dependence on 1/. We name it the BoostNet algo- Other related work Several recent papers address the rithm because it uses the AdaBoost approach [7] to con- challenge of establishing polynomial-time learnability for struct a m-layer neural network, by incrementally ensem- neural networks [1, 25,9, 28, 17]. Sedghi and Anandku- bling shallower (m − 1)-layer neural networks. Each shal- mar [25] and Janzamin et al. [9] study the supervised learn- low network is trained by the agnostic learning algorithm ing of neural networks under the assumption that the score presented earlier, focusing on instances that are not cor- function of the data distribution is known. They show that rectly addressed by existing shallow networks. Although by certain computations on the score function, the first net- each shallow network only guarantees a constant optimal- work layer can be learned by a polynomial-time algorithm. ity gap, such constant gaps can be boosted to a diminishing In contrast, our algorithms does not requires knowledge of error rate via a suitable ensembling procedure. We note that the data distribution. Another approach to the problem is for real-world data, the labels are often noisy so that the via improper learning, in which case the goal is to find a separability assumption is unlikely to hold hold. A more predictor that need not be a neural network, but performs realistic assumption would be that the underlying “true la- as well as the best possible neural network in terms of the bels” are separable, but the observed labels are corrupted generalization error. Livni et al. [17] propose a polynomial- by random noise and are not separable. For such noisy time algorithm to learn networks whose activation function data, we prove that the same poly(n; d; 1/)-time learnabil- is quadratic. Zhang et al. [28] propose an algorithm for the ity can be achieved by a variant of the BoostNet algorithm. improper learning of sigmoidal neural networks. The algo- To provide some historical context, in earlier work, a num- rithm is based on the kernel method, and so its output does ber of practioners [13, 24] have reported good empirical not characterize to the parameters of a neural network. In results using a neural network as a weak learner for Ad- contrast, our method learns the model parameters explic- aBoost.