So Far, We Have Introduced…
Total Page:16
File Type:pdf, Size:1020Kb
So far, we have introduced… ´ Mallat’s invariant scattering transform networks ´ The deeper the network, the more invariant (translation, local deformation, scaling and rotation) ´ Poggio et al. local (sparse), hierarchical, compositional functions ´ Avoid the curse of dimensionality ´ Papyan et al. sparse cascaded convolutional dictionary learning ´ Uniqueness and stability guarantees of sparse recovery Let’s continue on … ´ Harmonic Analysis: What are the optimal (transferrable) representations of functions as input signals (sounds, images, …)? ´ Approximation Theory: When and why are deep networks better than shallow networks? ´ Optimization: What is the landscape of the empirical risk and how to minimize it efficiently? ´ Statistics: How can deep learning generalize well without overfitting the noise? Generalization of Supervised Learning All �, � ~ �, where � is Collect data: �", �" � = 1, 2, … , �} unknown A common approach to learn: ERM (Empirical Risk Minimization) @ 1 Learn a model: �: � → �, � ∈ ℋ min � � : = = �(�; � , � ) ; � " " " �: model parameters �(�; �, �): loss function w.r.t. data Predict new data: � → �(�) Population Risk: �(�) ≔ E[�(�; �, �)] Generalization Error Bias-Variance Tradeoff? Deep models Models where p>20n are common Why big models generalize well? n=50,000 CIFAR10 d=3,072 k=10 What happens when I turn off the regularizers? Train Test Model parameters p/n loss error CudaConvNet 145,578 2.9 0 23% CudaConvNet 145,578 2.9 0.34 18% (with regularization) MicroInception 1,649,402 33 0 14% ResNet 2,401,440 48 0 13% Ben Recht FoCM 2017 How to control generalization error? Motivation • However, when Φ ( X ; ⇥ ) is a large, deep network, current best mechanism to control generalization gap has two key ingredients: – Stochastic Optimization ❖ “During training, it adds the sampling noise that corresponds to empirical- population mismatch” [Léon Bottou]. – Make the model as large as possible. ❖ see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang et al, ICLR’17]. Traditional Learning Theory Common form of generalization bound (in expectation or high probability) @ �������� ������� � � ≤ � � + ; � Capacity Measurement Complexity |�|: # of VC-dimension �� ≤ �( � log |�|) edges Y YZ[ ��X �-Covering number log � ℱ, �, � ≤ � �: # of layers O QR �OY Y Rademacher Average �\ ℱ ≤ �(� ) Big model should fail! Training Algorithms for Deep Learning Commonly Used Algorithms Learn the Non-adaptive Adaptive curvature adaptively Easy to • SGD implement • Adam • SGD with • AdaGrad momentu • AdaDelta m Automaticall • …… • …… y tune parameters Stochastic Gradient Descent Objective loss function: @ 1 min � � : = = �(�; � , � ) ; � " " " where (�", �") is the data, � is the parameter vector. @ � O(n) time � = � − = ��(� ; � , � ) Gradient Descent: hZ[ h � h " " complexity @ SGD: �hZ[ = �h − ���(�h; �"l, �"l), where �h is uniform in {1, …, n} O(1) time Extensions: mini-batch SGD complexity Spectrally-normalized margin bounds for neural networks Peter L. Bartlett⇤ Dylan J. Foster† Matus Telgarsky‡ Abstract This paper presents a margin-based multiclass generalization bound for neural networks that scales with their margin-normalized spectral complexity:theirLipschitzconstant,meaningtheproductof the spectral norms of the weight matrices, times a certain correction factor. This bound is empirically investigated for a standard AlexNet network trained with SGD on the mnist and cifar10 datasets, with both original and random labels; the bound, the Lipschitz constants, and the excess risks are all in direct correlation, suggesting both that SGD selects predictors whose complexity scales with the difficulty of the learning task, and secondly that the presented bound is sensitive to this complexity. 1 Overview Neural networks owe their astonishing success not only to their ability to fit any data set: they also generalize well,meaningtheyprovideaclosefitonunseendata.Aclassicalstatisticaladageisthat models capable of fitting too much will generalize poorly; what’s going on here? Let’s navigate the many possible explanations provided by statistical theory. A first observation is that any analysis based solely on the number of possible labellings on a finite training set — as is the case with VC dimension — is doomed: if the function class can fit all possible labels (as is the case with neural networksSGD in standard with configurations early stopping (Zhang et al., 2017)), regularizes then this analysis can not distinguish it from the collection of all possible functions! arXiv:1706.08498v2 [cs.LG] 5 Dec 2017 Figure 1: An analysis of AlexNet (Krizhevsky et al., 2012) trained with SGD on cifar10, both with original and with randomBartlett labels. et al. Triangle-marked 2017. Generalization curves error track of AlexNet excessin risk Cifar10 across training epochs (on a log scale), with an ‘x’ marking the earliest epoch with zero training error. Circle-marked curves track Lipschitz constants, normalized so that the two curves for random labels meet. The Lipschitz constants tightly correlate with excess risk, and moreover normalizing them by margins (resulting in the square-marked curve) neutralizes growth across epochs. ⇤<[email protected]>; University of California, Berkeley and Queensland University of Technology; work performed while visiting the Simons Institute. †<[email protected]>; Cornell University; work performed while visiting the Simons Institute. ‡<[email protected]>; University of Illinois, Urbana-Champaign; work performed while visiting the Simons Institute. 1 (a) Margins. (b) Normalized margins. Figure 2: Margin distributions at the end of training AlexNet on cifar10, with and without random labels. With proper normalization, random labels demonstrably correspond to a harder problem. =(A1,...,AL) let F denote the function computed by the corresponding network: A A F (x) := σL(ALσL 1(AL 1 σ1(A1x) )). (1.1) A − − ··· ··· dL The network output F (x) R (with d0 = d and dL = k)isconvertedtoaclasslabelin 1,...,k A 2 { } by taking the arg max over components, with an arbitrary rule for breaking ties. Whenever input data d n d x1,...,xn R are given, collect them as rows of a matrix X R ⇥ . Occasionally, notation will be 2 T th 2 overloaded to discuss F (X ), a matrix whose i column is F (xi).LetW denote the maximum of A A d, d ,...,d . The l norm is always computed entry-wise; thus, for a matrix, it corresponds to { 1 L} 2 k · k2 the Frobenius norm. Next, define a collection of reference matrices (M1,...,ML) with the same dimensions as A1,...,AL; for instance, to obtain a good bound for ResNet (He et al., 2016), it is sensible to set Mi := I,the identity map, and the bound below will worsen as the network moves farther from the identity map; for AlexNet (Krizhevsky et al., 2012), the simple choice Mi =0suffices. Finally, let σ denote the spectral : k · k norm, and let p,q denote the (p, q) matrix norm, defined by A p,q = ( A:,1 p,..., A:,m p) q for d m (a)k · Margins.k (b) Normalizedk k margins.k k k k A R ⇥ . The spectral complexity RF = R of a network F with weights is the defined as Margin2 and NetworkA A LipschitzA basedA Figure 2: Margin distributions at the end of training AlexNet on cifar10, with3/ and2 without random L L 2/3 labels. With proper normalization, random labels demonstrablyA correspondi> Mi> 2 to,1 a harder problem. GeneralizationR := ⇢errori Ai σ boundk − k . (1.2) A 0 k k 1 0 2/3 1 i=1 i=1 Ai σ Y X k k =(A1,...,AL) let F denote the function@ computedA by@ the corresponding network:A A (BartlettThe followingA theorem et provides al. a2017) generalization bound for neural networks whose nonlinearities are fixed but whose weightF matrices(x) := σL(AhaveLσL bounded1(AL 1 spectralσ1(A1x complexity) )). R . (1.1) A A − − ··· ··· A dL The networkTheorem output 1.1.F Let(x) nonlinearitiesR (with d(0σ=1,...,d andσLd)Land= k reference)isconvertedtoaclasslabelin matrices (M1,...,ML) be1,...,k given as above A 2 { } by taking(i.e., theσi argis ⇢ maxi-Lipschitzover components, and σi(0) = with 0). Then an arbitrary for (x, y rule), (x1 for,y1 breaking),...,(xn ties.,yn) Wheneverdrawn iid from input any data probability d d n d n x1,...,xdistributionn R are over given,R collect1,...,k them, as with rows probability of a matrix atX leastR1 ⇥ δ. Occasionally,over ((xi,yi)) notationi=1, every will margin be γ > 0 2 d ⇥T { k } th 2 − overloadedand network to discussF F: (X ), a matrixwith weight whose matricesi column=( is AF1,...,A(xi).LetL) satisfyW denote the maximum of AR R A d, d ,...,d . The lA norm! is always computed entry-wise;A thus, for a matrix, it corresponds to { 1 L} 2 k · k2 the Frobenius norm. X 2R ln(1/δ) Next, define a collectionPr arg of maxreferenceF (x) matricesj = y (Mγ,...,M(F )+ ) withk thek sameA ln(W dimensions)+ as A ,...,A, ; j A 6 R1 A LO γn r n 1! L for instance, to obtain ah good bound for ResNeti (He et al., 2016), it is sensible to set Mi := I,the b e identity map, and the bound1 below1 will worsen as the network moves farther from the identity2 map; for where γ (f) n− f(xi)yi γ + maxj=yi f(xi)j and X 2 = xi 2. AlexNet (KrizhevskyR et al., 2012),i the simple choice M6 =0suffices. Finally,k k let i kdenotek the spectral i k · kσ norm, and let p,q denote the⇥(p, q) matrix norm, defined by⇤ A := ( Ap:,1 p,..., A:,m p) for Theb full proof andP a generalization beyond spectral normsp,q is relegatedP to the appendix,q but a sketch d m k · k k k k k k k A R is⇥ provided. The spectral in Section complexity 3, alongRF with= R a lowerof a network bound.F Sectionwith weights3 also givesis a the discussion defined as of related work: 2 briefly, it’s essential to note thatA marginA and Lipschitz-sensitiveA bounds A have a long history in the neural 3/2 networks literature (Bartlett,L 1996; AnthonyL and Bartlett, 1999;2/3 Neyshabur et al., 2015); the distinction Ai> Mi> 2,1 R := ⇢i Ai σ k − k .