Lecture 7: Deep Learning I Laboratory for Information and Inference Systems (LIONS) École Polytechnique Fédérale De Lausanne (EPFL)

Mathematics of Data: From Theory to Computation

Prof. Volkan Cevher volkan.cevher@epﬂ.ch

Lecture 7: Deep learning I Laboratory for Information and Inference Systems (LIONS) École Polytechnique Fédérale de Lausanne (EPFL)

EE-556 (Fall 2020) License Information for Mathematics of Data Slides

I This work is released under a Creative Commons License with the following terms: I Attribution I The licensor permits others to copy, distribute, display, and perform the work. In return, licensees must give the original authors credit. I Non-Commercial I The licensor permits others to copy, distribute, display, and perform the work. In return, licensees may not use the work for commercial purposes – unless they get the licensor’s permission. I Share Alike I The licensor permits others to distribute derivative works only under a license identical to the one that governs the licensor’s work. I Full Text of the License

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 2/ 43 Outline

◦ This class I Introduction to Deep Learning I The Deep Learning Paradigm I Challenges in Deep Learning Theory and Applications I Introduction to Generalization error bounds I Uniform Convergence and Rademacher Complexity I Generalization in Deep Learning (Part 1) ◦ Next class I Generalization in Deep Learning (Part 2)

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 3/ 43 Remark about notation

◦ The Deep Learning literature might use a diﬀerent notation:

Our lectures DL literature data/sample ax label by bias µb weight x, Xw , W

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch Slide 4/ 43 Power of linear classifiers–I Problem (Recall: Logistic regression) d Given a sample vector ai ∈ R and a binary class label bi ∈ {−1, +1} (i = 1, . . . , n), we define the conditional probability of bi given ai as: −b hx,a i P(bi|ai, x) ∝ 1/(1 + e i i ), where x ∈ Rd is some weight vector.

b = + 1 b = 1 y y a a

ax ax

Figure: Linearly separable versus nonlinearly separable dataset

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 5/ 43 Power of linear classiﬁers–II ◦ Lifting dimensions to the rescue I Convex optimization objective I Might introduce the curse-of-dimensionality I Possible to avoid via kernel methods, such as SVMs

b = + 1 b = 1 y a

3 p 2 2 Figure: Non-linearly separable data (left). Linearly separable in R via az = ax + ay (right).

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 6/ 43 weight bias bias ↓ ↓ ↓ " # " # " # " # X2 X1 + µ1 + µ2 , x := [X1, X2, µ1, µ2]

hidden layer = learned features

recursively repeat activation + aﬃne transformation to obtain “deeper” networks.

An important alternative for non-linearly separable data

1-hidden-layer neural network with m neurons (fully-connected architecture):

m×d c×m m c ◦ Parameters: X1 ∈ R , X2 ∈ R (weights), µ1 ∈ R , µ2 ∈ R (biases) ◦ Activation function: σ : R → R

activation  input   ↓ y  " #      hx(a) := σ  a     

| {z }

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 7/ 43 bias bias ↓ ↓ " # " # " # X2 + µ1 + µ2 , x := [X1, X2, µ1, µ2]

hidden layer = learned features

recursively repeat activation + aﬃne transformation to obtain “deeper” networks.

An important alternative for non-linearly separable data

1-hidden-layer neural network with m neurons (fully-connected architecture):

m×d c×m m c ◦ Parameters: X1 ∈ R , X2 ∈ R (weights), µ1 ∈ R , µ2 ∈ R (biases) ◦ Activation function: σ : R → R

activation  weight input   ↓ y ↓ " #" #      hx(a) := σ  X1 a     

| {z }

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 7/ 43 bias ↓ " # " # X2 + µ2 , x := [X1, X2, µ1, µ2]

hidden layer = learned features

recursively repeat activation + aﬃne transformation to obtain “deeper” networks.

An important alternative for non-linearly separable data

1-hidden-layer neural network with m neurons (fully-connected architecture):

m×d c×m m c ◦ Parameters: X1 ∈ R , X2 ∈ R (weights), µ1 ∈ R , µ2 ∈ R (biases) ◦ Activation function: σ : R → R

activation  weight input bias   ↓ ↓ y ↓ " #" # " #     hx(a) := σ  X1 a + µ1     

| {z }

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 7/ 43 bias ↓ " # " # X2 + µ2 , x := [X1, X2, µ1, µ2]

recursively repeat activation + aﬃne transformation to obtain “deeper” networks.

An important alternative for non-linearly separable data

1-hidden-layer neural network with m neurons (fully-connected architecture):

m×d c×m m c ◦ Parameters: X1 ∈ R , X2 ∈ R (weights), µ1 ∈ R , µ2 ∈ R (biases) ◦ Activation function: σ : R → R

activation  weight input bias   ↓ ↓ y ↓ " #" # " #     hx(a) := σ  X1 a + µ1     

| hidden layer ={zlearned features }

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 7/ 43 bias ↓ " #

+ µ2 , x := [X1, X2, µ1, µ2]

recursively repeat activation + aﬃne transformation to obtain “deeper” networks.

An important alternative for non-linearly separable data

1-hidden-layer neural network with m neurons (fully-connected architecture):

m×d c×m m c ◦ Parameters: X1 ∈ R , X2 ∈ R (weights), µ1 ∈ R , µ2 ∈ R (biases) ◦ Activation function: σ : R → R

activation  weight input bias   ↓ ↓ y ↓ " # " #" # " #     hx(a) := X2 σ  X1 a + µ1     

| hidden layer ={zlearned features }

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 7/ 43 , x := [X1, X2, µ1, µ2]

recursively repeat activation + aﬃne transformation to obtain “deeper” networks.

An important alternative for non-linearly separable data

1-hidden-layer neural network with m neurons (fully-connected architecture):

m×d c×m m c ◦ Parameters: X1 ∈ R , X2 ∈ R (weights), µ1 ∈ R , µ2 ∈ R (biases) ◦ Activation function: σ : R → R

activation  weight input bias  bias  ↓ ↓ ↓ y ↓ " # " #" # " # " #     hx(a) := X2 σ  X1 a + µ1  + µ2    

| hidden layer ={zlearned features }

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 7/ 43 recursively repeat activation + aﬃne transformation to obtain “deeper” networks.

An important alternative for non-linearly separable data

1-hidden-layer neural network with m neurons (fully-connected architecture):

m×d c×m m c ◦ Parameters: X1 ∈ R , X2 ∈ R (weights), µ1 ∈ R , µ2 ∈ R (biases) ◦ Activation function: σ : R → R

activation  weight input bias  bias  ↓ ↓ ↓ y ↓ " # " #" # " # " #     hx(a) := X2 σ  X1 a + µ1  + µ2 , x := [X1, X2, µ1, µ2]    

| hidden layer ={zlearned features }

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 7/ 43 An important alternative for non-linearly separable data

1-hidden-layer neural network with m neurons (fully-connected architecture):

m×d c×m m c ◦ Parameters: X1 ∈ R , X2 ∈ R (weights), µ1 ∈ R , µ2 ∈ R (biases) ◦ Activation function: σ : R → R

activation  weight input bias  bias  ↓ ↓ ↓ y ↓ " # " #" # " # " #     hx(a) := X2 σ  X1 a + µ1  + µ2 , x := [X1, X2, µ1, µ2]    

| hidden layer ={zlearned features }

recursively repeat activation + aﬃne transformation to obtain “deeper” networks.

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 7/ 43 Why neural networks?: An approximation theoretic motivation

Theorem (Universal approximation [3]) Let σ(·) be a nonconstant, bounded, and increasing d continuous function. Let Id = [0, 1] . The space of continuous functions on Id is denoted by C(Id).

Given > 0 and g ∈ C(Id) there exists a 1-hidden-layer network h with m neurons such that h is an -approximation of g, i.e.,

sup |g(a) − h(a)| ≤ a∈Id

Caveat The number of neurons m needed to approximate some function g can be arbitrarily large! Figure: networks of increasing width

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 8/ 43 1

Why were NNs not popular before 2010?

I too big to optimize! I did not have enough data I could not ﬁnd the optimum via algorithms

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 9/ 43 Why were NNs not popular before 2010?

I too big to optimize! I did not have enough data I could not ﬁnd the optimum via algorithms

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 9/ 43

1 Supervised learning: Multi-class classiﬁcation

Figure: CIFAR10 dataset: 60000 32x32 color images (3 Figure: Imagenet dataset: 14 million color images (varying channels) from 10 classes resolution, 3 channels) from 21K classes

Goal Image-label pairs (a, b) ⊆ Rd × {1, . . . , c} follow an unknown distibution P. Find h : Rd → {1, . . . , c} with minimum misclassiﬁcation probability min P(h(a) , b) h∈H

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 10/ 43 2010-today: Deep Learning becomes popular again

Linear model 25 Deep Network

Error rate 10

Human ZF (2013) XRCE (2011) VGG (2014) AlexNet (2012) ResNet (2015) GoogLeNet (2014) GoogLeNet-V4 (2016) Classifier

Figure: Error rate on the ImageNet challenge, for diﬀerent network architectures.

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 11/ 43 Convolutional architectures in Computer Vision tasks

Figure: “Locality” Structure of a 2D deep convolutional neural network.

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 12/ 43 Inductive Bias: Why convolution works so well in Computer Vision tasks?

h◦ true unknown function H space of all functions p Hfc fully-connected networks with p parameters p Hconv convolutional networks with p parameters

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 13/ 43 2010-today: Size of neural networks grows exponentially!

Academia 105 Industry

104

103 Parameters (millions)

102

ELMo (AI2 - 2018) GPT (OpenAI - 2018) BERT (Google - 2018) GPT-2 (OpenAI - 2019) GPT-3 (OpenAI - 2020)

Megatron-LM (Nvidia - 2019) Turing-NLG (Microsoft - 2020) Transformer ELMo (AI2 - 2019)

Grover-Mega (U. of Washington - 2019) Language Model

Figure: Number of parameters in Language models based on Deep Learning.

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 14/ 43 The Landscape of ERM with multilayer networks

Recall: Empirical risk minimization (ERM) n n n Let hx : R → R be network and let {(ai, bi)}i=1 be a sample with bi ∈ {−1, 1} and ai ∈ R . The empirical risk minimization (ERM) is deﬁned as

( n ) 1 X min Rn(x) := L(hx(ai), bi) (1) x n i=1 where L(hx(ai), bi) is the loss on the sample (ai, bi) and x are the parameters of the network.

Some frequently used loss functions

I L(hx(a), b) = log(1 + exp(−b · hx(a))) (logistic loss) 2 I L(hx(a), b) = (b − hx(a)) (squared error)

I L(hx(a), b) = max(0, 1 − b · hx(a)) (hinge loss)

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 15/ 43 The Landscape of ERM with multilayer networks

Figure: convex (left) vs non-convex (right) optimization landscape

Conventional wisdom in ML until 2010: Simple models + simple errors 1

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 16/ 43 The Deep Learning Paradigm

(a) Massive datasets (b) Inductive bias from large and complex (c) ERM using stochastic non-convex ﬁrst-order architectures optimization algorithms (SGD)

Figure: Most common components in a Deep Learning Pipeline

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 17/ 43 Challenges in DL/ML applications: Robustness (I)

(a) Turtle classified as rifle. Athalye et al. 2018. (b) Stop sign classified as 45 mph sign. Eykholt et al. 2018

Figure: Natural or human-crafted modiﬁcations that trick neural networks used in computer vision tasks

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 18/ 43 Challenges in DL/ML applications: Robustness (II)

(a) Linear classiﬁer on data distributed on a sphere (b) Concentration of measure phenomenon on high dimensions

Figure: Understanding the robustness of a classiﬁer in high-dimensional spaces. Shafahi et al. 2019.

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 19/ 43 Challenges in DL/ML applications: Robustness (References)

1. Madry, Aleksander and Makelov, Aleksandar and Schmidt, Ludwig and Tsipras, Dimitris and Vladu, Adrian. Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR 2018. 2. Raghunathan, A., Steinhardt, J., and Liang, P. S. Semidefinite relaxations for certifying robustness to adversarial examples. Neurips 2018. 3. Wong, E. and Kolter, Z. (2018). Provable defenses against adversarial examples via the convex outer adversarial polytope. ICML 2018. 4. Huang, X., Kwiatkowska, M., Wang, S., and Wu, M. Safety verification of deep neural networks. Computer Aided Verification 2017. 5. Athalye, A., et al. Synthesizing robust adversarial examples. International conference on machine learning. PMLR, 2018. 6. Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., and Song, D. Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1625-1634). 2018. 7. Shafahi A., Ronny Huang, W., Studer, C., Feizi, S. and Goldstein, T. Are adversarial examples inevitable?. International Conference on Learning Representations. 2019.

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 20/ 43 Challenges in DL/ML applications: Surveillance/Privacy/Manipulation

Figure: Political and societal concerns about some DL/ML applications

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 21/ 43 Challenges in DL/ML applications: Surveillance/Privacy/Manipulation (References)

1. Dwork, C., and Roth, A. The Algorithmic Foundations of Diﬀerential Privacy. Foundations and Trends in Theoretical Computer Science, 9, 2013. 2. Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with diﬀerential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (pp. 308-318). 2016. 3. Sreenu, G., Saleem Durai, M.A. Intelligent video surveillance: a review through deep learning techniques for crowd analysis. J Big Data 6, 48. 2019. 4. O’Neil, C., Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy Broadway Books, (2016); 5. Wade, M. Psychographics: the behavioural analysis that helped Cambridge Analytica know voters’ minds. https://theconversation.com/psychographics-the-behavioural-analysis-that-helped-cambridge-analytica-know-voters-minds-93675 2018.

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 22/ 43 Challenges in DL/ML applications: Fairness

(a) Racist classifier ML & AI | Volkan Cevher | htps://lions.epf.ch(b) Effect of unbalanced data 5 ML & AI | Volkan Cevher | htps://lions.epf.ch 4 Figure: Unfair classifiers due to biased or unbalanced datasets/algorithms

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 23/ 43 Challenges in DL/ML applications: Fairness (References)

1. Barocas, S. Hardt, M. Narayanan, Arvind. Fairness in Machine Learning Limitations and Opportunities. https://fairmlbook.org/pdf/fairmlbook.pdf 2020. 2. Hardt, M. How Big Data Is Unfair. https://medium.com/@mrtz/how-big-data-is-unfair-9aa544d739de 2014. 3. Munoz, C., Smith, M., and Patil, D. Big Data: A Report on Algorithmic Systems, Opportunity, and Civil Rights. Executive Oﬃce of the President. The White House, 2016. 4. Campolo, A., Sanﬁlippo, M., Whittaker, M., Crawford, K. AI Now 2017 Report. AI Now Institute at New York University, 2017. 5. Friedman, B. and Nissenbaum, H. Bias in Computer Systems. ACM Transactions on Information Systems (TOIS) 14, no. 3. 1996: 330–47. 6. Pedreshi, D., Ruggieri, S. and Turini, F. Discrimination-Aware Data Mining. Proc. 14th SIGKDD. ACM 2008. 7. Noble, S.U. Algorithms of Oppression: How Search Engines Reinforce Racism. NYU Press. 2018.

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 24/ 43 Challenges in DL/ML applications: Interpretability Interpretability

ML & AI | Volkan Cevher | htps://lions.epf.chFigure : Performance vs Interpretability trade-oﬀs in DL/ML 3

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 25/ 43 Challenges in DL/ML applications: Interpretability (References)

1. Baehrens, David and Schroeter, Timon and Harmeling, Stefan and Kawanabe, Motoaki and Hansen, Katja and Mueller, Klaus-Robert. Simonyan, Karen and Vedaldi, Andrea and Zisserman, Andrew. How to Explain Individual Classification Decisions. JMLR 2010. 2. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. arXiv e-prints. arXiv:1312.6034. 2013. 3. Ribeiro, Marco and Singh, Sameer and Guestrin, Carlos. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. KDD 2016. 4. Sundararajan, Mukund and Taly, Ankur and Yan, Qiqi. Axiomatic Attribution for Deep Networks. ICML 2017. 5. Shrikumar, Avanti and Greenside, Peyton and Kundaje, Anshul. Learning Important Features Through Propagating Activation Differences. ICML 2017.

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 26/ 43 ChallengesSustainability: in DL/ML applications: Energy eﬃciency and cost

Dennard scaling & Moore’s law vs Growth of data

DART Consulting

Andy Burg

Figure: Eﬃciency and Scalability concerns in DL/ML

Tim Dettmers

ML & AI | Volkan Cevher | htps://lions.epf.ch 6

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 27/ 43 Challenges in DL/ML applications: Energy eﬃciency and cost (References)

1. García-Marín, E., Rodrigues, C. F., Riley, G., and Grahn, H. Estimation of energy consumption in machine learning. Journal of Parallel and Distributed Computing, 134, 75-88. 2019. 2. Strubell, E., Ganesh, A., and McCallum, A. Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243. 2019. 3. Goel, A., Tung, C., Lu, Y. H., and Thiruvathukal, G. K. A Survey of Methods for Low-Power Deep Learning and Computer Vision. arXiv preprint arXiv:2003.11066. 2020. 4. Conti, F., Rusci, M., and Benini, L. The Memory Challenge in Ultra-Low Power Deep Learning. In NANO-CHIPS 2030 (pp. 323-349). Springer, Cham. 2020.

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 28/ 43 What theoretical challenges in Deep Learning will we study?

Models Let X ⊆ X ◦ be parameter domains, where X is known. Deﬁne ◦ 1. x ∈ arg minx∈X ◦ R(x): true minimum risk model \ 2. x ∈ arg minx∈X R(x): assumed minimum risk model ? 3. x ∈ arg minx∈X Rn(x): ERM solution 4. xt: numerical approximation of x? at time t

Practical performance in Deep Learning

t ◦ t ? \ ◦ R(x ) − R(x ) ≤ Rn(x ) − Rn(x ) +2 sup |R(x) − Rn(x)| + R(x ) − R(x ) | {z } | {z } x∈X | {z } ε¯(t,n) optimization error model error worst-case| generalization{z error} where ε¯(t, n) denotes the total error of the Learning Machine. In Deep Learning applications

1. Optimization error is almost zero, in spite of non-convexity. ⇒ lecture 9 2. We expect large generalization error. It does not happen in practice. ⇒ lecture 7 (this one) and 8 3. Large architectures + inductive bias might lead to small model error.

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 29/ 43 Generalization error bounds

The value of |R(x) − Rn(x)| is called the generalization error of the parameter x.

Goal: obtain generalization bounds for multi-layer, fully-connected neural networks

We want to ﬁnd high-probability upper bounds for the worst case generalization error over a class X :

sup |R(x) − Rn(x)| x∈X

Main tool: concentration inequalities! I Measure of how far is an empirical average from the true mean

Theorem (Hoeﬀding’s Inequality [6]) Let Y1,...,Yn be i.i.d. random variables with Yi taking values in the interval [ai, bi] ⊆ R for all i = 1, . . . , n. Let S := 1 Pn Y . It holds that n n i=1 i

2n2t2 P (|Sn − E[Sn]| > t) ≤ 2 exp − Pn (b − a )2 i=1 i i

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 30/ 43 Warmup: Generalization bound for a singleton Lemma p p For i = 1, . . . , n let (ai, bi) ∈ R × {−1, 1} be independent random variables and hx : R → R be a function parametrized by x ∈ X . Let X = {x0} and L(hx(a), b) = {sign(hx(a)) , b} be the 0-1 loss. With probability at least 1 − δ, we have that r ln(2/δ) sup |R(x) − Rn(x)| = |R(x0) − Rn(x0)| ≤ . x∈X 2n

Proof. Note that E[ 1 Pn L(h (a ), b )] = R(x ), the expected risk of the parameter x . Moreover n i=1 x0 i i 0 0 L(hx0 (ai), bi) ∈ [0, 1]. We can use Hoeﬀding’s inequality and obtain

n ! 1 X 2 P(|Rn(x0) − R(x0)| > t) = P Li(hx (ai), bi) − R(x0) > t ≤ 2 exp − 2nt n 0 i=1 q 2 ln 2/δ Setting δ := 2 exp −2nt , we have that t = 2n , thus obtaining the result.

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 31/ 43 Generalization bound for ﬁnite sets

Lemma p p For i = 1, . . . , n let (ai, bi) ∈ R × {−1, 1} be independent random variables and hx : R → R be a function parametrized by x ∈ X . Let X be a ﬁnite set and L(hx(a), b) = {sign(hx(a)) , b} be the 0-1 loss. With probability at least 1 − δ, we have that r ln |X | + ln(2/δ) sup |R(x) − Rn(x)| ≤ . x∈X 2n

Proof.

Let X = {x1,..., x|X |}. We can use a union bound and the analysis of the singleton case to obtain:

|X | X 2 P(∃j : |Rn(xj ) − R(xj )| > t) ≤ P(|Rn(xj ) − R(xj )| > t) = 2|X | exp − 2nt j=1

q 2 2 ln |X |+ln δ Setting δ := 2|X | exp −2nt we have that t = 2n , thus obtaining the result.

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 32/ 43 Generalization bounds for inﬁnite classes - The Rademacher complexity

However, in most applications in ML/DL we optimize over an inﬁnite parameter space X !

◦ A useful notion of complexity to derive generalization bounds for inﬁnite classes of functions:

Definition (Rademacher Complexity [2]) p Let S = {a1,..., an} ⊆ R and let {σi : i = 1, . . . , n} be independent Rademacher random variables i.e., taking values uniformly in {−1, +1} (coin flip). Let H be a class of functions of the form h : Rp → R. The Rademacher complexity of H with respect to A is defined as follows:

n 1 X RA(H) B E sup σih(ai). h∈H n i=1

◦ RA(H) measures how well can we ﬁt random signs (±1) with the output of an element of H on the set A.

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 33/ 43 Visualizing Rademacher complexity

Figure: Rademacher complexity measures correlation with random signs

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 34/ 43 Visualizing Rademacher complexity

(a) High Rademacher Complexity (b) Large Generalization error (memorization)

Figure: Rademacher complexity and Generalization error

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 35/ 43 A fundamental theorem about the Rademacher Complexity

Theorem (See Theorem 3.3 and 5.8 in [6]) Suppose that the loss function has the form L(hx(a), b) = φ(b · hx(a)) for a 1-Lipschitz function φ : R → R. p Let HX := {hx : x ∈ X } be a class of parametric functions hx : R → R. For any δ > 0, with probability at n least 1 − δ over the draw of an i.i.d. sample {(ai, bi)}i=1, letting A = (a1,..., an), the following holds: r ln(2/δ) sup |Rn(x) − R(x)| ≤ 2EARA(HX ) + , x∈X 2n r ln(4/δ) sup |Rn(x) − R(x)| ≤ 2RA(HX ) + 3 . x∈X 2n

The assumption is satisﬁed for common losses

I L(hx(a), b) = log(1 + exp(−b · hx(a))) ⇒ φ(z) := log(1 + exp(z)) (logistic loss)

I L(hx(a), b) = max(0, 1 − b · hx(a)) ⇒ φ(z) := max(0, 1 − z) (hinge loss)

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 36/ 43 Computing the Rademacher complexity for Linear functions

Theorem p p Let X := {x ∈ R : kxk2 ≤ λ} and let HX be the class of functions of the form hx : R → R, hx(a) = hx, ai, p for some x ∈ X }.√ Let A = {a1,..., an} ⊆ R such that maxi=1,...,n kaik ≤ M. It holds that RA(HX ) ≤ λM/ n.

Proof.

n 1 X R (H ) = E sup σ hx, ai n !1/2 A X i 1 kxk2≤λ n X 2 i=1 ⇒ RA(HX ) ≤ λ E kσiaik (Jensen) n 2 * n + i=1 1 X = E sup x, σia n !1/2 kxk2≤λ n 1 X i=1 2 ≤ λ kaik2 n n i=1 1 X √ ≤ λE σiai (C-S) n ≤ λM/ n i=1 2

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 37/ 43 Rademacher complexity estimates of fully connected Neural Networks Notation n,m For a matrix X ∈ R , kXk denotes its spectral norm. Let X:,k be the k-th column of X. We deﬁne

kXk2,1 = k(kX:,1k2,..., kX:,mk2)k1. (2)

Theorem (Spectral bound [1])

For positive integers p0, p1, . . . , pd = 1, and positive reals λ1, . . . , λd ν1, . . . , νd deﬁne the set

pi×pi−1 T X := {(X1,..., Xd): Xi ∈ R , kXik ≤ λi, kXi k2,1 ≤ νi}.

p Let HX be the class of neural networks hx : R → R, hx = Xd ◦ σ ◦ ... ◦ σ ◦ X1 where x = (X1,..., Xd) ∈ X . p Suppose that σ is 1-Lipschitz. Let A = {a1,..., an} ⊆ R , M := maxi=1,...,n kaik and W := max{pi : i = 0, . . . , d}.

The Rademacher complexity of HX with respect to A is bounded as

 3/2 d d 2/3 ! log(W )M Y X νj RA(HX ) = O  √ λi  (3) n λ2/3 i=1 j=1 j

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 38/ 43 How well do complexity measures correlate with generalization?

name deﬁnition correlation1 Frobenius distance to initialization [7] Pd kX − X0k2 −0.263 i=1 i i F 3/2 2/3 kX k 2 Qd Pd i 2,1 Spectral complexity [1] kXik 3/2 −0.537 i=1 i=1 kXik Parameter Frobenius norm Pd kX k2 0.073 i=1 i F 2 Fisher-Rao [5] (d+1) Pn hx, ∇ `(h (a ), b )i 0.078 n i=1 x x i i P Qd 2 Path-norm [8] Xi ,i 0.373 (i0,...,id) j=1 j j−1 Table: Complexity measures compared in the empirical study [4], and their correlation with generalization

Complexity measures are still far from explaining generalization in Deep Learning!

1Kendall’s rank correlation coefficient. 2The definition in [4] differs slightly.

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 39/ 43 Wrap up!

◦ Deep learning recitation on Friday!

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 40/ 43 ReferencesI

[1] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6240–6249. Curran Associates, Inc., 2017. [2] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002. [3] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989. [4] Yiding Jiang*, Behnam Neyshabur*, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to ﬁnd them. In International Conference on Learning Representations, 2020. [5] Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric, geometry, and complexity of neural networks. volume 89 of Proceedings of Machine Learning Research, pages 888–896. PMLR, 16–18 Apr 2019.

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 41/ 43 ReferencesII

[6] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2nd edition, 2018. [7] Vaishnavh Nagarajan and J. Zico Kolter. Generalization in Deep Networks: The Role of Distance from Initialization. arXiv e-prints, page arXiv:1901.01672, January 2019. [8] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401, 2015.

Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epﬂ.ch Slide 42/ 43