The Sample Complexity in Data(Sample)-Driven Optimization

Y INYU Y E J OINT WORK WITH MANY OTHERS

Department of Management Science and Engineering and ICME Stanford University Outline

1. Sample Complexity in and Stochastic Game • Reinforcement Learning and MDP • The Sample-Complexity Problem in RL • The Near Sample-Optimal Algorithm • Variance Reduction • Monotonicity Analysis 2. Average Approximation with Sparsity-Inducing Penalty for High- Dimensional Stochastic Programming/Learning • Regularized sample average approximation (RSAA) • Theoretical generalizations • Theoretical applications: • High-dimensional statistical learning • Deep neural network learning I. What Is Reinforcement Learning?

• Use samples/experiences to control an unknown system

Action

Unknown Agent system

State observations Rewards Maze with random Play game, learn to drive punishment/bonus Samples/Experiences: previous observations Median over 57 Atari games Reinforcement learning achieves phenomenal empirical successes Human

Hessel et al’17 What if sample/trial is costly and limited?

Precision Medicine Biological Experiment

Crowd Sourcing

Sample Size/Complexity is essential! “What If” Model: Markov Decision Process

• States: ; Actions: • Reward: • State transition:

• Policy: random

Effective Horizon:

• Optimal policy & value: When model is known, solve via Linear Programming • -optimal policy : : Examples +1 2 3 +0.999 L R • Three-States MDP 1 − Optimal policy: 휋∗ 1 = 퐿 −0.1-optimal policy: 휋 1 = 푅 • Pacman − How to get an 0.1-optimal policy? − Learn the policy from sample data − Short episodes: For an , how many samples are

The Sample- sufficient/necessary to learn a 0.1-optimal policy Complexity Problem w. p. ≥ 0.9?

Data/sample: transition examples from every (s,a)

• samples per (s,a) is necessary! • No previous algorithms achieves this sample complexity • Sufficient side: what is the best way of learning policy? [Azar et al (2013)] Empirical Risk Minimization and Value-Iteration

❑Collect m samples for each (푠, 푎) ❑Construct an empirical model ❑Solve the empirical MDP via Linear Programming or Simply Value-Iteration:

T vs ← maxa [ r(s,a)+ΥP(s’|s,a) v ] for all s Sample-First and Solve-Second • Simple and intuitive • Space: model-based, all data • Samples: [Azar et al’ 2013] Iterative Q-learning: Solving while Sampling

: quality of (s,a), initialized to 0

Converging to Empirical estimator of the Q* expected future value Approximate DP

After iterations:

• #Samples for 0.1-opt. policy w.h.p.: Why Iterative Q-Learning Fails

Q1 Q2 “a little closer” to Q* Empirical estimator Q3 of the expected future value Q4 Q5

−1 • Runs for 푅 = 1 − 훾 iterations Too many samples for too little • Each iteration uses new samples of same improvement number Summary of Previous Methods

How many samples per are sufficient/necessary to obtain a 0.1-optimal policy w. p. > 0.9?

Previous Algorithms #Samples/(s,a) Ref Phased Q-learning 1014 Kearns & Singh’ 1999 ERM 1010 Azar et al’ 2013 Randomized LP 108 Wang’ 2017 Randomized Val. Iter. 108 Sidford et al’ ISML2018

Previous Best Algorithm: Necessary? Information theoretical lower bound: Sufficient? The Near Sample-Optimal Algorithm

Theorem (informal):

For an unknown discounted MDP, with model size .

We give an algorithm (vQVI) that • Recover w.h.p. an 0.1-optimal policy with samples per (s,a) • Time: #samples • Space: model-free

The first sample-optimal* algorithm for learning policy of DMDP.

[Sidford, Wang, Wu, Yang, and Ye, NeurIPS (2018)] Analysis Overview

Q-Learning

Monotonicity Utilizing Var. Reduced Q- Variance Learning Bernstein previous Reduction estimate

Guarantee Monotonicity Q- good policy Learning

Precise Bernstein + Law of control of total var. error Variance Reduction

• Original Q-Learning

“a little closer” to Q* • Use milestone-point to reduce samples used

Compute many times; each Compute once; reused time uses a small number for many iterations of samples Algorithm: vQVI (sketch)

Check-point

Large number of samples log 휖−1 iterations Small number of samples

퐻 = 1 − 훾 −1 iters

“Big improvement” Compare to Q-Learning

❑ Q-learning variants ❑ Variance-reduced Q-Learning

1 Q Q1 2 Q Q2 3 Q Q3 4 Q Q4 5 Q Q5

#Samples: (minibatch size) × R #samples: ≈ single minibatch Precise Control of Error Accumulation

• Each modified Q-learning step has estimation error

• Error

• Bernstein inequality + Law of total variance

∑휖푖+푗 ∼ (Total variance of MDP)/푚 ≤ 1 − 훾 −3/푚

the intrinsic complexity of MDP Extension to Two-Person Stochastic Game • Fundamentally different computation classes even for the discount case.

MIN 3 RL: 푎1 푎2 푎3

MAX 3 2 2

푎 푎 푎 푎 푎 푎 푎 11 12 21 22 23 31 31 Game:

3 12 2 4 6 14 2

• But they achieve the same complexity lower bound for the discount case!

[Sidford, Wang, Yang, Ye (2019)]: samples per (s,a) to learn 0.1-opt strategy w.h.p. II. Stochastic Programming

• Consider a Stochastic Programming (SP) problem

min 퐅 풙 : = 피ℙ푓(풙, 푊) 풙∈ℜ풑:0≤풙≤푅

• 푓 deterministic. For every feasible 풙, the function 푓(풙, ⋅ ) is measurable and 피ℙ푓(풙, 푊) is finite. 풲 is the support of 푊.

• Commonly solved Sample Average Approximation (SAA), a.k.a., Monte Carlo (Sampling) method.

20 Sampling Average Approximation

• To solve 풙푡푟푢푒 ∈ argmin{퐅 풙 : 0 ≤ 풙 ≤ 푅}

• Solve instead an approximation problem 푛 −1 min 퐹෠푛 풙 ≔ 푛 ෍ 푓(풙, 푊푖) 풙∈ℜ풑: 0≤풙≤푅 푖=1 • 푊1, 푊2, … , 푊푖, … 푊푛 is a sequence of samples of 푊 • Simple to implement • Often tractably computable

SAA is a popular way of solving SP 21 Equivalence between SAA and M-Estimation

Stochastic Statistical learning Programming SAA = Model Fitting Formulation for M-Estimation RSAA = Regularized M-Estimation Suboptimality gap 휖 = Excess Risk: 퐅 ⋅ − 퐅 풙푡푟푢푒 Minimizer 풙푡푟푢푒 = True parameters 풙푡푟푢푒 Function = Statistical Loss Efficacy of SAA

• Assumptions:

• (a) The sequence {푓 풙, 푊1 , 푓 풙, 푊2 , … , 푓 풙, 푊푛 } i.i.d. • (b) Subgaussian (can be generalized into sub-exponential) • (c) Lipschitz-like condition

• Number 푛 of samples required to achieve 휖 accuracy with probability 1 − 훼 in solving an 푝-dimensional problem. ℙ 퐅 풙푆퐴퐴 − 퐅 풙푡푟푢푒 ≤ 휖 ≥ 1 − 훼

• If 푛 is large enough to satisfy 푝 1 1 1 푛 ≳ ln + ln 휖2 휖 휖2 훼 Sample size 푛 grows polynomially when number of dimensions 푝 increases 23 Shapiro et al., 2009, Shapiro, 2003, Shapiro and Xu, 2008 High-Dimensional SP

• What if 푝 is large, but acquiring new samples are prohibitively costly?

• What if most dimensions are (approximately) ineffective, e.g., 풙푡푟푢푒 = [80; 131; 100; 0; 0; ….; 0]

• Consider a problem with 푝 = 100,000 • SAA requires n ≥ 100,000 ∼ 10,000,000 • If only it is known which dimensions (e.g., three of them) are nonzero, then original problem can be reduced to 푝 = 3, then 푛 ≥ 10 ∼ 1,000

Conventional SAA is not effective in solving high-dimensional SP

24 Convex and Nonconvex Regularization Schemes 푝 RSAA formulation: min 퐹푛,휆 풙 ≔ 퐹෠푛 풙 + ∑ 푃휆 푥푗 풙: 0≤풙≤푅 푗=1

-- ℓ풒-norm with 푞 = 0.5 • 푙푞-norm (0 < 푞 ≤ 1 or 푞 = 2) (--), biased or -- ℓퟎ-norm with 휆 = 1 non-sparse -- FCP with 휆 = 2, 푎 = 0.5 푞 • 푃휆 푡 ≔ 휆 ⋅ 푡 푃휆 푡

• 푙0-norm (--), not continuous • 푃휆 푡 ≔ 휆 ⋅ 핝(푡 ≠ 0) t • Folded concave penalty in the form of minimax Tibshirani, 1996. J R Stat Soc B concave penalty (--) Frank and Friedman, 1993.Technometrics 푡 푎휆−푠 Fan and Li, 2001. JASA 푃 푡 ≔ ׬ + 푑푠 • 휆 0 푎 Zhang, 2010. Ann. Stat Raskutti et al., 2011, IEEE Trans. Inf. Theory FCP entails unbiasedness, sparsity, and continuity Bian, Chen, Y, 2011, Math. Progrm Existing Results on Convex Regularization

• Lasso (Tibshirani, 1996), compressive sensing (Donoho, 2006), and Dantzig selector (Candes & Tao, 2007): • Pro: • Computationally tractable • Generalization error and oracle inequalities (e.g., ℓ1, ℓ2-loss): Bunea et al. (2007), Zou (2006), Bickel et al. (2009), Van de Geer (2008), Negahban et al. (2010), etc. • Cons: • Requirement of RIP, restricted eigenvalue (RE), or RSC • Absence of (strong) oracle property: Fan and Li (2001) • Biased: Fan and Li (2001) • Remark • Excess risk bound available for beyond RIP, RE, or RSC (e.g., Zhang et al., 2017) Existing Results on (Folded) Concave Penalty • SCAD: Fan and Li (2001) and MCP: Zhang (2010) • Pro: • (Nearly) Unbiased: Fan and Li (2001), Zhang (2010) • Generalization error (ℓ1, ℓ2-loss) and oracle inequalities: Loh and Wainwright (2015), Wang et al. (2014), Zhang & Zhang (2012), etc. • Oracle and Strong oracle property: Fan & Li (2001), Fan et al. (2014). • Cons: • Strong NP-Hard, requirement of RSC, except for the special case of linear regression • Remark: • Chen, Fu and Y (2010): threshod lower-bound property at every local solution; • Zhang and Zhang (2012): global optimality results in good statistical quality; • Fan and Lv (2011): sufficient condition results in desirable statistical property; • Though strongly NP-hard to find global solution (Chen et al., 2017), theories on local schemes available: Loh & Wainwright (2015), Wang, Liu, & Zhang (2014), Wang, Kim & Li (2013), Fan et al. (2014), Liu et al. (2017) The S3ONC Solutions Admits FPTAS

Assumption (d): 푓( ⋅ , 푧) is continuously differentiable and for a.e. 푧 ∈ 풲,

휕퐹෠푛 풙, 푧 휕퐹෠푛 풙, 푧 − ≤ 푈퐿 ⋅ |훿| 휕푥푗 휕푥푗 풙=풙෥+훿풆풋 풙=풙෥ • First-order necessary condition (FONC): The solution 풙∗ ∈ 0, 푅 푝 ෠ ∗ ′ ∗ ∗ 푝 훻퐹푛 풙 + 푃휆 푥푗 : 1 ≤ 푗 ≤ 푝 , 풙 − 풙 ≥ 0, ∀풙 ∈ 0, 푅 • Significant subspace second-order necessary condition (S3ONC): ∗ • FONC holds, and, for all 푗 = 1, … , 푝: 푥푗 ∈ (0, 푎휆), it holds that ′′ ∗ 2 ෠ ∗ ′′ ∗ 푈퐿+푃휆 푥푗 ≥ 0 표푟 훻 퐹푗푗 풙 + 푃휆 푥푗 ≥ 0 • S3ONC is weaker than the canonical second-order KKT • Computable within pseudo-polynomial time Y (1998), Liu et al. (2017), Haeser et al. (2018), Liu, Y (2019) https://arxiv.org/abs/1903.00616. Result 1: Efficacy of RSAA at a Global Minimum

THEOREM 1 Assumptions: • (a) – (d) Consider a global minimal solution 풙∗ to RSAA. If the sample size 푛 satisfies 푠2.5 푝 1.5 1 1 푛 ≳ ln + ln , 휖3 휖 휖2 훼 then ℙ 퐅 풙∗ − 퐅 풙푡푟푢푒 ≤ 휖 ≥ 1 − 훼, for any 훼 > 0 and 휖 ∈ 0, 0.5 .

푝 1 1 1 • Better dependence in 푝 than SAA: ln + ln 휖2 휖 휖2 훼 • However, generating a global solution is generally computationally prohibitive

29 Result 2: Efficacy of RSAA at a Local Solution

THEOREM 2 Assumptions: • (a) – (d) • 푓( ⋅ , 푧) is convex for almost every 푧 ∈ 풲 3 ∗ ∗ Consider an S ONC solution 풙 to RSAA. If 퐹푛,휆 풙 ≤ 퐹푛,휆(ퟎ) a.s., and if the sample size 푛 satisfies 푠2.5 푝 2 1 1 푛 ≳ ln + ln , 휖4 휖 휖2 훼 then ℙ 퐅 풙∗ − 퐅 풙푡푟푢푒 ≤ 휖 ≥ 1 − 훼, for any 훼 > 0 and 휖 ∈ 0, 0.5 .

푝 1 1 1 • Better dependence in 푝 than SAA: ln + ln 휖2 휖 휖2 훼 • Much easier solvable than a global minimizer: Pseudo-polynomial-time algorithms exist

30 Result 3: Efficacy of RSAA Under Additional Assumptions THEOREM 3 Assumptions: • (a) – (d) • 푓( ⋅ , 푧) is convex for almost every 푧 ∈ 풲 • 퐅 is strongly convex and differentiable (∗) 3 ∗ ∗ Consider an S ONC solution 풙 . If 퐹푛,휆 풙 ≤ 퐹푛,휆(ퟎ) a.s., and if the sample size 푛 satisfies 푠1.5 푝 1.5 1 1 푛 ≳ ln + ln , 휖3 휖 휖2 훼 then ℙ 퐅 풙∗ − 퐅 풙푡푟푢푒 ≤ 휖 ≥ 1 − 훼, for any 훼 > 0 and 휖 ∈ 0, 0.5 .

• For some problems, while f( ⋅ , 푊) is only convex, 퐅 can be strongly convex, e.g., high-dimensional linear regression with random design matrix • Under additional assumption (∗), Theorem 3 shows an improved sample complexity than Theorem 2 31 Result 4: Efficacy of RSAA Under Additional Assumptions THEOREM 4 Assumptions: • (a) – (d) • 푓( ⋅ , 푧) is convex for almost every 푧 ∈ 풲 • 퐅 is strongly convex and differentiable (∗) 푡푟푢푒 푡푟푢푒 • min 푥푗 : 푥푗 ≠ 0, 푗 = 1, … , 푝 ≥ 푇ℎ푟푒푠ℎ표푙푑 (∗∗) 3 ∗ ∗ Consider an S ONC solution 풙 . If 퐹푛,휆 풙 ≤ 퐹푛,휆(ퟎ) a.s., and if the sample size 푛 satisfies 푠 푝 1 1 푛 ≳ ln + ln , 휖2 휖 휖2 훼 then ℙ 퐅 풙∗ − 퐅 풙푡푟푢푒 ≤ 휖 ≥ 1 − 훼, for any 훼 > 0 and 휖 ∈ 0, 0.5 .

Under additional assumptions (∗) and (∗∗), no compromise on the rate in 휖 푝 1 1 1 compared to the SAA: ln + ln 휖2 휖 휖2 훼 32 Numerical Experiments

Fix sample size 푛 = 100; Fix dimension 푝 = 100; Increase dimensionality Increase sample size

Liu, Wang, Yao, Li, Ye, 2018. Math Program Generalizations and Applications

• Theoretical Generalizations

• From subgaussian distribution to subexponentiality • From smooth to nonsmooth cost function

• Theoretical Applications

• High-Dimensional Statistical Learning (HDSL) • Deep Neural Networks Theoretical Generalizations

• Generalization from subgaussian distribution to subexponentiality Assumption (e): 푓(풙, 푊) is subexponential for all 풙 ∈ 0, 푅 푝

• Generalization from smooth cost function 푓 to non-smooth cost function Assumption (f): ⊤ 푓 풙, 푊 = 푓1 풙, 푊 + max{풖 푨 푊 풙 − 휙෠(풖, 푊)} 푢∈ℋ where 푨 푊 ∈ ℜ푚×푝 is a linear operator, 휙෠ is a measurable, deterministic function, 푓1 ⋅ , 푊 is continuously differentiable with Lipschitz continuous gradient, ℋ ⊂ ℜ푚 is a convex and compact set. Result 5: Generalization to subexponentiality

푝 ℓ1 ෠ 풙 ≔ arg min 퐹푛 풙 + 휆 ∑푗=1 푥푗 . 풙:ퟎ≤풙≤푹 THEOREM 5

Assumptions: (a), (c), (d), (e) subexponentiality; 푓1( ⋅ , 푧) is convex for almost every 푧 ∈ 풲 3 ∗ ∗ ℓퟏ Consider an S ONC solution 풙 . If 퐹푛,휆 풙 ≤ 퐹푛,휆(풙 ) a.s., and if the sample size 푛 satisfies 푠3 푝 1.5 푝 1 3 푛 ≳ ln + ln + ln , 휖3 휖 훼 훼 then ℙ 퐅 풙∗ − 퐅 풙푡푟푢푒 ≤ 휖 ≥ 1 − 훼, for any 훼 > 0 and 휖 ∈ 0, 0.5 .

푠2.5 푝 2 1 1 • Better dependence in 휖 than Theorem 2: ln + ln 휖4 휖 휖2 훼 • Desired solution can be generated by solving for an S3ONC solution initialized with 풙ℓ1. 36 Result 6: Generalization to non-smoothness

푝 ℓ1 ෠ 풙 ≔ arg min 퐹푛 풙 + 휆 ∑푗=1 푥푗 . 풙:ퟎ≤풙≤푹 THEOREM 6

Assumptions: (a), (c), (e) subexponentiality; (f); 푓( ⋅ , 푧) is convex for almost every 푧 ∈ 풲. 3 ∗ ∗ ℓퟏ Consider an S ONC solution 풙 . If 퐹푛,휆 풙 ≤ 퐹푛,휆(풙 ) a.s., and if the sample size 푛 satisfies 푠4 푝 2 푝 1 2 푛 ≳ ln + ln + ln , 휖4 휖 훼 훼 then ℙ 퐅 풙∗ − 퐅 풙푚푖푛 ≤ 휖 ≥ 1 − 훼, for any 훼 > 0 and 휖 ∈ 0, 0.5 .

• Assumption (d) is replaced by (f) • Desired solution can be generated by solving for an S3ONC solution initialized with 풙ℓ1 37 Theoretical Application 1: High-Dimensional M-Estimation • Given knowledge of a statistical loss function 푓, and hypothetically 풙푡푟푢푒 ∈ arg min 퐅 풙 : = 피 푓 풙, 푊 풙푡푟푢푒: True parameters governing data generation model 푡푟푢푒 푝 • Estimate 풙 ∈ ℜ given only 푓 samples 푊1, 푊2, … , 푊푛. • Assume that 푓 풙, 푊푖 , 푖 = 1, … , 푛, is i.i.d. subexponential. • Traditional approach: minimizing empirical risk to obtain unbiased estimator 푛 −1 풙ෝ = arg min 퐹෠푛 풙 : = 푛 ෍ 푓 풙, 푊푖 푖=1 • Under High-dimensional setting, 푝 ≫ 푛 • Consider Folded concave penalized (FCP) learning 푛 푝 −1 min 퐹푛,휆 풙 : = 푛 ෍ 푓 푥, 푊푖 + ෍ 푃휆 푥푗 푖=1 푗=1 Result 7: Learning Under Approximate Sparsity • Approximate sparsity: Existence of 풙෥ such that 푡푟푢푒 픽 풙෥ − 픽 풙 ≤ 휀ෝ, and 푠: = 풙෥ 0 ≪ 푝

THEOREM 7

3 ∗ ∗ For any 훤 ≥ 0 , consider an S ONC solution 풙 with 퐹푛,휆 풙 ≤ min 퐹푛,휆 풙 + 훤, a.s. Assume 푥 (a) subexponentiality; (b) Lipschitz-like continuity of 퐹 and 훻퐹; (c) approximate sparsity 푠 ln 푝 훤 + 휀Ƹ 퐸푥푐푒푠푠 푟푖푠푘 표푓 풙∗ ≤ 푂෨ 1 ⋅ + + 훤 + 휀Ƹ 푛1/3 푛1/3 with probability 1 − 푂 1 ⋅ exp(−푂 1 푛1/3 ln 푝)

• First explication: more general assumptions than sparsity; • 훤 can be interpreted as the sub-optimality gap of 풙∗ • No convexity or RSC assumed. Removing 휞 via Wise Initialization for Convex Loss Step 1: Solve the LASSO program 푝 ℓ1 풙 ≔ argmin 퐹෠푛(풙) + 휆 ෍ |푥푖| 풙 푗=1 Step 2: Solve for an S3ONC stationary point initialized with 휷ℓ1 • A modified first-order method, ADMM

Algorithm-independent, fully polynomial-time approximation scheme Complexity ≤ Poly(p, desired_accuracy) Result 8: FCP+LASSO

THEOREM 8

3 ∗ ∗ ℓퟏ Consider an S ONC solution 풙 with 퐹푛,휆 풙 ≤ 퐹푛,휆 풙 , a.s. Assume (a) subexponentiality; (b) Lipschitz-like continuity of 퐹 and 훻퐹; (c) sparsity (i.e., 휀ෝ = 0 in the assumption of approximate sparsity) 푠‖풙푡푟푢푒‖ ln 푝 푠 ln 푝 퐸푥푐푒푠푠 푟푖푠푘 표푓 풙∗ ≤ 푂෨ 1 ⋅ ∞ + 푛1/3 푛2/3 with probability 1 − 푂 1 ⋅ exp(−푂 1 푛1/3 ln 푝)

A pseudo-polynomial-time computable theory without RSC Comparing This Research with Existing Results

• New insights on local solutions for folded concave penalty • Pros: • No requirement of RSC, RIP, RE, or alike • Applicable to non-smooth learning and • Applicable to high-dimensional stochastic programming • Generalization error bounds on excess risk • Cons: • Slower rate than alternative results that require RSC

Under RSC (e.g., Loh Without RSC, linear regression This research on M- & Wainwright 2015) only (e.g, Zhang et al., 2017) estimation without RSC 1 1 1 푂 푂 푂 푛 푛 푛1Τ3 Theoretical Application 2: Deep Learning

Date generation: for some unknown 푔: ℜ푑 → ℜ 푏푖 = 푔 풂푖 + 푤푖 푖 = 1, … , 푛 Traditional training model: 푛 1 2 min ෍ 푏푖 − 퐹푁푁 풂푖, 풙 풙 2푛 푖=1 • Expressive power: Yarotsky (2017) and Mhaskar (1996), DeVore et al. (1989), Mhaskar and Poggio (2016)

• Two-layer, polynomial increase in dimensions, or sometimes exponential growth in layer number: Arora et al. (2018), Barron and Klusowski (2018), Bartlett et al. (2017), Golowich et al. (2017), Neyshabur et al. (2015, 2017, 2018) Scarcity of theoretical guarantee on generalization performance under over-parameterization Image from: medium.com/xanaduai/ Approximate Sparsity and Regularization in Neural Networks

• Proposed training model: Find an S3ONC solution to the formulation below 푛 푝 1 2 min ෍ 푏푖 − 퐹푁푁 풂푖, 풙 + ෍ 푃휆(|푥푗|) 풙 2푛 푖=1 푗=1 • Expressive power of NN implies approximate sparsity 푟 • E.g., A ReLu network, with 푐푛1/3 ⋅ ln 푛3푑 + 1 -many approximates any 푟-th 푟 − order weakly-differentiable function with accuracy O(1)푛 3푑 (Yarotsky, 2017) 푟 푟 − • s = 푐푛1/3 ⋅ ln 푛3푑 + 1 and 휀Ƹ = O(1)푛 3푑 in approximate sparsity • Empirical evidences on sparse neural networks • E.g., Dropout: Srivastava et al. (2014); DropConnect: Wan et al. (2013); Deep compression (remove link with small weights and retrain): Han et al. (2016) Neural networks may not obey the RSC but they innately entail approximate sparsity Generalization Performance of Deep Learning in Special Case 1

For 푔: ℜ푑 → ℜ that has 푟-th order weak derivative;

Assume that the PDF of inputs to neurons have continuous density in a near neighborhood of 0.

1 1 1 Γ ෨ 푂 1 + 1 푟 + 푟 + 1 ⋅ ln 푝 + 훤 + 푛3 푛6 6푑 푛3푑 푛3

Poly-logarithmic in dimensionality Generalization Performance of Deep Learning in Special Case 2

For 푔: ℜ푑 → ℜ being a polynomial function

1 Γ 푂෨ + ln 푝 + 훤 푛1Τ3 푛1Τ3

Poly-logarithmic in dimensionality Numerical Experiment 1: MNIST

Handwriting recognition problem:

• 28 x 28-pixel images of handwritten digits

• 60,000 training data

• 10,000 test data Numerical Experiment 1: MNIST

• 800 × 800 + softmax

• Fully connected

• No preprocessing/distortion Numerical Experiment 1: MNIST

• 500 × 150 × 10 ×… × 10 × 150 + softmax • Fully connected • No preprocessing/distortion Numerical Experiment 2: Matrix Completion

푝 = 100 × 100 = 10,000, 푛 = 2,000 Ratings of customers to products/restaurants/movies often constitutes an incomplete matrix Numerical Experiment 2: Matrix Completion

S3ONC Method Sub. Nuclear S3ONC norm (Lasso) Error 85.7699 0.1072 0.0005

푝 = 100 × 100 = 10,000, 푛 = 2,000 Summary

• The Sample Complexity analysis is an essential problem in optimization driven by data or high-dimensional/deep learning • Theories can be developed to match a sample size upper bound to the information theoretical lower bound, such as in RL by smart sampling • Based on the structure of problems/solutions, adding certain regularizations can greatly reduce the sample complexity, such as RSAA achieves poly- logarithmic sample complexity on any S3ONC solution based on standard statistical and sparse assumptions • Although the RSAA problem become nonconvex when the regularization is concave, an S3ONC solution can be efficiently computed in both theory and practice • Future Research: further theoretical developments and more comprehensive numerical results