The Sample Complexity in Data(Sample)-Driven Optimization

The Sample Complexity in Data(Sample)-Driven Optimization Y INYU Y E J OINT WORK WITH MANY OTHERS Department of Management Science and Engineering and ICME Stanford University Outline 1. Sample Complexity in Reinforcement Learning and Stochastic Game • Reinforcement Learning and MDP • The Sample-Complexity Problem in RL • The Near Sample-Optimal Algorithm • Variance Reduction • Monotonicity Analysis 2. Average Approximation with Sparsity-Inducing Penalty for High- Dimensional Stochastic Programming/Learning • Regularized sample average approximation (RSAA) • Theoretical generalizations • Theoretical applications: • High-dimensional statistical learning • Deep neural network learning I. What Is Reinforcement Learning? • Use samples/experiences to control an unknown system Action Unknown Agent system State observations Rewards Maze with random Play game, learn to drive punishment/bonus Samples/Experiences: previous observations Median over 57 Atari games Reinforcement learning achieves phenomenal empirical successes Human Hessel et al’17 What if sample/trial is costly and limited? Precision Medicine Biological Experiment Crowd Sourcing Sample Size/Complexity is essential! “What If” Model: Markov Decision Process • States: ; Actions: • Reward: • State transition: • Policy: random Effective Horizon: • Optimal policy & value: When model is known, solve via Linear Programming • -optimal policy : : Examples +1 2 3 +0.999 L R • Three-States MDP 1 − Optimal policy: 휋∗ 1 = 퐿 −0.1-optimal policy: 휋 1 = 푅 • Pacman − How to get an 0.1-optimal policy? − Learn the policy from sample data − Short episodes: For an , how many samples are The Sample- sufficient/necessary to learn a 0.1-optimal policy Complexity Problem w. p. ≥ 0.9? Data/sample: transition examples from every (s,a) • samples per (s,a) is necessary! • No previous algorithms achieves this sample complexity • Sufficient side: what is the best way of learning policy? [Azar et al (2013)] Empirical Risk Minimization and Value-Iteration ❑Collect m samples for each (푠, 푎) ❑Construct an empirical model ❑Solve the empirical MDP via Linear Programming or Simply Value-Iteration: T vs ← maxa [ r(s,a)+ΥP(s’|s,a) v ] for all s Sample-First and Solve-Second • Simple and intuitive • Space: model-based, all data • Samples: [Azar et al’ 2013] Iterative Q-learning: Solving while Sampling : quality of (s,a), initialized to 0 Converging to Empirical estimator of the Q* expected future value Approximate DP After iterations: • #Samples for 0.1-opt. policy w.h.p.: Why Iterative Q-Learning Fails Q1 Q2 “a little closer” to Q* Empirical estimator Q3 of the expected future value Q4 Q5 −1 • Runs for 푅 = 1 − 훾 iterations Too many samples for too little • Each iteration uses new samples of same improvement number Summary of Previous Methods How many samples per are sufficient/necessary to obtain a 0.1-optimal policy w. p. > 0.9? Previous Algorithms #Samples/(s,a) Ref Phased Q-learning 1014 Kearns & Singh’ 1999 ERM 1010 Azar et al’ 2013 Randomized LP 108 Wang’ 2017 Randomized Val. Iter. 108 Sidford et al’ ISML2018 Previous Best Algorithm: Necessary? Information theoretical lower bound: Sufficient? The Near Sample-Optimal Algorithm Theorem (informal): For an unknown discounted MDP, with model size . We give an algorithm (vQVI) that • Recover w.h.p. an 0.1-optimal policy with samples per (s,a) • Time: #samples • Space: model-free The first sample-optimal* algorithm for learning policy of DMDP. [Sidford, Wang, Wu, Yang, and Ye, NeurIPS (2018)] Analysis Overview Q-Learning Monotonicity Utilizing Var. Reduced Q- Variance Learning Bernstein previous Reduction estimate Guarantee Monotonicity Q- good policy Learning Precise Bernstein + Law of control of total var. error Variance Reduction • Original Q-Learning “a little closer” to Q* • Use milestone-point to reduce samples used Compute many times; each Compute once; reused time uses a small number for many iterations of samples Algorithm: vQVI (sketch) Check-point Large number of samples log 휖−1 iterations Small number of samples 퐻 = 1 − 훾 −1 iters “Big improvement” Compare to Q-Learning ❑ Q-learning variants ❑ Variance-reduced Q-Learning 1 Q Q1 2 Q Q2 3 Q Q3 4 Q Q4 5 Q Q5 #Samples: (minibatch size) × R #samples: ≈ single minibatch Precise Control of Error Accumulation • Each modified Q-learning step has estimation error • Error • Bernstein inequality + Law of total variance ∑휖푖+푗 ∼ (Total variance of MDP)/푚 ≤ 1 − 훾 −3/푚 the intrinsic complexity of MDP Extension to Two-Person Stochastic Game • Fundamentally different computation classes even for the discount case. MIN 3 RL: 푎1 푎2 푎3 MAX 3 2 2 푎 푎 푎 푎 푎 푎 푎 11 12 21 22 23 31 31 Game: 3 12 2 4 6 14 2 • But they achieve the same complexity lower bound for the discount case! [Sidford, Wang, Yang, Ye (2019)]: samples per (s,a) to learn 0.1-opt strategy w.h.p. II. Stochastic Programming • Consider a Stochastic Programming (SP) problem min 퐅 풙 : = 피ℙ푓(풙, 푊) 풙∈ℜ풑:0≤풙≤푅 • 푓 deterministic. For every feasible 풙, the function 푓(풙, ⋅ ) is measurable and 피ℙ푓(풙, 푊) is finite. 풲 is the support of 푊. • Commonly solved Sample Average Approximation (SAA), a.k.a., Monte Carlo (Sampling) method. 20 Sampling Average Approximation • To solve 풙푡푟푢푒 ∈ argmin{퐅 풙 : 0 ≤ 풙 ≤ 푅} • Solve instead an approximation problem 푛 −1 min 퐹෠푛 풙 ≔ 푛 ෍ 푓(풙, 푊푖) 풙∈ℜ풑: 0≤풙≤푅 푖=1 • 푊1, 푊2, … , 푊푖, … 푊푛 is a sequence of samples of 푊 • Simple to implement • Often tractably computable SAA is a popular way of solving SP 21 Equivalence between SAA and M-Estimation Stochastic Statistical learning Programming SAA = Model Fitting Formulation for M-Estimation RSAA = Regularized M-Estimation Suboptimality gap 휖 = Excess Risk: 퐅 ⋅ − 퐅 풙푡푟푢푒 Minimizer 풙푡푟푢푒 = True parameters 풙푡푟푢푒 Function = Statistical Loss Efficacy of SAA • Assumptions: • (a) The sequence {푓 풙, 푊1 , 푓 풙, 푊2 , … , 푓 풙, 푊푛 } i.i.d. • (b) Subgaussian (can be generalized into sub-exponential) • (c) Lipschitz-like condition • Number 푛 of samples required to achieve 휖 accuracy with probability 1 − 훼 in solving an 푝-dimensional problem. ℙ 퐅 풙푆퐴퐴 − 퐅 풙푡푟푢푒 ≤ 휖 ≥ 1 − 훼 • If 푛 is large enough to satisfy 푝 1 1 1 푛 ≳ ln + ln 휖2 휖 휖2 훼 Sample size 푛 grows polynomially when number of dimensions 푝 increases 23 Shapiro et al., 2009, Shapiro, 2003, Shapiro and Xu, 2008 High-Dimensional SP • What if 푝 is large, but acquiring new samples are prohibitively costly? • What if most dimensions are (approximately) ineffective, e.g., 풙푡푟푢푒 = [80; 131; 100; 0; 0; ….; 0] • Consider a problem with 푝 = 100,000 • SAA requires n ≥ 100,000 ∼ 10,000,000 • If only it is known which dimensions (e.g., three of them) are nonzero, then original problem can be reduced to 푝 = 3, then 푛 ≥ 10 ∼ 1,000 Conventional SAA is not effective in solving high-dimensional SP 24 Convex and Nonconvex Regularization Schemes 푝 RSAA formulation: min 퐹푛,휆 풙 ≔ 퐹෠푛 풙 + ∑ 푃휆 푥푗 풙: 0≤풙≤푅 푗=1 -- ℓ풒-norm with 푞 = 0.5 • 푙푞-norm (0 < 푞 ≤ 1 or 푞 = 2) (--), biased or -- ℓퟎ-norm with 휆 = 1 non-sparse -- FCP with 휆 = 2, 푎 = 0.5 푞 • 푃휆 푡 ≔ 휆 ⋅ 푡 푃휆 푡 • 푙0-norm (--), not continuous • 푃휆 푡 ≔ 휆 ⋅ 핝(푡 ≠ 0) t • Folded concave penalty in the form of minimax Tibshirani, 1996. J R Stat Soc B concave penalty (--) Frank and Friedman, 1993.Technometrics 푡 푎휆−푠 Fan and Li, 2001. JASA 푃 푡 ≔ ׬ + 푑푠 • 휆 0 푎 Zhang, 2010. Ann. Stat Raskutti et al., 2011, IEEE Trans. Inf. Theory FCP entails unbiasedness, sparsity, and continuity Bian, Chen, Y， 2011， Math. Progrm Existing Results on Convex Regularization • Lasso (Tibshirani, 1996), compressive sensing (Donoho, 2006), and Dantzig selector (Candes & Tao, 2007): • Pro: • Computationally tractable • Generalization error and oracle inequalities (e.g., ℓ1, ℓ2-loss): Bunea et al. (2007), Zou (2006), Bickel et al. (2009), Van de Geer (2008), Negahban et al. (2010), etc. • Cons: • Requirement of RIP, restricted eigenvalue (RE), or RSC • Absence of (strong) oracle property: Fan and Li (2001) • Biased: Fan and Li (2001) • Remark • Excess risk bound available for linear regression beyond RIP, RE, or RSC (e.g., Zhang et al., 2017) Existing Results on (Folded) Concave Penalty • SCAD: Fan and Li (2001) and MCP: Zhang (2010) • Pro: • (Nearly) Unbiased: Fan and Li (2001), Zhang (2010) • Generalization error (ℓ1, ℓ2-loss) and oracle inequalities: Loh and Wainwright (2015), Wang et al. (2014), Zhang & Zhang (2012), etc. • Oracle and Strong oracle property: Fan & Li (2001), Fan et al. (2014). • Cons: • Strong NP-Hard, requirement of RSC, except for the special case of linear regression • Remark: • Chen, Fu and Y (2010): threshod lower-bound property at every local solution; • Zhang and Zhang (2012): global optimality results in good statistical quality; • Fan and Lv (2011): sufficient condition results in desirable statistical property; • Though strongly NP-hard to find global solution (Chen et al., 2017), theories on local schemes available: Loh & Wainwright (2015), Wang, Liu, & Zhang (2014), Wang, Kim & Li (2013), Fan et al. (2014), Liu et al. (2017) The S3ONC Solutions Admits FPTAS Assumption (d): 푓( ⋅ , 푧) is continuously differentiable and for a.e. 푧 ∈ 풲, 휕퐹෠푛 풙, 푧 휕퐹෠푛 풙, 푧 − ≤ 푈퐿 ⋅ |훿| 휕푥푗 휕푥푗 풙=풙෥+훿풆풋 풙=풙෥ • First-order necessary condition (FONC): The solution 풙∗ ∈ 0, 푅 푝 ෠ ∗ ′ ∗ ∗ 푝 훻퐹푛 풙 + 푃휆 푥푗 : 1 ≤ 푗 ≤ 푝 , 풙 − 풙 ≥ 0, ∀풙 ∈ 0, 푅 • Significant subspace second-order necessary condition (S3ONC): ∗ • FONC holds, and, for all 푗 = 1, … , 푝: 푥푗 ∈ (0, 푎휆), it holds that ′′ ∗ 2 ෠ ∗ ′′ ∗ 푈퐿+푃휆 푥푗 ≥ 0 표푟 훻 퐹푗푗 풙 + 푃휆 푥푗 ≥ 0 • S3ONC is weaker than the canonical second-order KKT • Computable within pseudo-polynomial time Y (1998), Liu et al. (2017), Haeser et al. (2018), Liu, Y (2019) https://arxiv.org/abs/1903.00616.

The Sample Complexity in Data(Sample)-Driven Optimization

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support