
The Sample Complexity in Data(Sample)-Driven Optimization Y INYU Y E J OINT WORK WITH MANY OTHERS Department of Management Science and Engineering and ICME Stanford University Outline 1. Sample Complexity in Reinforcement Learning and Stochastic Game • Reinforcement Learning and MDP • The Sample-Complexity Problem in RL • The Near Sample-Optimal Algorithm • Variance Reduction • Monotonicity Analysis 2. Average Approximation with Sparsity-Inducing Penalty for High- Dimensional Stochastic Programming/Learning • Regularized sample average approximation (RSAA) • Theoretical generalizations • Theoretical applications: • High-dimensional statistical learning • Deep neural network learning I. What Is Reinforcement Learning? • Use samples/experiences to control an unknown system Action Unknown Agent system State observations Rewards Maze with random Play game, learn to drive punishment/bonus Samples/Experiences: previous observations Median over 57 Atari games Reinforcement learning achieves phenomenal empirical successes Human Hessel et al’17 What if sample/trial is costly and limited? Precision Medicine Biological Experiment Crowd Sourcing Sample Size/Complexity is essential! “What If” Model: Markov Decision Process • States: ; Actions: • Reward: • State transition: • Policy: random Effective Horizon: • Optimal policy & value: When model is known, solve via Linear Programming • -optimal policy : : Examples +1 2 3 +0.999 L R • Three-States MDP 1 − Optimal policy: 휋∗ 1 = 퐿 −0.1-optimal policy: 휋 1 = 푅 • Pacman − How to get an 0.1-optimal policy? − Learn the policy from sample data − Short episodes: For an , how many samples are The Sample- sufficient/necessary to learn a 0.1-optimal policy Complexity Problem w. p. ≥ 0.9? Data/sample: transition examples from every (s,a) • samples per (s,a) is necessary! • No previous algorithms achieves this sample complexity • Sufficient side: what is the best way of learning policy? [Azar et al (2013)] Empirical Risk Minimization and Value-Iteration ❑Collect m samples for each (푠, 푎) ❑Construct an empirical model ❑Solve the empirical MDP via Linear Programming or Simply Value-Iteration: T vs ← maxa [ r(s,a)+ΥP(s’|s,a) v ] for all s Sample-First and Solve-Second • Simple and intuitive • Space: model-based, all data • Samples: [Azar et al’ 2013] Iterative Q-learning: Solving while Sampling : quality of (s,a), initialized to 0 Converging to Empirical estimator of the Q* expected future value Approximate DP After iterations: • #Samples for 0.1-opt. policy w.h.p.: Why Iterative Q-Learning Fails Q1 Q2 “a little closer” to Q* Empirical estimator Q3 of the expected future value Q4 Q5 −1 • Runs for 푅 = 1 − 훾 iterations Too many samples for too little • Each iteration uses new samples of same improvement number Summary of Previous Methods How many samples per are sufficient/necessary to obtain a 0.1-optimal policy w. p. > 0.9? Previous Algorithms #Samples/(s,a) Ref Phased Q-learning 1014 Kearns & Singh’ 1999 ERM 1010 Azar et al’ 2013 Randomized LP 108 Wang’ 2017 Randomized Val. Iter. 108 Sidford et al’ ISML2018 Previous Best Algorithm: Necessary? Information theoretical lower bound: Sufficient? The Near Sample-Optimal Algorithm Theorem (informal): For an unknown discounted MDP, with model size . We give an algorithm (vQVI) that • Recover w.h.p. an 0.1-optimal policy with samples per (s,a) • Time: #samples • Space: model-free The first sample-optimal* algorithm for learning policy of DMDP. [Sidford, Wang, Wu, Yang, and Ye, NeurIPS (2018)] Analysis Overview Q-Learning Monotonicity Utilizing Var. Reduced Q- Variance Learning Bernstein previous Reduction estimate Guarantee Monotonicity Q- good policy Learning Precise Bernstein + Law of control of total var. error Variance Reduction • Original Q-Learning “a little closer” to Q* • Use milestone-point to reduce samples used Compute many times; each Compute once; reused time uses a small number for many iterations of samples Algorithm: vQVI (sketch) Check-point Large number of samples log 휖−1 iterations Small number of samples 퐻 = 1 − 훾 −1 iters “Big improvement” Compare to Q-Learning ❑ Q-learning variants ❑ Variance-reduced Q-Learning 1 Q Q1 2 Q Q2 3 Q Q3 4 Q Q4 5 Q Q5 #Samples: (minibatch size) × R #samples: ≈ single minibatch Precise Control of Error Accumulation • Each modified Q-learning step has estimation error • Error • Bernstein inequality + Law of total variance ∑휖푖+푗 ∼ (Total variance of MDP)/푚 ≤ 1 − 훾 −3/푚 the intrinsic complexity of MDP Extension to Two-Person Stochastic Game • Fundamentally different computation classes even for the discount case. MIN 3 RL: 푎1 푎2 푎3 MAX 3 2 2 푎 푎 푎 푎 푎 푎 푎 11 12 21 22 23 31 31 Game: 3 12 2 4 6 14 2 • But they achieve the same complexity lower bound for the discount case! [Sidford, Wang, Yang, Ye (2019)]: samples per (s,a) to learn 0.1-opt strategy w.h.p. II. Stochastic Programming • Consider a Stochastic Programming (SP) problem min 퐅 풙 : = 피ℙ푓(풙, 푊) 풙∈ℜ풑:0≤풙≤푅 • 푓 deterministic. For every feasible 풙, the function 푓(풙, ⋅ ) is measurable and 피ℙ푓(풙, 푊) is finite. 풲 is the support of 푊. • Commonly solved Sample Average Approximation (SAA), a.k.a., Monte Carlo (Sampling) method. 20 Sampling Average Approximation • To solve 풙푡푟푢푒 ∈ argmin{퐅 풙 : 0 ≤ 풙 ≤ 푅} • Solve instead an approximation problem 푛 −1 min 퐹푛 풙 ≔ 푛 푓(풙, 푊푖) 풙∈ℜ풑: 0≤풙≤푅 푖=1 • 푊1, 푊2, … , 푊푖, … 푊푛 is a sequence of samples of 푊 • Simple to implement • Often tractably computable SAA is a popular way of solving SP 21 Equivalence between SAA and M-Estimation Stochastic Statistical learning Programming SAA = Model Fitting Formulation for M-Estimation RSAA = Regularized M-Estimation Suboptimality gap 휖 = Excess Risk: 퐅 ⋅ − 퐅 풙푡푟푢푒 Minimizer 풙푡푟푢푒 = True parameters 풙푡푟푢푒 Function = Statistical Loss Efficacy of SAA • Assumptions: • (a) The sequence {푓 풙, 푊1 , 푓 풙, 푊2 , … , 푓 풙, 푊푛 } i.i.d. • (b) Subgaussian (can be generalized into sub-exponential) • (c) Lipschitz-like condition • Number 푛 of samples required to achieve 휖 accuracy with probability 1 − 훼 in solving an 푝-dimensional problem. ℙ 퐅 풙푆퐴퐴 − 퐅 풙푡푟푢푒 ≤ 휖 ≥ 1 − 훼 • If 푛 is large enough to satisfy 푝 1 1 1 푛 ≳ ln + ln 휖2 휖 휖2 훼 Sample size 푛 grows polynomially when number of dimensions 푝 increases 23 Shapiro et al., 2009, Shapiro, 2003, Shapiro and Xu, 2008 High-Dimensional SP • What if 푝 is large, but acquiring new samples are prohibitively costly? • What if most dimensions are (approximately) ineffective, e.g., 풙푡푟푢푒 = [80; 131; 100; 0; 0; ….; 0] • Consider a problem with 푝 = 100,000 • SAA requires n ≥ 100,000 ∼ 10,000,000 • If only it is known which dimensions (e.g., three of them) are nonzero, then original problem can be reduced to 푝 = 3, then 푛 ≥ 10 ∼ 1,000 Conventional SAA is not effective in solving high-dimensional SP 24 Convex and Nonconvex Regularization Schemes 푝 RSAA formulation: min 퐹푛,휆 풙 ≔ 퐹푛 풙 + ∑ 푃휆 푥푗 풙: 0≤풙≤푅 푗=1 -- ℓ풒-norm with 푞 = 0.5 • 푙푞-norm (0 < 푞 ≤ 1 or 푞 = 2) (--), biased or -- ℓퟎ-norm with 휆 = 1 non-sparse -- FCP with 휆 = 2, 푎 = 0.5 푞 • 푃휆 푡 ≔ 휆 ⋅ 푡 푃휆 푡 • 푙0-norm (--), not continuous • 푃휆 푡 ≔ 휆 ⋅ 핝(푡 ≠ 0) t • Folded concave penalty in the form of minimax Tibshirani, 1996. J R Stat Soc B concave penalty (--) Frank and Friedman, 1993.Technometrics 푡 푎휆−푠 Fan and Li, 2001. JASA 푃 푡 ≔ + 푑푠 • 휆 0 푎 Zhang, 2010. Ann. Stat Raskutti et al., 2011, IEEE Trans. Inf. Theory FCP entails unbiasedness, sparsity, and continuity Bian, Chen, Y, 2011, Math. Progrm Existing Results on Convex Regularization • Lasso (Tibshirani, 1996), compressive sensing (Donoho, 2006), and Dantzig selector (Candes & Tao, 2007): • Pro: • Computationally tractable • Generalization error and oracle inequalities (e.g., ℓ1, ℓ2-loss): Bunea et al. (2007), Zou (2006), Bickel et al. (2009), Van de Geer (2008), Negahban et al. (2010), etc. • Cons: • Requirement of RIP, restricted eigenvalue (RE), or RSC • Absence of (strong) oracle property: Fan and Li (2001) • Biased: Fan and Li (2001) • Remark • Excess risk bound available for linear regression beyond RIP, RE, or RSC (e.g., Zhang et al., 2017) Existing Results on (Folded) Concave Penalty • SCAD: Fan and Li (2001) and MCP: Zhang (2010) • Pro: • (Nearly) Unbiased: Fan and Li (2001), Zhang (2010) • Generalization error (ℓ1, ℓ2-loss) and oracle inequalities: Loh and Wainwright (2015), Wang et al. (2014), Zhang & Zhang (2012), etc. • Oracle and Strong oracle property: Fan & Li (2001), Fan et al. (2014). • Cons: • Strong NP-Hard, requirement of RSC, except for the special case of linear regression • Remark: • Chen, Fu and Y (2010): threshod lower-bound property at every local solution; • Zhang and Zhang (2012): global optimality results in good statistical quality; • Fan and Lv (2011): sufficient condition results in desirable statistical property; • Though strongly NP-hard to find global solution (Chen et al., 2017), theories on local schemes available: Loh & Wainwright (2015), Wang, Liu, & Zhang (2014), Wang, Kim & Li (2013), Fan et al. (2014), Liu et al. (2017) The S3ONC Solutions Admits FPTAS Assumption (d): 푓( ⋅ , 푧) is continuously differentiable and for a.e. 푧 ∈ 풲, 휕퐹푛 풙, 푧 휕퐹푛 풙, 푧 − ≤ 푈퐿 ⋅ |훿| 휕푥푗 휕푥푗 풙=풙+훿풆풋 풙=풙 • First-order necessary condition (FONC): The solution 풙∗ ∈ 0, 푅 푝 ∗ ′ ∗ ∗ 푝 훻퐹푛 풙 + 푃휆 푥푗 : 1 ≤ 푗 ≤ 푝 , 풙 − 풙 ≥ 0, ∀풙 ∈ 0, 푅 • Significant subspace second-order necessary condition (S3ONC): ∗ • FONC holds, and, for all 푗 = 1, … , 푝: 푥푗 ∈ (0, 푎휆), it holds that ′′ ∗ 2 ∗ ′′ ∗ 푈퐿+푃휆 푥푗 ≥ 0 표푟 훻 퐹푗푗 풙 + 푃휆 푥푗 ≥ 0 • S3ONC is weaker than the canonical second-order KKT • Computable within pseudo-polynomial time Y (1998), Liu et al. (2017), Haeser et al. (2018), Liu, Y (2019) https://arxiv.org/abs/1903.00616.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages52 Page
-
File Size-