Statistical Learning and Sequential Prediction

Statistical Learning and Sequential Prediction Alexander Rakhlin and Karthik Sridharan DRAFT September 7, 2014 Contents I Introduction7 1 About8 2 An Appetizer: A Bit of Bit Prediction 12 3 What are the Learning Problems? 18 4 Example: Linear Regression 34 II Theory 43 5 Minimax Formulation of Learning Problems 44 5.1 Minimax Basics................................ 45 5.2 Defining Minimax Values for Learning Problems............ 48 5.3 No Free Lunch Theorems.......................... 55 5.3.1 Statistical Learning and Nonparametric Regression....... 55 5.3.2 Sequential Prediction with Individual Sequences........ 56 6 Learnability, Oracle Inequalities, Model Selection, and the Bias-Variance Trade-off 58 6.1 Statistical Learning.............................. 58 6.2 Sequential Prediction............................. 64 6.3 Remarks.................................... 65 1 7 Stochastic processes, Empirical processes, Martingales, Tree Processes 67 7.1 Motivation................................... 67 7.1.1 Statistical Learning.......................... 67 7.1.2 Sequential Prediction........................ 68 7.2 Defining Stochastic Processes........................ 69 7.3 Application to Learning........................... 73 7.4 Symmetrization................................ 74 7.5 Rademacher Averages............................ 79 7.6 Skolemization................................. 81 7.7 ... Back to Learning.............................. 81 8 Example: Learning Thresholds 82 8.1 Statistical Learning.............................. 82 8.2 Separable (Realizable) Case......................... 84 8.3 Noise Conditions............................... 85 8.4 Prediction of Individual Sequences.................... 86 8.5 Discussion................................... 89 9 Maximal Inequalities 90 9.1 Finite Class Lemmas............................. 90 10 Example: Linear Classes 94 11 Statistical Learning: Classification 97 11.1 From Finite to Infinite Classes: First Attempt............... 97 11.2 From Finite to Infinite Classes: Second Attempt............. 98 11.3 The Growth Function and the VC Dimension............... 99 12 Statistical Learning: Real-Valued Functions 104 12.1 Covering Numbers.............................. 104 12.2 Chaining Technique and the Dudley Entropy Integral.......... 108 12.3 Example: Nondecreasing Functions.................... 110 12.4 Improved Bounds for Classification.................... 112 12.5 Combinatorial Parameters.......................... 113 12.6 Contraction.................................. 117 12.7 Discussion................................... 118 12.8 Supplementary Material: Back to the Rademacher........... 119 2 12.9 Supplementary Material: Lower Bound on the Minimax Value.... 121 13 Sequential Prediction: Classification 123 13.1 From Finite to Infinite Classes: First Attempt............... 124 13.2 From Finite to Infinite Classes: Second Attempt............. 126 13.3 The Zero Cover and the Littlestone’s Dimension............. 128 13.4 Removing the Indicator Loss, or Fun Rotations with Trees....... 132 13.5 The End of the Story............................. 134 14 Sequential Prediction: Real-Valued Functions 136 14.1 Covering Numbers.............................. 136 14.2 Chaining with Trees.............................. 138 14.3 Combinatorial Parameters.......................... 140 14.4 Contraction.................................. 145 14.5 Lower Bounds................................. 146 15 Examples: Complexity of Linear and Kernel Classes, Neural Networks 148 15.1 Prediction with Linear Classes....................... 149 15.2 Kernel Methods................................ 149 15.3 Neural Networks............................... 151 15.4 Discussion................................... 153 16 Large Margin Theory for Classification 155 17 Regression with Square Loss: From Regret to Nonparametric Estimation 156 III Algorithms 157 18 Algorithms for Sequential Prediction: Finite Classes 158 18.1 The Halving Algorithm............................ 159 18.2 The Exponential Weights Algorithm.................... 159 19 Algorithms for Sequential Prediction: Binary Classification with Infinite Classes 164 19.1 Halving Algorithm with Margin....................... 164 19.2 The Perceptron Algorithm.......................... 166 19.3 The Winnow Algorithm........................... 167 3 20 Algorithms for Online Convex Optimization 168 20.1 Online Linear Optimization......................... 168 20.2 Gradient Descent............................... 169 20.3 Follow the Regularized Leader and Mirror Descent........... 170 20.4 From Linear to Convex Functions..................... 173 21 Example: Binary Sequence Prediction and the Mind Reading Machine 174 21.1 Prediction with Expert Advice........................ 175 21.2 Blackwell’s method.............................. 175 21.3 Follow the Regularized Leader....................... 178 21.4 Discussion................................... 180 21.5 Can we derive an algorithm for bit prediction?.............. 181 21.6 The Mind Reading Machine......................... 184 22 Algorithmic Framework for Sequential Prediction 186 22.1 Relaxations................................... 188 22.1.1 Follow the Regularized Leader / Dual Averaging......... 191 22.1.2 Exponential Weights......................... 193 22.2 Supervised Learning............................. 195 23 Algorithms Based on Random Playout, and Follow the Perturbed Leader 198 23.1 The Magic of Randomization........................ 198 23.2 Linear Loss................................... 199 23.2.1 Example: Follow the Perturbed Leader on the Simplex..... 201 23.2.2 Example: Follow the Perturbed Leader on Euclidean Balls... 203 23.2.3 Proof of Lemma 23.2......................... 204 23.3 Supervised Learning............................. 205 24 Algorithms for Fixed Design 206 24.1 ... And the Tree Disappears......................... 206 24.2 Static Experts................................. 208 24.3 Social Learning / Network Prediction................... 209 24.4 Matrix Completion / Netflix Problem................... 209 25 Adaptive Algorithms 210 25.1 Adaptive Relaxations............................. 210 25.2 Example: Bit Prediction from Lecture 1.................. 211 4 25.3 Adaptive Gradient Descent......................... 212 IV Extensions 213 26 The Minimax Theorem 214 26.1 When the Minimax Theorem Does Not Hold............... 215 26.2 The Minimax Theorem and Regret Minimization............ 216 26.3 Proof of a Minimax Theorem Using Exponential Weights........ 218 26.4 More Examples................................ 220 26.5 Sufficient Conditions for Weak Compactness............... 221 27 Two Proofs of Blackwell’s Approachability Theorem 224 27.1 Blackwell’s vector-valued generalization and the original proof.... 225 27.2 A non-constructive proof.......................... 228 27.3 Discussion................................... 230 27.4 Algorithm Based on Relaxations: Potential-Based Approachability.. 230 28 From Sequential to Statistical Learning: Relationship Between Values and Online-to-Batch 231 28.1 Relating the Values.............................. 231 28.2 Online to Batch Conversion......................... 233 29 Sequential Prediction: Better Bounds for Predictable Sequences 235 29.1 Full Information Methods.......................... 237 29.2 Learning The Predictable Processes.................... 240 29.3 Follow the Perturbed Leader Method................... 242 29.4 A General Framework of Stochastic, Smoothed, and Constrained Ad- versaries.................................... 242 30 Sequential Prediction: Competing With Strategies 243 30.1 Bounding the Value with History Trees.................. 244 30.2 Static Experts................................. 248 30.3 Covering Numbers and Combinatorial Parameters........... 249 30.4 Monotonic Experts.............................. 250 30.5 Compression and Sufficient Statistics................... 253 31 Localized Analysis and Fast Rates. Local Rademacher Complexities 254 5 A Appendix 255 6 Part I Introduction 7 1 About This course will focus on theoretical aspects of Statistical Learning and Sequential Prediction. Until recently, these two subjects have been treated separately within the learning community. The course will follow a unified approach to analyzing learning in both scenarios. To make this happen, we shall bring together ideas from probability and statistics, game theory, algorithms, and optimization. It is this blend of ideas that makes the subject interesting for us, and we hope to convey the excitement. We shall try to make the course as self-contained as possible, and pointers to additional readings will be provided whenever necessary. Our target audience is graduate students with a solid background in probability and linear algebra. “Learning” can be very loosely defined as the “ability to improve performance after observing data”. Over the past two decades, there has been an explosion of both applied and theoretical work on machine learning. Applications of learning methods are ubiquitous: they include systems for face detection and face recognition, prediction of stock markets and weather patterns, speech recognition, learning user’s search preferences, placement of relevant ads, and much more. The success of these applications has been paralleled by a well-developed theory. We shall call this latter branch of machine learning – “learning theory”. Why should one care about machine learning? Many tasks that we would like computers to perform cannot be hard-coded. The programs

Statistical Learning and Sequential Prediction

Generalized Linear Models (Glms)

How Prediction Statistics Can Help Us Cope When We Are Shaken, Scared and Irrational

Reliability Engineering: Trends, Strategies and Best Practices

Prediction in Multilevel Generalized Linear Models

A Philosophical Treatise of Universal Induction

Some Statistical Models for Prediction Jonathan Auerbach Submitted In

The Case of Astrology –

Prediction in General Relativity

Spiritual Expressions' Prediction of Mindfulness: the Case of Prospective Teachers

A Theory of Everything. Book Review of Richard Dawid: String Theory and the Scientific Method

The Real World Significance of Performance Prediction Zachary A

Chapter 1 – Vocabulary & Study Guide 1) Hypothesis an Educated