1 Introduction 2 the Learning Problem

9.520: Statistical Learning Theory and Applications February 8th, 2010 The Learning Problem and Regularization Lecturer: Tomaso Poggio Scribe: Sara Abbasabadi 1 Introduction Today we will introduce a few basic mathematical concepts underlying conditions of the learning theory and generalization property of learning algorithms. We will dene empirical risk, expected risk and empirical risk minimization. Then regularization algorithms will be introduced which forms the bases for the next few classes. 2 The Learning Problem Consider the problem of supervised learning where for a number of input data X, the output labels Y are also known. Data samples are taken independently from an underlying distribution µ(z) on Z = X × Y and form the training set S: (x1; y1);:::; (xn; yn) that is z1; : : : ; zn. We assume independent and identically distributed (i.i.d.) random variables which means the order does not matter (this is the exchangeability property). Our goal is to nd the conditional probability of y given a data x. µ(z) = p(x; y) = p(yjx) · p(x) p(x; y) is xed but unknown. Finding this probabilistic relation between X and Y solves the learning problem. XY P(y|x) P(x) For example suppose we have the data xi;Fi measured from a spring. Hooke's law F = ax gives 2 2 p(F=x) = δ(F −ax). For additive Gaussian noise Hooke's law F = ax gives p(F=x) = e−(F −ax) =2σ . 2-1 2.1 Hypothesis Space Hypothesis space H should be dened in every learning algorithm as the space of functions that can be explored. This space can be innite or nite. The learning algorithm will look into this space of possible functions and depending on the training data S, nds the function that maps the input into the output. This function will be used for future predictions for a new value of X: ypred = fS(xnew) If y is a real-valued random variable, we have regression. If y takes values from an unordered nite set, we have pattern classication. In two-class pattern classication problems, we assign one class a y value of 1, and the other class a y value of −1. Some exapmles of the possible hypothesis spaces include: • Example 1: H consists of continuous functions on [0; 1]. Continuity: For each there exist δ s.t. for all x0 : jx − x0j < δ the following holds jf(x) − f(x0)j < • Example 2: H consists of polynomials of degree n • Example 3: H consists of all linear functions • Example 4: H consists of equicontinuous functions on [0; 1]. H is bounded uniformly if there exists M s.t. 8f; x jf(x)j ≤ M. H is equicontinuous if: For each there exist δ s.t. for all x0; x" s.t. jx0 − x00j < δ and for all f 2 H the following holds jf(x0) − f(x")j < . D'Arzela' theorem states that a necessary and sucient condition for a class of continuous functions H to be compact is to be uniformly bounded and equicontinuous. 2.2 Loss Functions Loss function V is dened in order to measure the goodness of our prediction. V (f; z) = V (f(x); y) denotes the price we pay when we see sample x and guess that the associated y value is f(x) when it is actually y. For regression, the most common loss function is square loss or L2 loss: V (f(x); y) = (f(x) − y)2 We could also use the absolute value, or L1 loss: V (f(x); y) = jf(x) − yj Vapnik's more general -insensitive loss function is: V (f(x); y) = (jf(x) − yj − )+ y is the correct label and f(x) being the predicted label of sample x. For binary classication, the most intuitive loss is the 0-1 loss: V (f(x); y) = Θ(−yf(x)) where Θ(−yf(x)) is the step function and y is binary, eg y = +1 or y = −1. This means that when y and f(x) have the same sign a correct prediction with zero loss is achieved. For tractability and other reasons, we often use the hinge loss (implicitely introduced by Vapnik) in binary classication: V (f(x); y) = (1 − y · f(x))+ 2-2 2.3 Expected Error, Empirical Error Given a function f, a loss function V , and a probability distribution µ over Z, the expected or true error of f is: Z I[f] = EzV [f; z] = V (f; z)dµ(z) Z which is the expected loss on a new example drawn at random from µ. Expected loss indicates performance of the algorithm on the future samples. The goal is to choose f that makes the expected error I[f] as small as possible but in general we do not know µ. Instead we compute the empirical approximation of the expected error called the empirical error which is the average error on the training set. Given a function f, a loss function V , and a training set S consisting of n data points, the empirical error of f is: 1 X I [f] = V (f; z ) S n i 3 Generalization and Stability Let's rst dence convergence in probability. For a sequence of random variables, fXng, convergence is dened as follows: lim Xn = X in probability n!1 if 8" > 0 lim fjXn − Xj ≥ "g = 0: n!1 P or if for each n there exists a "n and a δn such that P fjXn − Xj ≥ "ng ≤ δn; with "n and δn going to zero for n ! 1. 3.1 Generalization A natural requirement for fS is distribution independent generalization. Generalization means that the dierence between the empirical error and the expected error goes to zero as the number of training set samples increase: 8µ, lim jIS[fS] − I[fS]j = 0 in probability n!1 This is equivalent to saying that for each n there exists a "n and a δn such that 8µ P fjISn [fSn ] − I[fSn ]j ≥ "ng ≤ δn; with "n and δn going to zero for n ! 1. In other words, the training error for the solution must converge to the expected error and thus be a proxy for it. Otherwise the solution would not be predictive. This is similar to the law of large numbers but there are dirences between these two. A desirable additional requirement is universal consistency dened as: 8" > 0 lim sup PS I[fS] > inf I[f] + " = 0: n!1 µ f2H The consistency criteria indicates that if the size of training set goes to innity and over all probability distributions, the expected error is very close to the f with minimum error. We are interested to 2-3 have error bounds for any number of samples. For that we dene convergence rate. Suppose we can prove that with probability at least 1 − e−τ2 we have C jI [f ] − I[f ]j ≤ p τ S S S n for some (problem dependent) constant C. The above result gives a convergence rate. If we x , τ and solve for the eq. pC we obtain the sample complexity: n = n τ C2τ 2 n(, τ) = 2 This setup allows to nd the number of samples needed for obtaining an error of , with condence 1 − e−τ2. In addition to the key property of generalization, a good learning algorithm should also be stable. This means that we don't want the function found by the learning algorithm to depend critically on small changes in the training points. The stability problem is described in the next section. 3.2 Well-posed and Ill-posed Problems A problem is well-posed if its solution: • exists • is unique • depends continuously on the data (e.g. it is stable) A problem is ill-posed if it is not well-posed. In the context of this class, well-posedness is mainly used to mean stability of the solution. Hadamard introduced the denition of ill-posedness. Ill-posed problems are typically inverse problems. As an example, assume g is a function in Y and u is a function in X, with Y and X Hilbert spaces. Then given the linear, continuous operator L, consider the equation: g = Lu The direct problem is is to compute g given u, the inverse problem is to compute u given the data g. The problem of learning is the inverse problem. In the learning case L is somewhat similar to a sampling operation and the inverse problem becomes the problem of nding a function that takes the values: f(xi) = yi; i = 1; :::n The inverse problem of nding u is well-posed when the solution exists, is unique and is stable, that is depends continuously on the initial data g. Ill-posed problems fail to satisfy one or more of these criteria. The learning problem like many inverse problems is often not stable and thus ill-posed. We would like our class of learning algorithms to be stable. 4 Empirical Risk Minimization One classical and simple class of learning algorithms is the empirical risk minimization algorithm. Given a training set S and a function space H, empirical risk minimization (Vapnik introduced the term) is the class of algorithms that look at S and select fS in the hypothesis space that minimizes the empirical error: fS = arg min IS[f] f2H 2-4 For example linear regression is ERM when V (z) = (f(x)−y)2 and H is the space of linear functions f = ax. For ERM to represent a good class of learning algorithms, the solution should be general and stable.

1 Introduction 2 the Learning Problem

A Simple Algorithm for Semi-Supervised Learning with Improved Generalization Error Bound

Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks

Multitask Learning with Local Attention for Tibetan Speech Recognition

Lecture 9: Generalization

Issues in Using Function Approximation for Reinforcement Learning

Effective Dimensionality Revisited

Generalizing to Unseen Domains: a Survey on Domain Generalization

Batch Policy Learning Under Constraints

Learning Better Structured Representations Using Low-Rank Adaptive Label Smoothing

Learning Universal Graph Neural Network Embeddings with Aid of Transfer Learning

Principles and Algorithms for Forecasting Groups of Time Series: Locality and Globality

A Theoretical Analysis of Contrastive Unsupervised Representation Learning