9.520: Statistical Learning Theory and Applications February 8th, 2010 The Learning Problem and Regularization Lecturer: Tomaso Poggio Scribe: Sara Abbasabadi

1 Introduction

Today we will introduce a few basic mathematical concepts underlying conditions of the learning theory and generalization property of learning algorithms. We will dene empirical risk, expected risk and empirical risk minimization. Then regularization algorithms will be introduced which forms the bases for the next few classes.

2 The Learning Problem

Consider the problem of where for a number of input data X, the output labels Y are also known. Data samples are taken independently from an underlying distribution µ(z) on Z = X × Y and form the training set S:

(x1, y1),..., (xn, yn) that is z1, . . . , zn. We assume independent and identically distributed (i.i.d.) random variables which means the order does not matter (this is the exchangeability property). Our goal is to nd the conditional probability of y given a data x.

µ(z) = p(x, y) = p(y|x) · p(x) p(x, y) is xed but unknown. Finding this probabilistic relation between X and Y solves the learning problem.

XY

P(y|x)

P(x)

For example suppose we have the data xi,Fi measured from a spring. Hooke's law F = ax gives 2 2 p(F/x) = δ(F −ax). For additive Gaussian noise Hooke's law F = ax gives p(F/x) = e−(F −ax) /2σ .

2-1 2.1 Hypothesis Space

Hypothesis space H should be dened in every learning algorithm as the space of functions that can be explored. This space can be innite or nite. The learning algorithm will look into this space of possible functions and depending on the training data S, nds the function that maps the input into the output. This function will be used for future predictions for a new value of X:

ypred = fS(xnew) If y is a real-valued random variable, we have regression. If y takes values from an unordered nite set, we have pattern classication. In two-class pattern classication problems, we assign one class a y value of 1, and the other class a y value of −1. Some exapmles of the possible hypothesis spaces include:

• Example 1: H consists of continuous functions on [0, 1]. Continuity: For each  there exist δ s.t. for all x0 : |x − x0| < δ the following holds |f(x) − f(x0)| <  • Example 2: H consists of polynomials of degree n • Example 3: H consists of all linear functions

• Example 4: H consists of equicontinuous functions on [0, 1]. H is bounded uniformly if there exists M s.t. ∀f, x |f(x)| ≤ M. H is equicontinuous if: For each  there exist δ s.t. for all x0, x” s.t. |x0 − x00| < δ and for all f ∈ H the following holds |f(x0) − f(x”)| < . D'Arzela' theorem states that a necessary and sucient condition for a class of continuous functions H to be compact is to be uniformly bounded and equicontinuous.

2.2 Loss Functions

Loss function V is dened in order to measure the goodness of our prediction. V (f, z) = V (f(x), y) denotes the price we pay when we see sample x and guess that the associated y value is f(x) when it is actually y. For regression, the most common is square loss or L2 loss:

V (f(x), y) = (f(x) − y)2

We could also use the absolute value, or L1 loss:

V (f(x), y) = |f(x) − y|

Vapnik's more general -insensitive loss function is:

V (f(x), y) = (|f(x) − y| − )+ y is the correct label and f(x) being the predicted label of sample x. For binary classication, the most intuitive loss is the 0-1 loss:

V (f(x), y) = Θ(−yf(x)) where Θ(−yf(x)) is the step function and y is binary, eg y = +1 or y = −1. This means that when y and f(x) have the same sign a correct prediction with zero loss is achieved. For tractability and other reasons, we often use the hinge loss (implicitely introduced by Vapnik) in binary classication:

V (f(x), y) = (1 − y · f(x))+

2-2 2.3 Expected Error, Empirical Error

Given a function f, a loss function V , and a probability distribution µ over Z, the expected or true error of f is: Z I[f] = EzV [f, z] = V (f, z)dµ(z) Z which is the expected loss on a new example drawn at random from µ. Expected loss indicates performance of the algorithm on the future samples. The goal is to choose f that makes the expected error I[f] as small as possible but in general we do not know µ. Instead we compute the empirical approximation of the expected error called the empirical error which is the average error on the training set. Given a function f, a loss function V , and a training set S consisting of n data points, the empirical error of f is:

1 X I [f] = V (f, z ) S n i 3 Generalization and Stability

Let's rst dence convergence in probability. For a sequence of random variables, {Xn}, convergence is dened as follows:

lim Xn = X in probability n→∞ if

∀ε > 0 lim {|Xn − X| ≥ ε} = 0. n→∞ P or if for each n there exists a εn and a δn such that

P {|Xn − X| ≥ εn} ≤ δn, with εn and δn going to zero for n → ∞.

3.1 Generalization

A natural requirement for fS is distribution independent generalization. Generalization means that the dierence between the empirical error and the expected error goes to zero as the number of training set samples increase:

∀µ, lim |IS[fS] − I[fS]| = 0 in probability n→∞

This is equivalent to saying that for each n there exists a εn and a δn such that ∀µ

P {|ISn [fSn ] − I[fSn ]| ≥ εn} ≤ δn, with εn and δn going to zero for n → ∞. In other words, the training error for the solution must converge to the expected error and thus be a proxy for it. Otherwise the solution would not be predictive. This is similar to the law of large numbers but there are dirences between these two. A desirable additional requirement is universal consistency dened as:   ∀ε > 0 lim sup PS I[fS] > inf I[f] + ε = 0. n→∞ µ f∈H The consistency criteria indicates that if the size of training set goes to innity and over all probability distributions, the expected error is very close to the f with minimum error. We are interested to

2-3 have error bounds for any number of samples. For that we dene convergence rate. Suppose we can prove that with probability at least 1 − e−τ2 we have C |I [f ] − I[f ]| ≤ √ τ S S S n for some (problem dependent) constant C. The above result gives a convergence rate. If we x , τ and solve for the eq. √C we obtain the sample complexity: n  = n τ

C2τ 2 n(, τ) = 2 This setup allows to nd the number of samples needed for obtaining an error of , with condence 1 − e−τ2. In addition to the key property of generalization, a good learning algorithm should also be stable. This means that we don't want the function found by the learning algorithm to depend critically on small changes in the training points. The stability problem is described in the next section.

3.2 Well-posed and Ill-posed Problems A problem is well-posed if its solution:

• exists • is unique • depends continuously on the data (e.g. it is stable) A problem is ill-posed if it is not well-posed. In the context of this class, well-posedness is mainly used to mean stability of the solution. Hadamard introduced the denition of ill-posedness. Ill-posed problems are typically inverse problems. As an example, assume g is a function in Y and u is a function in X, with Y and X Hilbert spaces. Then given the linear, continuous operator L, consider the equation: g = Lu The direct problem is is to compute g given u, the inverse problem is to compute u given the data g. The problem of learning is the inverse problem. In the learning case L is somewhat similar to a sampling operation and the inverse problem becomes the problem of nding a function that takes the values:

f(xi) = yi, i = 1, ...n The inverse problem of nding u is well-posed when the solution exists, is unique and is stable, that is depends continuously on the initial data g. Ill-posed problems fail to satisfy one or more of these criteria. The learning problem like many inverse problems is often not stable and thus ill-posed. We would like our class of learning algorithms to be stable.

4 Empirical Risk Minimization

One classical and simple class of learning algorithms is the empirical risk minimization algorithm. Given a training set S and a function space H, empirical risk minimization (Vapnik introduced the term) is the class of algorithms that look at S and select fS in the hypothesis space that minimizes the empirical error:

fS = arg min IS[f] f∈H

2-4 For example linear regression is ERM when V (z) = (f(x)−y)2 and H is the space of linear functions f = ax. For ERM to represent a good class of learning algorithms, the solution should be general and stable. Here is an example of generalization given a certain number of samples and the true solution for the samples:

Suppose that ERM algorithm selects this function (with zero empirical error):

This is not the true function that we want to know. Thus, we need to set generalization conditions so that the ERM solution converges with increasing number of examples to the true solution.

2-5 Now let's consider a stability example using 10 training points and the smoothest interpolating polynomial t with degree of 9 that gives a zero empirical error:

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Perturbing the points slightly gives the following solution:

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

The results is a very dierent solution. This means lack of stability in this example.

2-6 Using a lower degree polynomial e.g. 2, gives the following ts before and after perturbation:

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Although the t of data is not as good as before (empirical error > 0) but the solution does not change much with perturbing the data, a fact that indicates stability.

4.1 Conditions for Well-posedness(stability) and Predictivity (general- ization) Since Tikhonov, it is well-known that a generally ill-posed problem such as ERM, can be guaranteed to be well-posed and therefore stable by an appropriate choice of H. For example, compactness of H guarantees stability. We would like to have a hypothesis space that yields generalization too. Loosely speaking this would be a H for which the solution of ERM, say fS is such that |IS[fS]−I[fS]| converges to zero in probability for n increasing. Note that the above requirement is NOT the law of large numbers since f depends on the training set S and is not xed. Is there any possible relationship between stability and generalization? According to (Vapnik and ƒervonenkis (71), Alon et al (97), Dudley, Giné, and Zinn (91)), a necessary and sucient condition for generalization (and consistency) of ERM is that H is is a class of functions called uniform Glivenko-Cantelli (uGC). H is a uGC class if ( ) ∀ε > 0 lim sup PS sup |I[f] − IS[f]| > ε = 0. n→∞ µ f∈H For example class of continious functions are not uGC and thus don't have the generalization condi- tion. This theorem (Vapnik et al.) indicates that a proper choice of the hypothesis space H ensures generalization of ERM (and consistency since for ERM generalization is necessary and sucient for consistency and viceversa). A separate theorem (Niyogi, Poggio et al.) guarantees stability of ERM. Thus with the appropriate denition of stability, stability and generalization are equivalent for ERM and correspond to the same constraints on H.

5 Regularization

Regularization (originally introduced by Tikhonov independently of the learning problem) ensures well-posedness and thus generalization of ERM by constraining the hypothesis space H for instance

2-7 not allowing polynomials with too high degree with respect to the number of data. The direct way is called Ivanov regularization. The indirect way is Tikhonov regularization (which is not strictly ERM). ERM nds the function in (H, k · k) which minimizes

n 1 X V (f(x ), y ) n i i i=1 which in general  for arbitrary hypothesis space H  is ill-posed. Ivanov regularizes by nding the function that minimizes

n 1 X V (f(x ), y ) n i i i=1 with functions that are constrained in their norm satisfying

kfk2 ≤ A.

Tikhonov regularization minimizes over the hypothesis space H, for a xed positive parameter γ, the regularized functional n 1 X V (f(x ), y ) + γkfk2 , (1) n i i K i=1 where kfkK is the norm in H  the Reproducing Kernel Hilbert Space (RKHS), dened by the kernel K. Tikhonov regularization ensures well-posedness eg existence, uniqueness and especially stability (in a very strong form). It also provides generalization.

6 Target space, sample and approximation error

In addition to the hypothesis space H, the space we allow our algorithms to search, we need to dene the target space T . This space is in general larger than the hypothesis space and contains the true function f0 that minimizes the risk. Often, T is chosen to be all functions in L2, or all dierentiable functions. Notice that the true function if it exists is dened by µ(z), which contains all the relevant information.

6.1 Sample Error

Let fH be the function in H with the smallest true risk. We have dened the generalization error to be IS[fS] − I[fS]. We dene the sample error to be I[fS] − I[fH], the dierence in true risk between the best function in H and the function in H we actually nd. This is what we pay because our nite sample does not give us enough information to choose to the best function in H. We'd like this to be small. Consistency  dened earlier  is equivalent to the sample error going to zero for n → ∞. A main goal in classical learning theory (Vapnik, Smale, ...) is bounding the generalization error. Another goal  for learning theory and statistics  is bounding the sample error, that is determining conditions under which we can state that I[fS] − I[fH] will be small (with high probability). As a simple rule, we expect that if H is well-behaved, then, as n gets large the sample error will become small.

6.2 Approximation Error

Let f0 be the function in T with the smallest true risk. We dene the approximation error to be I[fH] − I[f0], the dierence in true risk between the best function in H and the best function in

2-8 T . This is what we pay when H is smaller than T . We'd like this error to be small too. In much of the following we can assume that I[f0] = 0. We will focus less on the approximation error in 9.520, but we will explore it. As a simple rule, we expect that as H grows bigger, the approximation error gets smaller. If T ⊆ H  which is a situation called the realizable setting the approximation error is zero.

6.3 Error

We dene the error to be I[fS] − I[f0], the dierence in true risk between the function we actually nd and the best function in T . We'd really like this to be small. As we mentioned, often we can assume that the error is simply I[fS]. The error is the sum of the sample error and the approximation error:

I[fS] − I[f0] = (I[fS] − I[fH]) + (I[fH] − I[f0]) If we can make both the approximation and the sample error small, the error will be small. There is a tradeo between the approximation error and the sample error. Thus, by decreasing the ap- proximation error, the sample error increases and vice versa. The goal is to have a hypothesis space which is large enough to give a small approximation error and small enough to give a small sample error.

2-9