<<

STATISYE/CSE 598Y 8803: Statistical Advanced Learning Machine TheoryLearning Instructor: Instructor: Jian Tuo ZhaZhaong

Lecture 1: Introduction to

In this lecture we formulate the basic (supervised) learning problem and introduce several key concepts including loss , and error decomposition.

1 Basic Concepts

We use X and Y to denote the input and the output space, where typically we have X = Rp. A joint on X × Y is denoted as PX,Y .Let(X, Y ) be a pair of random variables distributed according to PX,Y . We also use PX and PY |X to denote the marginal distribution of X and the conditional distribution of Y given X.

Let Dn = {(x1,y1),...,(xn,yn)} be an i.i.d. random sample from PX,Y . The goal of supervised learning is to find a mapping h : X "→ Y based on Dn so that h(X) is a good approximation of Y . When Y = R the learning problem is often called regression and when Y = {0, 1} or {−1, 1} it is often called (binary) classification.

The dataset Dn is often called the training set (or training ), and it is important since the distribution PX,Y is usually unknown. A learning algorithm is a procedure A which takes the training set Dn and produces a predictor hˆ = A(Dn) as the output. Typically the learning algorithm will search over a space of functions H, which we call the hypothesis space.

2 and Risk

A loss function is a mapping ℓ : Y ×Y "→ R+ (sometimes R×R "→ R+). For example, in binary classification the 0/1 loss function ℓ(y,p)=I(y ̸= p) is often used and in regression the squared error loss function ℓ(y,p)=(y − p)2 is often used. Other loss functions include the following: absolute loss, , ϵ- insensitive loss, hinge loss, logistic loss, exponential loss, modified loss, etc. They will be discussed later in more details. The performance of a predictor h : X "→ Y is measured by the expected loss, a.k.a. the risk or : R(h) := EX,Y [ℓ(Y,h(X))], where the expectation is taken with respect to the distribution PX,Y . Since in practice we estimate hˆ based on the training set Dn, we have R(hˆ) itself a . Thus we may also use the quantity E ˆ R(A)= Dn [R(h)] to characterize the generalization performance of the learning algorithm A, which is also called the expected risk of the learning algorithm. The risk is an important measure of the goodness of the predictor h(.) since it tells how it performs on average in terms of the loss function ℓ(., .). The minimum risk is defined as

R∗ = inf R(h) h where the infinum is often taken with respect to all measurable functions. The performance of a given predictor/ can be evaluated by how close R(h) is to R∗. Minimization of the risk is non-trivial because the underlying distribution PX,Y is in general unknown, and the training data Dn only gives us an incomplete knowledge of PX,Y in practice.

1 2.1 Binary Classification

For classification problem, a predictor h is also called a classifier, and the loss function for binary classification is often taken to be the 0/1 loss. In this case, we have

R(h)=EX,Y [ℓ(Y,f(X))] = EX,Y [I(Y ̸= f(X))] = P (f(X) ̸= Y ).

And the infimum risk R∗ is also known as the Bayes risk. The following result shows that the Bayes classifier, which is defined as

1,P(Y =1|X = x) ≥ 1/2 h∗(x)= , !0, otherwise can achieve the Bayes risk. Theorem 1-1. For any classifier h we have R(h) ≥ R(h∗),i.e.R(h∗)=R∗. Proof. We will show that P (h(X) ̸= Y ) ≥ P (h∗(X) ̸= Y ).ForanyX = x, we have

P (h(X) ̸= Y |X = x)=1− P (h(X)=Y |X = x) =1− (P (h(X)=1,Y =1|X = x)+P (h(X)=0,Y =0|X = x)) =1− (E[I(h(X)=1)I(Y =1)|X = x]+E[I(h(X)=0)I(Y =0)|X = x]) =1− (I(h(x)=1)E[I(Y =1)|X = x]+I(h(x)=0)E[I(Y =0)|X = x]) =1− I(h(x)=1)P (Y =1|X = x) − I(h(x)=0)P (Y =0|X = x).

Letting η(x) := P (Y =1|X = x), we have for any X = x,

P (h(X) ̸= Y |X = x) − P (h∗(X) ̸= Y |X = x) = P (h∗(X)=Y |X = x) − I(h(x)=1)η(x) − I(h(x)=0)(1− η(x)) = η(x)[I(h∗(x)=1)− I(h(x)=1)]+(1− η(x))[I(h∗(x)=0)− I(h(x)=0)] =(2η(x) − 1)[I(h∗(x)=1)− I(h(x)=1)] ≥ 0 where the last inequality is true by the definition of h∗(x). The result follows by intergrating both sides with respect to x. !

2.2 Regression

In regression we typically have X = Rp and Y = R. And the risk is often measured by the squared error loss, ℓ(p, y)=(p − y)2. The following result shows that for squared error regression, the optimal predictor is the conditional function E[Y |X = x]. Theorem 1-2. Suppose the loss function ℓ(., .) is the squared error loss. Let h∗(x)=E[Y |X = x],thenwe have R(h∗)=R∗. The proof will be left as an exercise. Thus regression with squared error can be thought as trying to estimate the conditional mean function. How about regression with its risk defined by the absolute error loss function?

2 3 Approximation Error vs. Estimation Error

Suppose that the learning algorithm chooses the predictor from the hypothesis space H, and define

h∗ = arg inf R(h), h∈H

∗ 1 i.e. h is the best predictor among H . Then the excess risk of the output hˆn of the learning algorithm is defined and can be decomposed as follows2:

∗ ∗ ∗ ∗ R(hˆn) − R = R(h ) − R + R(hˆn) − R(h )

approximation" error# " estimation error #

Such a decomposition reflects a trade-off similar$ to%& the bias-' $ tradeo%& ff (maybe' slightly more general). The approximation error is deterministic and is caused by the restriction of using H. The estimation error is caused by the usage of a finite sample that cannot completely represent the underlying distribution. The approximation error term behaves like a bias square term, and the estimation error behaves like the variance term in standard statistical estimation problems. Similar to the bias-variance trade-off, there is also a trade-off between the approximation error and the estimation error. Basically if H is large then we have a small approximation error but a relatively large estimation error and vice versa.

1Sometimes h∗ is defined as the approximately best predictor chosen by the estimation procedure if infinite amount of samples are given. In that case, the approximation error can be caused by both the function class H and some intrinsic bias of the estimation procedure. 2 ˆ ∗ Sometimes the decomposition is done for EDn [R(hn)] − R .

3 STATISYE/CSE 598Y 8803: Statistical Advanced Learning Machine TheoryLearning Instructor: Instructor: Jian Tuo ZhaZhaong Lecture 2: Risk Minimization

In this lecture we introduce the concepts of empirical risk minimization, overfitting, model complexity and regularization.

1 Empirical Risk Minimization

Given a loss function ℓ(., .), the risk R(h) is not computable as PX,Y is unknown. Thus we may not able to directly minimize R(h) to obtain some predictor. Fortunately we are provided with the training data Dn = {(x1,y1),...,(xn,yn)} which represents the underlying distribution PX,Y .

Instead of minimizing R(h)=EX,Y [ℓ(Y,h(X))], one may replace PX,Y by its empirical distribution and thus obtain the following minimization problem:

n ˆ 1 hn = arg min ! ℓ(yi,h(xi)) ∈H n h i=1

which we call empirical risk minimization (ERM). Furthermore, we also define the empirical risk Rˆn(h) as

n 1 Rˆ (h) := ℓ(y ,h(x )). n n ! i i i=1

Because under some conditions Rˆn(h) →p R(h) by the law of large numbers, the usage of ERM is at least partially justified. ERM covers many popular methods and is widely used in practice. For example, if we take H = {h(x): h(x)=θT x, ∀θ ∈ Rp} and ℓ(y,p)=(y − p)2, then ERM becomes the well-known least squares estimation. The celebrated maximum likelihood estimation (MLE) is also a special case of ERM where the loss function is taken to be the negative log-. Example: in binary classification (x1,y1),...,(xn,yn) T p where yi ∈ {−1, 1} and H = {h(x):h(x)=θ x, ∀θ ∈ R }, the is computed by minimizing the logistic loss: n 1 θˆ = arg min log(1 + exp(−y θT x )) n ! i i i=1 which is equivalent to MLE.

2 Overfitting

ERM works by minimizing the empirical risk Rˆn(h), while the goal of learning is to obtain a predictor with a small risk R(h). Although under certain conditions the former will converge to the latter as n →∞, in practice we always have a finite sample and as a result, there might be a large discrepancy between those two targets, especially when H is large and n is small. Overfitting refers to the situation where we have a small empirical risk but still a relatively large true risk. Consider the following example. Let ℓ(y,p)=(y − p)2 and we obtain the predictor hˆ by ERM:

n ˆ 1 2 h = arg min !(yi − h(xi)) . h∈H n i=1

1 Polynomial with degree1 Polynomial with degree2 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 2 4 6 0 2 4 6

Polynomial with degree3 Polynomial with degree5 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 2 4 6 0 2 4 6

Figure 1: Overfitting of . The true signal function (blue line) is h∗(x) = sin(x), and the function is fitted using 10 training examples (red dots). P1 and P2 show a lack of fitting (underfitting) and P5 is overfitting.

True/Empirical Risk VS. Model Complexity

Empirical Risk True Risk Best Model in Theory Risk

Model Complexity

Figure 2: True/empirical risk vs. model complexity

2 Figure 1 shows the case where H is taken to be P1, P2, ..., where Pk is the set of all polynomial functions with order up to k.

We can see that when H = P3 the fitted predictor will have a small risk (close to the true signal sin(x)). Taking Pk with larger k values as the hypothesis space can clearly improve its fitting with respect to the 10 observations (red dots), but this does not necessarily reduce the true risk as it overfits the training data. Learning is more about generalization than memorization.

3 Controlling Model Complexity

Overfitting is mainly caused by the fact that the hypothesis space H is too large for the sample size n. Clearly the complexity of the hypothesis space H (i..e the size of H)wecanafford depends on the amount of training data we have. For a given training dataset, the relationship between the true risk, the empirical risk and model complexity can be best illustrated as in Figure2. One way to avoid overfitting is to choose H so that it is appropriate for the sample size. There are many ways to control the model complexity, and they are in fact quite similar in spirit. Here we list two commonly used approaches:

1. Take H1, H2,...,Hn,... to be a sequence of increasing sized spaces. For example, one typically has ˆ Hk ⊂ Hk+1 and " Hk = H. Given the training data Dn one finds hn by minimizing

hˆn = arg min Rˆn(h). h∈Hn

This covers the method of Sieves and structural risk minimization (SRM).

+ 2. Define a penalty function Ω : H '→ R and find hˆn by the following optimization procedure:

n ˆ 1 hn = arg min ! ℓ(yi,h(xi)) + λnΩ(h) ∈H n h i=1

where λn > 0 balances the trade-off between goodness-of-fit and model complexity. This is also known as the penalized empirical risk minimization.

In practice we often need to select Hn or λn based on the training data to achieve a good balance between goodness-of-fit and model complexity. Consider the following regression problem: let H = {h(x):h(x)=θT x, ∀θ ∈ Rp} and we are trying to find T 2 an estimator θˆ which minimizes the risk EX,Y (Y −θ X) . For the first approach, we could define a sequence T T of increasing constants 0 ≤ η1 ≤ η2 ≤ ...≤ ηk ≤ ... and define Hk = {h(x):h(x)=θ x, θ θ ≤ ηk}. For the second approach we define Ω(h)=θT θ. Then it is well-known from optimization that those two approaches become mathematically equivalent (i.e. for any ηk there exists a λ such that those two optimization problems have the same solution).

3 STATISYE/CSE 598Y 8803: Statistical Advanced Learning Machine Theory Learning Instructor: Instructor: Jian Tuo Zha Zhaong Lecture 3: Generalization Error Bounds and Concentration Inequalities

1 Generalization Error Bounds and PAC Learning

Although the concept of consistency of a learning algorithm is very important, it only measures how the expectation of a random variable R(hˆn) converges to the optimal Bayes risk asymptotically. However, it does not say how fast this convergence is, and neither does it tell us how this random variable is distributed. In particular, we are interested in probability bounds of the generalization error, such as the following: “with probability at least 1 δ, the risk R(hˆn) is bounded by some quantity”. − Recall that the excess risk can be decomposed into approximation error and estimation error, i.e.

∗ ∗ R(hˆn) R = inf R(h) R + R(hˆn) inf R(h) . − h∈H − − h∈H ! " ! " approximation error estimation error # $% & # $% & The approximation error is deterministic and mainly caused by two possible reasons: (1) the restriction of using the function class ; (2) if infh∈H R(h) in the above equation were replaced by the minimum risk achievable by the learning algorithmH with infinite amount of data, then it can also be caused by the systematic bias of the learning algorithm. Such an error is often not controllable as we do not know the underlying distribution PX,Y . On the other hand, the estimation error depends on the sample size, the function class and the learning algorithm which we have control over. We would like to obtain probability bounds for theH estimation error.

Specifically, if we use ERM to obtain our predictor hˆn = arg minh∈H Rˆn(h), and assume that infh∈H R(h)= R(h∗) for some h∗ , then we have ∈ H ∗ R(hˆn) inf R(h)=R(hˆn) R(h ) − h∈H − ∗ ∗ R(hˆn) R(h )+Rˆn(h ) Rˆn(hˆn) ≤ − −∗ ∗ =(R(hˆn) Rˆn(hˆn)) (R(h ) Rˆn(h )) − − − 2 sup R(h) Rˆn(h) . ≤ h∈H − ' ' ' ' ' ˆ ' Thus if we can obtain uniform bound of suph∈H R(h) Rn(h) then the approximation error can be bounded. Thus again justifies the usage of the ERM method.| − |

Intuitively, for any h , Rˆn(h) is a random variable which follows (in fact nRˆn(h)) a Binomial distribution with mean R(h). Or∈ weH could think of it as the average of a series of random variables. Thus we should be able to bound the difference between an average of a set of random variables and their mean. The uniform bound, however, will depend crucially on how large/complex the hypothesis space is. H The probably approximately correct (PAC) learning model typically states as follows: we say that hˆn is ϵ-accurate with probability 1 δ, if −

∗ P R(hˆn) inf R(h ) > ϵ < δ. − h∈H ( )

In other words, we have R(hˆn) infh∈H R(h) ϵ with probability at least (1 δ). − ≤ −

1 2 Concentration Inequalities

Concentration inequalities will be used to measure how fast the empirical risk converges to the true risk. We start with some loose but simple ones and then get to more useful results. Theorem 3-1 (Markov Inequality). For any nonnegative random variable X and ϵ > 0,

E[X] P (X ϵ) . ≥ ≤ ϵ

Proof. We have E[X] E[I(X ϵ)X] ϵE[I(X ϵ)] = ϵP (X ϵ) ≥ ≥ ≥ ≥ ≥ and thus P (X ϵ) E[X]/ϵ. ! ≥ ≤

Theorem 3-2 (Chernoff Inequality) For any random variable X and ϵ > 0,

E[exp(tX)] P (X ϵ) ≥ ≤ exp(tϵ) and thus E[exp(tX)] P (X ϵ) inf . ≥ ≤ t>0 exp(tϵ)

Proof. For any t>0, since exp(tx) is a nonnegative monotone increasing function in x, we have

E[exp(tX)] P (X ϵ)=P (exp(tX) exp(tϵ)) . ≥ ≥ ≤ exp(tϵ) !

Theorem 3-3. (Chebyshev Inequality) For any random variable X and ϵ > 0,

V[X] P ( X E[X] > ϵ) . | − | ≤ ϵ2

Proof. Apply Markov Inequality with random variable Y = X E[X] . ! | − |

Both Markov and Chebyshev bounds are polynomial in 1/ϵ and often we need bounds which can converge to zero exponentially fast. In fact, the Chebyshev inequality can be quite poor. Consider the following example: 2 Let binary iid random variables X1,...,Xn 0, 1 and p = P (Xi =1). Then we have σ := V[Xi]= n ∈E{ } V 2 p(1 p). Define Sn = i=1 Xi and we have [Sn]=np and [Sn]=np(1 p)=nσ . From Chebyshev inequality− by using ˜ϵ = nϵ, we have − * S E[S ] σ2 P n n ˜ϵ . n − n ≥ ≤ nϵ˜2 (' ' ) ' ' ' −1 ' Thus the tail probability goes to zero at a' rate of n .' But from the (CLT) we have

1 1 2 √n Sn E[Sn] d (0, σ ). n − n → N ( )

2 In other words, we have ∞ 2 n Sn 1 1 exp( y /2) P ( p) y 1 Φ(y)= exp( x2/2)dx − . σ2 n − ≥ → − ˆ √ − ≤ √ y (+ ) y 2π 2π So S E[S ] n S n nϵ˜2 P n n ϵ˜ = P ( n p) ˜ϵ exp n − n ≥ σ2 n − ≥ σ2 ≈ −2p(1 p) ( ) (+ + ) ( − ) which decreases exponentially fast as a function of n. So the Chebyshev inequality does poorly in this case and we need something better. The Hoeffding’s inequality studies the concentration on the sum of independent random variables and gives an n exponential tail bound. Given random variables X1,...,Xn which are independent, and let Sn = i=1 Xi. By Chernoff bound we have * P (Sn E[Sn] ϵ)=P (exp(t(Sn E[Sn])) exp(tϵ)) − ≥ − ≥ exp( tϵ)E[exp(t(Sn E[Sn]))] ≤ − n− = exp( tϵ)E exp t (Xi E[Xi]) − − , - i=1 /0 n . = exp( tϵ) E[exp(t(Xi E[Xi]))]. − − i=1 1 The following lemma shows some property of a bounded random variable with mean zero. Lemma 3-4. If random variable X has mean zero, i.e. E[X]=0,andisboundedin[a, b],thenforany s>0, E[exp(sX)] exp(s2(b a)2/8). ≤ −

Proof. By convexity of exponential function and Jensen’s inequality and the fact a X b, ≤ ≤ X a b X exp(sX) − exp(sb)+ − exp(sa). ≤ b a b a − − Taking expectation on both sides, and utilizing the fact that E[X]=0, we have b exp(sa) a exp(sb) E[exp(sX)] − ≤ b a − = [1 λ + λ exp(s(b a))] exp( λs(b a)) − − − − where λ = a . Now let u = s(b a) and define − b−a − φ(u) := λu + log(1 λ + λ exp(u)), − − then the above inequality becomes E[exp(sX)] exp(φ(u)). ≤ Now we need to find an upper bound on exp(φ(u)). Using Taylor’s expansion we have u2 φ(u)=φ(0) + uφ′(0) + φ′′(ξ) 2 for some ξ [0,u]. It is easy to verify that φ(0) = 0 and φ′(0) = 0. And we have ∈ λ exp(u) (λ exp(u))2 φ′′(u)= 1 λ + λ exp(u) − (1 λ + λ exp(u))2 − − λ exp(u) λ exp(u) = 1 1 λ + λ exp(u) − 1 λ + λ exp(u) − ( − ) 1 . ≤ 4

3 So we have φ(u) u2/8, and therefore ≤ E[exp(sX)] exp(s2(b a)2/8). ≤ − !

Now are are ready to present the Hoeffding’s inequality.

Theorem 3-5 (Hoeffding Inequality) Let X1,...,Xn be independent bounded random variables such that n Xi [ai,bi] with probability 1. Let Sn = Xi.Thenforanyϵ > 0,wehave: ∈ i=1

2* E 2ϵ 1. P (Sn [Sn] ϵ) exp n − 2 − ≥ ≤ − Pi=1(bi ai)

! 2 " 2ϵ 2. P (Sn E[Sn] ϵ) exp n 2 − ≤− ≤ − Pi=1(bi−ai)

! 2 " E 2ϵ 3. P ( Sn [Sn] ϵ) 2exp n − 2 . | − | ≥ ≤ − Pi=1(bi ai) ! " Proof. By the above derivation and Lemma 3-4 we have

n P (Sn E[Sn] ϵ) exp( tϵ) E [exp(t(Xi E[Xi]))] − ≥ ≤ − − i=1 1 n 2 2 t (bi ai) exp( tϵ)exp − ≤ − 8 -i=1 / . 4ϵ Now choose t = n 2 we have Pi=1(bi−ai)

2 E 2ϵ P (Sn [Sn] ϵ) exp n 2 . − ≥ ≤ − (bi ai) ( i=1 − ) Similarly we can prove the other two claims. * !

If we appy the Hoeffding inequality to the average of a series of bernoulli random variables X1,...,Xn,we have 2 P (Sn/n p ϵ) exp( 2nϵ ) − ≥ ≤ − since bi ai =1. This is exactly what the CLT indicates when p =1/2. The following is a straightforward application− of the Hoeffding inequality:

Corollary 3-6 Assume that = h ,...,hm .Thenforallϵ > 0, H { 1 }

2 P sup Rˆn(h) R(h) ϵ 2m exp( 2nϵ ), h∈H − ≥ ≤ − ( ' ' ) ' ' 1 n for any distribution PX,Y ,whereR(h)=' EX,Y [I(Y =' h(X))] and Rˆn(h)= I(yi = h(xi)). ̸ n i=1 ̸ Finally we will introduce the McDiarmid inequality which generalizes the Hoe* ffding’s inequality to some function of iid random variables. Some restrictions are needed in order to get exponential bounds.

Theorem 3-6 (McDiarmid Inequality/Bounded Differences) Suppose random variables X1,...,Xn n ′ ∈ are independent, f is a mapping from to R. If for any i and any x ,...,xn, x , f satisfies X X 1 i ∈ X ′ f(x ,...,xn) f(x ,...,xi− ,x ,xi ,...,xn) ci. | 1 − 1 1 i +1 | ≤

4 Then for all ϵ > 0, 2ϵ2 P (f(X ,...,Xn) E[f(X ,...,Xn)] ϵ) exp 1 − 1 ≥ ≤ − n c2 ( i=1 i ) 2ϵ2 P (f(X ,...,Xn) E[f(X ,...,Xn)] ϵ) exp * . 1 − 1 ≤− ≤ − n c2 ( i=1 i ) * Proof. The proof utilizes a martingale sequence. Define

Vi = E[f X ,...,Xi] E[f X ,...,Xi− ], | 1 − | 1 1 and note that E[Vi]=0and n Vi = E[f X ,...,Xn] E[f]=f(X ,...,Xn) E[f]. | 1 − 1 − i=1 . Define uppper and lower bounds as

Li = inf E[f X1,...,Xi−1,x] E[f X1,...,Xi−1] x | − | Ui = sup E[f X1,...,Xi−1,x] E[f X1,...,Xi−1] x | − | and note that Li Vi Ui. Furthermore, we have ≤ ≤ E E ′ Ui Li = sup sup ( [f X1,...,Xi] [X1,...,Xi]) ci. ′ − Xi Xi | − ≤

And we have E[Vi X ,...,Xi− ]=0. Similar to the proof of Hoeffding’s inequality, we have | 1 1 n P (f E[f] ϵ) inf exp( tϵ)E exp(tVi) . − ≥ ≤ t>0 − ,i=1 0 1 And we have n n−1 E exp(tVi) = E E exp(tVn) exp(tVi) X ,...,Xn− | 1 1 ,i=1 0 , , i=1 00 1 1 n−1 = E exp(tVi)E [exp(tVn) X ,...,Xn− ] | 1 1 , i=1 0 1 n−1 2 2 E exp(tVi) exp(t c /8) ≤ n , i=1 0 1 . . t2 n c2 exp i=1 i . ≤ 8 ( * ) 4ϵ Setting t = n 2 we obtain the claimed results. Pi=1 ci ! Example. Consider n 1 f(X1,...,Xn) = sup E[g(X)] g(Xi) . g∈G ' − n ' ' i=1 ' ' . ' If all g : [a, b] then we have ci =(b a)/n. Thus' we have ' X )→ − ' ' n n 2 E 1 E E 1 2nϵ P sup [g(X)] g(Xi) sup [g(X)] g(Xi) exp 2 . -g∈G ' − n ' − ,g∈G ' − n '0/ ≤ −(b a) ' i=1 ' ' i=1 ' ( − ) ' . ' ' . ' ' ' ' ' As a final note,' the bounds we obtained' are the' worst case senario since' we did not utilize the variance information. We could obtain bounds if the variance were known, such as Bernstein’s inequality.

5 STATISYE/CSE 598Y 8803: Statistical Advanced Learning Machine Theory Learning Instructor: Instructor: Jian Tuo Zha Zhaong

Lecture 4: PAC Learning: Simple Examples

For a finite training sample n, the predictor hˆn can be thought as the output of the learning algorithm D given the training data and the hypothesis space, i.e. hˆn = ( n, ). Its risk R(hˆn)=EX,Y [I(Y = hˆn(X))] A D H ̸ is a random variable which depends on n, and . D A H E ˆ Consistency of the learning algorithm has focused on the mean of this random variable, i.e. Dn [R(hn)]. In PAC learning we are interested in its tail distribution, i.e. finding a bound which holds with large probability:

P sup[R(h) Rˆn(h)] ϵ δ. ∈H − ≥ ≤ !h " The basic idea is to set the probability of being misled to δ and thus solve the ϵ. Example 1 (single classifier). Consider the special case = h , i.e. we only have a single function. H { } Furthermore, we assume that it can achieve 0 trainining errorover n, i.e. Rˆn(h)=0. Then what is the probability that its generalization error R(h) ϵ? We have D ≥ n P Rˆn(h)=0,R(h) ϵ =(1 R(h)) ≥ − # $ (1 ϵ)n ≤ − exp( nϵ). ≤ − Setting the RHS to δ and solve for ϵ we have ϵ = 1 log 1 . Thus with probability (1 δ), n δ − 1 1 P Rˆn(h)=0,R(h) < log . n δ ! " 2 Note that we can also utilzie the Hoeffding’s inequality to obtain P ( Rˆn(h) R(h) ϵ) 2 exp( 2nϵ ), which leads to | − | ≥ ≤ − 1 2 P Rˆn(h) R(h) log δ. %| − | ≥ &2n δ ' ≤

This is more general but not as tight as the previous one since it does not utilize the fact Rˆn(h)=0. ! Although the result in Example 1 is very simple, it has very limited practical meaning. The main reason is that it only applies to a single fixed function h. Essentially, it says that for each fixed function h, there is asetS of samples (whose measure P (S) 1 δ) for which Rˆn(h) R(h) is bounded. However, such S sets could be different for different functions.≥ To− handle this issue| we− need to| obtain the uniform deviations since: Rˆn(hˆn) Rn(hˆn) sup(Rˆn(h) R(h)). − ≤ h∈H − The idea is to utilize the union bound as shown in the following example.

Example 2 (finite number of classifiers). Consider the case = h1,...,hm . Define H { }

Bk := (x1,y1) ...,(xn,yn): R(hk) Rˆn(hk) ϵ ,k=1,...,m. − ≥ ( ) Each Bk is the set of all bad samples for hk, i.e. the samples for which the bound fails for hk. In other words, it contains all misleading samples. If we want to measure the proability of the samples which are bad for any hk (k =1,...,m), we could apply the Bonferroni inequality to simply obtain:

m P (B1 ... Bm) P (Bk). ∪ ∪ ≤ k=1 *

1 Thus we have

m P h : R(h) Rˆn(h) ϵ = P R(hk) Rˆn(hk) ϵ ∃ ∈ H − ≥ − ≥ %k=1 ' # $ + ( ) union bound m P R(hk) Rˆn(hk) ϵ ≤ − ≥ k=1 * # $ m exp( 2nϵ2). ≤ −

Hence for = h1,...,hm , with probability at least (1 δ), H { } − 1 log m + log δ h ,R(h) Rˆn(h) . ∀ ∈ H − ≤ , 2n

Since this is a uniform upper bound, it can be applied to hˆn . ! ∈ H E ˆ Note that we could also bound the of [suph∈H R(h) Rn(h) ] by using the fact that for E ∞ | − | any nonnegative random variable Z, [Z]=´0 P (Z>t)dt. From the above PAC learning examples we can see that

It requires assumptions on data generation, i.e. samples are iid. • The error bounds are valid with respect to repeated samples of training data. • For a fixed function we roughly have R(h) Rˆn(h) 1/√n. • − ≈ ˆ If = m then suph∈H(R(h) Rn(h)) log m/n. The term log m can be thought as the complexity • of|H| the hypothesis space . − ≈ H - There are several things which can be improved:

Hoeffding’s inequality does not utlize the variance information. So the results could be improved by • utilizing such information. The union bound could be quite loose. For instance, it is as bad as if all the functions in were • independent. H The supremum over might be too conservative. • H The bound in Example 2 becomes meaning less when m is infinite. The following example generalizes it to the case of coutably many classifiers.

Example 3 (countable number of classifiers). Consider the case = h1,h2,...,hm,... . Since we need to bound the probability of the set of misleading samples (whichH could{ mislead any h } )byδ,we ∞ ∈ H need budget the proability of being misled by hm to wmδ such that k=1 wk 1. So in order to find ϵ > 0 which satisfies ≤ .

P h : R(h) Rˆn(h) ϵ δ, ∃ ∈ H − ≥ ≤ # $ we only need to make sure that for any k,

P R(hk) Rˆn(hk) ϵ wkδ − ≥ ≤ # $

2 since

P h : R(h) Rˆn(h) ϵ = P R(hk) Rˆn(hk) ϵ ∃ ∈ H − ≥ − ≥ %k=1 ' # $ ∞ + ( ) union bound P R(hk) Rˆn(hk) ϵ ≤ − ≥ k=1 * # $ = wkδ k * δ. ≤ Again the first inequality comes from the Bonferroni inequality. By a similar argument, we solve ϵ by setting 2 1 1 exp( 2nϵ )=wkδ, which leads to ϵ = log . Thus we have with probability (1 δ), − 2n wk δ − / 1 1 log wk + log δ h ,R(h) Rˆn(h)+ . ∀ ∈ H ≤ , 2n !

Note that wk’s have to be specified before seeing the training data, otherwise the result will not hold. One way to interpret wk’s is that they can be thought as the “prior” knowledge about the functions (such as in ).

3 STATISYE/CSE 598Y 8803: Statistical Advanced Learning Machine Theory Learning Instructor: Instructor: Jian Tuo Zha Zhaong Lecture 5: Growth Function and VC

We have considered the case when H is finite or countably infinite. In practice, however, the function class H could be uncountable. Under this situation, the previous method does not work. The key idea is to group functions based on the sample.

Given a sample Dn = {(x1,y1),...,(xn,yn)}, and define S = {x1,...,xn}. Consider the set

HS = Hx1,...,xn = {(h(x1),...,h(xn): h ∈ H} .

The size of this set is the total number of possible ways that S = {x1,...,xn} can be classified. For binary classification the cardinality of this set is always finite, no matter how large H is. Definition (Growth Function). The growth function is the maximum number of ways into which n points can be classified by the function class: GH(n) = sup |HS| . x1,...,xn Growth function can be thought as a measure of the “size” for the class of functions H. Several facts about the growth function:

• When H is finite, we always have GH(n) ≤ |H| = m. n n • Since h(x) ∈ {0, 1}, we have GH(n) ≤ 2 . If GH(n)=2 , then there is a set of n points such that the class of functions H can generate any possible classification result on these points.

|S| Definition (Shatterring). We say that H shatters S if |HS| =2 .

Definition (VC Dimension). The VC dimension of a class H is the largest n = dVC(H) such that

n GH(n)=2 . In other words, VC dimension of a function class H is the cardinality of the largest set that it can shatters. Example. Consider all functions of the form H = {h(x)=I(x ≤ θ), θ ∈ R}. Then it can shatter 2 points but for any three points it cannot shatter. ! Example. Consider all linear classifiers in a 2-d space, i.e. X = R2. In this case, all linear classifiers can shatter a set of 3 points. No set of four points can be shattered by linear classifiers. So the VC dimension in this case is 2. ! p Example. Consider all linear classifiers in a p-dimensional , i.e. X = R . Given x1,...,xn ∈ Rp, we define the augmented data vector

T p+1 zi = [1,xi] ∈ R ,i=1,...,n. Then the set of all linear classifiers can be written as

H = h : h(z)=sign θT z , θ ∈ Rp+1 . ! " # $ Define (p+1)×n Z =[z1,z2,...,zn] ∈ R and we argue that x1,...,xn is shattered by H if and only if the n columns of Z are linearly independent.

• If columns z1,...,zn are linearly independent, we have n ≤ p +1 and for any possible classification assignment y ∈ {±1}n the linear system ZT θ = y must have a solution. Thus, there is a linear classifier in H (by taking the solution of the linear equation) which can produce such arbitrary class assignment y.

1 • Suppose columns z1,...,zn are not linearly independent. For H to shatter the set there must exist a Rp+1 T T n θ ∈ with sign(z1 θ), ..., sign(zn θ) taking any possible vector in {±1} . In other words, this that the vector ZT θ can be in any of the 2n orthants of Rn. However, this contradicts the fact that z1,...,zn are linearly dependent.

Since if n>p+1 it is not possible to have Z’s columns linearly independent, but for n ≤ p +1 we can always find such x1,...,xn to make it happen, we have dVC(H)=p +1. !

A somewhat surprising result shows that the growth function GH either grows exponentially in n or only increases polynomially in n, depends on whether n is greater than its VC dimension dVC(H) or not. Theorem 5-1 (Sauer). If H is a class of functions with binary outputs and its VC dimensionisd = dVC(H).Thenforalln ∈ N, d n G (n) ≤ . H & i ' %i=0 Furthermore, for all n ≥ d,wehave en d GH(n) ≤ . ( d )

Proof.ForanyS = {x1,...,xn}, consider a table containing values of functions in HS (i.e. we only consider distinct ones projected onto the sample S), each row for one such unique tuple. For example, if S = {x1,x2,...,x5} we might have the following table T :

h(x1) h(x2) h(x3) h(x4) h(x5) - + - + + + - - + + + + + - + - + + - - - - - + -

Table 1: An example of H projected onto S = {x1,...,x5}

Each row is one possible tuple for some h ∈ H evaluated on the sample S. Obviously the number of rows in T is the same as the cadinality of |HS|. Thus we can bound the growth function of H by the maximum number of rows in table T . Next we transform the table T by processing each column sequentially. For example, to process the first column, for each row, we replace a ”+” into a ” − ” unless it produces a duplicated row in the table. Table 2 shows the table after processing the first column (left table) and the final table after processing all 5 columns (right table).

∗ ∗ ∗ ∗ ∗ ∗ h (x1) h(x2) h(x3) h(x4) h(x5) h (x1) h (x2) h (x3) h (x4) h (x5) - + - + + - + ------+ + - - - + + - + + - + - - - - + - + + ------+ - - - - + -

Table 2: transformed tables (left: after processing the first column; right: after processing all 5 columns)

Now we have the following observations:

1. The size of the tables are not changed for such transformations, and rows in the final table T ∗ are still unique. Thus we use the upper bound of the number of rows in T ∗ to bound the growth function GH(n).

2 2. The final table T ∗ possess the property that if we replace any ”+” to ”−”, it will result in a duplication. So the set of ”+”elements in each row must be a subset of S that can be shattered by the table T ∗ (in fact, by the set of functions H∗ corresponding to the table T ∗).

3. If a subset A ⊂ S can be shattered by a latter table Tk+1, then it must also be shattered by the previous table Tk. To see this, notice that if A does not contain the transformed column xk, then the result holds trivially as all columns in A remain the same in Tk and Tk+1. If A contain the transformed |A|−1 column xk, then for each +/- combination (2 ) of elements in A\{xk}, we must have two rows in Tk+1 such that they have ”+”and “-” values in the column xk. Now in the previous table Tk, those two rows must also exist. The “+” row is obviously there, and it must also contain the “-” row since otherwise the “+” would not show up in the later table Tk+1 by the processing procedure.

∗ ∗ Since dVC(T ) ≤ dVC(T )=dVC(H)=d by observation 3, each row in T has at most d “+” elements. Thus ∗ d n an upper bound of the total number of rows in T is i=0( i ), which is also an upper bound of the growth function GH(n) by observation 1. * The second statement comes from the fact that for n ≥ d,

d d i n n d n d ≤ & i ' d & i '&n' %i=0 ( ) %i=0 n d d n = 1+ ( d ) & n' en d ≤ . ( d ) !

3 STATISYE/CSE 598Y 8803: Statistical Advanced Learning Machine Theory Learning Instructor: Instructor: Jian Tuo Zha Zhaong Lecture 6: Generalization Bound for Infinite Function Class

We introduce the generalization error bound which utilizes the growth function of H or VC dimension of H instead of the naive cardinality |H|. Theorem 6-1 (Vapnik-Chervonenkis). For any δ > 0,withprobabilityatleast1 − δ,

2 2 log GH(2n) + 2 log δ ∀h ∈ H,R(h) ≤ Rˆn(h)+2 . ! n

The proof of Theorem 6-1 utilizes a technique called symmetrization. For notational simplicty, we will use

n 1 Pf = E[f(X, Y )], P f = f(x ,y ). n n i i i=1 " Here f(X, Y ) can be thought as ℓ(Y,h(X)). Also we define Zi =(Xi,Yi). The key idea is to upper bound the true risk by an estimate from an independent sample, which is often known as the “ghost” sample. We ′ ′ use Z1,...,Zn to denote the ghost sample and n ′ 1 P f = f(x′ ,y′). n n i i i=1 " Then we could project the functions in H onto this double sample and apply the union bound with the help of the growth function GH(.) of H. Lemma (Symmetrization). For any t>0 such that t ≥ 2/n,wehave # P P P′ P P sup | f − nf| ≥ t ≤ 2P sup nf − nf ≥ t/2 . $f∈F % h∈H & ' ' ( ' ' Proof. Let fn be the function achieving the supremum. By' triangle inequal' ity we have P P P P′ P′ P I (| fn − nfn| >t) I (| fn − nfn| < t/2) ≤ I (| nfn − nfn| > t/2) . Taking expecation with respect to the ghost sample we have

′ ′ P P ′ P P ′ P P I (| fn − nfn| >t) PDn (| fn − nfn| < t/2) ≤ PDn (| nfn − nfn| > t/2) . By Chebyshev’s inequality, we have

′ 4V[fn] 1 PD′ (|Pfn − P fn| < t/2) ≤ ≤ n n nt2 nt2 where V[fn] ≤ 1/4 because fn is a random variable (function of the sample) whose value is between 0 and 1. Hence

1 ′ I (|Pfn − Pnfn| >t) 1 − ≤ PD′ (|P fn − Pnfn| > t/2) . nt2 n n & ( Taking expecation with respect to the original sample and utilizing the fact that t ≥ 2/n we obtain the result. Note the same result holds if we remove the absolute operator. ! # The symmetrization lemma replaces the true risk by an average over the ghost sample. As a result, the RHS ′ only depends on the project of the function class F on the double sample FDn,Dn .

1 Proof of Theorem 6-1:

Let F = {f : f(x, y)=ℓ(y,h(x)),h∈ H}. First note that GH(n)=GF (n).

P sup R(h) − Rˆn(h)) ≥ ϵ = P sup(Pf − Pnf) ≥ ϵ ∈H ∈F &h ( $f % P′ P ≤ 2P sup( nf − nf) ≥ ϵ/2 $f∈F %

P′ P =2P sup ( nf − nf) ≥ ϵ/2 f∈F ′ $ Dn,Dn % P′ P ≤ 2GF (2n)P (( nf − nf) ≥ ϵ/2) 2 ≤ 2GF (2n)exp(−nϵ /8).

′ 2 The last inequality is by the Hoeffding’s inequality since P (Pnf − Pnf ≥ t) ≤ exp(−nt /2). Setting 2 δ =2GF (2n) exp(−nϵ /8) we have the claimed result. !

Note that in the case of a finite function class, we have |H| = m and GH(2n) ≤ m. So except for the constants, the bound is at least as good as the one before. Furthermore, if the VC dimension of H is dVC(H) ≤ n, we can apply the Sauer’s lemma to obtain the following result: Corollary 6-2. For any δ > 0,withprobabilityatleast1 − δ,

2en 2 2dVC(H) log d H + 2 log δ ∀h ∈ H,R(h) ≤ Rˆ (h)+2 VC( ) . n ! n

Note that in order for the result to be meaningful in Theorem 6-1, we require the dVC(H) to be finite. A class of functions whose VC dimension is finite is called a VC class. We can also utilize this result to obtain a bound on the expected risk E[R(hˆn)], where hˆn is the empirical minimizer. Since

R(hˆn) − inf R(h) ≤ 2 sup R(h) − Rˆn(h) , h∈H h∈H ' ' ' ' we have ' '

P R(hˆn) − inf R(h) ≥ ϵ ≤ P sup R(h) − Rˆn(h) ≥ ϵ/2 h∈H h∈H & ( & ' ' ( ' 2 ' ≤ 4GH(2n)' exp(−nϵ /32)'. 2 Define a nonnegative random variable Z = R(hˆn)−infh∈H R(h), and we have P (Z ≥ ϵ) ≤ 4GH(2n)exp(−nϵ /32). Thus ∞ E[Z2]= P (Z2 ≥ t)dt ˆ0 u ∞ = P (Z2 ≥ t)dt + P (Z2 ≥ t)dt ˆ0 ˆu ∞ ≤ u + 4GH(2n) exp(−nt/32)dt ˆu 32GH(2n) nu = u + exp − . n 32 ) * E 2 32 log(4GH(2n)) Minimizing the RHS with respect to u we have u =32GH(2n)/n. Plugging in we have [Z ] ≤ n . By the Cauchy-Schwarz inequality we have

H 2 log G (2n) E[R(hˆn)] − inf R(h)=E[Z] ≤ E[Z ] ≤ O . h∈H n & ( # 2 So if the growth function is only polynomially increasing as a function of n, then obviously we have E[R(hˆn)]− infh∈H R(h) → 0, i.e. the expected risk will converge to the minimum risk within the function class H.

3 STATISYE/CSE 598Y 8803: Statistical Advanced Learning Machine Theory Learning Instructor: Instructor: Jian Tuo Zha Zhaong Lecture 7: Glivenko-Cantelli Theorem

Recall that if we use empirical minimization to obtain our predictor

hˆn = arg min Rˆn(h), h∈H then in order to bound the quantity R(hˆn) − infh∈H R(h), it suffices to bound the quantity

sup |R(h) − Rˆn(h)|. h∈H Thus the uniform bound plays an important role in statistical learning theory. The Glivenko-Cantelli class is defined such that the above property holds as n →∞. Definition. H is a Glivenko-Cantelli class with respect to a P if for all ϵ > 0,

P lim sup |Pf − Pnf| =0 =1, n→∞ h∈H ! " i.e. suph∈H |Pf − Pnf| converges to zero almost surely (with proability 1). H is said to be a uniformly GC Class if the convergence is uniformly over all probability measures P . Note that Vapnik and Chervonenkis have shown that a function class is a uniformly GC class if and only if it is a VC class.

Given a set of iid real-valued random variables Z1,...,Zn and any z ∈ R, we know that the quantity I(Zi ≤ z) is a Bernoulli random variable with mean P (Z ≤ z)=F (z), where F (.) is the CDF. Furthermore, by strong law of large numbers, we know that n 1 I(Zi ≤ z) → F (z) n i=1 # almost surely. The following theorem is one of the most fundamental theorems in mathematical , which generalizes the strong law of large numbers: the empirical distribution function uniformly almost surely converges to the true distribution function.

Theorem (Glivenko-Cantelli). Let Z1,...,Zn be iid real-valued random variables with distribution func- tion F (z)=P (Zi ≤ z). Denote the standard empirical distribution function by n 1 Fn(z)= I(Zi ≤ z). n i=1 # Then nϵ2 P sup |F (z) − Fn(z)| > ϵ ≤ 8(n +1)exp − , z∈R 32 ! " ! " and in particular, by the Borel-Cantelli lemma, we have

lim sup |F (z) − Fn(z)| =0 almost surely. n→∞ z∈R

Proof. 1 n We use the notation ν(A) := P (Z ∈ A) and νn(A)= n i=1 I(Zi ∈ A) for any measurable set A ⊂ R. If we let A denote the class of sets of the form (−∞,z] for all z ∈ R, then we have $ sup |F (z) − Fn(z)| = sup |ν(A) − νn(A)|. z∈R A∈A

1 We assume nϵ2 > 2 since otherwise the result holds trivially. The proof consists of several key steps. ′ ′ (1) Symmetrization by a ghost sample: Introduce a ghost sample Z1,...,Zn which are iid together ′ with the original sample, and denote by νn the empirical measure with respect to the ghost sample. Then for nϵ2 > 2 we have (by the symmetrization lemma)

′ P sup |νn(A) − ν(A)| > ϵ ≤ 2P sup |νn(A) − νn(A)| > ϵ/2 . A∈A A∈A ! " ! " (2) Symmetrization by Rademacher Variables:Letσ1,...,σn be iid random variables, independent of ′ ′ Z1,...,Zn,Z1,...,Zn, with P (σi =1)=P (σi = −1) = 1/2. Such random variables are called Rademacher random variables. Observe that the distribution of n ′ sup (I(Zi ∈ A) − I(Zi ∈ A) A∈A %i=1 % % % %# % is the same as % % %n % ′ sup σi(I(Zi ∈ A) − I(Zi ∈ A) A∈A %i=1 % % % ′ %′# % by the definition of Z1,...,Zn; Z1,...,Zn% and σ1,...,σn. Thus we have% % % P sup |νn(A) − ν(A)| > ϵ A∈A ! " ′ ≤ 2P sup |νn(A) − νn(A)| > ϵ/2 A∈A ! n " 1 ′ ϵ =2P sup σi(I(Zi ∈ A) − I(Zi ∈ A) > &A∈A %n i=1 % 2' % % % #n % n %1 ϵ % 1 ′ ϵ ≤ 2P sup % σiI(Zi ∈ A) > +2P % sup σiI(Zi ∈ A) > &A∈A n %i=1 % 4' &A∈A n %i=1 % 4' % % % % %#n % %# % 1 % % ϵ % % =4P sup % σiI(Zi ∈ A)% > . % % &A∈A n %i=1 % 4' % % %# % % % (3) Conditioning: To bound the% proability % n n 1 ϵ 1 ϵ P sup σiI(Zi ∈ A) > = P sup σiI(Zi ≤ z) > &A∈A n %i=1 % 4' &z∈R n %i=1 % 4' % % % % %# % %# % we condition on Z1,...,Zn. Fix% z1,...,zn ∈ R% and note that the% vector [I(z1 ≤ z%),...,I(zn ≤ z)] can take % % % % at most (n +1) possible values for any z. Thus conditioned on Z1,...,Zn, the supremum is just a maximum over at most n +1random variables. Applying union bound we obtain n n 1 ϵ 1 ϵ P sup σiI(Zi ∈ A) > Z1,...,Zn ≤ (n + 1) sup P σiI(Zi ∈ A) > Z1,...,Zn &A∈A n %i=1 % 4 ' A∈A &n %i=1 % 4 ' % % % % % % %# % % %# % % where the sup% is outside of the% probability.% The next step is to find an% exponential bound% for% the RHS. % % n % % (4) Hoeffding’s Inequality: With z1,...,zn fixed, i=1 σiI(zi ∈ A) is a sum of n independent zero mean random variables between [−1, 1]. Thus, by Hoeffding’s inequality we have n $ 1 ϵ P sup σiI(Zi ∈ A) > Z1,...,Zn &A∈A n %i=1 % 4 ' % % % %# n % % % 1 % % ϵ ≤ (n + 1) sup% P σiI(Z% i ∈ A) > Z1,...,Zn A∈A &n %i=1 % 4 ' % % % %2# % % nϵ% % % ≤ 2(n +1)exp − % . % 32 ! "

2 Taking expectation on both side we obtain the claimed result

nϵ2 P sup |νn(A) − ν(A)| > ϵ ≤ 8(n +1)exp − . A∈A 32 ! " ! "

3 STATISYE/CSE 598Y 8803: Statistical Advanced Learning Machine Theory Learning Instructor: Instructor: Jian Tuo Zha Zhaong

Lecture 8: Rademacher Complexity

Recall that in the proof of the Glivenko-Cantelli theorem we used the Rademacher random variables σ1,...,σn which are iid uniform 1 random variables. {± } Definition. Let µ be a proability measure on and assume that X1,...,Xn are independent random variables according to µ.Let be a class of functionsX mapping from to R. Define the random variable F X n 1 ˆn( ) := E sup σif(Xi) X1,...,Xn , R F !f∈F n =1 $ i # " # # where σ1,...,σn are independent uniform 1 -valued random variables. ˆn( ) is called the empirical Rademacher averages of . Note that it depends{± } on the sample and can be actuallyR F computed. Essentially it measures the correlationF between a random noise (labeling) and functions in the function class , in the supremum sense. The Rademacher averages of is F F

n( )=E[ ˆn( )]. R F R F

For a function class and a sample S = x1,...,xn , we would like to bound the random quantity F { } n 1 φ(S) := sup (Pf Pnf) = sup E[f(X)] f(xi) . − − n f∈F f∈F % i=1 & " First, we bound the difference between the random variable and its mean by using McDiarmid inequality. Consider another sample S′ which only differs from S at one example. Then we have

′ P P P P′ c φ(S) φ(S ) = sup( f nf) sup( f nf) | − | #f∈F − − f∈F − # ≤ n # # # # # # # # where we assume that f [a, b] with c = b a. Then by McDiarmind inequality we have ∈ − 2nϵ2 P sup(Pf Pnf) E sup(Pf Pnf) ϵ exp . − − − ≥ ≤ − c2 %f∈F !f∈F $ & ' ( Thus, by setting δ = exp( 2nϵ2/c2) we have f , − ∀ ∈ F c2 log(1/δ) φ(S) E[φ(S)] + . ≤ ) 2n

′ ′ ′ Next, we related the E[φ(S)] with the Rademacher averages. From now on, we define S = X1,...,X to { n}

1 be a ghost sample of S (not the same S′ before). Note that

n 1 ES sup(Pf Pnf) = ES sup E[f(X)] f(xi) − − n !f∈F $ !f∈F % i=1 &$ n " n 1 ′ 1 = ES sup E f(X ) f(Xi) X1,...,Xn n i − n !f∈F ! i=1 i=1 $$ " " # Jensen n # 1 ′ # ES,S′ sup (f(X ) f(Xi)) ≤ n i − !f∈F i=1 $ "n 1 ′ = ES,S′ sup σi(f(X ) f(Xi)) n i − !f∈F i=1 $ "n n 1 ′ 1 ES,S′ sup σif(X ) + sup σif(Xi) ≤ n i −n !f∈F % i=1 & f∈F % i=1 &$ " " =2n( ). R F

So we have shown that ES [φ(S)] 2 n( ). Combine it with the first step (with c =1), we have shown the first part of the following theorem:≤ R F Theorem 8-1. Let be a set of binary-valued 0, 1 functions. For all δ > 0,withproabilityatleast1 δ, F { } − log(1/δ) f , Pf Pnf +2 n( )+ , ∀ ∈ F ≤ R F ) 2n and also with probability at least 1 δ, − log(2/δ) f , Pf Pnf +2ˆn( )+C , ∀ ∈ F ≤ R F ) n where C = √2+1/√2. Proof.

First part has been proven above. For the second part, we apply the McDiarmid’s inequality again to the empirical Rademacher averages

n 1 ˆn( )=E sup σif(Xi) X1,...,Xn . R F !f∈F n =1 $ i # " # # Note that ˆn( ) is a function of X1,...,Xn and satisfies the condition of the McDiarmid’s inequality with bounded diRfferenceF at most 1/n. So we have

2 P 2 n( ) 2 ˆn( ) > ϵ exp( nϵ /2). R F − R F ≤ − * + So with probability at least 1 δ, − 2 log(1/δ) 2 n( ) 2 ˆn( )+ . R F ≤ R F ) n So if we allow each step to be wrong with δ/2 probability, then we have with probability at least 1 δ, − log(2/δ) f , Pf Pnf +2ˆ n( )+C ∀ ∈ F ≤ R F ) n where C = √2+1/√2.

2 !

Assume that is the hypothesis space and = ℓ = f : f(x, y)=ℓ(y,h(x)), h is the class induced from H. Then the Rademacher avergesF of and◦ H are{ quite related. In fact, if we∀ assume∈ H} = 1 , then we have H H F Y {± }

n 1 n( )=E sup σiI(Yi = h(Xi)) R F n ̸ !h∈H i=1 $ "n 1 1 = E sup σi (1 Yih(Xi)) n 2 − !h∈H i=1 $ "n 1 1 = E sup σ Y h(X ) 2 n i i i !h∈H i=1 $ " 1 = n( ). 2R H

The Rademacher average n( ) can actually be computed. Notice that R H n 1 1 1 n( )= E sup σiYih(Xi) 2R H 2 n !h∈H i=1 $ " n 1 1 1 σih(Xi) = + E sup − 2 n − 2 !h∈H i=1 $ "n 1 1 1 σih(Xi) = E inf − 2 − h∈H n 2 ! i=1 $ " 1 = E inf Rˆn(h, σ) 2 − h∈H , - where Rˆn(h, σ) is the empirical risk of classifier h with respect to random label σ =[σ1,...,σn]. When H is so large that it can fit every random labeling perfectly, we have n( )=1/2 and the bound becomes meaningless. R H The Rademacher average is related to the growth funciton and VC dimension. One can bound the Rademacher average by the growth function or VC dimension. We could estimate Rademacher averages for function classes which are built from simpler classes. The following is a list of properties about Rademacher averages.

1. If then n( ) n( ). It follows from the definition. F ⊂ G R F ≤ R G 2. n(c )= c n( ), where c = x cf(x): f . Since we have R ·F | |R F ·F { *→ ∈ F} n 1 n(c )=E sup cσif(Xi) R F n !f∈F i=1 $ "

where σi’s are iid Rademacher random variables. Since c σi has the same distribution as cσi, we have | | n 1 n(c )= c E sup σif(Xi) = c n( ). R F | | n | |R F !f∈F i=1 $ "

3. n( + g)= n( ), where + g is defined as f + g : f and g is a fixed function. To show R F R F F { ∀ ∈ F}

3 this, we have

n 1 n( + g)=E sup σi[f(Xi)+g(Xi)] R F n !f∈F i=1 $ "n n 1 1 = E sup σ f(X ) + E σ g(X ) n i i n i i !f∈F i=1 $ ! i=1 $ " " = n( ) R F since the second term is zero. 4. Let the convex hull of a set of functions be defined as F k k conv( )= αkfi : k 1, αi 0, αi =1,f1,...,fk . F ≥ ≥ ∈ F .i=1 i=1 / " "

Then we have n( )= n(conv( )) since R F R F n k 1 n(conv( )) = E sup σi αifj(Xi) R F ⎡ =1 n ⎤ fj ∈F,∥α∥1 i=1 j=1 " " ⎣ k n ⎦ 1 = E sup sup αj σifj(Xi) ⎡ =1 n ⎤ fj ∈F ∥α∥1 j=1 % i=1 & " " ⎣ n ⎦ 1 = E sup max σifj(Xi) j n !fj ∈F i=1 $ " = n( ). R F

5. Ledoux-Talagrand contraction inequality: If φi is Lipshtiz, i.e. it satisfies φi(a) φi(b) L a b , then | − | ≤ | − | n n 1 1 E sup σiφi(f(Xi)) LE sup σif(Xi) = L n( ). n ≤ n R F !f∈F i=1 $ !f∈F i=1 $ " "

4 STATISYE/CSE 598Y 8803: Statistical Advanced Learning Machine TheoryLearning Instructor: Instructor: Jian Tuo ZhaZhaong Lecture 9: Covering Number and Real-Valued Function Class

n Recall that if S = {x1,...,xn} ⊂ X and H is a set of binary-valued functions, then HS denotes the restriction of H to S, and the growth function (shatter coefficient) is defined by

GH(n) = sup |HS |. S

n Since H maps X into {0, 1}, HS is finite for any finite S (|HS | ≤ 2 ).

Such a definition works well for binary-valued functions, but if H is a set of real-valued functions then HS will be a set with infinite cardinality even for finite n. Essentially we do not need to divide H into unique elements based on HS, but only need to partition the function class H into small groups which are “local” in nature. Definition. Let (W,d) be a metric space and F ⊂ W.Foreveryϵ > 0, denote by N(ϵ, F,d) the minimal number of open balls (with respect to metric d) needed to cover F. That is, N(ϵ, F,d) is the minimal cardinality of the set {f1,...,fm} ⊂ W with the property that for every f ∈ F there is some fi such that d(f,fi) < ϵ. The set {f1,...,fm} is called an ϵ-cover of F. The logarithm of the covering number is called the entropy of the set.

We will be interested in metrics induced by samples. For every sample {x1,...,xn} let µn be the empirical measure of the sample. For 1 ≤ p ≤∞and a function f, put

n 1/p 1 ∥f∥ = |f(x )|p Lp(µn) n i ! i=1 # "

and in particular, we have ∥f∥L∞(µn) =max1≤i≤n |f(xi)|.LetN(ϵ, F,Lp(µn)) be the covering number of F at scale ϵ with respect to the Lp(µn).

Theorem 9-1. For any class F of real-valued functions, any sample S = {x1,...,xn} and ϵ > 0,

N(ϵ, F,L1(µn)) ≤ N(ϵ, F,L2(µn)) ≤ N(ϵ, F,L∞(µn)).

Definition. For ϵ > 0, define the uniform covering number

Np(ϵ, F,n) = sup N(ϵ, F,Lp(µn)). µn

It is easy to see that the uniform covering number is a generalization of the growth function. Suppose that F contains functions which map X into {0, 1}. Then for any S = {x1,...,xn}, ϵ < 1 and p = ∞, we have N(ϵ, F,L∞(µn)) = |GH|,soN∞(ϵ, F,n)=|GH|. Based on the covering number we are able to obtain uniform convergence result for real-valued function class. Theorem 9-2. Let F be a class of functions which map X into [−1, 1] and let µ be a probability measure on X .AssumeX1,...,Xn are independent random variables distributed according to µ. For every ϵ > 0 and n ≥ 8/ϵ2, 1 nϵ2 P sup f(X ) − E[f(X)] > ϵ ≤ 8E[N(ϵ, F,L (µ ))] exp − , n i 1 n 128 !f∈F $ $ # % & $ " $ $ $ where µn is the empirical$ measure on X1,...,X$n.

1 Proof n . Consider the event A = {supf∈F | i=1 σif(Xi)| >nϵ/4}. We have

P (A)=Eµ[Eσ[I(A')|X1,...,Xn]] n nϵ = Eµ Eσ I sup | σif(Xi)| > |X1,...,Xn . ∈F 4 ( ( !f i=1 # )) " For any realization of X1,...,Xn, its empirical measure is µn.LetG be an ϵ/8 cover of F with respect to the L1(µn) norm, and we can assume that any function g ∈ G is bounded by 1. First, observe that

n n nϵ nϵ P sup σif(Xi) > ≤ P sup σig(Xi) > . !f∈F $ $ 4 # !g∈G $ $ 8 # $i=1 $ $i=1 $ $" $ $" $ $ ∗$ $ $ This is becuase if there is some$ element f $ ∈ F which makes$ the LHS event$ true, then we can find some g∗ ∈ G such that n n 1 1 |σ f ∗(X ) − σ g∗(X )| = |f ∗(X ) − g∗(X )| ≤ nϵ. n i i i i n i i i=1 i=1 " " n So we have supg∈G | i=1 σig(Xi)| >nϵ/8. Applying the union bound, Hoeffding’s inequality and utilizing the fact that ∀g ∈ G, N g(x )2 ≤ n, we have ' i=1 i n n ' nϵ nϵ P sup σig(Xi) > ≤ |G| · sup P σig(Xi) > ∈G 8 ∈G 8 !g $i=1 $ # g !$i=1 $ # $" $ $" $ $ $ $ $ nϵ2 $ $ ≤ 2N(ϵ/8, F,L$ (µ )) exp $− . $ $ $1 n $ 128 % & The claim follows by combing this result with symmetrization (with ghost sample and Rademacher random variables). !

n Lemma.ForA ⊂ R with r =maxa∈A ∥a∥, and σ1,...,σn being Rademacher random variables, we have

n E sup σiai ≤ r 2 log |A|. ∈ a A !i=1 # " * Proof. For any s>0 we have

n n exp sE sup σiai ≤ E exp s sup σiai ∈ ∈ ! a A !i=1 ## ! a A !i=1 ## " "n = E sup exp s σiai ∈ a A ! ! i=1 ## n" ≤ E exp s σiai ∈ ! i=1 # a"A " n s2 ≤ exp a2 2 i ∈ ! i=1 # a"A " ≤ |A| exp s2r2/2 .

So we have + , n log |A| sr2 E sup σiai ≤ inf + = r 2 log |A|. ∈ s>0 s 2 a A !i=1 # % & " * 2 !

By the lemma we have obvious that 2 log |F| Rˆn(F) ≤ - n if F is finite with output values within [−1, 1]. Theorem 9-3. For F ⊂ [−1, 1]X , we have

2 log N(ϵ, F,L2(µn)) Rˆn(F) ≤ inf + ϵ . ϵ>0 !- n # Proof.

For an ϵ > 0, let G be an ϵ-cover of F. Then we have

n 1 Rˆn(F)=Eσ sup σif(xi) ∈F n (f i=1 ) " n n 1 1 = Eσ sup sup σig(xi)+ σi(f(xi) − g(xi)) ∈G n n (g f∈F∩Bϵ(g) ! i=1 i=1 #) " " 1 n ≤ Eσ sup σig(xi)+ϵ ∈G n (g i=1 ) " 2 log N(ϵ, F,L (µ )) ≤ 2 n + ϵ, - n

where the first equality utilizes the fact that F = g∈G (F ∩ Bϵ(g)), and the first inequality comes from the fact that ∥f − g∥L2(µ ) ≤ ϵ and C-S inequality. n . !

Definition. Let (W,d) be a metric space and F ⊂ W.Forϵ > 0, a subset A is said to be an ϵ-packing of F, if for all distinct f1, f2 ∈ A, we have d(f1,f2) > ϵ. The ϵ-packing number P (ϵ, F,d) is defined as the maximum cardinality of an ϵ-packing subset. Both the covering number and packing number can be used to measure the size of the sets, and they are obviously related. The following simple result shows that as long as one of them can be computed, we can easily obtain a bound for the other one. Theorem 9-4. Given a metric space (W,d).Thenforallϵ > 0 and for every F ⊂ W, the covering number and packing number satisfy P (2ϵ, F,d) ≤ N(ϵ, F,d) ≤ P (ϵ, F,d).

3 STATISYE/CSE 598Y 8803: Statistical Advanced Learning Machine Theory Learning Instructor: Instructor: Jian Tuo Zha Zhaong Lecture 10: and Linear SVM

We start with the perceptron algorithm which is probably one of the simplest linear classifiers. We then introduce the margin maximization idea and derive the linear SVM classifier. p Assume that n = (x1,y1),...,(xn,yn) where = R and = 1 . For simplicity we only consider the linear classifiersD without{ intercept here,} i.e. =X x wT x Y: w {±Rp} . Furthermore, we assume that the H { !→ ∈ } T data are linearly separable, i.e. there exists some w∗ which can correctly classify all examples sign(w∗ xi)=yi for i =1,...,n. The perceptron algorithm works as follows.

1. Start w0 =0and t =0;

2. While wt has training error > 0:

(a) Pick one observation (xi,yi) which is misclassified by wt;

(b) Update wt+1 = wt + yixi; (c) t = t +1

T 3. Let T = t, w = wT and return hˆn(x)=sign(w x).

T Define and w∗ xiyi ,where is some classifier which Theorem 10-1 (Novikov). r =maxi xi δ = mini ∥w∗∥ w∗ ∥ ∥ 2 2 can linearly separate n. Then it terminates after T r /δ steps. D ≤ Proof.

First, note that δ has the meaning of the “margin”: the minimum distance of an example to the decision hyperplane. So the larger the margin, the smaller number of steps we need to converge. The basic idea of 2 2 2 T the proof is to show that wt are getting closer and closer to w∗. Since wt w∗ = wt + w∗ 2wt w∗, 2 T ∥ − ∥ ∥ ∥ ∥ ∥ − essentially we need to upperbound wt and lowerbound w w and then combine the results. ∥ ∥ t ∗ T First we have w0 w∗ =0and

T T T wt+1w∗ = wt w∗ + yixi w∗ wT w + δ w . ≥ t ∗ ∥ ∗∥ T Clearly we have w w tδ w by induction. Second, since w0 =0and t ∗ ≥ ∥ ∗∥ ∥ ∥ 2 2 wt+1 = wt + yixi ∥ ∥ ∥ 2 ∥ 2 T = wt + xi +2yixi wt ∥ ∥2 ∥2 ∥ wt + r . ≤∥∥ 2 2 So we have wt tr . Thus we have ∥ ∥ ≤ T tδ w w w wt w √tr w . ∥ ∗∥≤ t ∗ ≤∥ ∥∥ ∗∥≤ ∥ ∗∥ Then it follows that t r2/δ2 for any t, and thus T r2/δ2. ≤ ≤ !

1 Maximum Margin Classifier: Support Vector Machines (SVM)

We consider the set of linear classifiers = h(x)=wT x + b, w Rp,b R . Suppose the training examples are linearly separable, i.e. thereH exist{ some linear classifier∈ which has∈ 0} training error. Consider the following optimization problem:

min w 2 w,b ∥ ∥ T s.t. yi(w xi + b) 1 i =1,...,n. ≥ ∀ This is a constrained optimization where the objective is a quadratic function and the constraints are linear. So it is a problem (quadratic programming, to be more specific). Given a hyperplance (classifier), define margin as the minimum distance between the plane to any of the example. Now we show that the above optimization essentially tries to find a classifier which maximizes the margin. First, assume that there are two examples x+ and x−, both are on the margin boundary (see Figure 1). Then we know that the margin equals half of the distance between (x+ x−)’s projection along the direction that is perpendicular to the hyperplane. So we have −

1 T w margin = (x+ x ) 2 − − w ∥ ∥ T T T Using the fact that x+ and x− lie on margin, we have w x+ +b =1and w x− +b = 1. Thus w (x+ x−)= 2. So we conclude that the margin is 1/ w . Thus minimizing w 2 subject to− the linear constraints− is equvalent to maximizing 1/ w 2 subject to∥ the∥ same constraints. ∥ ∥ ∥ ∥ Since in practice examples may not be linearly separable, we introduce the concept of slack variables. For each example, define ξi 0 to be the slack variable which measures how much this example violates the ≥ 2 n margin condition. Instead of minimizing w , we will also add a term !i=1 ξi which penalizes violation of the margin condition. The relaxed optimization∥ ∥ problem can be written as:

n 2 min =1 ξi + λ w w,b !i ∥ ∥ T s.t. yi(w xi + b) 1 i =1,...,n ≥ ∀ ξi 0 i =1,...,n. ≥ ∀ where λ > 0 is a tuning parameter which controls the balance between training error and the margin. Note that equivalently, wecan write down the optimization problem as

n T 2 min "(1 yi(w xi + b))+ + λ w w,b − ∥ ∥ i=1 where (t)+ =max(t, 0).

2 STATISYE/CSE 598Y 8803: Statistical Advanced Learning Machine Theory Learning Instructor: Instructor: Jian Tuo Zha Zhaong Lecture 11: Convex Optimization

In this lecture we first give some background about convex optimization including the KKT condition and duality. We then derive the SVM dual optimization problem. Consider the constrained minimization problem of the form

minimize f(x)

subject to: gi(x)=0 i =1,...,m≤ n (1)

hj(x) ≤ 0 j =1,...,p. gi’s are equality constraints and hj’s are inequality constraints and usually they are assumed to be within the class C2. A point that satisfies all constraints is said to be a feasible point. An inequality constraint is said to be active at a feasible point x if hj(x)=0and inactive if hj(x) < 0. Equality constraints are always active at any feasible point. To simplify notation we write h =[h1,...,hp] and g =[g1,...,gm], and the constraints now become g(x)=0 and h(x) ≤ 0.

Karush-Kuhn-Tucker (KKT) Conditions

KKT conditions (a.k.a. Kuhn-Tucker conditions) are necessary conditions for the local minimum solutions of problem (1). Let x∗ be a local minimum point for Problem (1) and suppose x∗ is a regular point for the constraints. Then there is a vector µ ∈ Rm and a vector λ ∈ Rp with λ ≥ 0 such that

∇f(x∗)+λT ∇h(x∗)+µT ∇g(x∗)=0 (2) g(x∗)=0 (3) ∗ λj hj(x )=0(j =1,...,p) (4)

Convince yourself why the above conditions hold geometrically. It is convenient to introduce the Lagrangian associated with the problem as

L(x, λ,µ)=f(x)+λT h(x)+µT g(x) where µ ∈ Rm, λ ∈ Rp and λ ≥ 0 are Lagrange multipliers. Note that equation (2), (3) and (4) together give a total of n + m + p equations in the n + m + p variables x∗, λ and µ. From now on we assume that we only have inequality constrains for simplicity. The case with equality constraints can be done in a similar way, except that µ does not have the nonnegative constraint as λ. So in our case we have the following optimization problem:

min f(x) s.t. h(x) ≤ 0.

Weak Duality and Strong Duality

Consider the Lagrangian L(x, λ) for the above optimization problem. Then we have the following two types of dualities:

• weak duality: we have obviously for any λ ≥ 0 that

inf L(x, λ) ≤ inf sup L(x, λ) x x λ≥0

1 and thus sup inf L(x, λ) ≤ inf sup L(x, λ). λ≥0 x x λ≥0 Such a relation always holds. • strong duality: suppose in addition there exist x∗ and λ∗ ≥ 0 such that L(x∗, λ) ≤ L(x∗, λ∗) ≤ L(x, λ∗) for all feasible x and λ ≥ 0. Then we have inf sup L(x, λ) ≤ sup L(x∗, λ) x λ≥0 λ≥0 = L(x∗, λ∗) = inf L(x, λ∗) x ≤ sup inf L(x, λ). λ≥0 x Thus we have inf sup L(x, λ) = sup inf L(x, λ). x λ≥0 λ≥0 x The point (x∗, λ∗) is called the saddle point. One example is the function L(x, λ)=x2 − λ2, with saddle point (0, 0) as shown in Figure 1.

Weak duality always holds, and strong duality holds if f and hj’s are convex and there exists at least one feasible point which is an interior point. The Lagrange dual function D(λ) is defined as

p ⎧ ⎫ D(λ) := inf L(x, λ) = inf f(x)+ λj hj(x) x x ⎨ $ ⎬ j=1 ⎩ ⎭ and we define the dual optimization problem as: max D(λ) s.t. λ ≥ 0.

Note that (1) D(λ) is a concave function; (2) for any feasible λ and x we have D(λ) ≤ f(x). In fact if we define p∗ to be the minimum solution of the primal optimization problem(primal solution), and d∗ to the ∗ ∗ ∗ maximum of the dual problem d = supλ≥0 D(λ) (dual solution). Then the weak duality says d ≤ p . The quantity p∗ − d∗ is known as the duality gap, which can be a useful criteria for convergence. Now we illustrate this duality relationship with a simple example where we only have one inequality con- straint: min f(x) s.t. h(x) ≤ 0. Define ω(z) = inf{f(x):h(x) ≤ z} for z ∈ R. Then it is easy to observe that ω(z) is monotone on each coordinate of z. The duality can be illustrated by the fact that the primal solution p∗ is the intercept of ω(z) with the vertical axis z =0, and it is an upperbound of the maximum intercept with the vertical axis T of all hyperplanes that lie below ω(.). Such hyperplanes have the form lλ(z)=−λ z + infx{f(x)+λh(x)} with λ ≥ 0. An example is shown in Figure 2.

SVM Dual Problem

The SVM primal problem can be written as 1 n 2 minw,b,ξ n i=1 ξi + λ∥w∥ T ( s.t. yi(w xi + b) ≥ 1 − ξi; ξi ≥ 0 ∀i

2 Saddle Point Duality 8

7

6 ω(z) 4

5 2

4 ) λ 0

L(x, 3 −2 p* 2 2 1 −4 2 0 1 1 0 −1 −1 0 −2 −2 x −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 λ

Figure 1: Left: Saddle point (0, 0) of L(x, λ)=x2 − λ2; Right: Geometric interpretation of duality.

Now the Lagrangian can be written as

n n n 1 L(w, b, ξ, α, β)= ξ + λwT w + α (1 − ξ − y wT x − y b) − β ξ n $ i $ i i i i i $ i i i=1 i=1 i=1 where the Lagrange multiplers α ≥ 0 and β ≥ 0. We want to remove the primal variables w, b, ξ by maximization, i.e. set the following derivatives to zero:

n ∂L 1 =0 =⇒ w = α y x ∂w 2λ $ i i i i=1 n ∂L =0 =⇒ α y =0 ∂b $ i i i=1 ∂L 1 =0 =⇒ α + β = . ∂ξ i i n Plugging in and we obtain the dual:

n 1 D(α, β)= α − α α y y xT x . $ i 4λ $ i j i j i j i=1 i,j

Since we have αi ≥ 0 and βi ≥ 0 and αi + βi =1/n, thus we have 0 ≤ αi ≤ 1/n. So the dual optimization problem becomes

n 1 T max =1 αi − 4 αiαjyiyjxi xj α (i λ (i,j s.t. 0 ≤ αi ≤ 1/n. which is a quadratic programming problem. Note that due to the constraints, the dual solution is in general ′ sparse, i.e. we have many αis equal to 0. We have the following observations:

T 1. If αi > 0: we have yi(w xi + b)=1− ξi ≤ 1. So the example is either at or on the wrong side of the margin. Such examples for αi > 0 are called support vectors.

T 2. If αi =0: we have βi =1/n and thus ξi =0. So yi(w xi + b) ≥ 1. Such examples are on the correct side of the margin.

3 T 3. If yi(w xi + b) < 1: we have ξ > 0 and thus βi =0and αi =1/n. So if an example causes margin error then its dual variable αi will take at the right boundary 1/n.

4. It is possible that for examples which are on the correct side of the margin, their αi’s are nonzero.

T 5. In the objective xi’s appear always in the form of inner product xi xj . So if we first map xi into a T feature vector φ(xi), then we could replace xi xj by ⟨φ(xi), φ(xj )⟩. This leads to the introduction of reproducing kernel Hilbert space in SVM.

4 STATISYE/CSE 598Y 8803: Statistical Advanced Learning Machine Theory Learning Instructor: Instructor: Jian Tuo Zha Zhaong Lecture 12: Reproducing Kernel Hilbert Spaces and Kernel Methods

We first define Hilbert space and then introduce the concept of Reproducing Kernel Hilbert Space (RKHS) which plays an important role in . Definition. A Hilbert space is an inner product space which is also complete and separable 1 with respect to the norm/distance function induced by the inner product. For any f,g and α R, ., . is an inner product if and only if it satisfies the following conditions: ∈ H ∈ ⟨ ⟩

1. f,g = g,f ; ⟨ ⟩ ⟨ ⟩ 2. f + g,h = f,h + g,h and αf,g = α f,g ; ⟨ ⟩ ⟨ ⟩ ⟨ ⟩ ⟨ ⟩ ⟨ ⟩ 3. f,f 0 and f,h =0if and only if f =0. ⟨ ⟩≥ ⟨ ⟩ The norm/distance induced by the inner product is defined as f = f,f and f g = f g,f g . ., . is called a semi-inner product if the third condition only sa∥ ys∥ f,f⟨ 0⟩. In this∥ − case,∥ the⟨ induced− norm− ⟩ ! ! is⟨ actually⟩ a semi-norm. ⟨ ⟩≥ Examples of Hilbert space includes:

1. Rn with a, b = aT b; ⟨ ⟩ ∞ 2. ℓ2 space of square summable sequence with inner product x, y = xiyi; ⟨ ⟩ i=1 3. The space of L2 square integrable functions with inner product f,g" = ´ f(x)g(x)dx. ⟨ ⟩ A closed linear subspace of a Hilbert space is also a Hilbert space. The distance between an element G H f and is defined as infg f g . Since is closed, the infimum can be attained and we have f ∈ H G ∈G ∥ − ∥ G G ∈ G such that f fG = infg∈G f g . Such fG is called the projection of f onto . It can be shown that such ∥ − ∥ ∥ − ∥ c G fG is unique, and f fG,g =0for all g . The linear subspace = f : f,g =0, g is called ⟨ − ⟩ ∈ G c G { ⟨ ⟩ ∀ ∈ G} the orthogonal complement of . It can be shown that is also closed and f = fG + fGc for any f , G c G ∈ H where fG and fGc are projections of f onto and . The decomposition f = fG + fGc is called a tensor sum decomposition and is denoted by = G c, Gc = or = c. H G ⊕ G G H ⊖ G G H ⊖ G A simple example of decomposition would be = R2 and = (x, 0) : x R and c = (0,y):y R . Any element (x, y) in can be decomposed asH(x, y)=(x, 0)G + (0{ ,y) and this∈ decomposition} G { is unique.∈ } H Theorem 12-1 (Riesz). For every contniuous linear functional L in a Hilbert space ,thereexistsa H unique gL such that L(f)= gL,f for f . ∈ H ⟨ ⟩ ∀ ∈ H Proof.

Define L = f : L(f)=0 to be the null space of L. Since L is continuous we have L a closed linear N { } N subspace. Assume L then there exists a nonzero element g0 L. We have N ⊂ H ∈ H ⊖ N (L(f))g0 (L(g0))f L, − ∈ N and thus (L(f))g0, (L(g0))f,g0 =0. ⟨ ⟩ Thus we get L(g0) L(f)= g0,f . g0,g0 #⟨ ⟩ $ Hence we take gL =(L(g0))g0/ g0,g0 . If L = we simply take gL =0. If there are gL and g˜L two ⟨ ⟩ N H representers for L then we have gL g˜L,f =0for any f and thus gL g˜L =0and then gL =˜gL. ⟨ − ⟩ ∈ H ∥ − ∥ !. 1AvectorspaceH is complete if every Cauchy sequence in H converges to an element in H.Asequencesatisfying limm,n→∞ ∥fn − fm∥ =0is called a Cauchy sequence.

1 Reproducing Kernel Hilbert Space

Definition. A kernel k : R if (1) it is symmetric; (2) it is positive semi definite. I.e. any x1,...,xn the gram matrix K is positiveX ×X semi,→ definite. Properties: (1) k(x, x) 0;(2)k(x, z) K(x, x)K(z,z). ≥ ≤ There are a couple of ways to define RKHS! which are equivalent. Definition. k(., .) is a reproducing kernel of a Hilbert space if for f , we have f(x)= k(x, .),f(.) . H ∀ ∈ H ⟨ ⟩ Definition. A RKHS is a Hilbert space with a reproducing kernel whose span is dense in . H H An equivalent definition of RKHS woud be “a Hilbert space of functions with all evaluation functinos bounded and linear” or “all evaluation functionals are continuous”. Theorem 12-2 (Mercer’s). Let ( ,µ) be a finite measure space and k L ( ,µ µ) be a kernel such X ∈ ∞ X ×X × that Tk : L2( ,µ) L2( ,µ) is positive definite, i.e. ´ k(x, z)f(x)f(z)dµ(x)dµ(z) 0 for all f L2( ,µ). X ,→ X ≥ ∈ X Let φi L2( ,µ) be the normalized eigenfunctions of Tk associated with the eigenvalues λi 0.Then ∈ X ∞ ≥ (1) The eigenvalues λi i=1 are absolutely summable; ∞ { } (2) k(x, z)= i=1 λiφi(x)φi(z) holds the series converges absolutely and uniformly. We can construct" a RKHS as the completed space of the span of eigenfunctions defined by the kernel:

= f : f(x)= αiφi(x) s.t. f < H ∥ ∥H ∞ % i ' &

Given f = i αiφi and g = i βiφi, the inner product and the norm induced by the inner product are defined as " " αiβi f,g = αiφi(x), βiφi(x) = ⟨ ⟩H λ ( i i ) i i & & H & and 2 2 αi f = αiφi(x), αiφi(x) = . ∥ ∥H λ ( i i ) i i & & H & It is easy to see that the representer property holds:

αiλiφi(x) f(.),k(., x) = αiφi(.), λiφi(x)φi(.) = = f(x). ⟨ ⟩H λ ( i i ) i i & & H & The RKHS concept can be utilized in SVM and other kernel machines which is known as the kernel trick. p Given the eigenvalues λi’s and eigenfunctions φi’s of a reproducing kernel k(., .), we can may the x R into a higher dimensional feature space: ∈

x Φ(x)= λ1φ1(x),..., λiφi(x),... . ,→ *! ! + The dimensionality of the feature vector Φ(x) is the same as the number of nonzero eigenvalues of k(., .), which could be of infinite dimensional. By Mercer’s theorem, the standard ℓ2 inner product between any two feature vectors Φ(x) and Φ(z) can now be computed by the reproducing kernel since

k(x, z)= Φ(x), Φ(z) ℓ2 . ⟨ ⟩

2 Representer Theorem

Theorem 12-3 (Representer). Given a reproducing kernel k and let be the corresponding RKHS. Then for a function L : Rn R and non-decreasing function Ω : R R,thesolutionoftheoptimizationproblemH ,→ ,→ 2 min J(f) = min L(f(x1),...,f(xn)) + Ω( f H) f∈H f∈H ∥ ∥ can be expressed as , - n ∗ f = αik(xi,.). i=1 & Furthermore, if Ω(.) is strictly increasing, then all solutions have this form. Proof. Define the subspace to be the span of G span k(xi,.), 1 i n . { ≤ ≤ } Decompose f as f = fG + fGc . We have

f = f + f c ∥ ∥H ∥ G∥G ∥ G ∥H by orthogonality of with c. Since Ω is non-decreasing, we have G G Ω( f 2 ) Ω( f 2 ). ∥ ∥H ≥ ∥ G∥H On the other hand, since the kernel k has the reproducing property, we have

f(xi)= f,k(xi,.) ⟨ ⟩ = f ,k(xi,.) + f c ,k(xi,.) ⟨ G ⟩ ⟨ G ⟩ = f ,k(xi,.) ⟨ G ⟩ = fG(xi).

So this implies that L(f(x1),...,f(xn)) = L(fG(x1),...,fG(xn)), i.e. the first component of the optimization ′ 2 objective only depends on the projection of f onto which is the span of k(xi,.) s. Since Ω( f H) 2 G ∗ n ∥ ∥ ≥ Ω( f ), we have the minimizer can be expressed as f (.)= αik(xi,.). If Ω(.) is strictly non- ∥ G∥H i=1 decreasing, then fGc must be zero and all minimizers must take the above form. " !

Examples of Kernel

Some simple examples of kernel:

Linear kernel: k(x, z)=xT z or more generally, k(x, z)=xT Bz for B " 0. • T d Polynomial kernel: k(x, z)=(x z + c) where c 0 and d N+. • ≥ ∈ RBF kernel: k(x, z) = exp( γ x z 2). • − ∥ − ∥ We could also construct kernels based on simple ones. For instance, we have kernels (it can be shown that k(., .) satisfies the conditions of a kernel):

k(x, z)= αiki(x, z) where αi 0 and ki(., .) are kernels; • i ≥ k(x, z)="k1(x, z)k2(x, z); • k(x, z) = exp(k1(x, z)); • k(x, z)=P (k1(x, z)) where P (t) is a polynomial of t with nonnegative coefficients. •

3 Rademacher Average

We next present a result which computes the upper bound of the Rademacher average of a function class which is a ball in the RKHS. Consider the following learning problem: n 1 min ℓ(yi,f(xi)) + λ f H, f∈H n ∥ ∥ i=1 & where is a RKHS with kernel k. The optimization problem is equivalent to H n 1 min ℓ(yi,f(xi)) f∈H,∥f∥H≤t n i=1 & for some properly chosen t>0. So we would like to invest the function class = f : f H t ’s Rademacher average. F { ∥ ∥ ≤ } n×n Theorem. Let be a RKHS with kernel k,andletK R so that Kij = k(xi,xj).Define t = f : f , f tH.Thenwehave ∈ F { ∈ H ∥ ∥H ≤ } n 1 t ˆn( t) := E sup ϵif(xi) X1,...,Xn trace(K) R F t n | ≤ n .f∈F i=1 / & ! and t ∞ n( t) λi R F ≤ √n0 1i=1 1& 2 where λi’s are the eigenvalues of the operator Tk : f ´ k(., x)f(x)dP (x). ,→ Proof. By the reproducing property we have n n 1 1 sup ϵif(xi) = sup ϵi k(xi,.),f t n t n ⟨ ⟩ f∈F i=1 f∈F i=1 & & n 1 = sup ϵ k(x ,.),f n i i ∥f∥H≤t ( i=1 ) n & 1 = t ϵik(xi,.) 3n 3 3 i=1 3 3 & 3 3 n 3 3 1 3 = t ϵiϵjk(xi,xj ). 0n2 1 i,j=1 1 & 2 Therefore we have

n t ˆn( t)=E ϵiϵjk(xi,xj ) X1,...,Xn R F ⎡n0 | ⎤ 1i,j=1 1 & ⎣ 2 ⎦ n t E ϵiϵjk(xi,xj ) X1,...,Xn ≤ n0 ⎡ | ⎤ 1 i,j=1 1 & 1 ⎣ ⎦ 2 n t = k(x ,x ) n0 i i 1i=1 1& t 2 = trace(K), n ! 4 where we used the property that E[ϵi]=0and V[ϵi]=1and Jensen’s inequality. Since k(x, x)= ∞ i=1 λiφi(x)φi(x), where φi’s are an orthonomral basis, we have " n( t)=E[ ˆn( t)] R F R F n t 1 E k(xi,xi) ≤ √n 0n 1 i=1 1 & t 2 E[k(X, X)] ≤ √n ! t ∞ λi. ≤ √n0 1i=1 1& 2 !

5 STATISYE/CSE 598Y 8803: Statistical Advanced Learning Machine TheoryLearning Instructor: Instructor: Jian Tuo ZhaZhaong Lecture 13: Ensemble Learning and Boosting

In ensemble learning, we obtain a final classifier by combining a set of classifiers. Often even if the set of candidate classifiers are weak (means they only do slightly better than random), the final classifier could still have very good performance. In most cases, the final classifier is a weighted combination of weak learners, where the weights could be determined in different ways. We will focus on boosting algorithms (adaboost, in particular) which has been shown to perform well in practice.

1 Adaptive Boosting (AdaBoost)

Let H be a function class and (x1,y1),...,(xn,yn) be a set of training examples. H is often called the set of weak learners, such as linear threshold functions, decision trees, decision stumps (a simple decision tree with only one node, such as y = I(xy ≥ 3)), a set of simple decision functions, etc. The AdaBoost algorithm works as follows.

1. Start with a uniform weight distribution 1 D (i)= ,i=1,...n 1 n

and let the initial classifier be F0(x)=0. 2. For t =1..T :

(a) Choose ft ∈ H to (approximately) minimize the weighted classification error:

n ϵt = Dt(i)I(yi ̸= ft(xi)). i=1 !

1 1−ϵt (b) Let αt = 2 log ϵt and update Ft = Ft−1 + αtft. (c) Update

Dt(i) exp(αt) yi ̸= ft(xi) Dt+1(i)= × Zt "exp(−αt) otherwise,

where we will see later that Zt =2 ϵt(1 − ϵt). # 3. Output Ft

The next theorem shows that AdaBoost is essentially minimizing the exponential loss of the training examples by using a greedy search. Theorem. T The empirical training error of the final classifier FT (.) is upper bounded by t=1 2 ϵt(1 − ϵt). Furthermore, if ϵ ≤ 1/2 − γ for all t =1,...,T, then the training error is further upper bouned by (1 − t # 4γ2)T/2. $ Proof.

1 The first part can be shown as follows. Since YFT ≤ 0 implies exp(−YFT (X)) ≥ 1. So we have the training error upper bounded as follows: 1 n 1 n I(y ̸= F (x )) = I(y F (x ) ≤ 0) n i T i n i T i i=1 i=1 ! !n 1 ≤ exp(−y F (x )) n i T i i=1 ! n T 1 = exp(−y α f (x )) n i t t i i=1 t=1 ! ! 1 n T = exp(−y α f (x )). n i t t i i=1 t=1 ! %

Since yi,f(xi) ∈ {±}, their product is also ±1. By definition of Dt+1(i) we have exp(−yiαtft(xi)) · Dt(i)= Dt+1(i)Zt. Thus we have

n T n T 1 1 D (i) exp(−y α f (x )) = t+1 Z n i t t i n D (i) t i=1 t=1 i=1 t=1 t ! % ! % n T 1 DT +1(i) = Zt (1) n D1(i) i=1 &t=1 ' ! % T = Zt, t=1 % n where the last step comes from the facts that D1(i)=1/n and i=1 DT +1(i)=1because it is normalized. Now we choose αt to minimize Zt. By definition we have ( Zt = Dt(i) exp(−αt)+ Dt(i) exp(αt) i:yi=!ft(xi) i:yi≠!ft(xi) =(1− ϵt) exp(−αt)+ϵt exp(αt)

1 1−ϵt and we obtain αt = 2 ln ϵt by setting ∂Zt/∂αt =0. Plugging in we have ϵ 1 − ϵ Z =(1− ϵ ) t + ϵ t =2 ϵ (1 − ϵ ). t t 1 − ϵ t ϵ t t t ) t ) # 1 The second part then follows since if we have ϵt ≥ 2 − γ for all t , then T 2 2 ϵt(1 − ϵt) ≤ 2 1/4 − γ t=1 t % # % # =(1− 4γ2)T/2.

2 T/2 log 1/δ Furthermore, if we let (1 − 4γ ) ≤ δ then we have for T ≥ 2γ2 the training error will be upper bounded by δ. !

2 Alternative Interpretation of AdaBoost

Consider the upper bound of training error

n n T n T n T T 1 1 1 D (i) 1 D (i) exp(−y F (x )) = exp(−y α f (x )) = t+1 Z = Z T +1 = Z . n i T i n i t t i n D (i) t n t D (i) t i=1 i=1 t=1 i=1 t=1 t i=1 &t=1 ' 1 t=1 ! ! % ! % ! % %

2 Assume that we already fixed α1,...,αt−1 and f1,...,ft−1 in the first t − 1 steps. In the t-th step, we have n 1 exp(−y F (x )) n i T i i=1 ! 1 n = exp (−y (F (x )+α f (x ))) n i t−1 i t t i i=1 !n 1 = [(exp(α ) − exp(−α ))I(y ̸= f (x )) + exp(−α )] exp(−y F (x )) n t t i t i t i t−1 i i=1 ! exp(−α ) n 1 n = t exp(−y F (x )) + (exp(α ) + exp(−α )) · I(y ̸= f (x )) exp(−y F (x )). n i t−1 i t t n i t i i T −1 i i=1 i=1 ! ! Since we have T −1 1 exp(−y F (x )) = D (i) Z n i T −1 i T t t=1 % by equation 1, plugging in we have 1 n exp(−y F (x )) n i T i i=1 ! n T −1 n exp(−α ) = t exp(−y F (x )) + (exp(α ) + exp(−α )) Z · I(y ̸= f (x ))D (i). n i t−1 i t t s i t i T i=1 s=1 i=1 ! % ! As a result, for any αt we have the quantity is minimized with respect to ft if ft minimizes the quantity n i=1 DT (i)I(yi ̸= ft(xi)), the weighted training error. Given ft, the αt which minimizes the empirical 1 n exponential error n i=1 exp(−yiFT (xi)) will be the same αt which minimizes Zt, i.e. it implies αt = ( t 1 ln 1−ϵ . 2 ϵt ( We can see that AdaBoost is essentially a greedy algorithm which tries to minimize the empirical exponential loss by using : alternatively minimizes w.r.t. ft and αt in each iteration. Furthermore, we can generalize AdaBoost to other loss functions by this viewpoint. Given any convex loss function φ(.) other than the exponential loss (such as the logistic loss or the squared loss), we can use gradient descent method to minimize 1 n Rˆ (F + α f )= φ(y (F (x )+α f (x ))) φ t−1 t t n i t−1 i t t i i=1 ! with respect to ft and αt. Gradient descent will choose a direction d =[ft(x1),...,ft(xn)] which is the negative gradient of Rˆφ(Ft−1 + z) at z =0. Since the gradient of Rˆφ(Ft−1 + z) at z =0is 1 1 φ′(y F (x ))y ,..., φ′(y F (x ))y , n 1 t−1 1 1 n n t−1 n n * + essentially we want to find d which minimizes n T ′ ′ ′ d [φ (y1Ft−1(x1))y1,...,φ (ynFt−1(xn))yn]= φ (yiFt−1(xi))yift(xi). i=1 ! ′ Define ai = φ (yiFt−1(xi)) which is a constant. In terms of minimization, we have the following equivalences: n n n n (−ai)(−yift(xi)) + ai min aiyift(xi) ⇐⇒ min (−ai)(−yift(xi)) ⇐⇒ min ⇐⇒ min I(yi ̸= ft(xi))ai. ft ft ft 2 ft i=1 i=1 i=1 i=1 ! ! ! ! So we can obtain ft by minimizing n I(yi ̸= ft(xi))Dt(i) i=1 ′ ! −φ (yiFt−1(xi)) wrt ft(.), where Dt(i)= Zt . This generalizes the AdaBoost from exponential loss φ(z) = exp(−z) to other loss functions.

3 STATISYE/CSE 598Y 8803: Statistical Advanced Learning Machine TheoryLearning Instructor: Instructor: Jian Tuo ZhaZhaong Lecture 14: Maximum Likilihood Estimation

In this lecture we will consider the one of the most popular approaches in statistics: the maximum likelihood estimation (MLE). In order to apply MLE, we need to make stronger assumptions about the distribution of (X, Y ). Often such assumptions are reasonable in practical applications.

1 Maximum Likelihood Estimation

Consider the model yi = h(xi)+ϵi,i=1,...,n

where we assume ϵi’s are iid zero mean noises. Note that both classification and regression can be represented in this way. Furthermore, we assume that the errors have a distribution F (µ) with mean µ =0. Then we have yi ∼ F (h(xi)).

Assume that the probability density function for yi is ph(yi), we can write down the joint likelihood as

n ph(yi). i=1 ! The MLE estimator seeks the model which maximizes the likelihood, or equivalently, minimizes the negative log-likelihood. This is reasonable since the MLE estimator is the most probable explanation for the observed data. Formally, let Θ be a , and assume that we have the model

yi ∼ pθ∗ (y),i=1,...,n

∗ for iid observations y1,...,yn and θ ∈ Θ is the true parameter. Here we do not have the covariates xi’s for simplicity, and it is straightforward to include them in the model. The MLE of θ∗ is

n θˆn =argmax pθ(yi) ∈Θ θ i=1 ! 1 n = arg min − log pθ(yi). ∈Θ n θ i=1 " By strong law of large numbers we have

n 1 − log p (y ) → E[− log p (Y )]. n θ i a.s. θ i=1 " So we can think of maximum likelihood as trying to minimize E[− log pθ(Y )]. On the other hand, consider the quantity

pθ∗ (Y ) E [log p ∗ (Y ) − log p (Y )] = E log θ θ p (Y ) # θ $ pθ∗ (y) = log pθ∗ (y)dy ˆ pθ(y) = KL(pθ,pθ∗ ) ≥ 0

1 where KL(q, p) is the KL- between two distributions q and p. Although not a distance measure (not symmetric), the KL-divergence measures the discrepancy between the two distributions. Also note that the last inequality becomes equality if and only if pθ = pθ∗ . This is because p q q KL(q, p)=E log = −E log ≥−log E =0. p q p p p p # $ # $ # $ By Jensen’s inequality, the equality happens if and only if p(x)=q(x) for all x. So we can see that if we minimize E[− log pθ(Y )], the minimum it can achieve is E[− log pθ∗ (Y )], and it achieves this minimum when θ = θ∗, the true parameter value we want to find. It is easy to see that MLE can be thought as a special case of empirical risk minimization, where the loss function is simply the negative log-likelihood: ℓ(θ,yi)=− log pθ(yi). Another observation is that minimizing the negative log-likelihood will result in the least squares estimator, if the error follows a . The empirical risk is n 1 Rˆ (θ)=− log p (y ) n n θ i i=1 " and the risk is R(θ)=E[ℓ(θ,Y)] = E[− log pθ(Y )]. The excess risk of θ is ∗ R(θ) − R(θ )=E[− log pθ(Y ) + log pθ∗ (Y )] = KL(pθ,pθ∗ ), the KL-divergence between pθ and pθ∗ .

2 Hellinger Distance and Consistency of MLE

We define the Hellinger distance between two distributions p and q as

1 2 h(p, q)= p(x) − q(x) dx. %2 ˆ &' ' ( It is easy to see that it is always nonnegative, symmetric, and satisfies the triangle inequality. And further- more, h(p, q)=0if and only if p = q. The hellinger distance plays an important role in studying MLE because it can be upper bounded by the KL-divergence, as shown in the following Lemma. 2 1 KL Lemma. We have h (q, p) ≤ 2 (q, p). 1 1/2 Proof. Using the fact that 2 log v ≤ v − 1 for all v>0, we have

1 q(x) q(x) log ≤ − 1, p x 2 ( ) 'p(x) and thus ' 1 q(X) KL(q, p) ≥ 1 − E . 2 )'p(X)* The result follows since ' q(X) 1 − E =1− q(x) p(x)dx ˆ )'p(X)* ' ' 1 1 ' = p(x)dx + q(x)dx − q(x) p(x)dx 2 ˆ 2 ˆ ˆ 1 2 ' ' = p(x) − q(x) dx 2 ˆ 2 & 2 ( = h (p, q')=h (q,' p).

2 !

This lemma says that convergence in KL-divergence will lead to convergence in hellinger distance. So if we can establish the convergence in KL-divergence then the consistency of MLE can be proven.

The convergence of the KL-divergence can be seen as follows. Since θˆn maximizes the likelihood over θ ∈ Θ, we have n n n pθ∗ (yi) log = log pθ∗ (yi) − log pˆ (yi) ≤ 0. pˆ (y ) θn i=1 θn i i=1 i=1 " " " Thus n 1 pθ∗ (yi) log − KL(pˆ ,pθ∗ )+KL(pˆ ,pθ∗ ) ≤ 0. n pˆ (y ) θn θn i=1 θn i " So we have n 1 pθ∗ (yi) KL(pˆ ,pθ∗ ) ≤ log − KL(pˆ ,pθ∗ ) . θn n pˆ y θn + =1 θn ( i) + + i + + " + If law of large numbers holds uniformly, we have+ + + + n 1 pθ∗ (yi) sup log − KL(pθ,pθ∗ ) → 0 θ∈Θ +n =1 pθ(yi) + + i + + " + + + which can be applied to θˆn as well. As+ a result, the convergence of KL-divergence+ can be established.

3