Advanced Topics in Learning and Vision

Ming-Hsuan Yang [email protected]

Lecture 6 (draft) Announcements

• More course material available on the course web page

• Project web pages: Everyone needs to set up one (format details will soon be available)

• Reading (due Nov 1): - RVM application for 3D human pose estimation [1]. - Viola and Jones: Adaboost-based real-time face detector [25]. - Viola et al: Adaboost-based real-time pedestrian detector [26].

A. Agarwal and B. Triggs. 3D human pose from silhouettes by relevance vector regression. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 882–888, 2004.

P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, 2004.

P. Viola, M. Jones, and D. Snow. Markov face models. In Proceedings of the Ninth IEEE International Conference on Computer Vision, volume 2, pages 734–741, 2003.

Lecture 6 (draft) 1 Overview

• Linear classifier: Fisher linear discriminant (FLD), linear support vector machine (SVM).

• Nonlinear classifier: nonlinear support vector machine,

• SVM regression, relevance vector machine (RVM).

• Kernel methods: kernel principal component, kernel discriminant analysis.

• Ensemble of classifiers: Adaboost, bagging, ensemble of homogeneous/heterogeneous classifiers.

Lecture 6 (draft) 2 Fisher Linear Discriminant

• Based on Gaussian distribution: between-scatter/within-scatter matrices.

for multiple classes.

0 • Trick often used in ridge regression: Sw = Sw + λI without using PCA

• Fisherfaces vs. Eigenfaces (FLD vs. PCA).

• Further study: Aleix Martinez’s paper on analyzing FLD for object recognition [10][11].

A. M. Martinez and A. Kak. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228–233, 2001.

A. M. Martinez and M. Zhu. Where are linear feature extraction methods applicable? IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1934–1944, 2005.

Lecture 6 (draft) 3 Intuitive Justification for SVM

Two things come into picture:

• what kind of function do we use for classification?

• what are the parameters for this function?

Which classifier do you expect to perform well?

Lecture 6 (draft) 4 Support Vector Machine

• Objective: To find an optimal hyperplane that correctly classifies data points as much as possible and separates the points of two classes as far as possible.

• Approach: Formulate a constrained optimization problem, then solve it using constrained quadratic programming (constrained QP).

• Theorem: Structural Risk Minimization.

• Issues: VC dimension, linear separability, feature space, multiple class.

Lecture 6 (draft) 5 Main Idea

• Given a set of data points which belong to either of two classes, an SVM finds the hyperplane: - leaving the largest possible fraction of points of the same class on the same side. - and maximizing the distance of either class from the hyperplane.

• Find the optimal separating hyperplane that minimizes the risk of misclassifying the training samples and unseen test samples.

• Question: Why do we need this? Does it work well in test samples (i.e., generalization)?

• Answer: Structural Risk Minimization.

• Issues: - How do we determine the capacity of a classification model/function? - How do we determine the parameters for that model/function?

Lecture 6 (draft) 6 Vapnik-Chervonenkis (VC) Dimension

• A property of a set of functions (hypothesis space H, learning machine, learner) {f(α)} (we use α as a generic set of parameters: a choice of α specifies a particular function).

• Shattered : if a given set of N points can be labeled in all possible 2N ways, and for each labeling, a member of the set {f(α)} can be found which correctly assigns those labels, we say that the set of points is shattered by that set of functions.

• The functions f(α) are usually called hypothesis, and the set {f(α): α ∈ Λ} is called the hypothesis space and denoted by H.

• The VC dimension for the set of functions {f(α)} , i.e., the hypothesis space H, is defined as the maximum number of training points that can be shattered by H.

• Frequently used to find the , mistake bound algorithm, neural network capacity, computational learning theory, etc.

Lecture 6 (draft) 7 • Example:

• While it is possible to find a set of 3 points that can be shattered by the set of oriented lines, it is not possible to shatter a set of 4 points (with any labeling). Thus the VC dimension of the set of oriented lines in R2 is 3.

• The VC dimension of the set of oriented hyperplane in Rn is n + 1.

Lecture 6 (draft) 8 Linear Classifiers

• Instance space: X = Rn.

• Set of class labels: Y = {+1, −1}.

• Training data set: {(x1, y1),..., (xN , yN )}.

• Hypothesis space:

n n Hlin(n) = {h : R → Y |h(x) = sign(w · x + b), w ∈ R , b ∈ R}  +1 if w · x + b > 0 sign(w · x + b) = −1 otherwise (1)

• VC(Hlin(2)) = 3, VC(Hlin(n)) = n + 1.

• Idea:

- First need to find a hypothesis h ∈ Hlin(n) - Then the parameters to minimize error for unseen examples.

Lecture 6 (draft) 9 Expected Risk Suppose we are given N observations. Each observation consists of a pair: a n vector xi ∈ < , i = 1,...,N and the associated class label yi. Assume there exists some unknown probability distribution P (x, y) from which these data points are drawn, i.e., the data points are assumed independently drawn and identically distributed.

Consider a machine whose task is to learn the mapping xi → yi ∈ {−1, 1}, the machine is defined by a set of possible mappings x → f(x, α) ∈ {−1, 1}, where the functions f(x, α) themselves are labeled by the adjustable parameters α.

Expected Risk: The expectation of the test error for a trained machine is

Z 1 R(α) = |y − f(x, α)|dP (x, y) (2) 2

Lecture 6 (draft) 10 Upper Bound for Expected Risk

The empirical risk Remp is

N 1 X R (α) = |y − f(x , α)| (3) emp 2N i i i=1

Under PAC (Probably Approximately Correct) model, Vapnik shows that the bound for the expected risk which holds with probability 1 − η (0 ≤ η ≤ 1),

r h(log(2N/h) + 1) − log(η/4) R(α) ≤ R (α) + (4) emp N where h is the VC dimension of f(x, α) (i.e., H). The second term on the right hand side is called VC confidence.

Lecture 6 (draft) 11 Implications of the Bound for Expected Risk

r h(log(2N/h) + 1) − log(η/4) R(α) ≤ R (α) + emp N

• To achieve small expected risk, that is good generalization performance ⇒ both the empirical risk and the ratio between VC dimension and the number of data points have to be small.

• Since the empirical risk is usually a decreasing function of h, it turns out, for a given number of data points, there is an optimal value of the VC dimension.

Lecture 6 (draft) 12 r h(log(2N/h) + 1) − log(η/4) R(α) ≤ R (α) + emp N

Lecture 6 (draft) 13 Implications of the bound for R(α) (cont)

• The choice of an appropriate value for h (which in most techniques is controlled by the number of free parameters of the model) is crucial in order to get good performance, especially when the number of data points is small.

• When using a multilayer or a radial basis function network, this is equivalent to the problem of finding an appropriate number of hidden units.

• This is known to be difficult, and it is usually solved by cross-validation techniques.

Lecture 6 (draft) 14 Structural Risk Minimization (SRM)

• It is not enough just to minimize the empirical risk as often done by most neural networks.

• Need to overcome the problem of choosing an appropriate VC dimension.

• Structural Risk Minimization Principle: To make the expected risk small, both sides in (4) should be small

• Minimize the empirical risk and VC confidence simultaneously:

r h(log(2N/h) + 1) − log(η/4) min(Remp(α) + ) (5) Hn N

• Introduce “structure” by dividing the entire class of functions into nested subsets.

• For each subset, we must be able to compute h or a bound of h.

Lecture 6 (draft) 15 • Then, SRM consists of finding that subset of functions which minimizes the bound on the actual risk.

• The problem of selecting the right subset for a given amount of observations is referred as capacity control.

• A trade off between reducing the training error and limiting model complexity.

• Occam Razor models should be no more complex than is sufficient to explain the data.

• “Things should be made as simple as possible – but not any simpler” Albert Einstein.

Lecture 6 (draft) 16 Structural Risk Minimization (cont)

• To implement SRM, one needs the nested structure of hypothesis space:

H1 ⊂ H2 ⊂ ... ⊂ Hn ⊂ ...

with the property that h(n) ≤ h(n + 1) where h(n) is the VC dimension of Hn.

Lecture 6 (draft) 17 • A learning machine with larger complexity (higher VC dimension) ⇒ small empirical risk.

• A simpler learning machine (lower VC dimension) ⇒ low VC confidence.

• SRM picks a trade-off in between VC dimension and empirical risk such that the risk bound is minimized.

• Problems:

- It is usually difficult to compute the VC dimension of Hn, and there are only a small number of models for which we know how to compute the VC dimension. - Even when the VC dimension of Hn is known, it is not easy to solve the minimization problem. In most cases, one will have to minimize the empirical risk for every set Hn, and then choose the Hn that minimizes the (5).

Lecture 6 (draft) 18 Support Vector Machine (SV Algorithm) One implementation based on Structural Risk Minimization theory

• Each particular choice of structure gives rise to a learning algorithm, consisting of performing SRM in the given structure of sets of functions.

• The SVM algorithm is based on a structure on the set of separating hyperplanes.

• For a set of hyperplane functions,

H1 ⊂ H2 ⊂ ... ⊂ Hn ⊂ ...

Lecture 6 (draft) 19 Margin

The margin γi of a point xi with respect to a linear classifier h(x) = sign(w · x + b) is defined as the distance of xi from the hyperplane w · x + b = 0.

γ = w·xi+b i ||w|| (6)

The margin of a set of points {x1,..., xN } is defined as the margin of the point closest to the hyperplane:

w·xi+b γ = min γi = min (7) i i ||w||

Lecture 6 (draft) 20 The SV Algorithm Consider a hyperplane parameterized by w and b, (w, b) ∈ S × R,

{x ∈ S :(w · x) + b = 0} (8)

If we additionally require

min |(w · xi) + b| = 1 (9) i=1,...,n i.e., that the scaling w and b be such that the point closest to the hyperplane 1 has a distance of ||w||. Thus, the margin between the two classes, measured 2 perpendicular to the hyperplane is at least ||w||.

Lecture 6 (draft) 21 Proposition 1. (Vapnik 95) Let R be the radius of the smallest ball BR(a) = {x ∈ F : ||x − a|| < R}(a ∈ F ) containing the points x1, x2,..., xN , and let fw,b = sign((w · z) + b) (10) be canonical hyperplane decision functions defined on these points. Then the set {fw,b : ||w|| ≤ A} has a VC dimension h satisfying

h < R2A2 + 1 (11)

In other words,

1 Maximizing the margin ||w|| ⇒ Minimizing ||w|| ⇒ smallest acceptable VC dimension ⇒ Constructing an optimal hyperplane is an implementation of SRM!

Question: How do we get the optimal hyperplane (w, b)?

Lecture 6 (draft) 22 Linear Support Vector Machine n Given a set of points xi ∈ < with i = 1, 2,...,N. Each point xi belongs to either of two classes with the label yi ∈ {−1, 1}.

Definition 1. The set S is linearly separable if there exist w ∈

The pair (w, b) defines a hyperplane of equation w · xi + b = 0 named separating hyperplane. The signed distance di of a point xi from the separating hyperplane (w, b) is given by

w · x + b d = i (13) i ||w||

With (12) and (13), for all xi ∈ S, we have

1 y d ≥ (14) i i ||w||

Lecture 6 (draft) 23 Linear SVM (cont)

1 ∀x ∈ S y d ≥ i i i ||w|| 1 Therefore, ||w|| is the lower bound on the distance between the points xi and the separating hyperplane (w, b). Definition 2. The canonical representation of the separating hyperplane is 0 0 obtained by rescaling the pair (w, b) into the pair (w , b ) such that the distance 1 of the closest point, say xj equals 0 . ||w || Definition 3. Given a linearly separable set S, the optimal separating hyperplane (OSH) is the separating hyperplane for which the distance of the closest point of S is maximum (i.e., maximize 1 ). ||w0||

Lecture 6 (draft) 24 Constrained Quadratic Programming Problem 1. 1 Minimize 2w · w subject to yi(w · xi + b) ≥ 1, i = 1, 2,...,N

Let α = α1, α2, . . . , αN be the N nonnegative Lagrange multipliers associated with the constraints in (1), the solution to Problem 1 is equivalent to determining the saddle point of the function

N 1 X L = w · w − α (y (w · x + b) − 1) P 2 i i i i=1 with LP = L(w, b, α)

Lecture 6 (draft) 25 Solving Constrained QP

At saddle point, LP has minimum for w = w and b = b requiring

N ∂L X = y α = 0 (15) ∂b i i i=1

N N ∂L X X = w − α y x = 0 ⇒ w = α y x (16) ∂w i i i i i i i=1 i=1 with ∂L ∂L ∂L ∂L = ( , , ··· , ) ∂w ∂w1 ∂w2 ∂wn Since these are equality constraints in the dual formulation, we can substitute them into LP to give

N N N N X 1 X X 1 X L = α − α α y y x · x = α − α D α (17) D i 2 i j i j i j i 2 i ij j i=1 i,j=1 i=1 i,j=1

Lecture 6 (draft) 26 Solving Constrained QP using Dual Problem 2. 1 T PN Maximize -2α Dα + i=1 αi PN subject to i=1 yiαi = 0 α ≥ 0 where D is an N × N matrix such that

Dij = yiyjxi · xj (18)

For the solution at the saddle point, (w, b), it follows that from Problem 2 that

N X w = αiyixi (19) i

Lecture 6 (draft) 27 Solving Constrained QP Using Dual b can be determined from α, which is a solution of the dual problem, and from the Kuhn-Tucker conditions

αi(yi(w · xi + b) − 1) = 0, i = 1,...,N (20)

Recall (16) and constraints in Problem 1

N N ∂L X X = w − α y x = 0 ⇒ w = α y x ∂w i i i i i i i=1 i=1

yi(w · xi + b) ≥ 1, i = 1, 2,...,N.

Note that the only αi that can be nonzero in (20) are those for which the constraints (12) are satisfied with the equality sign.

Lecture 6 (draft) 28 Support Vectors

N X yi(w · xi + b) ≥ 1, i = 1, 2,...,N. w = αiyixi i=1 Most of the constraints in (12) are satisfied with inequality signs i.e., most αi solved from the dual are null. ⇒ the vectors w is a linear combination of a relative small percentage of the points xi. ⇒ these points are termed support vectors because they are the closest points from the OSH and the only points of S needed to determine the OSH (name of the game).

The problem of classify a new data point x is now simply solved by looking at the sign of sign(w · x + b)

Lecture 6 (draft) 29 Soft Margin Classifier In the case that the set S is not linearly separable or one simply ignore whether or not the set S is linearly separable, the previous analysis can be generalized by introducing N nonnegative variable ξ = (ξ1, ξ2, . . . , ξN ) such that yi(w · xi + b) ≥ 1 − ξi, i = 1,...,N (21)

Purpose: to allow for a small number of misclassified points for better generalization or computational efficiency.

Linear separating hyperplane for non-separable case [4].

Lecture 6 (draft) 30 Generalized OSH The generalized OSH is then regarded as the solution to

Problem 3. 1 PN Minimize 2w · w + C i=1 ξi subject to yi(w · xi + b) ≥ 1 − ξi, i = 1,...,N ξ ≥ 0

Role of C:

• as a regularization parameter (cf. Radial Basis Function, fitting).

• large C ⇒ minimize the number of misclassified points.

1 • small C ⇒ maximize the minimum distance ||w||.

Lecture 6 (draft) 31 Dual Problem Problem 4. 1 T PN Maximize -2α Dα + i=1 αi PN subject to i=1 yiαi = 0 0 ≤ αi ≤ C, i = 1, 2,...,N As before, N X w = αiyixi i=1 and b can be determined from α, solution of the dual Problem 4 and from the new Kuhn-Tucker conditions

αi(yi(w · xi + b) − 1 + ξ) = 0 (22)

(C − αi)ξ = 0 (23) where ξ are the values of the ξ at the saddle point. Similar to separable case, the points xi for which αi > 0 are termed support vectors.

Lecture 6 (draft) 32 (C − αi)ξ = 0

Two cases:

• αi < C ⇒ ξi = 0 1 ⇒ the support vectors lie at a distance ||w|| from the OSH ⇒ called margin vectors

• αi = C

1. ξi > 1, misclassified points 1 2. 0 < ξi ≤ 1, points correctly classified but closer than ||w|| from the OSH 3. ξ = 0, margin vectors (rare case)

Neglecting the last rare case, we refer to all the support vectors for which αi = C as errors. All the points that are not support vectors are correctly classified and lie outside the margin strip.

Lecture 6 (draft) 33 Nonlinear Support Vector Machine

• Note that the only way the data points appear in the training problem is in the form of dot products xi · xj.

• In higher dimensional space (feature space), it is very likely that a linear separator (hyperplane) can be constructed.

• E.g.: we map the data points to input space Rn to some feature space of higher dimension, Rm,(m > n) using function Φ.

Φ: Rn → Rm

Example: Φ: R2 → R3 0 2 2 x = (x1, x2) → x = (x1, x2, x1x2)

• Then the training algorithm would only depend on the dot products in H,

Lecture 6 (draft) 34 i.e., on the functions of the form Φ(xi) · Φ(xj). In other words,

N X f(x) = w · x = yiαixi · x + b i=1

N X f(x) = w · x = yiαiΦ(x) · Φ(x) + b i=1

• But the transformation operator, Φ, is computationally expensive.

• If there were a “kernel function” K such that K(xi, xj) = Φ(xi) · Φ(xj), we would only need to use K in the training algorithm.

−||x −x ||2/2σ2 • One example, K(xi, xj) = e i j .

• All the previous derivations in linear SVM hold (substituting dot product with kernel function), since we are still doing a linear separation, but in a different space.

Lecture 6 (draft) 35 • Map the training data nonlinearly into a higher-dimensional feature space via Φ, and construct a separating hyperplane with maximum margin there.

• This yields a nonlinear decision boundary in input space. By the use of kernel function, it is possible to compute the separating hyperplane without explicitly carrying out the map into the feature space.

Mapping between input space and feature space [9].

• Question: Is a feature space always more expressive than input space? How do we determine kernel functions?

Lecture 6 (draft) 36 Mercer’s Condition for Kernel Function

• The idea of constructing support vector networks comes form considering general forms of the dot product in a Hilbert space.

Φ(u) · Φ(v) ≡ K(u, v) (24)

• Question: Which kernel does there exist a pair {H, Φ} such that K(xi, xj) = Φ(xi) · Φ(xj)

• Answer: Mercer’s condition. It tells us whether or not a prospective kernel is actually a dot product in some space.

• According to the Hilbert-Schmidt Theory, any symmetric function K(u, v), with K(u, v) ∈ L2, can be expanded in the form

X K(u, v) = λiΦi(u) · Φ(v) (25) i

Lecture 6 (draft) 37 where λi ∈ R and Φ are eigenvalues and eigenfunctions

Z K(u, v)Φ(u)du = λiΦi(v)

of the integral operator defined by the kernel K(u, v).

• A sufficient condition to ensure that (24) defines a dot product in a feature space is that all the eigenvalues in the expansion (25) are positive. To guarantee that these coefficients are positive, it is necessary and sufficient (Mercer’s theorem) that the condition

ZZ K(u, v)g(u)g(v)dudv > 0

is satisfied for all g such that

Z g2(u)du < ∞

Lecture 6 (draft) 38 Some Kernel Functions in SVM Simple dot product: K(x, y) = x · y

Vovk’s polynomial: K(x, y) = (x · y + 1)p

Radial basis function (RBF):

2 2 K(x, y) = e−||x−y|| /2σ

Two layer neural network:

K(x, y) = tanh(κx · y − δ)

Lecture 6 (draft) 39 Some Properties of SVM

Decision surface in (a) by a polynomial classifier, and in (b) by a RBF where the support vectors are indicated in dark fill. Note the reduced number of them and their position close to the boundary. In (b), the support vectors are the RBF centers [13].

Lecture 6 (draft) 40 Architecture for General Support Vector Machine

• Linear SVM: Kernel function is just a dot product in input space

• Nonlinear SVM: Need to choose an appropriate kernel function

Lecture 6 (draft) 41 Multiple Classes For K-class pattern recognition problem, several approaches have been proposed

• One-against-the-rest: construct a hyperplane between class k and the K − 1 other classes ⇒ K SVM’s, etc.

K(K−1) • One-against-one: construct a hyper plane for any two classes ⇒ 2 SVM’s.

• K-class SVM: by Watkins.

• John Platt’s DAG method

Lecture 6 (draft) 42 SVM: Training and Testing

• Training: Solve a complex quadratic optimization problem - Speed up: Chucking, Sequential Minimization Optimization (SMO).

• Testing: The number of support vectors may be large - Speed up: Reduced set of support vectors, etc.

Lecture 6 (draft) 43 SVM Regression

• Introduce ε-insensitive loss

|y − f(x)|ε = max{0, |y − f(x)| − ε} (26)

• Estimate f(x) = w · x + b by minimizing

N 1 C X ||w||2 + |y − f(x )| (27) 2 N i i ε i=1

Lecture 6 (draft) 44 Relevance Vector Machine [Tipping NIPS00]

• Main idea [22] - Extend SVM to have probabilistic interpretations. - Further sparsify and have fewer support vectors, i.e., relevant vectors (prototypes). - Bayesian learning: put priors over parameters (i.e., introducing hyperparameters).

• Approach: Put priors on target value yi, and the weights wi.

1 p(y|w, σ2) = (2πσ2)−N/2 exp{− ||y − Φw||2} (28) 2σ2

N Y −1 p(w|β) = N(αi|0, βi ) (29) i=0 where Φ is a design matrix with Φnm = K(xn, xm−1) and Φn1 = 1, and β is a vector of N + 1 hyperparameters.

• Hyperparameters: σ and β.

Lecture 6 (draft) 45 • With that, we have posterior of the weights

1 p(w|y, β, σ2) = (2π)−(N+1)/2|Σ|−1/2 exp{− (w − µ)Σ−1(w − µ)} (30) 2

Σ = (ΦT BΦ + A)−1 (31) µ = ΣΦT By

where A is a diagonal matrix and let A = diag(β0, β1, . . . , βN ).

• Integrating out the weights, we have the marginal likelihood for the hyperparameters.

1 p(y|β, σ2) = (2π)−N/2|B−1+ΦA−1ΦT |−1/2 exp{− yT (B−1+ΦA−1ΦT )−1y} 2 (32)

• Optimizing the hyperparameters β, and σ with an EM algorithm.

Lecture 6 (draft) 46 • For the function sinc(x) = |x|−1 sin |x|. With 100 training samples, SVM regression uses up 39 support vectors. RVM uses 9 relevant vectors.

• Williams et al. apply RVM to learn a regression function of in-plane image displacement for visual tracking [27][28].

• Agarwal and Triggs apply RVM to learn a regression function of human joint angles from silhouette images [1].

Lecture 6 (draft) 47 Applications

• Pattern Recognition: hand digit recognition, 3D object recognition, face detection, face recognition, pedestrian detection, gender classification, visual tracking, expression recognition, speaker identification, text classification.

• Regression: time series prediction, relevance vector machine

• Signal Processing: seismic signal classification, density estimation, DNA sequence classification.

Lecture 6 (draft) 48 Review: Optical Flow

Let image intensity be given by I(x, y, t), use first-order Taylor expansion

∂I ∂I ∂I I(x + dx, y + dy, t + dt) = I(x, y, t) + dx + dy + dt + h.o.t. (33) ∂x ∂y ∂t

Based on brightness constancy assumption,

I(x + dx, y + dy, t + dt) = I(x, y, t) (34)

Thus ∂I ∂I ∂I dx + dy + dt = 0 (35) ∂x ∂y ∂t Let dx dy = u , = v (36) dt dt We have ∂I ∂I ∂I − = u + v (37) ∂t ∂x ∂y

Lecture 6 (draft) 49 Lucas-Kanade

X E(u, v) = (I(x + u, y + v, t + dt) − I(x, y, t))2 (38) x,y∈ROI

Minimize E(u, v)

 P 2 P     P  Ix IxIy u −ItIx P P 2 = P (39) IxIy Iy v −ItIy

Lecture 6 (draft) 50 SVM Tracking [Avidan CVPR01]

• Incorporate optical flow with SVM [2][3].

• Estimate the displacement direction using optical flow

• Use the SVM score to determine the most likely target location

• Given I, use first order Taylor expansion

∗ I = I + uIx + vIy (40)

where Ix, Iy are x and y derivative of I, and u, v are motion parameters.

• With SVM score

PN f(x) = i=1 yiαiK(xi, xj) + b PN (41) f(x) = i=1 yiαiK(I + uIx + vIy, xj)

Lecture 6 (draft) 51 • Use dot product kernel for K, and maximize the above function

N X T 2 E(u, v) = yiαi((I + uIx + vIy) xj) (42) i=1

• Take partial derivatives w.r.t u, and v

N ∂E X = y α IT x (I + uI + vI )T x = 0 (43) ∂u i i x j x y j i=1 N ∂E X = y α IT x (I + uI + vI )T x = 0 (44) ∂v i i y j x y j i=1

• Thus we have,  A A   u   b  11 12 = 1 (45) A21 A22 v b2

Lecture 6 (draft) 52 where PN T 2 A11 = i=1 αiyi(xi Ix) PN T T A12 = A21 = i=1 αiyi(xi Ix)(xi Iy) PN T 2 A22 = i=1 αiyi(xi Iy) (46) PN T T b1 = − i=1 αiyi(xi Ix)(xi I) PN T T b2 = − i=1 αiyi(xi Iy)(xi I)

• Similar to optical flow in equation form

• Also use pyramid to search over scale

Lecture 6 (draft) 53 Life Beyond SVM

• Mistake-Bound On-Line Learning: Winnow, SNoW

• Ensemble of Homogeneous Classifiers: Boosting, Bagging

• Ensemble of Heterogeneous Classifiers: Kittler’s method

• Random Subspace Method: Monte Carlo approach

• Kernel methods: Kernel PCA, Kernel Fisher Linear Discriminant

• Generative Models, Graphical Models, Nonlinear PCA, Probabilistic PCA, Mixture of Probabilistic PCA, etc.

• Maximum entropy approach

Lecture 6 (draft) 54 SVM v.s. SNoW on Face Detection A benchmark on SVM and SNoW based on 5732 training samples and 500 testing samples using a Sun Ultra Sparc 10. Each sample is an 20 × 20 image.

Nonlinear SVM Linear SVM SNoW Training Accuracy 100% 100% 100% Testing Accuracy 100% 96% 97% Memory Requirement 83MB 24MB 7MB World Clock Time 5.8hr 3.7hr 0.6hr

Lecture 6 (draft) 55 Concluding Remarks Pros

• Optimal hyperplane.

• Some kernels have infinite VC dimension.

• Can deal with high dimensional data.

Cons

• Numerical stability problems in solving constrained QP.

• Usually require positive/negative examples.

• Need to select a good kernel function.

• Require lots of memory and CPU time.

Lecture 6 (draft) 56 References Introductory articles: [9][21][16][14][4][6] Book: [8][23][24][18][7] Ph.D. Thesis: [5][17][20] Vision papers: [16][14][12][15] Comparison with RBF: [19] Kernel Machines web site: http://www.kernel-machines.org Boosting web site: http://www.boosting.org/

References

[1] A. Agarwal and B. Triggs. 3D human pose from silhouettes by relevance vector regression. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 882–888, 2004. [2] S. Avidan. Support vector tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 184–191, 2001. [3] S. Avidan. Support vector tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8):1064–1072, 2004. [4] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. and Knowledge Discovery, 2(2), 1998.

Lecture 6 (draft) 57 [5] C. Cortes. Prediction of Generalization Ability in Learning Machines. PhD thesis, Department of Computer Science, University of Rochester, 1995. [6] C. Cortes and V. Vapnik. Support vector networks. , 20, 1995. [7] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [8] S. Haykin. Neural Networks : A Comprehensive Foundation. Prentice Hall, 1998. [9] M. A. Heasrt, B. Scholkopf, S. Dumais, E. Osuna, and J. Platt. Trends and controversies - support vector machines. IEEE Intelligent Systems, 13(4):18–28, 1998. [10] A. M. Martinez and A. Kak. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228–233, 2001. [11] A. M. Martinez and M. Zhu. Where are linear feature extraction methods applicable? IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1934–1944, 2005. [12] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio. Pedestrian detection using wavelet templates. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 193–199, 1997. [13] E. Osuna, R. Freund, and F. Girosi. Support vector machines: Training and applications. Technical Report AI MEMO 1602, MIT AI Lab, 1997. [14] E. Osuna, R. Freund, and F. Girosi. Training support vector machines: an application to face detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 130–136, 1997. [15] C. Papageorgiou, M. Oren, and T. Poggio. A general framework for object detection. In

Lecture 6 (draft) 58 Proceedings of the Fifth International Conference on Computer Vision, pages 555–562, 1998. [16] M. Pontil and A. Verri. Support vector machines for 3D object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(6):637–646, 1998. [17] B. Scholkopf. Support Vector Learning. PhD thesis, Informatik der Technischen Universitat Berlin, 1997. [18] B. Scholkopf, C. Burges, and A. Smola, editors. Advances in Kernel Methods. MIT Press, 1998. [19] B. Scholkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, and T. Poggio. Comparing support vector machines with gaussian kernels to radial basis function classifiers. IEEE Transactions on Signal Processing, 45(11):2758–2765, 1997. [20] A. Smola. Learning with Kernels. PhD thesis, GMD, 1998. [21] A. J. Smola and B. Scholkopf. A tutorial on support vector regression. Technical Report TR-1998-030, Neuro COLT, GMD First, 1998. [22] M. Tipping. The relevance vector machine. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Advances in Neural Information Processing Systems, pages 46–53. MIT Press, 2000. [23] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. [24] V. Vapnik. Statistical Learning Theory (Adaptive and Learning Systems for Signal Processing, Communications, and Control). John Wiley & Sons, 1998. [25] P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, 2004.

Lecture 6 (draft) 59 [26] P. Viola, M. Jones, and D. Snow. Markov face models. In Proceedings of the Ninth IEEE International Conference on Computer Vision, volume 2, pages 734–741, 2003. [27] O. Williams, A. Blake, and R. Cipolla. A sparse probabilistic learning algorithm for real-time tracking. In Proceedings of the Ninth IEEE International Conference on Computer Vision, volume 1, pages 353–360, 2003. [28] O. Williams, A. Blake, and R. Cipolla. Sparse bayesian learning for efficient visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1292–1304, 2005.

Lecture 6 (draft) 60