Advanced Topics in Learning and Vision
Total Page:16
File Type:pdf, Size:1020Kb
Advanced Topics in Learning and Vision Ming-Hsuan Yang [email protected] Lecture 6 (draft) Announcements • More course material available on the course web page • Project web pages: Everyone needs to set up one (format details will soon be available) • Reading (due Nov 1): - RVM application for 3D human pose estimation [1]. - Viola and Jones: Adaboost-based real-time face detector [25]. - Viola et al: Adaboost-based real-time pedestrian detector [26]. A. Agarwal and B. Triggs. 3D human pose from silhouettes by relevance vector regression. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 882–888, 2004. P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, 2004. P. Viola, M. Jones, and D. Snow. Markov face models. In Proceedings of the Ninth IEEE International Conference on Computer Vision, volume 2, pages 734–741, 2003. Lecture 6 (draft) 1 Overview • Linear classifier: Fisher linear discriminant (FLD), linear support vector machine (SVM). • Nonlinear classifier: nonlinear support vector machine, • SVM regression, relevance vector machine (RVM). • Kernel methods: kernel principal component, kernel discriminant analysis. • Ensemble of classifiers: Adaboost, bagging, ensemble of homogeneous/heterogeneous classifiers. Lecture 6 (draft) 2 Fisher Linear Discriminant • Based on Gaussian distribution: between-scatter/within-scatter matrices. • Supervised learning for multiple classes. 0 • Trick often used in ridge regression: Sw = Sw + λI without using PCA • Fisherfaces vs. Eigenfaces (FLD vs. PCA). • Further study: Aleix Martinez’s paper on analyzing FLD for object recognition [10][11]. A. M. Martinez and A. Kak. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228–233, 2001. A. M. Martinez and M. Zhu. Where are linear feature extraction methods applicable? IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1934–1944, 2005. Lecture 6 (draft) 3 Intuitive Justification for SVM Two things come into picture: • what kind of function do we use for classification? • what are the parameters for this function? Which classifier do you expect to perform well? Lecture 6 (draft) 4 Support Vector Machine • Objective: To find an optimal hyperplane that correctly classifies data points as much as possible and separates the points of two classes as far as possible. • Approach: Formulate a constrained optimization problem, then solve it using constrained quadratic programming (constrained QP). • Theorem: Structural Risk Minimization. • Issues: VC dimension, linear separability, feature space, multiple class. Lecture 6 (draft) 5 Main Idea • Given a set of data points which belong to either of two classes, an SVM finds the hyperplane: - leaving the largest possible fraction of points of the same class on the same side. - and maximizing the distance of either class from the hyperplane. • Find the optimal separating hyperplane that minimizes the risk of misclassifying the training samples and unseen test samples. • Question: Why do we need this? Does it work well in test samples (i.e., generalization)? • Answer: Structural Risk Minimization. • Issues: - How do we determine the capacity of a classification model/function? - How do we determine the parameters for that model/function? Lecture 6 (draft) 6 Vapnik-Chervonenkis (VC) Dimension • A property of a set of functions (hypothesis space H, learning machine, learner) {f(α)} (we use α as a generic set of parameters: a choice of α specifies a particular function). • Shattered : if a given set of N points can be labeled in all possible 2N ways, and for each labeling, a member of the set {f(α)} can be found which correctly assigns those labels, we say that the set of points is shattered by that set of functions. • The functions f(α) are usually called hypothesis, and the set {f(α): α ∈ Λ} is called the hypothesis space and denoted by H. • The VC dimension for the set of functions {f(α)} , i.e., the hypothesis space H, is defined as the maximum number of training points that can be shattered by H. • Frequently used to find the sample complexity, mistake bound algorithm, neural network capacity, computational learning theory, etc. Lecture 6 (draft) 7 • Example: • While it is possible to find a set of 3 points that can be shattered by the set of oriented lines, it is not possible to shatter a set of 4 points (with any labeling). Thus the VC dimension of the set of oriented lines in R2 is 3. • The VC dimension of the set of oriented hyperplane in Rn is n + 1. Lecture 6 (draft) 8 Linear Classifiers • Instance space: X = Rn. • Set of class labels: Y = {+1, −1}. • Training data set: {(x1, y1),..., (xN , yN )}. • Hypothesis space: n n Hlin(n) = {h : R → Y |h(x) = sign(w · x + b), w ∈ R , b ∈ R} +1 if w · x + b > 0 sign(w · x + b) = −1 otherwise (1) • VC(Hlin(2)) = 3, VC(Hlin(n)) = n + 1. • Idea: - First need to find a hypothesis h ∈ Hlin(n) - Then the parameters to minimize error for unseen examples. Lecture 6 (draft) 9 Expected Risk Suppose we are given N observations. Each observation consists of a pair: a n vector xi ∈ < , i = 1,...,N and the associated class label yi. Assume there exists some unknown probability distribution P (x, y) from which these data points are drawn, i.e., the data points are assumed independently drawn and identically distributed. Consider a machine whose task is to learn the mapping xi → yi ∈ {−1, 1}, the machine is defined by a set of possible mappings x → f(x, α) ∈ {−1, 1}, where the functions f(x, α) themselves are labeled by the adjustable parameters α. Expected Risk: The expectation of the test error for a trained machine is Z 1 R(α) = |y − f(x, α)|dP (x, y) (2) 2 Lecture 6 (draft) 10 Upper Bound for Expected Risk The empirical risk Remp is N 1 X R (α) = |y − f(x , α)| (3) emp 2N i i i=1 Under PAC (Probably Approximately Correct) model, Vapnik shows that the bound for the expected risk which holds with probability 1 − η (0 ≤ η ≤ 1), r h(log(2N/h) + 1) − log(η/4) R(α) ≤ R (α) + (4) emp N where h is the VC dimension of f(x, α) (i.e., H). The second term on the right hand side is called VC confidence. Lecture 6 (draft) 11 Implications of the Bound for Expected Risk r h(log(2N/h) + 1) − log(η/4) R(α) ≤ R (α) + emp N • To achieve small expected risk, that is good generalization performance ⇒ both the empirical risk and the ratio between VC dimension and the number of data points have to be small. • Since the empirical risk is usually a decreasing function of h, it turns out, for a given number of data points, there is an optimal value of the VC dimension. Lecture 6 (draft) 12 r h(log(2N/h) + 1) − log(η/4) R(α) ≤ R (α) + emp N Lecture 6 (draft) 13 Implications of the bound for R(α) (cont) • The choice of an appropriate value for h (which in most techniques is controlled by the number of free parameters of the model) is crucial in order to get good performance, especially when the number of data points is small. • When using a multilayer perceptron or a radial basis function network, this is equivalent to the problem of finding an appropriate number of hidden units. • This is known to be difficult, and it is usually solved by cross-validation techniques. Lecture 6 (draft) 14 Structural Risk Minimization (SRM) • It is not enough just to minimize the empirical risk as often done by most neural networks. • Need to overcome the problem of choosing an appropriate VC dimension. • Structural Risk Minimization Principle: To make the expected risk small, both sides in (4) should be small • Minimize the empirical risk and VC confidence simultaneously: r h(log(2N/h) + 1) − log(η/4) min(Remp(α) + ) (5) Hn N • Introduce “structure” by dividing the entire class of functions into nested subsets. • For each subset, we must be able to compute h or a bound of h. Lecture 6 (draft) 15 • Then, SRM consists of finding that subset of functions which minimizes the bound on the actual risk. • The problem of selecting the right subset for a given amount of observations is referred as capacity control. • A trade off between reducing the training error and limiting model complexity. • Occam Razor models should be no more complex than is sufficient to explain the data. • “Things should be made as simple as possible – but not any simpler” Albert Einstein. Lecture 6 (draft) 16 Structural Risk Minimization (cont) • To implement SRM, one needs the nested structure of hypothesis space: H1 ⊂ H2 ⊂ ... ⊂ Hn ⊂ ... with the property that h(n) ≤ h(n + 1) where h(n) is the VC dimension of Hn. Lecture 6 (draft) 17 • A learning machine with larger complexity (higher VC dimension) ⇒ small empirical risk. • A simpler learning machine (lower VC dimension) ⇒ low VC confidence. • SRM picks a trade-off in between VC dimension and empirical risk such that the risk bound is minimized. • Problems: - It is usually difficult to compute the VC dimension of Hn, and there are only a small number of models for which we know how to compute the VC dimension. - Even when the VC dimension of Hn is known, it is not easy to solve the minimization problem. In most cases, one will have to minimize the empirical risk for every set Hn, and then choose the Hn that minimizes the (5). Lecture 6 (draft) 18 Support Vector Machine (SV Algorithm) One implementation based on Structural Risk Minimization theory • Each particular choice of structure gives rise to a learning algorithm, consisting of performing SRM in the given structure of sets of functions.