Nonlinear Learning with Kernels

Nonlinear Learning with Kernels Piyush Rai Machine Learning (CS771A) Aug 26, 2016 Machine Learning (CS771A) Nonlinear Learning with Kernels 1 Recap Machine Learning (CS771A) Nonlinear Learning with Kernels 2 Support Vector Machines Looked at hard and soft-margin SVMs The dual objectives for hard-margin and soft-margin SVM N > 1 > X Hard-Margin SVM: max LD (α) = α 1 − α Gα s.t. αnyn = 0 α≥0 2 n=1 N > 1 > X Soft-Margin SVM: max LD (α) = α 1 − α Gα s.t. αnyn = 0 α≤C 2 n=1 > where G is an N × N matrix with Gmn = ymynx mx n, and 1 is a vector of 1s Machine Learning (CS771A) Nonlinear Learning with Kernels 3 SVM Dual Formulation: A Geometric View Convex Hull Interpretationy: Solving the SVM dual is equivalent to finding the shortest line connecting the convex hulls of both classes (the SVM's hyperplane will be the perpendicular bisector of this line) ySee: \Duality and Geometry in SVM Classifiers" by Bennett and Bredensteiner Machine Learning (CS771A) Nonlinear Learning with Kernels 4 Can think of our loss as basically the sum of the slacks ξn ≥ 0, which is N N N X X X T `(w; b) = `n(w; b) = ξn = maxf0; 1 − yn(w x n + b)g n=1 n=1 n=1 This is called \Hinge Loss". Can also learn SVMs by minimizing this loss via stochastic sub-gradient descent (can also add a regularizer on w, e.g., `2) Recall that, Perceptron also minimizes a sort of similar loss function N N X X T `(w; b) = `n(w; b) = maxf0; −yn(w x n + b)g n=1 n=1 Perceptron, SVM, Logistic Reg., all minimize convex approximations of the 0-1 loss (optimizing which is NP-hard; moreover it's non-convex/non-smooth) Loss Function Minimization View of SVM T Recall, we want for each training example: yn(w x n + b) ≥ 1 − ξn Machine Learning (CS771A) Nonlinear Learning with Kernels 5 This is called \Hinge Loss". Can also learn SVMs by minimizing this loss via stochastic sub-gradient descent (can also add a regularizer on w, e.g., `2) Recall that, Perceptron also minimizes a sort of similar loss function N N X X T `(w; b) = `n(w; b) = maxf0; −yn(w x n + b)g n=1 n=1 Perceptron, SVM, Logistic Reg., all minimize convex approximations of the 0-1 loss (optimizing which is NP-hard; moreover it's non-convex/non-smooth) Loss Function Minimization View of SVM T Recall, we want for each training example: yn(w x n + b) ≥ 1 − ξn Can think of our loss as basically the sum of the slacks ξn ≥ 0, which is N N N X X X T `(w; b) = `n(w; b) = ξn = maxf0; 1 − yn(w x n + b)g n=1 n=1 n=1 Machine Learning (CS771A) Nonlinear Learning with Kernels 5 Recall that, Perceptron also minimizes a sort of similar loss function N N X X T `(w; b) = `n(w; b) = maxf0; −yn(w x n + b)g n=1 n=1 Perceptron, SVM, Logistic Reg., all minimize convex approximations of the 0-1 loss (optimizing which is NP-hard; moreover it's non-convex/non-smooth) Loss Function Minimization View of SVM T Recall, we want for each training example: yn(w x n + b) ≥ 1 − ξn Can think of our loss as basically the sum of the slacks ξn ≥ 0, which is N N N X X X T `(w; b) = `n(w; b) = ξn = maxf0; 1 − yn(w x n + b)g n=1 n=1 n=1 This is called \Hinge Loss". Can also learn SVMs by minimizing this loss via stochastic sub-gradient descent (can also add a regularizer on w, e.g., `2) Machine Learning (CS771A) Nonlinear Learning with Kernels 5 Perceptron, SVM, Logistic Reg., all minimize convex approximations of the 0-1 loss (optimizing which is NP-hard; moreover it's non-convex/non-smooth) Loss Function Minimization View of SVM T Recall, we want for each training example: yn(w x n + b) ≥ 1 − ξn Can think of our loss as basically the sum of the slacks ξn ≥ 0, which is N N N X X X T `(w; b) = `n(w; b) = ξn = maxf0; 1 − yn(w x n + b)g n=1 n=1 n=1 This is called \Hinge Loss". Can also learn SVMs by minimizing this loss via stochastic sub-gradient descent (can also add a regularizer on w, e.g., `2) Recall that, Perceptron also minimizes a sort of similar loss function N N X X T `(w; b) = `n(w; b) = maxf0; −yn(w x n + b)g n=1 n=1 Machine Learning (CS771A) Nonlinear Learning with Kernels 5 Loss Function Minimization View of SVM T Recall, we want for each training example: yn(w x n + b) ≥ 1 − ξn Can think of our loss as basically the sum of the slacks ξn ≥ 0, which is N N N X X X T `(w; b) = `n(w; b) = ξn = maxf0; 1 − yn(w x n + b)g n=1 n=1 n=1 This is called \Hinge Loss". Can also learn SVMs by minimizing this loss via stochastic sub-gradient descent (can also add a regularizer on w, e.g., `2) Recall that, Perceptron also minimizes a sort of similar loss function N N X X T `(w; b) = `n(w; b) = maxf0; −yn(w x n + b)g n=1 n=1 Perceptron, SVM, Logistic Reg., all minimize convex approximations of the 0-1 loss (optimizing which is NP-hard; moreover it's non-convex/non-smooth) Machine Learning (CS771A) Nonlinear Learning with Kernels 5 Learning with Kernels Machine Learning (CS771A) Nonlinear Learning with Kernels 6 Reason: Linear models rely on \linear" notions of similarity/distance > Sim(x n; x m) = x n x m > Dist(x n; x m) = (x n − x m) (x n − x m) .. which wouldn't work well if the patterns we want to learn are nonlinear Linear Models Linear models (e.g., linear regression, linear SVM) are nice and interpretable but have limitations. Can't learn “difficult” nonlinear patterns. Machine Learning (CS771A) Nonlinear Learning with Kernels 7 Linear Models Linear models (e.g., linear regression, linear SVM) are nice and interpretable but have limitations. Can't learn “difficult” nonlinear patterns. Reason: Linear models rely on \linear" notions of similarity/distance > Sim(x n; x m) = x n x m > Dist(x n; x m) = (x n − x m) (x n − x m) .. which wouldn't work well if the patterns we want to learn are nonlinear Machine Learning (CS771A) Nonlinear Learning with Kernels 7 Aha, nice! But we might have two potential issues here.. Constructing these mappings be expensive, especially when the new space is very high dimensional Storing and using these mappings in later computations can be expensive (e.g., we may need to compute innner products in very high dimensional spaces) Kernels side-step these issues by defining an \implicit" feature map Kernels to the Rescue Kernels, using a feature mapping φ, map data to a new space where the original learning problem becomes \easy" (e.g., a linear model can be applied) Machine Learning (CS771A) Nonlinear Learning with Kernels 8 Aha, nice! But we might have two potential issues here.. Constructing these mappings be expensive, especially when the new space is very high dimensional Storing and using these mappings in later computations can be expensive (e.g., we may need to compute innner products in very high dimensional spaces) Kernels side-step these issues by defining an \implicit" feature map Kernels to the Rescue Kernels, using a feature mapping φ, map data to a new space where the original learning problem becomes \easy" (e.g., a linear model can be applied) Machine Learning (CS771A) Nonlinear Learning with Kernels 8 Constructing these mappings be expensive, especially when the new space is very high dimensional Storing and using these mappings in later computations can be expensive (e.g., we may need to compute innner products in very high dimensional spaces) Kernels side-step these issues by defining an \implicit" feature map Kernels to the Rescue Kernels, using a feature mapping φ, map data to a new space where the original learning problem becomes \easy" (e.g., a linear model can be applied) Aha, nice! But we might have two potential issues here.. Machine Learning (CS771A) Nonlinear Learning with Kernels 8 Storing and using these mappings in later computations can be expensive (e.g., we may need to compute innner products in very high dimensional spaces) Kernels side-step these issues by defining an \implicit" feature map Kernels to the Rescue Kernels, using a feature mapping φ, map data to a new space where the original learning problem becomes \easy" (e.g., a linear model can be applied) Aha, nice! But we might have two potential issues here.. Constructing these mappings be expensive, especially when the new space is very high dimensional Machine Learning (CS771A) Nonlinear Learning with Kernels 8 Kernels side-step these issues by defining an \implicit" feature map Kernels to the Rescue Kernels, using a feature mapping φ, map data to a new space where the original learning problem becomes \easy" (e.g., a linear model can be applied) Aha, nice! But we might have two potential issues here.. Constructing these mappings be expensive, especially when the new space is very high dimensional Storing and using these mappings in later computations can be expensive (e.g., we may need to compute innner products in very high dimensional spaces) Machine Learning (CS771A) Nonlinear Learning with Kernels 8 Kernels to the Rescue Kernels, using a feature mapping φ, map data to a new space where the original learning problem becomes \easy" (e.g., a linear model can be applied) Aha, nice! But we might have two potential issues here.

Load more