Nonlinear Learning with Kernels

Piyush Rai

Machine Learning (CS771A)

Aug 26, 2016

Machine Learning (CS771A) Nonlinear Learning with Kernels 1 Recap

Machine Learning (CS771A) Nonlinear Learning with Kernels 2 Support Vector Machines

Looked at hard and soft-margin SVMs

The dual objectives for hard-margin and soft-margin SVM

N > 1 > X Hard-Margin SVM: max LD (α) = α 1 − α Gα s.t. αnyn = 0 α≥0 2 n=1

N > 1 > X Soft-Margin SVM: max LD (α) = α 1 − α Gα s.t. αnyn = 0 α≤C 2 n=1

> where G is an N × N matrix with Gmn = ymynx mx n, and 1 is a vector of 1s

Machine Learning (CS771A) Nonlinear Learning with Kernels 3 SVM Dual Formulation: A Geometric View

Convex Hull Interpretation†: Solving the SVM dual is equivalent to finding the shortest line connecting the convex hulls of both classes (the SVM’s hyperplane will be the perpendicular bisector of this line)

†See: “Duality and Geometry in SVM Classifiers” by Bennett and Bredensteiner Machine Learning (CS771A) Nonlinear Learning with Kernels 4 Can think of our loss as basically the sum of the slacks ξn ≥ 0, which is

N N N X X X T `(w, b) = `n(w, b) = ξn = max{0, 1 − yn(w x n + b)} n=1 n=1 n=1

This is called “Hinge Loss”. Can also learn SVMs by minimizing this loss via stochastic sub-gradient descent (can also add a regularizer on w, e.g., `2)

Recall that, Perceptron also minimizes a sort of similar loss N N X X T `(w, b) = `n(w, b) = max{0, −yn(w x n + b)} n=1 n=1 Perceptron, SVM, Logistic Reg., all minimize convex approximations of the 0-1 loss (optimizing which is NP-hard; moreover it’s non-convex/non-smooth)

Loss Function Minimization View of SVM

T Recall, we want for each training example: yn(w x n + b) ≥ 1 − ξn

Machine Learning (CS771A) Nonlinear Learning with Kernels 5 This is called “Hinge Loss”. Can also learn SVMs by minimizing this loss via stochastic sub-gradient descent (can also add a regularizer on w, e.g., `2)

Recall that, Perceptron also minimizes a sort of similar loss function N N X X T `(w, b) = `n(w, b) = max{0, −yn(w x n + b)} n=1 n=1 Perceptron, SVM, Logistic Reg., all minimize convex approximations of the 0-1 loss (optimizing which is NP-hard; moreover it’s non-convex/non-smooth)

Loss Function Minimization View of SVM

T Recall, we want for each training example: yn(w x n + b) ≥ 1 − ξn

Can think of our loss as basically the sum of the slacks ξn ≥ 0, which is

N N N X X X T `(w, b) = `n(w, b) = ξn = max{0, 1 − yn(w x n + b)} n=1 n=1 n=1

Machine Learning (CS771A) Nonlinear Learning with Kernels 5 Recall that, Perceptron also minimizes a sort of similar loss function N N X X T `(w, b) = `n(w, b) = max{0, −yn(w x n + b)} n=1 n=1 Perceptron, SVM, Logistic Reg., all minimize convex approximations of the 0-1 loss (optimizing which is NP-hard; moreover it’s non-convex/non-smooth)

Loss Function Minimization View of SVM

T Recall, we want for each training example: yn(w x n + b) ≥ 1 − ξn

Can think of our loss as basically the sum of the slacks ξn ≥ 0, which is

N N N X X X T `(w, b) = `n(w, b) = ξn = max{0, 1 − yn(w x n + b)} n=1 n=1 n=1

This is called “Hinge Loss”. Can also learn SVMs by minimizing this loss via stochastic sub-gradient descent (can also add a regularizer on w, e.g., `2)

Machine Learning (CS771A) Nonlinear Learning with Kernels 5 Perceptron, SVM, Logistic Reg., all minimize convex approximations of the 0-1 loss (optimizing which is NP-hard; moreover it’s non-convex/non-smooth)

Loss Function Minimization View of SVM

T Recall, we want for each training example: yn(w x n + b) ≥ 1 − ξn

Can think of our loss as basically the sum of the slacks ξn ≥ 0, which is

N N N X X X T `(w, b) = `n(w, b) = ξn = max{0, 1 − yn(w x n + b)} n=1 n=1 n=1

This is called “Hinge Loss”. Can also learn SVMs by minimizing this loss via stochastic sub-gradient descent (can also add a regularizer on w, e.g., `2)

Recall that, Perceptron also minimizes a sort of similar loss function N N X X T `(w, b) = `n(w, b) = max{0, −yn(w x n + b)} n=1 n=1

Machine Learning (CS771A) Nonlinear Learning with Kernels 5 Loss Function Minimization View of SVM

T Recall, we want for each training example: yn(w x n + b) ≥ 1 − ξn

Can think of our loss as basically the sum of the slacks ξn ≥ 0, which is

N N N X X X T `(w, b) = `n(w, b) = ξn = max{0, 1 − yn(w x n + b)} n=1 n=1 n=1

This is called “Hinge Loss”. Can also learn SVMs by minimizing this loss via stochastic sub-gradient descent (can also add a regularizer on w, e.g., `2)

Recall that, Perceptron also minimizes a sort of similar loss function N N X X T `(w, b) = `n(w, b) = max{0, −yn(w x n + b)} n=1 n=1 Perceptron, SVM, Logistic Reg., all minimize convex approximations of the 0-1 loss (optimizing which is NP-hard; moreover it’s non-convex/non-smooth)

Machine Learning (CS771A) Nonlinear Learning with Kernels 5 Learning with Kernels

Machine Learning (CS771A) Nonlinear Learning with Kernels 6 Reason: Linear models rely on “linear” notions of similarity/distance

> Sim(x n, x m) = x n x m > Dist(x n, x m) = (x n − x m) (x n − x m)

.. which wouldn’t work well if the patterns we want to learn are nonlinear

Linear Models

Linear models (e.g., linear regression, linear SVM) are nice and interpretable but have limitations. Can’t learn “difficult” nonlinear patterns.

Machine Learning (CS771A) Nonlinear Learning with Kernels 7 Linear Models

Linear models (e.g., linear regression, linear SVM) are nice and interpretable but have limitations. Can’t learn “difficult” nonlinear patterns.

Reason: Linear models rely on “linear” notions of similarity/distance

> Sim(x n, x m) = x n x m > Dist(x n, x m) = (x n − x m) (x n − x m)

.. which wouldn’t work well if the patterns we want to learn are nonlinear

Machine Learning (CS771A) Nonlinear Learning with Kernels 7 Aha, nice! But we might have two potential issues here.. Constructing these mappings be expensive, especially when the new space is very high dimensional Storing and using these mappings in later computations can be expensive (e.g., we may need to compute innner products in very high dimensional spaces) Kernels side-step these issues by defining an “implicit” feature map

Kernels to the Rescue

Kernels, using a feature mapping φ, map data to a new space where the original learning problem becomes “easy” (e.g., a linear model can be applied)

Machine Learning (CS771A) Nonlinear Learning with Kernels 8 Aha, nice! But we might have two potential issues here.. Constructing these mappings be expensive, especially when the new space is very high dimensional Storing and using these mappings in later computations can be expensive (e.g., we may need to compute innner products in very high dimensional spaces) Kernels side-step these issues by defining an “implicit” feature map

Kernels to the Rescue

Kernels, using a feature mapping φ, map data to a new space where the original learning problem becomes “easy” (e.g., a linear model can be applied)

Machine Learning (CS771A) Nonlinear Learning with Kernels 8 Constructing these mappings be expensive, especially when the new space is very high dimensional Storing and using these mappings in later computations can be expensive (e.g., we may need to compute innner products in very high dimensional spaces) Kernels side-step these issues by defining an “implicit” feature map

Kernels to the Rescue

Kernels, using a feature mapping φ, map data to a new space where the original learning problem becomes “easy” (e.g., a linear model can be applied)

Aha, nice! But we might have two potential issues here..

Machine Learning (CS771A) Nonlinear Learning with Kernels 8 Storing and using these mappings in later computations can be expensive (e.g., we may need to compute innner products in very high dimensional spaces) Kernels side-step these issues by defining an “implicit” feature map

Kernels to the Rescue

Kernels, using a feature mapping φ, map data to a new space where the original learning problem becomes “easy” (e.g., a linear model can be applied)

Aha, nice! But we might have two potential issues here.. Constructing these mappings be expensive, especially when the new space is very high dimensional

Machine Learning (CS771A) Nonlinear Learning with Kernels 8 Kernels side-step these issues by defining an “implicit” feature map

Kernels to the Rescue

Kernels, using a feature mapping φ, map data to a new space where the original learning problem becomes “easy” (e.g., a linear model can be applied)

Aha, nice! But we might have two potential issues here.. Constructing these mappings be expensive, especially when the new space is very high dimensional Storing and using these mappings in later computations can be expensive (e.g., we may need to compute innner products in very high dimensional spaces)

Machine Learning (CS771A) Nonlinear Learning with Kernels 8 Kernels to the Rescue

Kernels, using a feature mapping φ, map data to a new space where the original learning problem becomes “easy” (e.g., a linear model can be applied)

Aha, nice! But we might have two potential issues here.. Constructing these mappings be expensive, especially when the new space is very high dimensional Storing and using these mappings in later computations can be expensive (e.g., we may need to compute innner products in very high dimensional spaces) Kernels side-step these issues by defining an “implicit” feature map

Machine Learning (CS771A) Nonlinear Learning with Kernels 8 Suppose we have a function k which takes as inputs x and z and computes

k(x, z)=( x >z)2 2 = (x1z1 + x2z2) 2 2 2 2 = x1 z1 + x2 z2 + 2x1x2z1z2 2 √ 2 > 2 √ 2 = (x1 , 2x1x2, x2 ) (z1 , 2z1z2, z2 ) = φ(x)>φ(z) (an inner product) The above k implicitly defines a mapping φ to a higher dimensional space √ 2 2 φ(x) = {x1 , 2x1x2, x2 } .. and computes inner-product based similarity φ(x)>φ(z) in that space We didn’t need to pre-define/compute the mapping φ to compute k(x, z) The function k is known as the kernel function Also, evaluating k is almost as fast as computing inner products

Kernels as (Implicit) Feature Maps

Consider two data points x = {x1, x2} and z = {z1, z2} (each in 2 dims)

Machine Learning (CS771A) Nonlinear Learning with Kernels 9 2 = (x1z1 + x2z2) 2 2 2 2 = x1 z1 + x2 z2 + 2x1x2z1z2 2 √ 2 > 2 √ 2 = (x1 , 2x1x2, x2 ) (z1 , 2z1z2, z2 ) = φ(x)>φ(z) (an inner product) The above k implicitly defines a mapping φ to a higher dimensional space √ 2 2 φ(x) = {x1 , 2x1x2, x2 } .. and computes inner-product based similarity φ(x)>φ(z) in that space We didn’t need to pre-define/compute the mapping φ to compute k(x, z) The function k is known as the kernel function Also, evaluating k is almost as fast as computing inner products

Kernels as (Implicit) Feature Maps

Consider two data points x = {x1, x2} and z = {z1, z2} (each in 2 dims) Suppose we have a function k which takes as inputs x and z and computes

k(x, z)=( x >z)2

Machine Learning (CS771A) Nonlinear Learning with Kernels 9 2 2 2 2 = x1 z1 + x2 z2 + 2x1x2z1z2 2 √ 2 > 2 √ 2 = (x1 , 2x1x2, x2 ) (z1 , 2z1z2, z2 ) = φ(x)>φ(z) (an inner product) The above k implicitly defines a mapping φ to a higher dimensional space √ 2 2 φ(x) = {x1 , 2x1x2, x2 } .. and computes inner-product based similarity φ(x)>φ(z) in that space We didn’t need to pre-define/compute the mapping φ to compute k(x, z) The function k is known as the kernel function Also, evaluating k is almost as fast as computing inner products

Kernels as (Implicit) Feature Maps

Consider two data points x = {x1, x2} and z = {z1, z2} (each in 2 dims) Suppose we have a function k which takes as inputs x and z and computes

k(x, z)=( x >z)2 2 = (x1z1 + x2z2)

Machine Learning (CS771A) Nonlinear Learning with Kernels 9 2 √ 2 > 2 √ 2 = (x1 , 2x1x2, x2 ) (z1 , 2z1z2, z2 ) = φ(x)>φ(z) (an inner product) The above k implicitly defines a mapping φ to a higher dimensional space √ 2 2 φ(x) = {x1 , 2x1x2, x2 } .. and computes inner-product based similarity φ(x)>φ(z) in that space We didn’t need to pre-define/compute the mapping φ to compute k(x, z) The function k is known as the kernel function Also, evaluating k is almost as fast as computing inner products

Kernels as (Implicit) Feature Maps

Consider two data points x = {x1, x2} and z = {z1, z2} (each in 2 dims) Suppose we have a function k which takes as inputs x and z and computes

k(x, z)=( x >z)2 2 = (x1z1 + x2z2) 2 2 2 2 = x1 z1 + x2 z2 + 2x1x2z1z2

Machine Learning (CS771A) Nonlinear Learning with Kernels 9 = φ(x)>φ(z) (an inner product) The above k implicitly defines a mapping φ to a higher dimensional space √ 2 2 φ(x) = {x1 , 2x1x2, x2 } .. and computes inner-product based similarity φ(x)>φ(z) in that space We didn’t need to pre-define/compute the mapping φ to compute k(x, z) The function k is known as the kernel function Also, evaluating k is almost as fast as computing inner products

Kernels as (Implicit) Feature Maps

Consider two data points x = {x1, x2} and z = {z1, z2} (each in 2 dims) Suppose we have a function k which takes as inputs x and z and computes

k(x, z)=( x >z)2 2 = (x1z1 + x2z2) 2 2 2 2 = x1 z1 + x2 z2 + 2x1x2z1z2 2 √ 2 > 2 √ 2 = (x1 , 2x1x2, x2 ) (z1 , 2z1z2, z2 )

Machine Learning (CS771A) Nonlinear Learning with Kernels 9 The above k implicitly defines a mapping φ to a higher dimensional space √ 2 2 φ(x) = {x1 , 2x1x2, x2 } .. and computes inner-product based similarity φ(x)>φ(z) in that space We didn’t need to pre-define/compute the mapping φ to compute k(x, z) The function k is known as the kernel function Also, evaluating k is almost as fast as computing inner products

Kernels as (Implicit) Feature Maps

Consider two data points x = {x1, x2} and z = {z1, z2} (each in 2 dims) Suppose we have a function k which takes as inputs x and z and computes

k(x, z)=( x >z)2 2 = (x1z1 + x2z2) 2 2 2 2 = x1 z1 + x2 z2 + 2x1x2z1z2 2 √ 2 > 2 √ 2 = (x1 , 2x1x2, x2 ) (z1 , 2z1z2, z2 ) = φ(x)>φ(z) (an inner product)

Machine Learning (CS771A) Nonlinear Learning with Kernels 9 .. and computes inner-product based similarity φ(x)>φ(z) in that space We didn’t need to pre-define/compute the mapping φ to compute k(x, z) The function k is known as the kernel function Also, evaluating k is almost as fast as computing inner products

Kernels as (Implicit) Feature Maps

Consider two data points x = {x1, x2} and z = {z1, z2} (each in 2 dims) Suppose we have a function k which takes as inputs x and z and computes

k(x, z)=( x >z)2 2 = (x1z1 + x2z2) 2 2 2 2 = x1 z1 + x2 z2 + 2x1x2z1z2 2 √ 2 > 2 √ 2 = (x1 , 2x1x2, x2 ) (z1 , 2z1z2, z2 ) = φ(x)>φ(z) (an inner product) The above k implicitly defines a mapping φ to a higher dimensional space √ 2 2 φ(x) = {x1 , 2x1x2, x2 }

Machine Learning (CS771A) Nonlinear Learning with Kernels 9 We didn’t need to pre-define/compute the mapping φ to compute k(x, z) The function k is known as the kernel function Also, evaluating k is almost as fast as computing inner products

Kernels as (Implicit) Feature Maps

Consider two data points x = {x1, x2} and z = {z1, z2} (each in 2 dims) Suppose we have a function k which takes as inputs x and z and computes

k(x, z)=( x >z)2 2 = (x1z1 + x2z2) 2 2 2 2 = x1 z1 + x2 z2 + 2x1x2z1z2 2 √ 2 > 2 √ 2 = (x1 , 2x1x2, x2 ) (z1 , 2z1z2, z2 ) = φ(x)>φ(z) (an inner product) The above k implicitly defines a mapping φ to a higher dimensional space √ 2 2 φ(x) = {x1 , 2x1x2, x2 } .. and computes inner-product based similarity φ(x)>φ(z) in that space

Machine Learning (CS771A) Nonlinear Learning with Kernels 9 The function k is known as the kernel function Also, evaluating k is almost as fast as computing inner products

Kernels as (Implicit) Feature Maps

Consider two data points x = {x1, x2} and z = {z1, z2} (each in 2 dims) Suppose we have a function k which takes as inputs x and z and computes

k(x, z)=( x >z)2 2 = (x1z1 + x2z2) 2 2 2 2 = x1 z1 + x2 z2 + 2x1x2z1z2 2 √ 2 > 2 √ 2 = (x1 , 2x1x2, x2 ) (z1 , 2z1z2, z2 ) = φ(x)>φ(z) (an inner product) The above k implicitly defines a mapping φ to a higher dimensional space √ 2 2 φ(x) = {x1 , 2x1x2, x2 } .. and computes inner-product based similarity φ(x)>φ(z) in that space We didn’t need to pre-define/compute the mapping φ to compute k(x, z)

Machine Learning (CS771A) Nonlinear Learning with Kernels 9 Also, evaluating k is almost as fast as computing inner products

Kernels as (Implicit) Feature Maps

Consider two data points x = {x1, x2} and z = {z1, z2} (each in 2 dims) Suppose we have a function k which takes as inputs x and z and computes

k(x, z)=( x >z)2 2 = (x1z1 + x2z2) 2 2 2 2 = x1 z1 + x2 z2 + 2x1x2z1z2 2 √ 2 > 2 √ 2 = (x1 , 2x1x2, x2 ) (z1 , 2z1z2, z2 ) = φ(x)>φ(z) (an inner product) The above k implicitly defines a mapping φ to a higher dimensional space √ 2 2 φ(x) = {x1 , 2x1x2, x2 } .. and computes inner-product based similarity φ(x)>φ(z) in that space We didn’t need to pre-define/compute the mapping φ to compute k(x, z) The function k is known as the kernel function

Machine Learning (CS771A) Nonlinear Learning with Kernels 9 Kernels as (Implicit) Feature Maps

Consider two data points x = {x1, x2} and z = {z1, z2} (each in 2 dims) Suppose we have a function k which takes as inputs x and z and computes

k(x, z)=( x >z)2 2 = (x1z1 + x2z2) 2 2 2 2 = x1 z1 + x2 z2 + 2x1x2z1z2 2 √ 2 > 2 √ 2 = (x1 , 2x1x2, x2 ) (z1 , 2z1z2, z2 ) = φ(x)>φ(z) (an inner product) The above k implicitly defines a mapping φ to a higher dimensional space √ 2 2 φ(x) = {x1 , 2x1x2, x2 } .. and computes inner-product based similarity φ(x)>φ(z) in that space We didn’t need to pre-define/compute the mapping φ to compute k(x, z) The function k is known as the kernel function Also, evaluating k is almost as fast as computing inner products

Machine Learning (CS771A) Nonlinear Learning with Kernels 9 φ takes input x ∈ X (input space) and maps it to F (new “feature space”)

The kernel function k can be seen as taking two points as inputs and computing their inner-product based similarity in the F space

φ : X → F > k : X × X → R, k(x, z) = φ(x) φ(z) F needs to be a with a dot product defined on it Also called a Hilbert Space

Is any function k with k(x, z) = φ(x)>φ(z) for some φ, a kernel function? No. The function k must satisfy Mercer’s Condition

Kernel Functions

Any kernel function k implicitly defines an associated feature mapping φ

Machine Learning (CS771A) Nonlinear Learning with Kernels 10 The kernel function k can be seen as taking two points as inputs and computing their inner-product based similarity in the F space

φ : X → F > k : X × X → R, k(x, z) = φ(x) φ(z) F needs to be a vector space with a dot product defined on it Also called a Hilbert Space

Is any function k with k(x, z) = φ(x)>φ(z) for some φ, a kernel function? No. The function k must satisfy Mercer’s Condition

Kernel Functions

Any kernel function k implicitly defines an associated feature mapping φ φ takes input x ∈ X (input space) and maps it to F (new “feature space”)

Machine Learning (CS771A) Nonlinear Learning with Kernels 10 F needs to be a vector space with a dot product defined on it Also called a Hilbert Space

Is any function k with k(x, z) = φ(x)>φ(z) for some φ, a kernel function? No. The function k must satisfy Mercer’s Condition

Kernel Functions

Any kernel function k implicitly defines an associated feature mapping φ φ takes input x ∈ X (input space) and maps it to F (new “feature space”)

The kernel function k can be seen as taking two points as inputs and computing their inner-product based similarity in the F space

φ : X → F > k : X × X → R, k(x, z) = φ(x) φ(z)

Machine Learning (CS771A) Nonlinear Learning with Kernels 10 Is any function k with k(x, z) = φ(x)>φ(z) for some φ, a kernel function? No. The function k must satisfy Mercer’s Condition

Kernel Functions

Any kernel function k implicitly defines an associated feature mapping φ φ takes input x ∈ X (input space) and maps it to F (new “feature space”)

The kernel function k can be seen as taking two points as inputs and computing their inner-product based similarity in the F space

φ : X → F > k : X × X → R, k(x, z) = φ(x) φ(z) F needs to be a vector space with a dot product defined on it Also called a Hilbert Space

Machine Learning (CS771A) Nonlinear Learning with Kernels 10 No. The function k must satisfy Mercer’s Condition

Kernel Functions

Any kernel function k implicitly defines an associated feature mapping φ φ takes input x ∈ X (input space) and maps it to F (new “feature space”)

The kernel function k can be seen as taking two points as inputs and computing their inner-product based similarity in the F space

φ : X → F > k : X × X → R, k(x, z) = φ(x) φ(z) F needs to be a vector space with a dot product defined on it Also called a Hilbert Space

Is any function k with k(x, z) = φ(x)>φ(z) for some φ, a kernel function?

Machine Learning (CS771A) Nonlinear Learning with Kernels 10 Kernel Functions

Any kernel function k implicitly defines an associated feature mapping φ φ takes input x ∈ X (input space) and maps it to F (new “feature space”)

The kernel function k can be seen as taking two points as inputs and computing their inner-product based similarity in the F space

φ : X → F > k : X × X → R, k(x, z) = φ(x) φ(z) F needs to be a vector space with a dot product defined on it Also called a Hilbert Space

Is any function k with k(x, z) = φ(x)>φ(z) for some φ, a kernel function? No. The function k must satisfy Mercer’s Condition

Machine Learning (CS771A) Nonlinear Learning with Kernels 10 k must define a dot product for some Hilbert Space F for which Above is true if k is symmetric and positive semi-definite (p.s.d.) function (though there are exceptions; there are also “indefinite” kernels). The function k is p.s.d. if the following holds ZZ f (x)k(x, z)f (z)dxdz ≥ 0 (∀f ∈ L2)

.. for all functions f that are “square integrable”, i.e., R f (x)2dx < ∞ This is the Mercer’s Condition

Let k1, k2 be two kernel functions then the following are as well:

k(x, z) = k1(x, z) + k2(x, z): direct sum

k(x, z) = αk1(x, z): scalar product

k(x, z) = k1(x, z)k2(x, z): direct product Kernels can also be constructed by composing these rules

Kernel Functions

For k to be a kernel function

Machine Learning (CS771A) Nonlinear Learning with Kernels 11 Above is true if k is symmetric and positive semi-definite (p.s.d.) function (though there are exceptions; there are also “indefinite” kernels). The function k is p.s.d. if the following holds ZZ f (x)k(x, z)f (z)dxdz ≥ 0 (∀f ∈ L2)

.. for all functions f that are “square integrable”, i.e., R f (x)2dx < ∞ This is the Mercer’s Condition

Let k1, k2 be two kernel functions then the following are as well:

k(x, z) = k1(x, z) + k2(x, z): direct sum

k(x, z) = αk1(x, z): scalar product

k(x, z) = k1(x, z)k2(x, z): direct product Kernels can also be constructed by composing these rules

Kernel Functions

For k to be a kernel function k must define a dot product for some Hilbert Space F for which

Machine Learning (CS771A) Nonlinear Learning with Kernels 11 (though there are exceptions; there are also “indefinite” kernels). The function k is p.s.d. if the following holds ZZ f (x)k(x, z)f (z)dxdz ≥ 0 (∀f ∈ L2)

.. for all functions f that are “square integrable”, i.e., R f (x)2dx < ∞ This is the Mercer’s Condition

Let k1, k2 be two kernel functions then the following are as well:

k(x, z) = k1(x, z) + k2(x, z): direct sum

k(x, z) = αk1(x, z): scalar product

k(x, z) = k1(x, z)k2(x, z): direct product Kernels can also be constructed by composing these rules

Kernel Functions

For k to be a kernel function k must define a dot product for some Hilbert Space F for which Above is true if k is symmetric and positive semi-definite (p.s.d.) function

Machine Learning (CS771A) Nonlinear Learning with Kernels 11 The function k is p.s.d. if the following holds ZZ f (x)k(x, z)f (z)dxdz ≥ 0 (∀f ∈ L2)

.. for all functions f that are “square integrable”, i.e., R f (x)2dx < ∞ This is the Mercer’s Condition

Let k1, k2 be two kernel functions then the following are as well:

k(x, z) = k1(x, z) + k2(x, z): direct sum

k(x, z) = αk1(x, z): scalar product

k(x, z) = k1(x, z)k2(x, z): direct product Kernels can also be constructed by composing these rules

Kernel Functions

For k to be a kernel function k must define a dot product for some Hilbert Space F for which Above is true if k is symmetric and positive semi-definite (p.s.d.) function (though there are exceptions; there are also “indefinite” kernels).

Machine Learning (CS771A) Nonlinear Learning with Kernels 11 This is the Mercer’s Condition

Let k1, k2 be two kernel functions then the following are as well:

k(x, z) = k1(x, z) + k2(x, z): direct sum

k(x, z) = αk1(x, z): scalar product

k(x, z) = k1(x, z)k2(x, z): direct product Kernels can also be constructed by composing these rules

Kernel Functions

For k to be a kernel function k must define a dot product for some Hilbert Space F for which Above is true if k is symmetric and positive semi-definite (p.s.d.) function (though there are exceptions; there are also “indefinite” kernels). The function k is p.s.d. if the following holds ZZ f (x)k(x, z)f (z)dxdz ≥ 0 (∀f ∈ L2)

.. for all functions f that are “square integrable”, i.e., R f (x)2dx < ∞

Machine Learning (CS771A) Nonlinear Learning with Kernels 11 Let k1, k2 be two kernel functions then the following are as well:

k(x, z) = k1(x, z) + k2(x, z): direct sum

k(x, z) = αk1(x, z): scalar product

k(x, z) = k1(x, z)k2(x, z): direct product Kernels can also be constructed by composing these rules

Kernel Functions

For k to be a kernel function k must define a dot product for some Hilbert Space F for which Above is true if k is symmetric and positive semi-definite (p.s.d.) function (though there are exceptions; there are also “indefinite” kernels). The function k is p.s.d. if the following holds ZZ f (x)k(x, z)f (z)dxdz ≥ 0 (∀f ∈ L2)

.. for all functions f that are “square integrable”, i.e., R f (x)2dx < ∞ This is the Mercer’s Condition

Machine Learning (CS771A) Nonlinear Learning with Kernels 11 Kernels can also be constructed by composing these rules

Kernel Functions

For k to be a kernel function k must define a dot product for some Hilbert Space F for which Above is true if k is symmetric and positive semi-definite (p.s.d.) function (though there are exceptions; there are also “indefinite” kernels). The function k is p.s.d. if the following holds ZZ f (x)k(x, z)f (z)dxdz ≥ 0 (∀f ∈ L2)

.. for all functions f that are “square integrable”, i.e., R f (x)2dx < ∞ This is the Mercer’s Condition

Let k1, k2 be two kernel functions then the following are as well:

k(x, z) = k1(x, z) + k2(x, z): direct sum

k(x, z) = αk1(x, z): scalar product

k(x, z) = k1(x, z)k2(x, z): direct product

Machine Learning (CS771A) Nonlinear Learning with Kernels 11 Kernel Functions

For k to be a kernel function k must define a dot product for some Hilbert Space F for which Above is true if k is symmetric and positive semi-definite (p.s.d.) function (though there are exceptions; there are also “indefinite” kernels). The function k is p.s.d. if the following holds ZZ f (x)k(x, z)f (z)dxdz ≥ 0 (∀f ∈ L2)

.. for all functions f that are “square integrable”, i.e., R f (x)2dx < ∞ This is the Mercer’s Condition

Let k1, k2 be two kernel functions then the following are as well:

k(x, z) = k1(x, z) + k2(x, z): direct sum

k(x, z) = αk1(x, z): scalar product

k(x, z) = k1(x, z)k2(x, z): direct product Kernels can also be constructed by composing these rules

Machine Learning (CS771A) Nonlinear Learning with Kernels 11 Quadratic Kernel: k(x, z) = (x >z)2 or (1 + x >z)2 Kernel (of degree d): k(x, z) = (x >z)d or (1 + x >z)d Radial Function (RBF) of “Gaussian” Kernel: k(x, z) = exp[−γ||x − z||2] γ is a hyperparameter (also called the kernel bandwidth) The RBF kernel corresponds to an infinite dimensional feature space F (i.e., you can’t actually write down or store the map φ(x) explicitly) Also called “stationary kernel”: only depends on the distance between x and z (translating both by the same amount won’t change the value of k(x, z)) Kernel hyperparameters (e.g., d, γ) need to be chosen via cross-validation

Some Examples of Kernel Functions

Linear (trivial) Kernel: k(x, z) = x >z (mapping function φ is identity)

Machine Learning (CS771A) Nonlinear Learning with Kernels 12 Polynomial Kernel (of degree d): k(x, z) = (x >z)d or (1 + x >z)d Radial Basis Function (RBF) of “Gaussian” Kernel: k(x, z) = exp[−γ||x − z||2] γ is a hyperparameter (also called the kernel bandwidth) The RBF kernel corresponds to an infinite dimensional feature space F (i.e., you can’t actually write down or store the map φ(x) explicitly) Also called “stationary kernel”: only depends on the distance between x and z (translating both by the same amount won’t change the value of k(x, z)) Kernel hyperparameters (e.g., d, γ) need to be chosen via cross-validation

Some Examples of Kernel Functions

Linear (trivial) Kernel: k(x, z) = x >z (mapping function φ is identity) Quadratic Kernel: k(x, z) = (x >z)2 or (1 + x >z)2

Machine Learning (CS771A) Nonlinear Learning with Kernels 12 Radial Basis Function (RBF) of “Gaussian” Kernel: k(x, z) = exp[−γ||x − z||2] γ is a hyperparameter (also called the kernel bandwidth) The RBF kernel corresponds to an infinite dimensional feature space F (i.e., you can’t actually write down or store the map φ(x) explicitly) Also called “stationary kernel”: only depends on the distance between x and z (translating both by the same amount won’t change the value of k(x, z)) Kernel hyperparameters (e.g., d, γ) need to be chosen via cross-validation

Some Examples of Kernel Functions

Linear (trivial) Kernel: k(x, z) = x >z (mapping function φ is identity) Quadratic Kernel: k(x, z) = (x >z)2 or (1 + x >z)2 Polynomial Kernel (of degree d): k(x, z) = (x >z)d or (1 + x >z)d

Machine Learning (CS771A) Nonlinear Learning with Kernels 12 The RBF kernel corresponds to an infinite dimensional feature space F (i.e., you can’t actually write down or store the map φ(x) explicitly) Also called “stationary kernel”: only depends on the distance between x and z (translating both by the same amount won’t change the value of k(x, z)) Kernel hyperparameters (e.g., d, γ) need to be chosen via cross-validation

Some Examples of Kernel Functions

Linear (trivial) Kernel: k(x, z) = x >z (mapping function φ is identity) Quadratic Kernel: k(x, z) = (x >z)2 or (1 + x >z)2 Polynomial Kernel (of degree d): k(x, z) = (x >z)d or (1 + x >z)d Radial Basis Function (RBF) of “Gaussian” Kernel: k(x, z) = exp[−γ||x − z||2] γ is a hyperparameter (also called the kernel bandwidth)

Machine Learning (CS771A) Nonlinear Learning with Kernels 12 Also called “stationary kernel”: only depends on the distance between x and z (translating both by the same amount won’t change the value of k(x, z)) Kernel hyperparameters (e.g., d, γ) need to be chosen via cross-validation

Some Examples of Kernel Functions

Linear (trivial) Kernel: k(x, z) = x >z (mapping function φ is identity) Quadratic Kernel: k(x, z) = (x >z)2 or (1 + x >z)2 Polynomial Kernel (of degree d): k(x, z) = (x >z)d or (1 + x >z)d Radial Basis Function (RBF) of “Gaussian” Kernel: k(x, z) = exp[−γ||x − z||2] γ is a hyperparameter (also called the kernel bandwidth) The RBF kernel corresponds to an infinite dimensional feature space F (i.e., you can’t actually write down or store the map φ(x) explicitly)

Machine Learning (CS771A) Nonlinear Learning with Kernels 12 Kernel hyperparameters (e.g., d, γ) need to be chosen via cross-validation

Some Examples of Kernel Functions

Linear (trivial) Kernel: k(x, z) = x >z (mapping function φ is identity) Quadratic Kernel: k(x, z) = (x >z)2 or (1 + x >z)2 Polynomial Kernel (of degree d): k(x, z) = (x >z)d or (1 + x >z)d Radial Basis Function (RBF) of “Gaussian” Kernel: k(x, z) = exp[−γ||x − z||2] γ is a hyperparameter (also called the kernel bandwidth) The RBF kernel corresponds to an infinite dimensional feature space F (i.e., you can’t actually write down or store the map φ(x) explicitly) Also called “stationary kernel”: only depends on the distance between x and z (translating both by the same amount won’t change the value of k(x, z))

Machine Learning (CS771A) Nonlinear Learning with Kernels 12 Some Examples of Kernel Functions

Linear (trivial) Kernel: k(x, z) = x >z (mapping function φ is identity) Quadratic Kernel: k(x, z) = (x >z)2 or (1 + x >z)2 Polynomial Kernel (of degree d): k(x, z) = (x >z)d or (1 + x >z)d Radial Basis Function (RBF) of “Gaussian” Kernel: k(x, z) = exp[−γ||x − z||2] γ is a hyperparameter (also called the kernel bandwidth) The RBF kernel corresponds to an infinite dimensional feature space F (i.e., you can’t actually write down or store the map φ(x) explicitly) Also called “stationary kernel”: only depends on the distance between x and z (translating both by the same amount won’t change the value of k(x, z)) Kernel hyperparameters (e.g., d, γ) need to be chosen via cross-validation

Machine Learning (CS771A) Nonlinear Learning with Kernels 12 This is explained below: For simplicity, let’s assume x and z to be scalars (i.e., each example in the original space is defined by a single feature) and also assume γ = 1

k(x, z) = exp[−γ(x − z)2] = exp(−x2) exp(−z2) exp(2xz) ∞ k k k 2 2 X 2 x z = exp(−x )exp(−z ) k! k=1 = φ(x)>φ(z) (the constants 2k and k! are subsumed in φ)

This shows that φ(x) and φ(z) are infinite dimensional vectors

RBF Kernel = Infinite Dimensional Mapping

We saw that the RBF/Gaussian kernel is defined as

k(x, z) = exp[−γ||x − z||2]

Using this kernel corresponds to mapping data to infinite dimensional space

Machine Learning (CS771A) Nonlinear Learning with Kernels 13 This shows that φ(x) and φ(z) are infinite dimensional vectors

RBF Kernel = Infinite Dimensional Mapping

We saw that the RBF/Gaussian kernel is defined as

k(x, z) = exp[−γ||x − z||2]

Using this kernel corresponds to mapping data to infinite dimensional space This is explained below: For simplicity, let’s assume x and z to be scalars (i.e., each example in the original space is defined by a single feature) and also assume γ = 1

k(x, z) = exp[−γ(x − z)2] = exp(−x2) exp(−z2) exp(2xz) ∞ k k k 2 2 X 2 x z = exp(−x )exp(−z ) k! k=1 = φ(x)>φ(z) (the constants 2k and k! are subsumed in φ)

Machine Learning (CS771A) Nonlinear Learning with Kernels 13 RBF Kernel = Infinite Dimensional Mapping

We saw that the RBF/Gaussian kernel is defined as

k(x, z) = exp[−γ||x − z||2]

Using this kernel corresponds to mapping data to infinite dimensional space This is explained below: For simplicity, let’s assume x and z to be scalars (i.e., each example in the original space is defined by a single feature) and also assume γ = 1

k(x, z) = exp[−γ(x − z)2] = exp(−x2) exp(−z2) exp(2xz) ∞ k k k 2 2 X 2 x z = exp(−x )exp(−z ) k! k=1 = φ(x)>φ(z) (the constants 2k and k! are subsumed in φ)

This shows that φ(x) and φ(z) are infinite dimensional vectors

Machine Learning (CS771A) Nonlinear Learning with Kernels 13 Kij : Similarity between the i-th and j-th example in the feature space F K: N × N matrix of pairwise similarities between examples in F space

K is a symmetric and positive definite matrix For a P.D. matrix: z >Kz > 0, ∀z ∈ RN (also, all eigenvalues positive) The Kernel Matrix K is also known as the Gram Matrix

The Kernel Matrix

The kernel function k defines the Kernel Matrix K over the data

Given N examples {x 1,..., x N }, the (i, j)-th entry of K is defined as:

> Kij = k(x i , x j ) = φ(x i ) φ(x j )

Machine Learning (CS771A) Nonlinear Learning with Kernels 14 K: N × N matrix of pairwise similarities between examples in F space

K is a symmetric and positive definite matrix For a P.D. matrix: z >Kz > 0, ∀z ∈ RN (also, all eigenvalues positive) The Kernel Matrix K is also known as the Gram Matrix

The Kernel Matrix

The kernel function k defines the Kernel Matrix K over the data

Given N examples {x 1,..., x N }, the (i, j)-th entry of K is defined as:

> Kij = k(x i , x j ) = φ(x i ) φ(x j )

Kij : Similarity between the i-th and j-th example in the feature space F

Machine Learning (CS771A) Nonlinear Learning with Kernels 14 K is a symmetric and positive definite matrix For a P.D. matrix: z >Kz > 0, ∀z ∈ RN (also, all eigenvalues positive) The Kernel Matrix K is also known as the Gram Matrix

The Kernel Matrix

The kernel function k defines the Kernel Matrix K over the data

Given N examples {x 1,..., x N }, the (i, j)-th entry of K is defined as:

> Kij = k(x i , x j ) = φ(x i ) φ(x j )

Kij : Similarity between the i-th and j-th example in the feature space F K: N × N matrix of pairwise similarities between examples in F space

Machine Learning (CS771A) Nonlinear Learning with Kernels 14 For a P.D. matrix: z >Kz > 0, ∀z ∈ RN (also, all eigenvalues positive) The Kernel Matrix K is also known as the Gram Matrix

The Kernel Matrix

The kernel function k defines the Kernel Matrix K over the data

Given N examples {x 1,..., x N }, the (i, j)-th entry of K is defined as:

> Kij = k(x i , x j ) = φ(x i ) φ(x j )

Kij : Similarity between the i-th and j-th example in the feature space F K: N × N matrix of pairwise similarities between examples in F space

K is a symmetric and positive definite matrix

Machine Learning (CS771A) Nonlinear Learning with Kernels 14 The Kernel Matrix K is also known as the Gram Matrix

The Kernel Matrix

The kernel function k defines the Kernel Matrix K over the data

Given N examples {x 1,..., x N }, the (i, j)-th entry of K is defined as:

> Kij = k(x i , x j ) = φ(x i ) φ(x j )

Kij : Similarity between the i-th and j-th example in the feature space F K: N × N matrix of pairwise similarities between examples in F space

K is a symmetric and positive definite matrix For a P.D. matrix: z >Kz > 0, ∀z ∈ RN (also, all eigenvalues positive)

Machine Learning (CS771A) Nonlinear Learning with Kernels 14 The Kernel Matrix

The kernel function k defines the Kernel Matrix K over the data

Given N examples {x 1,..., x N }, the (i, j)-th entry of K is defined as:

> Kij = k(x i , x j ) = φ(x i ) φ(x j )

Kij : Similarity between the i-th and j-th example in the feature space F K: N × N matrix of pairwise similarities between examples in F space

K is a symmetric and positive definite matrix For a P.D. matrix: z >Kz > 0, ∀z ∈ RN (also, all eigenvalues positive) The Kernel Matrix K is also known as the Gram Matrix

Machine Learning (CS771A) Nonlinear Learning with Kernels 14 Recall: Kernel k(x, z) represents a dot product in some high dimensional feature space F

> Any learning model in which, during training and test, inputs only appear as dot products( x i x j ) can be kernelized (i.e., non-linearlized)

> > .. by replacing the x i x j terms by φ(x i ) φ(x j ) = k(x i , x j ) Most learning algorithms can be easily kernelized Distance based methods, Perceptron, SVM, linear regression, etc. Many of the unsupervised learning algorithms too can be kernelized (e.g., K-means clustering, Principal Component Analysis, etc.) Let’s look at two examples: Kernelized SVM and Kernelized Ridge Regression

Using Kernels

Kernels can turn a linear model into a nonlinear one

Machine Learning (CS771A) Nonlinear Learning with Kernels 15 > Any learning model in which, during training and test, inputs only appear as dot products( x i x j ) can be kernelized (i.e., non-linearlized)

> > .. by replacing the x i x j terms by φ(x i ) φ(x j ) = k(x i , x j ) Most learning algorithms can be easily kernelized Distance based methods, Perceptron, SVM, linear regression, etc. Many of the unsupervised learning algorithms too can be kernelized (e.g., K-means clustering, Principal Component Analysis, etc.) Let’s look at two examples: Kernelized SVM and Kernelized Ridge Regression

Using Kernels

Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x, z) represents a dot product in some high dimensional feature space F

Machine Learning (CS771A) Nonlinear Learning with Kernels 15 > > .. by replacing the x i x j terms by φ(x i ) φ(x j ) = k(x i , x j ) Most learning algorithms can be easily kernelized Distance based methods, Perceptron, SVM, linear regression, etc. Many of the unsupervised learning algorithms too can be kernelized (e.g., K-means clustering, Principal Component Analysis, etc.) Let’s look at two examples: Kernelized SVM and Kernelized Ridge Regression

Using Kernels

Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x, z) represents a dot product in some high dimensional feature space F

> Any learning model in which, during training and test, inputs only appear as dot products( x i x j ) can be kernelized (i.e., non-linearlized)

Machine Learning (CS771A) Nonlinear Learning with Kernels 15 Most learning algorithms can be easily kernelized Distance based methods, Perceptron, SVM, linear regression, etc. Many of the unsupervised learning algorithms too can be kernelized (e.g., K-means clustering, Principal Component Analysis, etc.) Let’s look at two examples: Kernelized SVM and Kernelized Ridge Regression

Using Kernels

Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x, z) represents a dot product in some high dimensional feature space F

> Any learning model in which, during training and test, inputs only appear as dot products( x i x j ) can be kernelized (i.e., non-linearlized)

> > .. by replacing the x i x j terms by φ(x i ) φ(x j ) = k(x i , x j )

Machine Learning (CS771A) Nonlinear Learning with Kernels 15 Distance based methods, Perceptron, SVM, linear regression, etc. Many of the unsupervised learning algorithms too can be kernelized (e.g., K-means clustering, Principal Component Analysis, etc.) Let’s look at two examples: Kernelized SVM and Kernelized Ridge Regression

Using Kernels

Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x, z) represents a dot product in some high dimensional feature space F

> Any learning model in which, during training and test, inputs only appear as dot products( x i x j ) can be kernelized (i.e., non-linearlized)

> > .. by replacing the x i x j terms by φ(x i ) φ(x j ) = k(x i , x j ) Most learning algorithms can be easily kernelized

Machine Learning (CS771A) Nonlinear Learning with Kernels 15 Many of the unsupervised learning algorithms too can be kernelized (e.g., K-means clustering, Principal Component Analysis, etc.) Let’s look at two examples: Kernelized SVM and Kernelized Ridge Regression

Using Kernels

Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x, z) represents a dot product in some high dimensional feature space F

> Any learning model in which, during training and test, inputs only appear as dot products( x i x j ) can be kernelized (i.e., non-linearlized)

> > .. by replacing the x i x j terms by φ(x i ) φ(x j ) = k(x i , x j ) Most learning algorithms can be easily kernelized Distance based methods, Perceptron, SVM, linear regression, etc.

Machine Learning (CS771A) Nonlinear Learning with Kernels 15 Let’s look at two examples: Kernelized SVM and Kernelized Ridge Regression

Using Kernels

Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x, z) represents a dot product in some high dimensional feature space F

> Any learning model in which, during training and test, inputs only appear as dot products( x i x j ) can be kernelized (i.e., non-linearlized)

> > .. by replacing the x i x j terms by φ(x i ) φ(x j ) = k(x i , x j ) Most learning algorithms can be easily kernelized Distance based methods, Perceptron, SVM, linear regression, etc. Many of the unsupervised learning algorithms too can be kernelized (e.g., K-means clustering, Principal Component Analysis, etc.)

Machine Learning (CS771A) Nonlinear Learning with Kernels 15 Using Kernels

Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x, z) represents a dot product in some high dimensional feature space F

> Any learning model in which, during training and test, inputs only appear as dot products( x i x j ) can be kernelized (i.e., non-linearlized)

> > .. by replacing the x i x j terms by φ(x i ) φ(x j ) = k(x i , x j ) Most learning algorithms can be easily kernelized Distance based methods, Perceptron, SVM, linear regression, etc. Many of the unsupervised learning algorithms too can be kernelized (e.g., K-means clustering, Principal Component Analysis, etc.) Let’s look at two examples: Kernelized SVM and Kernelized Ridge Regression

Machine Learning (CS771A) Nonlinear Learning with Kernels 15 Example 1: Kernel (Nonlinear) SVM

Machine Learning (CS771A) Nonlinear Learning with Kernels 16 > Can simply replace Gmn = ymynx mx n by ymynKmn > .. where Kmn = k(x m, x n) = φ(x m) φ(x n) for a suitable kernel function k The problem can be solved just like the linear SVM case The new SVM learns a linear separator in kernel-induced feature space F This corresponds to a non-linear separator in the original space X

Kernelized SVM Training Recall the soft-margin SVM dual problem:

N > 1 > X Soft-Margin SVM: max LD (α) = α 1 − α Gα s.t. αnyn = 0 α≤C 2 n=1

> .. where we had defined Gmn = ymynx mx n

Machine Learning (CS771A) Nonlinear Learning with Kernels 17 > .. where Kmn = k(x m, x n) = φ(x m) φ(x n) for a suitable kernel function k The problem can be solved just like the linear SVM case The new SVM learns a linear separator in kernel-induced feature space F This corresponds to a non-linear separator in the original space X

Kernelized SVM Training Recall the soft-margin SVM dual problem:

N > 1 > X Soft-Margin SVM: max LD (α) = α 1 − α Gα s.t. αnyn = 0 α≤C 2 n=1

> .. where we had defined Gmn = ymynx mx n > Can simply replace Gmn = ymynx mx n by ymynKmn

Machine Learning (CS771A) Nonlinear Learning with Kernels 17 The problem can be solved just like the linear SVM case The new SVM learns a linear separator in kernel-induced feature space F This corresponds to a non-linear separator in the original space X

Kernelized SVM Training Recall the soft-margin SVM dual problem:

N > 1 > X Soft-Margin SVM: max LD (α) = α 1 − α Gα s.t. αnyn = 0 α≤C 2 n=1

> .. where we had defined Gmn = ymynx mx n > Can simply replace Gmn = ymynx mx n by ymynKmn > .. where Kmn = k(x m, x n) = φ(x m) φ(x n) for a suitable kernel function k

Machine Learning (CS771A) Nonlinear Learning with Kernels 17 The new SVM learns a linear separator in kernel-induced feature space F This corresponds to a non-linear separator in the original space X

Kernelized SVM Training Recall the soft-margin SVM dual problem:

N > 1 > X Soft-Margin SVM: max LD (α) = α 1 − α Gα s.t. αnyn = 0 α≤C 2 n=1

> .. where we had defined Gmn = ymynx mx n > Can simply replace Gmn = ymynx mx n by ymynKmn > .. where Kmn = k(x m, x n) = φ(x m) φ(x n) for a suitable kernel function k The problem can be solved just like the linear SVM case

Machine Learning (CS771A) Nonlinear Learning with Kernels 17 This corresponds to a non-linear separator in the original space X

Kernelized SVM Training Recall the soft-margin SVM dual problem:

N > 1 > X Soft-Margin SVM: max LD (α) = α 1 − α Gα s.t. αnyn = 0 α≤C 2 n=1

> .. where we had defined Gmn = ymynx mx n > Can simply replace Gmn = ymynx mx n by ymynKmn > .. where Kmn = k(x m, x n) = φ(x m) φ(x n) for a suitable kernel function k The problem can be solved just like the linear SVM case The new SVM learns a linear separator in kernel-induced feature space F

Machine Learning (CS771A) Nonlinear Learning with Kernels 17 Kernelized SVM Training Recall the soft-margin SVM dual problem:

N > 1 > X Soft-Margin SVM: max LD (α) = α 1 − α Gα s.t. αnyn = 0 α≤C 2 n=1

> .. where we had defined Gmn = ymynx mx n > Can simply replace Gmn = ymynx mx n by ymynKmn > .. where Kmn = k(x m, x n) = φ(x m) φ(x n) for a suitable kernel function k The problem can be solved just like the linear SVM case The new SVM learns a linear separator in kernel-induced feature space F This corresponds to a non-linear separator in the original space X

Machine Learning (CS771A) Nonlinear Learning with Kernels 17 N N X > X = sign( αnynx n x) = sign( αnynk(x n, x)) n=1 n=1 Note that the SVM weight vector for the kernelized case can be written as N N X X w = αnynφ(x n) = αnynk(x n,.) n=1 n=1 Note: For the above representation, w can be stored explicitly as a vector only if the feature map φ(.) of the kernel k can be explicitly written In general, kernelized SVMs have to store the training data (at least the support vectors for which αn’s are nonzero) even at the test time Thus the prediction time cost of kernel SVM scales linearly in N PN For unkernelized version w = n=1 αnynx n can be computed and stored as a D × 1 vector. Thus training data need not be stored and the prediction cost is constant w.r.t. N (w >x can be computed in O(D) time).

Kernelized SVM Prediction

Prediction for a new test example x (assume b = 0)

y = sign(w >x)

Machine Learning (CS771A) Nonlinear Learning with Kernels 18 N X sign( αnynk(x n, x)) n=1 Note that the SVM weight vector for the kernelized case can be written as N N X X w = αnynφ(x n) = αnynk(x n,.) n=1 n=1 Note: For the above representation, w can be stored explicitly as a vector only if the feature map φ(.) of the kernel k can be explicitly written In general, kernelized SVMs have to store the training data (at least the support vectors for which αn’s are nonzero) even at the test time Thus the prediction time cost of kernel SVM scales linearly in N PN For unkernelized version w = n=1 αnynx n can be computed and stored as a D × 1 vector. Thus training data need not be stored and the prediction cost is constant w.r.t. N (w >x can be computed in O(D) time).

Kernelized SVM Prediction

Prediction for a new test example x (assume b = 0) N > X > y = sign(w x) = sign( αnynx n x) = n=1

Machine Learning (CS771A) Nonlinear Learning with Kernels 18 Note that the SVM weight vector for the kernelized case can be written as N N X X w = αnynφ(x n) = αnynk(x n,.) n=1 n=1 Note: For the above representation, w can be stored explicitly as a vector only if the feature map φ(.) of the kernel k can be explicitly written In general, kernelized SVMs have to store the training data (at least the support vectors for which αn’s are nonzero) even at the test time Thus the prediction time cost of kernel SVM scales linearly in N PN For unkernelized version w = n=1 αnynx n can be computed and stored as a D × 1 vector. Thus training data need not be stored and the prediction cost is constant w.r.t. N (w >x can be computed in O(D) time).

Kernelized SVM Prediction

Prediction for a new test example x (assume b = 0) N N > X > X y = sign(w x) = sign( αnynx n x) = sign( αnynk(x n, x)) n=1 n=1

Machine Learning (CS771A) Nonlinear Learning with Kernels 18 N X = αnynk(x n,.) n=1 Note: For the above representation, w can be stored explicitly as a vector only if the feature map φ(.) of the kernel k can be explicitly written In general, kernelized SVMs have to store the training data (at least the support vectors for which αn’s are nonzero) even at the test time Thus the prediction time cost of kernel SVM scales linearly in N PN For unkernelized version w = n=1 αnynx n can be computed and stored as a D × 1 vector. Thus training data need not be stored and the prediction cost is constant w.r.t. N (w >x can be computed in O(D) time).

Kernelized SVM Prediction

Prediction for a new test example x (assume b = 0) N N > X > X y = sign(w x) = sign( αnynx n x) = sign( αnynk(x n, x)) n=1 n=1 Note that the SVM weight vector for the kernelized case can be written as N X w = αnynφ(x n) n=1

Machine Learning (CS771A) Nonlinear Learning with Kernels 18 Note: For the above representation, w can be stored explicitly as a vector only if the feature map φ(.) of the kernel k can be explicitly written In general, kernelized SVMs have to store the training data (at least the support vectors for which αn’s are nonzero) even at the test time Thus the prediction time cost of kernel SVM scales linearly in N PN For unkernelized version w = n=1 αnynx n can be computed and stored as a D × 1 vector. Thus training data need not be stored and the prediction cost is constant w.r.t. N (w >x can be computed in O(D) time).

Kernelized SVM Prediction

Prediction for a new test example x (assume b = 0) N N > X > X y = sign(w x) = sign( αnynx n x) = sign( αnynk(x n, x)) n=1 n=1 Note that the SVM weight vector for the kernelized case can be written as N N X X w = αnynφ(x n) = αnynk(x n,.) n=1 n=1

Machine Learning (CS771A) Nonlinear Learning with Kernels 18 In general, kernelized SVMs have to store the training data (at least the support vectors for which αn’s are nonzero) even at the test time Thus the prediction time cost of kernel SVM scales linearly in N PN For unkernelized version w = n=1 αnynx n can be computed and stored as a D × 1 vector. Thus training data need not be stored and the prediction cost is constant w.r.t. N (w >x can be computed in O(D) time).

Kernelized SVM Prediction

Prediction for a new test example x (assume b = 0) N N > X > X y = sign(w x) = sign( αnynx n x) = sign( αnynk(x n, x)) n=1 n=1 Note that the SVM weight vector for the kernelized case can be written as N N X X w = αnynφ(x n) = αnynk(x n,.) n=1 n=1 Note: For the above representation, w can be stored explicitly as a vector only if the feature map φ(.) of the kernel k can be explicitly written

Machine Learning (CS771A) Nonlinear Learning with Kernels 18 Thus the prediction time cost of kernel SVM scales linearly in N PN For unkernelized version w = n=1 αnynx n can be computed and stored as a D × 1 vector. Thus training data need not be stored and the prediction cost is constant w.r.t. N (w >x can be computed in O(D) time).

Kernelized SVM Prediction

Prediction for a new test example x (assume b = 0) N N > X > X y = sign(w x) = sign( αnynx n x) = sign( αnynk(x n, x)) n=1 n=1 Note that the SVM weight vector for the kernelized case can be written as N N X X w = αnynφ(x n) = αnynk(x n,.) n=1 n=1 Note: For the above representation, w can be stored explicitly as a vector only if the feature map φ(.) of the kernel k can be explicitly written In general, kernelized SVMs have to store the training data (at least the support vectors for which αn’s are nonzero) even at the test time

Machine Learning (CS771A) Nonlinear Learning with Kernels 18 PN For unkernelized version w = n=1 αnynx n can be computed and stored as a D × 1 vector. Thus training data need not be stored and the prediction cost is constant w.r.t. N (w >x can be computed in O(D) time).

Kernelized SVM Prediction

Prediction for a new test example x (assume b = 0) N N > X > X y = sign(w x) = sign( αnynx n x) = sign( αnynk(x n, x)) n=1 n=1 Note that the SVM weight vector for the kernelized case can be written as N N X X w = αnynφ(x n) = αnynk(x n,.) n=1 n=1 Note: For the above representation, w can be stored explicitly as a vector only if the feature map φ(.) of the kernel k can be explicitly written In general, kernelized SVMs have to store the training data (at least the support vectors for which αn’s are nonzero) even at the test time Thus the prediction time cost of kernel SVM scales linearly in N

Machine Learning (CS771A) Nonlinear Learning with Kernels 18 Kernelized SVM Prediction

Prediction for a new test example x (assume b = 0) N N > X > X y = sign(w x) = sign( αnynx n x) = sign( αnynk(x n, x)) n=1 n=1 Note that the SVM weight vector for the kernelized case can be written as N N X X w = αnynφ(x n) = αnynk(x n,.) n=1 n=1 Note: For the above representation, w can be stored explicitly as a vector only if the feature map φ(.) of the kernel k can be explicitly written In general, kernelized SVMs have to store the training data (at least the support vectors for which αn’s are nonzero) even at the test time Thus the prediction time cost of kernel SVM scales linearly in N PN For unkernelized version w = n=1 αnynx n can be computed and stored as a D × 1 vector. Thus training data need not be stored and the prediction cost is constant w.r.t. N (w >x can be computed in O(D) time).

Machine Learning (CS771A) Nonlinear Learning with Kernels 18 Example 2: Kernel (Nonlinear) Ridge Regression

Machine Learning (CS771A) Nonlinear Learning with Kernels 19 The solution to this problem was N N X > X > −1 > w = ( x nx n + λID )( ynx n) =( X X + λID ) X y n=1 n=1 Inputs don’t appear as inner-products here . They actually are. :-) Matrix inversion lemma: (FH−1G − E)−1FH−1 = E−1F(GE−1F − H)−1 The lemma allows us to rewrite w as N > > −1 > X w = X (XX + λIN ) y = X α = αnx n n=1 > −1 −1 where α = (XX + λIN ) y = (K + λIN ) y is an N × 1 vector of dual variables, and > Knm = x n x m PN Note: w = n=1 αnx n is known as “dual” form of ridge regression solution. However, so far it is still a linear model. But now it is easily kernelizable.

Ridge Regression: Revisited

Recall the ridge regression problem N X > 2 > w = arg min (yn − w x n) + λw w w n=1

Machine Learning (CS771A) Nonlinear Learning with Kernels 20 Inputs don’t appear as inner-products here . They actually are. :-) Matrix inversion lemma: (FH−1G − E)−1FH−1 = E−1F(GE−1F − H)−1 The lemma allows us to rewrite w as N > > −1 > X w = X (XX + λIN ) y = X α = αnx n n=1 > −1 −1 where α = (XX + λIN ) y = (K + λIN ) y is an N × 1 vector of dual variables, and > Knm = x n x m PN Note: w = n=1 αnx n is known as “dual” form of ridge regression solution. However, so far it is still a linear model. But now it is easily kernelizable.

Ridge Regression: Revisited

Recall the ridge regression problem N X > 2 > w = arg min (yn − w x n) + λw w w n=1 The solution to this problem was N N X > X > −1 > w = ( x nx n + λID )( ynx n) =( X X + λID ) X y n=1 n=1

Machine Learning (CS771A) Nonlinear Learning with Kernels 20 . They actually are. :-) Matrix inversion lemma: (FH−1G − E)−1FH−1 = E−1F(GE−1F − H)−1 The lemma allows us to rewrite w as N > > −1 > X w = X (XX + λIN ) y = X α = αnx n n=1 > −1 −1 where α = (XX + λIN ) y = (K + λIN ) y is an N × 1 vector of dual variables, and > Knm = x n x m PN Note: w = n=1 αnx n is known as “dual” form of ridge regression solution. However, so far it is still a linear model. But now it is easily kernelizable.

Ridge Regression: Revisited

Recall the ridge regression problem N X > 2 > w = arg min (yn − w x n) + λw w w n=1 The solution to this problem was N N X > X > −1 > w = ( x nx n + λID )( ynx n) =( X X + λID ) X y n=1 n=1 Inputs don’t appear as inner-products here

Machine Learning (CS771A) Nonlinear Learning with Kernels 20 Matrix inversion lemma: (FH−1G − E)−1FH−1 = E−1F(GE−1F − H)−1 The lemma allows us to rewrite w as N > > −1 > X w = X (XX + λIN ) y = X α = αnx n n=1 > −1 −1 where α = (XX + λIN ) y = (K + λIN ) y is an N × 1 vector of dual variables, and > Knm = x n x m PN Note: w = n=1 αnx n is known as “dual” form of ridge regression solution. However, so far it is still a linear model. But now it is easily kernelizable.

Ridge Regression: Revisited

Recall the ridge regression problem N X > 2 > w = arg min (yn − w x n) + λw w w n=1 The solution to this problem was N N X > X > −1 > w = ( x nx n + λID )( ynx n) =( X X + λID ) X y n=1 n=1 Inputs don’t appear as inner-products here . They actually are. :-)

Machine Learning (CS771A) Nonlinear Learning with Kernels 20 The lemma allows us to rewrite w as N > > −1 > X w = X (XX + λIN ) y = X α = αnx n n=1 > −1 −1 where α = (XX + λIN ) y = (K + λIN ) y is an N × 1 vector of dual variables, and > Knm = x n x m PN Note: w = n=1 αnx n is known as “dual” form of ridge regression solution. However, so far it is still a linear model. But now it is easily kernelizable.

Ridge Regression: Revisited

Recall the ridge regression problem N X > 2 > w = arg min (yn − w x n) + λw w w n=1 The solution to this problem was N N X > X > −1 > w = ( x nx n + λID )( ynx n) =( X X + λID ) X y n=1 n=1 Inputs don’t appear as inner-products here . They actually are. :-) Matrix inversion lemma: (FH−1G − E)−1FH−1 = E−1F(GE−1F − H)−1

Machine Learning (CS771A) Nonlinear Learning with Kernels 20 N > X = X α = αnx n n=1 > −1 −1 where α = (XX + λIN ) y = (K + λIN ) y is an N × 1 vector of dual variables, and > Knm = x n x m PN Note: w = n=1 αnx n is known as “dual” form of ridge regression solution. However, so far it is still a linear model. But now it is easily kernelizable.

Ridge Regression: Revisited

Recall the ridge regression problem N X > 2 > w = arg min (yn − w x n) + λw w w n=1 The solution to this problem was N N X > X > −1 > w = ( x nx n + λID )( ynx n) =( X X + λID ) X y n=1 n=1 Inputs don’t appear as inner-products here . They actually are. :-) Matrix inversion lemma: (FH−1G − E)−1FH−1 = E−1F(GE−1F − H)−1 The lemma allows us to rewrite w as

> > −1 w = X (XX + λIN ) y

Machine Learning (CS771A) Nonlinear Learning with Kernels 20 > −1 −1 where α = (XX + λIN ) y = (K + λIN ) y is an N × 1 vector of dual variables, and > Knm = x n x m PN Note: w = n=1 αnx n is known as “dual” form of ridge regression solution. However, so far it is still a linear model. But now it is easily kernelizable.

Ridge Regression: Revisited

Recall the ridge regression problem N X > 2 > w = arg min (yn − w x n) + λw w w n=1 The solution to this problem was N N X > X > −1 > w = ( x nx n + λID )( ynx n) =( X X + λID ) X y n=1 n=1 Inputs don’t appear as inner-products here . They actually are. :-) Matrix inversion lemma: (FH−1G − E)−1FH−1 = E−1F(GE−1F − H)−1 The lemma allows us to rewrite w as N > > −1 > X w = X (XX + λIN ) y = X α = αnx n n=1

Machine Learning (CS771A) Nonlinear Learning with Kernels 20 PN Note: w = n=1 αnx n is known as “dual” form of ridge regression solution. However, so far it is still a linear model. But now it is easily kernelizable.

Ridge Regression: Revisited

Recall the ridge regression problem N X > 2 > w = arg min (yn − w x n) + λw w w n=1 The solution to this problem was N N X > X > −1 > w = ( x nx n + λID )( ynx n) =( X X + λID ) X y n=1 n=1 Inputs don’t appear as inner-products here . They actually are. :-) Matrix inversion lemma: (FH−1G − E)−1FH−1 = E−1F(GE−1F − H)−1 The lemma allows us to rewrite w as N > > −1 > X w = X (XX + λIN ) y = X α = αnx n n=1 > −1 −1 where α = (XX + λIN ) y = (K + λIN ) y is an N × 1 vector of dual variables, and > Knm = x n x m

Machine Learning (CS771A) Nonlinear Learning with Kernels 20 Ridge Regression: Revisited

Recall the ridge regression problem N X > 2 > w = arg min (yn − w x n) + λw w w n=1 The solution to this problem was N N X > X > −1 > w = ( x nx n + λID )( ynx n) =( X X + λID ) X y n=1 n=1 Inputs don’t appear as inner-products here . They actually are. :-) Matrix inversion lemma: (FH−1G − E)−1FH−1 = E−1F(GE−1F − H)−1 The lemma allows us to rewrite w as N > > −1 > X w = X (XX + λIN ) y = X α = αnx n n=1 > −1 −1 where α = (XX + λIN ) y = (K + λIN ) y is an N × 1 vector of dual variables, and > Knm = x n x m PN Note: w = n=1 αnx n is known as “dual” form of ridge regression solution. However, so far it is still a linear model. But now it is easily kernelizable.

Machine Learning (CS771A) Nonlinear Learning with Kernels 20 Choosing some kernel k with an associated feature map φ, we can write N N X X w = αnφ(x n) = αnk(x n,.) n=1 n=1 −1 > where α = (K + λIN ) y and Knm = φ(x n) φ(x m) = k(x n, x m) Prediction for a new test input x will be N N > X > X y = w φ(x) = αnφ(x n) φ(x) = αnk(x n, x) n=1 n=1 Thus, using the kernel, we effectively learn a nonlinear regression model

Note: Just as in kernel SVM, prediction cost scales in N

Kernel (Nonlinear) Ridge Regression

PN With the dual form w = n=1 αnx n, we can kernelize ridge regression

Machine Learning (CS771A) Nonlinear Learning with Kernels 21 Prediction for a new test input x will be N N > X > X y = w φ(x) = αnφ(x n) φ(x) = αnk(x n, x) n=1 n=1 Thus, using the kernel, we effectively learn a nonlinear regression model

Note: Just as in kernel SVM, prediction cost scales in N

Kernel (Nonlinear) Ridge Regression

PN With the dual form w = n=1 αnx n, we can kernelize ridge regression Choosing some kernel k with an associated feature map φ, we can write N N X X w = αnφ(x n) = αnk(x n,.) n=1 n=1 −1 > where α = (K + λIN ) y and Knm = φ(x n) φ(x m) = k(x n, x m)

Machine Learning (CS771A) Nonlinear Learning with Kernels 21 N N X > X = αnφ(x n) φ(x) = αnk(x n, x) n=1 n=1 Thus, using the kernel, we effectively learn a nonlinear regression model

Note: Just as in kernel SVM, prediction cost scales in N

Kernel (Nonlinear) Ridge Regression

PN With the dual form w = n=1 αnx n, we can kernelize ridge regression Choosing some kernel k with an associated feature map φ, we can write N N X X w = αnφ(x n) = αnk(x n,.) n=1 n=1 −1 > where α = (K + λIN ) y and Knm = φ(x n) φ(x m) = k(x n, x m) Prediction for a new test input x will be

y = w >φ(x)

Machine Learning (CS771A) Nonlinear Learning with Kernels 21 N X = αnk(x n, x) n=1 Thus, using the kernel, we effectively learn a nonlinear regression model

Note: Just as in kernel SVM, prediction cost scales in N

Kernel (Nonlinear) Ridge Regression

PN With the dual form w = n=1 αnx n, we can kernelize ridge regression Choosing some kernel k with an associated feature map φ, we can write N N X X w = αnφ(x n) = αnk(x n,.) n=1 n=1 −1 > where α = (K + λIN ) y and Knm = φ(x n) φ(x m) = k(x n, x m) Prediction for a new test input x will be N > X > y = w φ(x) = αnφ(x n) φ(x) n=1

Machine Learning (CS771A) Nonlinear Learning with Kernels 21 Note: Just as in kernel SVM, prediction cost scales in N

Kernel (Nonlinear) Ridge Regression

PN With the dual form w = n=1 αnx n, we can kernelize ridge regression Choosing some kernel k with an associated feature map φ, we can write N N X X w = αnφ(x n) = αnk(x n,.) n=1 n=1 −1 > where α = (K + λIN ) y and Knm = φ(x n) φ(x m) = k(x n, x m) Prediction for a new test input x will be N N > X > X y = w φ(x) = αnφ(x n) φ(x) = αnk(x n, x) n=1 n=1 Thus, using the kernel, we effectively learn a nonlinear regression model

Machine Learning (CS771A) Nonlinear Learning with Kernels 21 Kernel (Nonlinear) Ridge Regression

PN With the dual form w = n=1 αnx n, we can kernelize ridge regression Choosing some kernel k with an associated feature map φ, we can write N N X X w = αnφ(x n) = αnk(x n,.) n=1 n=1 −1 > where α = (K + λIN ) y and Knm = φ(x n) φ(x m) = k(x n, x m) Prediction for a new test input x will be N N > X > X y = w φ(x) = αnφ(x n) φ(x) = αnk(x n, x) n=1 n=1 Thus, using the kernel, we effectively learn a nonlinear regression model

Note: Just as in kernel SVM, prediction cost scales in N

Machine Learning (CS771A) Nonlinear Learning with Kernels 21 Thus the solution (the minimizer of our regularized loss function) lies in the span of the inputs (mapped to F). We already saw this in the case of SVM and kernel ridge regression. This property is however more generally applicable.

The Representer Theorem

Theorem Let X be some vector space and let φ() be a map from X to F, a Reproducing Kernel Hilbert Space† (RKHS) associated with kernel k(., .): X × X → R. Let f be some function defined in this RKHS, let `(., .) be a loss function, and let R(.) be some non-decreasing function. Then the minimizer fˆ of the loss function N X 2 `(yn, f (x n)) + R(||f || ) can be written as n=1 N N X X fˆ = αnφ(x n) = αnk(x n,.) n=1 n=1

where αn ∈ R, 1 ≤ n ≤ N

† For a short intro, see: “From Zero to Reproducing Kernel Hilbert Spaces in Twelve Pages or” by Hal Daum´e Machine Learning (CS771A) Nonlinear Learning with Kernels 22 The Representer Theorem

Theorem Let X be some vector space and let φ() be a map from X to F, a Reproducing Kernel Hilbert Space† (RKHS) associated with kernel k(., .): X × X → R. Let f be some function defined in this RKHS, let `(., .) be a loss function, and let R(.) be some non-decreasing function. Then the minimizer fˆ of the loss function N X 2 `(yn, f (x n)) + R(||f || ) can be written as n=1 N N X X fˆ = αnφ(x n) = αnk(x n,.) n=1 n=1

where αn ∈ R, 1 ≤ n ≤ N Thus the solution (the minimizer of our regularized loss function) lies in the span of the inputs (mapped to F). We already saw this in the case of SVM and kernel ridge regression. This property is however more generally applicable.

† For a short intro, see: “From Zero to Reproducing Kernel Hilbert Spaces in Twelve Pages or” by Hal Daum´e Machine Learning (CS771A) Nonlinear Learning with Kernels 22 Some kernels (e.g., RBF) work well for many problems but hyperparameters of the kernel function may need to be tuned via cross-validation There is a huge literature on learning the right kernel from data Learning a combination of multiple kernels (Multiple Kernel Learning) Bayesian kernel methods (e.g., Gaussian Processes) can learn the kernel hyperparameters from data (thus can be seen as learning the kernel) Various other alternatives to learn the “right” data representation Adaptive Basis Functions (learn basis functions and weights from data) M X f (x) = wmφm(x) m=1 .. various methods can be seen as learning adaptive basis functions, e.g., (Deep) Neural Networks, mixture of experts, Decision Trees, etc. Kernel methods use a “fixed” set of basis functions or “landmarks”. The basis functions are the training data points themselves; also see the next slide.

Learning with Kernels: Some Aspects

Choice of the right kernel is important

Machine Learning (CS771A) Nonlinear Learning with Kernels 23 There is a huge literature on learning the right kernel from data Learning a combination of multiple kernels (Multiple Kernel Learning) Bayesian kernel methods (e.g., Gaussian Processes) can learn the kernel hyperparameters from data (thus can be seen as learning the kernel) Various other alternatives to learn the “right” data representation Adaptive Basis Functions (learn basis functions and weights from data) M X f (x) = wmφm(x) m=1 .. various methods can be seen as learning adaptive basis functions, e.g., (Deep) Neural Networks, mixture of experts, Decision Trees, etc. Kernel methods use a “fixed” set of basis functions or “landmarks”. The basis functions are the training data points themselves; also see the next slide.

Learning with Kernels: Some Aspects

Choice of the right kernel is important Some kernels (e.g., RBF) work well for many problems but hyperparameters of the kernel function may need to be tuned via cross-validation

Machine Learning (CS771A) Nonlinear Learning with Kernels 23 Learning a combination of multiple kernels (Multiple Kernel Learning) Bayesian kernel methods (e.g., Gaussian Processes) can learn the kernel hyperparameters from data (thus can be seen as learning the kernel) Various other alternatives to learn the “right” data representation Adaptive Basis Functions (learn basis functions and weights from data) M X f (x) = wmφm(x) m=1 .. various methods can be seen as learning adaptive basis functions, e.g., (Deep) Neural Networks, mixture of experts, Decision Trees, etc. Kernel methods use a “fixed” set of basis functions or “landmarks”. The basis functions are the training data points themselves; also see the next slide.

Learning with Kernels: Some Aspects

Choice of the right kernel is important Some kernels (e.g., RBF) work well for many problems but hyperparameters of the kernel function may need to be tuned via cross-validation There is a huge literature on learning the right kernel from data

Machine Learning (CS771A) Nonlinear Learning with Kernels 23 Bayesian kernel methods (e.g., Gaussian Processes) can learn the kernel hyperparameters from data (thus can be seen as learning the kernel) Various other alternatives to learn the “right” data representation Adaptive Basis Functions (learn basis functions and weights from data) M X f (x) = wmφm(x) m=1 .. various methods can be seen as learning adaptive basis functions, e.g., (Deep) Neural Networks, mixture of experts, Decision Trees, etc. Kernel methods use a “fixed” set of basis functions or “landmarks”. The basis functions are the training data points themselves; also see the next slide.

Learning with Kernels: Some Aspects

Choice of the right kernel is important Some kernels (e.g., RBF) work well for many problems but hyperparameters of the kernel function may need to be tuned via cross-validation There is a huge literature on learning the right kernel from data Learning a combination of multiple kernels (Multiple Kernel Learning)

Machine Learning (CS771A) Nonlinear Learning with Kernels 23 Various other alternatives to learn the “right” data representation Adaptive Basis Functions (learn basis functions and weights from data) M X f (x) = wmφm(x) m=1 .. various methods can be seen as learning adaptive basis functions, e.g., (Deep) Neural Networks, mixture of experts, Decision Trees, etc. Kernel methods use a “fixed” set of basis functions or “landmarks”. The basis functions are the training data points themselves; also see the next slide.

Learning with Kernels: Some Aspects

Choice of the right kernel is important Some kernels (e.g., RBF) work well for many problems but hyperparameters of the kernel function may need to be tuned via cross-validation There is a huge literature on learning the right kernel from data Learning a combination of multiple kernels (Multiple Kernel Learning) Bayesian kernel methods (e.g., Gaussian Processes) can learn the kernel hyperparameters from data (thus can be seen as learning the kernel)

Machine Learning (CS771A) Nonlinear Learning with Kernels 23 Adaptive Basis Functions (learn basis functions and weights from data) M X f (x) = wmφm(x) m=1 .. various methods can be seen as learning adaptive basis functions, e.g., (Deep) Neural Networks, mixture of experts, Decision Trees, etc. Kernel methods use a “fixed” set of basis functions or “landmarks”. The basis functions are the training data points themselves; also see the next slide.

Learning with Kernels: Some Aspects

Choice of the right kernel is important Some kernels (e.g., RBF) work well for many problems but hyperparameters of the kernel function may need to be tuned via cross-validation There is a huge literature on learning the right kernel from data Learning a combination of multiple kernels (Multiple Kernel Learning) Bayesian kernel methods (e.g., Gaussian Processes) can learn the kernel hyperparameters from data (thus can be seen as learning the kernel) Various other alternatives to learn the “right” data representation

Machine Learning (CS771A) Nonlinear Learning with Kernels 23 .. various methods can be seen as learning adaptive basis functions, e.g., (Deep) Neural Networks, mixture of experts, Decision Trees, etc. Kernel methods use a “fixed” set of basis functions or “landmarks”. The basis functions are the training data points themselves; also see the next slide.

Learning with Kernels: Some Aspects

Choice of the right kernel is important Some kernels (e.g., RBF) work well for many problems but hyperparameters of the kernel function may need to be tuned via cross-validation There is a huge literature on learning the right kernel from data Learning a combination of multiple kernels (Multiple Kernel Learning) Bayesian kernel methods (e.g., Gaussian Processes) can learn the kernel hyperparameters from data (thus can be seen as learning the kernel) Various other alternatives to learn the “right” data representation Adaptive Basis Functions (learn basis functions and weights from data) M X f (x) = wmφm(x) m=1

Machine Learning (CS771A) Nonlinear Learning with Kernels 23 Kernel methods use a “fixed” set of basis functions or “landmarks”. The basis functions are the training data points themselves; also see the next slide.

Learning with Kernels: Some Aspects

Choice of the right kernel is important Some kernels (e.g., RBF) work well for many problems but hyperparameters of the kernel function may need to be tuned via cross-validation There is a huge literature on learning the right kernel from data Learning a combination of multiple kernels (Multiple Kernel Learning) Bayesian kernel methods (e.g., Gaussian Processes) can learn the kernel hyperparameters from data (thus can be seen as learning the kernel) Various other alternatives to learn the “right” data representation Adaptive Basis Functions (learn basis functions and weights from data) M X f (x) = wmφm(x) m=1 .. various methods can be seen as learning adaptive basis functions, e.g., (Deep) Neural Networks, mixture of experts, Decision Trees, etc.

Machine Learning (CS771A) Nonlinear Learning with Kernels 23 Learning with Kernels: Some Aspects

Choice of the right kernel is important Some kernels (e.g., RBF) work well for many problems but hyperparameters of the kernel function may need to be tuned via cross-validation There is a huge literature on learning the right kernel from data Learning a combination of multiple kernels (Multiple Kernel Learning) Bayesian kernel methods (e.g., Gaussian Processes) can learn the kernel hyperparameters from data (thus can be seen as learning the kernel) Various other alternatives to learn the “right” data representation Adaptive Basis Functions (learn basis functions and weights from data) M X f (x) = wmφm(x) m=1 .. various methods can be seen as learning adaptive basis functions, e.g., (Deep) Neural Networks, mixture of experts, Decision Trees, etc. Kernel methods use a “fixed” set of basis functions or “landmarks”. The basis functions are the training data points themselves; also see the next slide.

Machine Learning (CS771A) Nonlinear Learning with Kernels 23 For each input x n, we can define the following N dimensional vector

K(n, :) = [k(x n, x 1) k(x n, x 2) ... k(x n, x N )]

Can think of this as a new feature vector (with N features) for inputs x n. Each feature represents the similarity of x n with one of the inputs. Thus these new features are basically defined in terms of similarities of each input with a fixed set of basis points or “landmarks” x 1, x 2,..., x N In general, the set of basis points or landmarks can be any set of points (not necessarily the data points) and can even be learned (which is what Adaptive Basis Function methods basically do).

Kernels: Viewed as Defining Fixed Basis Functions

Consider each row (or column) of the N × N kernel matrix (it’s symmetric)

Machine Learning (CS771A) Nonlinear Learning with Kernels 24 Can think of this as a new feature vector (with N features) for inputs x n. Each feature represents the similarity of x n with one of the inputs. Thus these new features are basically defined in terms of similarities of each input with a fixed set of basis points or “landmarks” x 1, x 2,..., x N In general, the set of basis points or landmarks can be any set of points (not necessarily the data points) and can even be learned (which is what Adaptive Basis Function methods basically do).

Kernels: Viewed as Defining Fixed Basis Functions

Consider each row (or column) of the N × N kernel matrix (it’s symmetric)

For each input x n, we can define the following N dimensional vector

K(n, :) = [k(x n, x 1) k(x n, x 2) ... k(x n, x N )]

Machine Learning (CS771A) Nonlinear Learning with Kernels 24 Thus these new features are basically defined in terms of similarities of each input with a fixed set of basis points or “landmarks” x 1, x 2,..., x N In general, the set of basis points or landmarks can be any set of points (not necessarily the data points) and can even be learned (which is what Adaptive Basis Function methods basically do).

Kernels: Viewed as Defining Fixed Basis Functions

Consider each row (or column) of the N × N kernel matrix (it’s symmetric)

For each input x n, we can define the following N dimensional vector

K(n, :) = [k(x n, x 1) k(x n, x 2) ... k(x n, x N )]

Can think of this as a new feature vector (with N features) for inputs x n. Each feature represents the similarity of x n with one of the inputs.

Machine Learning (CS771A) Nonlinear Learning with Kernels 24 In general, the set of basis points or landmarks can be any set of points (not necessarily the data points) and can even be learned (which is what Adaptive Basis Function methods basically do).

Kernels: Viewed as Defining Fixed Basis Functions

Consider each row (or column) of the N × N kernel matrix (it’s symmetric)

For each input x n, we can define the following N dimensional vector

K(n, :) = [k(x n, x 1) k(x n, x 2) ... k(x n, x N )]

Can think of this as a new feature vector (with N features) for inputs x n. Each feature represents the similarity of x n with one of the inputs. Thus these new features are basically defined in terms of similarities of each input with a fixed set of basis points or “landmarks” x 1, x 2,..., x N

Machine Learning (CS771A) Nonlinear Learning with Kernels 24 Kernels: Viewed as Defining Fixed Basis Functions

Consider each row (or column) of the N × N kernel matrix (it’s symmetric)

For each input x n, we can define the following N dimensional vector

K(n, :) = [k(x n, x 1) k(x n, x 2) ... k(x n, x N )]

Can think of this as a new feature vector (with N features) for inputs x n. Each feature represents the similarity of x n with one of the inputs. Thus these new features are basically defined in terms of similarities of each input with a fixed set of basis points or “landmarks” x 1, x 2,..., x N In general, the set of basis points or landmarks can be any set of points (not necessarily the data points) and can even be learned (which is what Adaptive Basis Function methods basically do).

Machine Learning (CS771A) Nonlinear Learning with Kernels 24 Need to store the kernel matrix even at test time. This can be an issue (remember that K is an N × N matrix which can be very massive) Test time prediction can be slow because to make a prediction, we need to compute PN n=1 αnk(x n, x). Can be made faster if very few αn’s are nonzero. There is a huge literature on speeding up kernel methods Approximating the kernel matrix using a set of kernel-derived new features Identifying a small set of landmark points in the training data .. and a lot more

Learning with Kernels: Some Aspects (Contd.)

Storage/computational efficiency can be a bottleneck when using kernels

Machine Learning (CS771A) Nonlinear Learning with Kernels 25 Test time prediction can be slow because to make a prediction, we need to compute PN n=1 αnk(x n, x). Can be made faster if very few αn’s are nonzero. There is a huge literature on speeding up kernel methods Approximating the kernel matrix using a set of kernel-derived new features Identifying a small set of landmark points in the training data .. and a lot more

Learning with Kernels: Some Aspects (Contd.)

Storage/computational efficiency can be a bottleneck when using kernels Need to store the kernel matrix even at test time. This can be an issue (remember that K is an N × N matrix which can be very massive)

Machine Learning (CS771A) Nonlinear Learning with Kernels 25 There is a huge literature on speeding up kernel methods Approximating the kernel matrix using a set of kernel-derived new features Identifying a small set of landmark points in the training data .. and a lot more

Learning with Kernels: Some Aspects (Contd.)

Storage/computational efficiency can be a bottleneck when using kernels Need to store the kernel matrix even at test time. This can be an issue (remember that K is an N × N matrix which can be very massive) Test time prediction can be slow because to make a prediction, we need to compute PN n=1 αnk(x n, x). Can be made faster if very few αn’s are nonzero.

Machine Learning (CS771A) Nonlinear Learning with Kernels 25 Learning with Kernels: Some Aspects (Contd.)

Storage/computational efficiency can be a bottleneck when using kernels Need to store the kernel matrix even at test time. This can be an issue (remember that K is an N × N matrix which can be very massive) Test time prediction can be slow because to make a prediction, we need to compute PN n=1 αnk(x n, x). Can be made faster if very few αn’s are nonzero. There is a huge literature on speeding up kernel methods Approximating the kernel matrix using a set of kernel-derived new features Identifying a small set of landmark points in the training data .. and a lot more

Machine Learning (CS771A) Nonlinear Learning with Kernels 25 All you need to do is replace the inner products with the kernel

All the computations remain as efficient as in the original space A very general notion of similarity: Can define similarities between objects even though they can’t be represented as vectors. Many kernels are tailor-made for specific types of data Strings (string kernels): DNA matching, text classification, etc. Trees (tree kernels): Comparing parse trees of phrases/sentences

Kernels: Concluding Notes

Kernels give a modular way to learn nonlinear patterns using linear models

Machine Learning (CS771A) Nonlinear Learning with Kernels 26 All the computations remain as efficient as in the original space A very general notion of similarity: Can define similarities between objects even though they can’t be represented as vectors. Many kernels are tailor-made for specific types of data Strings (string kernels): DNA matching, text classification, etc. Trees (tree kernels): Comparing parse trees of phrases/sentences

Kernels: Concluding Notes

Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel

Machine Learning (CS771A) Nonlinear Learning with Kernels 26 A very general notion of similarity: Can define similarities between objects even though they can’t be represented as vectors. Many kernels are tailor-made for specific types of data Strings (string kernels): DNA matching, text classification, etc. Trees (tree kernels): Comparing parse trees of phrases/sentences

Kernels: Concluding Notes

Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel

All the computations remain as efficient as in the original space

Machine Learning (CS771A) Nonlinear Learning with Kernels 26 Strings (string kernels): DNA matching, text classification, etc. Trees (tree kernels): Comparing parse trees of phrases/sentences

Kernels: Concluding Notes

Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel

All the computations remain as efficient as in the original space A very general notion of similarity: Can define similarities between objects even though they can’t be represented as vectors. Many kernels are tailor-made for specific types of data

Machine Learning (CS771A) Nonlinear Learning with Kernels 26 Trees (tree kernels): Comparing parse trees of phrases/sentences

Kernels: Concluding Notes

Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel

All the computations remain as efficient as in the original space A very general notion of similarity: Can define similarities between objects even though they can’t be represented as vectors. Many kernels are tailor-made for specific types of data Strings (string kernels): DNA matching, text classification, etc.

Machine Learning (CS771A) Nonlinear Learning with Kernels 26 Kernels: Concluding Notes

Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel

All the computations remain as efficient as in the original space A very general notion of similarity: Can define similarities between objects even though they can’t be represented as vectors. Many kernels are tailor-made for specific types of data Strings (string kernels): DNA matching, text classification, etc. Trees (tree kernels): Comparing parse trees of phrases/sentences

Machine Learning (CS771A) Nonlinear Learning with Kernels 26