KernelKernel MethodsMethods Simple Idea of Data Fitting
Given ( xi,y i) i=1,…,n
xi is of dimension d Find the best linear function w (hyperplane) that fits the data Two scenarios y: real, regression y: {-1,1}, classification Two cases n>d, regression, least square n
2 Primary and Dual
There are two ways to formulate the problem: Primary Dual Both provide deep insight into the problem Primary is more traditional Dual leads to newer techniques in SVM and kernel methods
3 Regression 2 w = arg min ∑(yi − wo − ∑ xij w j ) W i j w = arg min (y − Xw )T (y − Xw ) W d(y − Xw )T (y − Xw ) = 0 dw ⇒ XT (y − Xw ) = 0 w = [w , w ,L, w ]T , ⇒ T T o 1 d X Xw = X y L T x = ,1[ x1, , xd ] , T −1 T ⇒ w = (X X) X y y = [y , y ,L, y ]T ) 1 2 n T −1 T xT y =< x (, X X) X y > 1 xT X = 2 M xT 4 n n×xd Graphical Interpretation
) y = Xw = Hy = X(XT X)−1 XT y = X(XT X)−1 XT y
d
X= n
FICA Income X is a n (sample size) by d (dimension of data) matrix w combines the columns of X to best approximate y Combine features (FICA, income, etc.) to decisions (loan) H projects y onto the space spanned by columns of X Simplify the decisions to fit the features
5 Problem #1
n=d, exact solution n>d, least square, (most likely scenarios) When n < d, there are not enough constraints to determine coefficients w uniquely d X= n W
6 Problem #2
If different attributes are highly correlated (income and FICA) The columns become dependent Coefficients are then poorly determined with high variance E.g., large positive coefficient on one can be canceled by a similarly large negative coefficient on its correlated cousin Size constraint is helpful Caveat: constraint is problem dependent
7 Ridge Regression
Similar to regularization ridge 2 2 w = arg min ∑(yi − wo − ∑xij w j ) + λ ∑ w j W i j j w ridge = arg min (y − Xw )T (y − Xw ) + λwT w W d(y − Xw )T (y − Xw ) + λwT w = 0 dw ⇒ −XT (y − Xw ) + λw = 0 ⇒ XT y = XT Xw + λw ⇒ XT y = (XT X + λI)w ⇒ w ridge = (XT X + λI)−1 XT y ⇒ yˆ =< x (, XT X + λI)−1 XT y >
8 Ugly Math w ridge = (XT X + λI)−1 XT y X = UΣV T ) y = Xw ridge = X(XT X + λI)−1 XT y = UΣV T (VΣT UT UΣV T + λI)−1 VΣT UT y = UΣ(V −T )−1(VΣT ΣV T + λI)−1(V −1)−1 ΣT UT y = UΣ(V −1VΣT ΣV TV −T + V −1λIV −T )−1 ΣT UT y = UΣ(ΣT Σ + λI)−1 ΣT UT y σ 2 = u i u T y ∑ i 2 λ i i σ i +
9 How to Decipher This ) σ 2 y = u i u T y ∑ i 2 λ i i σ i +
Red: best estimate (y hat) is composed of columns of U (“basis” features, recall U and X have the same column space) Green: how these basis columns are weighed Blue: projection of target (y) onto these columns Together: representing y in a body-fitted coordinate system ( ui)
10 Sidebar Recall that Trace (sum of the diagonals) of a matrix is the same as the sum of the eigenvalues Proof: every matrix has a standard Jordan form (an upper triangular matrix) where the eigenvalues appear on the diagonal (trace=sum of eigenvalues) Jordan form results from a similarity transform (PAP -1) which does not change eigenvalues Ax = λx ⇒ PAx = λPx ⇒ PAP −1Px = λPx ⇒ A J y = λy 11 Physical Interpretation
Singular values of X represents the spread of data along different body-fitting dimensions (orthonormal columns) To estimate y (=
12 More Details
Trace X(XTX+λI)-1XT is called effective degrees of freedom Controls how many eigen modes are actually used or active df (λ ) = d , λ = ,0 df (λ ) = ,0 λ → ∞
Different methods are possible Shrinking smoother: contributions are scaled Projection smoother: contributions are used (1) or not used (0)
13 Dual Formulation
Weight vector can be expressed as a sum of the n training feature vectors
T T w = (XT X)−1 XT y X y = X Xw + λw T T = XT X(XT X)−2 XT y λw = X y − X Xw T 1 T = X d×nα w = X (y − Xw ) n×1 λ T = ∑αixi α = X d×n n×1 i
= ∑αixi i
14 Dual Formulation (cont.)
1 T T α = (y − X w ) X y = X Xw + λw n×1 λ n×d d×1 λw = XT y − XT Xw λα = y − Xw 1 α T α w = XT (y − Xw ) λ = y − XX λ (XX T + λI)α = y T = X d×nα n×1 α T −1 −1 = (Xn×d X d×n + λI) y n×1 = (G + λI) y = ∑αixi i
g (x) =< w , x >= ∑ α i x i , x = ∑ α i x i , x
x , x 1 x , x =< X T (XX T + λI) −1 y , x >= y T (XX T + λI) −1 2 M 15 x n , x In More Details
Gram matrix
−1 − xT − | | | − xT − 1 1 L L M L []y1 y2 yn 1×n − − x1 xn + λI − − x − xT − | | | − xT − n n×d d×n n n×d xT x + λ xT x . xT x [] 1 1 1 2 1 n − xT x − xT x xT x + λ . xT x 1 L 2 1 2 2 2 n L y1 y2 yn 1×n − − . . O . − xT x − T T T n n×1 xn x1 xn x2 xn xn + λ
16 Observations
Primary Dual XTX is d by d Only inner products Training: Slow for are involved high feature dimension XX T is n by n Use: fast O(d) Training: Fast for high feature dimension Use: Slow O(nd) N inner product to evaluate, each requires g(x) =< x (, XT X + λI)−1 XT y > d multiplications x ,x d×1 d×1 1 T T −1 x2 ,x g(x) = y (XX + λI) 1×n M x ,x 17 n n×1 Graphical Interpretation d n n
d= n n
x j ,x j + λδij
x ,x 1 x ,x g(x) =< w,x >= α x ,x = α x ,x = yT (XX T + λI)−1 2 ∑ i i ∑ i i M xn ,x 18 One Extreme – Perfect Uncorrelated d n n
d= n x i , x i n
xi,xj =δij
x ,x 1 T T −1 x2 ,x xi ,x g(x) = y (XX + λI) = ∑ yi M x ,x + λ i i i xn ,x Orthogonal projection – no generalization
19 General Case T T T −1 T T yˆ 1×n = y 1×n (XX + λI) n×n X X X = U Σ V d×d n×d d×n n×d d×d T 2 T −1 T T = y 1×n (UΣ U + λI) n×n UΣV X d×n T 2 T −1 T T = y 1×n (U(Σ + λI)U ) n×n UΣV X d×n T 2 −1 −1 T T = y 1×nU(Σ + λI) U UΣV X d×n T 2 −1 T T = y 1×nU(Σ + λI) ΣV X d×n T T 2 −1 T T = (U d×ny ) 1×d (Σ + λI) ΣV X n×1 d×n T T Σ2 −1 Σ T Σ T = (U d×ny n×1) 1×d ( + λI) V V U T T Σ2 −1 Σ2 T = (U d×ny n×1) 1×d ( + λI) U
How to interpret this? Does this still make sense? 20 Physical Meaning of SVD
Assume that n > d X is of rank d at most U are the body (data)-fitted axes UT is a projection from n to d space Σ is the importance of the dimensions V is the representation of the X in the d space
Σ T X = Un×d d×dV d×d
21 Interpretation ) 2 T T T 2 −1 2 T σ i T Σ Σ y = ∑ui ui y yˆ 1×n = (U d×ny n×1) 1×d ( + λI) U 2 λ i σ i + In the new, uncorrelated space, there are only d training vectors and d decisions Red: dx1 uncorrelated decision vector Green: weighting of the significance of the components in the uncorrelated decision vector Blue: transformed (uncorrelated) training samples Still the same interpretation: similarity measurement in a new space by Gram matrix Inner product of training samples and new sample
22 First Important Concept
The computation involves only inner product For training samples in computing the Gram matrix For new sample in computing regression or classification results Similarity is measured in terms of angle , instead of distance
23 Second Important Concept
Using angle or distance for similarity measurement doesn’t make problems easier or harder If you cannot separate data, it doesn’t matter what similarity measures you use “Massage” data Transform data (into higher – even infinite - dimensional space) Data become “more likely” to be linearly separable (caveat: choice of the kernel function is important) Cannot perform inner product efficiently Kernel trick – do not have to
24 Why? There are a lot more “bases” features now
d
X= n
FICA Income
>>d
X= n
φi(FICA, Income, …)
25 Example - Xor
d=2 X= n
X1 X2
>>d
X= n
X1*X2
26 Example (Doesn’t quite work yet)
2 2 3 φ : x = (x1, x2 ) → φ(x) = (x1, x2 , x1 + x2 ) ∈ F = R Need to keep the nice property of requiring only inner product in the computation (dual formulation) But what happens if the feature dimension is very high (or even infinitely high)? Inner product in the high (infinitely high) dimensional feature space can be calculated without explicit mapping through a kernel function
27 In More Details
x ,x 1 x ,x g(x) =< w,x >= α x ,x = α x ,x = yT (XX T + λI)−1 2 ∑ i i ∑ i i M xn ,x
−1 − φ(x )T − | | | − φ(x )T − 1 1 L L M L []y1 y2 yn 1×n − − φ(x1 ) φ(xn ) + λI − − φ(x) − φ(x )T − | | | − φ(x )T − n n×d d×n n n×d φ(x )T φ(x ) + λ φ(x )T φ(x ) . φ(x )T φ(x ) [] 1 1 1 2 1 n − φ(x )T φ(x) − φ(x )T φ(x ) φ(x )T φ(x ) + λ . φ(x )T x ) 1 L 2 1 2 2 2 n L y1 y2 yn 1×n − − . . O . − φ(x )T φ(x) − T T T n n×1 φ(xn ) φ(x1 ) φ(xn ) φ(x2 ) φ(xn ) φ(xn ) + λ
28 Example
2 2 3 φ : x = (x1, x2 ) → φ(x) = (x1 2, x1x2 , x2 ) ∈ F = R 2 k(x, y) = (x1 y1 + x2 y2 ) 2 2 2 2 = x1 y1 + 2x1 y1x2 y2 + x2 y2 2 2 2 2 = (x1 , 2x1x2 , x2 ) ⋅( y1 , 2 y1 y2 , y2 ) = φ(x) ⋅φ(y)
29 More Example
2 2 4 φ : x = (x1, x2 ) →φ(x) = (x1 , x1x2 , x1x2 , x2 )∈ F = R 2 k(x, y) = (x1 y1 + x2 y2 ) 2 2 2 2 = x1 y1 + 2x1 y1x2 y2 + x2 y2 2 2 2 2 = (x1 , x1x2 , x1x2 , x2 )⋅(y1 , y1 y2 , y1 y2 , y2 ) = φ(x)⋅φ(y)
30 Even More Example 1 φ : x = (x , x ) →φ(x) = (x2 − x2 2, x x , x2 + x2 )∈ F = R3 1 2 2 1 2 1 2 1 2 2 k(x, y) = (x1 y1 + x2 y2 ) 2 2 2 2 = x1 y1 + 2x1 y1x2 y2 + x2 y2 2 2 2 2 = (x1 , 2x1x2 , x2 )⋅(y1 , 2y1 y2 , y2 ) 1 1 = (x2 − x2 2, x x , x2 + x2 )⋅ (y2 − y2 2, y y , y2 + y2 ) 2 1 2 1 2 1 2 2 1 2 1 2 1 2 = φ(x)⋅φ(y)
The interpretation of mapping φ is not unique even with a single κ function
31 Observations
The interpretation of mapping φ is not unique even with a single κ function The κ function is special. Certainly not all functions have such properties (i.e., corresponding to the inner product in a feature space) Such functions are called kernel functions Kernel is a function that for all x, z in X, κ(x, z)=< φ(x), φ(z)>, where φ is a mapping from X to an (inner product) feature space F
32 Important Theorem
A function κ: κ(X,X)-> R can be decomposed into κ(x,z)-> < φ(x), φ(z)> ( φ forms a Hilbert space) if and only if it satisfies finitely positive semi-definite property
Finitely positive semi-definite: If κ(X,X)-> R is symmetrical and for any finite subset of space X, the matrix formed by applying κ is positive semi-definite (i.e., Gram matrix is SPD for any choices of training samples)
33 Only if Condition
Given bi-linear function κ: κ(X,X)-> R κ(x,z)-> < φ(x), φ(z)> then the Gram matrix from κ satisfies finitely positive semi-definite
property L κ (x1 , x1 ) κ (x1 , x2 ) κ (x1 , xn ) κ (x , x ) κ (x , x ) L κ (x , x ) v T Gv = v T 2 1 2 2 2 n v L L L L L κ (xn , x1 ) κ (xn , x2 ) κ (xn , xn )
φ (x1 ) φ (x ) = v T 2 []φ (x ) φ (x ) L φ (x ) v M 1 2 n φ (x n ) 2 34 = ∑ viφ (x i )∑ viφ (x i ) = ∑ viφ (x i ) ≥ 0 i i i If Condition Proof Strategy
More complicated First: Establish a Hilbert ( function ) space Second: Establish the reproducing property in this function space Third: Establish the (Fourier) basis of such a function space Fourth: Establish κ as expansion on such a Fourier basis
35 What? 2 2 3 φ : x = (x1, x2 ) → φ(x) = (x1 2, x1x2 , x2 ) ∈ F = R 2 k(x, y) = (x1 y1 + x2 y2 ) 2 2 2 2 = x1 y1 + 2x1 y1x2 y2 + x2 y2 2 2 2 2 = (x1 , 2x1x2 , x2 ) ⋅( y1 , 2 y1 y2 , y2 ) = φ(x) ⋅φ(y)
In this case, φ is a 3 dimensional space Each “dimension” is a function of x There are three (not unique) “eign” functions that form the basis
36 A Function Space Example All (well-behaved, square-integrable) functions defined over a domain R form a vector space (a function space, F) f ε F, then cf ε F f ε F, and g ε F, then ( af+bg ) ε F Such a space is a Hilbert space if it is complete with an inner product (real valued, symmetrical, bilinear) < f , g >=< g , f >= ∫ f ( x ) g ( x )dx < f , f >= ∫ f ( x ) f ( x )dx = ∫ f 2 ( x )dx > 0 You can define an orthogonal basis (e.g., Fourier basis) on it 37 Hilbert Space The proof is harder and not very intuitive Suffice it to say that someone has figured out that the desired feature space is a function space of the form l L F = ∑αik(xi ,.) :l ∈ N,xi ∈X,αi ∈ R,i = ,1 ,l i=1 With an inner product defined as l L F = ∑ α i k (x i ,.) : l ∈ N , x i ∈ X,α i ∈ R, i = ,1 , l i =1 l m
f = ∑ α i k (x i ,.) , g = ∑ β i k (z i ,.) i=1 i =1 m l l m
< f , g >= ∑ ∑ α i β j k (x i , z j ) = ∑ α i g (x i ) = ∑ β j f (z j ) j =1i = 1 i=1 j =1 38 Why
Because then we have SPD properties regardless
of choice of xi l L F = ∑ α i k (x i ,.) : l ∈ N , x i ∈ X,α i ∈ R, i = ,1 , l i =1 l m
f = ∑ α i k (x i ,.) , g = ∑ β i k (z i ,.) i=1 i =1 m l l m
< f , g >= ∑ ∑ α i β j k (x i , z j ) = ∑ α i g (x i ) = ∑ β j f (z j ) j =1i = 1 i=1 j =1 l l α T α < f , f >= ∑ ∑ α iα j k (x i , z j ) = k ≥ 0 j =1i = 1 Still have to prove completeness (not here, see page 62 of Shawe-Taylor and Christianini)
39 Reproducing Property
Special Hilbert space called Reproducing Kernel Hilbert space (RKHS)
l m
Recall : f = ∑ α i k (x i ,.) , g = ∑ β i k (z i ,.) i =1 i=1 m l l m
< f , g >= ∑ ∑ α i β j k (x i , z j ) = ∑ α i g (x i ) = ∑ β j f (z j ) j =1i = 1 i=1 j =1
If we take g = k (x,.) l
< f , g >=< f , k (x,.) >= ∑ α i k (x i , x) = f (x) i =1
40 Mercer Kernel Theorem
Denote an orthonormal basis of the RKHS
with kernel κ as φi(.) κ(x,.) belongs in this space
Expand κ(x,.) onto the orthonormal basis φi(.)
∞
k (x,.) = ∑ < k (x,.), φi (.) > φi (.) i=1 ∞
⇒ k (x, z) = ∑ < k (x, z), φi (z) > φi (z) i=1 ∞
⇒ k (x, z) = ∑ < k (x, z), φi (z) > φi (z) i=1 ∞ Reproducing property ⇒ k (x, z) = ∑ φi (x)φi (z) i=1 41 Practically
The explicit computation of feature mapping is not necessary Instead, we can compose different κ and manipulate Gram matrices using all kinds of mathematical tricks (Kernel design), as long as the finite positive definite property is preserved Research intensive topic, not covered in detail here
42 Composition of Kernels
The space of kernel functions is closed under certain operations I.e., the composition of valid kernel functions using such operations result in valid kernels Can be proven by showing the resulting function preserves the finite positive definite property E.g., sum and multiplication of kernels, and constant multiplication by a positive number
α T α α T α k 1 > 0 k 2 > 0 α T α α T α ck 1 = c k 1 > ,0 if c > 0 α T α α T α α T α (k 1 + k 2 ) = k 1 + k 2 > 0
43 Other Rules
k ( x, z ) = k 1 ( x, z ) + k 2 ( x, z ) k ( x, z ) = ak 1 ( x, z ), a > 0 k ( x, z ) = k 1 ( x, z ) k 2 ( x, z )
⇒ k ( x, z ) = p ( k 1 ( x, z ))
⇒ k ( x, z ) = exp( k 1 ( x, z )) 2 ⇒ k ( x, z ) = exp( k 1 ( x − z ) /( 2 σ ))
44 Kernel Shaping
Adding a constant to all entries Adding an extra constant features Adding a constant to diagonal Ridge regression, drop smaller features
45 Example Kernels
Pattern classification is a hard problem Massaging classifiers is difficult and massaging data (using different kernels) only allocate the complexity differently (you cannot turn a NP problem into a P problem by a magic trick) Whether Kernel methods work will depend on your kernels Some examples are discussed below (there are many more …)
46 Polynomial Kernel d k(x, z) = p(k1 (x, z)) = (< x, z > +c) (if feature is two dimensiona l and d = 2) 2 = (x1z1 + x2 z2 + c) 2 2 = (x1z1 + x2 z2 ) + 2c(x1z1 + x2 z2 ) + c 2 2 2 2 2 = x1 z1 + 2x1z1x2 z2 + x2 z2 + 2cx 1z1 + 2x2 z2 + c 2 2 2 2 = (x1 , 2x1x2 , x2 , 2cx1, 2cx2 ,c) ⋅(z1 , 2z1z2 , z2 , 2 zc 1, 2 zc 2 ,c)
= φ(x). φ(z) = ∑φi (x). φi (z) i The feature space is made of alln monomials of the form i1 i2 L in x1 x2 xn ,∑i j = s, s ≤ d n + d j=1 The dimension is d
Instead of calculating so many terms, we can do a simple polynomial evaluation – the beauty of kernel methods (less control over weighting of individual monomials) 47 All-Subsets Kernel If there are n features, 1, ..., n
φ A , A ⊆ 2,1{ ,..., n}
n i j i1 i2 L in φA (x) = ∏ x j = x1 x2 xn ∑i j ≤ n,i j ∈ },1,0{ 1≤ j ≤ n j∈A j=1
The feature space is made of all monomials of the form n i j i1 i2 L in ∏ xj = x1 x2 xn ∑ij ≤ n,ij ∈ },1,0{ 1≤ j ≤ n i∈A j=1
The dimension is 2 n
48 All-Subsets Kernel (cont.)
Instead of calculating so many terms, we can do a simple polynomial evaluation
If there are n features, 1, ..., n
φ A , A ⊆ 2,1{ ,..., n}
n κ(x, z) =< φ(x), φ(z) >= ∑φA (x)φA (z) = ∏ 1( + xi zi ) A⊆ 2,1{ ,..., n} i=1 More generally n κ(x, z) =< φ(x), φ(z) >= ∑φA (x)φA (z) = ∏ 1( + ai xi zi ) A⊆ 2,1{ ,..., n} i=1 Different weights for different features
49 ANOVA Kernel
All-subset kernel of a fixed cardinality d Dimensionality is n d If there are n features, 1, ..., n
φ A , A ⊆ 2,1{ ,..., n}, A = d
n i j i1 i2 L in φA (x) = ∏ x j = x1 x2 xn ∑i j = d,i j ∈ },1,0{ 1≤ j ≤ n j∈A j=1
Evaluation through recursion (DP)
50 Gaussian Kernel
Identical to the Radial Basis Function
x − z 2 κ(x, z) =< φ(x), φ(z) >= exp( ) 2σ 2 ∞ i Recall that x x e = ∑ i=0 i!
The feature dimension is infinitely high in this case
51 Representing Texts
Bag-of-words model Presence + frequency Ordering, grammatical relations, phrases ignored Terms: words Dictionary: all possible words Corpses: all documents Document L k d → φ (d ) = (tf (t1 , d ), tf (t 2 , d ), , f (tk , d )) ∈ R Similarity is measured by the inner product of φ(d)
52 Mapping between terms and docs Document-term matrices ( D) X in our previous notation Term-document matrices ( DT) X’ in our previous notation Document-document matrices ( D D T) Gram matrix, dual formulation Term-term matrices ( DT D)
Primary formulation L k d → φ (d ) = (tf (t1 , d ), tf (t2 , d ), , f (tk , d )) ∈ R
tf (t1 , d1 ) tf (t2 , d1 ) . tf (tk , d1 ) tf (t , d ) tf (t , d ) . tf (t , d ) D(X) = 1 2 2 2 k 2 O 53 tf (t1 , d n ) tf (t2 , d n ) . tf (tk , d n ) Strings and Sequences
DNA, protein, virus signatures, etc. Different lengths Partial matching Multiple matched sub-regions Good example of kernels on non-numerical data set Dynamic programming (DP) is the standard (expensive) matching technique to define similarity
54 Spectrum Kernels
p-spectrum: histogram of (contiguous) substring of length p Kernel as inner product of p-spectrum of two strings
55 All Subsequences Kernel
56