<<

KernelKernel MethodsMethods Simple Idea of Data Fitting

 Given ( xi,y i)  i=1,…,n

 xi is of dimension d  Find the best linear w (hyperplane) that fits the data  Two scenarios  y: real, regression  y: {-1,1}, classification  Two cases  n>d, regression, least square  n : best fit (regression), best decision (classification)

2 Primary and Dual

 There are two ways to formulate the problem:  Primary  Dual  Both provide deep insight into the problem  Primary is more traditional  Dual leads to newer techniques in SVM and methods

3 Regression   2 w = arg min ∑(yi − wo − ∑ xij w j )  W  i j  w = arg min (y − Xw )T (y − Xw ) W d(y − Xw )T (y − Xw ) = 0 dw ⇒ XT (y − Xw ) = 0 w = [w , w ,L, w ]T , ⇒ T T o 1 d X Xw = X y L T x = ,1[ x1, , xd ] , T −1 T ⇒ w = (X X) X y y = [y , y ,L, y ]T ) 1 2 n T −1 T xT  y =< x (, X X) X y >  1  xT X =  2   M    xT  4  n n×xd Graphical Interpretation

) y = Xw = Hy = X(XT X)−1 XT y = X(XT X)−1 XT y

d

X= n

FICA Income  X is a n (sample size) by d (dimension of data) matrix  w combines the columns of X to best approximate y  Combine features (FICA, income, etc.) to decisions (loan)  H projects y onto the space spanned by columns of X  Simplify the decisions to fit the features

5 Problem #1

 n=d, exact solution  n>d, least square, (most likely scenarios)  When n < d, there are not enough constraints to determine coefficients w uniquely d X= n W

6 Problem #2

 If different attributes are highly correlated (income and FICA)  The columns become dependent  Coefficients are then poorly determined with high variance  E.g., large positive coefficient on one can be canceled by a similarly large negative coefficient on its correlated cousin  Size constraint is helpful  Caveat: constraint is problem dependent

7 Ridge Regression

 Similar to regularization   ridge 2 2 w = arg min ∑(yi − wo − ∑xij w j ) + λ ∑ w j  W  i j j  w ridge = arg min (y − Xw )T (y − Xw ) + λwT w W d(y − Xw )T (y − Xw ) + λwT w = 0 dw ⇒ −XT (y − Xw ) + λw = 0 ⇒ XT y = XT Xw + λw ⇒ XT y = (XT X + λI)w ⇒ w ridge = (XT X + λI)−1 XT y ⇒ yˆ =< x (, XT X + λI)−1 XT y >

8 Ugly Math w ridge = (XT X + λI)−1 XT y X = UΣV T ) y = Xw ridge = X(XT X + λI)−1 XT y = UΣV T (VΣT UT UΣV T + λI)−1 VΣT UT y = UΣ(V −T )−1(VΣT ΣV T + λI)−1(V −1)−1 ΣT UT y = UΣ(V −1VΣT ΣV TV −T + V −1λIV −T )−1 ΣT UT y = UΣ(ΣT Σ + λI)−1 ΣT UT y σ 2 = u i u T y ∑ i 2 λ i i σ i +

9 How to Decipher This ) σ 2 y = u i u T y ∑ i 2 λ i i σ i +

 Red: best estimate (y hat) is composed of columns of U (“basis” features, recall U and X have the same column space)  Green: how these basis columns are weighed  Blue: projection of target (y) onto these columns  Together: representing y in a body-fitted coordinate system ( ui)

10 Sidebar  Recall that  Trace (sum of the diagonals) of a matrix is the same as the sum of the eigenvalues  Proof: every matrix has a standard Jordan form (an upper triangular matrix) where the eigenvalues appear on the diagonal (trace=sum of eigenvalues)  Jordan form results from a similarity transform (PAP -1) which does not change eigenvalues Ax = λx ⇒ PAx = λPx ⇒ PAP −1Px = λPx ⇒ A J y = λy 11 Physical Interpretation

 Singular values of X represents the spread of data along different body-fitting dimensions (orthonormal columns)  To estimate y (=) regularization minimizes the contribution from less spread-out dimensions  Less spread-out dimensions usually have much larger variance (high dimension eigen modes) harder to estimate gradients reliably  Trace X(XTX+λI)-1XT is called effective degrees of freedom

12 More Details

 Trace X(XTX+λI)-1XT is called effective degrees of freedom  Controls how many eigen modes are actually used or active df (λ ) = d , λ = ,0 df (λ ) = ,0 λ → ∞

 Different methods are possible  Shrinking smoother: contributions are scaled  Projection smoother: contributions are used (1) or not used (0)

13 Dual Formulation

 Weight vector can be expressed as a sum of the n training feature vectors

T T w = (XT X)−1 XT y X y = X Xw + λw T T = XT X(XT X)−2 XT y λw = X y − X Xw T 1 T = X d×nα w = X (y − Xw ) n×1 λ T = ∑αixi α = X d×n n×1 i

= ∑αixi i

14 Dual Formulation (cont.)

1 T T α = (y − X w ) X y = X Xw + λw n×1 λ n×d d×1 λw = XT y − XT Xw λα = y − Xw 1 α T α w = XT (y − Xw ) λ = y − XX λ (XX T + λI)α = y T = X d×nα n×1 α T −1 −1 = (Xn×d X d×n + λI) y n×1 = (G + λI) y = ∑αixi i

g (x) =< w , x >= ∑ α i x i , x = ∑ α i x i , x

 x , x   1  x , x =< X T (XX T + λI) −1 y , x >= y T (XX T + λI) −1  2   M    15  x n , x  In More Details

Gram matrix

−1 − xT −  | | |   − xT −  1    1  L  L  M   L []y1 y2 yn 1×n − − x1 xn + λI − − x     − xT −  | | |   − xT −  n  n×d   d×n   n  n×d xT x + λ xT x . xT x  [] 1 1 1 2 1 n − xT x − xT x xT x + λ . xT x  1  L  2 1 2 2 2 n  L y1 y2 yn 1×n − −  . . O .   − xT x − T T T  n  n×1  xn x1 xn x2 xn xn + λ

16 Observations

 Primary  Dual  XTX is d by d  Only inner products  Training: Slow for are involved high feature dimension  XX T is n by n  Use: fast O(d)  Training: Fast for high feature dimension  Use: Slow O(nd)  N inner product to evaluate, each requires g(x) =< x (, XT X + λI)−1 XT y > d multiplications  x ,x  d×1 d×1  1  T T −1  x2 ,x  g(x) = y (XX + λI) 1×n  M     x ,x  17  n n×1 Graphical Interpretation d n n

d= n n

x j ,x j + λδij

 x ,x   1  x ,x g(x) =< w,x >= α x ,x = α x ,x = yT (XX T + λI)−1  2  ∑ i i ∑ i i  M     xn ,x  18 One Extreme – Perfect Uncorrelated d n n

d= n x i , x i n

xi,xj =δij

 x ,x   1  T T −1  x2 ,x  xi ,x g(x) = y (XX + λI) = ∑ yi  M  x ,x + λ   i i i  xn ,x   Orthogonal projection – no generalization

19 General Case T T T −1 T T yˆ 1×n = y 1×n (XX + λI) n×n X X X = U Σ V d×d n×d d×n n×d d×d T 2 T −1 T T = y 1×n (UΣ U + λI) n×n UΣV X d×n T 2 T −1 T T = y 1×n (U(Σ + λI)U ) n×n UΣV X d×n T 2 −1 −1 T T = y 1×nU(Σ + λI) U UΣV X d×n T 2 −1 T T = y 1×nU(Σ + λI) ΣV X d×n T T 2 −1 T T = (U d×ny ) 1×d (Σ + λI) ΣV X n×1 d×n T T Σ2 −1 Σ T Σ T = (U d×ny n×1) 1×d ( + λI) V V U T T Σ2 −1 Σ2 T = (U d×ny n×1) 1×d ( + λI) U

 How to interpret this? Does this still make sense? 20 Physical Meaning of SVD

 Assume that n > d  X is of rank d at most  U are the body (data)-fitted axes  UT is a projection from n to d space  Σ is the importance of the dimensions  V is the representation of the X in the d space

Σ T X = Un×d d×dV d×d

21 Interpretation ) 2 T T T 2 −1 2 T σ i T Σ Σ y = ∑ui ui y yˆ 1×n = (U d×ny n×1) 1×d ( + λI) U 2 λ i σ i +  In the new, uncorrelated space, there are only d training vectors and d decisions  Red: dx1 uncorrelated decision vector  Green: weighting of the significance of the components in the uncorrelated decision vector  Blue: transformed (uncorrelated) training samples  Still the same interpretation: similarity measurement in a new space by  Gram matrix  Inner product of training samples and new sample

22 First Important Concept

 The computation involves only inner product  For training samples in computing the Gram matrix  For new sample in computing regression or classification results  Similarity is measured in terms of angle , instead of distance

23 Second Important Concept

 Using angle or distance for similarity measurement doesn’t make problems easier or harder  If you cannot separate data, it doesn’t matter what similarity measures you use  “Massage” data  Transform data (into higher – even infinite - dimensional space)  Data become “more likely” to be linearly separable (caveat: choice of the kernel function is important)  Cannot perform inner product efficiently  Kernel trick – do not have to

24 Why?  There are a lot more “bases” features now

d

X= n

FICA Income

>>d

X= n

φi(FICA, Income, …)

25 Example - Xor

d=2 X= n

X1 X2

>>d

X= n

X1*X2

26 Example (Doesn’t quite work yet)

2 2 3 φ : x = (x1, x2 ) → φ(x) = (x1, x2 , x1 + x2 ) ∈ F = R  Need to keep the nice property of requiring only inner product in the computation (dual formulation)  But what happens if the feature dimension is very high (or even infinitely high)?  Inner product in the high (infinitely high) dimensional feature space can be calculated without explicit mapping through a kernel function

27 In More Details

 x ,x   1  x ,x g(x) =< w,x >= α x ,x = α x ,x = yT (XX T + λI)−1  2  ∑ i i ∑ i i  M     xn ,x 

−1  − φ(x )T −  | | |   − φ(x )T −   1    1  L  L  M   L []y1 y2 yn 1×n − − φ(x1 ) φ(xn ) + λI − − φ(x)      − φ(x )T −  | | |   − φ(x )T −   n  n×d   d×n   n n×d φ(x )T φ(x ) + λ φ(x )T φ(x ) . φ(x )T φ(x )  [] 1 1 1 2 1 n − φ(x )T φ(x) − φ(x )T φ(x ) φ(x )T φ(x ) + λ . φ(x )T x )  1  L  2 1 2 2 2 n  L y1 y2 yn 1×n − −  . . O .   − φ(x )T φ(x) − T T T  n  n×1  φ(xn ) φ(x1 ) φ(xn ) φ(x2 ) φ(xn ) φ(xn ) + λ

28 Example

2 2 3 φ : x = (x1, x2 ) → φ(x) = (x1 2, x1x2 , x2 ) ∈ F = R 2 k(x, y) = (x1 y1 + x2 y2 ) 2 2 2 2 = x1 y1 + 2x1 y1x2 y2 + x2 y2 2 2 2 2 = (x1 , 2x1x2 , x2 ) ⋅( y1 , 2 y1 y2 , y2 ) = φ(x) ⋅φ(y)

29 More Example

2 2 4 φ : x = (x1, x2 ) →φ(x) = (x1 , x1x2 , x1x2 , x2 )∈ F = R 2 k(x, y) = (x1 y1 + x2 y2 ) 2 2 2 2 = x1 y1 + 2x1 y1x2 y2 + x2 y2 2 2 2 2 = (x1 , x1x2 , x1x2 , x2 )⋅(y1 , y1 y2 , y1 y2 , y2 ) = φ(x)⋅φ(y)

30 Even More Example 1 φ : x = (x , x ) →φ(x) = (x2 − x2 2, x x , x2 + x2 )∈ F = R3 1 2 2 1 2 1 2 1 2 2 k(x, y) = (x1 y1 + x2 y2 ) 2 2 2 2 = x1 y1 + 2x1 y1x2 y2 + x2 y2 2 2 2 2 = (x1 , 2x1x2 , x2 )⋅(y1 , 2y1 y2 , y2 ) 1 1 = (x2 − x2 2, x x , x2 + x2 )⋅ (y2 − y2 2, y y , y2 + y2 ) 2 1 2 1 2 1 2 2 1 2 1 2 1 2 = φ(x)⋅φ(y)

 The interpretation of mapping φ is not unique even with a single κ function

31 Observations

 The interpretation of mapping φ is not unique even with a single κ function  The κ function is special. Certainly not all functions have such properties (i.e., corresponding to the inner product in a feature space)  Such functions are called kernel functions  Kernel is a function that for all x, z in X, κ(x, z)=< φ(x), φ(z)>, where φ is a mapping from X to an (inner product) feature space F

32 Important Theorem

 A function κ: κ(X,X)-> R can be decomposed into κ(x,z)-> < φ(x), φ(z)> ( φ forms a Hilbert space) if and only if it satisfies finitely positive semi-definite property

 Finitely positive semi-definite: If κ(X,X)-> R is symmetrical and for any finite of space X, the matrix formed by applying κ is positive semi-definite (i.e., Gram matrix is SPD for any choices of training samples)

33 Only if Condition

 Given bi-linear function κ: κ(X,X)-> R κ(x,z)-> < φ(x), φ(z)> then the Gram matrix from κ satisfies finitely positive semi-definite

property L κ (x1 , x1 ) κ (x1 , x2 ) κ (x1 , xn )    κ (x , x ) κ (x , x ) L κ (x , x ) v T Gv = v T  2 1 2 2 2 n  v  L L L L    L κ (xn , x1 ) κ (xn , x2 ) κ (xn , xn )

φ (x1 )    φ (x ) = v T  2 []φ (x ) φ (x ) L φ (x ) v  M  1 2 n   φ (x n ) 2 34 = ∑ viφ (x i )∑ viφ (x i ) = ∑ viφ (x i ) ≥ 0 i i i If Condition Proof Strategy

 More complicated  First: Establish a Hilbert ( function ) space  Second: Establish the reproducing property in this function space  Third: Establish the (Fourier) basis of such a function space  Fourth: Establish κ as expansion on such a Fourier basis

35 What? 2 2 3 φ : x = (x1, x2 ) → φ(x) = (x1 2, x1x2 , x2 ) ∈ F = R 2 k(x, y) = (x1 y1 + x2 y2 ) 2 2 2 2 = x1 y1 + 2x1 y1x2 y2 + x2 y2 2 2 2 2 = (x1 , 2x1x2 , x2 ) ⋅( y1 , 2 y1 y2 , y2 ) = φ(x) ⋅φ(y)

 In this case, φ is a 3 dimensional space  Each “dimension” is a function of x  There are three (not unique) “eign” functions that form the basis

36 A Function Space Example  All (well-behaved, square-integrable) functions defined over a domain R form a (a function space, F)  f ε F, then cf ε F  f ε F, and g ε F, then ( af+bg ) ε F  Such a space is a Hilbert space if it is complete with an inner product (real valued, symmetrical, bilinear) < f , g >=< g , f >= ∫ f ( x ) g ( x )dx < f , f >= ∫ f ( x ) f ( x )dx = ∫ f 2 ( x )dx > 0  You can define an orthogonal basis (e.g., Fourier basis) on it 37 Hilbert Space  The proof is harder and not very intuitive  Suffice it to say that someone has figured out that the desired feature space is a function space of the form  l  L F = ∑αik(xi ,.) :l ∈ N,xi ∈X,αi ∈ R,i = ,1 ,l  i=1   With an inner product defined as  l  L F = ∑ α i k (x i ,.) : l ∈ N , x i ∈ X,α i ∈ R, i = ,1 , l  i =1  l m

f = ∑ α i k (x i ,.) , g = ∑ β i k (z i ,.) i=1 i =1 m l l m

< f , g >= ∑ ∑ α i β j k (x i , z j ) = ∑ α i g (x i ) = ∑ β j f (z j ) j =1i = 1 i=1 j =1 38 Why

 Because then we have SPD properties regardless

of choice of xi  l  L F = ∑ α i k (x i ,.) : l ∈ N , x i ∈ X,α i ∈ R, i = ,1 , l  i =1  l m

f = ∑ α i k (x i ,.) , g = ∑ β i k (z i ,.) i=1 i =1 m l l m

< f , g >= ∑ ∑ α i β j k (x i , z j ) = ∑ α i g (x i ) = ∑ β j f (z j ) j =1i = 1 i=1 j =1 l l α T α < f , f >= ∑ ∑ α iα j k (x i , z j ) = k ≥ 0 j =1i = 1  Still have to prove completeness (not here, see page 62 of Shawe-Taylor and Christianini)

39 Reproducing Property

 Special Hilbert space called Reproducing Kernel Hilbert space (RKHS)

l m

Recall : f = ∑ α i k (x i ,.) , g = ∑ β i k (z i ,.) i =1 i=1 m l l m

< f , g >= ∑ ∑ α i β j k (x i , z j ) = ∑ α i g (x i ) = ∑ β j f (z j ) j =1i = 1 i=1 j =1

If we take g = k (x,.) l

< f , g >=< f , k (x,.) >= ∑ α i k (x i , x) = f (x) i =1

40 Mercer Kernel Theorem

 Denote an orthonormal basis of the RKHS

with kernel κ as φi(.)  κ(x,.) belongs in this space

 Expand κ(x,.) onto the orthonormal basis φi(.)

k (x,.) = ∑ < k (x,.), φi (.) > φi (.) i=1 ∞

⇒ k (x, z) = ∑ < k (x, z), φi (z) > φi (z) i=1 ∞

⇒ k (x, z) = ∑ < k (x, z), φi (z) > φi (z) i=1 ∞ Reproducing property ⇒ k (x, z) = ∑ φi (x)φi (z) i=1 41 Practically

 The explicit computation of feature mapping is not necessary  Instead, we can compose different κ and manipulate Gram matrices using all kinds of mathematical tricks (Kernel design), as long as the finite positive definite property is preserved  Research intensive topic, not covered in detail here

42 Composition of Kernels

 The space of kernel functions is closed under certain operations  I.e., the composition of valid kernel functions using such operations result in valid kernels  Can be proven by showing the resulting function preserves the finite positive definite property  E.g., sum and multiplication of kernels, and constant multiplication by a positive number

α T α α T α k 1 > 0 k 2 > 0 α T α α T α ck 1 = c k 1 > ,0 if c > 0 α T α α T α α T α (k 1 + k 2 ) = k 1 + k 2 > 0

43 Other Rules

k ( x, z ) = k 1 ( x, z ) + k 2 ( x, z ) k ( x, z ) = ak 1 ( x, z ), a > 0 k ( x, z ) = k 1 ( x, z ) k 2 ( x, z )

⇒ k ( x, z ) = p ( k 1 ( x, z ))

⇒ k ( x, z ) = exp( k 1 ( x, z )) 2 ⇒ k ( x, z ) = exp( k 1 ( x − z ) /( 2 σ ))

44 Kernel Shaping

 Adding a constant to all entries  Adding an extra constant features  Adding a constant to diagonal  Ridge regression, drop smaller features

45 Example Kernels

 Pattern classification is a hard problem  Massaging classifiers is difficult and massaging data (using different kernels) only allocate the complexity differently (you cannot turn a NP problem into a P problem by a magic trick)  Whether Kernel methods work will depend on your kernels  Some examples are discussed below (there are many more …)

46 Polynomial Kernel d k(x, z) = p(k1 (x, z)) = (< x, z > +c) (if feature is two dimensiona l and d = 2) 2 = (x1z1 + x2 z2 + c) 2 2 = (x1z1 + x2 z2 ) + 2c(x1z1 + x2 z2 ) + c 2 2 2 2 2 = x1 z1 + 2x1z1x2 z2 + x2 z2 + 2cx 1z1 + 2x2 z2 + c 2 2 2 2 = (x1 , 2x1x2 , x2 , 2cx1, 2cx2 ,c) ⋅(z1 , 2z1z2 , z2 , 2 zc 1, 2 zc 2 ,c)

= φ(x). φ(z) = ∑φi (x). φi (z) i  The feature space is made of alln monomials of the form i1 i2 L in x1 x2 xn ,∑i j = s, s ≤ d   n + d  j=1 The dimension is    d 

 Instead of calculating so many terms, we can do a simple polynomial evaluation – the beauty of kernel methods (less control over weighting of individual monomials) 47 All- Kernel If there are n features, 1, ..., n

φ A , A ⊆ 2,1{ ,..., n}

n i j i1 i2 L in φA (x) = ∏ x j = x1 x2 xn ∑i j ≤ n,i j ∈ },1,0{ 1≤ j ≤ n j∈A j=1

 The feature space is made of all monomials of the form n i j i1 i2 L in ∏ xj = x1 x2 xn ∑ij ≤ n,ij ∈ },1,0{ 1≤ j ≤ n i∈A j=1

 The dimension is 2 n

48 All-Subsets Kernel (cont.)

 Instead of calculating so many terms, we can do a simple polynomial evaluation

If there are n features, 1, ..., n

φ A , A ⊆ 2,1{ ,..., n}

n κ(x, z) =< φ(x), φ(z) >= ∑φA (x)φA (z) = ∏ 1( + xi zi ) A⊆ 2,1{ ,..., n} i=1 More generally n κ(x, z) =< φ(x), φ(z) >= ∑φA (x)φA (z) = ∏ 1( + ai xi zi ) A⊆ 2,1{ ,..., n} i=1 Different weights for different features

49 ANOVA Kernel

 All-subset kernel of a fixed cardinality d  Dimensionality is  n    d  If there are n features, 1, ..., n

φ A , A ⊆ 2,1{ ,..., n}, A = d

n i j i1 i2 L in φA (x) = ∏ x j = x1 x2 xn ∑i j = d,i j ∈ },1,0{ 1≤ j ≤ n j∈A j=1

 Evaluation through recursion (DP)

50 Gaussian Kernel

 Identical to the Radial Basis Function

x − z 2 κ(x, z) =< φ(x), φ(z) >= exp( ) 2σ 2 ∞ i  Recall that x x e = ∑ i=0 i!

 The feature dimension is infinitely high in this case

51 Representing Texts

 Bag-of-words model  Presence + frequency  Ordering, grammatical relations, phrases ignored  Terms: words  Dictionary: all possible words  Corpses: all documents  Document L k d → φ (d ) = (tf (t1 , d ), tf (t 2 , d ), , f (tk , d )) ∈ R  Similarity is measured by the inner product of φ(d)

52 Mapping between terms and docs  Document-term matrices ( D)  X in our previous notation  Term-document matrices ( DT)  X’ in our previous notation  Document-document matrices ( D D T)  Gram matrix, dual formulation  Term-term matrices ( DT D)

 Primary formulation L k d → φ (d ) = (tf (t1 , d ), tf (t2 , d ), , f (tk , d )) ∈ R

tf (t1 , d1 ) tf (t2 , d1 ) . tf (tk , d1 )   tf (t , d ) tf (t , d ) . tf (t , d ) D(X) =  1 2 2 2 k 2   O    53 tf (t1 , d n ) tf (t2 , d n ) . tf (tk , d n ) Strings and Sequences

 DNA, protein, virus signatures, etc.  Different lengths  Partial matching  Multiple matched sub-regions  Good example of kernels on non-numerical data  Dynamic programming (DP) is the standard (expensive) matching technique to define similarity

54 Spectrum Kernels

 p-spectrum: histogram of (contiguous) substring of length p  Kernel as inner product of p-spectrum of two strings

55 All Subsequences Kernel

56