Kernel Methodsmethods Simple Idea of Data Fitting
Total Page:16
File Type:pdf, Size:1020Kb
KernelKernel MethodsMethods Simple Idea of Data Fitting Given ( xi,y i) i=1,…,n xi is of dimension d Find the best linear function w (hyperplane) that fits the data Two scenarios y: real, regression y: {-1,1}, classification Two cases n>d, regression, least square n<d, ridge regression New sample: x, < x,w> : best fit (regression), best decision (classification) 2 Primary and Dual There are two ways to formulate the problem: Primary Dual Both provide deep insight into the problem Primary is more traditional Dual leads to newer techniques in SVM and kernel methods 3 Regression 2 w = arg min ∑(yi − wo − ∑ xij w j ) W i j w = arg min (y − Xw )T (y − Xw ) W d(y − Xw )T (y − Xw ) = 0 dw ⇒ XT (y − Xw ) = 0 w = [w , w ,L, w ]T , ⇒ T T o 1 d X Xw = X y L T x = ,1[ x1, , xd ] , T −1 T ⇒ w = (X X) X y y = [y , y ,L, y ]T ) 1 2 n T −1 T xT y =< x (, X X) X y > 1 xT X = 2 M xT 4 n n×xd Graphical Interpretation ) y = Xw = Hy = X(XT X)−1 XT y = X(XT X)−1 XT y d X= n FICA Income X is a n (sample size) by d (dimension of data) matrix w combines the columns of X to best approximate y Combine features (FICA, income, etc.) to decisions (loan) H projects y onto the space spanned by columns of X Simplify the decisions to fit the features 5 Problem #1 n=d, exact solution n>d, least square, (most likely scenarios) When n < d, there are not enough constraints to determine coefficients w uniquely d X= n W 6 Problem #2 If different attributes are highly correlated (income and FICA) The columns become dependent Coefficients are then poorly determined with high variance E.g., large positive coefficient on one can be canceled by a similarly large negative coefficient on its correlated cousin Size constraint is helpful Caveat: constraint is problem dependent 7 Ridge Regression Similar to regularization ridge 2 2 w = arg min ∑(yi − wo − ∑xij w j ) + λ ∑ w j W i j j w ridge = arg min (y − Xw )T (y − Xw ) + λwT w W d(y − Xw )T (y − Xw ) + λwT w = 0 dw ⇒ −XT (y − Xw ) + λw = 0 ⇒ XT y = XT Xw + λw ⇒ XT y = (XT X + λI)w ⇒ w ridge = (XT X + λI)−1 XT y ⇒ yˆ =< x (, XT X + λI)−1 XT y > 8 Ugly Math w ridge = (XT X + λI)−1 XT y X = UΣV T ) y = Xw ridge = X(XT X + λI)−1 XT y = UΣV T (VΣT UT UΣV T + λI)−1 VΣT UT y = UΣ(V −T )−1(VΣT ΣV T + λI)−1(V −1)−1 ΣT UT y = UΣ(V −1VΣT ΣV TV −T + V −1λIV −T )−1 ΣT UT y = UΣ(ΣT Σ + λI)−1 ΣT UT y σ 2 = u i u T y ∑ i 2 λ i i σ i + 9 How to Decipher This ) σ 2 y = u i u T y ∑ i 2 λ i i σ i + Red: best estimate (y hat) is composed of columns of U (“basis” features, recall U and X have the same column space) Green: how these basis columns are weighed Blue: projection of target (y) onto these columns Together: representing y in a body-fitted coordinate system ( ui) 10 Sidebar Recall that Trace (sum of the diagonals) of a matrix is the same as the sum of the eigenvalues Proof: every matrix has a standard Jordan form (an upper triangular matrix) where the eigenvalues appear on the diagonal (trace=sum of eigenvalues) Jordan form results from a similarity transform (PAP -1) which does not change eigenvalues Ax = λx ⇒ PAx = λPx ⇒ PAP −1Px = λPx ⇒ A J y = λy 11 Physical Interpretation Singular values of X represents the spread of data along different body-fitting dimensions (orthonormal columns) To estimate y (=<x,w ridge >) regularization minimizes the contribution from less spread-out dimensions Less spread-out dimensions usually have much larger variance (high dimension eigen modes) harder to estimate gradients reliably Trace X(XTX+λI)-1XT is called effective degrees of freedom 12 More Details Trace X(XTX+λI)-1XT is called effective degrees of freedom Controls how many eigen modes are actually used or active df (λ ) = d , λ = ,0 df (λ ) = ,0 λ → ∞ Different methods are possible Shrinking smoother: contributions are scaled Projection smoother: contributions are used (1) or not used (0) 13 Dual Formulation Weight vector can be expressed as a sum of the n training feature vectors T T w = (XT X)−1 XT y X y = X Xw + λw T T = XT X(XT X)−2 XT y λw = X y − X Xw T 1 T = X d×nα w = X (y − Xw ) n×1 λ T = ∑αixi α = X d×n n×1 i = ∑αixi i 14 Dual Formulation (cont.) 1 T T α = (y − X w ) X y = X Xw + λw n×1 λ n×d d×1 λw = XT y − XT Xw λα = y − Xw 1 α T α w = XT (y − Xw ) λ = y − XX λ (XX T + λI)α = y T = X d×nα n×1 α T −1 −1 = (Xn×d X d×n + λI) y n×1 = (G + λI) y = ∑αixi i g (x) =< w , x >= ∑ α i x i , x = ∑ α i x i , x x , x 1 x , x =< X T (XX T + λI) −1 y , x >= y T (XX T + λI) −1 2 M 15 x n , x In More Details Gram matrix −1 − xT − | | | − xT − 1 1 L L M L []y1 y2 yn 1×n − − x1 xn + λI − − x − xT − | | | − xT − n n×d d×n n n×d xT x + λ xT x . xT x 1 1 1 2 1 n − xT x − xT x xT x + λ . xT x 1 L 2 1 2 2 2 n L []y1 y2 yn 1×n − − . O . − xT x − T T T n n×1 xn x1 xn x2 xn xn + λ 16 Observations Primary Dual XTX is d by d Only inner products Training: Slow for are involved high feature dimension XX T is n by n Use: fast O(d) Training: Fast for high feature dimension Use: Slow O(nd) N inner product to evaluate, each requires g(x) =< x (, XT X + λI)−1 XT y > d multiplications x ,x d×1 d×1 1 T T −1 x2 ,x g(x) = y (XX + λI) 1×n M x ,x 17 n n×1 Graphical Interpretation d n n d= n n x j ,x j + λδ ij x ,x 1 x ,x g(x) =< w,x >= α x ,x = α x ,x = yT (XX T + λI)−1 2 ∑ i i ∑ i i M xn ,x 18 One Extreme – Perfect Uncorrelated d n n d= n x i , x i n xi,xj =δij x ,x 1 T T −1 x2 ,x xi ,x g(x) = y (XX + λI) = ∑ yi M x ,x + λ i i i xn ,x Orthogonal projection – no generalization 19 General Case T T T −1 T T yˆ 1×n = y 1×n (XX + λI) n×n X X X = U Σ V d×d n×d d×n n×d d×d T 2 T −1 T T = y 1×n (UΣ U + λI) n×n UΣV X d×n T 2 T −1 T T = y 1×n (U(Σ + λI)U ) n×n UΣV X d×n T 2 −1 −1 T T = y 1×nU(Σ + λI) U UΣV X d×n T 2 −1 T T = y 1×nU(Σ + λI) ΣV X d×n T T 2 −1 T T = (U d×ny ) 1×d (Σ + λI) ΣV X n×1 d×n T T Σ2 −1 Σ T Σ T = (U d×ny n×1) 1×d ( + λI) V V U T T Σ2 −1 Σ2 T = (U d×ny n×1) 1×d ( + λI) U How to interpret this? Does this still make sense? 20 Physical Meaning of SVD Assume that n > d X is of rank d at most U are the body (data)-fitted axes UT is a projection from n to d space Σ is the importance of the dimensions V is the representation of the X in the d space Σ T X = Un×d d×dV d×d 21 Interpretation ) 2 T T T 2 −1 2 T σ i T Σ Σ y = ∑ui ui y yˆ 1×n = (U d×ny n×1) 1×d ( + λI) U 2 λ i σ i + In the new, uncorrelated space, there are only d training vectors and d decisions Red: dx1 uncorrelated decision vector Green: weighting of the significance of the components in the uncorrelated decision vector Blue: transformed (uncorrelated) training samples Still the same interpretation: similarity measurement in a new space by Gram matrix Inner product of training samples and new sample 22 First Important Concept The computation involves only inner product For training samples in computing the Gram matrix For new sample in computing regression or classification results Similarity is measured in terms of angle , instead of distance 23 Second Important Concept Using angle or distance for similarity measurement doesn’t make problems easier or harder If you cannot separate data, it doesn’t matter what similarity measures you use “Massage” data Transform data (into higher – even infinite - dimensional space) Data become “more likely” to be linearly separable (caveat: choice of the kernel function is important) Cannot perform inner product efficiently Kernel trick – do not have to 24 Why? There are a lot more “bases” features now d X= n FICA Income >>d X= n φi(FICA, Income, …) 25 Example - Xor d=2 X= n X1 X2 >>d X= n X1*X2 26 Example (Doesn’t quite work yet) 2 2 3 φ : x = (x1, x2 ) → φ(x) = (x1, x2 , x1 + x2 ) ∈ F = R Need to keep the nice property of requiring only inner product in the computation (dual formulation) But what happens if the feature dimension is very high (or even infinitely high)? Inner product in the high (infinitely high) dimensional feature space can be calculated without explicit mapping through a kernel function 27 In More Details x ,x 1 x ,x g(x) =< w,x >= α x ,x = α x ,x = yT (XX T + λI)−1 2 ∑ i i ∑ i i M xn ,x −1 − φ(x )T − | | | − φ(x )T − 1 1 L L M L []y1 y2 yn 1×n − − φ(x1 ) φ(xn ) + λI − − φ(x) − φ(x )T − | | | − φ(x )T − n n×d d×n n n×d φ(x )T φ(x ) + λ φ(x )T φ(x ) .