Kernel Methods for Statistical Learning

Kernel Methods for Statistical Learning Kenji Fukumizu The Institute of Statistical Mathematics / Graduate University for Advanced Studies September 6-7, 2012 Machine Learning Summer School 2012, Kyoto The latest version of slides is downloadable at http://www.ism.ac.jp/~fukumizu/MLSS2012/ 1 Lecture Plan I. Introduction to kernel methods II. Various kernel methods kernel PCA, kernel CCA, kernel ridge regression, etc III. Support vector machine A brief introduction to SVM IV. Theoretical backgrounds of kernel methods Mathematical aspects of positive definite kernels V. Nonparametric inference with positive definite kernels Recent advances of kernel methods 2 General references (More detailed lists are given at the end of each section) – Schölkopf, B. and A. Smola. Learning with Kernels. MIT Press. 2002. – Lecture slides (more detailed than this course) This page contains Japanese information, but the slides are written in English. Slides: 1, 2, 3, 4, 5, 6, 7, 8 – For Japanese only (Sorry!): • 福水「カーネル法入門－正定値カーネルによるデータ解析」朝倉書店（2010） • 赤穂「カーネル多変量解析 ―非線形データ解析の新しい展開」岩波書店（2008） 3 I. Introduction to Kernel Methods Kenji Fukumizu The Institute of Statistical Mathematics / Graduate University for Advanced Studies September 6-7 Machine Learning Summer School 2012, Kyoto I-11 Outline 1. Linear and nonlinear data analysis 2. Principles of kernel methods I-22 Linear and nonlinear data analysis I-33 What is data analysis? – Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. –Wikipedia I-4 Linear data analysis – ‘Table’ of numbers Matrix expression X (1) X (1) 1 m (2) (2) X1 X m m N X dimensional, data (N ) (N ) X1 X m – Linear algebra is used for methods of analysis • Correlation, • Linear regression analysis, • Principal component analysis, • Canonical correlation analysis, etc. I-5 Example 1: Principal component analysis (PCA) PCA: project data onto the low-dimensional subspace of largest variance. T 1st direction = argmax||a||1Var[a X ] 2 N 1 1 N Var[aT X ] aT X (i) X ( j) j1 N i1 N T a VXX a. where N T 1 1 N 1 N X (i) X ( j) X (i) X ( j) VXX j1 j1 N i1 N N (Empirical) covariance matrix of I-6 – 1st principal direction T argmax||a||1a VXX a u1 unit eigenvector w.r.t. the largest eigenvalue of – p-th principal direction = unit eigenvector w.r.t. the p-th largest eigenvalue of PCA Eigenproblem of covariance matrix I-7 Example 2: Linear classification – Binary classification Input data Class label X (1) X (1) Y (1) 1 m (2) (2) (2) X1 X m Y N X Y {1} ( N ) ( N ) Y (N ) X1 X m Find a linear classifier h(x) sgn(aT x b) so that (i) (i) h(X ) Y for all (or most) i. – Example: Fisher’s linear discriminant analyzer, Linear SVM, etc. I-8 Are linear methods enough? linearly inseparable linearly separable 6 15 4 10 5 2 0 z3 -5 0 x2 transform -10 -2 -15 0 5 20 -4 10 15 z1 10 -6 15 -6 -4 -2 0 2 4 6 5 20 0 z2 x1 2 2 (z1, z2 , z3 ) (x1 , x2 , 2x1x2 ) Watch the following movie! http://jp.youtube.com/watch?v=3liCbRZPrZA I-9 Another example: correlation Cov[X ,Y ] EX E[X ] Y E[Y ] XY 2 2 Var[X ]Var[Y ] EX E[X ] EY E[Y ] 3 2 = 0.94 1 Y 0 -1 -2 -3 -3 -2 -1 0 1 2 3 X I-10 2.5 2.5 2 2 1.5 1.5 X Y X2 Y Y 1 ( , ) 1 ( , ) = 0.17 = 0.96 0.5 0.5 0 0 -0.5 -0.5 -1.5 -1 -0.5 0 0.5 1 1.5 0 0.5 1 1.5 2 2.5 X (X, Y) transform (X2, Y) I-11 Nonlinear transform helps! Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. – Wikipedia. Kernel method = a systematic way of transforming data into a high- dimensional feature space to extract nonlinearity or higher-order moments of data. I-12 Principles of kernel methods I-13 Kernel method: Big picture – Idea of kernel method , xi Hk xi feature map x j xｊ Space of original data Feature space Do linear analysis in the feature space! e.g. SVM – What kind of space is appropriate as a feature space? • Should incorporate various nonlinear information of the original data. • The inner product should be computable. It is essential for many linear methods. I-14 Computational issue – For example, how about using power series expansion? (X, Y, Z) (X, Y, Z, X2, Y2, Z2, XY, YZ, ZX, …) – But, many recent data are high-dimensional. e.g. microarray, images, etc... The above expansion is intractable! e.g. Up to 2nd moments, 10,000 dimension: Dim of feature space: 10000C1 + 10000C2 = 50,005,000 (!) – Need a cleverer way Kernel method. I-15 Feature space by positive definite kernel – Feature map: from original space to feature space : H X1,, X n (X1),,(X n ) –With special choice of feature space, we have a function (positive definite kernel) k(x, y) such that (X i ),(X j ) k(X i , X j ) kernel trick – Many linear methods use only the inner products of data, and do not need the explicit form of the vector Φ. (e.g., PCA) I-16 Positive definite kernel Definition. Ω: set. : Ω Ω → is a positive definite kernel if 1) (symmetry) k(x, y) k(y, x) 2) (positivity) for arbitrary x1,…,xn ∈Ω k(x , x ) k(x , x ) 1 1 1 n is positive semidefinite, k(xn , x1) k(xn , xn ) (Gram matrix) n c c k(x , x ) 0 c R i.e., i, j1 i j i j for any i I-17 Examples: positive definite kernels on Rm (proof is give in Section IV) Gaussian • Euclidean inner product k(x, y) xT y • Gaussian RBF kernel 2 2 ( 0) kG (x, y) exp x y • Laplacian kernel Laplacian m k (x, y) exp | x y | L i1 i i ( 0) • Polynomial kernel T d kP (x, y) (c x y) (c 0,d N) I-18 Proposition 1.1 Let be a vector space with inner product ⋅,⋅ and Φ: Ω → be a map (feature map). If : Ω Ω → is defined by (x),( y) k(x, y), (kernel trick) then k(x,y) is necessarily positive definite. – Positive definiteness is necessary. – Proof) n n c c k(X , X ) c c (X ),(X ) i, j1 i j i j i, j1 i j i j n n n 2 c (X ), c (X ) c (X ) 0 i1 i i j1 j j i1 i i I-19 – Positive definite kernel is sufficient. Theorem 1.2 (Moore-Aronszajn) For a positive definite kernel on Ω, there is a Hilbert space (reproducing kernel Hilbert space, RKHS) that consists of functions on Ω such that 1) ⋅, ∈ for any 2) span ⋅, ∈ Ω is dense in 3) (reproducing property) , ⋅, for any ∈, ∈Ω *Hilbert space: vector space with inner product such that the topology is complete. I-20 Feature map by positive definite kernel , – Prepare a kernel. xi Hk x xi feature map j – Feature space = RKHS. xｊ – Feature map: feature space Φ: Ω → , ↦ ⋅, ,…, ↦ ⋅, ,…, ⋅, – Kernel trick: by reproducing property Φ ,Φ ⋅, , ⋅, , – All we need is a positive definite kernel: We do not need an explicit form of feature vector or feature space. Computation in kernel methods use only kernel values , . I-21 II. Various Kernel Methods Kenji Fukumizu The Institute of Statistical Mathematics / Graduate University for Advanced Studies September 6-7 Machine Learning Summer School 2012, Kyoto 1 Outline 1. Kernel PCA 2. Kernel CCA 3. Kernel ridge regression 4. Some topics on kernel methods II-2 Kernel Principal Component Analysis II-3 Principal Component Analysis PCA (review) – Linear method for dimension reduction of data – Project the data in the directions of large variance. T 1st principal axis = argmax||a||=1Var[a X ] 2 n 1 1 n [aT X ] = aT X − X Var ∑ i ∑ j=1 j n i=1 n T = a VXX a. where n T 1 1 n 1 n = X − X X − X VXX ∑ i ∑ j=1 j i ∑ j=1 j n i=1 n n II-4 From PCA to Kernel PCA – Kernel PCA: nonlinear dimension reduction of data (Schölkopf et al. 1998). – Do PCA in feature space 2 n 1 1 n max : [aT X ] = aT X − X ||a||=1 Var ∑ i ∑s=1 s n i=1 n 2 n 1 1 n max : Var[ f ,Φ(X ) ] = f ,Φ(X ) − Φ(X ) || f ||H =1 ∑ i ∑s=1 s n i=1 n II-5 It is sufficient to assume ⊥ 푓 1 푓 = 푛 푛 푖 푖 푠 푓 � 푐 Φ 푋 − � Φ 푋 Φ 푖=1 푛 푠=1 푓 Orthogonal directions to the data can be neglected, since for = ( ) + , where is orthogonal to 1 the span푛 { 푛 } , the objective function of kernel 푓 ∑푖=1 푐푖 Φ 푋푖 − 푛 ∑푠=1 Φ 푋푠 푓⊥ 푓⊥ PCA does not depend1 푛 on . 푛 Φ 푋푖 − 푛 ∑푠=1 Φ 푋푠 푖=1 ⊥ T푓 ~ 2 Then, Var[ f ,Φ(X ) ]= c K X c [Exercise] 2 T ~ || f ||H = c K X c ~ ~ ~ where K X ,ij := Φ(X i ),Φ(X j ) (centered Gram matrix) ~ n with Φ(X ) := Φ(X ) − 1 Φ(X ) (centered feature vector) i i n ∑s=1 s II-6 Objective function of kernel PCA ~ ~ max T 2 T = 1 c K X c subject to c K X c The centered Gram matrix is expressed with Gram matrix = , as 퐾�푋 푋 푖 푗 1 퐾 푘 푋 푋 푖푗 1 1 = = 푇 1 푛 푇 푇 ퟏ푛 ⋮ ∈ 퐑 �푋 푛 푛 푛 푛 푛 푛 Unit matrix 퐾~ 퐼 − ퟏ ퟏ 1 퐾 n 퐼 − ퟏ ퟏ = K = k(X푛, X ) − k(X푛, X ) ( X )ij i j ∑s=1 i s n 퐼푛 1 n 1 N − ( , ) + ( , ) ∑ =1 k X t X j 2 ∑ , =1 k X t X s n t n t s [Exercise] II-7 – Kernel PCA can be solved by eigen-decomposition.

Kernel Methods for Statistical Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support