Kernel Methods for Statistical Learning
Kenji Fukumizu
The Institute of Statistical Mathematics / Graduate University for Advanced Studies
September 6-7, 2012 Machine Learning Summer School 2012, Kyoto
The latest version of slides is downloadable at http://www.ism.ac.jp/~fukumizu/MLSS2012/ 1
Lecture Plan
I. Introduction to kernel methods
II. Various kernel methods kernel PCA, kernel CCA, kernel ridge regression, etc
III. Support vector machine A brief introduction to SVM
IV. Theoretical backgrounds of kernel methods Mathematical aspects of positive definite kernels
V. Nonparametric inference with positive definite kernels Recent advances of kernel methods
2 General references (More detailed lists are given at the end of each section) – Schölkopf, B. and A. Smola. Learning with Kernels. MIT Press. 2002.
– Lecture slides (more detailed than this course) This page contains Japanese information, but the slides are written in English. Slides: 1, 2, 3, 4, 5, 6, 7, 8
– For Japanese only (Sorry!): • 福水 「カーネル法入門 -正定値カーネルによるデータ解析」 朝倉書店(2010)
• 赤穂 「カーネル多変量解析 ―非線形データ解析の新しい展開」 岩波書店(2008) 3 I. Introduction to Kernel Methods
Kenji Fukumizu
The Institute of Statistical Mathematics / Graduate University for Advanced Studies
September 6-7 Machine Learning Summer School 2012, Kyoto
I-11 Outline
1. Linear and nonlinear data analysis
2. Principles of kernel methods
I-22 Linear and nonlinear data analysis
I-33 What is data analysis?
– Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. –Wikipedia
I-4 Linear data analysis
– ‘Table’ of numbers Matrix expression
X (1) X (1) 1 m (2) (2) X1 X m m N X dimensional, data (N ) (N ) X1 X m
– Linear algebra is used for methods of analysis • Correlation, • Linear regression analysis, • Principal component analysis, • Canonical correlation analysis, etc.
I-5 Example 1: Principal component analysis (PCA)
PCA: project data onto the low-dimensional subspace of largest variance.
T 1st direction = argmax||a||1Var[a X ] 2 N 1 1 N Var[aT X ] aT X (i) X ( j) j1 N i1 N T a VXX a.
where
N T 1 1 N 1 N X (i) X ( j) X (i) X ( j) VXX j1 j1 N i1 N N
(Empirical) covariance matrix of
I-6 – 1st principal direction T argmax||a||1a VXX a
u1 unit eigenvector w.r.t. the largest eigenvalue of
– p-th principal direction =
unit eigenvector w.r.t. the p-th largest eigenvalue of
PCA Eigenproblem of covariance matrix
I-7 Example 2: Linear classification – Binary classification Input data Class label X (1) X (1) Y (1) 1 m (2) (2) (2) X1 X m Y N X Y {1} ( N ) ( N ) Y (N ) X1 X m Find a linear classifier h(x) sgn(aT x b) so that (i) (i) h(X ) Y for all (or most) i.
– Example: Fisher’s linear discriminant analyzer, Linear SVM, etc.
I-8 Are linear methods enough?
linearly inseparable linearly separable 6
15 4 10
5 2 0 z3 -5 0 x2 transform -10 -2 -15 0
5 20 -4 10 15 z1 10 -6 15 -6 -4 -2 0 2 4 6 5 20 0 z2 x1 2 2 (z1, z2 , z3 ) (x1 , x2 , 2x1x2 )
Watch the following movie!
http://jp.youtube.com/watch?v=3liCbRZPrZA I-9 Another example: correlation Cov[X ,Y ] EX E[X ] Y E[Y ] XY Var[X ]Var[Y ] EX E[X ] 2 EY E[Y ] 2
3
2 = 0.94
1
Y 0
-1
-2
-3 -3 -2 -1 0 1 2 3 X
I-10 2.5 2.5
2 2
1.5 1.5 X Y X2 Y Y 1 ( , ) 1 ( , ) = 0.17 = 0.96 0.5 0.5
0 0
-0.5 -0.5 -1.5 -1 -0.5 0 0.5 1 1.5 0 0.5 1 1.5 2 2.5 X
(X, Y) transform (X2, Y)
I-11 Nonlinear transform helps!
Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. – Wikipedia.
Kernel method = a systematic way of transforming data into a high- dimensional feature space to extract nonlinearity or higher-order moments of data.
I-12 Principles of kernel methods
I-13 Kernel method: Big picture
– Idea of kernel method
, xi Hk xi feature map x j xj Space of original data Feature space Do linear analysis in the feature space! e.g. SVM
– What kind of space is appropriate as a feature space? • Should incorporate various nonlinear information of the original data. • The inner product should be computable. It is essential for many linear methods. I-14 Computational issue – For example, how about using power series expansion? (X, Y, Z) (X, Y, Z, X2, Y2, Z2, XY, YZ, ZX, …)
– But, many recent data are high-dimensional. e.g. microarray, images, etc...
The above expansion is intractable!
e.g. Up to 2nd moments, 10,000 dimension:
Dim of feature space: 10000C1 + 10000C2 = 50,005,000 (!)
– Need a cleverer way Kernel method.
I-15 Feature space by positive definite kernel
– Feature map: from original space to feature space : H
X1,, X n (X1),,(X n )
–With special choice of feature space, we have a function (positive definite kernel) k(x, y) such that
(X i ),(X j ) k(X i , X j ) kernel trick
– Many linear methods use only the inner products of data, and do not need the explicit form of the vector Φ . (e.g., PCA)
I-16 Positive definite kernel
Definition. Ω: set. : Ω Ω → is a positive definite kernel if 1) (symmetry) k(x, y) k(y, x)
2) (positivity) for arbitrary x1,…,xn ∈Ω
k(x , x ) k(x , x ) 1 1 1 n is positive semidefinite, k(xn , x1) k(xn , xn ) (Gram matrix)
n c c k(x , x ) 0 c R i.e., i, j1 i j i j for any i
I-17 Examples: positive definite kernels on Rm (proof is give in Section IV) Gaussian • Euclidean inner product k(x, y) xT y
• Gaussian RBF kernel 2 2 ( 0) kG (x, y) exp x y
• Laplacian kernel Laplacian
m k (x, y) exp | x y | L i1 i i ( 0)
• Polynomial kernel T d kP (x, y) (c x y) (c 0,d N)
I-18 Proposition 1.1 Let be a vector space with inner product ⋅,⋅ and Φ: Ω → be a map (feature map). If : Ω Ω → is defined by
(x),( y) k(x, y), (kernel trick)
then k(x,y) is necessarily positive definite.
– Positive definiteness is necessary.
– Proof) n n c c k(X , X ) c c (X ),(X ) i, j1 i j i j i, j1 i j i j
n n n 2 c (X ), c (X ) c (X ) 0 i1 i i j1 j j i1 i i
I-19 – Positive definite kernel is sufficient.
Theorem 1.2 (Moore-Aronszajn) For a positive definite kernel on Ω, there is a Hilbert space (reproducing kernel Hilbert space, RKHS) that consists of functions on Ω such that
1) ⋅, ∈ for any 2) span ⋅, ∈ Ω is dense in 3) (reproducing property)