Kernel Methods for Statistical Learning

Kernel Methods for Statistical Learning Kenji Fukumizu The Institute of Statistical Mathematics / Graduate University for Advanced Studies September 6-7, 2012 Machine Learning Summer School 2012, Kyoto The latest version of slides is downloadable at http://www.ism.ac.jp/~fukumizu/MLSS2012/ 1 Lecture Plan I. Introduction to kernel methods II. Various kernel methods kernel PCA, kernel CCA, kernel ridge regression, etc III. Support vector machine A brief introduction to SVM IV. Theoretical backgrounds of kernel methods Mathematical aspects of positive definite kernels V. Nonparametric inference with positive definite kernels Recent advances of kernel methods 2 General references (More detailed lists are given at the end of each section) – Schölkopf, B. and A. Smola. Learning with Kernels. MIT Press. 2002. – Lecture slides (more detailed than this course) This page contains Japanese information, but the slides are written in English. Slides: 1, 2, 3, 4, 5, 6, 7, 8 – For Japanese only (Sorry!): • 福水「カーネル法入門－正定値カーネルによるデータ解析」朝倉書店（2010） • 赤穂「カーネル多変量解析 ―非線形データ解析の新しい展開」岩波書店（2008） 3 I. Introduction to Kernel Methods Kenji Fukumizu The Institute of Statistical Mathematics / Graduate University for Advanced Studies September 6-7 Machine Learning Summer School 2012, Kyoto I-11 Outline 1. Linear and nonlinear data analysis 2. Principles of kernel methods I-22 Linear and nonlinear data analysis I-33 What is data analysis? – Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. –Wikipedia I-4 Linear data analysis – ‘Table’ of numbers Matrix expression X (1) X (1) 1 m (2) (2) X1 X m m N X dimensional, data (N ) (N ) X1 X m – Linear algebra is used for methods of analysis • Correlation, • Linear regression analysis, • Principal component analysis, • Canonical correlation analysis, etc. I-5 Example 1: Principal component analysis (PCA) PCA: project data onto the low-dimensional subspace of largest variance. T 1st direction = argmax||a||1Var[a X ] 2 N 1 1 N Var[aT X ] aT X (i) X ( j) j1 N i1 N T a VXX a. where N T 1 1 N 1 N X (i) X ( j) X (i) X ( j) VXX j1 j1 N i1 N N (Empirical) covariance matrix of I-6 – 1st principal direction T argmax||a||1a VXX a u1 unit eigenvector w.r.t. the largest eigenvalue of – p-th principal direction = unit eigenvector w.r.t. the p-th largest eigenvalue of PCA Eigenproblem of covariance matrix I-7 Example 2: Linear classification – Binary classification Input data Class label X (1) X (1) Y (1) 1 m (2) (2) (2) X1 X m Y N X Y {1} ( N ) ( N ) Y (N ) X1 X m Find a linear classifier h(x) sgn(aT x b) so that (i) (i) h(X ) Y for all (or most) i. – Example: Fisher’s linear discriminant analyzer, Linear SVM, etc. I-8 Are linear methods enough? linearly inseparable linearly separable 6 15 4 10 5 2 0 z3 -5 0 x2 transform -10 -2 -15 0 5 20 -4 10 15 z1 10 -6 15 -6 -4 -2 0 2 4 6 5 20 0 z2 x1 2 2 (z1, z2 , z3 ) (x1 , x2 , 2x1x2 ) Watch the following movie! http://jp.youtube.com/watch?v=3liCbRZPrZA I-9 Another example: correlation Cov[X ,Y ] EX E[X ] Y E[Y ] XY 2 2 Var[X ]Var[Y ] EX E[X ] EY E[Y ] 3 2 = 0.94 1 Y 0 -1 -2 -3 -3 -2 -1 0 1 2 3 X I-10 2.5 2.5 2 2 1.5 1.5 X Y X2 Y Y 1 ( , ) 1 ( , ) = 0.17 = 0.96 0.5 0.5 0 0 -0.5 -0.5 -1.5 -1 -0.5 0 0.5 1 1.5 0 0.5 1 1.5 2 2.5 X (X, Y) transform (X2, Y) I-11 Nonlinear transform helps! Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. – Wikipedia. Kernel method = a systematic way of transforming data into a high- dimensional feature space to extract nonlinearity or higher-order moments of data. I-12 Principles of kernel methods I-13 Kernel method: Big picture – Idea of kernel method , xi Hk xi feature map x j xｊ Space of original data Feature space Do linear analysis in the feature space! e.g. SVM – What kind of space is appropriate as a feature space? • Should incorporate various nonlinear information of the original data. • The inner product should be computable. It is essential for many linear methods. I-14 Computational issue – For example, how about using power series expansion? (X, Y, Z) (X, Y, Z, X2, Y2, Z2, XY, YZ, ZX, …) – But, many recent data are high-dimensional. e.g. microarray, images, etc... The above expansion is intractable! e.g. Up to 2nd moments, 10,000 dimension: Dim of feature space: 10000C1 + 10000C2 = 50,005,000 (!) – Need a cleverer way Kernel method. I-15 Feature space by positive definite kernel – Feature map: from original space to feature space : H X1,, X n (X1),,(X n ) –With special choice of feature space, we have a function (positive definite kernel) k(x, y) such that (X i ),(X j ) k(X i , X j ) kernel trick – Many linear methods use only the inner products of data, and do not need the explicit form of the vector Φ. (e.g., PCA) I-16 Positive definite kernel Definition. Ω: set. : Ω Ω → is a positive definite kernel if 1) (symmetry) k(x, y) k(y, x) 2) (positivity) for arbitrary x1,…,xn ∈Ω k(x , x ) k(x , x ) 1 1 1 n is positive semidefinite, k(xn , x1) k(xn , xn ) (Gram matrix) n c c k(x , x ) 0 c R i.e., i, j1 i j i j for any i I-17 Examples: positive definite kernels on Rm (proof is give in Section IV) Gaussian • Euclidean inner product k(x, y) xT y • Gaussian RBF kernel 2 2 ( 0) kG (x, y) exp x y • Laplacian kernel Laplacian m k (x, y) exp | x y | L i1 i i ( 0) • Polynomial kernel T d kP (x, y) (c x y) (c 0,d N) I-18 Proposition 1.1 Let be a vector space with inner product ⋅,⋅ and Φ: Ω → be a map (feature map). If : Ω Ω → is defined by (x),( y) k(x, y), (kernel trick) then k(x,y) is necessarily positive definite. – Positive definiteness is necessary. – Proof) n n c c k(X , X ) c c (X ),(X ) i, j1 i j i j i, j1 i j i j n n n 2 c (X ), c (X ) c (X ) 0 i1 i i j1 j j i1 i i I-19 – Positive definite kernel is sufficient. Theorem 1.2 (Moore-Aronszajn) For a positive definite kernel on Ω, there is a Hilbert space (reproducing kernel Hilbert space, RKHS) that consists of functions on Ω such that 1) ⋅, ∈ for any 2) span ⋅, ∈ Ω is dense in 3) (reproducing property) , ⋅, for any ∈, ∈Ω *Hilbert space: vector space with inner product such that the topology is complete. I-20 Feature map by positive definite kernel , – Prepare a kernel. xi Hk x xi feature map j – Feature space = RKHS. xｊ – Feature map: feature space Φ: Ω → , ↦ ⋅, ,…, ↦ ⋅, ,…, ⋅, – Kernel trick: by reproducing property Φ ,Φ ⋅, , ⋅, , – All we need is a positive definite kernel: We do not need an explicit form of feature vector or feature space. Computation in kernel methods use only kernel values , . I-21 II. Various Kernel Methods Kenji Fukumizu The Institute of Statistical Mathematics / Graduate University for Advanced Studies September 6-7 Machine Learning Summer School 2012, Kyoto 1 Outline 1. Kernel PCA 2. Kernel CCA 3. Kernel ridge regression 4. Some topics on kernel methods II-2 Kernel Principal Component Analysis II-3 Principal Component Analysis PCA (review) – Linear method for dimension reduction of data – Project the data in the directions of large variance. T 1st principal axis = argmax||a||=1Var[a X ] 2 n 1 1 n [aT X ] = aT X − X Var ∑ i ∑ j=1 j n i=1 n T = a VXX a. where n T 1 1 n 1 n = X − X X − X VXX ∑ i ∑ j=1 j i ∑ j=1 j n i=1 n n II-4 From PCA to Kernel PCA – Kernel PCA: nonlinear dimension reduction of data (Schölkopf et al. 1998). – Do PCA in feature space 2 n 1 1 n max : [aT X ] = aT X − X ||a||=1 Var ∑ i ∑s=1 s n i=1 n 2 n 1 1 n max : Var[ f ,Φ(X ) ] = f ,Φ(X ) − Φ(X ) || f ||H =1 ∑ i ∑s=1 s n i=1 n II-5 It is sufficient to assume ⊥ 푓 1 푓 = 푛 푛 푖 푖 푠 푓 � 푐 Φ 푋 − � Φ 푋 Φ 푖=1 푛 푠=1 푓 Orthogonal directions to the data can be neglected, since for = ( ) + , where is orthogonal to 1 the span푛 { 푛 } , the objective function of kernel 푓 ∑푖=1 푐푖 Φ 푋푖 − 푛 ∑푠=1 Φ 푋푠 푓⊥ 푓⊥ PCA does not depend1 푛 on . 푛 Φ 푋푖 − 푛 ∑푠=1 Φ 푋푠 푖=1 ⊥ T푓 ~ 2 Then, Var[ f ,Φ(X ) ]= c K X c [Exercise] 2 T ~ || f ||H = c K X c ~ ~ ~ where K X ,ij := Φ(X i ),Φ(X j ) (centered Gram matrix) ~ n with Φ(X ) := Φ(X ) − 1 Φ(X ) (centered feature vector) i i n ∑s=1 s II-6 Objective function of kernel PCA ~ ~ max T 2 T = 1 c K X c subject to c K X c The centered Gram matrix is expressed with Gram matrix = , as 퐾�푋 푋 푖 푗 1 퐾 푘 푋 푋 푖푗 1 1 = = 푇 1 푛 푇 푇 ퟏ푛 ⋮ ∈ 퐑 �푋 푛 푛 푛 푛 푛 푛 Unit matrix 퐾~ 퐼 − ퟏ ퟏ 1 퐾 n 퐼 − ퟏ ퟏ = K = k(X푛, X ) − k(X푛, X ) ( X )ij i j ∑s=1 i s n 퐼푛 1 n 1 N − ( , ) + ( , ) ∑ =1 k X t X j 2 ∑ , =1 k X t X s n t n t s [Exercise] II-7 – Kernel PCA can be solved by eigen-decomposition.

Load more