<<

Kernel Methods for Statistical Learning

Kenji Fukumizu

The Institute of Statistical Mathematics / Graduate University for Advanced Studies

September 6-7, 2012 Summer School 2012, Kyoto

The latest version of slides is downloadable at http://www.ism.ac.jp/~fukumizu/MLSS2012/ 1

Lecture Plan

I. Introduction to methods

II. Various kernel methods kernel PCA, kernel CCA, kernel ridge regression, etc

III. Support vector machine A brief introduction to SVM

IV. Theoretical backgrounds of kernel methods Mathematical aspects of positive definite kernels

V. Nonparametric inference with positive definite kernels Recent advances of kernel methods

2  General references (More detailed lists are given at the end of each section) – Schölkopf, B. and A. Smola. Learning with Kernels. MIT Press. 2002.

– Lecture slides (more detailed than this course) This page contains Japanese information, but the slides are written in English. Slides: 1, 2, 3, 4, 5, 6, 7, 8

– For Japanese only (Sorry!): • 福水 「カーネル法入門 -正定値カーネルによるデータ解析」 朝倉書店(2010)

• 赤穂 「カーネル多変量解析 ―非線形データ解析の新しい展開」 岩波書店(2008) 3 I. Introduction to Kernel Methods

Kenji Fukumizu

The Institute of Statistical Mathematics / Graduate University for Advanced Studies

September 6-7 Machine Learning Summer School 2012, Kyoto

I-11 Outline

1. Linear and nonlinear data analysis

2. Principles of kernel methods

I-22 Linear and nonlinear data analysis

I-33 What is data analysis?

– Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. –Wikipedia

I-4 Linear data analysis

– ‘Table’ of numbers  Matrix expression

 X (1) X (1)   1  m   (2) (2)  X1  X m m N X    dimensional, data      (N ) (N )   X1  X m 

– Linear algebra is used for methods of analysis • Correlation, • analysis, • Principal component analysis, • analysis, etc.

I-5  Example 1: Principal component analysis (PCA)

PCA: project data onto the low-dimensional subspace of largest variance.

T 1st direction = argmax||a||1Var[a X ] 2 N 1   1 N  Var[aT X ]  aT X (i)  X ( j)   j1  N i1   N  T  a VXX a.

where

N T 1  1 N  1 N   X (i)  X ( j) X (i)  X ( j) VXX   j1  j1  N i1  N  N 

(Empirical) of

I-6 – 1st principal direction T  argmax||a||1a VXX a

 u1 unit eigenvector w.r.t. the largest eigenvalue of

– p-th principal direction =

unit eigenvector w.r.t. the p-th largest eigenvalue of

PCA Eigenproblem of covariance matrix

I-7  Example 2: Linear classification – Input data Class label  X (1) X (1)   Y (1)   1  m     (2) (2)  (2) X1  X m  Y  N X    Y    {1}         ( N ) ( N )  Y (N )   X1  X m    Find a h(x)  sgn(aT x  b) so that (i) (i) h(X )  Y for all (or most) i.

– Example: Fisher’s linear discriminant analyzer, Linear SVM, etc.

I-8 Are linear methods enough?

linearly inseparable linearly separable 6

15 4 10

5 2 0 z3 -5 0 x2 transform -10 -2 -15 0

5 20 -4 10 15 z1 10 -6 15 -6 -4 -2 0 2 4 6 5 20 0 z2 x1 2 2 (z1, z2 , z3 )  (x1 , x2 , 2x1x2 )

Watch the following movie!

http://jp.youtube.com/watch?v=3liCbRZPrZA I-9  Another example: correlation Cov[X ,Y ] EX  E[X ] Y  E[Y ]  XY   Var[X ]Var[Y ] EX  E[X ] 2 EY  E[Y ] 2 

3

2  = 0.94

1

Y 0

-1

-2

-3 -3 -2 -1 0 1 2 3 X

I-10 2.5 2.5

2 2

1.5 1.5  X Y  X2 Y Y 1 ( , ) 1 ( , ) = 0.17 = 0.96 0.5 0.5

0 0

-0.5 -0.5 -1.5 -1 -0.5 0 0.5 1 1.5 0 0.5 1 1.5 2 2.5 X

(X, Y) transform (X2, Y)

I-11 Nonlinear transform helps!

Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. – Wikipedia.

Kernel method = a systematic way of transforming data into a high- dimensional feature space to extract nonlinearity or higher-order moments of data.

I-12 Principles of kernel methods

I-13 Kernel method: Big picture

– Idea of kernel method

 , xi Hk   xi feature map x j  xj Space of original data Feature space Do linear analysis in the feature space! e.g. SVM

– What kind of space is appropriate as a feature space? • Should incorporate various nonlinear information of the original data. • The inner product should be computable. It is essential for many linear methods. I-14  Computational issue – For example, how about using power series expansion? (X, Y, Z)  (X, Y, Z, X2, Y2, Z2, XY, YZ, ZX, …)

– But, many recent data are high-dimensional. e.g. microarray, images, etc...

The above expansion is intractable!

e.g. Up to 2nd moments, 10,000 dimension:

Dim of feature space: 10000C1 + 10000C2 = 50,005,000 (!)

– Need a cleverer way  Kernel method.

I-15 Feature space by positive definite kernel

– Feature map: from original space to feature space  :   H

X1,, X n  (X1),,(X n )

–With special choice of feature space, we have a function (positive definite kernel) k(x, y) such that

(X i ),(X j )  k(X i , X j ) kernel trick

– Many linear methods use only the inner products of data, and do not need the explicit form of the vector Φ. (e.g., PCA)

I-16 Positive definite kernel

Definition. Ω: set. : Ω Ω → is a positive definite kernel if 1) (symmetry) k(x, y)  k(y, x)

2) (positivity) for arbitrary x1,…,xn ∈Ω

 k(x , x ) k(x , x )   1 1  1 n       is positive semidefinite,   k(xn , x1)  k(xn , xn ) ()

n c c k(x , x )  0 c  R i.e., i, j1 i j i j for any i

I-17 Examples: positive definite kernels on Rm (proof is give in Section IV) Gaussian • Euclidean inner product k(x, y)  xT y

• Gaussian RBF kernel 2 2 (  0) kG (x, y)  exp x  y  

• Laplacian kernel Laplacian

m k (x, y)  exp | x  y | L i1 i i (  0)

• Polynomial kernel T d kP (x, y)  (c  x y) (c  0,d  N)

I-18 Proposition 1.1 Let be a vector space with inner product ⋅,⋅ and Φ: Ω → be a map (feature map). If : Ω Ω → is defined by

(x),( y)  k(x, y), (kernel trick)

then k(x,y) is necessarily positive definite.

– Positive definiteness is necessary.

– Proof) n n c c k(X , X )  c c (X ),(X ) i, j1 i j i j i, j1 i j i j

n n n 2  c (X ), c (X )  c (X )  0 i1 i i  j1 j j i1 i i

I-19 – Positive definite kernel is sufficient.

Theorem 1.2 (Moore-Aronszajn) For a positive definite kernel on Ω, there is a Hilbert space (reproducing kernel Hilbert space, RKHS) that consists of functions on Ω such that

1) ⋅, ∈ for any 2) span ⋅, ∈ Ω is dense in 3) (reproducing property)

, ⋅, for any ∈, ∈Ω

*Hilbert space: vector space with inner product such that the topology is complete.

I-20 Feature map by positive definite kernel

 , – Prepare a kernel. xi Hk  x xi feature map j – Feature space = RKHS.  xj – Feature map: feature space Φ: Ω → , ↦ ⋅,

,…, ↦ ⋅, ,…, ⋅,

– Kernel trick: by reproducing property Φ ,Φ ⋅, , ⋅, ,

– All we need is a positive definite kernel: We do not need an explicit form of feature vector or feature space.

Computation in kernel methods use only kernel values , .

I-21 II. Various Kernel Methods

Kenji Fukumizu

The Institute of Statistical Mathematics / Graduate University for Advanced Studies

September 6-7 Machine Learning Summer School 2012, Kyoto

1 Outline

1. Kernel PCA

2. Kernel CCA

3. Kernel ridge regression

4. Some topics on kernel methods

II-2 Kernel Principal Component Analysis

II-3 Principal Component Analysis  PCA (review) – Linear method for dimension reduction of data – Project the data in the directions of large variance.

T 1st principal axis = argmax||a||=1Var[a X ] 2 n 1   1 n  [aT X ] = aT X − X Var ∑  i ∑ j=1 j  n i=1   n 

T = a VXX a.

where

n T 1  1 n  1 n  = X − X X − X VXX ∑ i ∑ j=1 j  i ∑ j=1 j  n i=1  n  n 

II-4 From PCA to Kernel PCA

– Kernel PCA: nonlinear dimension reduction of data (Schölkopf et al. 1998).

– Do PCA in feature space

2 n 1   1 n  max : [aT X ] = aT X − X ||a||=1 Var ∑  i ∑s=1 s  n i=1   n 

2 n 1  1 n  max : Var[ f ,Φ(X ) ] = f ,Φ(X ) − Φ(X ) || f ||H =1 ∑ i ∑s=1 s  n i=1  n 

II-5

It is sufficient to assume 푓 1 푓⊥ = 푛 푛

푖 푖 푠 푓 � 푐 Φ 푋 − � Φ 푋 Φ 푖=1 푛 푠=1 푓 Orthogonal directions to the data can be neglected, since for = ( ) + , where is orthogonal to 1 the span푛 { 푛 } , the objective function of kernel 푓 ∑푖=1 푐푖 Φ 푋푖 − 푛 ∑푠=1 Φ 푋푠 푓⊥ 푓⊥ PCA does not depend1 푛 on . 푛 Φ 푋푖 − 푛 ∑푠=1 Φ 푋푠 푖=1 ⊥ T푓 ~ 2 Then, Var[ f ,Φ(X ) ]= c K X c [Exercise] 2 T ~ || f ||H = c K X c ~ ~ ~ where K X ,ij := Φ(X i ),Φ(X j ) (centered Gram matrix) ~ n with Φ(X ) := Φ(X ) − 1 Φ(X ) (centered feature vector) i i n ∑s=1 s II-6  Objective function of kernel PCA

~ ~ max T 2 T = 1 c K X c subject to c K X c

The centered Gram matrix is expressed with Gram matrix = , as 퐾�푋 푋 푖 푗 1 퐾 푘 푋 푋 푖� 1 1 = = 푇 1 푛 푇 푇 ퟏ푛 ⋮ ∈ 퐑 �푋 푛 푛 푛 푛 푛 푛 Unit matrix 퐾~ 퐼 − ퟏ ퟏ 1 퐾 n 퐼 − ퟏ ퟏ = K = k(X푛, X ) − k(X푛, X ) ( X )ij i j ∑s=1 i s n 퐼푛 1 n 1 N − ( , ) + ( , ) ∑ =1 k X t X j 2 ∑ , =1 k X t X s n t n t s [Exercise] II-7 – Kernel PCA can be solved by eigen-decomposition.

– Kernel PCA algorithm ~ • Compute centered Gram matrix K ~ X • Eigendecompsoition of K X ~ N = λ T K X ∑i=1 iuiui

λ1 ≥ λ2 ≥ ≥ λ ≥ 0 eigenvalues  N u1,u2 ,,uN unit eigenvectors

• p-th principal component of T X i = λp up X i

II-8 Summary: derivation of kernel method in general – Consider feature vectors given by kernels. – A linear method is applied to the feature vectors. (Kernelization) – Typically, only the inner products

Φ(X ),Φ(X ) = k(X , X ) i j i j

f ,Φ(X i ) are used to express the objective function of the method.

n – The solution is given in the form f = ∑ciΦ(X i ), i=1

(representer theorem, see later), and thus everything is written by Gram matrices.

These are common in derivation of any kernel methods. II-9 Example of Kernel PCA

 Wine data (from UCI repository) 13 dim. chemical measurements of for three types of wine. 178 data. Class labels are NOT used in PCA, but shown in the figures.

First two principal components:

4 Linear PCA 0.6 Kernel PCA (Gaussian kernel)

3 (σ = 3) 0.4

2 0.2

1 0 0

-0.2 -1

-0.4 -2

-3 -0.6

-4 -0.8 -5 0 5 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 II-10 Kernel PCA (Gaussian) 0.5 0.6 σ = 4 0.4 σ = 2 0.4 0.3

0.2 0.2

0.1 0 0

-0.2 -0.1

-0.2 -0.4

-0.3 -0.6 -0.4

-0.5 -0.8 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

0.6 σ = 5 0.4

0.2 2 2 0 kG (x, y) = exp(− x − y σ )

-0.2

-0.4

-0.6

-0.8 II-11 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 Noise Reduction with Kernel PCA

– PCA can be used for noise reduction (principal directions represent signal, other directions noise).

– Apply kernel PCA for noise reduction:

• Compute d-dim subspace Vd spanned by d-principal directions. • For a new data x,

Gx: Projection of Φ(x) onto Vd = noise reduced feacure vector. • Compute a preimage xˆ in

data sapce for the noise 10

8 Φ(x) reduced feature vector Gx. 6 2 4 Gx xˆ =arg min Φ− ( xG′ ) x 2 x′ 0 -2

-4

Note: G is not necessariy -6 x -20 -8 given as an image of Φ. 0 -10 -25 -20 20 -15 -10 -5 0 5 10 15 20 II-12  USPS data

Original data (NOT used for PCA)

Noisy images

Noise reduced images(linear PCA)

Noise reduced images (kernel PCA, Gaussian kernel)

Generated by Matlab stprtool (by V. Franc)

II-13 Properties of Kernel PCA

– Nonlinear features can be extracted.

– Can be used for preprocessing of other analysis like classification. (dimension reduction / feature extraction)

– The results depend on the choice of kernel and kernel parameters. Interpreting the results may not be straightforward.

– How to choose a kernel and kernel parameter? • Cross-validation is not straightforward unlike SVM. • If it is a preprocessing, the performance of the final analysis should be maximized.

II-14 Kernel Canonical Correlation Analysis

II-15 Canonical Correlation Analysis

– Canonical correlation analysis (CCA, Hotelling 1936) Linear dependence of two multivariate random vectors. • Data , , … , ( , ) • : m-dimensional, : -dimensional 푋1 푌1 푋푁 푌푁

푋푖 푌푖 ℓ Find the directions and so that the correlation of and is maximized. 푇 푇 푎 푏 푎 푋 푏 푌 Cov[aT X ,bTY ] ρ = maxCorr[aT X ,bTY ] = max , a,b a,b [ T ] [ T ] Var a X Var b Y

a b X aTX bTY Y

II-16  Solution of CCA max subject to = = 1. , 푇 푇 푇 푋푋 푋푋 𝑌 – Rewritten푎 푏 푎 푉� as 푏a generalized eigenproblem푎 푉� 푎: 푏 푉� 푏

= 푂 푉�푋푋 푎 푉�푋푋 푂 푎 휌 [Exercise: 푉derive�𝑌 푂this. 푏(Hint. Use푂 Lagrange푉�𝑌 푏multiplier method.)]

– Solution: / / = , = 1 2 1 2 푎where푉푋푋 푢1 ( 푏 , resp.)푉𝑌 푣 is1 the left (right, resp.) first eigenvector for the SVD of / / . 1 1 푢 푣 −1 2 −1 2 푉�푋푋 푉�푋푋푉�𝑌 II-17 Kernel CCA

– Kernel CCA (Akaho 2000, Melzer et al. 2002, Bach et al 2002) • Dependence (not only correlation) of two random variables. • Data: , , … , ( , ) arbitrary variables

푋1 푌1 푋푁 푌푁 • Consider CCA for the feature vectors with and : , … , , … , , , , … , , … , . 푘푋 푘� 푋1 푋푁 ↦ Φ푋 푋1 Φ푋 푋푁 ∈ 퐻푋 1 푁 � 1 � 푁 � 푌 Cov푌 [ ↦ Φ, 푌] Φ 푌 ∈ 퐻 , , max = max , Var Var[ ] , 푁 푓 푋 푔 푌 ∑푖 ,푓 Φ�푋 푋푖 〈Φ� � 푌, 푖 푔〉 푓∈퐻푋 푔∈퐻푌 푓∈퐻푋 푔∈퐻푌 푓 푋 푔 푌 푁 2 푁 2 ∑푖 푓 Φ�푋 푋푖 ∑푖 푔 Φ� � 푌푖

Φ f g Φy x ( ) ( ) Φ ( ) X Φx(X) f X g Y y Y Y

II-18 – We can assume = ( ) and = ( ). 푁 푁 푓 ∑푖=1 훼푖Φ�푋 푋푖 푔 ∑푖=(same1 훽푖Φ� �as 푌kernel푖 PCA)

max , 푇 and : centered 푋 � 푁 푁 훼 퐾� 퐾� 훽 Gram matrices. 훼∈R 훽∈푅 푋 � 푇 2 푇 2 퐾� 퐾� 훼 퐾�푋 훼 훽 퐾�� 훽 – Regularization: to avoid trivial solution, , , max , 푁 푖=1 푋 푖 � 푖 , ∑ + 푓 Φ� 푋 〈Φ� 푌 , 푔〉 + 푓∈퐻푋 푔∈퐻푌 푁 2 2 푁 2 2 ∑푖=1 푓 Φ�푋 푋푖 휀푁 푓 퐻푋 ∑푖=1 푔 Φ� � 푌푖 휀푁 푔 퐻푌 (We can prove this works. Fukumizu et al JMLR 2007)

– Solution: generalized eigenproblem + = 2 + 푂 퐾�푋퐾�� 훼 퐾�푋 휀푁퐾�푋 푂 훼 휌 2 II-19 �� �푋 훽 �� 푁 �� 훽 퐾 퐾 푂 푂 퐾 휀 퐾 Application of KCCA

– Application to image retrieval (Hardoon, et al. 2004).

• Xi: image, Yi: corresponding texts (extracted from the same webpages).

• Idea: use d eigenvectors f1,…,fd and g1,…,gd as the feature spaces which contain the dependence between X and Y. • Given a new word , compute its feature vector, and find the image whose feature has the highest inner product. 푛푛푛 푌

 f1,Φ (X )   g1,Φ (Y )  Φ f  x i   y  g Φ x • y Xi Φx(Xi)       Φy(Y) Y    ,Φ ( )   ,Φ ( )   fd x X i   gd y Y  “at phoenix sky harbor on july 6, 1997. 757- 2s7, n907wa, ...”

II-20 – Example: • Gaussian kernel for images. • Bag-of-words kernel (frequency of words) for texts.

Text -- “height: 6-11, weight: 235 lbs, position: forward, born: september 18, 1968, split, croatia college: none” Extracted images

II-21 Kernel Ridge Regression

II-22 Ridge regression

, , … , , : data ( , ) 푚 Ridge1 1 regression푛 푛: linear regression푖 with푖 penalty. 푋 푌 푋 푌 푋 ∈ 퐑 푌 ∈ 퐑 2 퐿 min 푛 | | + 2 푇 2 2 푎 � 푌푖 − 푎 푋푖 휆 푎 푖=1 (The constant term is omitted for simplicity)

– Solution (quadratic function): = + where −1 푇 푎� 푉푋푋 휆퐼푛 푋 푌 × = , = 푇 , = 1 1 푇 푋 푛 푚 푌 푛 푋푋 푉 푋 푋 푋 ⋮푇 ∈ 퐑 푌 ⋮ ∈ 퐑 푛 푛 – Ridge regression is preferred, 푋when is (close to)푌 singular.

II-23 푉푋푋 Kernel ridge regression

– , , … , , : arbitrary data, .

– Kernelization1 1 푛of ridge푛 regression: positive definite kernel for 푋 푌 푋 푌 푋 푌 ∈ 퐑 푘 푋 min 푛 | , | + Ridge regression on 2 2 푖 푖 퐻 퐻 푓∈퐻 � 푌 − 푓 Φ 푋 휆 푓 퐻 equivalently, 푖 =1

min 푛 | ( )| + Nonlinear ridge regr. 2 2 푖 푖 퐻 푓∈퐻 � 푌 − 푓 푋 휆 푓 푖=1

II-24 n – Solution is given by the form f = ∑c jΦ(X j ), j=1 Let = + = + ( 푛 , : orthogonal complement) 푖=1 푖 푖 ⊥ Φ ⊥ Objective푓 ∑ function푐 Φ 푋 = 푓 푛 푓 푓+ , + + 푓Φ ∈ 푆푆𝑆 Φ 푋푖 푖=1 푓⊥ = 푛 , +2 ( + 2 ) ∑푖=1 푌푖 − 푓Φ 푓⊥ Φ 푋푖 휆 푓Φ 푓⊥ The 1st term does not depend푛 on , and 2nd2 term is minimized2 2 in the 푖=1 푖 Φ 푖 Φ ⊥ case = 0. ∑ 푌 − 푓 Φ 푋 휆 푓 푓 ⊥ 푓 푓⊥ – Objective function : + 2 푇 푋 푋 – Solution: = 푌 −+퐾 푐 휆 푐 퐾 푐

−1 , 푋 푛 Function: 푐 ̂ 퐾 =휆퐼 푌+ = 1 푇 −1 푘 푥, 푋 푓̂ 푥 푌 퐾푋 휆퐼푛 퐤 푥 ⋮ 퐤 푥 푘 푥 II푋-25푛 Regularization

– The minimization min 2 may be attained with zero푖 errors.푖 푓 푌 − 푓 푋 But the function may not be unique.

– Regularization min | ( )| + 푛 2 2 푖 푖 퐻 푓∈퐻 ∑푖=1 푌 − 푓 푋 휆 푓 • Regularization with smoothness penalty is preferred for uniqueness, stability, and smoothness. • Link with some RKHS norm and

smoothness is discussed in Sec. IV. II-26

Comparison

 Kernel ridge regression vs local linear regression = 1/ 1.5 + | | + , ~ 0, , ~ 0, 0.1 2 2 푑 푌 = 100, 500 runs푋 푍 0.012푋 푁 퐼 푍 푁 Kernel method Local linear 푛 0.01 Kernel ridge regression with Gaussian kernel 0.008 Local linear regression with Epanechnikov kernel* 0.006 (‘locfit’ in R is used) 0.004 Mean square errors Bandwidth parameters 0.002 are chosen by CV. 0 1 10 100 Dimension of X

*Epanechnikov kernel = 1 {| | } II-27 3 2 4 − 푢 ퟏ 푢 ≤1  Local linear regression (e.g., Fan and Gijbels 1996) – : smoothing kernel / Parzen window ( 0, = 1, not necessarily positive definite) 퐾 퐾 푥 ≥ ∫ 퐾 푥 푑� – Local linear regression = is estimated by ( ) = 푥 퐸 푌 푋 푥0 −푑 min 푛 ( ) ( )ℎ ℎ , 퐾 푥 ℎ 퐾 푇 2 푖 푖 0 ℎ 푖 0 푎 푏 � 푌 − 푎 − 푏 푋 −푥 퐾 푋 − 푥 • For each , 푖this minimization can be solved by linear algebra. • Statistical property of this estimator is well studied. 푥0 • For one dimensional , it works nicely with some theoretical optimality. • But, weak for high-dimensional푋 data.

II-28 Some topics on kernel methods

• Representer theorem • Structured data • Kernel choice • Low rank approximation

II-29 Representer theorem

, , … , , : data : positive definite kernel for , : corresponding RKHS. 푋1 푌1 푋푛 푌푛 : monotonically increasing function on . 푘 푋 퐻

Ψ 퐑+ Theorem 2.1 (representer theorem, Kimeldorf & Wahba 1970) The solution to the minimization problem:

min ( , , , … , ( , , )) + ( )

1 1 1 푛 푛 푛 퐻 is attained푓∈퐻 퐹 by푋 푌 푓 푋 푋 푌 푓 푋 Ψ 푓

= 푛 with some , … , . 푛 푖 푖 1 푛 푓 � 푐 Φ 푋 푐 푐 ∈ 퐑 푖=1 The proof is essentially the same as the one for the kernel ridge regression. [Exercise: complete the proof] II-30 Structured Data – Structured data: non-vectorial data with some structure.

• Sequence data (variable length): DNA sequence, Protein (sequence of amino acids) Text (sequence of words)

• Graph data (Koji’s lecture) Chemical compound etc. S

• Tree data NP VP Parse tree NP • Histograms / probability Det N V Det N measures The cat chased the mouse.

– Kernel methods can be applied to any type of data, once a kernel is defined.

– Many kernels uses counts of substructures (Haussler 1999). II-31 Spectrum kernel

– p-spectrum kernel (Leslie et al 2002): positive definite kernel for string.

, = Occurrences of common subsequences of length p.

– Example:푘푝 푠 푡 s = “statistics” t = “pastapistan”

3-spectrum

s: sta, tat, ati, tis, ist, sti, tic, ics

t : pas, ast, sta, tap, api, pis, ist, sta, tan sta tat ati tis ist sti tic ics pas ast tap api pis tan Φ(s) 1 1 1 1 1 1 1 1 0 0 0 0 0 0 Φ(t) 2 0 0 0 1 0 0 0 1 1 1 1 1 1

K3(s, t) = 1・2 + 1・1 = 3

– Linear time ( ( (| | + | |) ) algorithm with suffix tree is known (Vishwanathan & Smola 2003). II-32 푂 � 푆 푡 – Application: kernel PCA of ‘words’ with 3-spectrum kernel

8 6 informatics 4

2 biology mathematics 0 psychology cybernetics physics statistics methodology -2 metaphysics biostatistics pioneering -4 engineering

-6 -4 -2 0 2 4 6 8

II-33 Choice of kernel  Choice of kernel – Choice of kernel (polyn or Gauss) – Choice of parameters (bandwidth parameter in Gaussian kernel)

 General principles – Reflect the structure of data (e.g., kernels for structured data)

– For (e.g., SVM)  Cross-validation

– For (e.g. kernel PCA) • No general methods exist. • Guideline: make or use a relevant supervised problem, and use CV.

– Learning a kernel: (MKL)  Ask Francis! , = ( , ) optimize II-34 푀 푘 푥 푦 ∑푖=1 푐푖푘푖 푥 푦 푐푖 Low rank approximation – Kernel method: computation is reduced to Gram matrices: × where is the sample size. • Good for high-dimensional data 푛 푛 푛 • Large causes computational problems. e.g. Inversion, eigendecomposition costs ( ) in time. 푛 3 – Low-rank approximation 푂 푛

, where : × matrix ( ) 푇 푅 푛 푟 푟 ≪ 푛 퐾 ≈ 푅푅

푛 푛 푟 퐾 푅 푅 ≈ 푛 푛 푟 – The decay of eigenvalues of a Gram matrix is often quite fast (See Widom 1963, 1964; Bach & Jordan 2002). II-35

– Two major methods • Incomplete Cholesky factorization (Fine & Sheinberg 2001) ( ) in time and ( ) in space • Nyström2 approximation (Williams and Seeger 2001) 푂 푛푟 푂 푛� Random sampling + eigendecomposition

– Example: kernel ridge regression

+ time : ( )

푇 −1 3 Low rank푌 approximation:퐾푋 휆퐼푛 퐤 푥 . With푂 푛 Woodbury formula*

푇 + 퐾푋 ≈ 푅+푅 푇 −1 푇 푇 −1 푌 퐾푋 휆퐼푛 퐤 푥 ≈= 푌 푅푅 휆퐼푛 퐤 푥 + 1 푇 time :푇 ( 푇 + ) − 1 푇 휆 푌 퐤 푥 − 푌 푅 푅 푅 휆퐼푟 푅 퐤 푥 2 3 푂 푟 푛 푟 * Woodbury (Sherman–Morrison–Woodbury) formula:

+ = + . II-36 −1 −1 −1 −1 −1 −1 퐴 푈� 퐴 − 퐴 푈 퐼 푉퐴 푈 푉퐴 Other kernel methods

– Kernel Fisher discriminant analysis (kernel FDA) (Mika et al. 1999)

– Kernel (Roth 2001, Zhu&Hastie 2005)

– Kernel partial least square(kernel PLS)(Rosipal&Trejo 2001)

– Kernel K-means clustering(Dhillon et al 2004) etc, etc, ...

– Variants of SVM  Section III.

II-37 Summary: properties of kernel methods

– Various classical linear methods can be kernelized  Linear algorithms on RKHS.

– The solution typically has the form n f = ∑ciΦ(X i ). (representer theorem) i=1

– The problem is reduced to manipulation of Gram matrices of size n (sample size). • Advantage for high dimensional data. • For a large number of data, low-rank approximation is used effectively.

– Structured data: • kernel can be defined on any set. • kernel methods can be applied to any type of data.

II-38 References Akaho. (2000) Kernel Canonical Correlation Analysis. Proc. 3rd Workshop on Induction-based Information Sciences (IBIS2000). (in Japanese) Bach, F.R. and M.I. Jordan. (2002) Kernel independent component analysis. Journal of Machine Learning Research, 3:1–48. Dhillon, I. S., Y. Guan, and B. Kulis. (2004) Kernel k-means, and normalized cuts. Proc. 10th ACM SIGKDD Intern. Conf. Knowledge Discovery and (KDD), 551–556. Fan, J. and I. Gijbels. Local Polynomial Modelling and Its Applications. Chapman Hall/CRC, 1996. Fukumizu, K., F. R. Bach and A. Gretton (2007). Statistical Consistency of Kernel Canonical Correlation Analysis. Journal of Machine Learning Research 8, 361-383. Fine, S. and K. Scheinberg. (2001) Efficient SVM Training Using Low-Rank Kernel Representations. Journal of Machine Learning Research, 2:243-264. Gökhan, B., T. Hofmann, B. Schölkopf, A.J. Smola, B. Taskar, S.V.N. Vishwanathan. (2007) Predicting Structured Data. MIT Press. Hardoon, D.R., S. Szedmak, and J. Shawe-Taylor. (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16:2639– 2664. II-39 Haussler, D. Convolution kernels on discrete structures. Tech Report UCSC-CRL-99-10, Department of Computer Science, University of California at Santa Cruz, 1999. Leslie, C., E. Eskin, and W.S. Noble. (2002) The spectrum kernel: A for SVM protein classification, in Proc. Pacific Symposium on Biocomputing, 564–575. Melzer, T., M. Reiter, and H. Bischof. (2001) Nonlinear feature extraction using generalized canonical correlation analysis. Proc. Intern. Conf. .Artificial Neural Networks (ICANN 2001), 353–360. Mika, S., G. Rätsch, J. Weston, B. Schölkopf, and K.-R. Müller. (1999) Fisher discriminant analysis with kernels. In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, edits, Neural Networks for Signal Processing, volume IX, 41–48. IEEE. Rosipal, R. and L.J. Trejo. (2001) Kernel partial least squares regression in reproducing kernel Hilbert space. Journal of Machine Learning Research, 2: 97–123. Roth, V. (2001) Probabilistic discriminative kernel classifiers for multi-class problems. In : Proc. 23rd DAGM Symposium, 246–253. Springer. Schölkopf, B., A. Smola, and K-R. Müller. (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299–1319. Schölkopf, B. and A. Smola. Learning with Kernels. MIT Press. 2002. Vishwanathan, S. V. N. and A.J. Smola. (2003) Fast kernels for string and tree matching. Advances in Neural Information Processing Systems 15, 569–576. MIT Press.

II-40

Williams, C. K. I. and M. Seeger. (2001) Using the Nyström method to speed up kernel machines. Advances in Neural Information Processing Systems, 13:682–688. Widom, H. (1963) Asymptotic behavior of the eigenvalues of certain integral equations. Transactions of the American Mathematical Society, 109:278{295, 1963. Widom, H. (1964) Asymptotic behavior of the eigenvalues of certain integral equations II. Archive for Rational Mechanics and Analysis, 17:215{229, 1964.

II-41 Appendix

II-42 Exercise for kernel PCA

2 T ~ – || f ||H = c K X c

= 푛 , 푛 = 푛 , = . 2 푇 푓 퐻 � 푐푖Φ� 푋푖 � 푐푗Φ� 푋푗 � 푐푖푐푗 Φ� 푋푖 Φ� 푋푗 푐 퐾�푋푐 푖=1 푗=1 푖 T ~ 2 – Var [ f ,Φ(X ) ]= c K X c

1 Var , = 푛 푛 , 2

푓 Φ 푋 � � 푐푖Φ� 푋푖 Φ� 푋푗 푛1푗=1 푖=1 = 푛 푛 , 푛 ,

� � 푐푖Φ� 푋푖 Φ� 푋푗 � 푐ℎΦ� 푋ℎ Φ� 푋푗 푛1 푗=1 푖=1 ℎ1=1 = 푛 푛 푛 = 푇 2 � � 푐푖퐾�푖� � 푐ℎ퐾�ℎ푗 푐 퐾�푋 푐 II-43 푛 푗=1 푖 ℎℎ 푛 ∎ III. Support Vector Machines A Brief Introduction

Kenji Fukumizu

The Institute of Statistical Mathematics / Graduate University for Advanced Studies

September 6-7 Machine Learning Summer School 2012, Kyoto

1 Large margin classifier

– , , … , ( , ): training data : input ( -dimensional) 푋1 푌1 푋푛 푌푛 {±1}: binary teaching data, • 푋푖 푚

• 푌푖 ∈ – Linear classifier ( ) = +

푇 푤 = sgn 푓 푥 푤 푥 푏 y = fw(x) ( ) < 0 ℎ 푥 푓푤 푥 fw x

We wish to make ( ) with the training data so that a new data can be correctly푓푤 푥 classified. ≧ 푥 fw(x) 0 III-2  Large margin criterion Assumption: the data is linearly separable. Among infinite number of separating hyperplanes, choose the one to give the largest margin.

– Margin = distance of two classes measured along the direction of .

– Support vector machine: 8 푤

Hyperplane to give 6 the largest margin. 4 푤 support – The classifier is the middle of 2 vector

the margin. 0

-2 – “Supporting points” on the two boundaries are -4 margin called support vectors. -6

-8 -8 -6 -4 -2 0 2 4 6 8 III-3 – ( , ) and ( , ) ( > 0) give the same classifier.

푤 푏 푐� 푐� 푐 – Fix the scale: min( T + ) = 1  w X i b if Yi = 1,  max(wT X + b) = −1 if Y = 1  i i Then −

Margin = 2

푤 [Exercise] Prove this.

III-4  Support vector machine (linear, hard margin) (Boser, Guyon, Vapnik 1992) Objective function: + 1 if = 1, max subject to , 푇 + 1 if = 1. 1 푤 푋푖 푏 ≥ 푌푖 푤 푏 푤 � 푇 푤 푋푖 푏 ≤ − 푌푖 −

SVM (hard margin) min , subject to ( + ) 1 ( ) 2 푇 푤 푏 푤 푌푖 푤 푋푖 푏 ≥ ∀푖 – Quadratic program (QP): • Minimization of a quadratic function with linear constraints. • Convex, no local optima (Vandenberghe’ lecture) • Many solvers are available (Chih-Jen Lin’s lecture)

III-5  Soft-margin SVM – “Linear separability” is too strong. Relax it. Hard margin Soft margin

( + ) 1 ( + ) 1 , 0 푇 푇 푌푖 푤 푋푖 푏 ≥ 푌푖 푤 푋푖 푏 ≥ − 휉푖 휉푖 ≥ SVM (soft margin) ( + ) 1 , min + subject to , , 푇 0 ( ) 2 푛 푖 푖 푖 푖 푌 푤 푋 푏 ≥ − 휉 푤 푏 휉 푤 퐶 ∑푖=1 휉 휉푖≥ ∀푖 – This is also QP. – The coefficient C must be given. Cross-validation is often used.

III-6 + = 1 + = 푇 푻 8푤 푥 푏 풊 풊 w 풘 푿 풃 ퟏ − 흃 6

4 + = 1 2 푇 푤 푥 푏 − 0

-2

-4

-6 + = + -8 -8 -6 -4 -2 0 2 4 6 8 푻 풘 푿풋 풃 −ퟏ 흃풋

III-7 SVM and regularization

– Soft-margin SVM is equivalent to the regularization problem:

min 푛 1 + + , 푇 2 푖 푖 푤 푏 � − 푌 푤 푋 푏 + regularization휆 푤 term 푖=1 loss

= max{0, } • loss function: Hinge loss + , = 1 푧 푧

ℓ 푓 푥 푦 − 푦� 푥 + • c.f. Ridge regression (squared error) min + + , 푧 푛 푇 2 2 푖=1 푖 푖 푤 푏 ∑ 푌 − 푤 푋 푏 휆 푤

[Exercise] Confirm the above equivalence. III-8 Kernelization: nonlinear SVM

– , , … , ( , ): training data : input on arbitrary space 푋1 푌1 푋푛 푌푛 {±1}: binary teaching data, • 푋푖 Ω

• 푌푖 ∈ – Kernelization: Positive definite kernel on (RKHS ), Feature vectors , … , 푘 Ω 퐻

Φ 푋1 Φ 푋푛 – Linear classifier on  nonlinear classifier on

= sgn퐻 , + Ω = sgn + , . ℎ 푥 푓 Φ 푥 푏 푓 푥 푏 푓 ∈ 퐻

III-9  Nonlinear SVM ( , + ) 1, min + 푛 subject to , 0 ( ) 2 푖 푖 푖 푌 〈푓 Φ 푋 〉 푏 ≥ 푓 푏 푓 퐶 � 휉 푖 or equivalently, 푖=1 휉 ≥ ∀푖

min 푛 1 ( + ) + , 2 푖 푖 + 퐻 푓 푏 � − 푌 푓 푋 푏 휆 푓 푖=1 By representer theorem, = . 푛 푗=1 푗 푗 Nonlinear SVM (soft margin)푓 ∑ : Gram푤 Φ 푋 matrix

( + ) 1, min + 푛 subject퐾 to , , 0 ( ) 푇 푌푖 퐾� 푖 푏 ≥ 푖 푤 푏 휉 푤 퐾� 퐶 � 휉 휉푖≥ ∀푖 • This is again QP.푖=1

III-10  Dual problem ( + ) 1, min + subject to , 0 ( ) 푇 푛 푖 푖 푖 푌 퐾� 푏 ≥ 푤 푏 푤 퐾� 퐶 ∑푖=1 휉 – By Lagrangian multiplier method, the dual problem휉푖≥ is ∀푖 SVM (dual problem) 0 , max , subject to ( ) = 0 푛 푛 푖 푖=1 푖 푖 푗=1 푖 푗 푖 푗 푖� ≤ 훼 ≤ 퐶 훼 ∑ 훼 − ∑ 훼 훼 푌 푌 퐾 푛 ∀푖 ∑푖=1 푌푖훼푖 • The dual problem is often preferred. • The classifier is expressed by

+ = 푛 , +

푓 ∗ 푥 푏∗ � 훼푖∗푌푖 퐾 푥 푋푖 푏∗ – Sparse expression푖=1 : Only the data with 0 < appear in the summation.  Support vectors. 푖∗ 훼 ≤ 퐶 III-11  KKT condition Theorem The solution of the primal and dual problem of SVM is given by the following equations:

(1) 1 + 0 ( ) [primal constraint] (2) 0 ( ∗ ) ∗ ∗ [primal constraint] − 푌푖 푓 푋푖 푏 − 휉푖 ≤ ∀ 푖 (3) 0∗ , ( ) [dual constraint] 휉푖 ≥ ∀ 푖 (4) (1 ∗ ( + ) ) = 0 ( ) [complementary] ≤ 훼푖 ≤ 퐶 ∀ 푖 (6) ∗ =∗ 0 ( )∗, ∗ [complementary] 훼푖 − 푌푖 푓 푋푖 푏 − 휉푖 ∀ 푖 (7) ∗ ∗ = 0, 휉푖 퐶 − 훼푖 ∀ 푖 (8) 푛 =∗ 0, 푛 ∗ ∑푗 =1 퐾푖�푤푗 − ∑푗 훼푗 푌푗퐾푖� 푛 ∗ ∑ 푗=1훼푗 푌푗

III-12  Sparse expression

( ) = + = , + : 휑∗ 푥 푓∗ 푥 푏∗ � 훼푖∗푌푖 퐾 푥 푋푖 푏∗ 푖 – Two types of support vectors.푋 support vectors

8 w 6

4 support vectors 0 < α < C 2 i ( ) 1 (Yi Xi = ) 0 휑 support vectors -2 αi = C -4 ( ) ≤ 1 (Yi Xi ) -6

-8 휑 -8 -6 -4 -2 0 2 4 6 8 III-13 Summary of SVM

– One of the kernel methods: • kernelization of linear large margin classifier. • Computation depends on Gram matrices of size .

– Quadratic program: 푛 • No local optimum. • Many solvers are available. • Further efficient optimization methods are available (e.g. SMO, Plat 1999)

– Sparse representation • The solution is written by a small number of support vectors.

– Regularization • The objective function can be regarded as regularization with hinge loss function. III-14

– NOT discussed on SVM in this lecture are • Many successful applications (see references) • Theoretical analysis on generalization. • Multi-class extension – Combination of binary classifiers (1-vs-1, 1-vs-rest) – Generalization of large margin criterion Crammer & Singer (2001), Mangasarian & Musicant (2001), Lee, Lin, & Wahba (2004), etc • Other extensions – Support vector regression (Vapnik 1995) – -SVM (Schölkopf et al 2000) – Structured-output (Collins & Duffty 2001, Taskar et al 2004, Altun 휈et al 2003, etc) – One-class SVM (Schökopf et al 2001) • Optimization – Solving primal problem • Implementation (Chih-Jen Lin’s lecture) III-15 References

Altun, Y., I. Tsochantaridis, and T. Hofmann. Hidden Markov support vector machines. In Proc. 20th Intern. Conf. Machine Learning, 2003. Boser, B.E., I.M. Guyon, and V.N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor, Fifth Annual ACM Workshop on Computational Learning Theory, pages 144–152, Pittsburgh, PA, 1992. ACM Press. Crammer, K. and Y. Singer. On the algorithmic implementation of multiclass kernel- based vector machines. Journal of Machine Learning Research, 2:265–292, 2001. Collins, M. and N. Duffy. Convolution kernels for natural language. In Advances in Neural Information Processing Systems 14, pages 625–632. MIT Press, 2001. Mangasarian, O. L. and David R. Musicant. Lagrangian support vector machines. Journal of Machine Learning Research, 1:161–177, 2001 Lee, Y., Y. Lin, and G. Wahba. Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99: 67–81, 2004. Schölkopf, B., A. Smola, R.C. Williamson, and P.L. Bartlett. (2000) New support vector algorithms. Neural Computation, 12(5):1207–1245.

III-16 Schölkopf, B., J.C. Platt, J. Shawe-Taylor, R.C. Williamson, and A.J.Smola. (2001) Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443–1471. Vapnik, V.N. The Nature of Statistical Learning Theory. Springer 1995. Platt, J. Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 185–208. MIT Press, 1999.

General references on SVM: – Schölkopf, B., A.J. Smola. Learning with Kernels. MIT Press 2001 – Cristianini, N., J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge Univ. Press. 2000 Books on Application domains: – Lee, S.-W., A. Verri (Eds.) Pattern Recognition with Support Vector Machines: First International Workshop, SVM 2002, Niagara Falls, Canada, August 10, 2002. Proceedings. Lecture Notes in Computer Science 2388, Springer, 2002. – Joachims, T. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Springer, 2002. – Schölkopf, B., K. Tsuda, J.-P. Vert (Eds.). Kernel Methods in Computational . MIT Press, 2004. Biology III-17

IV. Theoretical Backgrounds of Kernel Methods

Kenji Fukumizu

The Institute of Statistical Mathematics / Graduate University for Advanced Studies

September 6-7 Machine Learning Summer School 2012, Kyoto

1 -valued Positive definite kernel

Definition. 퐂 Ω: set. k : Ω × Ω → C is a positive definite kernel if for arbitrary 1, … , and , … , ,

푥 푥푛 ∈ Ω 푐1 푐푛 ∈ 퐂

푛 , 0. , 푖 푗 푖 푗 � 푐 푐� 푘 푥 푥 ≥ 푖 푗=1 Remark: From the above condition, the Gram matrix , is

푖 푗 necessarily Hermitian, i.e. , = ( , ). [Exercise]푘 푥 푥 푖� 푘 푦 푥 푘 푥 푦

IV-2 Operations that preserve positive definiteness

Proposition 4.1 If : × ( = 1,2, … ,) are positive definite kernels, then so are the following: 푖 1.푘 (positive푋 푋 → combination퐂 푖 ) + ( , 0). 2. (product) , , 푎푘1 푏푘2 푎 푏 ≥ 3. (limit) lim ( , ), assuming the limit exists. 푘1푘2 푘1 푥 푦 푘2 푥 푦 푖 푖→∞ 푘 푥 푦 Proof. 1 and 3 are trivial from the definition. For 2, it suffices to prove that Hadamard product (element-wise) of two positive semidefinite matrices is positive semidefinite.

Remark: The set of positive definite kernels on is a closed cone, where the topology is defined by the point-wise convergence. 푋 IV-3 Proposition 4.2 Let and be positive semidefinite Hermitian matrices. Then, Hadamard product = (element-wise product) is positive퐴 semidefinite퐵 . 퐾 퐴 ∗ 퐵 Proof) 0 0 0 Eigendecomposition of : = = 1 휆 ⋯ 푇 푗 i.e., 푇 푖 0 휆02 ⋯ 퐴 퐴 푈Λ푈� 푈푝 푈푝 = n λ i j ⋮ ⋱ ⋮ Aij ∑ =1 pU pU p (λ ≧ 0 by the positive semidefiniteness푛 ). p p ⋯ 휆 Then,

n n n c c K = c c λ U i U j B ∑i, j=1 i j ij ∑p=1 ∑i, j=1 i j p p p ij

= λ n i j + + λ n i j ≥ 0. 1(∑ , =1 ciU1 c jU1 Bij )  n (∑ , =1 ciUn c jUn Bij ) i j i j IV-4 ∎ Normalization

Proposition 4.3 Let be a positive definite kernel on , and : be an arbitrary function. Then, 푘 Ω 푓 Ω → 퐂 , , ( )

is positive definite.푘� 푥 푦 In≔ particular푓 푥 푘 푥,푦 푓 푦

( )

is a positive definite푓 kernel.푥 푓 푦

– Proof [Exercise] – Example. Normalization: for any positive definite kernel ( , ), , , = 푘 푥 푦 ( , ) ( , ) 푘 푥 푦 is positive definite푘� 푥 and푦 ( ) = 1 for any . 푘 푥 푥 푘 푦 푦 IV-5 Φ 푥 푥 ∈ Ω Proof of positive definiteness

– Euclidean inner product : easy (Prop. 1.1).

푇 – Polynomial kernel +푥 푦 ( 0): + = 푇 + 푑 + + , 0. 푥 푦 푐 푐 ≥ 푇 푑 푇 푑 푇 푑−1 Product and non-negative1 combination of p.d푑 . kernels.푖 푥 푦 푐 푥 푦 푎 푥 푦 ⋯ 푎 푎 ≥ – Gaussian RBF kernel exp : 2 푥−푦 2 exp = −/ 휎 / / 2 2 2 푇 2 2 2 푥 − 푦 − 푥 휎 푥 푦 휎 − 푦 휎 Note − 2 푒 푒 푒 휎 1 1 / = 1 + + + 푇 2 1! 2! 푥 푦 휎 푇 푇 2 is푒 positive definite (2Prop.푥 푦 4.1). Proposition4 푥 푦 4.3⋯ then completes the proof. 휎 휎

– Laplacian kernel is discussed later. IV-6 Shift-invariant kernel

– A positive definite kernel , on is called shift-invariant if the kernel is of the form , = 푚 . 푘 푥 푦 퐑 푘 푥 푦 휓 푥 − 푦 – Examples: Gaussian, Laplacian kernel

– Fourier kernel (C-valued positive definite kernel): for each ,

, = exp 1 = exp 1 exp 1 휔 푇 푇 푇 푘 퐹 푥 푦 − 휔 푥 − 푦 − 휔 푥 (Prop.− 4.3)휔 푦

– If , = is positive definite, the function is called positive definite. 푘 푥 푦 휓 푥 − 푦 휓

IV-7 Bochner’s theorem Theorem 4.4 (Bochner) Let be a continuous function on . Then, is ( -valued) positive definite if and only if there is푚 a finite non-negative Borel measure휓 on such that 퐑 휓 퐂 푚 Λ 퐑= exp 1 ( ) 푇 Bochner’s theorem휓 푧 characterizes� − 휔 all푧 the푑Λ continuous휔 shift-invariant positive definite kernels. exp 1 is the generator of the cone (see Prop. 4.1). 푇 푚 − 휔 푧 휔 ∈ 퐑 – is the inverse Fourier (or Fourier-Stieltjes) transform of . – Roughly speaking, the shift invariant functions are the class that Λhave non-negative Fourier transform. 휓

– Sufficiency is easy: , = | | . 푇 Necessity is difficult. −1휔 푧푖 2 ∑푖 푗 푖 푗 푖 푗 ∑푖 푖 푐 푐� 휓 푧 − 푧 ∫ 푐 푒 푑Λ IV휔-8 RKHS in frequency domain

Suppose (shift-invariant) kernel has a form

, = exp 1 푘 . > 0. 푇 휌 휔 Then, it is푘 known푥 푦 that� RKHS− 휔 is 푥given− 푦 by휌 휔 푑�

퐻푘 = , 2 < , 2 ̂ 푘 푓 휔 퐻 푓 ∈ 퐿 퐑 푑� � � 푑� ∞ 휌 휔 , = 푓̂ 휔 푔� 휔 where is 푓the푔 Fourier� transform푑� of : 휌 휔 (푓̂ ) = ( )exp 1 푓 . 1 푇 푚 푓̂ 휔 2휋 ∫ 푓 푥 − − 휔 푥 푑� – The RKHS norm is related to the smoothness of . IV-9 푓 퐻 푓  Gaussian kernel 1 , = exp , = exp 2 2 2 2 푥 − 푦 휎 휔 퐺 2 퐺 푚 푘 푥 푦= − , 휌 exp휔 <− 2휎 2 2휋2 2 2 2 휎 휔 푘퐺 ̂ 퐻 ,푓 ∈ =퐿 퐑 푑� � � 푓 휔 exp 푑� ∞ 2 2 2 푚 휎 휔 푓 푔 2휋 � 푓̂ 휔 푔� 휔 푑�  Laplacian kernel (on ) 1 , = exp | | , = 퐑 + 푘퐿 푥 푦 −훽 푥 − 푦 휌퐿 휔 2 2 = , + 2휋 휔 < 훽 2 2 2 2 푘퐿 퐻 푓 ∈, 퐿 =퐑 푑� � � 푓̂ 휔 휔+ 훽 푑 � ∞ 2 2 푓 푔 2휋 �푓̂ 휔 푔� 휔 휔 훽 푑� – Decay of for high-frequency is different for Gaussian and Laplacian. IV-10 푓 ∈ 퐻 RKHS by polynomial kernel

– Polynomial kernel on : , = + , 0, 퐑 푇 푑 푘푝 푥 푦 푥 푦 푐 푐 ≥ 푑 ∈ 퐍 , = + + + + . 1 2 푑 푑 푑 푑−1 푑−1 푑 2 푑−2 푑−1 푑 푘 푝 푥 푧0 푧0 푥 푐푧0 푥 푐 푧0 푥 ⋯ 푐 Span of these functions are polynomials of degree .

푑 Proposition 4.5 If 0, the RKHS is the space of polynomials of degree at most . 푐 ≠ 푑 [Proof: exercise. Hint. Find to satisfy , = as a solution to a linear equation.]푑 푖 푖=0 푖 푖 푑 푖 푏 ∑ 푏 푘 푥 푧 푖 ∑푖=0 푎 푥 IV-11 Sum and product

, , , : two RKHS’s and positive definite kernels on .

퐻1 Sum푘1 퐻 2 푘2 Ω RKHS for + : + = : , , = + } 푘1 푘2 퐻1 퐻=2min{푓 Ω → 푅+ ∃푓1 ∈ 퐻1 ∃=푓2 ∈+퐻2 푓, 푓1 ,푓2 }. 2 2 2 푓 푓1 퐻1 푓2 퐻2 ∣ 푓 푓1 푓2 푓1 ∈ 퐻1 푓2 ∈ 퐻2  Product RKHS for :

=tensor1 2 product as a vector space. 푘 푘 퐻1 =⊗ 퐻2 , } is dense in .

푛 푓 ∑푖=1 푓푖푔푖, 푓푖 ∈ 퐻1 푔푖 ∈ 퐻2= 퐻1 ⊗, 퐻2 , . 푛 1 1 푚 2 2 푛 푚 1 2 1 2 푖=1 푖 푖 푗=1 푗 푗 푖=1 푗=1 푖 푗 퐻1 푖 푗 퐻2 ∑ 푓 푔 ∑ 푓 푔 ∑ ∑ 푓 푓 푔 푔 IV-12 Summary of Section IV

– Positive definiteness of kernels are preserved by • Non-negative combinations, • Product • Point-wise limit • Normalization

– Bochner’s theorem: characterization of the continuous shift- invariance kernels on . 푚 퐑 – Explicit form of RKHS • RKHS with shift-invariance kernels has explicit expression in frequency domain. • Polynomial kerns gives RKHS of polynomials. • Sum and product can be given. IV-13 References

Aronszajn., N. Theory of reproducing kernels. Trans. American Mathematical Society, 68(3):337–404, 1950. Saitoh., S. Integral transforms, reproducing kernels, and their applications. Addison Wesley Longman, 1997.

IV-14 Solution to Exercises

 C-valued positive definiteness Using the definition for one point, we have , is real and non- negative for all . For any and , applying the definition with coefficient ( , 1) where , we have 푘 푥 푥 ,푥 + , 푥 + 푦 , + , 0. Since the right2푐 hand side푐 ∈is 퐂real, its complex conjugate also satisfies 푐 푘 푥, 푥 + 푐� 푥, 푦 + 푐̅푘 푦, 푥 + 푘 푦, 푦 ≥ 0. The difference2 of the left hand side of the above two inequalities is real, so that푐 푘 푥 푥 푐̅푘 푥 푦 푐푘 푦 푥 푘 푦 푦 ≥ , , ( , , ) is a real number. On the other hand, since must be pure imaginary 푐for̅ 푘 any푦 푥 complex− 푘 푥 푦 number− 푐 푘 푦, 푥 − 푘 푥 푦 , , = 0 훼 − 훼� 훼 holds for any . This implies , = ( , ). 푐̅ 푘 푦 푥 − 푘 푥 푦

푐 ∈ 퐂 푘 푦 푥 푘 푥 푦 IV-15

V. Nonparametric Inference with Positive Definite Kernels

Kenji Fukumizu

The Institute of Statistical Mathematics / Graduate University for Advanced Studies

September 6-7 Machine Learning Summer School 2012, Kyoto

1 Outline

1. Mean and Variance on RKHS

2. Statistical tests with kernels

3. Conditional probabilities and beyond.

V-2 Introduction

 “Kernel methods” for statistical inference – We have seen that kernelization of linear methods provides nonlinear methods, which capture ‘nonlinearity’ or ‘high-order moments’ of original data. e.g. nonlinear SVM, kernel PCA, kernel CCA, etc.

– This section discusses more basic statistics on RKHS: mean, covariance, and conditional covariance.  (X) = k( , X) X

 (original space) feature map H (RKHS)

mean, covariance, conditional covariance V-3 Mean and covariance on RKHS

V-4 Mean on RKHS

X: random variable taking value on a measurable space Ω k: measurable positive definite kernel on Ω. H: RKHS. Always assumes “bounded” kernel for simplicity: sup , ∞. ( X)  k( , X ): feature vector = random variable on RKHS.

– Definition. The kernel mean ∈of X on H is defined by

Φ ⋅, .

– Reproducing expectation: mX , f  Ef (X ) ∀ ∈

* Notation: depends on , also. But, we do not show it for simplicity. V-55 Covariance operator

, : random variable taking values on Ω,Ω, resp.

,, ,: RKHS given by kernels on Ω and Ω, resp.

Definition. Cross-covariance operator: YX : H X  HY ∗ ∗ Σ ≡Φ ⊗Φ Φ ⊗Φ

∗ denotes the linear functional ,⋅ : ↦ 〈, 〉

– Simply, covariance of feature vectors. T T c.f. Euclidean case VYX = E[YX ] – E[Y]E[X] : covariance matrix

 X(X) Y(Y) X Y  YX  X X Y Y HX HY V-6 – As an operator: Σ ⋅, ⋅, .

– Reproducing covariance:

g, YXf E[g(Y ) f (X )] E[g(Y )]E[ f (X )] ( Cov[ f (X ), g(Y )])

for all ∈,∈.

– Standard identification: ∗ ≅ : ,⋅, ↔ .

The operator is regarded as an element in ⊗, i.e., ∗ ∗ Σ ≡Φ ⊗Φ Φ ⊗Φ can be regarded as

Σ ≅, ⊗ ∈ ⊗

V-7 Characteristic kernel (Fukumizu, Bach, Jordan 2004, 2009; Sriperumbudur et al 2010) – Kernel mean can capture higher-order moments of the variable. Example X: R-valued random variablek: pos.def. kernel on R. Suppose k admits a Taylor series expansion on R.

2 k(u, x)  c0  c1(xu)  c2 (xu)  (ci > 0) e.g.) k(x,u)  exp(xu)

The kernel mean mX works as a moment generating function:

2 2 mX (u)  E[k(u, X )]  c0  c1E[X ]u  c2 E[X ]u 

1 d  m (u)  E[X  ] c du X  u0

V-8 P : family of all the probabilities on a measurable space (, B). H: RKHS on with a bounded measurable kernel k. mP: kernel mean on H for a variable with probability P P Definition. A bounded measureable positive definite k is called characteristic (w.r.t. P) if the mapping

P  H, P  mP is one-to-one.

– The kernel mean with a characteristic kernel uniquely determines a probability.

mP  mQ  P  Q i.e. ~ ∼ ∀ ∈ ⇔ .

V-9 – Analogy to “characteristic function” With Fourier kernel F(x, y)  exp 1 xT y

Ch.f.X (u)  E[F(X ,u)]. • The characteristic function uniquely determines a Borel probability on Rm.

• The kernel mean m X ( u )  E [ k ( u , X )] with a characteristic kernel uniquely determines a probability on (, B). Note:  may not be Euclidean.

– The characteristic RKHS must be large enough! Examples for Rm (proved later): Gaussian, Laplacian kernel. Polynomial kernels are not characteristic.

For conditions of characteristic kernels, see Sriperumbudur et al. JMLR 2010.

V-10 Statistical inference with kernel means

– Statistical inference: inference on some properties of probabilities. – With characteristic kernels, they can be cast into the inference on kernel means.

Two sample problem: ? ?

Independence test: ⊗? ⊗?

V-11 Empirical Estimation

 Empirical kernel mean – An advantage of RKHS approach is easy empirical estimation.

–X1,..., X n : i.i.d.  X1 ,,X n  : sample on RKHS

Empirical kernel mean: n n (n) 1 1 mˆ X  (X i )  k(, X i ) n i1 n i1

Theorem 5.1 (strong -consistency) Assume E[k(X , X )]  .

(n) mˆ X  mX  Op 1 n  (n  ).

V-12  Empirical covariance operator

, ,…,,: i.i.d. sample on Ω Ω.

An estimator of YX is defined by 1 n ˆ (n) ˆ ˆ * YX  kY (,Yi )  mY  k X (, X i )  mX n i1

Theorem 5.2

ˆ (n) YX  YX  Op 1 n  (n  ) HS

– Hilbert-Schmidt norm: same as Frobenius norm of a matrix, but often used for infinite dimensional spaces.

| | ∑∑ , , : orthonormal bases (sum of squares in matrix expression) V-13 Statistical test with kernels

V-14 Two‐sample problem

– Two sample homogeneity test Two i.i.d. samples are given; X ,..., X ~ P Y ,...,Y ~ Q. 1 n and 1 

Q: Are they sampled from the same distribution?

Null hypothesis H0: .

Alternative H1: .

– Practically important: we often wish to distinguish two things. • Are the experimental results of treatment and control significantly different? • Were the plays “Henry VI” and “Henry II” written by the same author?

– If the means of and are different, we may use it for test. If they are identical, we need to look at higher order information. V-15 – If mean and variance are the same, it is a very difficult problem. – Example: do they have the same distribution? ℓ 100 N(0,I) N(0,I) N(0,I) 4 4 4

3 3 3

2 2 2

1 1 1

0 0 0

-1 -1 -1

-2 -2 -2

-3 -3 -3

-4 -4 -4 -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0

-1 -1 -1

-2 -2 -2

-3 -3 -3

-4 -4 -4 -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 N(0,I) Unif 0.5N(0, I)  0.5Unif V-16 Kernel method for two-sample problem (Gretton et al. NIPS 2007, 2010, JMLR2012).  Kernel approach

– Comparison of and  comparison of and .

 Maximum Mean Discrepancy – In population 2 ~ ~ MMD2  m  m  E[k(X , X )]  E[k(Y ,Y )]  2E[k(X ,Y )]. X Y H (,: independent copy of , )

m  m  sup f ,m  m  sup f (x)dP(x)  f (x)dQ(x) X Y H X Y  f :||f ||H 1 f :||f ||H 1 hence, MMD.

– With characteristic kernel, MMD = 0 if and only if .

V-17 – Test statistics: 2 Empirical estimator MMD emp

2 2 MMDemp  mˆ X  mˆ Y H 1 n 2 n  1   2 k(X i , X j )  k(X i ,Ya )  2 k(Ya ,Yb ) n i, j1 n ia11  a,b1

– Asymptotic distributions under H0 and H1 are known (see Appendix) • Null distribution: ℓ  infinite mixture of •Alternative: ℓ  normal distribution

– Approximation of null distribution • Approximation of the mixture coefficients. • Fitting it with Pearson curve by moment matching. • Bootstrap (Arcones & Gine 1992)

V-18 Experiments

Comparison of two databases.

Data set Attr. MMD-B WW KS Data size / Dim Neural I Same 96.5 97.0 95.0 Neural I: 4000 / 63 Different Neural II: 1000 / 100 0.0 0.0 10.0 Health: 25 / 12600 Neural II Same 94.6 95.0 94.5 Subtype: 25 / 2118 Different 3.3 0.8 31.8 Health Same 95.5 94.7 96.1 Different 1.0 2.8 44.0 Gaussian kernels Subtype Same 99.1 94.6 97.3 are used. Different 0.0 0.0 28.4 WW: Wald-Walfovitz test Percentage of accepting . KS: Kolmogorov-Smirnov test Significance level . -- Classical methods (see Appendix) (Gretton et al. JMLR 2012) V-19 Experiments for mixtures

0.02 0.03 NX = NY = 100 NX = NY = 200 0.025 0.015 0.02

0.015 0.01

0.01 0.005 0.005

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 c 0.014 N = N = 500 Average values of MMD 0.012 X Y over 100 samples 0.01 0.008 0,1 vs 0.006 Unif 1 0,1 0.004 0,1 vs 0,1 0.002 0 0 0.2 0.4 0.6 0.8 1 20 Independence test

 Independence –and are independent if and only if

• Probability density function: , . • Cumulative distribution function: , . • Characteristic function:

 Independence test

Given i.i.d. sample , ,…, , , are and independent?

– Null hypothesis H0: ⊗ (independent)

–Alternative H1: ⊗ (not independent)

V-21 Independent Dependent 3 3

2 2

1 1

0 0

-1 -1

-2 -2

-3 -3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

V-22 Independence with kernel

Theorem 5.3. (Fukumizu, Bach, Jordan, JMLR 2004)

If the product kernel is characteristic, then

X Y  YX  O

Recall Σ ≅ ⊗ ∈ ⊗. Comparison between (kernel mean of ) and ⊗ (kernel mean of ).

– Dependence measure: Hilbert-Schmidt independence criterion , ≔ Σ

, ⊗ [Exercise] ⊗

, , , ,

2 , , , : , . independent copy of V-23 Independence test with kernels (Gretton, Fukumizu, Teo, Song, Schölkopf, Smola. NIPS 2008)

– Test statistic: 1 HSIC, ≔ Σ Tr

,: centered Gram matrices Or equivalently, 1 n 2 n HSICemp (X ,Y )  2 kX (X i , X j )kY (Yi ,Y j )  3 kX (X i , X j )kY (Yi ,Yk ) n i, j1 n i, j,k1 1 n n  k (X , X ) k (Y ,Y ) 4  X i j  Y k  n i, j1 k,1

– This is a special case of MMD comparing and ⊗ with product kernel . – Asymptotic distributions are given similarly to MMD (Gretton et al 2009).

V-24 Comparison with existing method

 Distance covariance (Székely et al., 2007; Székely & Rizzo, 2009) – Distance covariance / distance correlation has gained popularity in statistical community as a dependence measure beyond Pearson correlation.

Definition. , : and ℓ-dimensional random vectors . Distance covariance: , ≔ 2

where , and , are i.i.d. ~ . Distance correlation: , , ≔ , ,

V-25  Relation between distance covariance and HSIC Proposition 5.4 (Sejdinovic, Gretton, Sriperumbudur, F. ICML2012) A kernel on Euclidean spaces , . is positive definite, and HSIC , , .

– Distance covariance is a specific instance of HSIC. – Positive definite kernel approach is more general in choosing kernels, and thus may perform better.

–Extension , ∑

is positive definite for 02. We can define , ≔ HSIC,

V-26 Experiments

Original distance covariance

2 HSIC with Ratio of accepting independence of accepting Ratio Gaussian kernel

Dependence 

V-27  Choice of kernel for statistical tests – Heuristics for Gaussian kernel: median , 1, … ,

– Using performance of statistical test: Type II error of the test statistics (Gretton et al NIPS 2012).

Challenging open questions.

V-28 Conditional probabilities and beyond

V-29 Conditional probability

– Conditional probabilities appear in many machine learning problems

• Regression / classification: direct inference of | or |.  already seen in Section II.

• Bayesian inference ′

• Conditional independence / dependence

– Graphical modeling X1 X2 X3 Conditional independent relations among variables. – Causality X4 X5 – Dimension reduction / feature extraction

V-30 Conditional kernel mean

 Conditional kernel mean Definition.

Φ ⋅,

– Simply, kernel mean of | given . – It determines the conditional probability with a characteristic kernel. – Again, inference problems on conditional probabilities can be solved as inference on conditional kernel means.

– But, how can we estimate it?

V-31 Covariance operator revisited

 (Uncentered) cross-covariance operator

, : random variables on Ω,Ω, ,: positive definite kernels on Ω,Ω, , , ∞.

Definition. Uncentered cross-covariance operator

∗ ≡Φ ⊗Φ ∶ →

∗ (Or ∈ ⊗ ≅ ⊗)

– Reproducing property:

, ∀ ∈ , ∈

, ∀, ∈

V-32 Conditional probability by regression

– Recall that for zero-mean Gaussian random variable , ,

. • The matrix is given by the solution to the least mean square ,

– For the feature vector Φ Φ . • The operator is given by the solution to the least mean square Φ Φ ,

is ill-posed in infinite dimensional cases, but regularized estimator can be justified. V-33 Estimator for conditional kernel mean

– Empirical estimation: given , ,…, , , Φ In Gram matrix expression, ∗

, ∗, ⋮ , ∗ ⋮ . , ∗, Proposition 5.5 (Consistency) , If is characteristic, ∈ ⊗ , and →0, → ∞ as →∞, then for every ⋅, → Φ in in probability. V-34 Conditional covariance  Review: Gaussian variables Conditional covariance matrix: | ≔

Fact: | Cov,| for any  Conditional cross-covariance operator

Definition: , , : random variables on Ω,Ω,Ω. ,,: positive definite kernel on Ω,Ω,Ω. Conditional cross-covariance operator | ≡

– Reproducing averaged conditional covariance

Proposition 5.6 If is characteristic, then for ∀ ∈ , ∈,

, | Cov , |

V-35 – An interpretation: Compare the conditional kernel means for , and |.

Φ ⊗Φ Φ |⊗Φ Dependence on is not easy to handle  Average it out.

Φ ⊗Φ Φ| ⊗ Φ ] ⋅ ⋅

– Empirical estimator: | ≡

V-36 Conditional independence

– Recall: for Gaussian random variable | ⇔ | .

– By average over , | does not imply conditional independence, which requires , |for each .

– Trick: consider | ≔ ,

where ,and the product kernel is used for .

Theorem 5.7 (Fukumizu et al. JMLR 2004)

Assume ,, are characteristic, then | ⇔ | .

– |, | can be similarly used. V-37 Applications of conditional mean and covariance

 Conditional independence test (Fukumizu et al. NIPS2007) – Squared HS-norm can be used for conditional | independence test.

– Unlike the independence test, the asymptotic null distribution is not available. Permutation test is needed.

– Background: The conditional independent test for continuous non- Gaussian variables is not easy, and a challenging problem.

V-38  Causal inference – Directional acyclic graph (DAG) is used for representing the causal structure among variables. – The structure can be learned by conditional independence tests. The above test can be applied (Sun, Janzing, Schölkopf, F. ICML2007).

X Y X Y and Z X Y | Z

 Feature extraction / dimension reduction for supervised learning  see next.

V-39 Dimension reduction and conditional independence  Dimension reduction for supervised learning

Input: X = (X1, ... , Xm), Output: Y (either continuous or discrete) Goal: find an effective dimension reduction space (EDR space) spanned by an m x d matrix B s.t. T pYXp(| ) (| YBX ) T T T YX| YB| T X where B X = (b1 X, ..., bd X) linear feature vector No further assumptions (parametric model) on cond. p.d.f. p.

 Conditional independence Y , B spans EDR space

X Y | BTX X U V V-40 Gradient‐based method (Samarov 1993; Hristache et al 2001) Average Derivative Estimation (ADE) – Assumptions: • Y is one dimensional • EDR space p(Y | X )  ~p(Y | BT X )

  ~ T  ~ E[Y | X  x]  yp(y | B x)dy  B y p(y | z) T dy x x   z zB x

– Gradient of the regression function lies in the EDR space at each x

 Bˆ = average or PCA of the gradients at many x.

V-41 Kernel Helps!

– Weakness of ADE: • Difficulty of estimating gradients in high dimensional space. ADE uses local . » Sensitive to bandwidth • May find only a subspace of the effective subspace. 2 e.g. Y ~ f (X1)  Z, Z ~ N(0, (X 2 ) ).

–Kernel method • Can handle conditional probability in regression form Φ

• Characterizes conditional independence X Y | BTX

V-42 Derivative with kernel

– Reproducing the derivative (e.g. Steinwart & Christmann, Chap. 4):

~ k X (, x) Assume k X ( x , x ) is differentiable and  H then x X

⋅, , for any ∈

– Combining with the estimation of conditional kernel mean, Eˆ[ (Y ) | X  x] Eˆ[ (Y ) | X  x] Mˆ (x)  Y , Y ij xi x j

ˆ ˆ 1 k X (  , x) ˆ ˆ 1 k X (  , x)  CYX (C XX  n I) ,CYX (C XX   n I) xi x j

The top eigenvectors of estimates the EDR space

V-43 Gradient‐based kernel dimension reduction (gKDR) (Fukumizu & Leng, NIPS 2012) –Method  •Compute n ~ 1 T 1 1 M n  k X (X i ) (GX  n n I ) GY (GX  n n I ) k X (X i ). n i1 T  k X (X1 , x) k X (X n , x)  k X (X i )   ,...,  G  k (X , X ) x x X X i j   x X i

~ ˆ • Compute top d eigenvectors of M n .  Estimator B.

– gKDR estimate the subspace to realize the conditional independence

– Choice of kernel: Cross-validation with some regressor/classifier, e.g. kNN method.

V-44 Experiment: ISOLET

– Speech signals of 26 alphabets – 617 dim. (continuous) – 6238 training data / 1559 test data Data from UCI repository. –Results DimDim 10 15 20 25 30 35 40 45 50 gKDRgKDR 14.43 7.50 5.00 4.75 - - - - - gKDR-vgKDR-v 16.87 7.57 4.75 4.30 3.85 3.85 3.59 3.53 3.08 CCACCA 13.09 8.66 6.54 6.09 - - - - - Classification errors for test data by SVM (%)

c.f. C4.5 + ECOC: 6.61% Neural Networks (best): 3.27%

V-45 Experiment: Amazon Commerce Reviews

– Author identification for Amazon commerce reviews. – Dim = 10000 (linguistic style: e.g. usage of digit, punctuation, words and sentences' length, and frequency of words, etc) –= #authors x 30 (Data from UCI repository.)

V-46 – Example of 2-dim plots for 3 authors

– gKDR (dim = #authors) vs correlation-based variable selection (dim=500/2000) #Authors gKDR gKDR Corr (500) Corr + 5NN + SVM + SVM (2000) + SVM 10 9.3 12.0 15.7 8.3 20 16.2 16.2 30.2 18.0 30 20.1 18.0 29.2 24.0 40 22.8 21.8 35.4 25.0 50 22.7 19.5 41.1 29.0 10-fold cross-validation errors (%) V-47 Bayesian inference with kernels (Fukumizu, Song, Gretton NIPS 2011)  Bayes’ rule , .  Kernel Bayes’ rule ℓ –  : kernel mean of prior, ∑ ⋅,

–| : kernel representation of relation between and , , ,…, , ~, i.i.d.

– Goal: compute kernel mean of posterior |. | ∑ ⋅,

| | Λ Λ Λ Λ Diag V-48 Bayesian inference using kernel Bayes' rule – NO PARAMETRIC MODELS, BUT SAMPLES!

– When is it useful? • Explicit form of cond. p.d.f. | or prior is unavailable, but sampling is easy. – Approximate Bayesian Computation (ABC), Process prior. • Cond. p.d.f. | is unknown, but sample from , is given in training phase. – Example: nonparametric HMM (shown later). • If both of | and are known, there are many good sampling methods, such as MCMC, SMC, etc. But, they may take long time. KBR uses matrix operations.

V-49 Application: nonparametric HMM

Model: , ∏ ∏ |

–Assume: X0 X1 X2 X3 XT … | and/or | is not known. But, data , is available in training phase. Y0 Y1 Y2 Y3 YT Examples: • Measurement of hidden states is expensive, • Hidden states are measured with time delay.

– Testing phase (e.g., filtering, e.g.):

given ,…,, estimate hidden state .

– Sequential filtering/prediction uses Bayes’ rule  KBR applied.

V-50  Camera angles

– Hidden : angles of a video camera located at a corner of a room. – Observed : movie frame of a room + additive Gaussian noise. –: 3600 downsampled frames of 20 x 20 RGB pixels (1200 dim. ). – The first 1800 frames for training, and the second half for testing

noise KBR (Trace) Kalman filter(Q) 2 = 10-4 0.15 0.01 0.56 0.02 2 = 10-3 0.210.01 0.54 0.02 Average MSE for camera angles (10 runs)

To represent SO(3) model, Tr[AB-1] for KBR, and quaternion expression for Kalman filter are used .

V-51 Summary of Part V

 Kernel mean embedding of probabilities – Kernel mean gives a representation of probability distribution. – Inference on probabilities can be cast into inference on kernel means. e.g. two sample test, independent test

 Conditional probabilities – Conditional probabilities can be handled with kernel means and covariances • Conditional independence test • Graphical modeling • Causal inference • Dimension reduction • Bayesian inference

V-52 References Fukumizu, K., Bach, F.R., and Jordan, M.I. (2004) for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research. 5:73-99, Fukumizu, K., F.R. Bach and M. Jordan. (2009) Kernel dimension reduction in regression. Annals of Statistics. 37(4), pp.1871-1905 Fukumizu, K., L. Song, A. Gretton (2011) Kernel Bayes' Rule. Advances in Neural Information Processing Systems 24 (NIPS2011) 1737-1745. Fukumizu, K. and C. Leng. (2011) Gradient-based kernel dimension reduction for supervised learning. arXiv:1109.0455v1. Gretton, A., K.M. Borgwardt, M.Rasch, B. Schölkopf, A.J. Smola (2007) A Kernel Method for the Two-Sample-Problem. Advances in Neural Information Processing Systems 19, 513-520. Gretton, A., Z. Harchaoui, K. Fukumizu, B. Sriperumbudur (2010) A Fast, Consistent Kernel Two-Sample Test. Advances in Neural Information Processing Systems 22, 673-681. Gretton, A., K. Fukumizu, C.-H. Teo, L. Song, B. Schölkopf, A. Smola. (2008) A Kernel Statistical Test of Independence. Advances in Neural Information Processing Systems 20, 585-592. V-53 Gretton, A., K. Fukumizu, C.-H. Teo, L. Song, B. Schölkopf, A. Smola. (2008) A Kernel Statistical Test of Independence. Advances in Neural Information Processing Systems 20, 585-592. Gretton, A., D. Sejdinovic, B. Sriperumbudur, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu. Optimal kernel choice for large-scale two-sample tests. NIPS 2012, to appear. Hristache, M., A. Juditsky, J. Polzehl, and V. Spokoiny. (2001) Structure Adaptive Approach for Dimension Reduction. Annals of Statsitics, 29(6):1537-1566. Samarov, A. (1993). Exploring regression structure using nonparametric functional estimation. Journal of American Statistical Association. 88, 836-847. Sejdinovic, D., A. Gretton, B. Sriperumbudur, K. Fukumizu. (2012) Hypothesis testing using pairwise distances and associated kernels. Proc. 29th International Conference on Machine Learning (ICML2012). Sriperumbudur, B.K., A. Gretton, K. Fukumizu, B. Schölkopf, G.R.G. Lanckriet. (2010) Hilbert Space Embeddings and Metrics on Probability Measures. Journal of Machine Learning Research. 11:1517-1561. Sun, X., D. Janzing, B. Schölkopf and K. Fukumizu. (2007) A kernel-based causal learning algorithm. Proc. 24th Annual International Conference on Machine Learning (ICML2007), 855-862.

V-54 Székely, G.J., M.L. Rizzo, and N.K. Bakirov. (2007) Measuring and testing dependence by correlation of distances. Annals of Statistics, 35(6): 2769-2794. Székely, G.J. and M.L. Rizzo. (2009) Brownian distance covariance. Annals of Applied Statistics, 3(4):1236-1265.

V-55 Appendices

V-56 Statistical Test: quick introduction

 How should we set the threshold? Example) Based on MMD, we wish to make a decision whether the two variables have the same distribution.

Simple-minded idea: Set a small value like t = 0.001 MMD(X,Y) < t Perhaps, same MMD(X,Y)  t Different

But, the threshold should depend on the properties of X and Y.

 Statistical hypothesis test – A statistical way of deciding whether a hypothesis is true or not. – The decision is based on sample  We cannot be 100% certain.

V-57  Procedure of hypothesis test

• Null hypothesis H0 = hypothesis assumed to be true “X and Y have the same distribution”

• Prepare a test statistic TN

e.g. TN = MMDemp2

• Null distribution: Distribution of TN under H0

•Set significance level  Typically  = 0.05 or 0.01

• Compute the critical region: = Pr(TN > t under H0)

• Reject the null hypothesis if TN > t 2 The probability that MMDemp > t under H0 is very small.

otherwise, accept H0 negatively. V-58 One-sided test 1 0.9 p.d.f. of Null distribution 0.8

0.7 0.6 area = p-value TN > t  p-value < 

0.5

0.4

0.3 area = (5%, 1% etc)

0.2 significance level

0.1

0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 T N critical region threshold t

- If H0 is the truth, the value of TN should follow the null distribution.

- If H1 is the truth, the value of TN should be very large. - Set the threshold with risk . - The threshold depends on the distribution of the data.

V-59  Type I and Type II error – Type I error = false positive (e.g. “” = positive) – Type II error = false negative TRUTH

H0 Alternative 0 Type II error True negative False negative Accept H 0

TEST RESULT TEST Type I error True positive False positive Reject H

• Significance level controls the type I error. • Under a fixed type I error, a good test statistics should ∎ give small type II error. V-60 MMD: Asymptotic distribution

 Under H0 ,…,~, ,…,ℓ ~: i.i.d. Let ℓ. ℓ Assume →, → 1 (01) as →∞. Under the null hypothesis of , 1 MMD →∞, 1 where ,,… are i.i.d. with law 0; 1/ 1 , and are the eigenvalues of the integral operator on ,

with the centered kernel , , , , , . . : independent copy of V-61  Under H1 Under the alternative ,

MMD 0; →∞, where , , , , 4 .

– The asymptotic distributions are derived by the general theory of U-statistics (e.g. see van der Vaart 1998, Chapter 12). – Estimation of the null distribution:

• Estimation of • Approximation by Pearson curve with moment matching. • Bootstrap MMD (Arcones & Gine 1992)

V-62 Conventional methods for two sample problem  Kolmogorov-Smirnov (K-S) test for two samples One-dimensional variables – Empirical distribution function 1 N FN (t)   I(X i  t) N i1 – KS test statistics 2 1 2 FN (t) DN  sup FN (t)  FN (t) tR 1 DN FN (t) – Asymptotic null distribution is known (not shown here). t

V-63  Wald-Wolfowitz run test One-dimensional samples – Combine the samples and plot the points in ascending order. – Label the points based on the original two groups. – Count the number of “runs”, i.e. consecutive sequences of the same label. – Test statistics R = Number of runs R  E[R] T   N(0,1) N Var[R] R = 10 – In one-dimensional case, less powerful than KS test

 Multidimensional extension of KS and WW test – Minimum spanning tree is used (Friedman Rafsky 1979)

V-64