Introduction to Machine Learning Lecture 9

Introduction to Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research [email protected] Kernel Methods Motivation Non-linear decision boundary. Efficient computation of inner products in high dimension. Flexible selection of more complex features. Mehryar Mohri - Introduction to Machine Learning page 3 This Lecture Definitions SVMs with kernels Closure properties Sequence Kernels Mehryar Mohri - Introduction to Machine Learning page 4 Non-Linear Separation Linear separation impossible in most problems. Non-linear mapping from input space to high- dimensional feature space: Φ : X F . → Generalization ability: independent of dim( F ) , depends only on ρ and m . Mehryar Mohri - Introduction to Machine Learning page 5 Kernel Methods Idea: Define K : X X R , called kernel, such that: • × → Φ(x) Φ(y)=K(x, y). · • K often interpreted as a similarity measure. Benefits: • Efficiency: K is often more efficient to compute than Φ and the dot product. • Flexibility: K can be chosen arbitrarily so long as the existence of Φ is guaranteed (symmetry and positive definiteness condition). Mehryar Mohri - Introduction to Machine Learning page 6 PDS Condition Definition: a kernel K : X X R is positive definite × → symmetric (PDS) if for any x 1 ,...,x m X , the {m m }⊆ matrix K =[ K ( x i ,x j )] ij R × is symmetric ∈ positive semi-definite (SPSD). K SPSD if symmetric and one of the 2 equiv. cond.’s: its eigenvalues are non-negative. • n m 1 for any c R × , cKc = cicjK(xi,xj) 0. • ∈ ≥ i,j=1 Terminology: PDS for kernels, SPSD for kernel matrices (see (Berg et al., 1984)). Mehryar Mohri - Introduction to Machine Learning page 7 Example - Polynomial Kernels Definition: N d x, y R ,K(x, y)=(x y + c) ,c>0. ∀ ∈ · Example: for N =2 and d =2 , 2 K(x, y)=(x1y1 + x2y2 + c) 2 2 x1 y1 2 2 x2 y2 √2 x x √2 y y = 1 2 1 2 . √2cx1 · √2cy1 √2cx √2cy 2 2 c c Mehryar Mohri - Introduction to Machine Learning page 8 XOR Problem Use second-degree polynomial kernel with c =1 : √2 x1x2 x2 (-1, 1) (1, 1, +√2, √2, √2, 1) (1, 1, +√2, +√2, +√2, 1) (1, 1) − − √2 x1 x1 (-1, -1) (1, -1) (1, 1, √2, √2, +√2, 1) (1, 1, √2, +√2, √2, 1) − − − − Linearly non-separable Linearly separable by x1x2 =0 . Mehryar Mohri - Introduction to Machine Learning page 9 Other Standard PDS Kernels Gaussian kernels: x y 2 K(x, y)=exp || − || ,σ=0. − 2σ2 Sigmoid Kernels: K(x, y)=tanh(a(x y)+b),a,b 0. · ≥ Mehryar Mohri - Introduction to Machine Learning page 10 This Lecture Definitions SVMs with kernels Closure properties Sequence Kernels Mehryar Mohri - Introduction to Machine Learning page 11 Reproducing Kernel Hilbert Space (Aronszajn, 1950) Theorem: Let K : X X R be a PDS kernel. Then, × → there exists a Hilbert space H and a mapping Φ from X to H such that x, y X, K(x, y)=Φ(x) Φ(y). ∀ ∈ · Furthermore, the following reproducing property holds: f H , x X, f(x)= f,Φ(x) = f,K(x, ) . ∀ ∈ 0 ∀ ∈ · Mehryar Mohri - Introduction to Machine Learning page 12 Notes: • H is called the reproducing kernel Hilbert space (RKHS) associated to K . A Hilbert space such that there exists Φ : X H • → with K ( x, y )= Φ ( x ) Φ ( y ) for all x, y X is also · ∈ called a feature space associated to K . Φ is called a feature mapping. • Feature spaces associated to K are in general not unique. Mehryar Mohri - Introduction to Machine Learning page 13 Consequence: SVMs with PDS Kernels (Boser, Guyon, and Vapnik, 1992) Constrained optimization: Φ(xi) Φ(xj) m m · 1 max αi αiαj yiyjK(xi,xj ) α − 2 i=1 i,j=1 m subject to: 0 α C α y =0,i [1,m]. ≤ i ≤ ∧ i i ∈ i=1 Solution: Φ(xi) Φ(x) m · h(x)=sgn αiyiK(xi,x)+b , Φ(xj ) Φ(xi) m i=1 · with b = y α y K(x ,x ) for any x with i − j j j i i j=1 0<αi <C. Mehryar Mohri - Introduction to Machine Learning page 14 SVMs with PDS Kernels Constrained optimization: Hadamard product max 2 1α (α y)K(α y) α − ◦ ◦ subject to: 0 α C αy =0. ≤ ≤ ∧ Solution: m h =sgn α y K(x , )+b , i i i · i=1 with b = yi (α y)Kei for any x i with − ◦ 0<αi <C. Mehryar Mohri - Introduction to Machine Learning page 15 Generalization: Representer Theorem (Kimeldorf and Wahba, 1971) Theorem: Let K : X X R be a PDS kernel and H its corresponding RKHS.× → Then, for any non- m decreasing function G : R R and any L : R R + the optimization problem→ → ∪{ ∞} 2 argmin F (h)=argminG( h H )+L h(x1),...,h(xm) h H h H m ∈ ∈ admits a solution of the formh∗ = α K(x , ). i i · i=1 If G is further assumed to be increasing, then any solution has this form. Mehryar Mohri - Introduction to Machine Learning page 16 Proof: let H =span( K ( x , ): i [1 ,m ] ). Any h H • 1 { i · ∈ } ∈ admits the decomposition h = h 1 + h ⊥ according to H = H H ⊥ . 1 ⊕ 1 • Since G is non-decreasing, 2 2 2 2 G( h ) G( h + h⊥ )=G( h ). 1 ≤ 1 By the reproducing property, for all i [1 ,m ], • ∈ h(x )= h, K(x , ) = h ,K(x , ) = h (x ). i i · 1 i · 1 i • Thus, L h ( x 1 ) ,...,h ( x m ) = L h 1 ( x 1 ) ,...,h 1 ( x m ) and F (h ) F (h). 1 ≤ • If G is increasing, then F ( h 1 ) <F ( h ) and any solution of the optimization problem must be in H 1 . Mehryar Mohri - Introduction to Machine Learning page 17 Kernel-Based Algorithms PDS kernels used to extend a variety of algorithms in classification and other areas: • regression. • ranking. • dimensionality reduction. • clustering. But, how do we define PDS kernels? Mehryar Mohri - Introduction to Machine Learning page 18 This Lecture Definitions SVMs with kernels Closure properties Sequence Kernels Mehryar Mohri - Introduction to Machine Learning page 19 Closure Properties of PDS Kernels Theorem: Positive definite symmetric (PDS) kernels are closed under: • sum, • product, • tensor product, • pointwise limit, • composition with a power series. Mehryar Mohri - Introduction to Machine Learning page 20 Closure Properties - Proof Proof: closure under sum: cKc 0 cKc 0 c(K + K)c 0. ≥ ∧ ≥ ⇒ ≥ • closure under product: K = MM , m m m cicj(Kij Kij )= cicj MikMjk Kij i,j=1 i,j=1 k=1 m m m = c c M M K = zKz 0, i j ik jk ij k k ≥ i,j=1 k=1 k=1 c1M1k with z = . k ··· cmMmk Mehryar Mohri - Introduction to Machine Learning page 21 • Closure under tensor product: definition: for all x ,x ,y ,y X , • 1 2 1 2 ∈ (K K )(x ,y ,x ,y )=K (x ,x )K (y ,y ). 1 ⊗ 2 1 1 2 2 1 1 2 2 1 2 • thus, PDS kernel as product of the kernels (x1,y1,x2,y2) K1(x1,x2) (x ,y ,x ,y ) K (y ,y ). → 1 1 2 2 → 2 1 2 Closure under pointwise limit: if for all x, y X , • ∈ lim Kn(x, y)=K(x, y), n →∞ Then,( n, cKnc 0) lim cKnc = cKc 0. n ∀ ≥ ⇒ →∞ ≥ Mehryar Mohri - Introduction to Machine Learning page 22 • Closure under composition with power series: assumptions: K PDS kernel with K ( x, y ) < ρ for • n | | all x, y X and f ( x )= ∞ a n x ,a n 0 power ∈ n=0 ≥ series with radius of convergence ρ . f K is a PDS kernel since K n is PDS by closure • ◦ N n under product, n =0 a n K is PDS by closure under sum, and closure under pointwise limit. Example: for any PDS kernel K , exp( K ) is PDS. Mehryar Mohri - Introduction to Machine Learning page 23 This Lecture Definitions SVMs with kernels Closure properties Sequence Kernels Mehryar Mohri - Introduction to Machine Learning page 24 Sequence Kernels Definition: Kernels defined over pairs of strings. • Motivation: computational biology, text and speech classification. • Idea: two sequences are related when they share some common substrings or subsequences. • Example: sum of the product of the counts of common substrings. Mehryar Mohri - Introduction to Machine Learning page 25 Weighted Transducers b:a/0.6 b:a/0.2 a:b/0.1 1 a:a/0.4 b:a/0.3 2 3/0.1 0 a:b/0.5 T (x, y)= Sum of the weights of all accepting paths with input x and output y . T (abb, baa)=.1 .2 .3 .1+.5 .3 .6 .1 × × × × × × Mehryar Mohri - Introduction to Machine Learning page 26 Rational Kernels over Strings (Cortes et al., 2004) Definition: a kernel K : Σ ∗ Σ ∗ R is rational if K =T × → for some weighted transducer T . Definition: let T 1 : Σ ∗ ∆ ∗ R and T 2 : ∆ ∗ Ω ∗ R be two weighted transducers.× → Then, the composition× → of T and T is defined for all x Σ ∗ ,y Ω ∗ by 1 2 ∈ ∈ (T T )(x, y)= T (x, z) T (z,y). 1 ◦ 2 1 2 z ∆ ∈ ∗ Definition: the inverse of a transducer T : Σ ∗ ∆ ∗ R 1 × → is the transducer T − : ∆ ∗ Σ ∗ R obtained from T by swapping input and output× →labels. Mehryar Mohri - Introduction to Machine Learning page 27 Composition Theorem: the composition of two weighted transducer is also a weighted transducer. Proof: constructive proof based on composition algorithm. • states identified with pairs. • -free case: transitions defined by E = (q ,q ),a,c,w w , (q ,q ) .

Introduction to Machine Learning Lecture 9

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support