Introduction to Machine Learning 4. Perceptron and Kernels
Alex Smola Carnegie Mellon University
http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline
• Perceptron • Hebbian learning & biology • Algorithm • Convergence analysis • Features and preprocessing • Nonlinear separation • Perceptron in feature space • Kernels • Kernel trick • Properties • Examples Perceptron
Frank Rosenblatt early theories of the brain Biology and Learning
• Basic Idea • Good behavior should be rewarded, bad behavior punished (or not rewarded). This improves system fitness. • Killing a sabertooth tiger should be rewarded ... • Correlated events should be combined. • Pavlov’s salivating dog. • Training mechanisms • Behavioral modification of individuals (learning) Successful behavior is rewarded (e.g. food). • Hard-coded behavior in the genes (instinct) The wrongly coded animal does not reproduce. Neurons
• Soma (CPU) Cell body - combines signals • Dendrite (input bus) Combines the inputs from several other nerve cells • Synapse (interface) Interface and parameter store between neurons • Axon (cable) May be up to 1m long and will transport the activation signal to neurons at different locations Neurons
x1 x2 x3 ... xn
w1 wn synaptic weights
output
f(x)= w x = w, x i i h i i X Perceptron
x1 x2 x3 ... xn • Weighted linear combination w1 wn • Nonlinear synaptic weights decision function
• Linear offset (bias) output f(x)= ( w, x + b) h i • Linear separating hyperplanes (spam/ham, novel/typical, click/no click) • Learning Estimating the parameters w and b Perceptron
Ham Spam The Perceptron initialize w = 0 and b =0 repeat if y [ w, x + b] 0 then i h ii w w + yixi and b b + yi end if until all classified correctly • Nothing happens if classified correctly
• Weight vector is linear combination w = yixi i I • Classifier is linear combination of X2
inner products f(x)= yi xi,x + b h i i I X2 Convergence Theorem
• If there exists some ( w ⇤ ,b ⇤ ) with unit length and
y [ x ,w⇤ + b⇤] ⇢ for all i i h i i then the perceptron converges to a linear separator after a number of steps bounded by
2 2 2 b⇤ +1 r +1 ⇢ where x r k ik ⇣ ⌘ • Dimensionality independent • Order independent (i.e. also worst case) • Scales with ‘difficulty’ of problem Proof, Part I Proof
Starting Point We start from w1 =0and b1 =0. Step 1: Bound on the increase of alignment Denote by wi the value of w at step i (analogously bi).
Alignment: (w ,b), (w⇤,b⇤) h i i i For error in observation (xi,yi) we get
(w ,b ) (w⇤,b⇤) h j+1 j+1 · i = [(w ,b )+y (x , 1)] , (w⇤,b⇤) h j j i i i = (w ,b ), (w⇤,b⇤) + y (x , 1) (w⇤,b⇤) h j j i ih i · i (w ,b ), (w⇤,b⇤) + ⇢ h j j i j⇢. Alignment increases with number of errors.
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 14 Proof, Part II Proof
Step 2: Cauchy-Schwartz for the Dot Product
(w ,b ) (w⇤,b⇤) (w ,b ) (w⇤,b⇤) h j+1 j+1 · ik j+1 j+1 kk k = 1+(b )2 (w ,b ) ⇤ k j+1 j+1 k Step 3: Upper Bound on (wj,bpj) If we make a mistake wek have k (w ,b ) 2 = (w ,b )+y (x , 1) 2 k j+1 j+1 k k j j i i k = (w ,b ) 2 +2y (x , 1), (w ,b ) + (x , 1) 2 k j j k ih i j j i k i k (w ,b ) 2 + (x , 1) 2 k j j k k i k j(R2 + 1). Step 4: Combination of first three steps j⇢ 1+(b )2 (w ,b ) j(R2 + 1)((b )2 + 1) ⇤ k j+1 j+1 k ⇤ Solvingp for j proves the theorem. p Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 15 Consequences
• Only need to store errors. This gives a compression bound for perceptron. • Stochastic gradient descent on hinge loss l(x ,y ,w,b) = max (0, 1 y [ w, x + b]) i i i h ii • Fails with noisy data
do NOT train your avatar with perceptrons
Black & White Hardness margin vs. size
hard easy
Concepts & version space
• Realizable concepts • Some function exists that can separate data and is included in the concept space • For perceptron - data is linearly separable • Unrealizable concept • Data not separable • We don’t have a suitable function class (often hard to distinguish) Minimum error separation
• XOR - not linearly separable • Nonlinear separation is trivial • Caveat (Minsky & Papert) Finding the minimum error linear separator is NP hard (this killed Neural Networks in the 70s). Nonlinearity & Preprocessing Nonlinear Features
• Regression We got nonlinear functions by preprocessing • Perceptron • Map data into feature space x (x) ! • Solve problem in this space
• Query replace x, x 0 by ( x ) , ( x 0) for code h i h i • Feature Perceptron
• Solution in span of (xi) Quadratic Features
• Separating surfaces are Circles, hyperbolae, parabolae Constructing Features Constructing Features Idea Construct(very features naive manually. OCR E.g.system) for OCR we could use
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 35 Delivered-To: [email protected] Received: by 10.216.47.73 with SMTP id s51cs361171web; Tue, 3 Jan 2012 14:17:53 -0800 (PST) Received: by 10.213.17.145 with SMTP id s17mr2519891eba.147.1325629071725; Tue, 03 Jan 2012 14:17:51 -0800 (PST) Return-Path:
--f46d043c7af4b07e8d04b5a7113a • ... secret sauce ... Content-Type: text/plain; charset=ISO-8859-1 More feature engineering
• Two Interlocking Spirals Transform the data into a radial and angular part (x1,x2)=(r sin ,rcos ) • Handwritten Japanese Character Recognition • Break down the images into strokes and recognize it • Lookup based on stroke order • Medical Diagnosis • Physician’s comments • Blood status / ECG / height / weight / temperature ... • Medical knowledge • Preprocessing • Zero mean, unit variance to fix scale issue (e.g. weight vs. income) • Probability integral transform (inverse CDF) as alternative Perceptron on Features
argument: X := x ,...,x X (data) { 1 m} ⇢ The PerceptronY := y1,...,ym 1on(labels) features function (w, b) = Perceptron({ X,} ⇢ Y,{±⌘) } initialize w, b =0 repeat Pick (xi,yi) from data if y (w (x )+b) 0 then i · i w0 = w + yi (xi) b0 = b + yi until y (w (x )+b) > 0 for all i i · i end• Nothing happens if classified correctly Important detail • Weight vector is linear combination w = yi (xi) w = yj (xj) and hence f(x)= yj( (xj) (xi ))I + b j · X2 • Classifierj is linear combination of X P inner products f(x)= yi (xi), (x) + b Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 37 h i i I X2 Problems
• Problems • Need domain expert (e.g. Chinese OCR) • Often expensive to compute • Difficult to transfer engineering knowledge • Shotgun Solution • Compute many features • Hope that this contains good ones • Do this efficiently Kernels
Grace Wahba Solving XOR
(x1,x2) (x1,x2,x1x2) • XOR not linearly separable • Mapping into 3 dimensions makes it easily solvable PolynomialQuadratic Features Features
Quadratic Features in R2 2 2 (x) := x1, p2x1x2,x2 Dot Product ⇣ ⌘ 2 2 2 2 (x), (x0) = x , p2x x ,x , x0 , p2x0 x0 ,x0 h i 1 1 2 2 1 1 2 2 2 = D⇣x, x0 . ⌘ ⇣ ⌘E h i Insight d Trick works for any polynomials of order d via x, x0 . h i
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 39
KernelsComputational Efficiency
Problem Extracting features can sometimes be very costly. Example: second order features in 1000 dimensions. This leads to5 5005105 numbers. For higher order polyno- mial features much· worse. Solution Don’t compute the features, try to compute dot products implicitly. For some features this works . . . Definition A kernel function k : X X R is a symmetric function in its arguments for which⇥ the! following property holds
k(x, x0)= (x), (x0) for some feature map . h i If k(x, x0) is much cheaper to compute than (x) ...
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Kernel Perceptron
argument: X := x ,...,x X (data) { 1 m} ⇢ TheY :=Kernely1,...,ym Perceptron1 (labels) function f = Perceptron({ X, Y,}⌘⇢) {± } initialize f =0 repeat Pick (xi,yi) from data if yif(xi) 0 then f( ) f( )+y k(x , )+y · · i i · i until yif(xi) > 0 for all i end • Nothing happens if classified correctly Important detail w•=Weighty vector(x ) and is hencelinear fcombination(x)= y k(xw =,x)+ybi .(xi) j j j j j i I • Classifierj is linear combination of inner productsX2 X P f(x)= y (x ), (x) + b = y k(x ,x)+b i h i i i i i I i I Alexander J. Smola: An Introduction to MachineX2 Learning with Kernels, Page 42 X2 PolynomialPolynomial Kernels in KernelsRn Idea 2 We want to extend k(x, x0)= x, x0 to h i d k(x, x0)=(x, x0 + c) where c>0 and d N. h i 2 Prove that such a kernel corresponds to a dot product. Proof strategy Simple and straightforward: compute the explicit sum given by the kernel, i.e. m d d i d i k(x, x0)=( x, x0 + c) = ( x, x0 ) c h i i h i i=0 X ✓ ◆ i Individual terms ( x, x0 ) are dot products for some (x). h i i
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 41 Kernel Conditions Are all k(x, x0) good Kernels? Computability We have to be able to compute k(x, x0) efficiently (much cheaper than dot products themselves). “Nice and Useful” Functions The features themselves have to be useful for the learn- ing problem at hand. Quite often this means smooth functions. Symmetry Obviously k(x, x0)=k(x0,x) due to the symmetry of the dot product (x), (x0) = (x0), (x) . Dot Product inh Feature Spacei h i Is there always a such that k really is a dot product?
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 43 Mercer’sMercer’s Theorem Theorem The Theorem For any symmetric function k : X X R which is square integrable in X X and which⇥ satisfies! ⇥
k(x, x0)f(x)f(x0)dxdx0 0 for all f L2(X) X X 2 Z ⇥ there exist i : X R and numbers i 0 where ! k(x, x0)= (x) (x0) for all x, x0 X. i i i 2 i X Interpretation Double integral is the continuous version of a vector- matrix-vector multiplication. For positive semidefinite matrices we have k(x ,x )↵ ↵ 0 i j i j i j Alexander J. Smola: An Introduction to Machine LearningX withX Kernels, Page 44 Properties ofProperties the Kernel Distance in Feature Space Distance between points in feature space via 2 2 d(x, x0) := (x) (x0) k k = (x), (x) 2 (x), (x0) + (x0), (x0) h i h i h i =k(x, x)+k(x0,x0) 2k(x, x) Kernel Matrix To compare observations we compute dot products, so we study the matrix K given by K = (x ), (x ) = k(x ,x ) ij h i j i i j where xi are the training patterns. Similarity Measure The entries Kij tell us the overlap between (xi) and (xj), so k(xi,xj) is a similarity measure.
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 45 Properties ofProperties the Kernel Matrix
K is Positive Semidefinite m Claim: ↵>K↵ 0 for all ↵ R and all kernel matrices m m 2 K R ⇥ . Proof: m 2 m ↵ ↵ K = ↵ ↵ (x ), (x ) i j ij i jh i j i i,j i,j X X m m m 2
= ↵i (xi), ↵j (xj) = ↵i (xi) * + i j i=1 X X X Kernel Expansion If w is given by a linear combination of (xi) we get m m w, (x) = ↵ (x ), (x) = ↵ k(x ,x). h i i i i i * i=1 + i=1 X X
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 46 A CounterexampleA Counterexample
A Candidate for a Kernel 1 if x x 1 k(x, x )= 0 0 0 otherwisek k ⇢ This is symmetric and gives us some information about the proximity of points, yet it is not a proper kernel . . . Kernel Matrix We use three points, x1 =1,x2 =2,x3 =3and compute the resulting “kernelmatrix” K. This yields 110 1 K = 111 and eigenvalues (p2 1) , 1 and (1 p2). 2 0113 as eigensystem.4 5 Hence k is not a kernel.
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 47 Some Good KernelsExamples
Examples of kernels k(x, x0)
Linear x, x0 h i Laplacian RBF exp ( x x0 ) k k 2 Gaussian RBF exp x x0 k d k Polynomial ( x, x0 + c ) ,c 0,d N h i i 2 B-Spline B (x x0) 2n+1 Cond. Expectation E [p(x c)p(x0 c)] c | | Simple trick for checking Mercer’s condition Compute the Fourier transform of the kernel and check that it is nonnegative.
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 48 Linear KernelLinear Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 49 LaplacianLaplacian Kernel Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 50 GaussianGaussian Kernel Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 51 PolynomialPolynomial (Order 3) of order 3
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 52 B3-SplineB3 Kernel Spline Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 53 Summary
• Perceptron • Hebbian learning & biology • Algorithm • Convergence analysis • Features and preprocessing • Nonlinear separation • Perceptron in feature space • Kernels • Kernel trick • Properties • Examples