Learning Kernel Classifiers
Total Page:16
File Type:pdf, Size:1020Kb
Learning Kernel Classifiers Theory and Algorithms Ralf Herbrich The MIT Press Cambridge, Massachusetts London, England Adaptive Computation and Machine Learning Thomas G. Dietterich, Editor Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, Associate Editors Bioinformatics: The Machine Learning Approach, Pierre Baldi and Søren Brunak Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto Graphical Models for Machine Learning and Digital Communication, Brendan J. Frey Learning in Graphical Models, Michael I. Jordan Causation, Prediction, and Search, second edition, Peter Spirtes, Clark Glymour, and Richard Scheines Principles of Data Mining, David Hand, Heikki Mannilla, and Padhraic Smyth Bioinformatics: The Machine Learning Approach, second edition, Pierre Baldi and Søren Brunak Learning Kernel Classifiers: Theory and Algorithms, Ralf Herbrich Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, Bernhard Schölkopf and Alexander J. Smola c 2002 Massachusetts Institute of Technology All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher. This book was set in Times Roman by the author using the LATEX document preparation system and was printed and bound in the United States of America. Library of Congress Cataloging-in-Publication Data Herbrich, Ralf. Learning kernel classifiers : theory and algorithms / Ralf Herbrich. p. cm. — (Adaptive computation and machine learning) Includes bibliographical references and index. ISBN 0-262-08306-X (hc. : alk. paper) 1. Machine learning. 2. Algorithms. I. Title. II. Series. Q325.5 .H48 2001 006.3 1—dc21 2001044445 There are many branches of learning theory that have not yet been analyzed and that are important both for understanding the phenomenon of learning and for practical applications. They are waiting for their researchers. —Vladimir Vapnik Geometry is illuminating; probability theory is powerful. —Pál Ruján Contents Series Foreword xv Preface xvii 1 Introduction 1 1.1 The Learning Problem and (Statistical) Inference 1 1.1.1 SupervisedLearning............... 3 1.1.2 UnsupervisedLearning.............. 6 1.1.3 ReinforcementLearning............. 7 1.2 Learning Kernel Classifiers 8 1.3 The Purposes of Learning Theory 11 I LEARNING ALGORITHMS 2 Kernel Classifiers from a Machine Learning Perspective 17 2.1 The Basic Setting 17 2.2 Learning by Risk Minimization 24 2.2.1 The (Primal) Perceptron Algorithm ....... 26 2.2.2 RegularizedRiskFunctionals.......... 27 2.3 Kernels and Linear Classifiers 30 2.3.1 TheKernelTechnique.............. 33 2.3.2 Kernel Families .................. 36 2.3.3 The Representer Theorem . ......... 47 2.4 Support Vector Classification Learning 49 2.4.1 MaximizingtheMargin............. 49 2.4.2 Soft Margins—Learning with Training Error . 53 2.4.3 Geometrical Viewpoints on Margin Maximization 56 2.4.4 The ν–TrickandOtherVariants......... 58 x Contents 2.5 Adaptive Margin Machines 61 2.5.1 Assessment of Learning Algorithms ....... 61 2.5.2 Leave-One-OutMachines............ 63 2.5.3 Pitfalls of Minimizing a Leave-One-Out Bound . 64 2.5.4 AdaptiveMarginMachines............ 66 2.6 Bibliographical Remarks 68 3 Kernel Classifiers from a Bayesian Perspective 73 3.1 The Bayesian Framework 73 3.1.1 ThePowerofConditioningonData....... 79 3.2 Gaussian Processes 81 3.2.1 Bayesian Linear Regression . ........ 82 3.2.2 From Regression to Classification ........ 87 3.3 The Relevance Vector Machine 92 3.4 Bayes Point Machines 97 3.4.1 EstimatingtheBayesPoint............ 100 3.5 Fisher Discriminants 103 3.6 Bibliographical Remarks 110 II LEARNING THEORY 4 Mathematical Models of Learning 115 4.1 Generative vs. Discriminative Models 116 4.2 PAC and VC Frameworks 121 4.2.1 Classical PAC and VC Analysis . ........ 123 4.2.2 GrowthFunctionandVCDimension...... 127 4.2.3 StructuralRiskMinimization........... 131 4.3 The Luckiness Framework 134 4.4 PAC and VC Frameworks for Real-Valued Classifiers 140 4.4.1 VC Dimensions for Real-Valued Function Classes 146 4.4.2 The PAC Margin Bound ............. 150 4.4.3 Robust Margin Bounds ............. 151 4.5 Bibliographical Remarks 158 xi Contents 5 Bounds for Specific Algorithms 163 5.1 The PAC-Bayesian Framework 164 5.1.1 PAC-Bayesian Bounds for Bayesian Algorithms 164 5.1.2 A PAC-Bayesian Margin Bound ......... 172 5.2 Compression Bounds 175 5.2.1 Compression Schemes and Generalization Error 176 5.2.2 On-line Learning and Compression Schemes . 182 5.3 Algorithmic Stability Bounds 185 5.3.1 Algorithmic Stability for Regression . 185 5.3.2 Algorithmic Stability for Classification . 190 5.4 Bibliographical Remarks 193 III APPENDICES A Theoretical Background and Basic Inequalities 199 A.1 Notation 199 A.2 Probability Theory 200 A.2.1 Some Results for Random Variables ....... 203 A.2.2 Families of Probability Measures ........ 207 A.3 Functional Analysis and Linear Algebra 215 A.3.1 Covering, Packing and Entropy Numbers . 220 A.3.2 Matrix Algebra .................. 222 A.4 Ill-Posed Problems 239 A.5 Basic Inequalities 240 A.5.1 General (In)equalities . .............. 240 A.5.2 Large Deviation Bounds ............. 243 B Proofs and Derivations—Part I 253 B.1 Functions of Kernels 253 B.2 Efficient Computation of String Kernels 254 B.2.1 Efficient Computation of the Substring Kernel . 255 B.2.2 Efficient Computation of the Subsequence Kernel 255 B.3 Representer Theorem 257 B.4 Convergence of the Perceptron 258 xii Contents B.5 Convex Optimization Problems of Support Vector Machines 259 B.5.1HardMarginSVM................ 260 B.5.2LinearSoftMarginLossSVM.......... 260 B.5.3QuadraticSoftMarginLossSVM........ 261 B.5.4 ν–LinearMarginLossSVM........... 262 B.6 Leave-One-Out Bound for Kernel Classifiers 263 B.7 Laplace Approximation for Gaussian Processes 265 B.7.1 Maximization of fTm+1|X=x,Zm=z ......... 266 B.7.2 Computation of ................ 268 B.7.3 Stabilized Gaussian Process Classification . 269 B.8 Relevance Vector Machines 271 B.8.1 Derivative of the Evidence w.r.t. θ ........ 271 σ 2 B.8.2 Derivative of the Evidence w.r.t. t ....... 273 B.8.3 Update Algorithms for Maximizing the Evidence 274 B.8.4ComputingtheLog-Evidence.......... 275 B.8.5 Maximization of fW|Zm =z ............. 276 B.9 A Derivation of the Operation ⊕µ 277 B.10 Fisher Linear Discriminant 278 C Proofs and Derivations—Part II 281 C.1 VC and PAC Generalization Error Bounds 281 C.1.1BasicLemmas.................. 281 C.1.2ProofofTheorem4.7............... 284 C.2 Bound on the Growth Function 287 C.3 Luckiness Bound 289 C.4 Empirical VC Dimension Luckiness 292 C.5 Bound on the Fat Shattering Dimension 296 C.6 Margin Distribution Bound 298 C.7 The Quantifier Reversal Lemma 300 C.8 A PAC-Bayesian Marin Bound 302 C.8.1BallsinVersionSpace.............. 303 C.8.2VolumeRatioTheorem.............. 306 C.8.3 A Volume Ratio Bound . ............. 308 xiii Contents C.8.4Bollmann’sLemma................ 311 C.9 Algorithmic Stability Bounds 314 C.9.1 Uniform Stability of Functions Minimizing a Regularized Risk........................ 315 C.9.2 Algorithmic Stability Bounds . ......... 316 D Pseudocodes 321 D.1 Perceptron Algorithm 321 D.2 Support Vector and Adaptive Margin Machines 323 D.2.1 Standard Support Vector Machines ........ 323 D.2.2 ν–Support Vector Machines . ......... 324 D.2.3AdaptiveMarginMachines............ 324 D.3 Gaussian Processes 325 D.4 Relevance Vector Machines 325 D.5 Fisher Discriminants 329 D.6 Bayes Point Machines 330 List of Symbols 331 References 339 Index 357 Series Foreword One of the most exciting recent developments in machine learning is the discovery and elaboration of kernel methods for classification and regression. These algo- rithms combine three important ideas into a very successful whole. From mathe- matical programming, they exploit quadratic programming algorithms for convex optimization; from mathematical analysis, they borrow the idea of kernel repre- sentations; and from machine learning theory, they adopt the objective of finding the maximum-margin classifier. After the initial development of support vector machines, there has been an explosion of kernel-based methods. Ralf Herbrich’s Learning Kernel Classifiers is an authoritative treatment of support vector ma- chines and related kernel classification and regression methods. The book examines these methods both from an algorithmic perspective and from the point of view of learning theory. The book’s extensive appendices provide pseudo-code for all of the algorithms and proofs for all of the theoretical results. The outcome is a volume that will be a valuable classroom textbook as well as a reference for researchers in this exciting area. The goal of building systems that can adapt to their environment and learn from their experience has attracted researchers from many fields, including computer science, engineering, mathematics, physics, neuroscience, and cognitive science. Out of this research has come a wide variety of learning techniques that have the potential to transform many scientific and industrial fields. Recently, several re- search communities have begun to converge on a common set of issues surround- ing supervised, unsupervised, and reinforcement learning problems. The MIT Press series on Adaptive Computation and Machine Learning seeks to unify the many di- verse strands of machine learning research and to foster high quality research and innovative applications.