Foundations of Machine Learning Adaptive Computation and Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
Foundations of Machine Learning Adaptive Computation and Machine Learning Thomas Dietterich, Editor Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, Associate Editors A complete list of books published in The Adaptive Computations and Machine Learning series appears at the back of this book. Foundations of Machine Learning Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar The MIT Press Cambridge, Massachusetts London, England c 2012 Massachusetts Institute of Technology All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher. MIT Press books may be purchased at special quantity discounts for business or sales promotional use. For information, please email special [email protected] or write to Special Sales Department, The MIT Press, 55 Hayward Street, Cam- bridge, MA 02142. This book was set in LATEX by the authors. Printed and bound in the United States of America. Library of Congress Cataloging-in-Publication Data Mohri, Mehryar. Foundations of machine learning / Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. p. cm. - (Adaptive computation and machine learning series) Includes bibliographical references and index. ISBN 978-0-262-01825-8 (hardcover : alk. paper) 1. Machine learning. 2. Computer algorithms. I. Rostamizadeh, Afshin. II. Talwalkar, Ameet. III. Title. Q325.5.M64 2012 006.3’1-dc23 2012007249 10987654321 Contents Preface xi 1 Introduction 1 1.1Applicationsandproblems........................ 1 1.2Definitionsandterminology....................... 3 1.3Cross-validation.............................. 5 1.4Learningscenarios............................ 7 1.5Outline.................................. 8 2 The PAC Learning Framework 11 2.1ThePAClearningmodel......................... 11 2.2Guaranteesforfinitehypothesissets—consistentcase........ 17 2.3Guaranteesforfinitehypothesissets—inconsistentcase....... 21 2.4Generalities................................ 24 2.4.1 Deterministicversusstochasticscenarios............ 24 2.4.2 Bayeserrorandnoise...................... 25 2.4.3 Estimationandapproximationerrors.............. 26 2.4.4 Modelselection.......................... 27 2.5Chapternotes............................... 28 2.6Exercises................................. 29 3 Rademacher Complexity and VC-Dimension 33 3.1Rademachercomplexity......................... 34 3.2Growthfunction............................. 38 3.3VC-dimension............................... 41 3.4Lowerbounds............................... 48 3.5Chapternotes............................... 54 3.6Exercises................................. 55 4 Support Vector Machines 63 4.1Linearclassification............................ 63 4.2SVMs—separablecase......................... 64 vi 4.2.1 Primaloptimizationproblem.................. 64 4.2.2 Supportvectors.......................... 66 4.2.3 Dualoptimizationproblem................... 67 4.2.4 Leave-one-outanalysis...................... 69 4.3SVMs—non-separablecase....................... 71 4.3.1 Primaloptimizationproblem.................. 72 4.3.2 Supportvectors.......................... 73 4.3.3 Dualoptimizationproblem................... 74 4.4Margintheory............................... 75 4.5Chapternotes............................... 83 4.6Exercises................................. 84 5 Kernel Methods 89 5.1Introduction................................ 89 5.2Positivedefinitesymmetrickernels................... 92 5.2.1 Definitions............................ 92 5.2.2 ReproducingkernelHilbertspace................ 94 5.2.3 Properties............................. 96 5.3Kernel-basedalgorithms......................... 100 5.3.1 SVMswithPDSkernels..................... 100 5.3.2 Representertheorem....................... 101 5.3.3 Learningguarantees....................... 102 5.4Negativedefinitesymmetrickernels...................103 5.5Sequencekernels............................. 106 5.5.1 Weightedtransducers...................... 106 5.5.2 Rationalkernels......................... 111 5.6Chapternotes...............................115 5.7Exercises................................. 116 6 Boosting 121 6.1Introduction................................121 6.2AdaBoost................................. 122 6.2.1 Boundontheempiricalerror.................. 124 6.2.2 Relationshipwithcoordinatedescent..............126 6.2.3 Relationshipwithlogisticregression.............. 129 6.2.4 Standarduseinpractice..................... 129 6.3Theoreticalresults............................ 130 6.3.1 VC-dimension-basedanalysis.................. 131 6.3.2 Margin-basedanalysis...................... 131 6.3.3 Marginmaximization...................... 136 6.3.4 Game-theoreticinterpretation..................137 vii 6.4Discussion................................. 140 6.5Chapternotes...............................141 6.6Exercises................................. 142 7 On-Line Learning 147 7.1Introduction................................147 7.2Predictionwithexpertadvice...................... 148 7.2.1 MistakeboundsandHalvingalgorithm............ 148 7.2.2 Weightedmajorityalgorithm.................. 150 7.2.3 Randomizedweightedmajorityalgorithm........... 152 7.2.4 Exponentialweightedaveragealgorithm............ 156 7.3Linearclassification............................159 7.3.1 Perceptronalgorithm.......................160 7.3.2 Winnowalgorithm........................ 168 7.4On-linetobatchconversion....................... 171 7.5Game-theoreticconnection........................174 7.6Chapternotes...............................175 7.7Exercises................................. 176 8 Multi-Class Classification 183 8.1Multi-classclassificationproblem.................... 183 8.2Generalizationbounds.......................... 185 8.3Uncombinedmulti-classalgorithms................... 191 8.3.1 Multi-classSVMs.........................191 8.3.2 Multi-classboostingalgorithms................. 192 8.3.3 Decisiontrees........................... 194 8.4 Aggregated multi-class algorithms ................... 198 8.4.1 One-versus-all...........................198 8.4.2 One-versus-one.......................... 199 8.4.3 Error-correctioncodes...................... 201 8.5Structuredpredictionalgorithms.................... 203 8.6Chapternotes...............................206 8.7Exercises................................. 207 9 Ranking 209 9.1Theproblemofranking......................... 209 9.2Generalizationbound.......................... 211 9.3RankingwithSVMs........................... 213 9.4RankBoost................................ 214 9.4.1 Boundontheempiricalerror.................. 216 9.4.2 Relationshipwithcoordinatedescent..............218 viii 9.4.3 Marginboundforensemblemethodsinranking ....... 220 9.5Bipartiteranking............................. 221 9.5.1 Boostinginbipartiteranking.................. 222 9.5.2 AreaundertheROCcurve................... 224 9.6Preference-basedsetting......................... 226 9.6.1 Second-stagerankingproblem..................227 9.6.2 Deterministicalgorithm..................... 229 9.6.3 Randomizedalgorithm......................230 9.6.4 Extensiontootherlossfunctions................ 231 9.7Discussion................................. 232 9.8Chapternotes...............................233 9.9Exercises................................. 234 10 Regression 237 10.1Theproblemofregression........................ 237 10.2Generalizationbounds.......................... 238 10.2.1Finitehypothesissets...................... 238 10.2.2Rademachercomplexitybounds.................239 10.2.3Pseudo-dimensionbounds.................... 241 10.3Regressionalgorithms.......................... 245 10.3.1Linearregression......................... 245 10.3.2Kernelridgeregression...................... 247 10.3.3Supportvectorregression.................... 252 10.3.4Lasso............................... 257 10.3.5Groupnormregressionalgorithms............... 260 10.3.6On-lineregressionalgorithms.................. 261 10.4Chapternotes...............................262 10.5Exercises................................. 263 11 Algorithmic Stability 267 11.1Definitions.................................267 11.2Stability-basedgeneralizationguarantee................ 268 11.3Stabilityofkernel-basedregularizationalgorithms.......... 270 11.3.1 Application to regression algorithms: SVR and KRR . 274 11.3.2Applicationtoclassificationalgorithms:SVMs........ 276 11.3.3Discussion............................. 276 11.4Chapternotes...............................277 11.5Exercises................................. 277 12 Dimensionality Reduction 281 12.1PrincipalComponentAnalysis..................... 282 ix 12.2KernelPrincipalComponentAnalysis(KPCA)............ 283 12.3KPCAandmanifoldlearning......................285 12.3.1Isomap.............................. 285 12.3.2Laplacianeigenmaps....................... 286 12.3.3Locallylinearembedding(LLE)................ 287 12.4Johnson-Lindenstrausslemma...................... 288 12.5Chapternotes...............................290 12.6Exercises................................. 290 13 Learning Automata and Languages 293 13.1Introduction................................293 13.2Finiteautomata............................. 294 13.3Efficientexactlearning.......................... 295 13.3.1Passivelearning......................... 296 13.3.2Learningwithqueries...................... 297 13.3.3Learningautomatawithqueries................ 298 13.4Identificationinthelimit.......................