Efficient Learning from Faulty Data

Efficient Learning from Faulty Data The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Decatur, Scott Evan. 1995. Efficient Learning from Faulty Data. Harvard Computer Science Group Technical Report TR-30-95. Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:26506447 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Ecient Learning from Faulty Data Scott Evan Decatur TR Center for Research in Computing Technology Harvard University Cambridge Massachusetts Ecient Learning from Faulty Data A thesis presented by Scott Evan Decatur to The Division of Applied Sciences in partial fulllment of the requirements for the degree of Do ctor of Philosophy in the sub ject of Computer Science Harvard University Cambridge Massachusetts July c by Scott Evan Decatur All rights reserved ii Abstract Learning systems are often provided with imp erfect or noisy data Therefore researchers have formalized various mo dels of learning with noisy data and have attempted to delineate the b oundaries of learnability in these mo dels In this thesis we describ e a general framework for the construction of ecient learning algorithms in noise tolerant variants of Valiants PAC learning mo del By applying this frame work we also obtain many new results for sp ecic learning problems in various settings with faulty data The central to ol used in this thesis is the sp ecication of learning algorithms in Kearns Statistical Query SQ learning mo del in which statistics as opp osed to lab elled examples are requested by the learner These SQ learning algorithms are then converted into PAC algorithms which tolerate various types of faulty data We develop this framework in three ma jor parts We design automatic compilations of SQ algorithms into PAC algorithms which tolerate various types of data errors These results include improve ments to Kearns classication noise compilation and the rst such compila tions for malicious errors attribute noise and new classes of hybrid noise comp osed of multiple noise types We prove nearly tight b ounds on the required complexity of SQ algorithms The upp er b ounds are based on a constructive technique which allows one to achieve this complexity even when it is not initially achieved by a given SQ algorithm We dene and employ an improved mo del of SQ learning which yields noise tolerant PAC algorithms that are more ecient than those derived from stan dard SQ algorithms Together these results provide a unied and intuitive framework for noise tolerant learning that allows the algorithm designer to achieve ecient and often optimal fault tolerant learning iii To the memory of my father Martin Decatur iv Table of Contents Introduction Outline of the Thesis Mo dels and Background PAC Learning from Examples Statistical Query Learning Weak and Strong Learning Enhanced Statistical Query Learning and NoiseFree Simulation New Statistical Query Mo dels Statistical Queries with Relative Error Estimates Probabilistic and RealValued Statistical Queries Estimating Query Probabilities using Examples Ensuring Individual Convergence for Sp ecied Queries Ensuring Uniform Convergence for Classes of Queries Statistical Query Algorithms for Sp ecic Learning Problems Bounds on Statistical Query Learning Upp er Bounds for Statistical Query Learning Bo osting by Ma jority in the PAC Mo del Bo osting Statistical Queries by Ma jority General Upp er Bounds on Learning in the SQ Mo del A Sp ecic Lower Bound for Learning in the SQ Mo del Learning with Classication Noise Introduction Improved Classication Noise Learning from Additive SQ A New Derivation for P v Table of Contents vi Sensitivity Analysis Guessing the Noise Rate Testing Hyp otheses Combined Improvements for Additive SQ Simulation Classication Noise Learning from Relative SQ General Upp er Bounds for Classication Noise Learning Classication Noise Sample Complexity Lower Bounds The Lower Bound The Combined Lower Bound Optimality of the General Lower Bound Learning without a Known Bound on the Classication Noise Rate Learning with Malicious Errors Introduction Statistical Queries for Malicious Error Tolerance Additive SQ Simulation Relative SQ Simulation Eciency Achieved by Statistical Query Simulation General Bounds Applications for Sp ecic Learning Problems Learning with Attribute Noise and Missing Attributes Introduction Learning with Attribute Noise The View of a Statistical Query Algorithm Statistical Queries for Attribute Noise Tolerance Learning with Imp erfect Knowledge of the Noise Rate Restricted View Statistical Query Algorithms Learning with Missing Attributes Sample Complexity Lower Bounds Learning with Faulty Distributions of Examples Introduction Learning with Distribution Noise Distribution Error Dynamic Distribution Error Variable Noise Rates Distribution Shift Distribution Restricted Learning Algorithms Learning in Hybrid Noise Mo dels Introduction Classication Noise and Malicious Errors Hyp othesis Testing Table of Contents vii Learning by Standard Techniques Learning by Statistical Queries Limits of CAM Learnability Classication Noise and Attribute Noise Other Hybrid Mo dels Relative Diculty of Learning with Dierent Faults Op en Problems A The Complexity of Query Spaces from Hyp othesis Bo osting A The Finite Query Space Complexity of Bo osting A The Size of the Query Space of Scheme and Scheme Bo osting A The Size of the Query Space of Hybrid Bo osting A The General Query Space Complexity of Bo osting A Preliminaries A The VCDimension of the Query Space of Scheme and Scheme Bo osting A The VCDimension of the Query Space of Hybrid Bo osting B On Bo osting DistributionRestricted Weak Learning Algorithms Bibliography Acknowledgements First I would like to thank my advisor Les Valiant Les introduced me to the eld of computational machine learning p eaked my interest in the sub ject and gave me valuable feedback and guidance throughout my years at Harvard It has b een an honor to have had Les as my advisor I also wish to thank Jay Aslam with whom I collab orated on much of the work in this thesis Additionally I wish to thank my other coauthor Rosario Gennaro Working with each of them has b een pro ductive as well as enjoyable An imp ortant part of my education has b een the machine learning reading group at MIT organized by Ron Rivest I thank Ron and all of the p eople who participated in the group for their contributions to my understanding of the eld I also want to thank b oth Ron and Michael Rabin for serving on my thesis committee and for their comments and suggestions on an earlier draft of this thesis My graduate career has also b een enriched by the coauthors colleagues and friends that I have sp ent time with including but not limited to Avrim Blum Nader Bshouty Stan Chen Zhixiang Chen Jon Christensen Yoav Freund Tom Hanco ck Steve Homer Christos Kaklamanis Mike Kearns Roni Khardon Dan Roth Rob Schapire and Mark Smith I wish to thank my family Marty Phyllis Debi and Jon for all of the love and supp ort they have given me I esp ecially want to thank my parents for setting such great examples and for their commitment to always giving me the b est p ossible opp ortunities Finally I thank my wife Amy whom I met on the rst day of graduate school Since that time we have shared in each others lives and graduate careers and I thank her for all of her love friendship and patience Financial supp ort for this thesis has come from NSF grants CCR and CCR as well as a National Defense Science and Engineering graduate fellowship from the Department of Defense viii Bibliog raphical Notes Most of the results of this thesis have b een published previously The results con tained in Chapters and as well as some of Chapter app ear in a Harvard Uni versity Technical Rep ort HUTR Aslam and Decatur and an extended abstract in COLT Aslam and Decatur The remainder of Chapter and all of Chapter and App endix B app ear as an extended abstract in COLT De catur An extended abstract of the results in Chapter and App endix A app ear in FOCS Aslam and Decatur Most of the results of Chapter app ear as

Efficient Learning from Faulty Data

Relational Machine Learning Algorithms

Computer Science and Decision Theory Fred S. Roberts1 Abstract 1

Research Notices

Typical Stability

An Efficient Boosting Algorithm for Combining Preferences Raj

An Introduction to Johnson–Lindenstrauss Transforms

Download This PDF File

Theory and Algorithms for Modern Problems in Machine Learning and an Analysis of Markets

40Th ACM Symposium on Theory of Computing (STOC 2008) Saturday

Hardness of Learning Halfspaces with Noise∗

Optimal Bounds for Estimating Entropy with PMF Queries

Mitigating Bias in Adaptive Data Gathering Via Differential Privacy