Isabelle Guyon, Steve Gunn, Masoud Nikravesh, Lotfi A. Zadeh (Eds

Isabelle Guyon, Steve Gunn, Masoud Nikravesh, Lotﬁ A. Zadeh (Eds.) Feature Extraction Studies in Fuzziness and Soft Computing, Volume 207

Editor-in-chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail: [email protected]

Further volumes of this series Vol. 199. Zhong Li can be found on our homepage: Fuzzy Chaotic Systems, 2006 springer.com ISBN 3-540-33220-0 Vol. 200. Kai Michels, Frank Klawonn, Rudolf Kruse, Andreas Nürnberger Vol. 191. Martin V. Butz Fuzzy Control, 2006 Rule-Based Evolutionary Online Learning ISBN 3-540-31765-1 Systems, 2006 ISBN 3-540-25379-3 Vol. 201. Cengiz Kahraman (Ed.) Fuzzy Applications in Industrial Vol. 192. Jose A. Lozano, Pedro Larrañaga, Engineering, 2006 Iñaki Inza, Endika Bengoetxea (Eds.) ISBN 3-540-33516-1 Towards a New Evolutionary Computation, 2006 Vol. 202. Patrick Doherty, Witold ISBN 3-540-29006-0 Łukaszewicz, Andrzej Skowron, Andrzej Szałas Vol. 193. Ingo Glöckner Knowledge Representation Techniques: A Fuzzy Quantifiers: A Computational Theory, Rough Set Approach, 2006 2006 ISBN 3-540-33518-8 ISBN 3-540-29634-4 Vol. 203. Gloria Bordogna, Giuseppe Psaila Vol. 194. Dawn E. Holmes, Lakhmi C. Jain (Eds.) (Eds.) Flexible Databases Supporting Imprecision Innovations in Machine Learning, 2006 and Uncertainty, 2006 ISBN 3-540-30609-9 ISBN 3-540-33288-X Vol. 195. Zongmin Ma Vol. 204. Zongmin Ma (Ed.) Fuzzy Database Modeling of Imprecise and Soft Computing in Ontologies and Semantic Uncertain Engineering Information, 2006 Web, 2006 ISBN 3-540-30675-7 ISBN 3-540-33472-6 Vol. 196. James J. Buckley Vol. 205. Mika Sato-Ilic, Lakhmi C. Jain Fuzzy Probability and Statistics, 2006 Innovations in Fuzzy Clustering, 2006 ISBN 3-540-30841-5 ISBN 3-540-34356-3 Vol. 197. Enrique Herrera-Viedma, Gabriella Vol. 206. Ashok Sengupta (Ed.) Pasi, Fabio Crestani (Eds.) Chaos, Nonlinearity, Complexity, 2006 Soft Computing in Web Information ISBN 3-540-31756-2 Retrieval, 2006 ISBN 3-540-31588-8 Vol. 207. Isabelle Guyon, Steve Gunn, Masoud Nikravesh, Lotfi A. Zadeh (Eds.) Vol. 198. Hung T. Nguyen, Berlin Wu Feature Extraction, 2006 Fundamentals of Statistics with Fuzzy Data, ISBN 3-540-35487-5 2006 ISBN 3-540-31695-7 Isabelle Guyon Steve Gunn Masoud Nikravesh Lotfi A. Zadeh (Eds.)

Feature Extraction

Foundations and Applications

ABC Isabelle Guyon Masoud Nikravesh Clopinet Department of Electrical 955 Creston Road Engineering & Computer 94708 Berkeley, USA Science – EECS E-mail: [email protected] University of California 94720 Berkeley, USA E-mail: [email protected] Steve Gunn School of Electronics and Computer Sciences Lotﬁ A. Zadeh University of Southampton Division of Computer Science SO17 1BJ Southampton Lab. Electronics Research Highﬁeld, United Kingdom University of California E-mail: [email protected] Soda Hall 387 94720-1776 Berkeley, CA, USA E-mail: [email protected]

Library of Congress Control Number: 2006928001

ISSN print edition: 1434-9922 ISSN electronic edition: 1860-0808 ISBN-10 3-540-35487-5 Springer Berlin Heidelberg New York ISBN-13 978-3-540-35487-1 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable for prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com c Springer-Verlag Berlin Heidelberg 2006 Printed in The Netherlands The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: by the authors and techbooks using a Springer LATEX macro package Cover design: Erich Kirchner, Heidelberg Printed on acid-free paper SPIN: 10966471 89/techbooks 543210 To our friends and foes Foreword

Everyone loves a good competition. As I write this, two billion fans are eagerly anticipating the 2006 World Cup. Meanwhile, a fan base that is somewhat smaller (but presumably includes you, dear reader) is equally eager to read all about the results of the NIPS 2003 Feature Selection Challenge, contained herein. Fans of Radford Neal and Jianguo Zhang (or of Bayesian neural networks and Dirichlet diffusion trees) are gloating “I told you so” and looking for proof that their win was not a fluke. But the matter is by no means settled, and fans of SVMs are shouting “wait ’til next year!” You know this book is a bit more edgy than your standard academic treatise as soon as you see the dedication: “To our friends and foes.” Competition breeds improvement. Fifty years ago, the champion in 100m butterfly swimming was 22 percent slower than today’s champion; the women’s marathon champion from just 30 years ago was 26 percent slower. Who knows how much better our machine learning algorithms would be today if Turing in 1950 had proposed an effective competition rather than his elusive Test? But what makes an effective competition? The field of Speech Recognition has had NIST-run competitions since 1988; error rates have been reduced by a factor of three or more, but the field has not yet had the impact expected of it. Information Retrieval has had its TREC competition since 1992; progress has been steady and refugees from the competition have played important roles in the hundred-billion-dollar search industry. Robotics has had the DARPA Grand Challenge for only two years, but in that time we have seen the results go from complete failure to resounding success (although it may have helped that the second year’s course was somewhat easier than the first’s). I think there are four criteria that define effective technical competitions:

1. The task must be approachable. Non-experts should be able to enter, to see some results, and learn from their better-performing peers. 2. The scoring must be incremental. A pass-fail competition where everyone always fails (such as the Turing Test) makes for a boring game and dis- courages further competition. On this score the Loebner Prize, despite its VIII Foreword

faults, is a better competition than the original Turing Test. In one sense, everyone failed the DARPA Grand Challenge in the ﬁrst year (because no entrant ﬁnished the race), but in another sense there were incremental scores: the distance each robot travelled, and the average speed achieved. 3. The results should be open. Participants and spectators alike should be able to learn the best practices of all participants. This means that each participant should describe their approaches in a written document, and that the data, auxiliary programs, and results should be publicly available. 4. The task should be relevant to real-world tasks. One of the problems with early competitions in speech recognition was that the emphasis on reducing word error rates did not necessarily lead to a strong speech dialog system—you could get almost all the words right and still have a bad dialog, and conversely you could miss many of the words and still recover. More recent competitions have done a better job of concentrating on tasks that really matter.

The Feature Selection Challenge meets the first three criteria easily. Sev- enty five teams entered, so they must have found it approachable. The scoring did a good job of separating the top performers while keeping everyone on the scale. And the results are all available online, in this book, and in the accom- panying CD. All the data and Matlab code is provided, so the Challenge is easily reproducible. The level of explication provided by the entrants in the chapters of this book is higher than in other similar competitions. The fourth criterion, real-world relevance, is perhaps the hardest to achieve. Only time will tell whether the Feature Selection Challenge meets this one. In the mean time, this book sets a high standard as the public record of an interesting and effective competition.

Palo Alto, California Peter Norvig January 2006 Preface

Feature extraction addresses the problem of finding the most compact and informative set of features, to improve the efficiency or data storage and processing. Defining feature vectors remains the most common and conve- nient means of data representation for classification and regression problems. Data can then be stored in simple tables (lines representing “entries”, “data points, “samples”, or “patterns”, and columns representing “features”). Each feature results from a quantitative or qualitative measurement, it is an “at- tribute” or a “variable”. Modern feature extraction methodology is driven by the size of the data tables, which is ever increasing as data storage becomes more and more efficient. After many years of parallel efforts, researchers in Soft-Computing, Sta- tistics, Machine Learning, and Knowledge Discovery, who are interested in predictive modeling are uniting their effort to advance the problem of feature extraction. The recent advances made in both sensor technologies and machine learning techniques make it possible to design recognition systems, which are capable of performing tasks that could not be performed in the past. Feature extraction lies at the center of these advances with applications in the pharmaco-medical industry, oil industry, industrial inspection and diagnosis systems, speech recognition, biotechnology, Internet, targeted marketing and many of other emerging applications. The present book is organized around the results of a benchmark that took place in 2003. Dozens of research groups competed on five large feature selection problems from various application domains: medical diagnosis, text processing, drug discovery, and handwriting recognition. The results of this effort pave the way to a new generation of methods capable of analyzing data tables with million of lines and/or columns. Part II of the book summarizes the results of the competition and gath- ers the papers describing the methods used by the top ranking participants. Following the competition, a NIPS workshop took place in December 2003 to discuss the outcomes of the competition and new avenues in feature extraction. The contributions providing new perspectives are found in Part III XPreface of the book. Part I provides all the necessary foundations to understand the recent advances made in Parts II and III. The book is complemented by appendices and by a web site. The appendices include fact sheets summarizing the methods used in the competition, tables of results of the competition, and a summary of basic concepts of statistics. This book is directed to students, researchers, and engineers. It presents recent advances in the field and complements an earlier book (Liu and Mo- toda, 1998), which provides a thorough bibliography and presents methods of historical interest, but explores only small datasets and ends before the new era of kernel methods. Readers interested in the historical aspects of the problem are also directed to (Devijver and Kittler, 1982). A completely novice reader will find all the necessary elements to understand the material of the book presented in the tutorial chapters of Part I. The book can be used as teaching material for a graduate class in statistics and machine learning, Part I supporting the lectures, Part II and III providing readings, and the CD providing data for computer projects.

Z¨urich, Switzerland Isabelle Guyon Southampton, UK Steve Gunn Berkeley, California Masoud Nikravesh and Lofti A. Zadeh November 2005

References

P.A. Devijver and J. Kittler. Pattern Recognition: A Statistical Approach.Prentice- Hall, 1982. H. Liu and H. Motoda. Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic, 1998. Contents

An Introduction to Feature Extraction Isabelle Guyon, André Elisseeff ...... 1 1 FeatureExtractionBasics...... 1 2 WhatisNewinFeatureExtraction?...... 7 3 GettingStarted...... 9 4 AdvancedTopicsandOpenProblems...... 16 5 Conclusion...... 22 References...... 23 A Forward Selection with Gram-Schmidt Orthogonalization ...... 24 B Justification of the Computational Complexity Estimates ...... 25

Part I Feature Extraction Fundamentals

1 Learning Machines Norbert Jankowski, Krzysztof Grabczewski ...... 29 1.1 Introduction...... 29 1.2 TheLearningProblem...... 29 1.3 LearningAlgorithms ...... 35 1.4 Some Remarks on Learning Algorithms ...... 57 References...... 58 2 Assessment Methods G´erard Dreyfus, Isabelle Guyon ...... 65 2.1 Introduction...... 65 2.2 A Statistical View of Feature Selection: Hypothesis Tests andRandomProbes...... 66 2.3 AMachineLearningViewofFeatureSelection...... 78 2.4 Conclusion...... 86 References...... 86 XII Contents

3 Filter Methods Wlodzislaw Duch ...... 89 3.1 Introduction to Filter Methods for Feature Selection ...... 89 3.2 General Issues Related to Filters ...... 91 3.3 Correlation-Based Filters ...... 96 3.4 Relevance Indices Based on Distances Between Distributions . . . . . 99 3.5 RelevanceMeasuresBasedonInformationTheory ...... 101 3.6 DecisionTreesforFiltering...... 104 3.7 Reliability and Bias of Relevance Indices ...... 106 3.8 Filters for Feature Selection ...... 108 3.9 SummaryandComparison...... 110 3.10 DiscussionandConclusions ...... 113 References...... 114 4 Search Strategies Juha Reunanen ...... 119 4.1 Introduction...... 119 4.2 OptimalResults...... 119 4.3 SequentialSelection...... 121 4.4 ExtensionstoSequentialSelection...... 123 4.5 StochasticSearch...... 129 4.6 OntheDifferentLevelsofTesting...... 133 4.7 TheBestStrategy?...... 134 References...... 135 5 Embedded Methods Thomas Navin Lal, Olivier Chapelle, Jason Weston, André Elisseeff ....137 5.1 Introduction...... 137 5.2 Forward-BackwardMethods...... 139 5.3 OptimizationofScalingFactors...... 150 5.4 SparsityTerm ...... 156 5.5 DiscussionsandConclusions...... 161 References...... 162 6 Information-Theoretic Methods Kari Torkkola ...... 167 6.1 Introduction...... 167 6.2 WhatisRelevance?...... 167 6.3 InformationTheory...... 169 6.4 Information-Theoretic Criteria for Variable Selection ...... 172 6.5 MIforFeatureConstruction...... 178 6.6 InformationTheoryinLearningDistanceMetrics...... 179 6.7 InformationBottleneckandvariants...... 180 6.8 Discussion ...... 181 References...... 182 Contents XIII

7 Ensemble Learning Eugene Tuv ...... 187 7.1 Introduction...... 187 7.2 OverviewofEnsembleMethods...... 188 7.3 Variable Selection and Ranking with Tree Ensembles ...... 191 7.4 BayesianVoting ...... 200 7.5 Discussions...... 201 References...... 203 8 Fuzzy Neural Networks Madan M. Gupta, Noriyasu Homma, Zeng-Guang Hou ...... 205 8.1 Introduction...... 205 8.2 FuzzySetsandSystems:AnOverview...... 207 8.3 uilding Fuzzy Neurons Using Fuzzy Arithmetic and Fuzzy Logic Operations ...... 215 8.4 HybridFuzzyNeuralNetworks(HFNNs)...... 222 8.5 ConcludingRemarks...... 230 References...... 231

Part II Feature Selection Challenge

9 Design and Analysis of the NIPS2003 Challenge Isabelle Guyon, Steve Gunn, Asa Ben Hur, Gideon Dror ...... 237 9.1 Introduction...... 237 9.2 BenchmarkDesign...... 239 9.3 ChallengeResults...... 245 9.4 Post-ChallengeVerifications...... 253 9.5 ConclusionsandFutureWork ...... 259 References...... 260 A DetailsAbouttheFiftyFeatureSubsetStudy...... 261 10 High Dimensional Classification with Bayesian Neural Networks and Dirichlet Diffusion Trees Radford M. Neal, Jianguo Zhang ...... 265 10.1 BayesianModelsvs.LearningMachines...... 266 10.2 SelectingFeatureswithUnivariateTests...... 269 10.3 ReducingDimensionalityUsingPCA...... 271 10.4 Bayesian Logistic Regression ...... 271 10.5 BayesianNeuralNetworkModels ...... 274 10.6 DirichletDiffusionTreeModels...... 276 10.7 MethodsandResultsfortheChallengeDataSets...... 280 10.8 Conclusions...... 294 References...... 295 XIV Contents

11 Ensembles of Regularized Least Squares Classifiers for High-Dimensional Problems Kari Torkkola and Eugene Tuv ...... 297 11.1 Introduction...... 297 11.2 Regularized Least-Squares Classification (RLSC) ...... 298 11.3 Model Averaging and Regularization ...... 300 11.4 Variable Filtering with Tree-Based Ensembles ...... 301 11.5 ExperimentswithChallengeDataSets...... 302 11.6 FutureDirections...... 310 11.7 Conclusion...... 312 References...... 313 12 Combining SVMs with Various Feature Selection Strategies Yi-Wei Chen, Chih-Jen Lin ...... 315 12.1 Introduction...... 315 12.2 Support Vector Classification ...... 316 12.3 FeatureSelectionStrategies...... 317 12.4 ExperimentalResults...... 320 12.5 CompetitionResults...... 321 12.6 DiscussionandConclusions ...... 322 References...... 323 13 Feature Selection with Transductive Support Vector Machines Zhili Wu, Chunhung Li ...... 325 13.1 Introduction...... 325 13.2 SVMsandTransductiveSVMs...... 326 13.3 FeatureSelectionMethodsRelatedwithSVMs ...... 329 13.4 TransductiveSVM-RelatedFeatureSelection...... 331 13.5 Experimentation...... 332 13.6 ConclusionandDiscussion...... 339 13.7 Acknowledgement...... 340 References...... 340 14 Variable Selection using Correlation and Single Variable Classifier Methods: Applications Amir Reza Saffari Azar Alamdari ...... 343 14.1 Introduction...... 343 14.2 Introduction to Correlation and Single Variable Classifier Methods 344 14.3 EnsembleAveraging...... 347 14.4 Applications to NIPS 2003 Feature Selection Challenge ...... 350 14.5 Conclusion...... 355 References...... 357 Contents XV

15 Tree-Based Ensembles with Dynamic Soft Feature Selection Alexander Borisov, Victor Eruhimov, Eugene Tuv ...... 359 15.1 Background...... 359 15.2 DynamicFeatureSelection...... 361 15.3 Experimentalresults ...... 367 15.4 Summary...... 373 References...... 374 16 Sparse, Flexible and Efficient Modeling using L1 Regularization Saharon Rosset, Ji Zhu ...... 375 16.1 Introduction...... 375 16.2 The L1-NormPenalty...... 380 16.3 PiecewiseLinearSolutionPaths ...... 384 16.4 A Robust, Efficient and Adaptable Method for Classification . . . . . 388 16.5 Results on the NIPS-03 Challenge Datasets ...... 390 16.6 Conclusion...... 392 References...... 393 17 Margin Based Feature Selection and Infogain with Standard Classifiers Ran Gilad-Bachrach, Amir Navot ...... 395 17.1 Methods...... 395 17.2 Results...... 398 17.3 Discussion ...... 399 References...... 400 18 Bayesian Support Vector Machines for Feature Ranking and Selection Wei Chu, S. Sathiya Keerthi, Chong Jin Ong, Zoubin Ghahramani .....403 18.1 Introduction...... 403 18.2 BayesianFramework ...... 404 18.3 Post-processingforFeatureSelection...... 410 18.4 NumericalExperiments...... 414 18.5 Conclusion...... 416 References...... 416 19 Nonlinear Feature Selection with the Potential Support Vector Machine Sepp Hochreiter, Klaus Obermayer ...... 419 19.1 Introduction...... 419 19.2 The Potential Support Vector Machine ...... 420 19.3 P-SVM Discussion and Redundancy Control ...... 424 19.4 NonlinearP-SVMFeatureSelection...... 427 19.5 Experiments...... 429 XVI Contents

19.6 Conclusion...... 436 References...... 436 20 Combining a Filter Method with SVMs Thomas Navin Lal, Olivier Chapelle, Bernhard Sch¨olkopf ...... 439 20.1 The Parameters σ and C oftheSVM...... 439 20.2 FeatureRanking...... 440 20.3 NumberofFeatures...... 441 20.4 Summary...... 443 References...... 445 21 Feature Selection via Sensitivity Analysis with Direct Kernel PLS Mark J. Embrechts, Robert A. Bress, Robert H. Kewley ...... 447 21.1 Introduction...... 447 21.2 PartialLeastSquaresRegression(PLS)...... 448 21.3 RegressionModelsBasedonDirectKernels ...... 450 21.4 DealingwiththeBias:CenteringtheKernel...... 452 21.5 MetricsforAssessingtheModelQuality...... 453 21.6 Data Conditioning and Preprocessing ...... 454 21.7 Sensitivity Analysis ...... 455 21.8 Heuristic Feature Selection Policies for the NIPS Feature SelectionChallenge ...... 456 21.9 Benchmarks...... 459 21.10Conclusions...... 460 References...... 461 22 Information Gain, Correlation and Support Vector Machines Danny Roobaert, Grigoris Karakoulas, Nitesh V. Chawla ...... 463 22.1 Introduction...... 463 22.2 Description of Approach ...... 464 22.3 FinalResults...... 467 22.4 AlternativeApproachesPursued...... 468 22.5 DiscussionandConclusion...... 469 References...... 470 23 Mining for Complex Models Comprising Feature Selection and Classiﬁcation Krzysztof Gr¸abczewski, Norbert Jankowski ...... 471 23.1 Introduction...... 471 23.2 Fundamental Algorithms ...... 472 23.3 FullyOperationalComplexModels...... 481 23.4 ChallengeDataExploration...... 483 23.5 Conclusions...... 486 References...... 487 Contents XVII

24 Combining Information-Based Supervised and Unsupervised Feature Selection Sang-Kyun Lee, Seung-Joon Yi, Byoung-Tak Zhang ...... 489 24.1 Introduction...... 489 24.2 Methods...... 490 24.3 Experiments...... 494 24.4 Conclusions...... 496 References...... 498 25 An Enhanced Selective Na¨ıve Bayes Method with Optimal Discretization Marc Boullé ...... 499 25.1 Introduction...... 499 25.2 The Enhanced Selective Na¨ıveBayesMethod...... 500 25.3 TheMODLDiscretizationMethod...... 502 25.4 ResultsontheNIPSChallenge ...... 504 25.5 Conclusion...... 506 References...... 506 26 An Input Variable Importance Definition based on Empirical Data Probability Distribution V. Lemaire, F. Clérot ...... 509 26.1 Introduction...... 509 26.2 Analysis of an Input Variable Influence ...... 510 26.3 ApplicationtoFeatureSubsetSelection...... 512 26.4 ResultsontheNIPSFeatureSelectionChallenge...... 513 26.5 Conclusions...... 516 References...... 516

Part III New Perspectives in Feature Extraction

27 Spectral Dimensionality Reduction Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux, Jean-Fran¸cois Paiement, Pascal Vincent, Marie Ouimet ...... 519 27.1 Introduction...... 519 27.2 Data-Dependent Kernels for Spectral Embedding Algorithms . . . . . 524 27.3 Kernel Eigenfunctions for Induction ...... 532 27.4 Learning Criterion for the Leading Eigenfunctions ...... 539 27.5 Experiments...... 541 27.6 Conclusion...... 544 References...... 547 XVIII Contents

28 Constructing Orthogonal Latent Features for Arbitrary Loss Michinari Momma, Kristin P. Bennett ...... 551 28.1 Introduction...... 551 28.2 GeneralFrameworkinBLF ...... 554 28.3 BLFwithLinearFunctions...... 557 28.4 ConvergencePropertiesofBLF...... 561 28.5 PLSandBLF...... 563 28.6 BLFforArbitraryLoss...... 564 28.7 KernelBLF...... 571 28.8 ComputationalResults ...... 572 28.9 Conclusion...... 581 References...... 582 29 Large Margin Principles for Feature Selection Ran Gilad-Bachrach, Amir Navot, Naftali Tishby ...... 585 29.1 Introduction...... 585 29.2 Margins ...... 586 29.3 Algorithms...... 589 29.4 TheoreticalAnalysis...... 592 29.5 Empirical Assessment ...... 593 29.6 DiscussionandFurtherResearchDirections...... 602 References...... 604 A ComplementaryProofs ...... 604 30 Feature Extraction for Classification of Proteomic Mass Spectra: A Comparative Study Ilya Levner, Vadim Bulitko, Guohui Lin ...... 607 30.1 Introduction...... 607 30.2 Existing Feature Extraction and Classification Methods ...... 611 30.3 ExperimentalResults...... 615 30.4 Conclusion...... 622 30.5 Acknowledgements...... 623 References...... 623 31 Sequence Motifs: Highly Predictive Features of Protein Function Asa Ben-Hur, Douglas Brutlag...... 625 31.1 Introduction...... 625 31.2 EnzymeClassification...... 627 31.3 Methods...... 628 31.4 Results...... 636 31.5 Discussion ...... 642 31.6 Conclusion...... 643 References...... 643 Contents XIX

Appendix A Elementary Statistics

Elementary Statistics G´erard Dreyfus ...... 649 1 BasicPrinciples...... 649 2 EstimatingandLearning...... 651 3 Some Additional Useful Probability Distributions ...... 654 4 Conﬁdence Intervals ...... 655 5 HypothesisTesting...... 657 6 Probably Approximately Correct (PAC) Learning andGuaranteedEstimators...... 660 References...... 662

Appendix B Feature Selection Challenge Datasets

Experimental Design Isabelle Guyon ...... 665

Arcene ...... 669

Gisette ...... 677

Dexter ...... 683

Dorothea ...... 687

Madelon ...... 691

Matlab Code of the Lambda Method ...... 697

Matlab Code Used to Generate Madelon ...... 699

Appendix C Feature Selection Challenge Fact Sheets

10 High Dimensional Classification with Bayesian Neural Networks and Dirichlet Diffusion Trees Radford M. Neal, Jianguo Zhang ...... 707 11 Ensembles of Regularized Least Squares Classifiers for High-Dimensional Problems Kari Torkkola and Eugene Tuv ...... 709 XX Contents

12 Combining SVMs with Various Feature Selection Strategies Yi-Wei Chen, Chih-Jen Lin ...... 711 13 Feature Selection with Transductive Support Vector Machines Zhili Wu, Chunhung Li ...... 713 14 Variable Selection using Correlation and SVC Methods: Applications Amir Reza Saffari Azar Alamdari ...... 715 15 Tree-Based Ensembles with Dynamic Soft Feature Selection Alexander Borisov, Victor Eruhimov, Eugene Tuv ...... 717 16 Sparse, Flexible and Efficient Modeling using L1 Regularization Saharon Rosset, Ji Zhu ...... 719 17 Margin Based Feature Selection and Infogain with Standard Classifiers Ran Gilad-Bachrach, Amir Navot ...... 721 18 Bayesian Support Vector Machines for Feature Ranking and Selection Wei Chu, S. Sathiya Keerthi, Chong Jin Ong, Zoubin Ghahramani .....723 19 Nonlinear Feature Selection with the Potential Support Vector Machine Sepp Hochreiter, Klaus Obermayer ...... 725 20 Combining a Filter Method with SVMs Thomas Navin Lal, Olivier Chapelle, Bernhard Schölkopf ...... 729 21 Feature Selection via Sensitivity Analysis with Direct Kernel PLS Mark J. Embrechts, Robert A. Bress, Robert H. Kewley ...... 731 22 Information Gain, Correlation and Support Vector Machines Danny Roobaert, Grigoris Karakoulas, Nitesh V. Chawla ...... 733 23 Mining for Complex Models Comprising Feature Selection and Classification Krzysztof Grabczewski, Norbert Jankowski ...... 735 Contents XXI

24 Combining Information-Based Supervised and Unsupervised Feature Selection Sang-Kyun Lee, Seung-Joon Yi, Byoung-Tak Zhang ...... 737 25 An Enhanced Selective Na¨ıve Bayes Method with Optimal Discretization Marc Boullé ...... 741 26 An Input Variable Importance Definition based on Empirical Data Probability Distribution V. Lemaire, F. Clérot ...... 743

Appendix D Feature Selection Challenge Results Tables

Result Tables of the NIPS2003 Challenge Isabelle Guyon, Steve Gunn ...... 747

Arcene ...... 749

Dexter ...... 753

Dorothea ...... 757

Gisette ...... 761

Dorothea ...... 765

Overall Results ...... 769

Index ...... 773 List of Abbreviations and Symbols

D a dataset X a sample of input patterns F feature space Y a sample of output labels ln logarithm to base e log2 logarithm to base 2 xT x inner product between vectors x and x . Euclidean norm n number of input variables N number of features m number of training examples xk input vector, k =1...m φk feature vector, k =1...m xk,i input vector elements, i =1...n φk,i feature vector elements, i =1...n yi target values, or (in pattern recognition) classes w input weight vector or feature weight vector wi weight vector elements, i =1...n or i =1...N b constant oﬀset (or threshold) h VC dimension F a concept space f(.) a concept or target function G a predictor space g(.) a predictor function function (real valued of with values in {−1, 1} for classiﬁcation) s(.) a non linear squashing function (e.g. sigmoid) ρf (x,y) margin function equal to yf(x) l(x; y; f(x)) loss function R(g)riskofg, i.e. expected fraction of errors Remp(g) empirical risk of g, i.e. fraction of training errors R(f)riskoff XXIV List of Abbreviations and Symbols

Remp(f) empirical risk of f k(x, x) Mercer kernel function (real valued) A a matrix (use capital letters for matrices) K matrix of kernel function values αk Lagrange multiplier or pattern weights, k =1...m α vector of all Lagrange multipliers ξi slack variables ξ vector of all slack variables C regularization constant for SV Machines 1 vector of ones [11 ...1]T