Feature Selection in Mlps and Svms Based on Maximum Output

1 Feature Selection in MLPs and SVMs based on Maximum Output Information Vikas Sindhwani, Subrata Rakshit, Dipti Deodhare, Deniz Erdogmus, Jose Principe, and Partha Niyogi Abstract— We present feature selection algorithms for multi- data availability constraints or need for visualization in lower layer Perceptrons (MLPs) and multi-class Support Vector Ma- dimensions. In this paper we address the problem of which chines (SVMs), using mutual information between class labels features to select, given a model selection (number of features). and classifier outputs, as an objective function. This objective function involves inexpensive computation of information mea- In this paper, we are concerned with developing sures only on discrete variables; provides immunity to prior information-theoretic methods to address the optimal feature class probabilities; and brackets the probability of error of the subset selection problem. Guyon & Elisseeff [9] review classifier. The Maximum Output Information (MOI) algorithms several approaches advocated in machine learning literature. employ this function for feature subset selection by greedy elim- In the filter approach, feature selection is independent of the ination and directed search. The output of the MOI algorithms is a feature subset of user-defined size and an associated trained learning algorithm. Many filters detect irrelevant features by classifier(MLP/SVM). These algorithms compare favorably with estimating the importance of each feature independent of other a number of other methods in terms of performance on various features [13], [15]. Other filters perform a more complex artificial and real-world data sets. search over multiple features in order to additionally identify Index Terms— Feature Selection, Support Vector Machines, and eliminate redundancy [1]. In the wrapper approach, the Multi-Layer Perceptrons, Mutual Information objective function for selection is a measure of classifier performance. Wrappers typically involve expensive search routines and are considered superior because they incorporate I. INTRODUCTION the inductive bias of the classifier [12]. supervised learning algorithm attempts to induce a deci- Several information-theoretic solutions to this problem have A sion rule from which to categorize examples of different been proposed and may also be categorized as described concepts by generalizing from a set of training examples. above. Filters like Information Gain, routinely used on very A critical ingredient of a successful attempt is to provide high dimensional problems like text classification [28], use ¥¨§ the learning algorithm with an optimal description of the mutual information ¢¤£¦¥¨§ © between a single feature and § ¥ concepts. Since one does not a-priori know what attributes the class variable , to estimate the relevance of feature . constitute this optimal description, a number of irrelevant and Yang & Moody [27] select the two features that maximize § © ¥ redundant features are recorded. Many learning algorithms the joint mutual information ¢£ ¨ ¥ over all possible suffer from the curse of dimensionality, i.e. the time and subsets of two features and class labels. For optimization over data requirements for successful induction may grow very fast more than two variables, search heuristics are used. Battiti [1] as the number of features increases [5], [12]. Unnecessary proposes an algorithm called Mutual Information Feature features, in such a case, serve only to increase the learning Selection (MIFS) that greedily constructs the set of features period. They add undesirable complexity to the underlying with high mutual information with the class labels while trying probability distribution of the concept label which the learning to minimize the mutual information among chosen features. ¥§ algorithm tries to capture. Thus, the feature included in the set, maximizes §¦ "! ¢¤£ ¨ ¥ ¢£¥$ ¥ ¥ § John, Kohavi & Pfleger [12] discuss notions of relevance # over all remaining features and irrelevance that partition the set of features into useful for some parameter . The Maximum Mutual Information Pro- degrees of dispensability. According to their definitions, Ir- jection (MMIP) feature extractor, developed by Bollacker [2], relevant features do not participate in defining the unknown aims to find a linear transform by maximizing, at each step, concepts; weakly relevant features possess redundant infor- the mutual information between the class variable and a single mation and can be eliminated if other features subsuming direction in the feature subspace orthogonal to previously this information are included; and strongly relevant features found directions. The Separated Mutual Information Feature are indispensable. Given the task of selecting out of ¡ Extractor (SMIFE) is a heuristic where a matrix of joint features, as is decreased one expects an ideal selection mutual information between class variables and each pair of algorithm to first discard irrelevant features, then redundant features is constructed. Following an analogy with Principal features and finally start eliminating the strongly relevant Component Analysis (PCA), the eigenvectors of this matrix are features according to the strength of their relevance. While this found and the principal components are then used for feature is desired, it usually cannot be directly implemented as these transformation. properties of features are hard to determine a-priori. Thus We observe two shortcomings of these methods: Firstly, the model selection problem (how many features) is usually any mutual information computation involving continuous fea- driven by external constraints like building compact classifiers, tures demands large amounts of data and high computational 2 complexity. Not only are features typically continuous, they When given a training sample consisting of a finite number of will be highly numerous in problems of interest in feature pairs of patterns and corresponding class labels (drawn accord- selection. Secondly, all these methods are in the vein of ing to the underlying unknown joint probability distribution the filter approach. Their objective functions disregard the F6GIHBJ ), the supervised machine learning framework aims to < classifier with which the selected features are to be used. As discover a function KMLN%PO , from a hypothesis class pointed out in [1] “... there is no guarantee that the optimal Q of functions, that exhibits good generalization on unseen ,R 'SK6£¦T, U© TWVW% subset of features will be processed in the optimal way by the patterns. Let , ( ) be the discrete random learning algorithm and by the operating classifier.” variables over < describing the unknown true label and the In this paper, we address both these shortcomings simul- label predicted by the classifier respectively. (Note that discrete taneously. We formulate an information theoretic objective labels are often obtained from real-valued outputs. We include function that involves computation of mutual information only this operation as part of the classifier.) ! 2 XZY[4C¥ © ¥\9-/-0-/-0- ¥ : between discrete random variables. This objective function Let X be a subset of features, i.e, . Q is the mutual information between the class labels and the Let Q^] denote the restriction of on G, i.e, the class of Q < discrete labels output by the classifier. Since, in a typical functions in that map X to . The optimization problem classification task, the number of classes is much smaller we would ideally like to solve, for selecting K features out of than the number of features, this suggests substantial gains in N, is the following: efficiency. We discuss theoretical justifications for using such ]a` ]6fhg ]ig ' bced bced ¢¤£ Ä ¤Ro an objective function in terms of upper and lower bounds on K"_ (1) R9kmln the error probability of the classifier, as well as justifications #"j "R in terms of its merits as a performance evaluation criterion. where ¢¤£ ^ ,Ro is the mutual information between and . This objective function is used to design wrappers to Since this is the average rate of information delivered by the select features out of ¡ for learning multi-layer Percep- classifier via its output, we refer to this quantity as classifier trons (MLPs) [18] and multi-class Support Vector Machines output information and sometimes also denote it by ¢5R , in (SVMs) [26]. Since the objective function is the mutual in- subsequent discussion. formation between class labels and the output of the classifier, The inner maximization constitutes the problem of training a the class of algorithms we present are called Maximum Output classifier, for a given set of input features. This is usually done Information (MOI) algorithms. Indirect feature crediting is such as to minimize a training objective function related to the achieved through an output side entropy evaluation. The MOI error rate of the classifier, while the criterion above calls for an wrapper algorithm, implemented for both SVMs (called MOI- information maximization. This section deals with the relation SVM) and MLPs (called MOI-MLP), then conducts a directed between these two measures (probability of error and output search by iteratively refining the feature subset. It aims to mutual information) and the rationale for substituting one discover a subset with which the classifier delivers maximum for the other. The outer maximization deals with the feature information via its output. The maximization of output in- selection problem, once again with an information theoretic formation may be seen

Load more