Multi-Layer Perceptrons with Embedded Feature Selection with Application in Cancer Classification∗

Total Page:16

File Type:pdf, Size:1020Kb

Multi-Layer Perceptrons with Embedded Feature Selection with Application in Cancer Classification∗ Multi-layer Perceptrons with Embedded Feature Selection with Application in Cancer Classification∗ Liefeng Bo, Ling Wang and Licheng Jiao Institute of Intelligent Information Processing, Xidian University, Shannxi 710071, China Abstract — This paper proposed a novel neural network network feature selector (NNFS) for feature selection. NNFS model, named multi-layer perceptrons with embedded feature encourages the small weights to converge to zeros by adding selection (MLPs-EFS), where feature selection is incorporated a penalty term to the error function. In 2002, Hsu et al. [4] into the training procedure. Compared with the classical MLPs, proposed a wrapper method, named ANNIGMA-wrapper for MLPs-EFS add a preprocessing step where each feature of the fast feature selection. In 2004, Sindhwani et al. [5] presented samples is multiplied by the corresponding scaling factor. By applying a truncated Laplace prior to the scaling factors, a maximum output information (MOI-MLP) algorithm for feature selection is integrated as a part of MLPs-EFS. Moreover, feature selection. Due to the vast and extensive literatures in a variant of MLPs-EFS, named EFS+MLPs is also given, which feature selection for MLPs, we only mentioned a small perform feature selection more flexibly. Application in cancer fraction of them and the reader can refer to [6] for more classification validates the effectiveness of the proposed information. algorithms. In this paper, we present a neural networks model, named MLPs-EFS that incorporates feature selection into the Key words — Multi-layer perceptrons (MLPs), neural networks, training procedure. Compared with the classical MLPs, Laplace prior, feature selection, cancer classification. MLPs-EFS add a preprocessing step where each feature of the samples is multiplied by the corresponding scaling factor. I. Introduction In order to achieve feature selection, a truncated Laplace With the wide application of information technology, one prior is applied to the scaling factors. Some researchers [7] has a growing chance to confront high dimensional data sets have shown that the Laplace prior promotes sparsity and with hundreds or thousands of features. Typical examples leads to a natural feature selection. The sparsity-promoting include text processing of internet documents, gene nature of Laplace prior is discussed in the literature [8]. expression array analysis, and combinatorial chemistry. Moreover, we also derive a variant of MLPs-EFS, named These data sets often contain many irrelevant and redundant EFS+MLPs that can perform feature selection more flexibly. features, hence an effective feature selection scheme becomes Application in cancer classification validates the a key factor in determining the performance of a classifier. effectiveness of the proposed algorithms The purpose of feature selection is to find the smallest subset The rest of this paper is organized as follows. Section II of features that result in satisfactory generalization presents the MLPs-EFS and EFS+MLPs algorithms. Section performance. The potential benefits of feature selection are at III reports the experimental results of proposed algorithms least three-fold [1-2]: and compares them with the results obtained by several ① Improving the generalization performance of existing algorithms. Section IV discusses the contributions of learning algorithm by eliminating the irrelevant and this paper redundant features; ② Enhancing the comprehensibility of learning II. Multi-layer Perceptrons with Embedded algorithm by identifying the most relevant feature subset; Feature Selection ③ Reducing the cost of future data collection and Consider the three-layer perceptrons with embedded boosting the test speed of learning algorithm by only utilizing feature selection shown in Fig. 1. The error measure we a small fraction of all features. optimize is the sum of the squared difference between the In MLPs, the input of hidden unit is the linear combination desired output and the actual output, of all the components of samples. This mechanism is highly 1 c T effective for feature extraction, but not for feature selection E wv,,α =−yHvyHv()mmmmφφ () () − () , (1) ()∑( oo()) ( ()) since the weight of each component is often non-zero. lc m=1 Therefore, various new methods have been proposed to select where we have the following: a good feature subset for MLPs. In 1997, Setiono and Liu [3] z l is the number of training samples. developed a backward elimination method, named neural ∗ This research was supported by the National Natural Science Foundation of China under Grant No. 60372050. z c is the number of output units that is equal to the number of category. z y()m is the output vector of the m-th class using the ()m “one-vs-all” encoding scheme, which satisfies yi = 1 ()m if the i-th sample belongs to the m-th class, and yi = 0 otherwise. z v()m is the weight vector from hidden units to the m-th output unit. 2 Fig. 2. Prior over the scaling factor αi z Hij is the output of the i-th sample at the j-th hidden unit, If we do not restrict the weights from input layer to hidden T layer, it may become very large. To avoid this case, we adopt Hw=⊗+φ ()jijα2 x() b() . (2) ij h (( ) ( ) ) a Gaussian prior over those weights h 2 z ⊗ denotes the elementwise multiplication. ⎛⎞λ2 ()j P2 ()ww∝−exp⎜⎟, (4) (i) ∑ 2 z x is the i-th input vector. ⎝⎠nh× j=1 2 z α is the scaling factors of input vector. Note that 2 n 2 ()jj() 2 where h is the number of hidden units, w = wi α denotes αα⊗ . 2 ∑() i=1 z w()j is the weight vector from the input layer to denotes the 2-norm and λ2 controls the intensity of the prior. the j-th hidden unit. Considering the error measure along with the prior, we ()j z b is the bias term of the j-th hidden unit. obtain a penalized maximum likelihood estimate of the z φo and φh (i) are the activation function that often scaling factors and the weights by minimizing the following is the logistic function, hyperbolic tangent function, expression or linear function. fE(wv,,α) =+(wv ,,αα) log( PP12( )) + log( (w)) nhn2 . (5) 2 λλ122 ()j α1 =++Ewwv,,α α x1 ()∑∑∑ii() nnhiji===111× 2 a2 x2 Like the original MLPs, the optimization problem that a2 MLPs-EFS confront with is unconstrained. Hence, all the x 3 3 optimization algorithms applied to MLPs are also suitable for a2 MLPs-EFS. In order to speedup the training procedure, x n n instead of the standard backpropagation (BP) algorithm, a variant of the conjugate gradient, named scaled conjugate Fig. 1. Three-layer perceptrons with embedded feature gradient (SCG) [9], is used to find a local minimum of the selection objective function f (wv,,α) . SCG is fully automated Compared with the classical MLPs, MLPs-EFS add a including no user dependent parameters and avoiding a time preprocessing step where each feature of the samples is consuming line-search. Experimental study performed by 2 multiplied by the corresponding scaling factor αi . After Moller in 1993 showed that SCG yields a speedup of at least MLPs-EFS are trained, the importance of features can be an order of magnitude relative to BP. Demuth’s test report identified by the magnitude of the scaling factors. That is, the [10] also showed that SCG performs well over a wide variety irrelevant and the redundant features are associated with the of problems, particularly for networks with a large number of small scaling factors and the relevant features with the large weights. scaling factors. Since we hope that the irrelevant and the MLPs-EFS add the scaling factors; hence, the number of redundant features affect learning procedure as little as its free parameters is slightly larger than that of MLPs. possible, it is necessary to make the small scaling factors However the training time of MLPs-EFS is comparable with close to zeros. To achieve this goal, we adopt a truncated that of MLPs because the number of free parameters added Laplace prior as shown in Fig. 2. over the scaling factors by MLPs-EFS is far smaller than the total number of the free parameters of MLPs. For example, for a network with ten ⎧ ⎛⎞λ1 22 2 ⎪exp⎜⎟−≥αα if 0 hidden units and three output units, the number of free P α ∝ n 1 , (3) 1 ()⎨ ⎝⎠ parameters of MLPs-EFS is only 1.08 times that of MLPs if ⎪ 2 ⎩0if0α < the feature dimensionality of the samples is 10. Similar where n is the feature dimensionality of the samples, conclusion also holds true for the test time. n After MLPs-EFS are trained, there exist some scaling α22= α denotes the 1-norm and λ controls the factors that are very close to but not exactly zeros. To remove 1 ∑ i 1 i=1 the features with small scaling factors from network, we need intensity of the prior. a variant of MLPs-EFS, named EFS+MLPs that can be The points with y = 1 lie in the first or third quadrant and described as the following: those with y = −1 in the second or fourth quadrant. The 1. Train MLPs-EFS with the full features; optimal decision function of this problem is non-linear and 2. Rank the importance of features according to the the highest recognition rate of linear classifiers is only magnitude of the scaling factors; 66.67%. 3. Reserve the d most important features; XOR1: XOR1 is a variant of XOR problem (7) and 4. Train MLPs only with the reserved d features. derived by adding 18 redundant features which are copies of the first two features and 20 irrelevant features which are III. Comparisons with Existing Algorithms drawn from the uniform distribution between -1 and 1, In order to know how well MLPs-EFS and EFS+MLPs simultaneously.
Recommended publications
  • A New Wrapper Feature Selection Approach Using Neural Network
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Community Repository of Fukui Neurocomputing 73 (2010) 3273–3283 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom A new wrapper feature selection approach using neural network Md. Monirul Kabir a, Md. Monirul Islam b, Kazuyuki Murase c,n a Department of System Design Engineering, University of Fukui, Fukui 910-8507, Japan b Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology (BUET), Dhaka 1000, Bangladesh c Department of Human and Artificial Intelligence Systems, Graduate School of Engineering, and Research and Education Program for Life Science, University of Fukui, Fukui 910-8507, Japan article info abstract Article history: This paper presents a new feature selection (FS) algorithm based on the wrapper approach using neural Received 5 December 2008 networks (NNs). The vital aspect of this algorithm is the automatic determination of NN architectures Received in revised form during the FS process. Our algorithm uses a constructive approach involving correlation information in 9 November 2009 selecting features and determining NN architectures. We call this algorithm as constructive approach Accepted 2 April 2010 for FS (CAFS). The aim of using correlation information in CAFS is to encourage the search strategy for Communicated by M.T. Manry Available online 21 May 2010 selecting less correlated (distinct) features if they enhance accuracy of NNs. Such an encouragement will reduce redundancy of information resulting in compact NN architectures. We evaluate the Keywords: performance of CAFS on eight benchmark classification problems. The experimental results show the Feature selection essence of CAFS in selecting features with compact NN architectures.
    [Show full text]
  • ``Preconditioning'' for Feature Selection and Regression in High-Dimensional Problems
    The Annals of Statistics 2008, Vol. 36, No. 4, 1595–1618 DOI: 10.1214/009053607000000578 © Institute of Mathematical Statistics, 2008 “PRECONDITIONING” FOR FEATURE SELECTION AND REGRESSION IN HIGH-DIMENSIONAL PROBLEMS1 BY DEBASHIS PAUL,ERIC BAIR,TREVOR HASTIE1 AND ROBERT TIBSHIRANI2 University of California, Davis, Stanford University, Stanford University and Stanford University We consider regression problems where the number of predictors greatly exceeds the number of observations. We propose a method for variable selec- tion that first estimates the regression function, yielding a “preconditioned” response variable. The primary method used for this initial regression is su- pervised principal components. Then we apply a standard procedure such as forward stepwise selection or the LASSO to the preconditioned response variable. In a number of simulated and real data examples, this two-step pro- cedure outperforms forward stepwise selection or the usual LASSO (applied directly to the raw outcome). We also show that under a certain Gaussian la- tent variable model, application of the LASSO to the preconditioned response variable is consistent as the number of predictors and observations increases. Moreover, when the observational noise is rather large, the suggested proce- dure can give a more accurate estimate than LASSO. We illustrate our method on some real problems, including survival analysis with microarray data. 1. Introduction. In this paper, we consider the problem of fitting linear (and other related) models to data for which the number of features p greatly exceeds the number of samples n. This problem occurs frequently in genomics, for exam- ple, in microarray studies in which p genes are measured on n biological samples.
    [Show full text]
  • Data Science (ML-DL-Ai)
    Data science (ML-DL-ai) Statistics Multiple Regression Model Building and Evaluation Introduction to Statistics Model post fitting for Inference Types of Statistics Examining Residuals Analytics Methodology and Problem- Regression Assumptions Solving Framework Identifying Influential Observations Populations and samples Detecting Collinearity Parameter and Statistics Uses of variable: Dependent and Categorical Data Analysis Independent variable Describing categorical Data Types of Variable: Continuous and One-way frequency tables categorical variable Association Cross Tabulation Tables Descriptive Statistics Test of Association Probability Theory and Distributions Logistic Regression Model Building Picturing your Data Multiple Logistic Regression and Histogram Interpretation Normal Distribution Skewness, Kurtosis Model Building and scoring for Outlier detection Prediction Introduction to predictive modelling Inferential Statistics Building predictive model Scoring Predictive Model Hypothesis Testing Introduction to Machine Learning and Analytics Analysis of variance (ANOVA) Two sample t-Test Introduction to Machine Learning F-test What is Machine Learning? One-way ANOVA Fundamental of Machine Learning ANOVA hypothesis Key Concepts and an example of ML ANOVA Model Supervised Learning Two-way ANOVA Unsupervised Learning Regression Linear Regression with one variable Exploratory data analysis Model Representation Hypothesis testing for correlation Cost Function Outliers, Types of Relationship,
    [Show full text]
  • Machine Learning V1.1
    An Introduction to Machine Learning v1.1 E. J. Sagra Agenda ● Why is Machine Learning in the News again? ● ArtificiaI Intelligence vs Machine Learning vs Deep Learning ● Artificial Intelligence ● Machine Learning & Data Science ● Machine Learning ● Data ● Machine Learning - By The Steps ● Tasks that Machine Learning solves ○ Classification ○ Cluster Analysis ○ Regression ○ Ranking ○ Generation Agenda (cont...) ● Model Training ○ Supervised Learning ○ Unsupervised Learning ○ Reinforcement Learning ● Reinforcement Learning - Going Deeper ○ Simple Example ○ The Bellman Equation ○ Deterministic vs. Non-Deterministic Search ○ Markov Decision Process (MDP) ○ Living Penalty ● Machine Learning - Decision Trees ● Machine Learning - Augmented Random Search (ARS) Why is Machine Learning In The News Again? Processing capabilities General ● GPU’s etc have reached level where Machine ● Tools / Languages / Automation Learning / Deep Learning practical ● Need for Data Science no longer limited to ● Cloud computing allows even individuals the tech giants capability to create / train complex models on ● Education is behind in creating Data vast data sets Scientists ● Organizing data is hard. Organizations Memory (Hard Drive (now SSD) as well RAM) challenged ● Speed / capacity increasing ● High demand due to lack of qualified talent ● Cost decreasing Data ● Volume of Data ● Access to vast public data sets ArtificiaI Intelligence vs Machine Learning vs Deep Learning Artificial Intelligence is the all-encompassing concept that initially erupted Followed by Machine Learning that thrived later Finally Deep Learning is escalating the advances of Artificial Intelligence to another level Artificial Intelligence Artificial intelligence (AI) is perhaps the most vaguely understood field of data science. The main idea behind building AI is to use pattern recognition and machine learning to build an agent able to think and reason as humans do (or approach this ability).
    [Show full text]
  • Dynamic Feature Scaling for Online Learning of Binary Classifiers
    Dynamic Feature Scaling for Online Learning of Binary Classifiers Danushka Bollegala University of Liverpool, Liverpool, United Kingdom May 30, 2017 Abstract Scaling feature values is an important step in numerous machine learning tasks. Different features can have different value ranges and some form of a feature scal- ing is often required in order to learn an accurate classifier. However, feature scaling is conducted as a preprocessing task prior to learning. This is problematic in an online setting because of two reasons. First, it might not be possible to accu- rately determine the value range of a feature at the initial stages of learning when we have observed only a handful of training instances. Second, the distribution of data can change over time, which render obsolete any feature scaling that we perform in a pre-processing step. We propose a simple but an effective method to dynamically scale features at train time, thereby quickly adapting to any changes in the data stream. We compare the proposed dynamic feature scaling method against more complex methods for estimating scaling parameters using several benchmark datasets for classification. Our proposed feature scaling method consistently out- performs more complex methods on all of the benchmark datasets and improves classification accuracy of a state-of-the-art online classification algorithm. 1 Introduction Machine learning algorithms require train and test instances to be represented using a set of features. For example, in supervised document classification [9], a document is often represented as a vector of its words and the value of a feature is set to the num- ber of times the word corresponding to the feature occurs in that document.
    [Show full text]
  • Capacity and Trainability in Recurrent Neural Networks
    Published as a conference paper at ICLR 2017 CAPACITY AND TRAINABILITY IN RECURRENT NEURAL NETWORKS Jasmine Collins,∗ Jascha Sohl-Dickstein & David Sussillo Google Brain Google Inc. Mountain View, CA 94043, USA {jlcollins, jaschasd, sussillo}@google.com ABSTRACT Two potential bottlenecks on the expressiveness of recurrent neural networks (RNNs) are their ability to store information about the task in their parameters, and to store information about the input history in their units. We show experimentally that all common RNN architectures achieve nearly the same per-task and per-unit capacity bounds with careful training, for a variety of tasks and stacking depths. They can store an amount of task information which is linear in the number of parameters, and is approximately 5 bits per parameter. They can additionally store approximately one real number from their input history per hidden unit. We further find that for several tasks it is the per-task parameter capacity bound that determines performance. These results suggest that many previous results comparing RNN architectures are driven primarily by differences in training effectiveness, rather than differences in capacity. Supporting this observation, we compare training difficulty for several architectures, and show that vanilla RNNs are far more difficult to train, yet have slightly higher capacity. Finally, we propose two novel RNN architectures, one of which is easier to train than the LSTM or GRU for deeply stacked architectures. 1 INTRODUCTION Research and application of recurrent neural networks (RNNs) have seen explosive growth over the last few years, (Martens & Sutskever, 2011; Graves et al., 2009), and RNNs have become the central component for some very successful model classes and application domains in deep learning (speech recognition (Amodei et al., 2015), seq2seq (Sutskever et al., 2014), neural machine translation (Bahdanau et al., 2014), the DRAW model (Gregor et al., 2015), educational applications (Piech et al., 2015), and scientific discovery (Mante et al., 2013)).
    [Show full text]
  • Feature Selection Via Dependence Maximization
    JournalofMachineLearningResearch13(2012)1393-1434 Submitted 5/07; Revised 6/11; Published 5/12 Feature Selection via Dependence Maximization Le Song [email protected] Computational Science and Engineering Georgia Institute of Technology 266 Ferst Drive Atlanta, GA 30332, USA Alex Smola [email protected] Yahoo! Research 4301 Great America Pky Santa Clara, CA 95053, USA Arthur Gretton∗ [email protected] Gatsby Computational Neuroscience Unit 17 Queen Square London WC1N 3AR, UK Justin Bedo† [email protected] Statistical Machine Learning Program National ICT Australia Canberra, ACT 0200, Australia Karsten Borgwardt [email protected] Machine Learning and Computational Biology Research Group Max Planck Institutes Spemannstr. 38 72076 Tubingen,¨ Germany Editor: Aapo Hyvarinen¨ Abstract We introduce a framework for feature selection based on dependence maximization between the selected features and the labels of an estimation problem, using the Hilbert-Schmidt Independence Criterion. The key idea is that good features should be highly dependent on the labels. Our ap- proach leads to a greedy procedure for feature selection. We show that a number of existing feature selectors are special cases of this framework. Experiments on both artificial and real-world data show that our feature selector works well in practice. Keywords: kernel methods, feature selection, independence measure, Hilbert-Schmidt indepen- dence criterion, Hilbert space embedding of distribution 1. Introduction In data analysis we are typically given a set of observations X = x1,...,xm X which can be { } used for a number of tasks, such as novelty detection, low-dimensional representation,⊆ or a range of . Also at Intelligent Systems Group, Max Planck Institutes, Spemannstr.
    [Show full text]
  • Feature Selection in Convolutional Neural Network with MNIST Handwritten Digits
    Feature Selection in Convolutional Neural Network with MNIST Handwritten Digits Zhuochen Wu College of Engineering and Computer Science, Australian National University [email protected] Abstract. Feature selection is an important technique to improve neural network performances due to the redundant attributes and the massive amount in original data sets. In this paper, a CNN with two convolutional layers followed by a dropout, then two fully connected layers, is equipped with a feature selection algorithm. Accuracy rate of the networks with different attribute input weight as zero are calculated and ranked so that the machine can decide which attribute is the least important for each run of the algorithm. The algorithm repeats itself to remove multiple attributes. When the network will not achieve a satisfying accuracy rate as defined in the algorithm, the process terminates and no more attributes to be removed. A CNN is chosen the image recognition task and one dropout is applied to reduce the overfitting of training data. This implementation of deep learning method proves its ability to rise accuracy and neural network performance with up to 80% less attributes fed in. This paper also compares the technique with other the result of LeNet-5 to see the differences and common facts. Keywords: CNN, Feature selection, Classification, Real world problem, Deep learning 1. Introduction Feature selection has been a focus in many study domains like econometrics, statistics and pattern recognition. It is a process to select a subset of attributes in given data and improve the algorithm performance in efficient and accuracy, etc. It is commonly understood that the more features being fed into a neural network, the more information machine could learn from in order to achieve a better outcome.
    [Show full text]
  • Classification with Costly Features Using Deep Reinforcement Learning
    Classification with Costly Features using Deep Reinforcement Learning Jarom´ır Janisch and Toma´sˇ Pevny´ and Viliam Lisy´ Artificial Intelligence Center, Department of Computer Science Faculty of Electrical Engineering, Czech Technical University in Prague jaromir.janisch, tomas.pevny, viliam.lisy @fel.cvut.cz f g Abstract y1 We study a classification problem where each feature can be y2 acquired for a cost and the goal is to optimize a trade-off be- af3 af5 af1 ac tween the expected classification error and the feature cost. We y3 revisit a former approach that has framed the problem as a se- ··· quential decision-making problem and solved it by Q-learning . with a linear approximation, where individual actions are ei- . ther requests for feature values or terminate the episode by providing a classification decision. On a set of eight problems, we demonstrate that by replacing the linear approximation Figure 1: Illustrative sequential process of classification. The with neural networks the approach becomes comparable to the agent sequentially asks for different features (actions af ) and state-of-the-art algorithms developed specifically for this prob- finally performs a classification (ac). The particular decisions lem. The approach is flexible, as it can be improved with any are influenced by the observed values. new reinforcement learning enhancement, it allows inclusion of pre-trained high-performance classifier, and unlike prior art, its performance is robust across all evaluated datasets. a different subset of features can be selected for different samples. The goal is to minimize the expected classification Introduction error, while also minimizing the expected incurred cost.
    [Show full text]
  • Training and Testing of a Single-Layer LSTM Network for Near-Future Solar Forecasting
    applied sciences Conference Report Training and Testing of a Single-Layer LSTM Network for Near-Future Solar Forecasting Naylani Halpern-Wight 1,2,*,†, Maria Konstantinou 1, Alexandros G. Charalambides 1 and Angèle Reinders 2,3 1 Chemical Engineering Department, Cyprus University of Technology, Lemesos 3036, Cyprus; [email protected] (M.K.); [email protected] (A.G.C.) 2 Energy Technology Group, Department of Mechanical Engineering, Eindhoven University of Technology, 5612 AE Eindhoven, The Netherlands; [email protected] 3 Department of Design, Production and Management, Faculty of Engineering Technology, University of Twente, 7522 NB Enschede, The Netherlands * Correspondence: [email protected] or [email protected] † Current address: Archiepiskopou Kyprianou 30, Limassol 3036, Cyprus. Received: 30 June 2020; Accepted: 31 July 2020; Published: 25 August 2020 Abstract: Increasing integration of renewable energy sources, like solar photovoltaic (PV), necessitates the development of power forecasting tools to predict power fluctuations caused by weather. With trustworthy and accurate solar power forecasting models, grid operators could easily determine when other dispatchable sources of backup power may be needed to account for fluctuations in PV power plants. Additionally, PV customers and designers would feel secure knowing how much energy to expect from their PV systems on an hourly, daily, monthly, or yearly basis. The PROGNOSIS project, based at the Cyprus University of Technology, is developing a tool for intra-hour solar irradiance forecasting. This article presents the design, training, and testing of a single-layer long-short-term-memory (LSTM) artificial neural network for intra-hour power forecasting of a single PV system in Cyprus.
    [Show full text]
  • Feature Selection for Regression Problems
    Feature Selection for Regression Problems M. Karagiannopoulos, D. Anyfantis, S. B. Kotsiantis, P. E. Pintelas unreliable data, then knowledge discovery Abstract-- Feature subset selection is the during the training phase is more difficult. In process of identifying and removing from a real-world data, the representation of data training data set as much irrelevant and often uses too many features, but only a few redundant features as possible. This reduces the of them may be related to the target concept. dimensionality of the data and may enable There may be redundancy, where certain regression algorithms to operate faster and features are correlated so that is not necessary more effectively. In some cases, correlation coefficient can be improved; in others, the result to include all of them in modelling; and is a more compact, easily interpreted interdependence, where two or more features representation of the target concept. This paper between them convey important information compares five well-known wrapper feature that is obscure if any of them is included on selection methods. Experimental results are its own. reported using four well known representative regression algorithms. Generally, features are characterized [2] as: 1. Relevant: These are features which have Index terms: supervised machine learning, an influence on the output and their role feature selection, regression models can not be assumed by the rest 2. Irrelevant: Irrelevant features are defined as those features not having any influence I. INTRODUCTION on the output, and whose values are generated at random for each example. In this paper we consider the following 3. Redundant: A redundancy exists regression setting.
    [Show full text]
  • Gaussian Universal Features, Canonical Correlations, and Common Information
    Gaussian Universal Features, Canonical Correlations, and Common Information Shao-Lun Huang Gregory W. Wornell, Lizhong Zheng DSIT Research Center, TBSI, SZ, China 518055 Dept. EECS, MIT, Cambridge, MA 02139 USA Email: [email protected] Email: {gww, lizhong}@mit.edu Abstract—We address the problem of optimal feature selection we define a rotation-invariant ensemble (RIE) that assigns a for a Gaussian vector pair in the weak dependence regime, uniform prior for the unknown attributes, and formulate a when the inference task is not known in advance. In particular, universal feature selection problem that aims to select optimal we show that multiple formulations all yield the same solution, and correspond to the singular value decomposition (SVD) of features minimizing the averaged MSE over RIE. We show the canonical correlation matrix. Our results reveal key con- that the optimal features can be obtained from the SVD nections between canonical correlation analysis (CCA), principal of a canonical dependence matrix (CDM). In addition, we component analysis (PCA), the Gaussian information bottleneck, demonstrate that in a weak dependence regime, this SVD Wyner’s common information, and the Ky Fan (nuclear) norms. also provides the optimal features and solutions for several problems, such as CCA, information bottleneck, and Wyner’s common information, for jointly Gaussian variables. This I. INTRODUCTION reveals important connections between information theory and Typical applications of machine learning involve data whose machine learning problems. dimension is high relative to the amount of training data that is available. As a consequence, it is necessary to perform di- mensionality reduction before the regression or other inference II.
    [Show full text]