Feature Selection and Dimension Reduction
DATA SCIENCE REPORT SERIES Feature Selection and Data Reduction (DRAFT) Patrick Boily1,2,3,4, Olivier Leduc1, Andrew Macfie3, Aditya Maheshwari3, Maia Pelletier1 Abstract Data mining is the collection of processes by which we can extract useful insights from data. Inherent in this definition is the idea of data reduction: useful insights (whether in the form of summaries, sentiment analyses, etc.) ought to be “smaller” and “more organized” than the original raw data. The challenges presented by high data dimensionality (the so-called curse of dimensionality) must be addressed in order to achieve insightful and interpretable analytical results. In this report, we introduce the basic principles of dimensionality reduction and a number of feature selection methods (filter, wrapper, regularization), discuss some current advanced topics (SVD, spectral feature selection, UMAP) and provide examples (with code). Keywords feature selection, dimension reduction, curse of dimensionality, principal component analysis, manifold hypothesis, manifold learning, regularization, subset selection, spectral feature selection, uniform manifold approximation and projection Funding Acknowledgement Parts of this report were funded by a University of Ottawa grant to develop teaching material in French (2019-2020). These were subsequently translated into English before being incorporated into this document. 1Department of Mathematics and Statistics, University of Ottawa, Ottawa 2Sprott School of Business, Carleton University, Ottawa 3Idlewyld Analytics and Consulting
[Show full text]