Gradient Boosted Feature Selection

Total Page:16

File Type:pdf, Size:1020Kb

Gradient Boosted Feature Selection Gradient Boosted Feature Selection Zhixiang (Eddie) Xu ∗ Gao Huang Kilian Q. Weinberger Washington University in St. Tsinghua University Washington University in St. Louis 30 Shuangqing Rd. Louis One Brookings Dr. Beijing, China One Brookings Dr. St. Louis, USA huang- St. Louis, USA [email protected] [email protected] [email protected] ∗ Alice X. Zheng GraphLab 936 N. 34th St. Ste 208 Seattle, USA [email protected] ABSTRACT 1. INTRODUCTION A feature selection algorithm should ideally satisfy four con- Feature selection (FS) [8] is an important problems in ma- ditions: reliably extract relevant features; be able to iden- chine learning. In many applications, e.g., bio-informatics [21] tify non-linear feature interactions; scale linearly with the or neuroscience [12], researchers hope to gain insight by ana- number of features and dimensions; allow the incorpora- lyzing how a classifier can predict a label and what features tion of known sparsity structure. In this work we propose a it uses. Moreover, effective feature selection leads to par- novel feature selection algorithm, Gradient Boosted Feature simonious classifiers that require less memory [25] and are Selection (GBFS), which satisfies all four of these require- faster to train and test [5]. It can also reduce feature extrac- ments. The algorithm is flexible, scalable, and surprisingly tion costs [29, 30] and lead to better generalization [9]. straight-forward to implement as it is based on a modifi- Linear feature selection algorithms such as LARS [7] are cation of Gradient Boosted Trees. We evaluate GBFS on highly effective at discovering linear dependencies between several real world data sets and show that it matches or out- features and labels. However, they fail when features in- performs other state of the art feature selection algorithms. teract in nonlinear ways. Nonlinear feature selection algo- Yet it scales to larger data set sizes and naturally allows for rithms, such as Random Forest [9] or recently introduced domain-specific side information. kernel methods [32, 23], can cope with nonlinear interac- tions. But their computational and memory complexity typ- Categories and Subject Descriptors ically grow super-linearly with the training set size. As data sets grow in size, this is increasingly problematic. Balancing H.3 [Information Storage and Retrieval]: Miscellaneous; the twin goals of scalability and nonlinear feature selection I.5.2 [Pattern Recognition]: Design Methodology|Fea- is still an open problem. ture evaluation and selection In this paper, we focus on the scenario where data sets contain a large number of samples. Specifically, we aim to General Terms perform efficient feature selection when the number of data points is much larger than the number of features (n d). Learning We start with the (NP-Hard) feature selection problem that also motivated LARS [7] and LASSO [26]. But instead of Keywords using a linear classifier and approximating the feature selec- Feature selection; Large-scale; Gradient boosting tion cost with an l1-norm, we follow [31] and use gradient boosted regression trees [7] for which greedy approximations exist [2]. The resulting algorithm is surprisingly simple yet very ef- fective. We refer to it as Gradient Boosted Feature Selection (GBFS). Following the gradient boosting framework, trees ∗Work done while at Microsoft Research are built with the greedy CART algorithm [2]. Features are Permission to make digital or hard copies of all or part of this work for personal or selected sparsely following an important change in the im- classroom use is granted without fee provided that copies are not made or distributed purity function: splitting on new features is penalized by for profit or commercial advantage and that copies bear this notice and the full citation a cost λ > 0, whereas re-use of previously selected features on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or incurs no additional penalty. republish, to post on servers or to redistribute to lists, requires prior specific permission GBFS has several compelling properties. 1. As it learns and/or a fee. Request permissions from [email protected]. an ensemble of regression trees, it can naturally discover KDD’14, August 24–27, 2014, New York, NY, USA. nonlinear interactions between features. 2. In contrast to, Copyright is held by the owner/author(s). Publication rights licensed to ACM. e.g., FS with Random Forests, it unifies feature selection ACM 978-1-4503-2956-9/14/08 ...$15.00. and classification into a single optimization. 3. In contrast http://dx.doi.org/10.1145/2623330.2623635 . to existing nonlinear FS algorithms, its time and memory trices in capital bold (F) font. Specific entries in vectors or complexity scales as O(dn), where d denotes the number of matrices are scalars and follow the corresponding conven- features dimensionality and n the number of data points1, tion. d and is very fast in practice. 4. GBFS can naturally incorpo- The data set consists of input vectors fx1;:::; xng 2 R rate pre-specified feature cost structures or side-information, with corresponding labels fy1; : : : ; yng 2 Y drawn from an e.g., select bags of features or focus on regions of interest, unknown distribution. The labels can be binary, categor- similar to generalized lasso in linear FS [19]. ical (multi-class) or real-valued (regression). For the sake We evaluate this algorithm on several real-world data sets of clarity, we focus on binary classification Y 2 {−1; +1g, of varying difficulty and size, and we demonstrate that GBFS although the algorithm can be extended to multi-class and tends to match or outperform the accuracy and feature se- regression as well. lection trade-off of Random Forest Feature Selection, the current state-of-the-art in nonlinear feature selection. 3.1 Feature selection with the l1 norm We showcase the ability of GBFS to naturally incorporate Lasso [26] combines linear classification and l1 regulariza- side-information about inter-feature dependencies on a real tion world biological classification task [1]. Here, features are X min `(xi; yi; w) + λjwj1: (1) grouped into nine pre-specified bags with biological mean- w (x ;y ) ing. GBFS can easily adapt to this setting and select entire i i feature bags. The resulting classifier matches the best accu- In its original formulation, `(·) is defined to be the squared > 2 racy of competing methods (trained on many features) with loss, `(xi; yi; w) = (w xi − yi) . However, for the sake of only a single bag of features. feature selection, other loss functions are possible. In the bi- nary classification setting, where yi 2 {−1; +1g, we use the > 2. RELATED WORK better suited log-loss, `(xi; yi; w) = log(1+exp(yiw xi)) [11]. One of the most widely used feature selection algorithms 3.2 The capped l1 norm is Lasso [26]. It minimizes the squared loss with l1 regu- l1 regularization serves two purposes: It regularizes the larization on the coefficient vector, which encourages sparse classifier against overfitting, and it induces sparsity for fea- solutions. Although scalable to very large data sets, Lasso ture selection. Unfortunately, these two effects of the l1- models only linear correlations between features and labels norm are inherently tied and there is no way to regulate the and cannot discover non-linear feature dependencies. impact of either one. [17] propose the Minimum Redundancy Maximum Rel- [33] introduce the capped l1 norm, defined by the element- evance (mRMR) algorithm, which selects a subset of the wise operation most responsive features that have high mutual information with labels. Their objective function also penalizes select- q(wi) = min(jwij; ): (2) ing redundant features. Though elegant, computing mu- Its advantage over the standard l1 norm is that once a fea- tual information when the number of instance is large is ture is extracted, its use is not penalized further | i.e., it intractable, and thus the algorithm does not scale. HSIC penalizes using many features does not reward small weights. Lasso [32], on the other hand, introduces non-linearity by This is a much better approximation of the l0 norm, which combining multiple kernel functions that each uses a single only penalizes feature use without interfering with the mag- feature. The resulting convex optimization problem aligns nitude of the weights. When is small enough, i.e., ≤ this kernel with a \perfect" label kernel. The algorithm re- mini jwij, we can compute the exact number of features ex- quires constructing kernel matrices for all features, thus its tracted with q(w)/. In other words, penalizing q(w) is a time and memory complexity scale quadratically with input close proxy for penalizing the number of extracted features. data set size. Moreover, both algorithms separate feature However, the capped l1 norm is not convex and therefore selection and classification, and require additional time and not easy to optimize. computation for training classifiers using the selected fea- The capped l1 norm can be combined with a regular l1 (or tures. l2) norm, where one can control the trade-off between feature Several other works avoid expensive kernel computation extraction and regularization by adjusting the corresponding while maintaining non-linearity. Grafting [18] combines l1 regularization parameters, µ, λ ≥ 0: and l0 regularization with a non-linear classifier based on X min `(xi; yi; w) + λjwj1 + µq(w): (3) a non-convex variant of the multi-layer perceptron. Fea- w ture Selection for Ranking using Boosted Trees [15] selects (xi;yi) the top features with the highest relative importance scores. Here q(w) denotes [q(w1); : : : ; q(wd)]. [27] and [9] use Random Forest. Finally, while not a fea- ture selection method, [31] employ Gradient Boosted Trees 4.
Recommended publications
  • 10 Oriented Principal Component Analysis for Feature Extraction
    10 Oriented Principal Component Analysis for Feature Extraction Abstract- Principal Components Analysis (PCA) is a popular unsupervised learning technique that is often used in pattern recognition for feature extraction. Some variants have been proposed for supervising PCA to include class information, which are usually based on heuristics. A more principled approach is presented in this paper. Suppose that we have a pattern recogniser composed of a linear network for feature extraction and a classifier. Instead of designing each module separately, we perform global training where both learning systems cooperate in order to find those components (Oriented PCs) useful for classification. Experimental results, using artificial and real data, show the potential of the proposed learning system for classification. Since the linear neural network finds the optimal directions to better discriminate classes, the global-trained pattern recogniser needs less Oriented PCs (OPCs) than PCs so the classifier’s complexity (e.g. capacity) is lower. In this way, the global system benefits from a better generalisation performance. Index Terms- Oriented Principal Components Analysis, Principal Components Neural Networks, Co-operative Learning, Learning to Learn Algorithms, Feature Extraction, Online gradient descent, Pattern Recognition, Compression. Abbreviations- PCA- Principal Components Analysis, OPC- Oriented Principal Components, PCNN- Principal Components Neural Networks. 1. Introduction In classification, observations belong to different classes. Based on some prior knowledge about the problem and on the training set, a pattern recogniser is constructed for assigning future observations to one of the existing classes. The typical method of building this system consists in dividing it into two main blocks that are usually trained separately: the feature extractor and the classifier.
    [Show full text]
  • Arxiv:2006.04059V1 [Cs.LG] 7 Jun 2020
    Soft Gradient Boosting Machine Ji Feng1;2, Yi-Xuan Xu1;3, Yuan Jiang3, Zhi-Hua Zhou3 [email protected], fxuyx, jiangy, [email protected] 1Sinovation Ventures AI Institute 2Baiont Technology 3National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China Abstract Gradient Boosting Machine has proven to be one successful function approximator and has been widely used in a variety of areas. However, since the training procedure of each base learner has to take the sequential order, it is infeasible to parallelize the training process among base learners for speed-up. In addition, under online or incremental learning settings, GBMs achieved sub-optimal performance due to the fact that the previously trained base learners can not adapt with the environment once trained. In this work, we propose the soft Gradient Boosting Machine (sGBM) by wiring multiple differentiable base learners together, by injecting both local and global objectives inspired from gradient boosting, all base learners can then be jointly optimized with linear speed-up. When using differentiable soft decision trees as base learner, such device can be regarded as an alternative version of the (hard) gradient boosting decision trees with extra benefits. Experimental results showed that, sGBM enjoys much higher time efficiency with better accuracy, given the same base learner in both on-line and off-line settings. arXiv:2006.04059v1 [cs.LG] 7 Jun 2020 1. Introduction Gradient Boosting Machine (GBM) [Fri01] has proven to be one successful function approximator and has been widely used in a variety of areas [BL07, CC11]. The basic idea is to train a series of base learners that minimize some predefined differentiable loss function in a sequential fashion.
    [Show full text]
  • Feature Selection/Extraction
    Feature Selection/Extraction Dimensionality Reduction Feature Selection/Extraction • Solution to a number of problems in Pattern Recognition can be achieved by choosing a better feature space. • Problems and Solutions: – Curse of Dimensionality: • #examples needed to train classifier function grows exponentially with #dimensions. • Overfitting and Generalization performance – What features best characterize class? • What words best characterize a document class • Subregions characterize protein function? – What features critical for performance? – Subregions characterize protein function? – Inefficiency • Reduced complexity and run-time – Can’t Visualize • Allows ‘intuiting’ the nature of the problem solution. Curse of Dimensionality Same Number of examples Fill more of the available space When the dimensionality is low Selection vs. Extraction • Two general approaches for dimensionality reduction – Feature extraction: Transforming the existing features into a lower dimensional space – Feature selection: Selecting a subset of the existing features without a transformation • Feature extraction – PCA – LDA (Fisher’s) – Nonlinear PCA (kernel, other varieties – 1st layer of many networks Feature selection ( Feature Subset Selection ) Although FS is a special case of feature extraction, in practice quite different – FSS searches for a subset that minimizes some cost function (e.g. test error) – FSS has a unique set of methodologies Feature Subset Selection Definition Given a feature set x={xi | i=1…N} find a subset xM ={xi1, xi2, …, xiM}, with M<N, that optimizes an objective function J(Y), e.g. P(correct classification) Why Feature Selection? • Why not use the more general feature extraction methods? Feature Selection is necessary in a number of situations • Features may be expensive to obtain • Want to extract meaningful rules from your classifier • When you transform or project, measurement units (length, weight, etc.) are lost • Features may not be numeric (e.g.
    [Show full text]
  • Feature Extraction (PCA & LDA)
    Feature Extraction (PCA & LDA) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline What is feature extraction? Feature extraction algorithms Linear Methods Unsupervised: Principal Component Analysis (PCA) Also known as Karhonen-Loeve (KL) transform Supervised: Linear Discriminant Analysis (LDA) Also known as Fisher’s Discriminant Analysis (FDA) 2 Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given feature set Feature extraction (e.g., PCA, LDA) A linear or non-linear transform on the original feature space ⋮ ⋮ → ⋮ → ⋮ = ⋮ Feature Selection Feature ( <) Extraction 3 Feature Extraction Mapping of the original data onto a lower-dimensional space Criterion for feature extraction can be different based on problem settings Unsupervised task: minimize the information loss (reconstruction error) Supervised task: maximize the class discrimination on the projected space In the previous lecture, we talked about feature selection: Feature selection can be considered as a special form of feature extraction (only a subset of the original features are used). Example: 0100 X N 4 XX' 0010 X' N 2 Second and thirth features are selected 4 Feature Extraction Unsupervised feature extraction: () () A mapping : ℝ →ℝ ⋯ Or = ⋮⋱⋮ Feature Extraction only the transformed data () () () () ⋯ ′ ⋯′ = ⋮⋱⋮ () () ′ ⋯′ Supervised feature extraction: () () A mapping : ℝ →ℝ ⋯ Or = ⋮⋱⋮ Feature Extraction only the transformed
    [Show full text]
  • Feature Extraction for Image Selection Using Machine Learning
    Master of Science Thesis in Electrical Engineering Department of Electrical Engineering, Linköping University, 2017 Feature extraction for image selection using machine learning Matilda Lorentzon Master of Science Thesis in Electrical Engineering Feature extraction for image selection using machine learning Matilda Lorentzon LiTH-ISY-EX--17/5097--SE Supervisor: Marcus Wallenberg ISY, Linköping University Tina Erlandsson Saab Aeronautics Examiner: Lasse Alfredsson ISY, Linköping University Computer Vision Laboratory Department of Electrical Engineering Linköping University SE-581 83 Linköping, Sweden Copyright © 2017 Matilda Lorentzon Abstract During flights with manned or unmanned aircraft, continuous recording can result in a very high number of images to analyze and evaluate. To simplify image analysis and to minimize data link usage, appropriate images should be suggested for transfer and further analysis. This thesis investigates features used for selection of images worthy of further analysis using machine learning. The selection is done based on the criteria of having good quality, salient content and being unique compared to the other selected images. The investigation is approached by implementing two binary classifications, one regard- ing content and one regarding quality. The classifications are made using support vector machines. For each of the classifications three feature extraction methods are performed and the results are compared against each other. The feature extraction methods used are histograms of oriented gradients, features from the discrete cosine transform domain and features extracted from a pre-trained convolutional neural network. The images classified as both good and salient are then clustered based on similarity measures retrieved using color coherence vectors. One image from each cluster is retrieved and those are the result- ing images from the image selection.
    [Show full text]
  • Time Series Feature Extraction for Industrial Big Data (Iiot) Applications
    Time Series Feature Extraction for Industrial Big Data (IIoT) Applications [1] Term paper for the subject „Current Topics of Data Engineering” In the course of studies MSc. Data Engineering at the Jacobs University Bremen Markus Sun Born on 07.11.1994 in Achim Student/ Registration number: 30003082 1. Reviewer: Prof. Dr. Adalbert F.X. Wilhelm 2. Reviewer: Prof. Dr. Stefan Kettemann Time Series Feature Extraction for Industrial Big Data (IIoT) Applications __________________________________________________________________________ Table of Content Abstract ...................................................................................................................... 4 1 Introduction ...................................................................................................... 4 2 Main Topic - Terminology ................................................................................. 5 2.1 Sensor Data .............................................................................................. 5 2.2 Feature Selection and Feature Extraction...................................................... 5 2.2.1 Variance Threshold.............................................................................................................. 7 2.2.2 Correlation Threshold .......................................................................................................... 7 2.2.3 Principal Component Analysis (PCA) .................................................................................. 7 2.2.4 Linear Discriminant Analysis
    [Show full text]
  • Machine Learning Feature Extraction Based on Binary Pixel Quantification Using Low-Resolution Images for Application of Unmanned Ground Vehicles in Apple Orchards
    agronomy Article Machine Learning Feature Extraction Based on Binary Pixel Quantification Using Low-Resolution Images for Application of Unmanned Ground Vehicles in Apple Orchards Hong-Kun Lyu 1,*, Sanghun Yun 1 and Byeongdae Choi 1,2 1 Division of Electronics and Information System, ICT Research Institute, Daegu Gyeongbuk Institute of Science and Technology (DGIST), Daegu 42988, Korea; [email protected] (S.Y.); [email protected] (B.C.) 2 Department of Interdisciplinary Engineering, Daegu Gyeongbuk Institute of Science and Technology (DGIST), Daegu 42988, Korea * Correspondence: [email protected]; Tel.: +82-53-785-3550 Received: 20 November 2020; Accepted: 4 December 2020; Published: 8 December 2020 Abstract: Deep learning and machine learning (ML) technologies have been implemented in various applications, and various agriculture technologies are being developed based on image-based object recognition technology. We propose an orchard environment free space recognition technology suitable for developing small-scale agricultural unmanned ground vehicle (UGV) autonomous mobile equipment using a low-cost lightweight processor. We designed an algorithm to minimize the amount of input data to be processed by the ML algorithm through low-resolution grayscale images and image binarization. In addition, we propose an ML feature extraction method based on binary pixel quantification that can be applied to an ML classifier to detect free space for autonomous movement of UGVs from binary images. Here, the ML feature is extracted by detecting the local-lowest points in segments of a binarized image and by defining 33 variables, including local-lowest points, to detect the bottom of a tree trunk. We trained six ML models to select a suitable ML model for trunk bottom detection among various ML models, and we analyzed and compared the performance of the trained models.
    [Show full text]
  • Towards Reproducible Meta-Feature Extraction
    Journal of Machine Learning Research 21 (2020) 1-5 Submitted 5/19; Revised 12/19; Published 1/20 MFE: Towards reproducible meta-feature extraction Edesio Alcobaça [email protected] Felipe Siqueira [email protected] Adriano Rivolli [email protected] Luís P. F. Garcia [email protected] Jefferson T. Oliva [email protected] André C. P. L. F. de Carvalho [email protected] Institute of Mathematical and Computer Sciences University of São Paulo Av. Trabalhador São-carlense, 400, São Carlos, São Paulo 13560-970, Brazil Editor: Alexandre Gramfort Abstract Automated recommendation of machine learning algorithms is receiving a large deal of attention, not only because they can recommend the most suitable algorithms for a new task, but also because they can support efficient hyper-parameter tuning, leading to better machine learning solutions. The automated recommendation can be implemented using meta-learning, learning from previous learning experiences, to create a meta-model able to associate a data set to the predictive performance of machine learning algorithms. Al- though a large number of publications report the use of meta-learning, reproduction and comparison of meta-learning experiments is a difficult task. The literature lacks extensive and comprehensive public tools that enable the reproducible investigation of the differ- ent meta-learning approaches. An alternative to deal with this difficulty is to develop a meta-feature extractor package with the main characterization measures, following uniform guidelines that facilitate the use and inclusion of new meta-features. In this paper, we pro- pose two Meta-Feature Extractor (MFE) packages, written in both Python and R, to fill this lack.
    [Show full text]
  • Xgboost: a Scalable Tree Boosting System
    XGBoost: A Scalable Tree Boosting System Tianqi Chen Carlos Guestrin University of Washington University of Washington [email protected] [email protected] ABSTRACT problems. Besides being used as a stand-alone predictor, it Tree boosting is a highly effective and widely used machine is also incorporated into real-world production pipelines for learning method. In this paper, we describe a scalable end- ad click through rate prediction [15]. Finally, it is the de- to-end tree boosting system called XGBoost, which is used facto choice of ensemble method and is used in challenges widely by data scientists to achieve state-of-the-art results such as the Netflix prize [3]. on many machine learning challenges. We propose a novel In this paper, we describe XGBoost, a scalable machine learning system for tree boosting. The system is available sparsity-aware algorithm for sparse data and weighted quan- 2 tile sketch for approximate tree learning. More importantly, as an open source package . The impact of the system has we provide insights on cache access patterns, data compres- been widely recognized in a number of machine learning and sion and sharding to build a scalable tree boosting system. data mining challenges. Take the challenges hosted by the machine learning competition site Kaggle for example. A- By combining these insights, XGBoost scales beyond billions 3 of examples using far fewer resources than existing systems. mong the 29 challenge winning solutions published at Kag- gle's blog during 2015, 17 solutions used XGBoost. Among these solutions, eight solely used XGBoost to train the mod- Keywords el, while most others combined XGBoost with neural net- Large-scale Machine Learning s in ensembles.
    [Show full text]
  • Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data Robert T
    Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data Robert T. Olszewski February 2001 CMU-CS-01-108 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Thesis Committee: Roy Maxion, co-chair Dan Siewiorek, co-chair Christos Faloutsos David Banks, DOT Copyright c 2001 Robert T. Olszewski This research was sponsored by the Defense Advanced Project Research Agency (DARPA) and the Air Force Research Laboratory (AFRL) under grant #F30602-96-1-0349, the National Science Foundation (NSF) under grant #IRI-9224544, and the Air Force Laboratory Graduate Fellowship Program sponsored by the Air Force Office of Scientific Research (AFOSR) and conducted by the Southeastern Center for Electrical Engineering Education (SCEEE). The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of DARPA, AFRL, NSF, AFOSR, SCEEE, or the U.S. government. Keywords: Structural pattern recognition, classification, feature extraction, time series, Fourier transformation, wavelets, semiconductor fabrication, electrocardiography. Abstract Pattern recognition encompasses two fundamental tasks: description and classification. Given an object to analyze, a pattern recognition system first generates a description of it (i.e., the pat- tern) and then classifies the object based on that description (i.e., the recognition). Two general approaches for implementing pattern recognition systems, statistical and structural, employ differ- ent techniques for description and classification. Statistical approaches to pattern recognition use decision-theoretic concepts to discriminate among objects belonging to different groups based upon their quantitative features.
    [Show full text]
  • Feature Extraction Using Dimensionality Reduction Techniques: Capturing the Human Perspective
    Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2015 Feature Extraction using Dimensionality Reduction Techniques: Capturing the Human Perspective Ashley B. Coleman Wright State University Follow this and additional works at: https://corescholar.libraries.wright.edu/etd_all Part of the Computer Engineering Commons, and the Computer Sciences Commons Repository Citation Coleman, Ashley B., "Feature Extraction using Dimensionality Reduction Techniques: Capturing the Human Perspective" (2015). Browse all Theses and Dissertations. 1630. https://corescholar.libraries.wright.edu/etd_all/1630 This Thesis is brought to you for free and open access by the Theses and Dissertations at CORE Scholar. It has been accepted for inclusion in Browse all Theses and Dissertations by an authorized administrator of CORE Scholar. For more information, please contact [email protected]. FEATURE EXTRACTION USING DIMENSIONALITY REDUCITON TECHNIQUES: CAPTURING THE HUMAN PERSPECTIVE A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science By Ashley Coleman B.S., Wright State University, 2010 2015 Wright State University WRIGHT STATE UNIVERSITY GRADUATE SCHOOL December 11, 2015 I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPERVISION BY Ashley Coleman ENTI TLED Feature Extraction using Dimensionality Reduction Techniques: Capturing the Human Perspective BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science. Pascal Hitzler, Ph.D. Thesis Advisor Mateen Rizki, Ph.D. Chair, Department of Computer Science Committee on Final Examination John Gallagher, Ph.D. Thesis Advisor Mateen Rizki, Ph.D. Chair, Department of Computer Science Robert E. W. Fyffe, Ph.D. Vice President for Research and Dean of the Graduate School Abstract Coleman, Ashley.
    [Show full text]
  • Unsupervised Feature Extraction for Reinforcement Learning
    Faculteit Wetenschappen en Bio-ingenieurswetenschappen Vakgroep Computerwetenschappen Unsupervised Feature Extraction for Reinforcement Learning Proefschrift ingediend met het oog op het behalen van de graad van Master of Science in de Ingenieurswetenschappen: Computerwetenschappen Yoni Pervolarakis Promotor: Prof. Dr. Peter Vrancx Prof. Dr. Ann Now´e Juni 2016 Faculty of Science and Bio-Engineering Sciences Department of Computer Science Unsupervised Feature Extraction for Reinforcement Learning Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in de Ingenieurswetenschappen: Computerwetenschappen Yoni Pervolarakis Promotor: Prof. Dr. Peter Vrancx Prof. Dr. Ann Now´e June 2016 Abstract When using high dimensional features chances are that most of the features are not important to a specific problem. To eliminate those features and potentially finding better features different possibilities exist. For example, feature extraction that will transform the original input features to a new smaller dimensional feature set or even a feature selection method where only features are taken that are more important than other features. This can be done in a supervised or unsupervised manner. In this thesis, we will investigate if we can use autoencoders as a means of unsupervised feature extraction method on data that is not necessary interpretable. These new features will then be tested in a Reinforcement Learning environment. This data will be represented as RAM states and are blackbox since we cannot understand them. The autoencoders will receive a high dimensional feature set and will transform it into a lower dimension, these new features will be given to an agent who will make use of those features and tries to learn from them.
    [Show full text]