Automated Feature Engineering for Deep Neural Networks with Genetic Programming
Total Page:16
File Type:pdf, Size:1020Kb
Automated Feature Engineering for Deep Neural Networks with Genetic Programming by Jeff Heaton A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science College of Engineering and Computing Nova Southeastern University 2017 ProQuest Number:10259604 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. ProQuest 10259604 Published by ProQuest LLC ( 2017). Copyright of the Dissertation is held by the Author. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code Microform Edition © ProQuest LLC. ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106 - 1346 We hereby certify that this dissertation, submitted by Jeff Heaton, conforms to acceptable standards and is fully adequate in scope and quality to fulfill the dissertation requirements for the degree of Doctor of Philosophy. _____________________________________________ ________________ James D. Cannady, Ph.D. Date Chairperson of Dissertation Committee _____________________________________________ ________________ Sumitra Mukherjee, Ph.D. Date Dissertation Committee Member _____________________________________________ ________________ Paul Cerkez, Ph.D. Date Dissertation Committee Member Approved: _____________________________________________ ________________ Yong X. Tao, Ph.D., P.E., FASME Date Dean, College of Engineering and Computing College of Engineering and Computing Nova Southeastern University 2017 An Abstract of a Dissertation Submitted to Nova Southeastern University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Automated Feature Engineering for Deep Neural Networks with Genetic Programming by Jeff Heaton 2017 Feature engineering is a process that augments the feature vector of a machine learning model with calculated values that are designed to enhance the accuracy of a model’s predictions. Research has shown that the accuracy of models such as deep neural networks, support vector machines, and tree/forest-based algorithms sometimes benefit from feature engineering. Expressions that combine one or more of the original features usually create these engineered features. The choice of the exact structure of an engineered feature is dependent on the type of machine learning model in use. Previous research demonstrated that various model families benefit from different types of engineered feature. Random forests, gradient-boosting machines, or other tree-based models might not see the same accuracy gain that an engineered feature allowed neural networks, generalized linear models, or other dot-product based models to achieve on the same data set. This dissertation presents a genetic programming-based algorithm that automatically engineers features that increase the accuracy of deep neural networks for some data sets. For a genetic programming algorithm to be effective, it must prioritize the search space and efficiently evaluate what it finds. This dissertation algorithm faced a potential search space composed of all possible mathematical combinations of the original feature vector. Five experiments were designed to guide the search process to efficiently evolve good engineered features. The result of this dissertation is an automated feature engineering (AFE) algorithm that is computationally efficient, even though a neural network is used to evaluate each candidate feature. This approach gave the algorithm a greater opportunity to specifically target deep neural networks in its search for engineered features that improve accuracy. Finally, a sixth experiment empirically demonstrated the degree to which this algorithm improved the accuracy of neural networks on data sets augmented by the algorithm’s engineered features. Acknowledgements There are several people who supported and encouraged me through my dissertation journey at Nova Southeastern University. First, I would like to extend my thanks to Dr. James Cannady for serving as my dissertation committee chair. He provided guidance and advice at every stage of the dissertation process. He also gave me several additional research opportunities during my time at NSU. I would also like to thank my committee members Dr. Sumitra Mukherjee and Dr. Paul Cerkez. Their advice and guidance through this process were invaluable. My sincere gratitude and thanks to Tracy, my wife and graduate school partner. Her help editing this dissertation was invaluable. We took a multi-year graduate journey as she completed her Master’s degree in Spanish at Saint Louis University and I worked on my Ph.D. It was great that she could accompany me to Florida for each of the cluster meetings and I could join her in Madrid for summer classes. The GECCO conference that I attended in Madrid was instrumental in the development of this research. Her love and encouragement made this project all possible. Thank you to Dave Snell for encouraging me to pursue a Ph.D. and for guidance at key points in my career. I would like to thank my parents. From the time that they bought me my first Commodore 64 and through all my educational journeys, they have always supported me. Table of Contents Approval ii Acknowledgements iv List of Tables vii List of Figures viii 1. Introduction 1 Problem Statement 6 Dissertation Goal 7 Relevance and Significance 8 Barriers and Issues 9 Definitions of Terms 12 List of Acronyms 22 Summary 23 2. Literature Review 25 Feature Engineering 25 Neural Networks 30 Deep Learning 39 Evolutionary Programming 48 Summary 59 3. Methodology 60 Introduction 60 Algorithm Contract and Specification 61 Algorithm Design Scope 65 Experimental Design 68 Measures 68 Experiment 1: Limiting the Search Space 70 Experiment 2: Establishing Baseline 73 Experiment 3: Genetic Ensembles 74 Experiment 4: Population Analysis 76 Experiment 5: Objective Function Design 79 Experiment 6: Automated Feature Engineering 81 Real World Data sets 85 Synthetic Data sets 86 Resources 87 Summary 88 4. Results 90 Introduction 90 Experiment 1 Results 90 Experiment 2 Results 94 Experiment 3 Results 96 v Experiment 4 Results 97 Experiment 5 Results 101 AFE Algorithm Design 103 Experiment 6 Results 116 Summary 123 5. Conclusions, Implications, Recommendations, and Summary 124 Conclusions 124 Implications 125 Recommendations 126 Summary 129 References 182 Appendixes A. Algorithms 138 B. Hyperparameters 144 C. Third Party Libraries 152 D. Data Set Descriptions 156 E. Synthetic Data Set Generation 163 F. Engineered Features for the Data Sets 167 G. Detailed Experiment Results 173 H. Dissertation Source Code Availability 179 vi List of Tables Tables 1. Common neural network transfer functions 41 2. Experiment 1 results format, neural network and genetic programming 72 3. Experiment 2 result format, neural network baseline 74 4. Experiment 3-result format, neural network genetic program ensemble 76 5. Experiment 4 results format, patterns in genetic programs 79 6. Experiment 5 results format, evaluating feature ranking 80 7. Experiment 6 results format, engineered feature effectiveness 84 8. Experiment 1 neural network results for select expressions 92 9. Experiment 1 genetic program results for select expressions 93 10. Experiment 2 baselines for data sets (RMSE) 95 11. Experiment 3, Ensemble, GP & neural network results 97 12. Experiment 4, patterns from genetic programs 100 13. Experiment 6, feature engineered vs non-augmented accuracy 120 14. Experiment 6, T-Test with data sets that showed improvement 122 15. Data sets used by the dissertation experiments 156 vii List of Figures Figures 1. Regression and classification network (original features) 3 2. Neural network engineered features 5 3. Elman neural network 35 4. Jordan neural network 35 5. Long short-term memory (LSTM) 37 6. Dropout layer in a neural network 44 7. Expression tree for genetic programming 50 8. Point crossover 52 9. Subtree mutation 53 10. Feature engineering to linearly separate two classes 58 11. Algorithm high level design and contract 64 12. Ensemble of genetic programs for a neural network 75 13. Generate candidate solutions 77 14. Branches with common structures 78 15. High-level overview of AFE algorithm 82 16. Dissertation algorithm evaluation 83 17. Overview of AFE algorithm 105 18. Input neuron weight connections 109 19. Data flow/mapping for the AFE algorithm 112 20. One step of the AFE algorithm 113 21. Collapsing trees 115 viii 22. Experiment 6, feature engineering benefits to data set 121 ix 1 Chapter 1 Introduction The research conducted in this dissertation created an algorithm to automatically engineer features that might increase the accuracy of deep neural networks for certain data sets. The research built upon, but did not duplicate, prior published research by the author of this dissertation. In 2008, the Encog Machine Learning Framework was created and includes advanced neural network and genetic programming algorithms (Heaton, 2015). The Encog genetic programming framework introduced an innovative algorithm that allows dynamically generated constant nodes for tree-based genetic programming. Thus, constants in Encog genetic programs can assume any value, rather than choosing from a fixed constant pool. Research was performed