Data Mining with Multivariate Kernel Regression Using Information Complexity and the Genetic Algorithm

University of Tennessee, Knoxville TRACE: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 12-2009 Data Mining with Multivariate Kernel Regression Using Information Complexity and the Genetic Algorithm Dennis Jack Beal University of Tennessee - Knoxville Follow this and additional works at: https://trace.tennessee.edu/utk_graddiss Part of the Management Sciences and Quantitative Methods Commons Recommended Citation Beal, Dennis Jack, "Data Mining with Multivariate Kernel Regression Using Information Complexity and the Genetic Algorithm. " PhD diss., University of Tennessee, 2009. https://trace.tennessee.edu/utk_graddiss/561 This Dissertation is brought to you for free and open access by the Graduate School at TRACE: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of TRACE: Tennessee Research and Creative Exchange. For more information, please contact [email protected]. To the Graduate Council: I am submitting herewith a dissertation written by Dennis Jack Beal entitled "Data Mining with Multivariate Kernel Regression Using Information Complexity and the Genetic Algorithm." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the equirr ements for the degree of Doctor of Philosophy, with a major in Management Science. Hamparsum Bozdogan, Major Professor We have read this dissertation and recommend its acceptance: Ken Gilbert, Bogdan Bichescu, Mohammed Mohsin Accepted for the Council: Carolyn R. Hodges Vice Provost and Dean of the Graduate School (Original signatures are on file with official studentecor r ds.) To the Graduate Council: I am submitting herewith a dissertation written by Dennis Jack Beal entitled \Data Min- ing with Multivariate Kernel Regression Using Information Complexity and the Genetic Algorithm." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the requirements for the degree of Doctor of Philosophy, with a major in Management Science. Hamparsum Bozdogan, Major Professor We have read this dissertation and recommend its acceptance: Ken Gilbert Bogdan Bichescu Mohammed Mohsin Accepted for the Council: Carolyn R. Hodges Vice Provost and Dean of the Graduate School (Original signatures are on file with official student records.) Data Mining with Multivariate Kernel Regression Using Information Complexity and the Genetic Algorithm A Dissertation Presented for the Doctor of Philosophy Degree The University of Tennessee, Knoxville Dennis Jack Beal December 2009 Copyright c 2009 by Dennis Jack Beal. All rights reserved. ii Dedication This dissertation is dedicated to my wife, Shannon Beal, and my children, Heather Beal and Jonathan Beal. Their patience and understanding during my pursuit of this degree were a great help. I also want to dedicate this dissertation to my mother, Paula S. Beal, who taught me the importance of learning, hard work, setting goals, perseverance, and the value of an education. Without her inspiration and sacrifices I would never have seen the value of education, nor would I have achieved this lifelong dream. iii Acknowledgments I would like to thank all those who supported and helped me complete this dissertation. First, I want to thank to my major professor, Dr. Hamparsum Bozdogan, for his time, ex- pertise, and guidance while directing my dissertation. His passion for research has inspired me to continue researching in academic areas of interest to me. I thank my doctoral committee members Dr. Ken Gilbert, Dr. Bogdan Bichescu, and Dr. Mohammed Mohsin for their time, support, and useful comments. I also appreciate Dr. John Andrew Howe for his help with LaTeX and the Information Complexity M 3 Toolbox for MATLAB that he and Dr. Bozdogan developed. Last, I appreciate my current employer, Science Applications International Corporation (SAIC), for investing in me as an employee by paying nearly all of my tuition and ed- ucational expenses throughout my doctoral program. Their investment in me made it financially possible for me to pursue the degree. iv Abstract Kernel density estimation is a data smoothing technique that depends heavily on the bandwidth selection. The current literature has focused on optimal selectors for the univariate case that are primarily data driven. Plug-in and cross validation selectors have recently been extended to the general multivariate case. This dissertation will introduce and develop new and novel techniques for data mining with multivariate kernel density regression using information complexity and the genetic algorithm as a heuristic optimizer to choose the optimal bandwidth and the best predictors in kernel regression models. Simulated and real data will be used to cross validate the optimal bandwidth selectors using information complexity. The genetic algorithm is used in conjunction with information complexity to determine kernel density estimates for variable selection from high dimension multivariate data sets. Kernel regression is also hybridized with the implicit enumeration algorithm to determine the set of independent variables for the global optimal solution using information criteria as the objective function. The results from the genetic algorithm are compared to the optimal solution from the implicit enumeration algorithm and the known global optimal solution from an explicit enumeration of all possible subset models. v Contents 1 Introduction 1 1.1 Brief Description of the Problem and Proposed Approach . .1 1.2 Motivations . .2 1.3 Contributions of this Dissertation . .5 1.4 Organization of this Dissertation . .5 2 Information Criteria and Complexity 6 2.1 Information Criteria . .6 2.2 Informational Complexity . .8 2.3 Comparison of Model Selection Techniques . 12 2.4 Conclusions . 14 3 Kernel Density Estimation 15 3.1 Introduction . 15 3.2 Review of Literature on Kernel Density Estimation . 17 3.3 Bandwidth Selectors . 19 3.3.1 Univariate Fixed Bandwidth Selectors . 20 3.3.2 Multivariate Fixed Bandwidth Selectors . 25 3.3.3 Variable Bandwidth Selectors . 30 3.3.4 Information Complexity Bandwidth Selectors . 31 3.4 Conclusions . 35 vi 4 Univariate Kernel Regression 36 4.1 Kernel Regression . 36 4.2 Univariate Kernel Regression Results . 39 4.2.1 Univariate Kernel Regression with Normal Data . 39 4.2.2 Univariate Box-Cox Transformation . 50 4.2.3 Univariate Kernel Regression on Hald Cement Data . 53 4.2.4 Kernel Regression for Univariate Power Exponential Distribution . 61 4.2.5 Univariate Kernel Regression for Friedman's Nonlinear Regression . 73 4.2.6 Univariate Kernel Regression for Uncorrelated Data . 87 4.2.7 Univariate Kernel Regression Using the Epanechnikov Kernel on Nor- mal Data . 98 4.2.8 Univariate Kernel Regression Using the Epanechnikov Kernel on PE Data . 101 4.3 Conclusions . 103 5 Multivariate Kernel Regression 104 5.1 Introduction . 104 5.2 Kernel Regression with Normal Data . 106 5.3 Kernel Regression for Multivariate Power Exponential Distribution . 113 5.4 Conclusions . 119 6 Kernel Regression with the Genetic Algorithm 120 6.1 Introduction . 120 6.2 Genetic Algorithm . 120 6.3 Kernel Regression with the Genetic Algorithm on Real Body Fat Data . 127 6.3.1 Results from the Y = Xβ Model . 127 6.3.2 Results from the Y = Xkβ Model . 134 6.3.3 Results from the Y = Xmkβ Model . 154 6.3.4 Results from the Yk = Xβ Model . 164 vii 6.3.5 Results from the Yk = Xkβ Model . 170 6.3.6 Results from the Yk = Xmkβ Model . 179 6.3.7 Conclusions from Body Fat Data . 184 6.4 Conclusions . 187 7 Kernel Regression with Implicit Enumeration Algorithm 188 7.1 Introduction . 188 7.2 Implicit Enumeration Algorithm . 188 7.3 Implicit Enumeration with Kernel Regression . 192 7.3.1 Results from the Y = Xβ Model . 192 7.3.2 Results from the Y = Xkβ Model . 194 7.3.3 Results from the Y = Xmkβ Model . 194 7.3.4 Results from the Yk = Xβ Model . 197 7.3.5 Results from the Yk = Xkβ Model . 197 7.3.6 Results from the Yk = Xmkβ Model . 200 7.4 Conclusions . 203 8 Conclusions 206 8.1 Summary of Dissertation . 206 8.2 Future Work . 208 Bibliography 209 Appendix 227 A.1. Simulated Normal Data from Section 4.2.1 . 227 A.2. Hald's Cement Data from Section 4.2.3 . 230 A.3. Power Exponential Data from Section 4.2.4 . 231 A.4. Friedman's Data from Section 4.2.5 . 233 A.5. Uncorrelated Uniform Data from Section 4.2.6 . 235 A.6. Simulated Multivariate Normal Data from Section 5.2 . 236 viii A.7. Power Exponential Data from Section 5.3 . 238 A.8. Body Fat Data from Section 6.3 . 240 Vita 243 ix List of Tables 4.1 Notation used for applying univariate kernel regression smoothing methods. 40 4.2 Results for applying univariate kernel regression smoothing methods with normally distributed data. 49 4.3 Competing models for univariate Box-Cox transformations. 51 4.4 Results from using Box-Cox transformation for normally distributed data. 53 4.5 Hald's cement production data. 54 4.6 Results for applying univariate kernel regression smoothing methods on Hald's data. 60 4.7 Results for applying univariate kernel regression smoothing methods with power exponential data. 72 4.8 Results for applying univariate kernel regression smoothing methods using Friedman's nonlinear data. 86 4.9 Results for applying univariate kernel regression smoothing methods on uncorrelated variables using posterior expected utility (PEU). 98 4.10 Results for applying univariate kernel regression with the Epanechnikov kernel on normal data. 101 4.11 Results for applying univariate kernel regression with the Epanechnikov kernel on power exponential data. 102 5.1 Notation used for applying multivariate kernel regression smoothing methods.107 5.2 Results for applying multivariate kernel regression smoothing methods with normally distributed data. 112 x 5.3 Results for applying multivariate kernel regression smoothing methods with power exponential data.

Data Mining with Multivariate Kernel Regression Using Information Complexity and the Genetic Algorithm

Least Squares Regression Principal Component Analysis a Supervised Dimensionality Reduction Method for Machine Learning in Scientiﬁc Applications

Nonparametric Regression

Enabling Deeper Learning on Big Data for Materials Informatics Applications

COVID-19 and Machine Learning: Investigation and Prediction Team #5

Kernel Smoothing Methods

Input Output Kernel Regression: Supervised and Semi-Supervised Structured Output Prediction with Operator-Valued Kernels

Kernel Smoothing Toolbox ∗ for MATLAB

Krls: a Stata Package for Kernel-Based Regularized Least

Iprior: an R Package for Regression Modelling Using I-Priors

Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains

Kernel Regularized Least Squares: Reducing Misspecification Bias with a Flexible and Interpretable Machine Learning Approach

Deconvoluting Kernel Density Estimation and Regression for Locally Diferentially Private Data Farhad Farokhi