<<

University of Tennessee, Knoxville TRACE: Tennessee Research and Creative Exchange

Doctoral Dissertations Graduate School

12-2009

Data Mining with Multivariate Regression Using Information Complexity and the Genetic Algorithm

Dennis Jack Beal University of Tennessee - Knoxville

Follow this and additional works at: https://trace.tennessee.edu/utk_graddiss

Part of the Management Sciences and Quantitative Methods Commons

Recommended Citation Beal, Dennis Jack, "Data Mining with Multivariate Kernel Regression Using Information Complexity and the Genetic Algorithm. " PhD diss., University of Tennessee, 2009. https://trace.tennessee.edu/utk_graddiss/561

This Dissertation is brought to you for free and open access by the Graduate School at TRACE: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of TRACE: Tennessee Research and Creative Exchange. For more information, please contact [email protected]. To the Graduate Council:

I am submitting herewith a dissertation written by Dennis Jack Beal entitled "Data Mining with Multivariate Kernel Regression Using Information Complexity and the Genetic Algorithm." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the equirr ements for the degree of Doctor of Philosophy, with a major in Management Science.

Hamparsum Bozdogan, Major Professor

We have read this dissertation and recommend its acceptance:

Ken Gilbert, Bogdan Bichescu, Mohammed Mohsin

Accepted for the Council:

Carolyn . Hodges

Vice Provost and Dean of the Graduate School

(Original signatures are on file with official studentecor r ds.) To the Graduate Council:

I am submitting herewith a dissertation written by Dennis Jack Beal entitled “Data Min- ing with Multivariate Kernel Regression Using Information Complexity and the Genetic Algorithm.” I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the requirements for the degree of Doctor of Philosophy, with a major in Management Science.

Hamparsum Bozdogan, Major Professor

We have read this dissertation and recommend its acceptance:

Ken Gilbert

Bogdan Bichescu

Mohammed Mohsin

Accepted for the Council:

Carolyn R. Hodges Vice Provost and Dean of the Graduate School

(Original signatures are on file with official student records.) Data Mining with Multivariate Kernel Regression Using Information Complexity and the Genetic Algorithm

A Dissertation

Presented for the

Doctor of Philosophy

Degree

The University of Tennessee, Knoxville

Dennis Jack Beal

December 2009 Copyright c 2009 by Dennis Jack Beal.

All rights reserved.

ii Dedication

This dissertation is dedicated to my wife, Shannon Beal, and my children, Heather Beal and Jonathan Beal. Their patience and understanding during my pursuit of this degree were a great help.

I also want to dedicate this dissertation to my mother, Paula S. Beal, who taught me the importance of learning, hard work, setting goals, perseverance, and the value of an education. Without her inspiration and sacrifices I would never have seen the value of education, nor would I have achieved this lifelong dream.

iii Acknowledgments

I would like to thank all those who supported and helped me complete this dissertation.

First, I want to thank to my major professor, Dr. Hamparsum Bozdogan, for his time, ex- pertise, and guidance while directing my dissertation. His passion for research has inspired me to continue researching in academic areas of interest to me.

I thank my doctoral committee members Dr. Ken Gilbert, Dr. Bogdan Bichescu, and

Dr. Mohammed Mohsin for their time, support, and useful comments.

I also appreciate Dr. John Andrew Howe for his help with LaTeX and the Information

Complexity M 3 Toolbox for MATLAB that he and Dr. Bozdogan developed.

Last, I appreciate my current employer, Science Applications International Corporation

(SAIC), for investing in me as an employee by paying nearly all of my tuition and ed- ucational expenses throughout my doctoral program. Their investment in me made it

financially possible for me to pursue the degree.

iv Abstract

Kernel density estimation is a data smoothing technique that depends heavily on the band- width selection. The current literature has focused on optimal selectors for the univariate case that are primarily data driven. Plug-in and cross validation selectors have recently been extended to the general multivariate case.

This dissertation will introduce and develop new and novel techniques for data mining with multivariate kernel density regression using information complexity and the genetic algorithm as a heuristic optimizer to choose the optimal bandwidth and the best predictors in kernel regression models. Simulated and real data will be used to cross validate the optimal bandwidth selectors using information complexity. The genetic algorithm is used in conjunction with information complexity to determine kernel density estimates for variable selection from high dimension multivariate data sets.

Kernel regression is also hybridized with the implicit enumeration algorithm to deter- mine the set of independent variables for the global optimal solution using information criteria as the objective function. The results from the genetic algorithm are compared to the optimal solution from the implicit enumeration algorithm and the known global optimal solution from an explicit enumeration of all possible subset models.

v Contents

1 Introduction 1

1.1 Brief Description of the Problem and Proposed Approach ...... 1

1.2 Motivations ...... 2

1.3 Contributions of this Dissertation ...... 5

1.4 Organization of this Dissertation ...... 5

2 Information Criteria and Complexity 6

2.1 Information Criteria ...... 6

2.2 Informational Complexity ...... 8

2.3 Comparison of Model Selection Techniques ...... 12

2.4 Conclusions ...... 14

3 Kernel Density Estimation 15

3.1 Introduction ...... 15

3.2 Review of Literature on Kernel Density Estimation ...... 17

3.3 Bandwidth Selectors ...... 19

3.3.1 Univariate Fixed Bandwidth Selectors ...... 20

3.3.2 Multivariate Fixed Bandwidth Selectors ...... 25

3.3.3 Variable Bandwidth Selectors ...... 30

3.3.4 Information Complexity Bandwidth Selectors ...... 31

3.4 Conclusions ...... 35

vi 4 Univariate Kernel Regression 36

4.1 Kernel Regression ...... 36

4.2 Univariate Kernel Regression Results ...... 39

4.2.1 Univariate Kernel Regression with Normal Data ...... 39

4.2.2 Univariate Box-Cox Transformation ...... 50

4.2.3 Univariate Kernel Regression on Hald Cement Data ...... 53

4.2.4 Kernel Regression for Univariate Power Exponential Distribution . . 61

4.2.5 Univariate Kernel Regression for Friedman’s Nonlinear Regression . 73

4.2.6 Univariate Kernel Regression for Uncorrelated Data ...... 87

4.2.7 Univariate Kernel Regression Using the Epanechnikov Kernel on Nor-

mal Data ...... 98

4.2.8 Univariate Kernel Regression Using the Epanechnikov Kernel on PE

Data ...... 101

4.3 Conclusions ...... 103

5 Multivariate Kernel Regression 104

5.1 Introduction ...... 104

5.2 Kernel Regression with Normal Data ...... 106

5.3 Kernel Regression for Multivariate Power Exponential Distribution . . . . . 113

5.4 Conclusions ...... 119

6 Kernel Regression with the Genetic Algorithm 120

6.1 Introduction ...... 120

6.2 Genetic Algorithm ...... 120

6.3 Kernel Regression with the Genetic Algorithm on Real Body Fat Data . . . 127

6.3.1 Results from the Y = Xβ Model ...... 127

6.3.2 Results from the Y = Xkβ Model ...... 134

6.3.3 Results from the Y = Xmkβ Model ...... 154

6.3.4 Results from the Yk = Xβ Model ...... 164

vii 6.3.5 Results from the Yk = Xkβ Model ...... 170

6.3.6 Results from the Yk = Xmkβ Model ...... 179 6.3.7 Conclusions from Body Fat Data ...... 184

6.4 Conclusions ...... 187

7 Kernel Regression with Implicit Enumeration Algorithm 188

7.1 Introduction ...... 188

7.2 Implicit Enumeration Algorithm ...... 188

7.3 Implicit Enumeration with Kernel Regression ...... 192

7.3.1 Results from the Y = Xβ Model ...... 192

7.3.2 Results from the Y = Xkβ Model ...... 194

7.3.3 Results from the Y = Xmkβ Model ...... 194

7.3.4 Results from the Yk = Xβ Model ...... 197

7.3.5 Results from the Yk = Xkβ Model ...... 197

7.3.6 Results from the Yk = Xmkβ Model ...... 200 7.4 Conclusions ...... 203

8 Conclusions 206

8.1 Summary of Dissertation ...... 206

8.2 Future Work ...... 208

Bibliography 209

Appendix 227

A.1. Simulated Normal Data from Section 4.2.1 ...... 227

A.2. Hald’s Cement Data from Section 4.2.3 ...... 230

A.3. Power Exponential Data from Section 4.2.4 ...... 231

A.4. Friedman’s Data from Section 4.2.5 ...... 233

A.5. Uncorrelated Uniform Data from Section 4.2.6 ...... 235

A.6. Simulated Multivariate Normal Data from Section 5.2 ...... 236

viii A.7. Power Exponential Data from Section 5.3 ...... 238

A.8. Body Fat Data from Section 6.3 ...... 240

Vita 243

ix List of Tables

4.1 Notation used for applying univariate kernel regression smoothing methods. 40

4.2 Results for applying univariate kernel regression smoothing methods with

normally distributed data...... 49

4.3 Competing models for univariate Box-Cox transformations...... 51

4.4 Results from using Box-Cox transformation for normally distributed data. . 53

4.5 Hald’s cement production data...... 54

4.6 Results for applying univariate kernel regression smoothing methods on

Hald’s data...... 60

4.7 Results for applying univariate kernel regression smoothing methods with

power exponential data...... 72

4.8 Results for applying univariate kernel regression smoothing methods using

Friedman’s nonlinear data...... 86

4.9 Results for applying univariate kernel regression smoothing methods on un-

correlated variables using posterior expected utility (PEU)...... 98

4.10 Results for applying univariate kernel regression with the Epanechnikov ker-

nel on normal data...... 101

4.11 Results for applying univariate kernel regression with the Epanechnikov ker-

nel on power exponential data...... 102

5.1 Notation used for applying multivariate kernel regression smoothing methods.107

5.2 Results for applying multivariate kernel regression smoothing methods with

normally distributed data...... 112

x 5.3 Results for applying multivariate kernel regression smoothing methods with

power exponential data...... 118

6.1 Genetic algorithm operational parameters...... 125

6.2 AIC results for body fat data from the Y = Xβ model...... 131

6.3 ICOMP(IFIM) results for body fat data from the Y = Xβ model...... 134

6.4 AIC results for body fat data from the Y = Xkβ model...... 152

6.5 ICOMP(IFIM) results for body fat data from the Y = Xkβ model...... 153

6.6 AIC results for body fat data from the Y = Xmkβ model...... 159

6.7 ICOMP(IFIM) results for body fat data from the Y = Xmkβ model. . . . . 163

6.8 AIC results for body fat data from the Yk = Xβ model...... 167

6.9 ICOMP(IFIM) results for body fat data from the Yk = Xβ model...... 171

6.10 AIC results for body fat data from the Yk = Xkβ model...... 174

6.11 ICOMP(IFIM) results for body fat data from the Yk = Xkβ model...... 178

6.12 AIC results for body fat data from the Yk = Xmkβ model...... 182

6.13 ICOMP(IFIM) results for body fat data from the Yk = Xmkβ model. . . . . 186 6.14 Best AIC and ICOMP(IFIM) models for the body fat data...... 186

7.1 Top 10 best AIC models for the body fat data using IE for the Y = Xβ

model...... 193

7.2 Top 10 best ICOMP(IFIM) models for the body fat data using IE for the

Y = Xβ model...... 193

7.3 Top 10 best AIC models for the body fat data using IE for the Y = Xkβ model...... 195

7.4 Top 10 best ICOMP(IFIM) models for the body fat data using IE for the

Y = Xkβ model...... 195

7.5 Top 10 best AIC models for the body fat data using IE for the Y = Xmkβ model...... 196

xi 7.6 Top 10 best ICOMP(IFIM) models for the body fat data using IE for the

Y = Xmkβ model...... 196

7.7 Top 10 best AIC models for the body fat data using IE for the Yk = Xβ model...... 198

7.8 Top 10 best ICOMP(IFIM) models for the body fat data using IE for the

Yk = Xβ model...... 198

7.9 Top 10 best AIC models for the body fat data using IE for the Yk = Xkβ model...... 199

7.10 Top 10 best ICOMP(IFIM) models for the body fat data using IE for the

Yk = Xkβ model...... 201

7.11 Top 10 best AIC models for the body fat data using IE for the Yk = Xmkβ model...... 202

7.12 Top 10 best ICOMP(IFIM) models for the body fat data using IE for the

Yk = Xmkβ model...... 202 7.13 Comparison of the best models from IE and GA using AIC...... 204

7.14 Comparison of the best models from IE and GA using ICOMP(IFIM). . . . 204

xii List of Figures

3.1 Univariate plot of the Gaussian kernel function...... 16

4.1 Bivariate surface plot of normally distributed variables X1 and X2...... 40

4.2 Kernel density plots for various h values for normally distributed variable X1. 41

4.3 Kernel density plots for various h values for normally distributed variable X2. 42

4.4 Kernel density plots for various h values for normally distributed variable X3. 43

4.5 Kernel density plots for various h values for normally distributed variable X4. 44

4.6 Kernel density plots for various h values for normally distributed variable X5. 45

4.7 Kernel density plots for various h values for normally distributed variable X6. 46

4.8 Kernel density plots for various h values for normally distributed variable X7. 47 4.9 Kernel density plots for various h values for normally distributed variable Y . 48

4.10 Bivariate surface plot of Hald’s variables X1 and X2...... 54

4.11 Kernel density plots for various h values for Hald’s variable X1...... 55

4.12 Kernel density plots for various h values for Hald’s variable X2...... 56

4.13 Kernel density plots for various h values for Hald’s variable X3...... 57

4.14 Kernel density plots for various h values for Hald’s variable X4...... 58 4.15 Kernel density plots for various h values for Hald’s variable Y ...... 59

4.16 Bivariate surface plot of power exponential variables X1 and X2...... 62

4.17 Kernel density plots for various h values for PE distributed variable X1... 64

4.18 Kernel density plots for various h values for PE distributed variable X2... 65

4.19 Kernel density plots for various h values for PE distributed variable X3... 66

4.20 Kernel density plots for various h values for PE distributed variable X4... 67

xiii 4.21 Kernel density plots for various h values for PE distributed variable X5... 68

4.22 Kernel density plots for various h values for PE distributed variable X6... 69

4.23 Kernel density plots for various h values for PE distributed variable X7... 70 4.24 Kernel density plots for various h values for PE distributed variable Y .... 71

4.25 Bivariate surface plot of Friedman’s variables X1 and X2 with the dependent variable Y ...... 73

4.26 Kernel density plots for various h values for Friedman’s variable X1..... 75

4.27 Kernel density plots for various h values for Friedman’s variable X2..... 76

4.28 Kernel density plots for various h values for Friedman’s variable X3..... 77

4.29 Kernel density plots for various h values for Friedman’s variable X4..... 78

4.30 Kernel density plots for various h values for Friedman’s variable X5..... 79

4.31 Kernel density plots for various h values for Friedman’s variable X6..... 80

4.32 Kernel density plots for various h values for Friedman’s variable X7..... 81

4.33 Kernel density plots for various h values for Friedman’s variable X8..... 82

4.34 Kernel density plots for various h values for Friedman’s variable X9..... 83

4.35 Kernel density plots for various h values for Friedman’s variable X10..... 84 4.36 Kernel density plots for various h values for Friedman’s variable Y ...... 85

4.37 Histograms and scatter plots of uncorrelated variables X1, X2, X3, and Y .. 88

4.38 Kernel density plots for various h values for uncorrelated variable X1.... 90

4.39 Kernel density plots for various h values for uncorrelated variable X2.... 91

4.40 Kernel density plots for various h values for uncorrelated variable X3.... 92

4.41 Kernel density plots for various h values for uncorrelated variable X4.... 93

4.42 Kernel density plots for various h values for uncorrelated variable X5.... 94

4.43 Kernel density plots for various h values for uncorrelated variable X6.... 95

4.44 Kernel density plots for various h values for uncorrelated variable X7.... 96 4.45 Kernel density plots for various h values for uncorrelated variable Y ..... 97

4.46 Univariate plot of the Epanechnikov kernel function...... 99

5.1 Bivariate surface plot of normally distributed variables X1 and X2...... 108

xiv 5.2 Kernel density plots for various h values for normally distributed variable Y2.110

5.3 Kernel density plots for various h values for normally distributed variable Y3.111

5.4 Bivariate surface plot of power exponential variables X1 and X2...... 115

5.5 Kernel density plots for various h values for PE distributed variable Y2... 116

5.6 Kernel density plots for various h values for PE distributed variable Y3... 117

6.1 Three dimensional surface plot of AIC scores for body fat from the Y = Xβ

model...... 129

6.2 Scatter plot of AIC scores for body fat for each generation from the Y = Xβ

model...... 130

6.3 Three dimensional surface plot of ICOMP(IFIM) scores for body fat from

the Y = Xβ model...... 132

6.4 Scatter plot of ICOMP(IFIM) scores for body fat for each generation from

the Y = Xβ model...... 133

6.5 Kernel density plots for various hj values for body fat variable X1...... 135

6.6 Kernel density plots for various hj values for body fat variable X2...... 136

6.7 Kernel density plots for various hj values for body fat variable X3...... 137

6.8 Kernel density plots for various hj values for body fat variable X4...... 138

6.9 Kernel density plots for various hj values for body fat variable X5...... 139

6.10 Kernel density plots for various hj values for body fat variable X6...... 140

6.11 Kernel density plots for various hj values for body fat variable X7...... 141

6.12 Kernel density plots for various hj values for body fat variable X8...... 142

6.13 Kernel density plots for various hj values for body fat variable X9...... 143

6.14 Kernel density plots for various hj values for body fat variable X10...... 144

6.15 Kernel density plots for various hj values for body fat variable X11...... 145

6.16 Kernel density plots for various hj values for body fat variable X12...... 146

6.17 Kernel density plots for various hj values for body fat variable X13...... 147

6.18 Kernel density plots for various hj values for body fat variable Y ...... 148

xv 6.19 Three dimensional surface plot of AIC scores for body fat from the Y = Xkβ model...... 150

6.20 Scatter plot of AIC scores for body fat for each generation from the Y = Xkβ model...... 151

6.21 Three dimensional surface plot of ICOMP(IFIM) scores for body fat from

the Y = Xkβ model...... 155 6.22 Scatter plot of ICOMP(IFIM) scores for body fat for each generation from

the Y = Xkβ model...... 156 6.23 Three dimensional surface plot of AIC scores for body fat from the Y =

Xmkβ model...... 157 6.24 Scatter plot of AIC scores for body fat for each generation from the Y =

Xmkβ model...... 158 6.25 Three dimensional surface plot of ICOMP(IFIM) scores for body fat from

the Y = Xmkβ model...... 161 6.26 Scatter plot of ICOMP(IFIM) scores for body fat for each generation from

the Y = Xmkβ model...... 162

6.27 Three dimensional surface plot of AIC scores for body fat from the Yk = Xβ model...... 165

6.28 Scatter plot of AIC scores for body fat for each generation from the Yk = Xβ model...... 166

6.29 Three dimensional surface plot of ICOMP(IFIM) scores for body fat from

the Yk = Xβ model...... 168 6.30 Scatter plot of ICOMP(IFIM) scores for body fat for each generation from

the Yk = Xβ model...... 169

6.31 Three dimensional surface plot of AIC scores for body fat from the Yk = Xkβ model...... 172

6.32 Scatter plot of AIC scores for body fat for each generation from the Yk = Xkβ model...... 173

xvi 6.33 Three dimensional surface plot of ICOMP(IFIM) scores for body fat from

the Yk = Xkβ model...... 176 6.34 Scatter plot of ICOMP(IFIM) scores for body fat for each generation from

the Yk = Xkβ model...... 177

6.35 Three dimensional surface plot of AIC scores for body fat from the Yk =

Xmkβ model...... 180

6.36 Scatter plot of AIC scores for body fat for each generation from the Yk =

Xmkβ model...... 181 6.37 Three dimensional surface plot of ICOMP(IFIM) scores for body fat from

the Yk = Xmkβ model...... 183 6.38 Scatter plot of ICOMP(IFIM) scores for body fat for each generation from

the Yk = Xmkβ model...... 185

A.1 Histograms and scatter plots of normally distributed variables X1, X2, X3,

and X4...... 228

A.2 Histograms and scatter plots of normally distributed variables X5, X6, X7,

and Y1...... 229

A.3 Histograms and scatter plots of Hald’s variables X1, X2, X3, X4 and Y ... 230

A.4 Histograms and scatter plots of PE variables X1, X2, X3, and X4...... 231

A.5 Histograms and scatter plots of PE variables X5, X6, X7, and Y1...... 232

A.6 Histograms and scatter plots of Friedman’s variables X1, X2, X3, X4, X5, and Y ...... 233

A.7 Histograms and scatter plots of Friedman’s variables X6, X7, X8, X9, and

X10...... 234

A.8 Histograms and scatter plots of uncorrelated uniform variables X4, X5, X6,

and X7...... 235

A.9 Histograms and scatter plots of normally distributed variables X1, X2, Y1,

Y2, and Y3...... 236

xvii A.10 Histograms and scatter plots of normally distributed variables X3, X4, X5,

X6, and X7...... 237

A.11 Histograms and scatter plots of PE variables X1, X2, Y1, Y2, and Y3..... 238

A.12 Histograms and scatter plots of PE variables X3, X4, X5, X6, and X7.... 239

A.13 Histograms and scatter plots of body fat variables X1, X2, X3, X4, and X5. 240

A.14 Histograms and scatter plots of body fat variables X6, X7, X8, X9, and X10. 241

A.15 Histograms and scatter plots of body fat variables X11, X12, X13, and Y .. 242

xviii Chapter 1

Introduction

1.1 Brief Description of the Problem and Proposed Approach

This research focuses on the problem of estimating the optimal bandwidth selection for ker- nel density estimation. Kernel density estimates are very sensitive to bandwidth selection, so a large bandwidth will oversmooth the distribution and could mask the true underlying distribution. An unnecessarily small bandwidth will undersmooth the distribution and highlight the noise in the data more than the underlying distribution.

The current literature has many data driven methods for estimating the optimal band- width selection for kernel density estimation. These data driven methods include plug-in and cross validation techniques. However, the approach of this dissertation is to use Boz- dogan’s information complexity (ICOMP) [Bozdogan, 1988, Bozdogan, 1990a, Bozdogan,

1990b, Bozdogan and Haughton, 1998, Bozdogan, 2000, Bozdogan, 2003] and Akaike’s in- formation criteria (AIC) with a genetic algorithm (GA) to derive the optimal bandwidth selection for high dimensional multivariate data as a data mining tool.

1 1.2 Motivations

A univariate multiple linear regression model is of the form

y = Xβ + , (1.1) where y is an (n × 1) observed response vector on each of n individuals, X is an (n × q) matrix of full rank on q predictor variables, β is an unknown (q × 1) vector of regression coefficients, and  is an (n × 1) vector of unobserved random errors. When the number of q predictor variables is large, an efficient model selection algorithm must be used to identify a subset of those predictor variables that form the best models for predicting the response variable y. The best model should estimate the regression coefficients with small standard errors with as few variables as necessary (parsimony).

A multivariate linear regression (MLR) model is of the form

Y = XB + E, (1.2) where Y is an (n × p) observed matrix of p response variables on each of n individuals,

X is an (n × q) matrix of full rank on q predictor variables, B is an unknown (q × p) matrix of regression coefficients, and E is an (n × p) matrix of unobserved random errors.

When the number of q predictor variables is large, an efficient model selection algorithm must be used to identify a subset of those predictor variables that form the best models for predicting the p response variables.

When q is small (e.g., q < 10), an obvious strategy to find the “best” model is us- ing explicit numeration where all possible subset regression models are enumerated and each model is evaluated according to a given criterion. The number of possible subset regressions is 2q − 1 since each of the q predictor variables are either included or excluded from the model. When q = 10, the number of subset models to enumerate explicitly is only 210 − 1 = 1023. However, the number of subset models grows exponentially larger as q increases. When q = 20, 1,048,575 subset models must be evaluated. When q = 30,

2 1,073,741,823 subset models must be evaluated using explicit enumeration. Since many modern multivariate data sets can have many more independent variables than 30, explic- itly enumerating and evaluating the criterion for all possible models cannot be completed in a reasonable time even on the fastest computers.

Classical model selection methods include hypothesis testing based on forward selection, backward elimination or stepwise regression techniques. Forward selection begins with no variables in the model and allows the variable with the largest F-statistic to enter the model if its corresponding p-value is below a pre-determined significance level to enter. Once selected, the variable stays in all subsequent models. Conversely, backward elimination begins with all variables in the model and drops the variable with the smallest F-statistic if its corresponding p-value exceeds the pre-determined significance level to stay. Stepwise regression is a hybrid of forward selection and backward elimination. The technique begins by adding the most significant variable to the model (determined by the p-value of its F- statistic), and then tests which other variable can enter the model one at a time given the previous variables in the model. However, in stepwise regression at each step all variables are retested to determine if they remain significant in the presence of the other variables.

Therefore, a variable may be removed from subsequent models if its p-value exceeds the significance level to stay.

There are several criticisms to using forward selection, backward elimination and step- wise regression. Linhart and Zucchin [Linhart and Zucchin, 1986] state the pre-determined significance level to enter and significance level to stay lack an adequate foundation since the F-tests used to determine significance are based on the normal distribution with nor- mally distributed errors. These assumptions usually do not hold in multivariate data sets in the real world. In addition, determining the significance levels for adding or deleting a variable from the model is subjective.

Another model selection method used is the branch and bound algorithm that guar- antees to find the “best” model for a given criterion. In general, the branch and bound algorithm searches a binary tree where each variable is either included or excluded from

3 the model. If the criterion cannot be improved by adding a specific variable to the model, then the algorithm pivots to another branch and does not consider models that include that variable. Therefore, only a small percentage of models actually have to be evaluated before the “best” model is determined.

Bao [Bao, et. al, 2005] proposed an implicit enumeration algorithm for model selection using information criteria. The implicit enumeration algorithm is a branch and bound algorithm that assumes a strict monotonic condition on the criterion adopted.

However, this research proposes to use the genetic algorithm (GA) with the kernel regression approach for model selection. A genetic algorithm is a heuristic search technique that finds optimal or near-optimal solution for general optimization problems. Genetic algorithms are particularly useful when the objective function to be minimized does not satisfy differentiability assumptions for standard calculus-based optimization techniques.

Genetic algorithms were introduced by John Holland [Holland, 1992] and are based on the principles of evolutionary biology using natural selection. GA is independent from the complexity of the problem structure and is not likely to be restricted to a local optimal solution [Goldberg, 1989]. The GA searches much more of the feature space through randomization during its crossover and mutation steps so that it is more likely to find the global optimum. Furthermore, GA does not assume the monotonic selection criterion condition as the implicit enumeration algorithm requires. GA is particularly useful for high dimensional data mining applications.

For each model identified with the GA, the optimal bandwidth will be calculated using

ICOMP as proposed by Bozdogan [Bozdogan, 2004] and AIC. ICOMP uses an information theoretic measure of overall model complexity based on Kullback-Leibler distance [Kull- back, 1987] and the generalized covariance complexity of Van Emden [Van Emden, 1971].

The inverse Fisher information matrix (IFIM) of a model is used with ICOMP to exploit the asymptotic optimality property of the maximum likelihood estimators (MLEs) and is the most general form of ICOMP, denoted ICOMP(IFIM). ICOMP(IFIM) is then used to calculate the optimal bandwidth for multivariate kernel density estimation.

4 1.3 Contributions of this Dissertation

This dissertation contributes the following to the literature:

• Information complexity is used as the criterion to determine the optimal bandwidth

for multivariate kernel density estimation. The bandwidth is calculated for each

model identified by the genetic algorithm. The best model has the minimum ICOMP

using kernel regression smoothing within the genetic algorithm.

• The management science approach taken is the minimization of the non-linear infor-

mation complexity function to determine the global optimal model for high dimen-

sional multivariate data sets.

• The approach of using an implicit enumeration with the GA in the kernel regression

problem integrates management science optimization with statistical modeling.

1.4 Organization of this Dissertation

This dissertation contains eight chapters. Chapter 1 provides a brief description of the problem and proposed approach of kernel density estimation using information complexity.

Chapter 2 defines information criteria and complexity as it will be used throughout this dissertation and compares information criteria with heuristic methods and model diagnos- tics for model selection. Chapter 3 introduces kernel density estimation and provides a relevant literature review. Univariate, multivariate and variable bandwidth selectors cur- rently in the literature are also reviewed. Chapter 4 discusses univariate kernel regression using simulated and real data sets for subset selection under different models. Chapter 5 extends kernel regression to the multivariate case and presents results of multivariate kernel regression for simulated data. Chapter 6 presents results of integrating multivariate kernel density estimation with GA using a real data set. Chapter 7 presents results of integrating the implicit enumeration with kernel regression from the same real data set. Chapter 8 summarizes conclusions from this dissertation and outlines areas for future work.

5 Chapter 2

Information Criteria and Complexity

2.1 Information Criteria

The concept of information criteria originated with Akaike [Akaike, 1973, Akaike, 1974,

Akaike, 1981] as an extension of the maximum likelihood principle. Information criteria is a measure of goodness of fit of a model to the data. Data with greater uncertainty will exhibit less information. Akaike’s information criteria (AIC) is defined by

ˆ AIC(k) = −2 log L(θk) + 2k, (2.1)

ˆ ˆ where L(θk) is the maximized log likelihood of the parameter vector θk and k is the number ˆ of free parameters within the model. The first term −2 log L(θk) in Eqn. 2.1 is a measure of lack of fit when the MLEs of the parameters of the model are used and 2k is a penalty term for the number of free parameters in the model or lack of parsimony. Models with smaller AIC are better since the information in these models is maximized.

6 Bayesian extensions of information criteria have also been widely used. Schwarz [Schwarz,

1978] proposed a Bayesian model selection criterion assuming the data are generated from an exponential family of distributions. Schwarz’s Criterion (SC) is defined as

ˆ SC(k) = −2 log L(θk) + k log n, (2.2)

where n is the sample size. SC has the same lack of fit term as AIC, but modifies the

lack of parsimony or penalty term. The AIC lack of parsimony term is simply 2k in the

model, whereas the SC lack of parsimony term in Eqn. 2.2 is k log n. In general, the

minimization of SC leads to lower dimensional models than those obtained by minimizing

AIC. SC adjusts the log likelihood by a penalty for model complexity. For sample sizes n

≥ 8, the values of SC are always larger than AIC. SC favors models with fewer parameters

than does AIC.

Another Bayesian extension of information criteria is Akaike’s [Akaike, 1978] Bayesian

Information Criterion (ABIC). For multiple linear regression ABIC is defined as

 y0Hy  y0My  ABIC(Regression) = n log(2πσˆ2) + n + k log − log + O(1), (2.3)  k n

2 whereσ ˆ is the MLE of the residual variance based on a k-parameter model, H = X(X0X)−1X0 is the hat or projection matrix, M = I − H, and O(1) denotes “of or-

y0Hy y0My der 1” which can be dropped. Note that if both k and n converge in probability to a constant, then ABIC will behave like SC in Eqn. 2.2 above.

None of these information criteria functions adjusts for biases or guards the researcher

to model misspecification in complex structures between variables in high dimensional data.

A new information complexity function is needed to account for the complex structures

and correlations among the variables.

7 2.2 Informational Complexity

Bozdogan [Bozdogan, 1988,Bozdogan, 1990a] proposed an information theoretic measure of overall model complexity based on the generalized covariance complexity index. Bozdogan’s

Information Complexity (ICOMP) was motivated by AIC, but extends it to adjust for biases and complex structures in the data. The loss function for ICOMP takes the form of a penalized likelihood function by

Loss = lack of fit + lack of parsimony + profusion of complexity. (2.4)

Note that the loss function in Eqn. 2.4 has an additional term for profusion of com- plexity that is not accounted for in the previous information criteria in Eqns 2.1, 2.2, and

2.3. ICOMP defines complexity as the degree of interdependence among the components of the model. The complexity of a random vector is a measure of the interaction or the dependency between its components.

0 Consider a continuous p-dimensional random vector x = [x1 , ..., xp ] that has joint density function f(x) = f(x1 , ..., xp ) and marginal density functions fj(xj), j = 1, 2, ..., p. Suppose that x has a p-variate Normal distribution with mean µ and covariance Σ, such that

1 f(x) = (2π)−p/2|Σ|−1/2 exp{− (x − µ)0Σ−1(x − µ)}, (2.5) 2

0 where µ = (µ1, ..., µp) , |µj| < ∞, j = 1, 2, ..., p and Σ is positive definite. By Kullback and Leibler [Kullback, 1987], the Kullback-Leibler information I(x) is

p X 1 1 1 p 1 p I(x) = log(2π) + log(σ ) + − log(2π) − log |Σ| − , (2.6) 2 2 jj 2 2 2 2 j=1

2 th where σjj = σj is the j diagonal element of Σ and p is the dimension of Σ. I(x) in

Eqn. 2.6 reduces to the information complexity of a covariance matrix Σ,C0(Σ), defined by Van Emden [Van Emden, 1971] as

8 p 1 X 1 C (Σ) = log(σ ) − log |Σ|. (2.7) 0 2 jj 2 j=1

However, the function C0(Σ) is dependent on the coordinates of the original random

2 variables x1 , ..., xp associated with the variances σj , j = 1, 2, ..., p. Van Emden [Van Em- den, 1971] showed that C0(Σ) is not an effective measure of the amount of complexity in the covariance matrix Σ since C0(Σ) depends on the marginal and common distributions of the random variables x1 , ..., xp .In addition, the first term of Eqn 2.7 would change under orthonormal transformations. A general definition of the complexity of Σ should be independent of the coordinates of the original x vector of random variables that are

2 associated with the variances σj for j = 1, 2, ..., p.

Therefore, the maximal covariance complexity is defined in order to improve on C0(Σ) in Eqn. 2.7. Bozdogan [Bozdogan, 1990b] states the maximal information theoretic measure of complexity of a covariance matrix Σ of a p-dimensional multivariate Normal distribution is

p tr(Σ) 1 C1(Σ) ≡ max C0(Σ) = log − log |Σ|, (2.8) T 2 p 2

where the maximum is taken over the orthonormal transformation T of the overall coordi-

nate systems x1 , ..., xp . The derivation and proof of Eqn. 2.8 is shown in Bozdogan [Bozdogan, 1990a]. Boz-

dogan and Haughton [Bozdogan and Haughton, 1998] and Bozdogan and Shigemasu [Boz-

dogan and Shigemasu, 1998] discuss the properties of C1(Σ). Clearly, C1(Σ) in Eqn. 2.8

is an upper bound to C0(Σ) in Eqn. 2.7 since it is the maximum over the orthonormal

transformations. C1(Σ) measures both the inequality among the variances and the contri-

bution of the covariances in the covariance matrix Σ. Also, C1(Σ) is independent of the

2 coordinate system associated with the variances σj for j = 1, 2, ..., p. The complexity is also interpreted as the log ratio between the geometric mean of the average total variation

and the generalized variance since tr(Σ) is the total variation and |Σ| is the generalized

9 variance. A large value of C1(Σ) indicates a high interaction between the variables, while a small value of C1(Σ) indicates less interaction between the variables. C1(Σ) can be used in model selection to determine the strength of model structures and correlations in mul-

tidimensional data. Note that C1(Σ) −→ 0 as Σ −→ I, where I is the (p × p) identity matrix.

Liu [Liu, 2006] states the complexity of the model covariance structure provides a

numerical measure to assess both parameter redundancy and stability together in one

measure. When the parameters are stable the covariance matrix should be approximately

a diagonal matrix. C1(Σ) penalizes the scaling of the ellipsoidal dispersion, while the

circular distribution has been accounted for. Therefore, ICOMP uses C1(Σ) without any

transformations of Σ without discarding the use of C0(Σ). The most general ICOMP approach to measure the fitness between multivariate linear

or nonlinear structural models and observed data is ICOMP(IFIM), where IFIM is the

inverse Fisher information matrix. The maximum likelihood estimators (MLEs) of model

parameters are asymptotically normally distributed with IFIM [Bozdogan, 2000]. The

general form of ICOMP(IFIM) is

−1 ICOMP (IFIM) = −2 log L(θˆ) + 2C1(= (θˆ)), (2.9)

−1 where C1 denotes the maximal informational complexity of the estimated IFIM = and θˆ

−1 denotes the estimated model parameter MLE vector. The term 2C1(= (θˆ)) in Eqn. 2.9 is the penalty term which measures the complexity of the accuracy of the parameter estimates

of the model as measured by IFIM. The C1 function gives a scalar measure of the Cram´er- Rao lower bound matrix which accounts for the accuracy of the estimated parameters and

implicitly adjusts for the number of free parameters included in the model. ICOMP(IFIM)

also takes into account parameter orthogonality, redundancy, and stability [Bozdogan,

2000].

An alternative formulation of ICOMP in Eqn. 2.9 provides an achievable accuracy

of the parameter estimates by considering the entire parameter space of the model. The

10 alternative formulation for ICOMP(IFIM) is

! tr(=ˆ −1) ICOMP (IFIM) = −2 log L(θˆ) + s log − log |=ˆ −1|, (2.10) s

ˆ −1 where s = dim(= ). The minimum of ICOMP over the class of models Mk, k = 1, 2, ..., K is chosen as the best model. The proof is given in [Bozdogan, 2000].

ICOMP(IFIM) quantifies complexity not as the number of parameters in the model, but the degree of interdependence or correlational structure among the parameter estimates.

Therefore, ICOMP(IFIM) provides a more sensible penalty term than AIC, SC or BIC.

−1 Furthermore, the lack of parsimony is automatically adjusted by C1(= (θˆ)) across the competing models as the parameter spaces of the models are constrained in the model

selection process [Bozdogan, 2000].

Bozdogan [Bozdogan, 2000] derives IFIM for the multiple regression model

y = Xβ + ε (2.11)

where X is the (n × q) model matrix, β is the (q × 1) parameter vector, y is the (n × 1) response variable vector, and ε is the (n × 1) error vector. IFIM for the multiple regression model is

  σˆ2(X0X)−1 0 −1 ˆ 2   = = Cov(β, σˆ ) =   . (2.12) 00 2ˆσ4/n

ICOMP(IFIM) for the multiple regression model is derived in [Bozdogan, 2000] as

 1 σˆ2tr(X0X)−1 + 2ˆσ4/n ICOMP (IFIM) = n log(2π)+ n − log(ˆσ2)+n+(q+1) log 2 q + 1

n − log |(X0X)−1| + log , (2.13) 2 where q = k + 1 and k = number of parameters estimated in the model.

11 Bozdogan [Bozdogan, 2000] derives IFIM for the multivariate regression model

Y = XB + E, (2.14) where X is the (n × q) model matrix with rank(X) = q +k +1, B is the (q × p) parameter matrix, Y is the (n × p) observed matrix of p response variables on n observations, and

E is the (n × p) matrix of unobserved random errors. ICOMP(IFIM) for the multivariate regression model is

 ˆ 0 −1 1 ˆ 2 1 ˆ 2 P 2  tr(Σ)tr(X X) + 2 tr(Σ )+ 2 tr(Σ) + σˆjj  j  ICOMP (IFIM) = p(p + q) log    p(p + q) 

+np log(2π) + n log |Σˆ| + np − (p + q + 1) log |Σˆ| − p log |(X0X)−1| − p log(2). (2.15)

2.3 Comparison of Model Selection Techniques

Other techniques have been used for model selection besides information criteria. The most common model diagnostics are the root mean square error (RMSE), the adjusted R2, and

Mallow’s Cp statistic. Models with lower RMSE and Mallow’s Cp are considered better, while models with higher adjusted R2 are preferred. These methods are limited since they do not account for the correlational structure in the data, but do penalize for the inclusion of more variables in the model.

Common heuristic techniques include backward elimination, forward selection, and stepwise regression. In backward elimination parameters are estimated for the full sat- urated model. The regression parameters for individual variables are tested whether they are significantly different from zero. The variable with the highest p-value exceeding some pre-selected significance level is removed from the model and the parameters for the re- maining variables are estimated again. This process continues until no variables can be removed from the model. Once a variable leaves the model it can not be included again.

12 In forward selection the model begins with an intercept term. The variable that is the most significant with the smallest p-value is kept in the model if it is below a pre- selected significance level to enter the model. All other variables are tested individually with this variable. The variable with the next lowest p-value that is below the pre-selected significance level (such as 0.15) is also kept in the model. This process continues until no more variables can be kept in the model. Once a variable enters the model it can not be excluded.

Stepwise regression is a combination of backward elimination and forward selection.

Stepwise starts with the intercept term and adds the most significant variable. Additional significant variables are added one at a time, but variables already in the model can be deleted if they are no longer significant in the presence of other variables. Stepwise regres- sion modeling is complete when no variables can enter the model and no variables can be deleted at the pre-selected significance level.

Sawa [Sawa, 1978] developed a model selection criterion that was derived from a

Bayesian modification of the AIC criterion. Sawa defined Bayesian Information Criteria

(BIC) as

 2  2 2 4 σˆ 2(k + 2)nσˆp 2n σˆp BIC = n log + 2 − 4 , (2.16) n σˆ σˆ

2 whereσ ˆp is the MLE of the residual variance based on the full p-parameter model. Beal [Beal, 2007] compared the information criteria AIC, BIC, and SC, model diag- nostics, and heuristic techniques on a simulated multivariate data set to determine which techniques more consistently chose the true model. Beal showed that information criteria consistently outperformed both model diagnostics and heuristic techniques for selecting the true model. Model diagnostics RMSE, Mallow’s Cp, and adjusted R2 performed poorly by selecting the true model only 28% of the time on average for n = 1000 and 26% of the time on average for n = 100. Adjusted R2 and RMSE consistently performed the worst compared to Mallow’s Cp.

13 Beal also showed the heuristic methods backward elimination, forward selection, and stepwise regression performed better than the model diagnostics by selecting the true model

43% of the time on average for n = 1000 and 41% of the time on average for n = 100.

However, information criteria performed better than all the other methods by selecting the

true model 63% of the time on average for n = 1000 and 59% of the time on average for n

= 100.

2.4 Conclusions

Three measures of information criteria were defined for AIC, ABIC, and SC. Bozdogan’s

ICOMP [Bozdogan, 1988, Bozdogan, 1990a] was defined and contrasted with AIC, ABIC, and SC. ICOMP incorporates a profusion of complexity term and accounts for the cor- relational structure in the data, as opposed to AIC, BIC, and SC. Recent research by

Beal [Beal, 2007] compares heuristic (forward selection, backward elimination, and step- wise regression) and model diagnostic methods (adjusted R2, Mallow’s Cp, and RMSE) to

information criteria and shows information criteria superior to both heuristic and model

diagnostics for model selection.

14 Chapter 3

Kernel Density Estimation

3.1 Introduction

Kernel density estimation is a data smoothing technique that uses independent and iden- tically distributed sample data to estimate a probability density function fˆ(x). Kernel density estimation is used as a data mining technique in exploratory data analysis and visualization. Kernel density estimators are categorized into two broad classes: parametric and nonparametric estimators. Parametric kernel density estimators assume a distribution functional form that simplifies the problem to estimating the parameters of the assumed distribution. The most common parametric estimators are maximum likelihood estima- tors (MLEs). The most common distribution used for MLEs is the Gaussian (normal) distribution. Parametric density estimators are usually not as computationally intensive as nonparametric estimators.

Density estimation is used to approximate a “true” probability density function f(x)

from sample data of independent and identically distributed observations. The univariate

kernel estimate in the general form is

n   1 X x − xi fˆ(x) = K , (3.1) nh h i=1 where x1 , ..., xn are independent, identically distributed observations from density f(x)

15 and K(·) is the kernel function that is positive, integrates to 1, and satisfies some regular- ity conditions [Parzen, 1962]. The parameter h > 0 is the window width that controls the tradeoff between smoothness and fidality of the estimate [Bozdogan, 2000]. The bandwidth h > 0 is the width of the bins the sample data are categorized into for density estimation.

Choosing an h that is too small will result in a density that is too abrupt and fluctuating.

However, choosing an h that is too large masks important features of the data by over- smoothing the density. Several techniques in the literature are presented in Section 3.3 for calculating the optimum bandwidth h without using information criteria or complexity.

A commonly used univariate parametric kernel is the Gaussian or normal kernel given by

1 −x2  K (x) = √ exp (3.2) 2π 2

for x  R. Figure 3.1 is a univariate plot of the shape of the Gaussian kernel shown in Eqn. 3.2.

Figure 3.1: Univariate plot of the Gaussian kernel function.

16 Substituting the Gaussian kernel from Eqn. 3.2 into Eqn. 3.1 yields the univariate

Gaussian kernel density estimator

n  2  1 X 1 −(x − xi) fˆ(x) = √ exp . (3.3) nh 2h2 i=1 2π Other parametric symmetric probability density functions commonly used as kernel functions are the triangular, quadratic, and Epanechnikov [Epanechnikov, 1969] kernel functions. The Epanechnikov kernel is discussed in more detail in section 4.2.7.

A nonparametric kernel density estimator does not require a functional form for the density estimate. Nonparametric density estimators allow more flexibility than their para- metric counterparts, but are typically more computationally intensive. However, with mod- ern, high speed computers, this usually is no longer an issue. Examples of nonparametric density estimators are histograms, frequency polygons, penalized likelihood estimators and spline estimators [Duong, 2004,Scott and Sain, 2004]. Both parametric and nonparametric kernel density estimators require the bandwidth parameter h to be estimated.

3.2 Review of Literature on Kernel Density Estimation

Kernel density estimation has been widely discussed in the literature. A major area within the literature is using cross validation for kernel density estimation. Sheather and

Jones [Sheather and Jones, 1991] recommend least squares or unbiased cross-validation algorithms for density estimation. However, cross-validation on time series data is not recommended since serial correlation is present. Bowman [Bowman, 1984] extends cross- validation methods for smoothing density estimates. Hall and Marron [Hall and Marron,

1987] use least squares cross validation to minimize integrated square error in nonparamet- ric density estimation.

Another broad class within the literature is nonparametric kernel density estimation

[Tapia and Thompson, 1978, Wagner, 1975]. Bhattacharyya [Bhattacharyya, et. al, 1989] discuss unweighted nonparametric kernel density estimators, while Chu [Chu, 1994] uses

17 nonparametric regression function through kernel density estimation. Devroye [Devroye and Gyorfi, 1985] discusses the L1 view of nonparametric density estimation. Epanech- nikov [Epanechnikov, 1969] uses nonparametric estimation on multivariate densities, as do

Loftsgaarden and Quesenberry [Loftsgaarden and Quesenberry, 1965]. Hall and Wand [Hall and Wand, 1988a] use density differences on nonparametric discrimination. Hjort and

Glad [Hjort and Glad, 1995] proposed a technique that uses a parametric start for non- parametric density estimation. Minnotte and Scott [Minnotte and Scott, 1993] use a mode tree to visualize nonparametric density features. Scott [Scott, 1985] uses averaged shifted histograms for nonparametric density estimation in several dimensions. Scott [Scott, et. al, 1980] uses a discrete maximum penalized likelihood criteria for nonparametric proba- bility density estimation. Terrell and Scott [Terrell and Scott, 1985] discuss oversmoothed density estimates.

Scott and Sain [Scott and Sain, 2004] discuss averaged shifted histograms with kernels from several distributions: triangular, beta and normal. Silverman [Silverman, 1981] dis- cusses beta densities and polynomial kernels as they are asymptotically related to normal density kernels. Silverman [Silverman, 1982] explores kernel density estimation using the fast Fourier transformation.

In addition to the statistical literature, kernel density estimation is also researched in management science literature. For example, Sahin and Diwekar [Sahin and Diwekar,

2004] propose a new algorithm for stochastic programming using reweighting through kernel density estimation. The algorithm is a better optimization of nonlinear uncertain systems.

Scott [Scott, 1976] also uses optimization techniques to estimate nonparametric densities.

However, a thorough review of the literature shows no published research using infor- mation criteria or information complexity in conjunction with kernel density estimation.

18 3.3 Bandwidth Selectors

The literature contains many different approaches for determining the optimum bandwidth h. The most basic nonparametric density estimator is the histogram. Sturges [Sturges,

1926] proposed a rule for the bin width of a histogram from a random sample, but Scott and

Sain [Scott and Sain, 2004] showed this rule of thumb was far from optimal in a stochastic setting. Duin [Duin, 1976] proposed a bandwidth based on the leave-one-out maximum likelihood for histograms. However, the procedure is not consistent for densities with heavy tails. Brewer [Brewer, 1998] used a modeling approach for bandwidth selection. Cao [Cao,

2001] explored the relative efficiencies of bandwidth. Chiu [Chiu, 1991] researched the effect of discretization error on bandwidth selection and an automatic bandwidth selector

[Chiu, 1992]. Chiu [Chiu, 1996] also provided a comparative review of bandwidth selectors.

Delaigle and Gijbels [Delaigle and Gijbels, 2004] use the bootstrap for bandwidth selection from a contaminated sample.

Eggermont and Lariccia [Eggermont and Lariccia, 1996] proposed a simpler bandwidth selector that is effective in the univariate case. Faraway and Jhun [Faraway and Jhun,

1990] used the bootstrap to select the bandwidth. Gajek and Lenic [Gajek and Lenic,

1993] derived an approximate necessary condition for the optimal bandwidth selector. Hall and Marron [Hall and Marron, 1991] derived lower bounds for bandwidth selection in density estimation. Hall [Hall, et al, 1991] used a data based approach to derive an op- timal bandwidth selection. Hallin and Tran [Hallin and Tran, 1996] derived an optimal bandwidth for linear processes assuming asymptotic normality. Hazelton [Hazelton, 1996] explores bandwidth selection for local density estimation and proposed an optimal local bandwidth selector [Hazelton, 1999]. M. C. Jones has contributed extensively to the lit- erature on bandwidth selection for kernel density estimation. For example, Jones [Jones,

1991c] explored an automatic bandwidth selection in extensions to basic kernel density estimation.

Another type of bandwidth selector is the plug-in bandwidth selector. Cwik and Ko- ronacki [Cwik and Koronacki, 1997a] introduce a combined adaptive-mixtures plug-in es-

19 timator for multivariate probability densities. Duong and Hazelton [Duong and Hazel- ton, 2003] discuss plug-in bandwidth selectors for the bivariate kernel density estima- tion. Loader [Loader, 1999] compares classical and plug-in bandwidth selectors, as does

Park [Park, 1989]. Song [Song, Seog and Cho, 1991] discuss asymptotically optimal plug-in bandwidth selectors. Wand and Jones [Wand and Jones, 1994] extend plug-in bandwidth selectors to the multivariate case.

Authors exploring the statistical properties of using the square root law for bandwidth selection include Abramson [Abramson, 1982], Jones [Jones, Marron and Park, 1991], Wu and Tsai [Wu and Tsai, 2004], and Yang [Yang, 2000]. Scott and Sain [Scott and Sain, 2004] recommend least squares or unbiased cross-validation algorithms for optimal bandwidth estimation.

3.3.1 Univariate Fixed Bandwidth Selectors

Rosenblatt [Rosenblatt, 1956] and Parzen [Parzen, 1962] first introduced univariate kernel density estimators in the literature. Wand and Jones [Wand and Jones, 1995] contains a comprehensive history of univariate bandwidth selectors with an extensive bibliography.

These authors provide references to all the major types of bandwidth selectors, including plug-in and cross validation selectors.

Mean Integrated Square Error Selector

Many univariate bandwidth selectors assume a normal or Gaussian distribution. One com- mon estimator for the optimal bandwidth h is the Mean Integrated Square Error (MISE) given by

! 1.06 Rˆ h = min σ,ˆ , (3.4) opt n1/5 1.34 whereσ ˆ is the sample standard deviation and Rˆ is the interquartile range. This univariate selector works well when the data are normally distributed, but the smoothing parameter h is wrong if the population is not normally distributed or has a mixture of normals. If

20 the true density is a mixture of normal distributions, the estimated density obtained using hopt hides a lot of structure by smoothing away the modes.

Histogram Selector

The first publication to advocate a systematic rule for the construction of a histogram was due to Sturges [Sturges, 1926]. For a histogram of binomial data with k bins with p = 0.5 the optimal bandwidth h is given by Scott and Sain [Scott and Sain, 2004] by

x − x h = max min , (3.5) 1 + log2(n)

where xmax is the largest order statistic, xmin is the smallest order statistic, and n is the sample size. However, Scott and Sain [Scott and Sain, 2004] showed Eqn. 3.5 is far from optimal in a stochastic setting. Sturges’ bin width in Eqn. 3.5 is too slow to decrease as the sample size increases. Sturges’ rule generally oversmooths the data in most cases.

Another histogram bandwidth estimator is proposed by Scott and Sain [Scott and Sain,

2004]. For normal distributions, Scott’s rule is given as

√ 24 π 1/3 h =σ ˆ , (3.6) n whereσ ˆ is the estimated standard deviation.

Frequency Polygon Selector

Scott [Scott, 1985] showed that the smoothness of the frequency polygon not only reduced the bias, but also the variance compared to the histogram. For normally distributed data,

Scott proposed the optimal bandwidth h is given by

2.15ˆσ h = , (3.7) n1/5 whereσ ˆ is the estimated standard deviation. However, for non-normally distributed data,

Eqn. 3.7 is no longer optimal.

21 Cross Validation Selector

Duong [Duong, 2004] describes cross validation methods that make use of leave-one-out estimators. These estimators have the form given by

n   1 X xj − xi fˆ (x ) = K . (3.8) j j (n − 1)h h i=1 j6=i

This estimator leaves out the jth data point. The kernel density estimate is computed on the rest of the data and then reevaluated at the missing data point. If the estimate is appropriate, then fˆj(xj) should be non-zero since the data set already has a data point at xi.

Least Squares Cross Validation Selector

Least squares cross validation (LSCV) was developed independently by Rudemo [Rudemo,

1982] and Bowman [Bowman, 1984]. The goal is to find the bandwidth that minimizes

∞ n n Z 2 X X LSCV (h) = fˆ(x; h)2dx − K (x − x ) , (3.9) n(n − 1) h j i j=1 i=1 −∞ j6=i where Kh (·) is the scaled kernel function for a given bandwidth h. LSCV is unbiased, so the LSCV selector is sometimes called the Unbiased Cross Validation selector. This selector does not rely on asymptotic expansions unlike the biased and smoothed cross validation methods described in the next two subsections.

Biased Cross Validation Selector

Scott and Terrell [Scott and Terrell, 1987] introduced a biased cross validation band- width selector that minimizes an estimate of the asymptotic mean integrated square error

(AMISE). The biased cross validation selector as a function of the bandwidth h is

22 4 2 n n R(K) h µ2(K) X X BCV (h) = + (K00 ∗ K00)(x − x ) , (3.10) nh 4n(n − 1) h h j i j=1 i=1 j6=i where R(K) = R K(x)2dx.

Smoothed Cross Validation Selector

Hall [Hall, Marron and Park, 1992] proposed smoothed cross validation (SCV) as a hybrid of estimating MISE and AMISE. This selector uses the asymptotic integrated variance

R(K)/(nh) and an estimate of the exact non-asymptotic integrated squared bias to derive the SCV function given by

n n R(K) 1 X X SCV (h) = + (K ∗K ∗L ∗L −2K ∗L ∗L +L ∗L )(x − x ) . (3.11) nh n2 h h g g h g g g g j i j=1 i=1

Hall also showed that the SCV (h) is asymptotically equivalent to the smoothed boot- strap proposed by Taylor [Taylor, 1989] and Faraway and Jhun [Faraway and Jhun, 1990].

Plug-In Selector

Plug-in bandwidth selectors are based on the AMISE

R(K) h4µ (K)2ψ AMISE fˆ(h) = + 2 4 (3.12) nh 4

1 00 as a starting point. The AMISE requires that h → 0 and nh → 0 as n → ∞ and that f is piecewise continuous and square integrable. The function ψ4 is defined by

∞ Z (4) ψ4 = f (x)f(x)dx. (3.13) −∞

The estimate ψˆ4 is then substituted into the AMISE estimate given by

23 R(K) h4µ (K)2ψˆ PI(h) = + 2 4 . (3.14) nh 4

ˆ The following plug-in univariate selector has a closed form solution for the selector hPI that minimizes PI(h) and is given by

" #1/5 ˆ R(K) hPI = . (3.15) 2 µ2(K) ψˆ4n

Sheather and Jones [Sheather and Jones, 1991] proposed the sample mean of the fourth derivative of a pilot kernel density estimate of f as an estimator for ψ4, where

n 1 X fˆ (x; g) = L (x − X ), (3.16) P n g j j=1

and L is the plot kernel and g is the pilot bandwidth. Therefore, an estimate of ψ4 is given by

n n n 1 X (4) 1 X X ψˆ (g) = fˆ (X ; g) = L(4)(X − X ), (3.17) 4 n P i n2 g i j i=1 i=1 j=1 where the most appropriate pilot bandwidth g is calculated from an algorithm by Sheather and Jones [Sheather and Jones, 1991].

An important conclusion from a review of these univariate bandwidth selectors is that there is no uniformly best bandwidth selector for all target densities. The shape and structure of the target density heavily influence which selectors perform well [Duong, 2004].

Note that none of these univariate bandwidth selectors has utilized information complexity to optimize the bandwidth h.

24 3.3.2 Multivariate Fixed Bandwidth Selectors

Extensions of univariate bandwidth selectors to the multivariate case are also in the liter- ature, though they are typically more computationally intensive than univariate selectors.

Multivariate bandwidth selectors require a bandwidth matrix rather than a scalar band- width as in the univariate case. Bowman and Azzalini [Bowman and Azzalini, 1997],

Scott [Scott, 1992], Silverman [Silverman, 1986], Simonoff [Simonoff, 1996] and Wand and

Jones [Wand and Jones, 1995] provide an extensive overview of the research already cited in the literature for multivariate density estimation. Duong and Hazelton [Duong and

Hazelton, 2005] explore convergence rates for unconstrained bandwidth matrices and cross validation bandwidth matrices for multivariate kernel density estimation. Sain [Sain, Bag- gerly and Scott, 1994] and Scott [Scott, Tapia and Thompson, 1997] also extend bandwidth selection to multivariate kernel density estimation.

The general form of the multivariate kernel density estimator from Wand [Wand and

0 Jones, 1993] of f(x) for a vector x = (x1, ..., xp) is given by

n 1 X fˆ (x) = K (x − x ) , (3.18) H n H i i=1 where K(·) is a multivariate kernel density function and H is a symmetric positive definite

(p × p) bandwidth matrix. Each data point xi is used in the computations through its own bandwidth Hi. The sample point estimator has superior performance relative to kernel density estimators where the variable bandwidth is associated with the center of the kernel x.

Asymptotic Mean Squared Error Selector

In

R(K)f(x) h4µ (K)2tr(D2f(x)) AMSE fˆ(x) = + 2 . (3.19) nhd 4

The minimizer of AMSE fˆ(x) can be shown to be of order n−1/(d+4). Closed forms for

25 the AMSE optimal bandwidths are not available for d > 2. Epanechnikov [Epanechnikov,

1969] optimizes both the bandwidths and the kernel function in the context of AMISE. If

hi = h for i = 1, 2, ..., d, a closed form solution for the optimal bandwidth hAMISE is given by

 1/(d+4)  dR(K)    hAMISE =  ∞  . (3.20) 2 R 2 2 nµ2(K) tr (D f(x))dx −∞

The optimal kernel is known as the Epanechnikov kernel. However, the Epanechnikov kernel establishes a theoretical basis for bandwidth selectors without data based algorithms for estimating unknown quantities in the functions.

Least Squares Cross Validation Selector

Stone [Stone, 1984] proposed the multivariate least squares cross validation (LSCV) cri- terion that is a generalization of the univariate case described previously. The LSCV proposed by Stone is given by

n Z 2 X LSCV (H) = fˆ(x; H)2dx− fˆ (X ; H), (3.21) n −i i i=1

Biased Cross Validation Selector

Sain [Sain, Baggerly and Scott, 1994] generalized the univariate biased cross validation

(BCV) described previously using diagonal bandwidth matrices. The BCV selector they develop is

R(K) (vechT H)Ψˆ (vech H) BCV (H) = + 4 , (3.22) np|H| 4

26 where

n n 1 X X (4) Ψˆ (H) = K (X − X ). (3.23) 4 n(n − 1) H i j j=1 i=1 j6=i

Based on their asymptotic results and simulation study, Sain [Sain, Baggerly and Scott,

1994] recommends this BCV selector.

Smoothed Cross Validation Selector

Sain [Sain, Baggerly and Scott, 1994] also developed a general multivariate smoothed cross validation (SCV) criterion that is an extension of Eqn. 3.11 given by

n n R(K) 1 X X SCV (H) = + (K ∗K ∗L ∗L −2K ∗L ∗L +L ∗L )(X − X ) , p n2 H H G G H G G G G j i n |H| j=1 i=1 (3.24) where L is a pilot kernel and G is the pilot bandwidth matrix. Sain [Sain, Baggerly and

Scott, 1994] show that for G = H, the SCV selector is suboptimal since the possibility of optimally selecting the pilot G is not accounted for.

Plug-In Selector

Wand and Jones [Wand and Jones, 1994] proposed a multivariate plug-in bandwidth se- lector that is similar to the BCV selector except for the way Ψˆ 4 is estimated. The plug-in (PI) multivariate bandwidth selector is an extension of the univariate selector approach by

Sheather and Jones [Sheather and Jones, 1991] given by

R(K) (vechT H)Ψˆ (vech H) PI(H) = BCV (H) = + 4 , (3.25) np|H| 4 where

n n 1 X X (4) Ψˆ (G) = K (X − X ). (3.26) 4 n2 G i j i=1 j=1

27 Here G is independent of H and G 6= H. Wand and Jones [Sheather and Jones, 1991] propose an algorithm to find an appropriate pilot bandwidth. They show the properties of univariate plug-in selectors carry over to the multivariate case through a simulation study.

Cross validation and plug-in selectors are currently the most commonly used selectors

[Duong, 2004].

Maximal Smoothing Selector

Terrell [Terrell, 1990] proposed a maximal smoothing selector that induces the smoothest density estimate that is consistent with the data scale. Terrell uses the kernel K such that

Z T Id = xx K(x)dx (3.27)

4 Z ˆ R(K) h 0 2 AMISE f(x;h) = d + tr(H D f(x))dx, (3.28) nh 4

 dR(K) 1/(d+4) h = R 2 0 2 . (3.29) n

Using a minimax approach, the maximally smoothed (MS) selector Hˆ MS with density

0 f with variance Id that maximizes the integral in the denominator and minimizes over H is given by

" #2/(d+4) (d + 8)(d+6)/2πd/2R(K) Hˆ = S. (3.30) MS 16(d + 2)nΓ(d/2 + 4)

The maximal smoothing selector is the only multivariate bandwidth selector that has a closed form. No large scale simulation studies of multivariate bandwidth selectors has been found in the literature.

28 AMISE Multivariate Normal

Wand and Jones [Wand and Jones, 1995] derive the asymptotic mean integrated squared error (AMISE) multivariate kernel density estimator given by

∞ 2 R 0 2 µ2(K) [tr{H Hg(x)H}] dx kKk2 AMISE(H) = −∞ + 2 , (3.31) 4 n|H| where H = bandwidth matrix, µ2(K) = second moment of K, Hg = Hessian matrix of 2 second partial derivatives, and kKk2 = p-dimensional squared L2 norm of K.

If we let K(·) be the multivariate standard normal distribution such that K ∼ Np(0, I), Eqn. 3.31 becomes

2tr{(H0Σ−1H)2} + {tr(H0Σ−1H)}2 1 AMISEN (H) = + , (3.32) 4[2p+2πp/2p|Σ|] 2pπp/2n|H| where p = number of variables. If we assume H and Σ to be diagonal matrices such that

2 2 H = diag(h1, ..., hp) and Σ = diag(σ1, ..., σp), the optimal bandwidth for each variable j is shown in Wand and Jones [Wand and Jones, 1995] to be

 4 1/(p+4) h∗ = σˆ (3.33) j n(p + 2) j for j = 1, ..., p, whereσ ˆj can be replaced by its sample estimate sj. Eqn. 3.33 is known as the “normal reference rule” since a multivariate normal distribution is assumed in its derivation. Eqn. 3.33 requires the estimation ofσ ˆj and does not take into account the correlational structure in the data. Note in the univariate case when p = 1, Eqn. 3.33 reduces to Silverman’s rule of thumb [Silverman, 1986] given by

 4 1/5 h = σˆ . (3.34) j 3n j

29 Eqn. 3.34 also requires estimation of eachσ ˆj and does not take into account the correlational structure in the data.

3.3.3 Variable Bandwidth Selectors

In addition to univariate and multivariate fixed bandwidth selectors, variable bandwidth selectors are also used. Variable bandwidth selectors use bandwidth functions instead of a constant bandwidth. The two main classes of variable bandwidth selectors assume either the bandwidth is different at each estimation point (x, h(x)) or the bandwidth is different at

each data point Xi : hi = ω(Xi), i = 1, 2, ..., n. The functions h(·) and ω(·) are non-random

functions. Sain and Scott [Sain and Scott, 1996] refer to the two types of selectors as

balloon and sample point selectors. Thus, the kernel density estimators arising from these

selectors are called balloon and sample point kernel density estimators [Duong, 2004].

Schucany [Schucany, 1989] introduced balloon selectors, while sample point selectors

were introduced by Wagner [Wagner, 1975], Breiman [Breiman, Meisel and Purcell, 1977]

and Victor [Victor, 1976]. ˆ The balloon (B) estimator fB(x; h(x)) takes on the functional form given by

n 1 X fˆ (x; h(x)) = K (x − X ). (3.35) B n h(x) i i=1 For balloon estimators, the bandwidth is a function of the estimation point. For a given point x0, all the kernels have the same bandwidth h(x0). However, balloon estimators are not true density functions since they do not typically integrate to 1 due to their focus on local estimation [Terrell and Scott, 1992].

The sample point (SP) estimator takes on the functional form

n 1 X fˆ (x; ω) = K (x − X ), (3.36) SP n hi i i=1 where hi = ω(Xi), i = 1, 2, ..., n. Each kernel has a different bandwidth for the sample point estimator, whereas the fixed kernel density estimator has a fixed bandwidth. The

30 bandwidths change at each of the data points rather than at each estimation point for the ˆ balloon estimator. Abramson [Abramson, 1982] suggested to use a pilot estimate fP given by

h ωˆ(Xi) = q . (3.37) ˆ fP (Xi)

Terrell and Scott [Terrell and Scott, 1992] developed multivariate generalized kernel density estimators which combine the fixed, balloon and sample point kernel estimators in addition to non-parametric density estimators by generalizing the sample point estimator of Breiman [Breiman, Meisel and Purcell, 1977]. The general multivariate sample point estimator is

n ˆ 1 X fSP (x; Ω) = KΩ(X )(x − Xi), (3.38) n i i=1

2 −1 where Ω(Xi) = h f(Xi) I as proposed by Abramson [Abramson, 1982]. Other choices th for Ω include the k nearest neighbor function of Xi multiplied by the identity matrix I [Breiman, Meisel and Purcell, 1977], a piecewise constant function [Sain, 2002], and

Ω(Xi) = Hj if Xi  bin j [Sain and Scott, 1996]. In addition, Hall [Hall, 1992] discussed other global properties of variable bandwidth density estimators.

3.3.4 Information Complexity Bandwidth Selectors

∗ A new method of estimating the bandwidth hj from Eqn. 3.33 uses Bozdogan’s ICOMP [Bozdogan, 1988, Bozdogan, 1990a, Bozdogan, 1990b, Bozdogan, 2000] from Eqn. 2.8 in place ofσ ˆj. The entropic complexity of the estimated covariance matrix Σˆ is

p tr(Σ) 1 σˆ = C (Σˆ) ≡ log − log |Σ|, (3.39) j 1 2 p 2 where p is the rank of Σˆ.

If Σ is diagonal, the Frobenius norm characterization of the complexity is defined as

31 tr(Σˆ 0Σˆ) tr(Σˆ)2 s = C (Σˆ) = − , (3.40) j F p p2 where p is the rank of Σˆ.

Another alternative is to choose the bandwidth matrix proportional to Σ1/2. The esti- mated bandwidth matrix Hˆ is defined as

Σˆ 1/2 Hˆ = , (3.41) n1/(p+4) where p is the rank of Σˆ.

A likelihood cross validation technique uses the Kullback-Leibler (K-L) [Kullback, 1987] information as a measure of distance between two densities. The goal is to minimize the ˆ K-L distance from the approximate density fH (x) from the true density f(x). The K-L information is defined as

! Z f(x) Z Z KL(f, fˆ ) = log f(x)dx = log[f(x)]f(x)dx − log[fˆ (x)]f(x)dx. (3.42) H ˆ H fH (x)

ˆ The goal is to derive a bandwidth matrix H that minimizes KL(f, fH ) or maximizes ˆ E[log fH (x)] as shown by

Z ˆ ˆ E[log fH (x)] = log[fH (x)]f(x)dx. (3.43)

Eqn. 3.43 is estimated by using the log pseudo-likelihood function

n X ˆ log L(x1, ..., xn|H) = log[fH,(−i)(xi)], (3.44) i=1 ˆ where fH,(−i) is the leave-one-out estimator given by

n 1 X fˆ (x ) = |H|−1/2K{H−1/2(x − x )}. (3.45) H,(−i) i n − 1 i j i=1 j6=i

32 1 The likelihood cross validation criterion will select H by maximizing n log L(·|H). Al- ternatively, we can minimize ICOMP(IFIM) in

ICOMP (IFIM) = −2 log L(x1, ..., xn|H) + 2C1F (Σˆ), (3.46)

where C1F (Σˆ) is defined by

ˆ s sCF (Σ) 1 X ¯ 2 C1F (Σˆ) = = (λj − λa) , (3.47)  tr(Σˆ )2  ¯2 4λa 4 s2 j=1 where λj are the eigenvalues. The function C1F (·) is a second order equivalent measure of complexity to the original C1(·) measure and is scale-invariant where C1F (·) > 0. Note

that C1F (·) measures the relative variation in the eigenvalues and C1F (·) = 0 only when

λj = λ¯a for all j = 1, ..., s. The Cholesky decomposition can be applied to the bandwidth matrix H since H is symmetric positive definite. The Cholesky decomposition calculates a lower triangular matrix L such that H = LL0. The kernel estimator of f(x) is

n |B| X fˆ (x) = K(B(x − x )), (3.48) B n i i=1 where B = L−1. The leave-one-out estimator of f(x) is

n |B| X fˆ (x ) = |H|−1/2K(B(x − x )). (3.49) B,(−i) i n − 1 i j i=1 j6=i

For a data set denoted by (x1, ..., xn), let S denote the sample covariance matrix with

2 2 diagonal components D = diag(s1, ..., sp). The sphering transformation is defined as

∗ −1/2 xi = S xi (3.50)

for i = 1, 2, ..., n. The scaling transformation is defined by

33 ∗ −1/2 xi = D xi (3.51) for i = 1, 2, ..., n. The optimal bandwidth matrix for the original data Hˆ can be calculated through the reverse transformations given by

Hˆ = S1/2Hˆ ∗(S1/2)0 (3.52) and

Hˆ = D1/2Hˆ ∗(D1/2)0, (3.53) where Hˆ ∗ denotes the optimal bandwidth matrix.

Substituting Eqn. 3.39 into Eqn. 3.33 gives

 4 1/(p+4) p tr(Σ) 1  h∗ = log − log |Σ| , (3.54) n(p + 2) 2 p 2 which will be used to estimate the bandwidth h for kernel regression smoothing using all

X variables together.

The subscript j is no longer needed to denote each of p variables for h sinceσ ˆj is calculated using the correlational structure from all the X matrix variables and is constant for all p variables.

When kernel regression smoothing is applied to each X or Y variable individually,

Bozdogan [Bozdogan, 2006] used

! 1.06 Rˆ h = 1.4j min σ,ˆ (3.55) j n1/5 1.34 to each variable for calculating the optimal h by scoring ICOMP for each j = -3, -2, ..., 4 whereσ ˆ is the sample standard deviation and Rˆ is the interquartile range, as shown in Eqn.

3.4. The value of hj associated with the minimized ICOMP score using the misspecification covariance matrix is the optimized h for that variable.

34 3.4 Conclusions

A review of the current literature showed several methods for both univariate and mul- tivariate kernel density estimation. However, none of the methods in the literature use kernel density estimation in conjunction with ICOMP and the genetic algorithm. This new contribution to the literature is discussed in chapter 6. Kernel density estimation also has not been applied with the implicit enumeration algorithm which is discussed in chapter

7.

35 Chapter 4

Univariate Kernel Regression

4.1 Kernel Regression

Regression models in which the response depends on some covariates linearly, but on other covariates nonparametrically are called partially linear models (PLMs). PLMs are special cases of additive models that generalize standard linear regression techniques. Linear parameters are estimated by least squares estimators, while the nonparametric part can be estimated by kernel regression. When the model is heteroscedastic, the variance functions are approximated by weighted least squares estimators.

Hardle [Hardle, Mori and Vieu, 2007] discusses how kernel regression is used in PLMs.

PLMs are defined by

Y = XT β + g(T ) + , (4.1)

where X and T are d-dimensional and scalar regressors, β is a vector of unknown param-

eters, g(·) an unknown smooth function, and  an error term with mean zero conditional

on X and T .

PLMs are a special form of the additive regression model and are more flexible than the standard linear models since they combine both parametric and nonparametric com- ponents. Hardle [Hardle, 1990] gives an extensive discussion of various nonparametric

36 statistical methods based on the kernel estimator. Wand and Jones [Wand and Jones,

1995] also discuss related works in kernel regression.

Let K(·) be a kernel function and hn be a bandwidth parameter. The weight function is defined by

  t−Ti K h ω (t) = n . (4.2) ni n   P K t−Tj hn j=1

n P T Let gn(t, β) = ωni(t)(Yi − Xi β) for a given β. By substituting gn(t, β) into Eqn. i=1 4.1 and using least square criterion, the least squares estimator of β is shown by

ˆ T −1 T βKR = (X˜ X˜) X˜ Y,˜ (4.3)

n T P T where X˜ = (X˜1, ..., X˜n) with X˜j = Xj − ωni(Tj)Xi and Y˜ = (Y˜1, ..., Y˜n) with Y˜j = i=1 n P Yj − ωni(Tj)Yi. The nonparametric part g(t) is estimated by i=1

n X T ˆ gˆn(t) = ωni(t)(Yi − Xi βKR). (4.4) i=1

2 When 1, ..., n are identically distributed, their common variance σ may be estimated

2 by σn by

2 ˜ ˜ ˆ T ˜ ˜ ˆ σn = (Y − XβKR) (Y − XβKR). (4.5)

Hardle, Liang and Gao [Hardle, Liang and Gao, 2000] and Speckman [Speckman, 1988]

provide a detailed discussion on asymptotic theories of these estimators. Hardle [Hardle,

3 Mori and Vieu, 2007] states that if sup0≤t≤1E(k X k |t) < ∞, Σ = Cov{X − E(X|T )} is

a positive definite matrix, g(t) and E(xij|t) are Lipschitz continuous, and the bandwidth h =∼ λn−1/5 for some 0 < λ < ∞, then

√ ˆ 2 −1 n(βKR − β) → N(0, σ Σ ). (4.6)

37 Kernel regression has been applied using a variety of bandwidth estimators as discussed in chapter 3. However, this dissertation proposes the extension of kernel regression to the multivariate case using ICOMP for variable model selection.

Eqn. 4.2 shows the weight is calculated from only one regressor Xi independent of the other Xj. Only the data within each Xi is used in the weighting for smoothing the data. Eqn. 4.7 is an extension of Eqn. 4.2 that uses the data from all X1,...,Xd variables simultaneously to smooth the data by

  t−Tik K h ω (t) = n . (4.7) nik d n   P P K t−Tjk hn k=1 j=1

Eqn. 4.7 can be applied to the independent variables Xi, the dependent variables Yi, n d P P T or both in the multivariate case. Let gn(t, β) = ωnik(t)(Yi − Xi β) for a given β. By i=1 k=1 substituting gn(t, β) into Eqn. 4.1 and using least square criterion, the least squares esti- n d T P P mator of β is shown in Eqn. 4.3 where X˜ = (X˜1, ..., X˜n) with X˜j = Xj − ωnik(Tj)Xi i=1 k=1 n d T P P and Y˜ = (Y˜1, ..., Y˜n) with Y˜j = Yj − ωnik(Tj)Yi. The nonparametric part g(t) is i=1 k=1 estimated by

n d X X T ˆ gˆn(t) = ωnik(t)(Yi − Xi βKR). (4.8) i=1 k=1

2 When 1, ..., n are identically distributed, their common variance σ may be estimated

2 by σn in Eqn. 4.5.

38 4.2 Univariate Kernel Regression Results

For univariate Y data sets, all possible combinations of applying kernel regression were calculated using the notation summarized in Table 4.1.

4.2.1 Univariate Kernel Regression with Normal Data

A multivariate data set with seven normally distributed X variables and one normally distributed Y variable was simulated with n = 50 observations. The models from Table

4.1 were run assuming a univariate Y .

Figure 4.1 is a surface plot of the first two variables X1 and X2. These are approximately bivariate normal.

Histograms and scatter plots of each of the simulated variables are shown in Appendix

Figures A.1 and A.2 where each variable is approximately normally distributed. Eqn. 3.55 was used to calculate the optimal bandwidth h values when kernel regression is applied

individually to each variable. Therefore, there is one h bandwidth for each variable when

kernel regression is applied individually to each variable.

Figure 4.2 shows smoothed kernel density plots for the various values of hj calculated

from Eqn. 3.55 for the variable X1. Figure 4.2 shows that as the bandwidth h increases the kernel density plots become smoother. The smallest h = 0.15337 is too small since the

kernel density plot is very jagged.

Yet the largest h = 2.2635 is too large since the kernel density plot has oversmoothed

the data and removed features of the data that may be important. For X1 the bandwidth h that minimizes the ICOMP score is h = 0.5892.

Similarly, Figures 4.3 through 4.9 each shows smoothed kernel density plots for various

values of hj calculated from Eqn. 3.55 for the variables X2,X3, ..., X7 and Y , respec-

tively, for the Yk and Xk models. The values of h that minimize ICOMP for the variables

X2,X3, ..., X7 and Y are 0.6766, 0.5691, 2.9936, 5.2505, 3.2336, 1.0146, and 0.4766, respec- tively.

39 0 9 :1 2 T u e s d a y , A u g u s t 1 8 , 2 0 0 9 1

T he K D E P rocedure

In p u ts D a ta S e t D IS .A _ M U L T IV A R IA T E N u m b e r o f O b se rv a tio n s U se d 5 0 V a ria b le 1 X 1 Table 4.1: Notation used for applying univariate kernel regression smoothing methods. V a ria b le 2 X 2 Model B a n d w id th M e th o d DescriptionS im p le N o rm a l R e fe re n c e Y = X β Baseline model with no kernel regression smoothing on either X or Y variables

Y = X k β Kernel regression on X variables one at a time, but no kernel regression on Y

Y = X mk β Kernel regression on all X variablesC o n tro l stogether, but no kernel regression on Y Kernel regression on Y variable, but no kernel regression on X Y k = X β X 1 X 2 Y k = X k β Kernel regression on Y and X variables one at a time G rid P o in ts 6 0 6 0 Y k = X mk β Kernel regression on Y variable and kernel regression on X variables together L o w e r G rid L im it 8 .1 2 9 9 7 .4 8 U p p e r G rid L im it 1 3 .1 0 4 1 2 .2 0 2 B a n d w id th M u ltip lie r 1 1

Figure 4.1: Bivariate surface plot of normally distributed variables X1 and X2.

40 hhat(p)=0.15337 hhat(p)=0.21472 hhat(p)=0.30061 0.04 0.04 0.04

0.03 0.03 0.03

0.02 0.02 0.02 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0 5 10 15 5 10 15 5 10 15 x x x hhat(p)=0.42086 hhat(p)=0.5892 hhat(p)=0.82488 0.03 0.03 0.03

0.02 0.02 0.02 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0 5 10 15 5 10 15 5 10 15 x x x hhat(p)=1.1548 hhat(p)=1.6168 hhat(p)=2.2635 0.03 0.025 0.025

0.02 0.02 0.02 0.015 f(x) f(x) f(x) 0.01 0.015 0.01

0 0.005 0.01 5 10 15 5 10 15 5 10 15 x x x

Figure 4.2: Kernel density plots for various h values for normally distributed variable X1.

41 hhat(p)=0.17612 hhat(p)=0.24657 hhat(p)=0.3452 0.04 0.03 0.03

0.03 0.02 0.02 0.02 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0 5 10 15 5 10 15 5 10 15 x x x hhat(p)=0.48327 hhat(p)=0.67658 hhat(p)=0.94722 0.03 0.03 0.03

0.02 0.02 0.02 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0 5 10 15 5 10 15 5 10 15 x x x hhat(p)=1.3261 hhat(p)=1.8565 hhat(p)=2.5992 0.025 0.025 0.022

0.02 0.02 0.02 0.015 0.018 f(x) f(x) f(x) 0.015 0.01 0.016

0.005 0.01 0.014 5 10 15 5 10 15 5 10 15 x x x

Figure 4.3: Kernel density plots for various h values for normally distributed variable X2.

42 hhat(p)=0.14814 hhat(p)=0.2074 hhat(p)=0.29036 0.04 0.04 0.04

0.03 0.03 0.03

0.02 0.02 0.02 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0 5 10 15 5 10 15 5 10 15 x x x hhat(p)=0.40651 hhat(p)=0.56911 hhat(p)=0.79675 0.03 0.03 0.03

0.02 0.02 0.02 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0 5 10 15 5 10 15 5 10 15 x x x hhat(p)=1.1155 hhat(p)=1.5616 hhat(p)=2.1863 0.025 0.025 0.025

0.02 0.02 0.02 0.015 f(x) f(x) f(x) 0.015 0.015 0.01

0.005 0.01 0.01 5 10 15 5 10 15 5 10 15 x x x

Figure 4.4: Kernel density plots for various h values for normally distributed variable X3.

43 hhat(p)=0.77926 hhat(p)=1.091 hhat(p)=1.5274 0.04 0.03 0.03

0.03 0.02 0.02 0.02 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0 −40 −20 0 −40 −20 0 −40 −20 0 x x x hhat(p)=2.1383 hhat(p)=2.9936 hhat(p)=4.1911 0.03 0.03 0.03

0.02 0.02 0.02 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0 −40 −20 0 −40 −20 0 −40 −20 0 x x x hhat(p)=5.8675 hhat(p)=8.2145 hhat(p)=11.5003 0.025 0.025 0.022

0.02 0.02 0.02 0.015 0.018 f(x) f(x) f(x) 0.015 0.01 0.016

0.005 0.01 0.014 −40 −20 0 −40 −20 0 −40 −20 0 x x x

Figure 4.5: Kernel density plots for various h values for normally distributed variable X4.

44 hhat(p)=1.3668 hhat(p)=1.9135 hhat(p)=2.6788 0.04 0.03 0.03

0.03 0.02 0.02 0.02 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0 50 100 150 50 100 150 50 100 150 x x x hhat(p)=3.7504 hhat(p)=5.2505 hhat(p)=7.3507 0.03 0.03 0.03

0.02 0.02 0.02 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0 50 100 150 50 100 150 50 100 150 x x x hhat(p)=10.291 hhat(p)=14.4074 hhat(p)=20.1704 0.025 0.025 0.022

0.02 0.02 0.02 0.015 0.018 f(x) f(x) f(x) 0.015 0.01 0.016

0.005 0.01 0.014 50 100 150 50 100 150 50 100 150 x x x

Figure 4.6: Kernel density plots for various h values for normally distributed variable X5.

45 hhat(p)=1.1784 hhat(p)=1.6498 hhat(p)=2.3097 0.04 0.04 0.03

0.03 0.03 0.02 0.02 0.02 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0 −20 0 20 −20 0 20 −20 0 20 x x x hhat(p)=3.2336 hhat(p)=4.527 hhat(p)=6.3378 0.03 0.03 0.025

0.02 0.02 0.02 0.015 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0.005 −20 0 20 −20 0 20 −20 0 20 x x x hhat(p)=8.8729 hhat(p)=12.422 hhat(p)=17.3908 0.025 0.025 0.022

0.02 0.02 0.02 f(x) f(x) f(x) 0.015 0.015 0.018

0.01 0.01 0.016 −20 0 20 −20 0 20 −20 0 20 x x x

Figure 4.7: Kernel density plots for various h values for normally distributed variable X6.

46 hhat(p)=0.36973 hhat(p)=0.51763 hhat(p)=0.72468 0.04 0.04 0.03

0.03 0.03 0.02 0.02 0.02 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0 −10 0 10 −10 0 10 −10 0 10 x x x hhat(p)=1.0146 hhat(p)=1.4204 hhat(p)=1.9885 0.03 0.03 0.025

0.02 0.02 0.02 0.015 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0.005 −10 0 10 −10 0 10 −10 0 10 x x x hhat(p)=2.7839 hhat(p)=3.8975 hhat(p)=5.4565 0.025 0.025 0.022

0.02 0.02 0.02 0.018 f(x) f(x) f(x) 0.015 0.015 0.016

0.01 0.01 0.014 −10 0 10 −10 0 10 −10 0 10 x x x

Figure 4.8: Kernel density plots for various h values for normally distributed variable X7.

47 hhat(p)=0.12407 hhat(p)=0.1737 hhat(p)=0.24318 0.06 0.06 0.04

0.03 0.04 0.04 0.02 f(x) f(x) f(x) 0.02 0.02 0.01

0 0 0 5 10 15 5 10 15 5 10 15 x x x hhat(p)=0.34046 hhat(p)=0.47664 hhat(p)=0.66729 0.04 0.04 0.03

0.03 0.03 0.02 0.02 0.02 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0 5 10 15 5 10 15 5 10 15 x x x hhat(p)=0.93421 hhat(p)=1.3079 hhat(p)=1.8311 0.03 0.03 0.025

0.02 0.02 0.02 0.015 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0.005 5 10 15 5 10 15 5 10 15 x x x Figure 4.9: Kernel density plots for various h values for normally distributed variable Y .

48 Eqn. 3.54 was used to calculate the optimal bandwidth h value when kernel regression is applied to all X variables together (Xmk models). Therefore, there is only one h value for all X variables when kernel regression is applied to all X variables together.

Table 4.2 summarizes the RMSE, AIC, ICOMP(IFIM), and h bandwidths calculated for the true underlying model using Eqns. 3.54 and 3.55 from the six models. Performing kernel regression smoothing only on the X variables for models Y = Xkβ and Y = Xmkβ increased the ICOMP(IFIM) to 81.1 and 194, respectively, from the baseline model of 79.8.

Similarly, the RMSE and AIC also increased from the baseline model where no kernel regression smoothing was applied to the data. Since RMSE, AIC, and ICOMP(IFIM) increased for these models, applying the kernel regression only to the X variables resulted in poorer models than the baseline.

However, applying the kernel regression smoothing to the Y variable decreased the

RMSE, AIC, and ICOMP(IFIM). The RMSE decreased from the baseline model of 0.474 to 0.109. Similarly, the AIC decreased from 73.1 to -74 for the Yk = Xβ model, while

ICOMP(IFIM) decreased from 79.8 to -77.7. ICOMP(IFIM) is minimized for the Yk = Xkβ model with an ICOMP(IFIM) score of -89.1. RMSE, AIC, and ICOMP(IFIM) increased considerably for the Yk = Xmkβ model compared to the Yk = Xβ model. Therefore, applying kernel regression to the Y variable provides the largest decrease in RMSE, AIC, and ICOMP(IFIM) compared to the baseline model.

Table 4.2: Results for applying univariate kernel regression smoothing methods with nor- mally distributed data.

Model RMSE AIC ICOMP(IFIM) h for Y h for X Y = X β 0.474 73.1 79.8

Y = X k β 0.537 85.6 81.1 *

Y = X mk β 1.478 187 194 0.0062

Y k = X β 0.109 -74.0 -77.7 0.4766

Y k = X k β 0.109 -73.4 -89.1 0.4766 *

Y k = X mk β 0.144 -46.1 -58.3 0.4766 0.0062

*Optimal bandwidth h values for X1, X2, …, X7 individually are respectively: [0.5892, 0.6766, 0.5691, 2.9936, 5.2505, 3.2336, 1.0146] using Eqn. 3.55.

49 4.2.2 Univariate Box-Cox Transformation

In multiple linear regression models, if the assumptions of normality and homoscedasticity of the error variance are violated, one approach established in the literature to stabilize the error variance, make the error variance more homogeneous, remove the skewness and make the distribution more nearly normal is to use a Box-Cox [Box and Tidwell, 1962] transformation. The Box-Cox transformation for λ  R is shown by

   (xλ − 1)/λ if λ 6= 0  y = . (4.9)  loge(x) if λ = 0 

Bozdogan [Bozdogan, 1990a] derived the information criteria for five competing models shown in Table 4.3.

The AIC for M1 is given by

AIC = n log(2π) + n log(ˆσ2) + n + 2(k + 1). (4.10)

The ICOMP(IFIM) for M1 is given by

tr(ˆσ2(X0X)−1) + 2ˆσ4/n ICOMP (IFIM) = n log(2π) + n log(ˆσ2) + n + (k + 2) log k + 2

2ˆσ4  − log |σˆ2(X0X)−1| − log . (4.11) n

The AIC for M2 is given by

n X AIC = n log(2π) + n log(ˆσ2) + n − 2(λˆ − 1) log(y ) + 2(k + 2), (4.12) λˆ i i=1 where λˆ is the optimal λ that maximizes the log likelihood function.

The ICOMP(IFIM) for M2 is given by

50 Table 4.3: Competing models for univariate Box-Cox transformations.

Model Notation Model Description

M 1 y = X β + ε No transformation baseline model ( λ) M 2 y = X β + ε Transformation of dependent variable y by λ only ( λ) (λ) M 3 y = X β + ε Transformation of all variables by the same λ (λ) M 4 y = X β + ε Transformation of independent X variables only ( λ) ( µ) M 5 y = X β + ε Independent transformation of X and y variables

n X ICOMP (IFIM) = n log(2π) + n log(ˆσ2) + n − 2(λˆ − 1) log(y )+ λˆ i i=1

2 0 −1 4 ! 4 ! tr(ˆσˆ(X X) ) + 2ˆσˆ/n 2ˆσˆ (k + 2) log λ λ − log |σˆ2(X0X)−1| − log λ . (4.13) k + 2 λˆ n

The AIC for M3 is given by

n X AIC = n log(2π) + n log(ˆσ2 ) + n − 2(λˆ − 1) log(y ) + 2(k + 2), (4.14) (λ,ˆ λˆ) i i=1 where λˆ is the optimal λ that maximizes the log likelihood function.

The ICOMP(IFIM) for M3 is given by

n 2ˆσ4 ! X (λ,ˆ λˆ) ICOMP (IFIM) = n log(2π) + n log(ˆσ2 ) + n − 2(λˆ − 1) log(y ) − log + (λ,ˆ λˆ) i n i=1

 ˆ ˆ  tr(ˆσ2 (X(λ)0X(λ))−1) + 2ˆσ4 /n ˆ ˆ ˆ ˆ ˆ ˆ (k + 2) log (λ,λ) (λ,λ) − log |σˆ2 (X(λ)0X(λ))−1|. (4.15)  k + 2  (λ,ˆ λˆ)

The AIC for M4 is given by

2 AIC = n log(2π) + n log(ˆσµˆ) + n + 2(k + 2), (4.16)

51 whereµ ˆ is the optimal µ that maximizes the log likelihood function.

The ICOMP(IFIM) for M4 is given by

tr(ˆσ2(X(ˆµ)0X(ˆµ))−1) + 2ˆσ4/n! ICOMP (IFIM) = n log(2π)+n log(ˆσ2)+n+(k +2) log µˆ µˆ µˆ k + 2

2ˆσ4 ! − log |σˆ2(X(ˆµ)0X(ˆµ))−1| − log µˆ . (4.17) µˆ n

The AIC for M5 is given by

n X AIC = n log(2π) + n log(ˆσ2 ) + n − 2(λˆ − 1) log(y ) + 2(k + 3), (4.18) (λ,ˆ µˆ) i i=1

where λˆ andµ ˆ are the optimal λ and µ that maximize the log likelihood function.

The ICOMP(IFIM) for M5 is given by

n 2ˆσ4 ! X (λ,ˆ µˆ) ICOMP (IFIM) = n log(2π) + n log(ˆσ2 ) + n − 2(λˆ − 1) log(y ) − log + (λ,ˆ µˆ) i n i=1

 2 (ˆµ)0 (ˆµ) −1 4  tr(ˆσ ˆ (X X ) ) + 2ˆσ ˆ /n (k + 2) log (λ,µˆ) (λ,µˆ) − log |σˆ2 (X(ˆµ)0X(ˆµ))−1|. (4.19)  k + 2  (λ,ˆ µˆ)

Results using kernel regression smoothing for the models in Table 4.2 will be compared to results from using the Box-Cox transformation models described in Table 4.3 and the results in Table 4.4.

Table 4.4 shows the results from using a Box-Cox transformation on the univariate data with normally distributed residuals. The RMSE, AIC, and ICOMP(IFIM) decrease very little compared to the baseline M1 model.

52 Table 4.4: Results from using Box-Cox transformation for normally distributed data.

Model Optimal µ Optimal λ RMSE AIC ICOMP(IFIM)

M 1 0.4735 73.13 79.77

M 2 0.75 0.2647 74.81 72.97

M 3 0.9 0.3756 75.11 73.12

M 4 1.6 0.4703 74.46 74.00

M 5 1.5 1 0.4704 76.49 72.89

The Box-Cox transformation did not provide significantly better models even using optimal maximum likelihood λ and µ parameters compared to kernel regression smoothing.

In fact, kernel regression applied to the univariate Y variable singly by far outperformed the optimal Box-Cox transformation as shown in Tables 4.2 and 4.4.

4.2.3 Univariate Kernel Regression on Hald Cement Data

Hald [Hald, 2001] published well-known cement production data that has been widely used in many published data analyses. The data set is small with only four independent X variables, one Y dependent variable, and n = 13 observations. The raw data are shown in

Table 4.5.

A multiple linear regression model using only the variables X1 and X2 minimizes the ICOMP(IFIM) among all possible subset models. The residuals from the best model

y = β0 + β1x1 + β2x2 + ε are approximately normally distributed.

Figure 4.10 is a surface plot of the first two variables X1 and X2. These are clearly not bivariate normal.

Histograms and scatter plots of each of the variables are shown in Appendix Figure A.3

where none of the variables is approximately normally distributed. Eqn. 3.55 was used to

calculate the optimal bandwidth h values when kernel regression is applied individually to

each variable. Therefore, there is one h bandwidth for each variable when kernel regression

is applied individually to each variable.

Figure 4.11 shows smoothed kernel density plots for the various values of hj calculated

53 0 9 :1 2 T u e s d a y , A u g u s t 1 8 , 2 0 0 9 1

T he K D E P rocedure

In p u ts

D a ta STablee t 4.5: Hald’s cementD productionIS .H A L D S data. N u m b e r o f O b se rv a tio n s U se d 1 3 YX1 X 2 X 3 X 4 V a ria b le 1 X 1 78.5 7 26 6 60 74.5V a ria b le 2 1 29X 2 15 52 104.3B a n d w id th M e t11h o d 56S im p le N o rm a l 8R e fe re n c e 20 87.6 11 31 8 47 95.9 7 52 6 33 109.2 11 C o n tro ls55 9 22 102.7 3 71 17 6 72.5 1 31 X 1 X 2 22 44 93.1 G rid 2P o in ts 54 6 0 6 0 18 22 115.9 21 47 4 26 L o w e r G rid L im it 1 2 6 83.8 1 40 23 34 113.3 U p p11e r G rid L im it 66 2 1 7 1 9 12 109.4 B a n d10w id th M u ltip l68ie r 1 1 8 12

Figure 4.10: Bivariate surface plot of Hald’s variables X1 and X2.

54 h=1.3605 h=1.9047 h=2.6665 0.2 0.1 0.1

0.1 0.05 0.05 f(x) f(x) f(x)

0 0 0 0 20 40 0 20 40 0 20 40 x x x h=3.7331 h=5.2264 h=7.3169 0.1 0.1 0.1

0.05 0.05 0.05 f(x) f(x) f(x)

0 0 0 0 20 40 0 20 40 0 20 40 x x x h=10.2437 h=14.3411 h=20.0776 0.1 0.1 0.1

0.05 0.08 f(x) f(x) f(x)

0 0.05 0.06 0 20 40 0 20 40 0 20 40 x x x

Figure 4.11: Kernel density plots for various h values for Hald’s variable X1.

from Eqn. 3.55 for the variable X1. Figure 4.11 shows that as the bandwidth h increases the kernel density plots become smoother. The smallest h = 1.3605 is too small since the kernel density plot is jagged. Yet the largest h = 14.3411 is too large since the kernel density plot has oversmoothed the data and removed features of the data that may be important. For X1 the bandwidth h that minimizes the ICOMP score is h = 5.2264. Similarly, Figures 4.12 through 4.15 each shows smoothed kernel density plots for var-

ious values of hj calculated from Eqn. 3.55 for the variables X2,X3,X4, and Y , respec-

tively, for the Yk and Xk models. The values of h that minimize ICOMP for the variables

X2,X3,X4 and Y are 7.0538, 2.0739, 14.8715, and 9.5323, respectively. Eqn. 3.54 was used to calculate the optimal bandwidth h value when kernel regression

is applied to all X variables together (Xmk models). Therefore, there is only one h value for all X variables when kernel regression is applied to all X variables together.

55 h=1.8362 h=2.5706 h=3.5989 0.2 0.2 0.2

0.1 0.1 0.1 f(x) f(x) f(x)

0 0 0 0 50 100 0 50 100 0 50 100 x x x h=5.0384 h=7.0538 h=9.8753 0.2 0.1 0.1

0.1 0.05 f(x) f(x) f(x)

0 0 0.05 0 50 100 0 50 100 0 50 100 x x x h=13.8255 h=19.3556 h=27.0979 0.1 0.1 0.1 f(x) f(x) f(x)

0.05 0.05 0.05 0 50 100 0 50 100 0 50 100 x x x

Figure 4.12: Kernel density plots for various h values for Hald’s variable X2.

56 h=0.53985 h=0.7558 h=1.0581 0.2 0.2 0.2

0.1 0.1 0.1 f(x) f(x) f(x)

0 0 0 0 20 40 0 20 40 0 20 40 x x x h=1.4814 h=2.0739 h=2.9035 0.2 0.2 0.2

0.1 0.1 0.1 f(x) f(x) f(x)

0 0 0 0 20 40 0 20 40 0 20 40 x x x h=4.0649 h=5.6908 h=7.9671 0.2 0.1 0.1

0.1 0.05 0.05 f(x) f(x) f(x)

0 0 0 0 20 40 0 20 40 0 20 40 x x x

Figure 4.13: Kernel density plots for various h values for Hald’s variable X3.

57 h=3.8712 h=5.4196 h=7.5875 0.2 0.2 0.2

0.1 0.1 0.1 f(x) f(x) f(x)

0 0 0 0 50 100 0 50 100 0 50 100 x x x h=10.6225 h=14.8715 h=20.82 0.2 0.1 0.1

0.1 0.05 0.05 f(x) f(x) f(x)

0 0 0 0 50 100 0 50 100 0 50 100 x x x h=29.1481 h=40.8073 h=57.1302 0.1 0.1 0.1

0.08 f(x) f(x) f(x)

0.05 0.05 0.06 0 50 100 0 50 100 0 50 100 x x x

Figure 4.14: Kernel density plots for various h values for Hald’s variable X4.

58 h=2.4813 h=3.4739 h=4.8634 0.2 0.2 0.2

0.1 0.1 0.1 f(x) f(x) f(x)

0 0 0 50 100 150 50 100 150 50 100 150 x x x h=6.8088 h=9.5323 h=13.3453 0.2 0.1 0.1

0.1 0.05 f(x) f(x) f(x)

0 0 0.05 50 100 150 50 100 150 50 100 150 x x x h=18.6834 h=26.1567 h=36.6194 0.1 0.1 0.1 f(x) f(x) f(x)

0.05 0.05 0.05 50 100 150 50 100 150 50 100 150 x x x

Figure 4.15: Kernel density plots for various h values for Hald’s variable Y .

59 Table 4.6: Results for applying univariate kernel regression smoothing methods on Hald’s data.

Model RMSE AIC ICOMP(IFIM) h for Y h for X Y = X β 2.12 60.4 75.4

Y = X k β 6.46 89.4 109.4 *

Y = X mk β 12.23 106.0 134.7 0.2695

Y k = X β 1.15 44.5 56.6 9.5323

Y k = X k β 1.55 52.3 57.3 9.5323 *

Y k = X mk β 2.18 61.1 81.0 9.5323 0.2695

*Optimal bandwidth h values for X1, X2, X3, X4 individually are respectively: [5.2264, 7.0538, 2.0739, 14.8715] using Eqn. 3.55.

Table 4.6 summarizes the RMSE, AIC, ICOMP(IFIM), and h bandwidths calculated for the best model using Eqns. 3.54 and 3.55 from the six kernel models. Performing kernel regression smoothing only on the X variables for models Y = Xkβ and Y = Xmkβ increased the ICOMP(IFIM) to 109.4 and 134.7, respectively, from the baseline model of

75.4. Similarly, the RMSE and AIC also increased from the baseline model where no kernel regression smoothing was applied to the data. Since RMSE, AIC, and ICOMP(IFIM) increased for these models, applying the kernel regression only to the X variables resulted in poorer models than the baseline.

However, applying the kernel regression smoothing to the Y variable decreased the

RMSE, AIC, and ICOMP(IFIM) for the Yk = Xβ and Yk = Xkβ kernel models. The RMSE decreased from the baseline model of 2.12 to 1.15 for the Yk = Xβ model. Similarly, the

AIC decreased from 60.4 to 44.5 for the Yk = Xβ model, while ICOMP(IFIM) decreased from 75.4 to 56.6. RMSE, AIC, and ICOMP(IFIM) are minimized for the Yk = Xβ model compared to the other models. RMSE, AIC, and ICOMP(IFIM) increased for the

Yk = Xmkβ model compared to the Y = Xβ and Yk = Xβ models. Therefore, applying kernel regression to the Y variable alone provides the largest decrease in RMSE, AIC, and

ICOMP(IFIM) compared to the baseline model. This agrees with the results from section

4.2.1 although the Hald’s data were shown to be non-normally distributed.

60 4.2.4 Kernel Regression for Univariate Power Exponential Distribution

Liu and Bozdogan [Liu and Bozdogan, 2008] discuss regression models with power expo- nential (PE) random errors. A univariate random variable x is PE distributed if the density function of x is defined by

2β! 1 1 x − µ f(x; µ, σ, β) = 1 exp − , (4.20) 1 1+ 2β 2 σ σΓ(1 + 2β )2 where the parameters −∞ < µ < ∞ and σ > 0 are location and scale parameters, respec- tively, and β > 0 is the shape parameter. Note that when β = 1 the density becomes the normal distribution. So the PE distribution can be seen as a generalized normal distribu- tion. Standard PE distribution is the PE distribution with µ = 0 and σ = 1.

In order to generate univariate PE pseudo-random numbers, Liu and Bozdogan [Liu

1 2β and Bozdogan, 2008] show that if Y = 2 |X| , where X has a standard PE distribution 1 with shape parameter β, then Y ∼ Gamma( 2β ). So Y has a distribution with density given by

y(1/2β)−1e−y f(y) = 1 . (4.21) Γ( 2β ) Seven independent X variables and one Y dependent variable were generated using

Eqn. 4.21 with n = 50 observations. The residuals from multiple linear regression models are gamma distributed with β = 0.3.

Figure 4.16 is a surface plot of the first two variables X1 and X2 where β = 0.3. These are gamma distributed with thin tails.

Histograms and scatter plots of each of the PE simulated variables are shown in Ap-

pendix Figures A.4 and A.5. Eqn. 3.55 was used to calculate the optimal bandwidth h

values when kernel regression is applied individually to each variable. However, due to

the extremely thin tails and sharp peaks for the PE data, the j parameters in Eqn. 3.55

were increased from 4 to 13 in order to minimize ICOMP and determine the optimal band-

width h values for each variable individually. Therefore, there is one h bandwidth for each

61 0 9 :1 2 T u e s d a y , A u g u s t 1 8 , 2 0 0 9 1

T he K D E P rocedure

In p u ts D a ta S e t D IS .M V P E R _ M U L T IV A R IA T E N u m b e r o f O b se rv a tio n s U se d 5 0 V a ria b le 1 X 1 V a ria b le 2 X 2 B a n d w id th M e th o d S im p le N o rm a l R e fe re n c e

C o n tro ls X 1 X 2 G rid P o in ts 6 0 6 0 L o w e r G rid L im it -6 .2 0 3 3 .8 4 7 U p p e r G rid L im it 6 4 .8 9 6 3 0 .2 3 7 B a n d w id th M u ltip lie r 1 1

Figure 4.16: Bivariate surface plot of power exponential variables X1 and X2.

62 variable when kernel regression is applied individually to each variable.

Figure 4.17 shows smoothed kernel density plots for the various values of hj calculated

from Eqn. 3.55 for the variable X1. Figure 4.17 shows that as the bandwidth h increases the kernel density plots become smoother. The smallest h = 0.47318 shown in Figure

4.17 is too small since the kernel density plot is very jagged. Yet the largest h = 6.9832

shown in Figure 4.17 is too large since the kernel density plot has oversmoothed the data

and removed features of the data that may be important. For X1 the bandwidth h that minimizes the ICOMP score is h = 4.988.

Similarly, Figures 4.18 through 4.24 each shows smoothed kernel density plots for var-

ious values of hj calculated from Eqn. 3.55 for the variables X2,X3, ..., X7 and Y , respec-

tively, for the Yk and Xk models.

The values of h that minimize ICOMP for the variables X2,X3, ..., X7 and Y are 2.5413, 18.5111, 0.7633, 2.7108, 14.5435, 82.7758, and 20.4193, respectively. Eqn. 3.54 was used

to calculate the optimal bandwidth h value when kernel regression is applied to all X

variables together (Xmk models). Therefore, there is only one h value for all X variables when kernel regression is applied to all X variables together.

Table 4.7 summarizes the RMSE, AIC, ICOMP(IFIM), and h bandwidths calculated

for the true underlying model using Eqns. 3.54 and 3.55 from the six kernel models.

Performing kernel regression smoothing only on the X variables for models Y = Xkβ and

Y = Xmkβ increased the ICOMP(IFIM) to 549.9 and 556.4, respectively, from the baseline model of 58.9, approximately one order of magnitude. Similarly, the AIC also increased

approximately one order of magnitude from the baseline model where no kernel regression

smoothing was applied to the data. However, the RMSE increased almost two orders of

magnitude from 0.3652 to 32.9 for the Y = Xkβ model and 34.4 for the Y = Xmkβ model. Since RMSE, AIC, and ICOMP(IFIM) increased for these models, applying the kernel

regression only to the X variables resulted in much poorer models than the baseline.

63 hhat(p)=0.47318 hhat(p)=0.66246 hhat(p)=0.92744 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −100 0 100 −100 0 100 −100 0 100 x x x hhat(p)=1.2984 hhat(p)=1.8178 hhat(p)=2.5449 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −100 0 100 −100 0 100 −100 0 100 x x x hhat(p)=3.5628 hhat(p)=4.988 hhat(p)=6.9832 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −100 0 100 −100 0 100 −100 0 100 x x x

Figure 4.17: Kernel density plots for various h values for PE distributed variable X1.

64 hhat(p)=0.92612 hhat(p)=1.2966 hhat(p)=1.8152 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 0 20 40 0 20 40 0 20 40 x x x hhat(p)=2.5413 hhat(p)=3.5578 hhat(p)=4.9809 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 0 20 40 0 20 40 0 20 40 x x x hhat(p)=6.9733 hhat(p)=9.7626 hhat(p)=13.6676 0.04 0.04 0.03

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0.01 0 20 40 0 20 40 0 20 40 x x x

Figure 4.18: Kernel density plots for various h values for PE distributed variable X2.

65 hhat(p)=1.756 hhat(p)=2.4585 hhat(p)=3.4419 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −200 0 200 −200 0 200 −200 0 200 x x x hhat(p)=4.8186 hhat(p)=6.746 hhat(p)=9.4444 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −200 0 200 −200 0 200 −200 0 200 x x x hhat(p)=13.2222 hhat(p)=18.5111 hhat(p)=25.9156 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −200 0 200 −200 0 200 −200 0 200 x x x

Figure 4.19: Kernel density plots for various h values for PE distributed variable X3.

66 hhat(p)=0.14193 hhat(p)=0.1987 hhat(p)=0.27818 0.05 0.04 0.04

0.02 0.02 f(x) f(x) f(x)

0 0 0 −20 0 20 −20 0 20 −20 0 20 x x x hhat(p)=0.38945 hhat(p)=0.54522 hhat(p)=0.76331 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −20 0 20 −20 0 20 −20 0 20 x x x hhat(p)=1.0686 hhat(p)=1.4961 hhat(p)=2.0945 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −20 0 20 −20 0 20 −20 0 20 x x x

Figure 4.20: Kernel density plots for various h values for PE distributed variable X4.

67 hhat(p)=0.70565 hhat(p)=0.98791 hhat(p)=1.3831 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −50 0 50 −50 0 50 −50 0 50 x x x hhat(p)=1.9363 hhat(p)=2.7108 hhat(p)=3.7952 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −50 0 50 −50 0 50 −50 0 50 x x x hhat(p)=5.3132 hhat(p)=7.4385 hhat(p)=10.4139 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −50 0 50 −50 0 50 −50 0 50 x x x

Figure 4.21: Kernel density plots for various h values for PE distributed variable X5.

68 hhat(p)=3.7858 hhat(p)=5.3001 hhat(p)=7.4202 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −200 0 200 −200 0 200 −200 0 200 x x x hhat(p)=10.3882 hhat(p)=14.5435 hhat(p)=20.361 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −200 0 200 −200 0 200 −200 0 200 x x x hhat(p)=28.5053 hhat(p)=39.9075 hhat(p)=55.8705 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −200 0 200 −200 0 200 −200 0 200 x x x

Figure 4.22: Kernel density plots for various h values for PE distributed variable X6.

69 hhat(p)=21.5472 hhat(p)=30.1661 hhat(p)=42.2326 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −1000 0 1000 −1000 0 1000 −1000 0 1000 x x x hhat(p)=59.1256 hhat(p)=82.7758 hhat(p)=115.8862 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −1000 0 1000 −1000 0 1000 −1000 0 1000 x x x hhat(p)=162.2406 hhat(p)=227.1369 hhat(p)=317.9916 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −1000 0 1000 −1000 0 1000 −1000 0 1000 x x x

Figure 4.23: Kernel density plots for various h values for PE distributed variable X7.

70 hhat(p)=5.3153 hhat(p)=7.4414 hhat(p)=10.418 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −500 0 500 −500 0 500 −500 0 500 x x x hhat(p)=14.5852 hhat(p)=20.4193 hhat(p)=28.587 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −500 0 500 −500 0 500 −500 0 500 x x x hhat(p)=40.0218 hhat(p)=56.0306 hhat(p)=78.4428 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −500 0 500 −500 0 500 −500 0 500 x x x

Figure 4.24: Kernel density plots for various h values for PE distributed variable Y .

71 Table 4.7: Results for applying univariate kernel regression smoothing methods with power exponential data.

Model RMSE AIC ICOMP(IFIM) h for Y h for X Y = X β 0.3652 47.2 58.9

Y = X k β 32.9 497.3 549.9 *

Y = X mk β 34.4 501.7 556.4 1.193

Y k = X β 7.83 353.8 394.4 20.42

Y k = X k β 5.58 319.9 346.0 20.42 *

Y k = X mk β 7.15 344.6 375.8 20.42 1.193

*Optimal bandwidth h values for X1, X2, …, X7 individually are respectively: [4.988, 2.5413, 18.5111, 0.7633, 2.7108, 14.5435, 82.7758] using Eqn. 3.55.

However, applying the kernel regression smoothing to the Y variable also increased the RMSE, AIC, and ICOMP(IFIM) compared to the baseline model, but not as high as the Y = Xkβ and Y = Xmkβ models. The RMSE increased from the baseline model of 0.3652 to 7.83, 5.58, and 7.15 for the Yk = Xβ, Yk = Xkβ, and Yk = Xmkβ models, respectively. Similarly, the AIC increased from 47.2 to 353.8 for the Yk = Xβ model, while ICOMP(IFIM) increased from 58.9 to 394.4. Among the models using kernel regression,

ICOMP(IFIM) and AIC are minimized for the Yk = Xkβ model with an ICOMP(IFIM) score of 346. For PE simulated data, no kernel regression model was better than the original model without kernel regression due to the highly skewed and peaked PE distribution.

The Box-Cox transformation was applied to the univariate PE data to compare with the kernel regression results. The Box-Cox optimal λˆ = 1, so Box-Cox recommended no data transformation. In fact, the M5 model could not be evaluated because the matrix was too singular. Therefore, Box-Cox could not be used to improve the RMSE, AIC, or

ICOMP(IFIM) of the univariate PE data.

72 4.2.5 Univariate Kernel Regression for Friedman’s Nonlinear Regression

Friedman [Friedman, 1991] proposed a highly nonlinear multiple regression model with skewed normal errors. The true model is given by

2 y = 10 sin(πx1x2) + 20(x3 − 0.5) + 10x4 + 5x5 + ε, (4.22)

where xi ∼ N(0, 1). The 10 independent variables X1, X2, ..., X10 are randomly generated from the unit hypercube. Variables X6, X7, ..., X10 have no contribution to the response variable Y , and are therefore redundant.

Ten independent X variables and one Y dependent variable were simulated using Eqn.

4.22 with n = 100 observations. Univariate kernel regression smoothing was performed on

the univariate nonlinear distributed data.

Figure 4.25 is a surface plot of the first two variables X1 and X2 with the dependent variable Y . As expected, Y is highly nonlinear.

Figure 4.25: Bivariate surface plot of Friedman’s variables X1 and X2 with the dependent variable Y .

73 Histograms and scatter plots of each of the Friedman simulated variables are shown in

Appendix Figures A.6 and A.7. Eqn. 3.55 was used to calculate the optimal bandwidth h

values when kernel regression is applied individually to each variable. Therefore, there is

one h bandwidth for each variable when kernel regression is applied individually to each

variable.

Figure 4.26 shows smoothed kernel density plots for the various values of hj calculated

from Eqn. 3.55 for the variable X1. Figure 4.26 shows that as the bandwidth h increases the kernel density plots become smoother. The smallest h = 0.14649 shown in Figure

4.26 is too small since the kernel density plot is very jagged. Yet the largest h = 2.1619

shown in Figure 4.26 is too large since the kernel density plot has oversmoothed the data

and removed features of the data that may be important. For X1 the bandwidth h that minimizes the ICOMP score from Eqn. 3.55 is h = 0.40197.

Similarly, Figures 4.27 through 4.36 each shows smoothed kernel density plots for vari-

ous values of hj calculated from Eqn. 3.55 for the variables X2,X3, ..., X10 and Y , respec-

tively, for the Yk and Xk models. The values of h that minimize ICOMP for the variables

X2,X3, ..., X10 and Y are 0.36273, 0.36094, 0.28781, 0.61694, 0.58268, 0.44327, 0.38365, 0.47411, 0.55332, and 17.1047, respectively.

Eqn. 3.54 was used to calculate the optimal bandwidth h value when kernel regression

is applied to all X variables together (Xmk models). Therefore, there is only one h value for all X variables when kernel regression is applied to all X variables together.

Table 4.8 summarizes the RMSE, AIC, ICOMP(IFIM), and h bandwidths calculated

for the true underlying model using Eqns. 3.54 and 3.55 from the six kernel models.

Performing kernel regression smoothing only on the X variables for models Y = Xkβ and

Y = Xmkβ slightly decreased the ICOMP(IFIM) to 946.8 and 970.5, respectively, from the baseline model of 990.2. Similarly, RMSE and AIC also decreased slightly from the baseline

model where no kernel regression smoothing was applied to the data. Since RMSE, AIC,

and ICOMP(IFIM) decreased for these models, applying the kernel regression only to the

X variables resulted in slightly better models than the baseline.

74 hhat(p)=0.14649 hhat(p)=0.20509 hhat(p)=0.28712 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=0.40197 hhat(p)=0.56276 hhat(p)=0.78786 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=1.103 hhat(p)=1.5442 hhat(p)=2.1619 0.02 0.02 0.015

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0.005 −5 0 5 −5 0 5 −5 0 5 x x x

Figure 4.26: Kernel density plots for various h values for Friedman’s variable X1.

75 hhat(p)=0.13219 hhat(p)=0.18507 hhat(p)=0.25909 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=0.36273 hhat(p)=0.50782 hhat(p)=0.71095 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=0.99533 hhat(p)=1.3935 hhat(p)=1.9509 0.02 0.02 0.015

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0.005 −5 0 5 −5 0 5 −5 0 5 x x x

Figure 4.27: Kernel density plots for various h values for Friedman’s variable X2.

76 hhat(p)=0.13154 hhat(p)=0.18415 hhat(p)=0.25781 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=0.36094 hhat(p)=0.50531 hhat(p)=0.70744 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=0.99041 hhat(p)=1.3866 hhat(p)=1.9412 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x

Figure 4.28: Kernel density plots for various h values for Friedman’s variable X3.

77 hhat(p)=0.14684 hhat(p)=0.20558 hhat(p)=0.28781 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=0.40293 hhat(p)=0.56411 hhat(p)=0.78975 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=1.1057 hhat(p)=1.5479 hhat(p)=2.1671 0.02 0.015 0.015

0.01 0.01 0.01 f(x) f(x) f(x)

0 0.005 0.005 −5 0 5 −5 0 5 −5 0 5 x x x

Figure 4.29: Kernel density plots for various h values for Friedman’s variable X4.

78 hhat(p)=0.16059 hhat(p)=0.22483 hhat(p)=0.31477 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=0.44067 hhat(p)=0.61694 hhat(p)=0.86372 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=1.2092 hhat(p)=1.6929 hhat(p)=2.37 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x

Figure 4.30: Kernel density plots for various h values for Friedman’s variable X5.

79 hhat(p)=0.15168 hhat(p)=0.21235 hhat(p)=0.29729 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=0.4162 hhat(p)=0.58268 hhat(p)=0.81575 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=1.1421 hhat(p)=1.5989 hhat(p)=2.2384 0.02 0.02 0.015

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0.005 −5 0 5 −5 0 5 −5 0 5 x x x

Figure 4.31: Kernel density plots for various h values for Friedman’s variable X6.

80 hhat(p)=0.16154 hhat(p)=0.22616 hhat(p)=0.31662 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=0.44327 hhat(p)=0.62058 hhat(p)=0.86881 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=1.2163 hhat(p)=1.7029 hhat(p)=2.384 0.02 0.02 0.015

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0.005 −5 0 5 −5 0 5 −5 0 5 x x x

Figure 4.32: Kernel density plots for various h values for Friedman’s variable X7.

81 hhat(p)=0.13981 hhat(p)=0.19574 hhat(p)=0.27404 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=0.38365 hhat(p)=0.53711 hhat(p)=0.75196 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=1.0527 hhat(p)=1.4738 hhat(p)=2.0634 0.02 0.02 0.015

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0.005 −5 0 5 −5 0 5 −5 0 5 x x x

Figure 4.33: Kernel density plots for various h values for Friedman’s variable X8.

82 hhat(p)=0.12342 hhat(p)=0.17278 hhat(p)=0.24189 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=0.33865 hhat(p)=0.47411 hhat(p)=0.66376 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=0.92926 hhat(p)=1.301 hhat(p)=1.8213 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x

Figure 4.34: Kernel density plots for various h values for Friedman’s variable X9.

83 hhat(p)=0.14403 hhat(p)=0.20165 hhat(p)=0.28231 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=0.39523 hhat(p)=0.55332 hhat(p)=0.77465 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x hhat(p)=1.0845 hhat(p)=1.5183 hhat(p)=2.1256 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x

Figure 4.35: Kernel density plots for various h values for Friedman’s variable X10.

84 hhat(p)=4.4525 hhat(p)=6.2335 hhat(p)=8.7269 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −200 0 200 −200 0 200 −200 0 200 x x x hhat(p)=12.2177 hhat(p)=17.1047 hhat(p)=23.9466 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −200 0 200 −200 0 200 −200 0 200 x x x hhat(p)=33.5253 hhat(p)=46.9354 hhat(p)=65.7095 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 −200 0 200 −200 0 200 −200 0 200 x x x

Figure 4.36: Kernel density plots for various h values for Friedman’s variable Y .

85 Table 4.8: Results for applying univariate kernel regression smoothing methods using Fried- man’s nonlinear data.

Model RMSE AIC ICOMP(IFIM) h for Y h for X Y = X β 24.1 930.2 990.2

Y = X k β 21.5 907.0 946.8 *

Y = X mk β 22.8 918.9 970.5 8.34E-03

Y k = X β 4.83 608.9 632.7 17.1

Y k = X k β 4.87 610.3 626.9 17.1 *

Y k = X mk β 5.18 622.8 642.6 17.1 8.34E-03

*Optimal bandwidth h values for X1, X2, …, X10 individually are respectively: [0.40197, 0.36273, 0.36094, 0.28781, 0.61694, 0.58268, 0.44327, 0.38365, 0.47411, 0.55332].

However, applying the kernel regression smoothing to the Y variable more dramatically decreased the RMSE, AIC, and ICOMP(IFIM) compared to the baseline model. The

RMSE decreased from the baseline model of 24.1 to 4.83, 4.87, and 5.18 for the Yk = Xβ,

Yk = Xkβ, and Yk = Xmkβ models, respectively. Similarly, the AIC decreased from 930.2 to 608.9 for the Yk = Xβ model, while ICOMP(IFIM) decreased from 990.2 to 632.7.

ICOMP(IFIM) is minimized for the Yk = Xkβ model with an ICOMP(IFIM) score of

626.9, while AIC is minimized for the Yk = Xβ model with an AIC score of 608.9. These results agree with sections 4.2.1 and 4.2.3 that applying kernel regression on the Y variable decreases the RMSE, AIC, and ICOMP(IFIM) compared to the baseline model even for a highly nonlinear response Y variable.

The Box-Cox transformation was applied to the univariate Friedman’s data to compare with the univariate kernel regression results. The Box-Cox optimal λˆ = 1, so Box-Cox recommended no data transformation. In fact, the M5 model could not be evaluated because the matrix was too singular. Therefore, Box-Cox could not be used to improve the RMSE, AIC, or ICOMP(IFIM) of the univariate Friedman’s data.

86 4.2.6 Univariate Kernel Regression for Uncorrelated Data

ICOMP in the general form utilizes the correlational structure in the data to accommodate correlated variables. However, in the case of completely uncorrelated variables, another form of ICOMP that estimates the posterior expected utility (PEU) as a Bayesian crite- rion can be used. Bozdogan [Bozdogan, 2006] defines ICOMP(IFIM)PEU for uncorrelated variables by

ˆ −1 ˆ ICOMP (IFIM)PEU = −2 log L(θM ) + k + 2C1(= (θM )). (4.23)

Substituting Eqns. 2.8 and 2.13 into Eqn. 4.23 yields

tr(=−1) ICOMP (IFIM) = n log(2π)+n log(ˆσ2)+n+k+s log −log |=−1|, (4.24) PEU s

−1 −1 ˆ where s = dim(= ) and = (θM ) is the outer product form.

Seven independent variables X1, X2, ..., X7 with n = 100 observations were randomly generated from the unit hypercube using an uniform distribution with varying means such

that X1, X2, ..., X7 were all uncorrelated. Variables X4, X5, X6, and X7 did not contribute to the response variable Y , and were therefore redundant variables.

Univariate kernel regression smoothing using ICOMP(IFIM)PEU was performed on the uncorrelated univariate data. Figure 4.37 shows histograms and scatter plots of the

uncorrelated uniform variables X1, X2, X3, and Y . Histograms and scatter plots of the

uncorrelated variables X4, X5, X6, and X7 are shown in Appendix Figure A.8. Eqn. 3.55 was used to calculate the optimal bandwidth h values when kernel regression

is applied individually to each variable. Therefore, there is one h bandwidth for each

variable when kernel regression is applied individually to each variable.

Eqn. 3.54 was used to calculate the optimal bandwidth h value when kernel regression

is applied to all X variables together (Xmk models). Therefore, there is only one h value for all X variables when kernel regression is applied to all X variables together.

87

Figure 4.37: Histograms and scatter plots of uncorrelated variables X1, X2, X3, and Y .

88 Figure 4.38 shows smoothed kernel density plots for the various values of hj calculated from Eqn. 3.55 for the variable X1. Figure 4.38 shows that as the bandwidth h increases the kernel density plots become smoother. The smallest h = 0.0334 shown in Figure 4.38 is too small since the kernel density plot is very jagged with many peaks. Yet the largest h

= 0.4933 shown in Figure 4.38 is too large since the kernel density plot has oversmoothed the data and removed features of the data that may be important. For X1 the bandwidth h that minimizes the ICOMP score is h = 0.0917.

Similarly, Figures 4.39 through 4.45 each shows smoothed kernel density plots for var- ious values of hj calculated from Eqn. 3.55 for the variables X2,X3, ..., X7 and Y , respec- tively, for the Yk and Xk models. The values of h that minimize ICOMP for the variables

X2,X3, ..., X7 and Y are 0.0341, 0.1761, 0.0712, 0.3161, 0.0931, 0.1799, and 1.4217, respec- tively.

Table 4.9 summarizes the RMSE, AIC, ICOMP(IFIM)PEU , and h bandwidths calcu- lated for the true underlying model using Eqns. 3.54 and 3.55 from the six kernel models.

Performing kernel regression smoothing only on the X variables for models Y = Xkβ and

Y = Xmkβ dramatically increased the ICOMP(IFIM)PEU to 558.6 and 520.6, respectively, from the baseline model of -280.6. Similarly, the AIC also dramatically increased from the baseline model where no kernel regression smoothing was applied to the data. However, the RMSE increased almost two orders of magnitude from 0.058 to 3.848 for the Y = Xkβ model and 3.192 for the Y = Xmkβ model. Since RMSE, AIC, and ICOMP(IFIM)PEU increased for these models, applying the kernel regression only to the X variables resulted in much poorer models than the baseline.

However, applying the kernel regression smoothing to the Y variable also increased the

RMSE, AIC, and ICOMP(IFIM)PEU compared to the baseline model, but not as high as the Y = Xkβ and Y = Xmkβ models. The RMSE increased from the baseline model of

0.058 to 0.176, 0.288, and 0.244 for the Yk = Xβ, Yk = Xkβ, and Yk = Xmkβ models, respectively.

89 hhat(p)=0.033425 hhat(p)=0.046795 hhat(p)=0.065513 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 10 11 12 10 11 12 10 11 12 x x x hhat(p)=0.091718 hhat(p)=0.1284 hhat(p)=0.17977 0.02 0.02 0.015

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0.005 10 11 12 10 11 12 10 11 12 x x x hhat(p)=0.25167 hhat(p)=0.35234 hhat(p)=0.49328 0.015 0.015 0.015

0.01 0.01 0.01 f(x) f(x) f(x)

0.005 0.005 0.005 10 11 12 10 11 12 10 11 12 x x x

Figure 4.38: Kernel density plots for various h values for uncorrelated variable X1.

90 hhat(p)=0.0088837 hhat(p)=0.012437 hhat(p)=0.017412 0.04 0.02 0.02

0.02 0.01 0.01 f(x) f(x) f(x)

0 0 0 5 5.5 6 5 5.5 6 5 5.5 6 x x x hhat(p)=0.024377 hhat(p)=0.034128 hhat(p)=0.047779 0.02 0.015 0.015

0.01 0.01 0.01 f(x) f(x) f(x)

0 0.005 0.005 5 5.5 6 5 5.5 6 5 5.5 6 x x x hhat(p)=0.06689 hhat(p)=0.093646 hhat(p)=0.1311 0.015 0.015 0.015

0.01 0.01 0.01 f(x) f(x) f(x)

0.005 0.005 0.005 5 5.5 6 5 5.5 6 5 5.5 6 x x x

Figure 4.39: Kernel density plots for various h values for uncorrelated variable X2.

91 hhat(p)=0.045848 hhat(p)=0.064187 hhat(p)=0.089862 0.04 0.02 0.02

0.02 0.01 0.01 f(x) f(x) f(x)

0 0 0 20 22 24 20 22 24 20 22 24 x x x hhat(p)=0.12581 hhat(p)=0.17613 hhat(p)=0.24658 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 20 22 24 20 22 24 20 22 24 x x x hhat(p)=0.34521 hhat(p)=0.4833 hhat(p)=0.67662 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 20 22 24 20 22 24 20 22 24 x x x

Figure 4.40: Kernel density plots for various h values for uncorrelated variable X3.

92 hhat(p)=0.025948 hhat(p)=0.036327 hhat(p)=0.050857 0.04 0.02 0.02

0.02 0.01 0.01 f(x) f(x) f(x)

0 0 0 15 16 17 15 16 17 15 16 17 x x x hhat(p)=0.0712 hhat(p)=0.099681 hhat(p)=0.13955 0.02 0.02 0.015

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0.005 15 16 17 15 16 17 15 16 17 x x x hhat(p)=0.19537 hhat(p)=0.27352 hhat(p)=0.38293 0.015 0.015 0.015

0.01 0.01 0.01 f(x) f(x) f(x)

0.005 0.005 0.005 15 16 17 15 16 17 15 16 17 x x x

Figure 4.41: Kernel density plots for various h values for uncorrelated variable X4.

93 hhat(p)=0.041983 hhat(p)=0.058777 hhat(p)=0.082287 0.015 0.015 0.015

0.01 0.01 0.01 f(x) f(x) f(x)

0.005 0.005 0.005 3 3.5 3 3.5 3 3.5 x x x hhat(p)=0.1152 hhat(p)=0.16128 hhat(p)=0.2258 0.015 0.015 0.015

0.01 0.01 0.01 f(x) f(x) f(x)

0.005 0.005 0.005 3 3.5 3 3.5 3 3.5 x x x hhat(p)=0.31611 hhat(p)=0.44256 hhat(p)=0.61958 0.012 0.011 0.011

0.01 0.01 0.01 f(x) f(x) f(x)

0.008 0.009 0.009 3 3.5 3 3.5 3 3.5 x x x

Figure 4.42: Kernel density plots for various h values for uncorrelated variable X5.

94 hhat(p)=0.024241 hhat(p)=0.033937 hhat(p)=0.047512 0.04 0.02 0.02

0.02 0.01 0.01 f(x) f(x) f(x)

0 0 0 13 14 15 13 14 15 13 14 15 x x x hhat(p)=0.066517 hhat(p)=0.093123 hhat(p)=0.13037 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 13 14 15 13 14 15 13 14 15 x x x hhat(p)=0.18252 hhat(p)=0.25553 hhat(p)=0.35774 0.02 0.02 0.015

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0.005 13 14 15 13 14 15 13 14 15 x x x

Figure 4.43: Kernel density plots for various h values for uncorrelated variable X6.

95 hhat(p)=0.046829 hhat(p)=0.06556 hhat(p)=0.091785 0.04 0.04 0.02

0.02 0.02 0.01 f(x) f(x) f(x)

0 0 0 26 28 30 26 28 30 26 28 30 x x x hhat(p)=0.1285 hhat(p)=0.1799 hhat(p)=0.25186 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 26 28 30 26 28 30 26 28 30 x x x hhat(p)=0.3526 hhat(p)=0.49364 hhat(p)=0.6911 0.02 0.02 0.015

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0.005 26 28 30 26 28 30 26 28 30 x x x

Figure 4.44: Kernel density plots for various h values for uncorrelated variable X7.

96 hhat(p)=0.37009 hhat(p)=0.51813 hhat(p)=0.72538 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 120 140 160 120 140 160 120 140 160 x x x hhat(p)=1.0155 hhat(p)=1.4217 hhat(p)=1.9904 0.02 0.02 0.02

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0 120 140 160 120 140 160 120 140 160 x x x hhat(p)=2.7866 hhat(p)=3.9012 hhat(p)=5.4617 0.02 0.02 0.015

0.01 0.01 0.01 f(x) f(x) f(x)

0 0 0.005 120 140 160 120 140 160 120 140 160 x x x

Figure 4.45: Kernel density plots for various h values for uncorrelated variable Y .

97 Table 4.9: Results for applying univariate kernel regression smoothing methods on uncor- related variables using posterior expected utility (PEU).

Model RMSE AIC ICOMP(IFIM)PEU h for Y h for X Y = X β 0.058 -280.5 -280.6

Y = X k β 3.848 559.3 558.6 *

Y = X mk β 3.192 521.9 520.6 0.206

Y k = X β 0.176 -57.80 -59.18 1.422

Y k = X k β 0.288 40.83 38.52 1.422 *

Y k = X mk β 0.244 7.490 5.436 1.422 0.206

*Optimal bandwidth h values for X1, X2, …, X7 individually are respectively: [0.0917, 0.0341, 0.1761, 0.0712, 0.3161, 0.0931, 0.1799] using Eqn. 3.55.

Similarly, the AIC increased from -280.5 to -57.8, 40.83, and 7.49 for the Yk = Xβ,

Yk = Xkβ, and Yk = Xmkβ models, respectively, while ICOMP(IFIM)PEU increased from -280.6 to -59.18, 38.52, and 5.436 for the same models. Among the models using kernel regression, ICOMP(IFIM)PEU and AIC are minimized for the Yk = Xβ model with an

ICOMP(IFIM)PEU score of -59.18 and AIC score of -57.8. For uncorrelated uniformly distributed data, no kernel regression model was better than the original model without kernel regression due to the extremely thick tails of the uniform distribution.

4.2.7 Univariate Kernel Regression Using the Epanechnikov Kernel on Normal Data

Epanechnikov [Epanechnikov, 1969] also proposed a kernel function for calculating the optimal bandwidth h. The Epanechnikov kernel minimizes asymptotic mean integrated squared error (AMISE) and assumes the data are from a multivariate normal distribution.

The Epanechnikov kernel univariate function is given by

   3 (1 − x2) if |x| ≤ 1  K(x) = 4 . (4.25)  0 if |x| > 1 

Figure 4.46 is a univariate plot of the Epanechnikov kernel shown in Eqn. 4.25.

98 Figure 4.46: Univariate plot of the Epanechnikov kernel function.

The equation for calculating the optimal bandwidth h using an Epanechnikov kernel one variable at a time is given by

2p+2p2(p + 2)(p + 4)Γ( p )1/(p+4) h∗ = 2 σˆ , (4.26) j n(2p + 1) j where p is the number of predictor variables.

Eqn. 4.26 has a similar form to Eqn. 3.33 which is the normal reference rule. In Eqn.

∗ 4.26 p = 1 when calculating hj one variable at a time. For all X variables together, Eqn. 3.39 is substituted into Eqn. 4.26 and becomes

1/(p+4) 2p+2p2(p + 2)(p + 4)Γ( p ) p tr(Σ) 1  h∗ = 2 log − log |Σ| . (4.27) j n(2p + 1) 2 p 2

Eqn. 4.27 resembles Eqn. 3.54 for calculating a single optimal h value for all X variables together.

99 The Epanechnikov kernel was used on the same multivariate data set described in section 4.2.1 with one normally distributed Y variable and seven normally distributed X variables with n = 50 observations. Histograms and scatter plots for the data are shown in Appendix Figures A.1 and A.2.

Table 4.10 summarizes the AIC, RMSE, ICOMP(IFIM), and bandwidth h values cal- culated for the true underlying model with predictor variables X1, X2, and X3 using the Epanechnikov kernel.

Table 4.10 shows the results from the six models used for univariate Y . Performing kernel regression smoothing only on the X variables for models Y = Xkβ and Y = Xmkβ increased the ICOMP(IFIM) to 103 and 194.9, respectively, from the baseline model of 79.8.

Similarly, the RMSE and AIC also increased considerably from the baseline model where no kernel regression smoothing was applied to the data. Since RMSE, AIC, and ICOMP(IFIM) increased for these models, applying the kernel regression only to the X variables resulted in poorer models than the baseline. However, applying the kernel regression smoothing to the Y variable decreased the RMSE, AIC, and ICOMP(IFIM). The RMSE decreased from the baseline model of 0.474 to 0.146, 0.144, and 0.26 for the Yk = Xβ, Yk = Xkβ, and Yk = Xmkβ models, respectively. Similarly, the AIC decreased from 73.1 to -44.4,

-45.9, and 13.3 for the Yk = Xβ, Yk = Xkβ, and Yk = Xmkβ models, respectively, while

ICOMP(IFIM) decreased from 79.8 to -46.1, -59.2, and 6.79 for the Yk = Xβ, Yk = Xkβ, and Yk = Xmkβ models, respectively. The Yk = Xkβ model had the best RMSE, AIC, and ICOMP(IFIM) scores.

These results agree well with Table 4.2 where the Gaussian kernel was used. The RMSE,

AIC, and ICOMP(IFIM) scores using the normal kernel in Table 4.2 were lower for the Yk models than the Epanechnikov kernel results in Table 4.10, particularly for the Yk = Xmkβ model. Since the data were normally distributed, the Gaussian kernel performed slightly better than the Epanechnikov kernel.

100 Table 4.10: Results for applying univariate kernel regression with the Epanechnikov kernel on normal data.

Model RMSE AIC ICOMP(IFIM) h for Y h for X Y = X β 0.474 73.1 79.8

Y = X k β 0.656 105.7 103.0 *

Y = X mk β 1.475 186.8 194.9 0.0177

Y k = X β 0.146 -44.4 -46.1 1.677

Y k = X k β 0.144 -45.9 -59.2 1.677 *

Y k = X mk β 0.260 13.3 6.79 1.677 0.0177

*Optimal bandwidth h values for X1, X2, …, X7 individually are respectively: [1.1118, 1.0691, 0.9609, 4.7303, 8.2965, 7.1532, 2.2444] using Eqn. 4.26.

4.2.8 Univariate Kernel Regression Using the Epanechnikov Kernel on PE Data

The Epanechnikov kernel defined in Eqn. 4.25 and shown in Fig. 4.46 was applied to the PE distributed data described in Section 4.2.4 to compare the performance of the

Epanechnikov and Gaussian kernels on highly skewed non-normal data.

The equations for calculating the optimal bandwidth h using an Epanechnikov kernel are shown in Eqns. 4.26 and 4.27. The Epanechnikov kernel was used on the same mul- tivariate PE data set described in section 4.2.4 with one PE distributed Y variable and seven PE distributed X variables with n = 50 observations. Eqn. 4.26 was used when cal-

∗ culating hj one variable at a time. Eqn. 4.27 was used when calculating a single optimal h value for all X variables together. Table 4.11 summarizes the AIC, RMSE, ICOMP(IFIM), and bandwidth h values calculated for the true underlying model using the Epanechnikov kernel.

Table 4.11 shows the results from the six models used for univariate Y . Performing kernel regression smoothing only on the X variables for models Y = Xkβ and Y = Xmkβ increased the ICOMP(IFIM) to 549.3 and 556.9, respectively, from the baseline model of

58.9. Similarly, the RMSE also increased approximately two orders of magnitude from the baseline model where no kernel regression smoothing was applied to the data.

101 Table 4.11: Results for applying univariate kernel regression with the Epanechnikov kernel on power exponential data.

Model RMSE AIC ICOMP(IFIM) h for Y h for X Y = X β 0.365 47.2 58.9

Y = X k β 33.0 497.4 549.3 *

Y = X mk β 34.5 502.1 556.9 3.42

Y k = X β 7.78 353.1 393.7 39.1

Y k = X k β 5.03 309.4 333.3 39.1 *

Y k = X mk β 6.89 340.9 371.7 39.1 3.42

*Optimal bandwidth h values for X1, X2, …, X7 individually are respectively: [12.2153, 3.8303, 42.9406, 2.5444, 5.7193, 19.6565, 98.13] using Eqn. 4.26.

AIC increased more than one order of magnitude from 47.2 in the baseline model to

497.4 and 502.1 for the Y = Xkβ and Y = Xmkβ models, respectively. Since RMSE, AIC, and ICOMP(IFIM) increased for these models, applying the kernel regression only to the

X variables resulted in poorer models than the baseline.

However, applying the kernel regression smoothing to the Y variable also increased the

RMSE, AIC, and ICOMP(IFIM), but not nearly as much as the Y = Xkβ and Y = Xmkβ models. The RMSE increased from the baseline model of 0.365 to 7.78, 5.03, and 6.89 for the models Yk = Xβ, Yk = Xkβ, and Yk = Xmkβ, respectively. Similarly, the AIC

increased from 47.2 to 353.1, 309.4, and 340.9 for the models Yk = Xβ, Yk = Xkβ, and

Yk = Xmkβ, respectively, while ICOMP(IFIM) increased from 58.9 to 393.7, 333.3, and

371.7 for these models, respectively. Among the kernel models, the Yk = Xkβ model had the best RMSE, AIC, and ICOMP(IFIM) scores. However, none of the kernel models

performed better than the baseline model.

These results agree well with Table 4.7 where the Gaussian kernel was applied on the

same PE data. Note the Epanechnikov kernel outperformed the Gaussian kernel for the

models Yk = Xβ, Yk = Xkβ, and Yk = Xmkβ since the RMSE, AIC, and ICOMP(IFIM) scores are slightly smaller in Table 4.11 than in Table 4.7.

102 4.3 Conclusions

Kernel regression smoothing on the univariate Y variable has shown to significantly lower the ICOMP(IFIM), RMSE, and AIC scores compared to the baseline linear regression models for normally distributed data, Hald’s cement data, and Friedman’s nonlinear data.

However, for power exponential and uncorrelated uniformly distributed data, kernel regres- sion did not lower ICOMP(IFIM), RMSE, and AIC scores. Kernel regression smoothing on the X variables consistently produced poorer models compared to the baseline for all data sets.

The Epanechnikov kernel was applied to both the normally distributed data and power exponential data to compare its performance with the Gaussian kernel. The Epanechnikov kernel performed better than the Gaussian kernel for the skewed power exponential data, but not as well with the normal data. Therefore, the choice of the kernel can matter depending on the data.

103 Chapter 5

Multivariate Kernel Regression

5.1 Introduction

Chapter 3 and Section 4.1 present the theoretical framework for kernel regression. Kernel regression has been applied using a variety of bandwidth estimators as discussed in chapter

3. However, this dissertation proposes the extension of kernel regression to the multivariate case using ICOMP for variable model selection.

The univariate kernel regression described in section 4.1 is extended here to the multi- variate case. The multivariate partially linear model is defined as

Y = XT β + g(T ) + E, (5.1)

where Y is a (n × m) response matrix, X and T are d-dimensional and scalar regressors,

β is a (d × m) matrix of unknown parameters, g(·) an unknown smooth function, and E an error matrix with E(i) = 0. n P T Let gn(t, β) = ωni(t)(Yi − Xi β) for a given matrix β where ωni(t) is defined in i=1 Eqn. 4.2. By substituting gn(t, β) into Eqn. 5.1 and using least squares criterion, the least squares estimator of β is shown as

T −1 T βˆKR = (X˜ X˜ ) X˜ Y˜ , (5.2)

104 n ˜ T ˜ ˜ ˜ P ˜ T ˜ ˜ ˜ where X = (X1, ..., Xn) with Xj = Xj − ωni(Tj)Xi and Yk = (Y1, ..., Yn) with Yj = i=1 n P Yj − ωni(Tj)Yi for k = 1, 2, ..., p. The nonparametric part g(t) is estimated by i=1

n X T ˆ gˆn(t) = ωni(t)(Yi − Xi βKR). (5.3) i=1

When 1, ..., p are identically distributed, their common variance Σ may be estimated by Σˆ by

T Σˆ = (Y˜ − X˜ βˆKR) (Y˜ − X˜ βˆKR). (5.4)

Eqn. 4.2 shows the weight is calculated from only one regressor Xi independent of the

other Xj. Only the data within each Xi is used in the weighting for smoothing the data.

Eqn. 4.7 is an extension of Eqn. 4.2 that uses the data from all X1, ..., Xd or Y1, ..., Yp variables simultaneously to smooth the data.

Eqn. 4.7 can be applied to the independent variables Xi, the dependent variables Yi, n d P P T or both in the multivariate case. Let gn(t, β) = ωnik(t)(Yi − Xi β) for a given β. By i=1 k=1 substituting gn(t, β) into Eqn. 5.1 and using least squares criterion, the least squares esti- n d T P P mator of β is shown in Eqn. 5.2 where X˜ = (X˜1, ..., X˜n) with X˜j = Xj − ωnik(Tj)Xi i=1 k=1 n d T P P and Y˜k = (Y˜1, ..., Y˜n) with Y˜j = Yj − ωnik(Tj)Yi. The nonparametric part g(t) is i=1 k=1 estimated by

n d X X T ˆ gˆn(t) = ωnik(t)(Yi − Xi βKR). (5.5) i=1 k=1

When 1, ..., p are identically distributed, their common variance Σ may be estimated by Σˆ in Eqn. 5.4. The optimal bandwidth matrix Hˆ is calculated from Eqn. 3.41.

Multivariate data sets were simulated with both normal and non-normal error structures

to determine the optimal use of univariate or multivariate kernel regression on minimizing

the ICOMP(IFIM) from Eqn. 2.15.

The simulated multivariate data sets were small enough so that all possible subset

105 models were considered. MATLAB programs were written to calculate AIC, root mean square error (RMSE), and ICOMP(IFIM) from the model with the smallest ICOMP(IFIM) from Eqn. 2.15 after considering all possible subset models.

AIC for multivariate regression derived from Eqn. 2.1 is defined by

AIC(k) = np log(2π) + n log(σ2|Σ|) + np + 2(k + 1), (5.6)

where p = number of dependent variables Y1, Y2, ..., Yp, k = number of independent

variables X1, X2, ..., Xk, and n = number of observations. For multivariate data sets, all possible combinations of applying kernel regression were calculated using the notation

described in Table 5.1.

5.2 Kernel Regression with Normal Data

A multivariate data set with seven normally distributed X variables and three normally

distributed Y variables was simulated with n = 50 observations as first discussed in section

4.2.1. Two additional Y response variables were added to the data set using the variables

X1, X2, and X3 in the true model.

Figure 5.1 is a surface plot of the first two variables X1 and X2. These are approximately bivariate normal.

Histograms and scatter plots of each of the simulated variables X1, X2, ..., X7, and Y1,

Y2, and Y3 are shown in Appendix Figures A.9 and A.10 where each variable is approx- imately normally distributed. Eqn. 3.55 was used to calculate the optimal bandwidth h

values when kernel regression is applied individually to each Xi or Yi variable. Eqn. 3.54 was used to calculate the optimal bandwidth h value when kernel regression is

applied to all X variables together (Xmk models) or all Y variables together (Ymk models).

106 Table 5.1: Notation used for applying multivariate kernel regression smoothing methods.

Model Description Y = X β Baseline model with no kernel regression smoothing on either X or Y variables

Y = X k β Univariate kernel regression on X variables one at a time, but no kernel regression on Y variables

Y = X mk β Multivariate kernel regression on all X variables together, but no kernel regression on Y variables

Y k = X β Univariate kernel regression on Y variables one at a time, but no kernel regression on X

Y k = X k β Univariate kernel regression on Y and X variables one at a time

Y k = X mk β Univariate kernel regression on Y variables one at a time, and multivariate kernel regression on all X variables together

Y mk = X β Multivariate kernel regression on all Y variables together, but no kernel regression on X variables

Y mk = X k β Multivariate kernel regression on all Y variables together, and univariate kernel regression on X variables one at a time

Y mk = X mk β Multivariate kernel regression on all Y variables together, and multivariate kernel regression on all X variables together

107 0 9 :1 2 T u e s d a y , A u g u s t 1 8 , 2 0 0 9 1

T he K D E P rocedure

In p u ts D a ta S e t D IS .H A L D S N u m b e r o f O b se rv a tio n s U se d 1 3 V a ria b le 1 X 1 V a ria b le 2 X 2 B a n d w id th M e th o d S im p le N o rm a l R e fe re n c e

C o n tro ls X 1 X 2 G rid P o in ts 6 0 6 0 L o w e r G rid L im it 1 2 6 U p p e r G rid L im it 2 1 7 1 B a n d w id th M u ltip lie r 1 1

Figure 5.1: Bivariate surface plot of normally distributed variables X1 and X2.

108 Therefore, there is only one h value for all X or Y variables when kernel regression is applied to all X or Y variables together.

Figure 5.2 shows smoothed kernel density plots for the various values of hj calculated from Eqn. 3.55 for the variable Y2. Figures 4.2 through 4.9 show smoothed kernel density plots for the values of hj calculated from Eqn. 3.55 for the variables X1, X2, ..., X7, and

Y1 in section 4.2.1. Figure 5.2 shows that as the bandwidth h increases the kernel density

plots become smoother. The smallest hj = 1.6352 is too small since the kernel density plot

is very jagged. Yet the largest hj = 24.1326 is too large since the kernel density plot has

oversmoothed the data and removed features of the data that may be important. For Y2 the bandwidth h that minimizes the ICOMP score is h = 6.2819.

Similarly, Figure 5.3 shows smoothed kernel density plots for various values of h calcu-

lated from Eqn. 3.55 for the Y3 variable for the Yk models. The value of h that minimizes

ICOMP for the Y3 variable is 1.3743. Table 5.2 shows the results from the nine kernel models for multivariate Y . Performing

kernel regression smoothing only on the X variables for models Y = Xkβ and Y = Xmkβ increased the ICOMP(IFIM) to 425 and 787, respectively, from the baseline multivariate

model of 319. Similarly, the AIC also increased from the baseline model of 205 to 370 and

647 for models Y = Xkβ and Y = Xmkβ, respectively. RMSE increased almost two orders

of magnitude from 0.0973 for the baseline to 8.26 for the Y = Xmkβ model. Since RMSE, AIC, and ICOMP(IFIM) increased for these models, applying the kernel regression only

to the X variables resulted in poorer models than the baseline.

However, applying the kernel regression smoothing to the multivariate Y variables

dramatically decreased the RMSE, AIC, and ICOMP(IFIM) as in the univariate Y case in

section 4.2.1. The RMSE decreased more than one order of magnitude from the baseline

model of 0.0973 to 0.008 for the Yk = Xβ model. Similarly, the AIC decreased from

205 to -47.4 for the Yk = Xβ model, while ICOMP(IFIM) decreased from 319 to 45.2.

Among the three Yk models, RMSE, AIC, and ICOMP(IFIM) increased dramatically for

the Yk = Xmkβ model compared to the Yk = Xβ and Yk = Xkβ models. However,

109 hhat(p)=1.6352 hhat(p)=2.2893 hhat(p)=3.2051 0.04 0.03 0.03

0.03 0.02 0.02 0.02 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0 −100 −50 0 −100 −50 0 −100 −50 0 x x x hhat(p)=4.4871 hhat(p)=6.2819 hhat(p)=8.7947 0.03 0.03 0.03

0.02 0.02 0.02 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0 −100 −50 0 −100 −50 0 −100 −50 0 x x x hhat(p)=12.3125 hhat(p)=17.2376 hhat(p)=24.1326 0.025 0.025 0.022

0.02 0.02 0.02 0.015 0.018 f(x) f(x) f(x) 0.015 0.01 0.016

0.005 0.01 0.014 −100 −50 0 −100 −50 0 −100 −50 0 x x x

Figure 5.2: Kernel density plots for various h values for normally distributed variable Y2.

110 hhat(p)=0.50085 hhat(p)=0.70118 hhat(p)=0.98166 0.03 0.03 0.03

0.02 0.02 0.02 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0 −20 0 20 −20 0 20 −20 0 20 x x x hhat(p)=1.3743 hhat(p)=1.924 hhat(p)=2.6937 0.03 0.03 0.025

0.02 0.02 0.02 0.015 f(x) f(x) f(x) 0.01 0.01 0.01

0 0 0.005 −20 0 20 −20 0 20 −20 0 20 x x x hhat(p)=3.7711 hhat(p)=5.2796 hhat(p)=7.3914 0.025 0.025 0.022

0.02 0.02 0.02 0.015 0.018 f(x) f(x) f(x) 0.015 0.01 0.016

0.005 0.01 0.014 −20 0 20 −20 0 20 −20 0 20 x x x

Figure 5.3: Kernel density plots for various h values for normally distributed variable Y3.

111 Table 5.2: Results for applying multivariate kernel regression smoothing methods with normally distributed data.

Model RMSE AIC ICOMP(IFIM) h for Y h for X Y = X β 0.0973 205 319

Y = X k β 0.5178 370 425 *

Y = X mk β 8.26 647 787 0.0062

Y k = X β 0.0080 -47.4 45.2 **

Y k = X k β 0.0103 -21.9 42.6 ** *

Y k = X mk β 0.0617 157 275 ** 0.0062

Y mk = X β 0.0032 -138 -93.5 0.8235

Y mk = X k β 0.0031 -141 -131 0.8235 *

Y mk = X mk β 0.0123 -3.81 18.4 0.8235 0.0062

*Optimal bandwidth h values for X1, X2, …, X7 individually are respectively: [0.5892, 0.6766, 0.5691, 2.9936, 5.2505, 3.2336, 1.0146]

**Optimal bandwidth h values for Y1, Y2, Y3 individually are respectively: [0.4766, 6.2819, 1.3743]

compared to the baseline model all Yk models have lower RMSE, AIC, and ICOMP(IFIM) scores.

The Ymk models provide the best improvement in RMSE, AIC, and ICOMP(IFIM)

compared to the other models. The Ymk = Xkβ model has the lowest RMSE, AIC, and ICOMP(IFIM) scores of all nine models with scores of 0.0031, -141, and -131, respectively.

The RMSE, AIC, and ICOMP(IFIM) scores increased considerably for the Ymk = Xmkβ

model compared to the Ymk = Xβ and Ymk = Xkβ models. This pattern is consistent for

the Y , Yk, and Ymk models in Table 5.2. Using a single h bandwidth for all X variables consistently produces poorer models than applying kernel regression smoothing to the Y

variables.

For normally distributed data, applying the kernel regression smoothing to the Y vari-

ables had the most dramatic effect in lowering the RMSE, AIC, and ICOMP(IFIM).

112 5.3 Kernel Regression for Multivariate Power Exponential

Distribution

Liu and Bozdogan [Liu and Bozdogan, 2008] discuss multivariate regression models with power exponential (PE) random errors. The univariate random variable x is PE distributed if the density function of x is defined by Eqn. 4.20 in section 4.2.4.

Gomez [Gomez, Gomez-Villegas and Marin, 1998] published a multivariate generaliza- tion of the PE family of distributions, denoted PEp(µ, Σ, β), defined by

  pΓ(p/2) −1/2 1 0−1 β f(x; µ, Σ, β) = p |Σ| exp − [(x − µ) (x − µ)] , (5.7) p/2 p 1+ 2β 2 π Γ(1 + 2β )2

0 p where x = [x1, x2, ..., xp] is a p dimension random vector, µ  R , Σ is a (p × p) positive definite symmetric matrix, and β > 0 is the shape parameter. Note that when p = 1, Eqn.

5.7 reduces to Eqn. 4.20.

In order to generate univariate PE pseudo-random numbers, Liu and Bozdogan [Liu

1 2β and Bozdogan, 2008] show that if Y = 2 |X| , where X has a standard PE distribution 1 with shape parameter β, then Y ∼ Gamma( 2β ). So Y has a distribution with density

y(1/2β)−1e−y f(y) = 1 . (5.8) Γ( 2β ) A multivariate PE data set with seven independent X variables and one dependent

Y variable was simulated with n = 50 observations as discussed in section 4.2.4. Two

additional Y response variables were added to the data set using the variables X1, X2,

and X3 in the true model. The residuals from the multivariate linear regression models are gamma distributed with β = 0.3. A MATLAB program performed multivariate kernel regression smoothing on the multivariate PE distributed data.

113 Figure 5.4 is a surface plot of the first two variables X1 and X2 where β = 0.3. The data are gamma distributed with thin tails.

Histograms and scatter plots of each of the simulated variables X1, X2, ..., X7, and Y1,

Y2, and Y3 are shown in Appendix Figures A.11 and A.12 where each variable is skewed and non-normally distributed. Eqn. 3.55 was used to calculate the optimal bandwidth h values when kernel regression is applied individually to each Xi or Yi variable. Therefore, there is one h bandwidth for each variable when kernel regression is applied individually to each variable.

Eqn. 3.54 was used to calculate the optimal bandwidth h value when kernel regression is applied to all X variables together (Xmk models) or all Y variables together (Ymk models). Therefore, there is only one h value for all X or Y variables when kernel regression is applied to all X or Y variables together.

Figure 5.5 shows smoothed kernel density plots for the various values of hj calculated from Eqn. 3.55 for the variable Y2. Figures 4.17 through 4.24 show smoothed kernel density plots for the values of hj calculated from Eqn. 3.55 for the variables X1, X2, ..., X7, and

Y1, respectively, in section 4.2.4. Figure 5.5 shows that as the bandwidth h increases the kernel density plots become smoother. The smallest hj = 10.7018 is too small since the kernel density plot is jagged. Yet the largest hj = 157.9359 is too large since the kernel density plot has oversmoothed the data and removed features of the data that may be important. For Y2 the bandwidth h that minimizes the ICOMP score is h = 41.112.

Similarly, Figure 5.6 shows smoothed kernel density plots for various values of hj calcu- lated from Eqn. 3.55 for the Y3 variable for the Yk models. The value of h that minimizes

ICOMP for the Y3 variable is 10.0665. Table 5.3 shows the results from the nine kernel models for multivariate Y . Performing kernel regression smoothing only on the X variables for models Y = Xkβ and Y = Xmkβ increased the ICOMP(IFIM) to 1663 and 1705, respectively, from the baseline multivariate model of 434. Similarly, the AIC also increased from the baseline model of 184 to 1460 and

1497 for models Y = Xkβ and Y = Xmkβ, respectively. RMSE increased more than five

114 0 9 :1 2 T u e s d a y , A u g u s t 1 8 , 2 0 0 9 1

T he K D E P rocedure

In p u ts D a ta S e t D IS .M V P E R _ M U L T IV A R IA T E N u m b e r o f O b se rv a tio n s U se d 5 0 V a ria b le 1 X 1 V a ria b le 2 X 2 B a n d w id th M e th o d S im p le N o rm a l R e fe re n c e

C o n tro ls X 1 X 2 G rid P o in ts 6 0 6 0 L o w e r G rid L im it -6 .2 0 3 3 .8 4 7 U p p e r G rid L im it 6 4 .8 9 6 3 0 .2 3 7 B a n d w id th M u ltip lie r 1 1

Figure 5.4: Bivariate surface plot of power exponential variables X1 and X2.

115 hhat(p)=10.7018 hhat(p)=14.9825 hhat(p)=20.9755 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −500 0 500 −500 0 500 −500 0 500 x x x hhat(p)=29.3657 hhat(p)=41.112 hhat(p)=57.5568 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −500 0 500 −500 0 500 −500 0 500 x x x hhat(p)=80.5795 hhat(p)=112.8114 hhat(p)=157.9359 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −500 0 500 −500 0 500 −500 0 500 x x x

Figure 5.5: Kernel density plots for various h values for PE distributed variable Y2.

116 hhat(p)=3.6686 hhat(p)=5.136 hhat(p)=7.1904 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −200 −100 0 −200 −100 0 −200 −100 0 x x x hhat(p)=10.0665 hhat(p)=14.0932 hhat(p)=19.7304 0.04 0.04 0.04

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0 −200 −100 0 −200 −100 0 −200 −100 0 x x x hhat(p)=27.6226 hhat(p)=38.6716 hhat(p)=54.1403 0.04 0.04 0.03

0.02 0.02 0.02 f(x) f(x) f(x)

0 0 0.01 −200 −100 0 −200 −100 0 −200 −100 0 x x x

Figure 5.6: Kernel density plots for various h values for PE distributed variable Y3.

117 Table 5.3: Results for applying multivariate kernel regression smoothing methods with power exponential data.

Model RMSE AIC ICOMP(IFIM) h for Y h for X Y = X β 0.080 184 434

Y = X k β 28,114 1460 1663 *

Y = X mk β 40,783 1497 1705 1.193

Y k = X β 318 1012 1177 **

Y k = X k β 12.8 691 824 ** *

Y k = X mk β 480 1053 1185 ** 1.193

Y mk = X β 136 927 1066 0.812

Y mk = X k β 131 923 1023 0.812 *

Y mk = X mk β 149 936 1048 0.812 1.193

*Optimal bandwidth h values for X1, X2, …, X7 individually are respectively: [4.988, 2.5413, 18.5111, 0.7633, 2.7108, 14.5435, 82.7758]

**Optimal bandwidth h values for Y1, Y2, Y3 individually are respectively: [20.4193, 41.112, 10.0665]

118 orders of magnitude from 0.08 for the baseline to 40,783 for the Y = Xmkβ model. Since RMSE, AIC, and ICOMP(IFIM) increased for these models, applying the kernel regression only to the X variables resulted in poorer models than the baseline.

However, applying the kernel regression smoothing to the multivariate Y variables also increased the RMSE, AIC, and ICOMP(IFIM) as in the univariate Y case in section 4.2.4.

The RMSE increased more than two orders of magnitude from the baseline model of 0.08 to 318, 12.8, and 480 for the Yk = Xβ, Yk = Xkβ, and Yk = Xmkβ models, respectively.

Similarly, the AIC scores increased from 184 to 1012, 691, and 1053 for the Yk = Xβ,

Yk = Xkβ, and Yk = Xmkβ models, respectively, while ICOMP(IFIM) scores increased from 434 to 1177, 824, and 1185 for the three models, respectively. Among the three Yk models, RMSE, AIC, and ICOMP(IFIM) scores are much lower for the Yk = Xkβ model compared to the Yk = Xβ and Yk = Xmkβ models. However, compared to the baseline model all Yk models have higher RMSE, AIC, and ICOMP(IFIM) scores.

The Ymk models generally provide the better improvement in RMSE, ICOMP(IFIM), and AIC scores compared to all other models except the Yk = Xkβ model. There is little difference between the three Ymk models as shown in Table 5.3. Among the kernel regression models, the Yk = Xkβ model minimizes RMSE, AIC, and ICOMP(IFIM), although the scores are not as low as the baseline model. Kernel regression smoothing did not improve the baseline model for PE data using the Gaussian kernel due to its skewness and peaks.

5.4 Conclusions

Kernel regression smoothing on the multivariate Y variables has shown to significantly lower the ICOMP(IFIM), RMSE, and AIC scores compared to the baseline linear re- gression models for normally distributed data. However, kernel regression did not lower

ICOMP(IFIM), RMSE, and AIC scores compared to the baseline model for power expo- nential data due to its high skewness and sharp peaks using the Gaussian kernel. Kernel regression smoothing on the X variables singly or jointly produces a poorer model com- pared to the baseline in both normally and power exponential distributed data.

119 Chapter 6

Kernel Regression with the Genetic Algorithm

6.1 Introduction

Chapters 4 and 5 showed results using kernel regression on univariate and multivariate data sets. Kernel regression was applied using both the Gaussian and Epanechnikov kernels.

Chapters 4 and 5 showed applying the kernel to the Y variables dramatically reduced both the AIC and ICOMP(IFIM) compared to the model where no kernel regression was used in some cases. Chapter 6 will present results of integrating kernel density regression with the genetic algorithm (GA).

6.2 Genetic Algorithm

A genetic algorithm is a heuristic search algorithm based on concepts of biological evolution and natural selection [Goldberg, 1989]. GA has many applications as a numerical optimiza- tion technique for complex problems in many different disciplines including management science, statistics, industrial engineering and operations research. GAs take into consider- ation all historical information involved in a problem as well as random information [Said,

2005].

120 A GA treats information as a series of codes on a string, where each string represents a different solution to a given problem. The strings are analogous to genetic informa- tion coded by genes on a chromosome. Each string is evaluated according to a “fitness” function that measures its ability to solve the problem. Genetic algorithms are search and optimization tools that enable the fittest candidate among strings to survive and reproduce based on information exchange imitating the natural biological selection. “Parent” strings reproduce into two “child” strings that are created using portions of the parent strings.

The general procedure in the GA involves the following seven steps:

1. Convert predictor variables into a string of binary code

2. Create a population of breeding strings

3. Define the fitness function

4. Select the mating pool based on values of the fitness function

5. Perform chromosomal crossover and genetic mutation

6. Create the next generation of models

7. Loop back to step 2 until termination criteria are met.

Step 1 in GA for regression modeling is to use an encoding scheme for the various possible combinations of predictor variables into a string, where each locus in the string is a binary code indicating the presence (1) or absence (0) of a given predictor variable. For each string, locus 0 represents the intercept term, while locus i represents predictor variable i for i = 1, 2, ..., p. Each string has length p + 1. For example, if there are p = 5 predictor variables, the string “110101” represents the regression model that includes the intercept

(locus 0 has “1”), predictor variable 1 is included (locus 1 has “1”), predictor variable 2 is excluded (locus 2 has “0”), predictor variable 3 is included (locus 4 has “1”), predictor variable 4 is excluded (locus 5 has “0”), and predictor variable 5 is included (locus 6 has

“1”).

Step 2 in GA is to create an initial population of breeding strings or models. The initial population size is chosen by the investigator and is not random. Each model is selected by randomly choosing variables to be included in the model and encoding the string based on

121 the presence or absence of each variable.

Step 3 in GA is to define the criterion to rank solutions called the fitness function. The

fitness function used for model selection is AIC or the information complexity (ICOMP) of the inverse Fisher information matrix (IFIM) given by

1 tr(ˆσ2X0X)−1 + 2ˆσ4/n ICOMP (IFIM) = n log(2π) + (n − ) log(ˆσ2) + n + (q + 1) log 2 q + 1

n − log |(X0X)−1| + log , (6.1) 2

where q = k+1 and k = number of parameters estimated in the model. Eqn. 6.1 is identical

to Eqn. 2.13.

Step 4 in GA is to select the mating pool based on their ICOMP values from Eqn.

6.1. After calculating the ICOMP criterion for each of the models in the population, each

ICOMP criterion is subtracted from the highest criterion value in the population. The

arithmetic average of these differences and the ratio of each model’s difference value to the

mean value are calculated. This process is called the natural selection strategy since the

probability of an individual being selected is proportional to the ratio

F itnessMax − F itnessj rj = n , (6.2) 1 P n F itnessj j=1

where F itnessMax is the maximum ICOMP fitness value among the population and F itnessj is the ICOMP fitness value of the jth model. The probability of an individual model being selected is proportional to the ratio in Eqn. 6.2. Parents from the population are chosen whose ratios are greater than one which guarantees the mating pool contains the models with better fitness. This selection process of selecting producing offspring models continues until the number of offspring equals the initial population size [Goldberg, 1989].

Step 5 in GA is the crossover and mutation process. In the crossover process the pair of chromosomes chosen for crossover is controlled by the crossover probability which is

122 input by the user. The crossover point is the randomly selected location on the binary string that separates the string into two parts. The portions of the two strings to the right of the crossover point are interchanged between the parents to form two offspring strings. For example, if parent A is the string “011|01001110” while parent B is the string

“010|11100101” where the symbol “|” is the randomly chosen crossover point, then offspring

A would be the string “011|11100101” and offspring B would be the string “010|01001110”.

A crossover probability of zero means that the members of the mating pool are carried over into the next generation and no offspring are produced. On the other had, a crossover probability of 1 means that crossover always occurs between any two parent models chosen from the mating pool; hence, the next generation will consist only of offspring models with no models from the previous generation. Models that perform well from the previous population could be removed from subsequent generations before selection can produce improvements from their contributions. However, a low crossover rate likely will slow down the search by not producing new combinations of variables fast enough [Luh, Minesky and

Bozdogan, 1996].

A mutation of models is used in the GA process to create new combinations of variables so the searching process can jump to other areas of the fitness landscape in search of the global optimal model. Mutation occurs when a randomly selected locus changes from 0 to

1 or from 1 to 0 with a specified probability called the mutation probability. The elitism rule is applied which allows the best model in the previous generation to remain in the subsequent generations. This guarantees the individual model with the most preferred

fitness in the current generation will survive to the next generation.

Step 6 in the GA is creating the next generations of models. The number of generations is set by the user in the algorithm. The number of generations the GA is allowed to reproduce should be large enough to have a high probability of searching most areas of the search landscape for the global optimum. However, the number of generations should be small enough that the algorithm terminates in a reasonable amount of time without being computationally expensive.

123 The GA continues looping through steps 2 through 6 until the algorithm has executed the total number of generations allowed or a prescribed number of generations have been executed with no improvement in the objective function.

The two main types of traditional search and optimization methods are calculus-based and enumerative. Calculus-based methods optimize an objective function subject to linear or non-linear constraints. The Langrange methodology searches for local extremal points using gradients. Calculus-based methods assume the functions are differentiable and con- tinuous. How close the starting point in the optimization method to the local maximum can determine whether the method converges to a local or global maximum, or even con- verges at all. Hence, many calculus-based optimization methods are not as robust. When calculus-based methods do converge on a local maximum, it may not be clear whether the local maximum is also the global maximum. Real world applications are known to have discontinuities with multimodal and chaotic search domains making them difficult to model with continuous, differentiable functions required of calculus-based methods. These meth- ods lack the power to provide robust search through the optimization techniques [Said,

2005].

Enumeration techniques depend on either searching a finite continuous highly focused region or an infinite disconnected search region where the algorithm evaluates every point in the region. However, these techniques are inefficient and not robust. For excessively large problems, the number of evaluations could be prohibitive even with modern computing capabilities. Enumerative methods are best used for small optimization problems where the domain space can be searched without prohibitive costs in time or computing resources.

By contrast, GAs can be applied to solving problems where the number of possible solutions is very large. The GA does not require calculation of gradient of the objective function and is not likely to be restricted to local optima [Goldberg, 1989]. Through the mutation and crossover stages of the algorithm, the GA can jump to other areas of the domain search space, thereby increasing the likelihood of finding the global optimum unlike calculus-based methods. GAs maintain robustness throughout the optimization process

124 as it searches the feature space efficiently in complex systems. GAs require only simple computations while not imposing restrictions on the domain of the search or optimization problem. GAs are robust and powerful in detecting problems in a wide array of fields by resolving an assortment of complex problems. GAs form a subset field of evolutionary computation that inspire optimization algorithms based on simulated evolution. GAs allow discontinuous, complex objective functions unlike calculus-based techniques. The optimum solution can be shown to be optimal by running a GA using the optimal solution as one of the initial parent solutions.

Howe [Howe, 2009] and Bozdogan developed the Information Complexity M 3 Toolbox for MATLAB. These MATLAB programs (hereafter referred to as the M 3 toolbox) provide the basis for generating results using the GA in this dissertation. The M 3 toolbox programs were modified to incorporate kernel regression discussed in chapters 3 and 4.

The M 3 toolbox uses eight operational parameters that are summarized in Table 6.1.

Howe [Howe, 2009] discusses these parameters in detail and are summarized below.

Number of Generations

A generation in a GA is another name for an iteration. The number of generations selected should be large enough for the algorithm to identify the global optimal solution, but not unnecessarily large due to the increase in computation time. Selecting a number of generations too small could mean termination of the algorithm with a suboptimal result.

Table 6.1: Genetic algorithm operational parameters.

Parameter Setting Number of generations 60 Premature termination threshold 40 Population size 30 Generation seeding Ranking Crossover probability 0.75 Mutation probability 0.10 Elitism on Objective function AIC or ICOMP(IFIM)

125 Premature Termination Threshold

The premature termination threshold is a convergence criteria for the GA. This param- eter controls the number of generations the GA is allowed to execute with no improvement in the objective function. This parameter needs to be large enough to find the global op- timal solution, but not too large since the algorithm will not make improvements in the objective function using a longer computation time.

Population Size

The population size parameter P determines how many solutions are evaluated in each generation. As a general rule, in a subsetting problem with p variables, each generation should evaluate P > p solutions.

Generation Seeding

Generation seeding is the process of selecting members of the next generation for mat- ing. Ranking selection uses unequal bin widths so the larger bins are at the beginning, corresponding to the most fit solutions. This allows chromosomes with a better objective function values are overrepresented in the mating pool. The ordering of the solutions is still randomly permuted, and solutions are mated sequentially [Howe, 2009].

Crossover Probability

The crossover probability uses a randomized single-point crossover described in Howe

[Howe, 2009]. The mating pair undergo the crossover operation if it is less than the crossover probability. Frequent crossovers are preferred since the procedure just duplicates the origi- nal solutions when the solutions are not crossed. Therefore, crossover probabilities greater than 50% are preferred.

Mutation Probability

The mutation probability is implemented by randomly selecting elements on the chro- mosome and then reassigning them randomly to different groups. The GA uses mutation to widen the search for the global optimal solution by allowing a jump to another area of the fitness landscape. Generally, a small probability (≤ 10%) is preferred for mutation.

126 Elitism

The elitism rule ensures monotonic improvement in the objective function. After all reproduction operations have been completed, the elitism rule copies the best chromosome without modification for mating with the next generation. However, the elitism rule usually means the population size increases with each generation, which can increase computation time.

Objective Function

Like all optimization procedures, the GA requires some objective function to either maximize or minimize. The M 3 toolbox uses information criteria as the objective function to determine the optimal solution.

6.3 Kernel Regression with the Genetic Algorithm on Real

Body Fat Data

The M 3 toolbox was originally developed by Howe and Bozdogan [Howe, 2009] for variable selection using ordinary least squares regression with the genetic algorithm. The MATLAB programs used in the M 3 toolbox were modified to incorporate kernel regression smoothing in place of ordinary regression for the models described in Table 4.1.

Kernel regression is applied with GA on a real data set using the modified M 3 toolbox.

The real data set predicts percentage of body fat as the response variable, as described by Bao [Bao, et. al, 2005] (http://lib.stat.cmu.edu/datasets/bodyfat). The data set has

13 independent X variables, n = 252 observations, and percentage of body fat as the Y response variable.

6.3.1 Results from the Y = Xβ Model

The modified M 3 toolbox was run first with no kernel regression (Y = Xβ model) applied to the data to establish a baseline of AIC and ICOMP(IFIM) scores for the best models.

The operational parameters shown in Table 6.1 were used for the GA. Figure 6.1 is a

127 three-dimensional plot showing the fitness landscape of AIC scores with the number of generations and populations output from the M 3 toolbox. Figure 6.1 shows the AIC scores generally decrease as the generations increase.

Figure 6.2 is a scatter plot of the minimum and average AIC scores for each generation in the GA. The GA converges to the minimum AIC at generation 5 and is constant through the remaining generations. Since the GA found the optimal model, the GA terminated early at generation 46. The average AIC score decreases quickly and stabilizes at generation

11.

The GA model was replicated 100 times. The results are summarized in Table 6.2.

An explicit enumeration of all possible subset models was performed and the top 10 AIC models are shown in Table 6.2. The integer 0 in the model column denotes the presence of the intercept term. The other integers in the model column identify the variable number present in the model. For example, the best model (0,1,2,4,6,7,8,12,13) has the intercept term plus variables X1, X2, X4, X6, X7, X8, X12, and X13. The GA correctly identified the best model (0,1,2,4,6,7,8,12,13) with the smallest AIC score of 1457 69 times out of

100 replications. This model identified by GA is the optimal solution shown in Table 6

of Bao [Bao, et. al, 2005] using the exact implicit enumeration (IE) algorithm. The GA

identified the model (1,2,3,4,6,7,8,12,13) with the third smallest AIC score of 1457.47 19

times out of 100 replications. Ninety-nine of the 100 replications identified five of the best

seven models using AIC. One replication identified model (1,2,3,4,6,7,11,12,13) outside the

top 10 models with an AIC score of 1459.79.

Figure 6.3 is a three-dimensional plot showing the fitness landscape of ICOMP(IFIM)

scores with the number of generations and populations output from the M 3 toolbox. Figure

6.3 shows the ICOMP(IFIM) scores generally decrease as the generations increase.

The GA model was replicated 100 times for ICOMP(IFIM). The results are summarized

in Table 6.3. An explicit enumeration of all possible subset models was performed and the

top 10 ICOMP(IFIM) models are shown in Table 6.3.

128 3−dimensional Surface Plot of AIC scores.

1800

1750

1700

1650

1600

1550

1500

1450 0 10 60 20 40 50 20 30 30 0 10 Population Generations

Figure 6.1: Three dimensional surface plot of AIC scores for body fat from the Y = Xβ model.

129 GA Progress: Objective function AIC 1470 1600

1460 1500 Average Value (*) Minimum Value (o)

1450 1400 0 5 10 15 20 25 30 35 40 45 50 Generation

Figure 6.2: Scatter plot of AIC scores for body fat for each generation from the Y = Xβ model.

130 Table 6.2: AIC results for body fat data from the Y = Xβ model.

Model AIC Frequency {0,1,2,4,6,7,8,12,13} 1457.00 69 {0,1,2,4,6,8,12,13} 1457.05 8 {1,2,3,4,6,7,8,12,13} 1457.47 19 {0,1,2,4,6,8,11,12,13} 1457.62 2 {1,3,4,6,7,8,12,13} 1457.72 {0,1,2,4,6,7,8,11,12,13} 1457.82 {0,1,2,4,6,11,12,13} 1458.13 1 {1,4,6,7,8,12,13} 1458.21 {0,1,2,3,4,6,7,8,12,13} 1458.33 {0,1,2,4,6,8,10,12,13} 1458.34 {1,2,3,4,6,7,11,12,13} 1459.79 1 Modeling completed in 3.5 minutes using 100 replications.

The GA identified the model (4,6,7,12,13) with the smallest ICOMP(IFIM) score of

1495.17 91 times out of 100 replications. The model (1,4,6,7,8,12,13) was the second most frequently identified model seven times out of 100 replications with an ICOMP(IFIM) score of 1498.5. All 100 replications using GA identified four of the top five models scored by

ICOMP(IFIM).

Figure 6.4 is a scatter plot of the minimum and average ICOMP(IFIM) scores for each generation in the GA. The GA converges to the minimum ICOMP(IFIM) at generation 41 and is constant through generation 60. The average ICOMP(IFIM) score decreases quickly and stabilizes at generation 26.

Histograms and scatter plots of each of the variables X1, X2, ..., X13, and Y are shown in Appendix Figures A.13, A.14, and A.15. The histograms show a variety of distributions

from skewed to approximately normal. The scatter plots show significant correlations

between several variables.

131 3−dimensional Surface Plot of ICOMP_IFIM scores.

1800

1750

1700

1650

1600

1550

1500

1450 0 10 60 20 40 50 20 30 30 0 10 Population Generations

Figure 6.3: Three dimensional surface plot of ICOMP(IFIM) scores for body fat from the Y = Xβ model.

132 GA Progress: Objective function ICOMP_IFIM 1506 1620

1504 1600

1502 1580

1500 1560

1498 1540 Average Value (*) Minimum Value (o) 1496 1520

1494 1500

1492 1480 0 10 20 30 40 50 60 Generation

Figure 6.4: Scatter plot of ICOMP(IFIM) scores for body fat for each generation from the Y = Xβ model.

133 Table 6.3: ICOMP(IFIM) results for body fat data from the Y = Xβ model. Model ICOMP(IFIM) Frequency {4,6,7,12,13} 1495.17 91 {4,6,7,13} 1495.85 1 {6,7,13} 1495.97 {3,4,6,7,12,13} 1497.84 1 {1,4,6,7,8,12,13} 1498.50 7 {3,6,7,13} 1498.51 {6,7,12,13} 1498.54 {1,4,6,7,12,13} 1498.55 {4,6,7,11,13} 1498.55 {0,2,6} 1498.65 Modeling completed in 1.6 minutes using 100 replications.

6.3.2 Results from the Y = Xkβ Model

Eqn. 3.55 was used to calculate the optimal bandwidth h values when kernel regression is applied individually to each X variable. Therefore, there is one h bandwidth for each of the 13 X variables when kernel regression is applied individually to each variable.

Figure 6.5 shows smoothed kernel density plots for the various values of hj calculated from Eqn. 3.55 for the variable X1. Figure 6.5 shows that as the bandwidth h increases the kernel density plots become smoother. The smallest h = 0.5871 is too small since the kernel density plot is very jagged. Yet the largest h = 8.664 is too large since the kernel

density plot has oversmoothed the data and removed features of the data that may be

important. For X1 the bandwidth h that minimizes the ICOMP score is h = 2.2553. Similarly, Figures 6.6 through 6.18 each shows smoothed kernel density plots for various

values of hj calculated from Eqn. 3.55 for the variables X2, X3, ..., X13, and Y , respectively,

for the Yk and Xk models. The values of h that minimize ICOMP for the variables X2, X3,

..., X13, and Y are 19.6249, 4.0225, 2.1908, 4.0862, 5.2954, 4.1302, 3.2837, 1.0994, 0.733, 1.4837, 0.7068, 0.3141, and 2.9355, respectively.

3 The modified M toolbox was then run with the Y = Xkβ model where kernel regres- sion was applied to the X matrix of independent variables one variable at a time. The

134 h=0.58708 h=0.82191 h=1.1507 0.01 0.01 0.01

0.005 0.005 0.005 f(x) f(x) f(x)

0 0 0 0 50 100 0 50 100 0 50 100 x x x h=1.611 h=2.2553 h=3.1575 0.01 0.01 0.01

0.005 0.005 0.005 f(x) f(x) f(x)

0 0 0 0 50 100 0 50 100 0 50 100 x x x h=4.4205 h=6.1886 h=8.6641 0.01 0.01 0.01

0.005 0.005 0.005 f(x) f(x) f(x)

0 0 0 0 50 100 0 50 100 0 50 100 x x x

Figure 6.5: Kernel density plots for various hj values for body fat variable X1.

135 −3 −3 −3 x 10 h=5.1085 x 10 h=7.1519 x 10 h=10.0127 6 6 6

4 4 4 f(x) 2 f(x) 2 f(x) 2

0 0 0 0 200 400 0 200 400 0 200 400 x x x −3 −3 −3 x 10 h=14.0178 x 10 h=19.6249 x 10 h=27.4749 6 6 6

4 4 4 f(x) 2 f(x) 2 f(x) 2

0 0 0 0 200 400 0 200 400 0 200 400 x x x −3 −3 −3 x 10 h=38.4649 x 10 h=53.8509 x 10 h=75.3912

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 0 200 400 0 200 400 0 200 400 x x x

Figure 6.6: Kernel density plots for various hj values for body fat variable X2.

136 −3 −3 −3 x 10 h=1.0471 x 10 h=1.4659 x 10 h=2.0523 6 6 6

4 4 4 f(x) 2 f(x) 2 f(x) 2

0 0 0 0 50 100 0 50 100 0 50 100 x x x −3 −3 −3 x 10 h=2.8732 x 10 h=4.0225 x 10 h=5.6315 6 6 6

4 4 4 f(x) 2 f(x) 2 f(x) 2

0 0 0 0 50 100 0 50 100 0 50 100 x x x −3 −3 −3 x 10 h=7.884 x 10 h=11.0377 x 10 h=15.4527

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 0 50 100 0 50 100 0 50 100 x x x

Figure 6.7: Kernel density plots for various hj values for body fat variable X3.

137 −3 −3 h=0.57029 x 10 h=0.7984 x 10 h=1.1178 0.01 6 6

4 4 0.005 f(x) f(x) 2 f(x) 2

0 0 0 20 40 60 20 40 60 20 40 60 x x x −3 −3 −3 x 10 h=1.5649 x 10 h=2.1908 x 10 h=3.0671 6 6 6

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 20 40 60 20 40 60 20 40 60 x x x −3 −3 −3 x 10 h=4.294 x 10 h=6.0116 x 10 h=8.4162

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 20 40 60 20 40 60 20 40 60 x x x

Figure 6.8: Kernel density plots for various hj values for body fat variable X4.

138 −3 h=1.0637 h=1.4892 x 10 h=2.0848 0.01 0.01 6

4 0.005 0.005 f(x) f(x) f(x) 2

0 0 0 50 100 150 50 100 150 50 100 150 x x x −3 −3 −3 x 10 h=2.9187 x 10 h=4.0862 x 10 h=5.7207 6 6 6

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 50 100 150 50 100 150 50 100 150 x x x −3 −3 −3 x 10 h=8.009 x 10 h=11.2126 x 10 h=15.6977 6 6 6

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 50 100 150 50 100 150 50 100 150 x x x

Figure 6.9: Kernel density plots for various hj values for body fat variable X5.

139 −3 h=1.3784 h=1.9298 x 10 h=2.7017 0.01 0.01 6

4 0.005 0.005 f(x) f(x) f(x) 2

0 0 0 50 100 150 50 100 150 50 100 150 x x x −3 −3 −3 x 10 h=3.7824 x 10 h=5.2954 x 10 h=7.4135 6 6 6

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 50 100 150 50 100 150 50 100 150 x x x −3 −3 −3 x 10 h=10.3789 x 10 h=14.5305 x 10 h=20.3427 6 6 6

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 50 100 150 50 100 150 50 100 150 x x x

Figure 6.10: Kernel density plots for various hj values for body fat variable X6.

140 −3 h=1.0751 h=1.5052 x 10 h=2.1073 0.01 0.01 6

4 0.005 0.005 f(x) f(x) f(x) 2

0 0 0 50 100 150 50 100 150 50 100 150 x x x −3 −3 −3 x 10 h=2.9502 x 10 h=4.1302 x 10 h=5.7823 6 6 6

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 50 100 150 50 100 150 50 100 150 x x x −3 −3 −3 x 10 h=8.0952 x 10 h=11.3333 x 10 h=15.8666 6 6 6

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 50 100 150 50 100 150 50 100 150 x x x

Figure 6.11: Kernel density plots for various hj values for body fat variable X7.

141 −3 h=0.85476 h=1.1967 x 10 h=1.6753 0.01 0.01 6

4 0.005 0.005 f(x) f(x) f(x) 2

0 0 0 0 50 100 0 50 100 0 50 100 x x x −3 −3 −3 x 10 h=2.3455 x 10 h=3.2837 x 10 h=4.5971 6 6 6

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 0 50 100 0 50 100 0 50 100 x x x −3 −3 −3 x 10 h=6.436 x 10 h=9.0103 x 10 h=12.6145 6 6 6

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 0 50 100 0 50 100 0 50 100 x x x

Figure 6.12: Kernel density plots for various hj values for body fat variable X8.

142 h=0.28619 h=0.40067 h=0.56094 0.01 0.01 0.01

0.005 0.005 0.005 f(x) f(x) f(x)

0 0 0 30 40 50 30 40 50 30 40 50 x x x −3 −3 −3 x 10 h=0.78531 x 10 h=1.0994 x 10 h=1.5392 6 6 6

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 30 40 50 30 40 50 30 40 50 x x x −3 −3 −3 x 10 h=2.1549 x 10 h=3.0169 x 10 h=4.2236 6 6 6

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 30 40 50 30 40 50 30 40 50 x x x

Figure 6.13: Kernel density plots for various hj values for body fat variable X9.

143 −3 h=0.19079 h=0.26711 x 10 h=0.37396 0.01 0.01 6

4 0.005 0.005 f(x) f(x) f(x) 2

0 0 0 0 20 40 0 20 40 0 20 40 x x x −3 −3 −3 x 10 h=0.52354 x 10 h=0.73296 x 10 h=1.0261 6 6 6

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 0 20 40 0 20 40 0 20 40 x x x −3 −3 −3 x 10 h=1.4366 x 10 h=2.0112 x 10 h=2.8157 6 6 6

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 0 20 40 0 20 40 0 20 40 x x x

Figure 6.14: Kernel density plots for various hj values for body fat variable X10.

144 −3 −3 −3 x 10 h=0.38622 x 10 h=0.5407 x 10 h=0.75699 6 6 6

4 4 4 f(x) 2 f(x) 2 f(x) 2

0 0 0 20 30 40 50 20 30 40 50 20 30 40 50 x x x −3 −3 −3 x 10 h=1.0598 x 10 h=1.4837 x 10 h=2.0772 6 6 6

4 4 4 f(x) 2 f(x) 2 f(x) 2

0 0 0 20 30 40 50 20 30 40 50 20 30 40 50 x x x −3 −3 −3 x 10 h=2.908 x 10 h=4.0713 x 10 h=5.6998 6 6 6

4 4 4 f(x) 2 f(x) 2 f(x) 2

0 0 0 20 30 40 50 20 30 40 50 20 30 40 50 x x x

Figure 6.15: Kernel density plots for various hj values for body fat variable X11.

145 −3 −3 −3 x 10 h=0.18398 x 10 h=0.25757 x 10 h=0.3606 6 6 6

4 4 4 f(x) 2 f(x) 2 f(x) 2

0 0 0 20 25 30 35 20 25 30 35 20 25 30 35 x x x −3 −3 −3 x 10 h=0.50484 x 10 h=0.70678 x 10 h=0.98949 6 6 6

4 4 4 f(x) 2 f(x) 2 f(x) 2

0 0 0 20 25 30 35 20 25 30 35 20 25 30 35 x x x −3 −3 −3 x 10 h=1.3853 x 10 h=1.9394 x 10 h=2.7152 6 6 6

4 4 4 f(x) 2 f(x) 2 f(x) 2

0 0 0 20 25 30 35 20 25 30 35 20 25 30 35 x x x

Figure 6.16: Kernel density plots for various hj values for body fat variable X12.

146 h=0.081769 h=0.11448 h=0.16027 0.01 0.01 0.01

0.005 0.005 0.005 f(x) f(x) f(x)

0 0 0 15 20 25 15 20 25 15 20 25 x x x −3 −3 −3 x 10 h=0.22437 x 10 h=0.31412 x 10 h=0.43977 6 6 6

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 15 20 25 15 20 25 15 20 25 x x x −3 −3 −3 x 10 h=0.61568 x 10 h=0.86196 x 10 h=1.2067 6 6 6

4 4 4

f(x) 2 f(x) 2 f(x) 2

0 0 0 15 20 25 15 20 25 15 20 25 x x x

Figure 6.17: Kernel density plots for various hj values for body fat variable X13.

147 −3 −3 −3 x 10 h=0.76414 x 10 h=1.0698 x 10 h=1.4977 6 6 6

4 4 4 f(x) 2 f(x) 2 f(x) 2

0 0 0 0 20 40 60 0 20 40 60 0 20 40 60 x x x −3 −3 −3 x 10 h=2.0968 x 10 h=2.9355 x 10 h=4.1097 6 6 6

4 4 4 f(x) 2 f(x) 2 f(x) 2

0 0 0 0 20 40 60 0 20 40 60 0 20 40 60 x x x −3 −3 −3 x 10 h=5.7536 x 10 h=8.0551 x 10 h=11.2771 6 6 6

4 4 4 f(x) 2 f(x) 2 f(x) 2

0 0 0 0 20 40 60 0 20 40 60 0 20 40 60 x x x

Figure 6.18: Kernel density plots for various hj values for body fat variable Y .

148 operational parameters shown in Table 6.1 were again used for the GA with 100 replica- tions. Figure 6.19 is a three-dimensional plot showing the fitness landscape of AIC scores from the Y = Xkβ model with the number of generations and populations output from the M 3 toolbox. Figure 6.19 shows the AIC scores generally decrease as the generations increase.

Figure 6.20 is a scatter plot of the minimum and average AIC scores for each generation in the GA from the Y = Xkβ model. The GA converges to the minimum AIC at generation 44. The average AIC score decreases quickly and stabilizes at generation 14.

The GA model was replicated 100 times. The results are summarized in Table 6.4.

An explicit enumeration of all possible subset models was performed and the top 10 AIC models are shown in Table 6.4. The optimal bandwidth h was calculated from Eqn. 3.55 for

each model using the Gaussian kernel. GA correctly identified the best model (0,3,4,6,13)

with the smallest AIC score of 1477.28 100 times out of 100 replications. Applying kernel

regression to the X matrix variables one at a time for the Y = Xkβ model resulted in higher AIC scores than the Y = Xβ model where no kernel regression was applied from

Table 6.2. The modeling took 3 hours to complete compared to 3.5 minutes for the Y = Xβ

model as shown in Table 6.2.

The GA model was replicated 100 times for ICOMP(IFIM). The results are summarized

in Table 6.5. The optimal bandwidth h was calculated from Eqn. 3.55 for each model using

the Gaussian kernel. An explicit enumeration of all possible subset models was performed

and the top 10 ICOMP(IFIM) models are shown in Table 6.5. GA correctly identified

the model (0,3,4,6) with the smallest ICOMP(IFIM) score of 1497.56 89 times out of 100

replications. The ICOMP(IFIM) scores from the models in Table 6.5 from the Y = Xkβ model are higher than the ICOMP(IFIM) scores when no kernel regression smoothing

was applied to the data from Table 6.3. All 100 replications identified four of the top five

models. The modeling took 2.2 hours to complete compared to 1.6 minutes for the Y = Xβ

model as shown in Table 6.3.

149 3−dimensional Surface Plot of AIC scores.

1800

1750

1700

1650

1600

1550

1500

1450 0 10 20 60 40 50 20 30 30 0 10 Population Generations

Figure 6.19: Three dimensional surface plot of AIC scores for body fat from the Y = Xkβ model.

150 GA Progress: Objective function AIC 1483 1580

1482 1560

1481 1540

1480 1520 Average Value (*) Minimum Value (o) 1479 1500

1478 1480

1477 1460 0 10 20 30 40 50 60 Generation

Figure 6.20: Scatter plot of AIC scores for body fat for each generation from the Y = Xkβ model.

151 Table 6.4: AIC results for body fat data from the Y = Xkβ model.

Model AIC Frequency h {0,3,4,6,13} 1477.28 100 * {0,3,4,6,10,13} 1478.42 * {0,3,4,5,6,13} 1478.70 * {0,1,3,4,6,13} 1478.78 * {0,3,4,6,12,13} 1478.89 * {0,3,4,6,11,13} 1479.04 * {0,2,3,4,6,13} 1479.15 * {0,3,4,6,7,13} 1479.26 * {0,3,4,6,9,13} 1479.28 * {0,3,4,6,8,13} 1479.28 * Modeling completed in 3.0 hours using 100 replications.

*Optimal bandwidth h values for X1, X2, …, X13 individually are respectively: [2.255, 19.625, 4.023, 2.191, 4.086, 5.295, 4.130, 3.284, 1.099, 0.733, 1.484, 0.707, 0.314]

152 Table 6.5: ICOMP(IFIM) results for body fat data from the Y = Xkβ model.

Model ICOMP(IFIM) Frequency h {0,3,4,6} 1497.56 89 * {0,3,4,6,10} 1498.13 2 * {0,3,4,5,6} 1499.94 * {0,3,4,6,9} 1500.59 3 * {0,3,6,13} 1500.60 6 * {0,3,4,6,11} 1500.63 * {0,1,3,4,6} 1500.65 * {0,3,4,6,10,11} 1501.12 * {0,3,4,5,6,10} 1501.15 * {0,1,3,4,6,10} 1501.38 * Modeling completed in 2.2 hours using 100 replications.

*Optimal bandwidth h values for X1, X2, …, X13 individually are respectively: [2.255, 19.625, 4.023, 2.191, 4.086, 5.295, 4.130, 3.284, 1.099, 0.733, 1.484, 0.707, 0.314]

153 Figure 6.21 is a three-dimensional plot showing the fitness landscape of ICOMP(IFIM) scores with the number of generations and populations output from the modified M 3

toolbox from the Y = Xkβ model. Figure 6.21 shows the ICOMP scores generally decrease as the generations increase.

Figure 6.22 is a scatter plot of the minimum and average ICOMP(IFIM) scores for

each generation in the GA from the Y = Xkβ model. The GA converges to the minimum ICOMP(IFIM) at generation 19 and is constant through the remaining generations until

it terminates early at generation 52 since the GA found the optimal solution early. The

average ICOMP(IFIM) score decreases quickly and stabilizes at generation 20.

6.3.3 Results from the Y = Xmkβ Model

The modified M 3 toolbox was then run with kernel regression applied to the X matrix

of independent variables together for the Y = Xmkβ model. The operational parameters shown in Table 6.1 were again used for the GA with 100 replications. Since the Y = Xmkβ model computes the optimal bandwidth h using all independent variables together, the computation time is significantly increased compared to the Y = Xkβ model. Figure 6.23 is a three-dimensional plot showing the fitness landscape of AIC scores from the Y = Xmkβ model with the number of generations and populations output from the M 3 toolbox.

Figure 6.24 is a scatter plot of the minimum and average AIC scores for each generation in the GA from the Y = Xmkβ model. The GA converges to the minimum AIC at generation 10 and continues through the remaining generations. The average AIC score decreases quickly and stabilizes at generation 15.

The GA model was replicated 100 times for AIC. The results are summarized in Table

6.6. An explicit enumeration of all possible subset models was performed and the top 10

AIC models are shown in Table 6.6. The optimal bandwidth h was calculated from Eqn.

3.54 for each model using the Gaussian kernel. The GA identified the model (0,2,4,5,6,13)

with the smallest AIC score of 1490.5 95 times out of 100 replications.

154 3−dimensional Surface Plot of ICOMP_IFIM scores.

1800

1750

1700

1650

1600

1550

1500

1450 0 10 20 50 60 30 20 30 40 0 10 Population Generations

Figure 6.21: Three dimensional surface plot of ICOMP(IFIM) scores for body fat from the Y = Xkβ model.

155 GA Progress: Objective function ICOMP_IFIM 1512 1640

1510 1620

1508 1600

1506 1580

1504 1560

1502 1540 Average Value (*) Minimum Value (o)

1500 1520

1498 1500

1496 1480 0 10 20 30 40 50 60 Generation

Figure 6.22: Scatter plot of ICOMP(IFIM) scores for body fat for each generation from the Y = Xkβ model.

156 3−dimensional Surface Plot of AIC scores.

1800

1750

1700

1650

1600

1550

1500

1450 0 10 20 50 30 40 30 20 0 10 Population Generations

Figure 6.23: Three dimensional surface plot of AIC scores for body fat from the Y = Xmkβ model.

157 GA Progress: Objective function AIC 1510 1800

1500 1600 Average Value (*) Minimum Value (o)

1490 1400 0 5 10 15 20 25 30 35 40 45 Generation

Figure 6.24: Scatter plot of AIC scores for body fat for each generation from the Y = Xmkβ model.

158 Table 6.6: AIC results for body fat data from the Y = Xmkβ model.

Model AIC Frequency h {0,1,2,4,5,6,8,12,13} 387.93 3.987 {0,1,2,3,4,6,8,12,13} 388.37 4.377 {0,1,2,4,5,6,8,10,12,13} 389.18 5.028 {0,1,2,4,5,6,7,8,12,13} 389.22 4.237 {0,1,2,3,4,6,7,8,12,13} 389.41 4.629 {0,1,2,4,5,6,8,9,12,13} 389.53 4.815 {0,1,2,3,4,6,8,10,12,13} 389.66 5.415 {0,1,2,4,5,6,8,11,12,13} 389.88 4.682 {0,1,2,3,4,5,6,8,12,13} 389.93 4.572 {0,1,2,3,4,6,8,9,12,13} 390.06 5.202 {0,2,4,5,6,13} 1490.5 95 3.211 {0,2,5,6,13} 1490.8 3 1.747 {0,2,4,5,6,9,13} 1491.4 1 3.870 {0,2,5,6,9,11,13} 1492.7 1 2.279 Modeling completed in 206.5 hours using 100 replications.

159 Applying kernel regression to the X matrix variables together for the Y = Xmkβ model resulted in higher AIC scores than the Y = Xβ model where no kernel regression was applied, as seen in Table 6.2. All replications were not within the top 10 models. The modeling took 206.5 hours to complete compared to 3.5 minutes for the Y = Xβ model as shown in Table 6.2.

Figure 6.25 is a three-dimensional plot showing the fitness landscape of ICOMP(IFIM) scores with the number of generations and populations output from the modified M 3

toolbox for the Y = Xmkβ model. Figure 6.26 is a scatter plot of the minimum and average ICOMP(IFIM) scores for each

generation in the GA from the Y = Xmkβ model. The GA first converges to the minimum ICOMP(IFIM) at generation 15 and continues through the remaining generations. The

average ICOMP(IFIM) score decreases quickly and stabilizes at generation 23.

The GA model was replicated 100 times for ICOMP(IFIM). The results are summarized

in Table 6.7. An explicit enumeration of all possible subset models was performed and the

top 10 ICOMP(IFIM) models are shown in Table 6.7. The optimal bandwidth h was

calculated from Eqn. 3.54 for each model using the Gaussian kernel. The GA identified

the model (0,2,5,6,13) with the smallest ICOMP(IFIM) score of 1525.8 78 times out of

100 replications. The model (0,5,6,7,13) was the model GA identified the second most

frequently 19 times out of 100 replications with a ICOMP(IFIM) score of 1552.6. All

models identified by the GA were not in the top 10 best models. The ICOMP(IFIM)

scores from the models identified by the GA in Table 6.7 from the Y = Xmkβ model are larger than the ICOMP(IFIM) scores when no kernel regression smoothing was applied to

the data, as seen in Table 6.3. The modeling took 107.5 hours to complete compared to

1.6 minutes for the Y = Xβ model as shown in Table 6.3.

160 3−dimensional Surface Plot of ICOMP_IFIM scores.

1850

1800

1750

1700

1650

1600

1550

1500 0

10

20 50 60 30 40 30 10 20 Population 0 Generations

Figure 6.25: Three dimensional surface plot of ICOMP(IFIM) scores for body fat from the Y = Xmkβ model.

161 GA Progress: Objective function ICOMP_IFIM 1600 1800

1550 1600 Average Value (*) Minimum Value (o)

1500 1400 0 10 20 30 40 50 60 Generation

Figure 6.26: Scatter plot of ICOMP(IFIM) scores for body fat for each generation from the Y = Xmkβ model.

162 Table 6.7: ICOMP(IFIM) results for body fat data from the Y = Xmkβ model. Model ICOMP(IFIM) Frequency h {1,2,3,4,5,6,7,8,9,10,11,12,13} 380.68 7.151 {1,2,3,4,6,7,8,9,10,11,12,13} 381.46 8.337 {1,2,3,4,5,6,7,8,9,11,12,13} 381.48 7.494 {1,2,3,4,5,6,7,8,10,11,12,13} 381.74 7.720 {1,2,3,4,5,6,7,8,9,10,12,13} 381.81 7.860 {1,2,3,4,6,7,8,9,11,12,13} 382.26 8.680 {1,2,3,4,5,6,7,8,11,12,13} 382.41 8.063 {1,2,3,4,6,7,8,10,11,12,13} 382.54 8.905 {1,2,3,4,6,7,8,9,10,12,13} 382.55 9.044 {1,2,3,4,5,6,7,8,9,12,13} 382.61 8.203 {0,2,5,6,13} 1525.8 78 1.747 {0,2,5,6,10,13} 1539.9 1 1.902 {0,2,4,5,6} 1542.9 1 1.741 {0,2,4,5,6,10,13} 1548.2 1 3.287 {0,5,6,7,13} 1552.6 19 2.040 Modeling completed in 107.5 hours using 100 replications.

163 6.3.4 Results from the Yk = Xβ Model

3 The modified M toolbox was run for the Yk = Xβ model where kernel regression was applied to the Y dependent variable only. The operational parameters shown in Table 6.1

were again used for the GA with 100 replications. Figure 6.27 is a three-dimensional plot

showing the fitness landscape of AIC scores from the Yk = Xβ model with the number of generations and populations output from the modified M 3 toolbox.

Figure 6.28 is a scatter plot of the minimum and average AIC scores for each generation

in the GA from the Yk = Xβ model. The GA converges to the minimum AIC at generation 25 and continues through the remaining generations. The average AIC score decreases

quickly and stabilizes at generation 28.

The GA model was replicated 100 times for AIC. The results are summarized in Table

6.8. An explicit enumeration of all possible subset models was performed and the top 10

AIC models are shown in Table 6.8. The optimal bandwidth h for Y was calculated from

Eqn. 3.55 for each model using the Gaussian kernel. The optimal bandwidth h = 2.936 for

each model since Y does not change as the X matrix changes. The GA correctly identified

the model (0,1,2,4,6,8,12,13) with the smallest AIC score of 386.38 98 times out of 100

replications. This model was also chosen as the second best AIC model from the Y = Xβ

model in Table 6.2. Applying the kernel regression to the Y variable alone and not on

the X matrix of independent variables resulted in significantly lower AIC scores. The AIC

score for model (0,1,2,4,6,8,12,13) was reduced from 1457.05 for the Y = Xβ model in

Table 6.2 to 386.38 for the Yk = Xβ model in Table 6.8. All 100 replications identified three of the top five models.

Figure 6.29 is a three-dimensional plot showing the fitness landscape of ICOMP(IFIM)

scores with the number of generations and populations output from the modified M 3

toolbox from the Yk = Xβ model. Figure 6.30 is a scatter plot of the minimum and average ICOMP(IFIM) scores for

each generation in the GA from the Yk = Xβ model. The GA converges to the minimum ICOMP(IFIM) at generation 13 and remains constant through the remaining generations.

164 3−dimensional Surface Plot of AIC scores.

600

550

500

450

400

350 0 10

20 60 40 50 20 30 30 0 10 Population Generations

Figure 6.27: Three dimensional surface plot of AIC scores for body fat from the Yk = Xβ model.

165 GA Progress: Objective function AIC 396 480

394 460

392 440

390 420 Average Value (*) Minimum Value (o)

388 400

386 380 0 10 20 30 40 50 60 Generation

Figure 6.28: Scatter plot of AIC scores for body fat for each generation from the Yk = Xβ model.

166 Table 6.8: AIC results for body fat data from the Yk = Xβ model.

Model AIC Frequency h {0,1,2,4,6,8,12,13} 386.38 98 2.936 {0,1,2,4,6,7,8,12,13} 387.51 1 2.936 {0,1,2,4,6,8,10,12,13} 387.67 2.936 {0,1,2,4,5,6,8,12,13} 387.93 2.936 {0,1,2,4,6,8,9,12,13} 388.09 1 2.936 {0,1,2,6,8,12,13} 388.27 2.936 {0,1,2,4,6,8,11,12,13} 388.30 2.936 {0,1,2,3,4,6,8,12,13} 388.37 2.936 {0,1,4,6,7,8,12,13} 388.50 2.936 {0,1,2,4,6,7,8,10,12,13} 388.84 2.936 Modeling completed in 32.3 minutes using 100 replications.

167 3−dimensional Surface Plot of ICOMP_IFIM scores.

650

600

550

500

450

400 0 10 20 60 40 50 20 30 30 0 10 Population Generations

Figure 6.29: Three dimensional surface plot of ICOMP(IFIM) scores for body fat from the Yk = Xβ model.

168 GA Progress: Objective function ICOMP_IFIM 425 550

420 500 Average Value (*) Minimum Value (o) 415 450

410 400 0 10 20 30 40 50 60 Generation

Figure 6.30: Scatter plot of ICOMP(IFIM) scores for body fat for each generation from the Yk = Xβ model.

169 The GA model was replicated 100 times for ICOMP(IFIM). The results are summarized in Table 6.9. An explicit enumeration of all possible subset models was performed and the top 10 ICOMP(IFIM) models are shown in Table 6.9. The model (0,1,4,6,12,13) was identified with an ICOMP(IFIM) score of 410.58 74 times out of 100 replications. This

ICOMP(IFIM) score is approximately 72% lower than the ICOMP(IFIM) scores of models using no kernel regression on any variables for the Y = Xβ model from Table 6.3. The

model (0,2,6) was the second most frequently selected model 20 times out of 100 replications

with an ICOMP(IFIM) score of 410.83. None of the models identified by the GA are among

the top 10 models. Table 6.9 shows the top 10 models all exclude the intercept term. GA

identified the best models with an intercept term.

6.3.5 Results from the Yk = Xkβ Model

3 The modified M toolbox was run with the Yk = Xkβ model where kernel regression is applied to the Y dependent variable and the X matrix of independent variables one variable at a time. The operational parameters shown in Table 6.1 were used for the GA with 100 replications. Figure 6.31 is a three-dimensional plot showing the fitness landscape of AIC scores from the Yk = Xkβ model with the number of generations and populations output from the modified M 3 toolbox.

Figure 6.32 is a scatter plot of the minimum and average AIC scores for each generation

in the GA from the Yk = Xkβ model. The GA converges to the minimum AIC at generation 25 and stays constant through generation 60. The average AIC score decreases quickly and

stabilizes at generation 13.

The GA model was replicated 100 times for AIC. The results are summarized in Table

6.10. An explicit enumeration of all possible subset models was performed and the top

10 AIC models are shown in Table 6.10. The optimal bandwidths h for Y and X were

calculated from Eqn. 3.55 for each model using the Gaussian kernel. The GA identified

the model (0,3,4,6,10,12) with an AIC score of 407.68 77 times out of 100 replications.

170 Table 6.9: ICOMP(IFIM) results for body fat data from the Yk = Xβ model.

Model ICOMP(IFIM) Frequency h {1,2,3,4,5,6,7,8,9,10,11,12,13} 380.68 2.936 {1,3,4,5,6,7,8,9,10,11,12,13} 381.16 2.936 {1,2,3,4,6,7,8,9,10,11,12,13} 381.46 2.936 {1,2,3,4,5,6,7,8,9,11,12,13} 381.48 2.936 {1,2,3,4,5,6,7,8,10,11,12,13} 381.74 2.936 {1,3,4,6,7,8,9,10,11,12,13} 381.75 2.936 {1,2,3,4,5,6,7,8,9,10,12,13} 381.81 2.936 {1,3,4,5,6,7,8,9,11,12,13} 382.05 2.936 {1,3,4,5,6,7,8,10,11,12,13} 382.26 2.936 {1,2,3,4,6,7,8,9,11,12,13} 382.26 2.936 {0,1,4,6,12,13} 410.58 74 2.936 {0,2,6} 410.83 20 2.936 {0,1,4,6,8,12,13} 411.23 1 2.936 {0,1,4,6,7,8,12,13} 411.66 1 2.936 {0,2,6,12,13} 414.92 1 2.936 {0,4,6,7,12,13} 414.97 3 2.936 Modeling completed in 34.7 minutes using 100 replications.

171 3−dimensional Surface Plot of AIC scores.

700

650

600

550

500

450

400 0 10 60 20 40 50 20 30 30 0 10 Population Generations

Figure 6.31: Three dimensional surface plot of AIC scores for body fat from the Yk = Xkβ model.

172 GA Progress: Objective function AIC 412 500

411 480

410 460

409 440 Average Value (*) Minimum Value (o)

408 420

407 400 0 10 20 30 40 50 60 Generation

Figure 6.32: Scatter plot of AIC scores for body fat for each generation from the Yk = Xkβ model.

173 Table 6.10: AIC results for body fat data from the Yk = Xkβ model.

Model AIC Frequency h for Y h for X {0,1,2,4,5,6,8,12,13} 387.93 2.936 * {0,1,2,3,4,6,8,12,13} 388.37 2.936 * {0,1,2,4,5,6,8,10,12,13} 389.18 2.936 * {0,1,2,4,5,6,7,8,12,13} 389.22 2.936 * {0,1,2,3,4,6,7,8,12,13} 389.41 2.936 * {0,1,2,4,5,6,8,9,12,13} 389.53 2.936 * {0,1,2,3,4,6,8,10,12,13} 389.66 2.936 * {0,1,2,4,5,6,8,11,12,13} 389.88 2.936 * {0,1,2,3,4,5,6,8,12,13} 389.93 2.936 * {0,1,2,3,4,6,8,9,12,13} 390.06 2.936 * {0,3,4,6,10,12} 407.68 77 2.936 * {0,3,4,6,10,12,13} 408.17 5 2.936 * {0,3,4,6,7,8,10,12} 408.41 15 2.936 * {0,3,4,6,12,13} 408.57 1 2.936 * {0,3,4,6,7,9,10,12,13} 408.61 2 2.936 * Modeling completed in 4.2 hours using 100 replications.

*Optimal bandwidth h values for X1, X2, …, X13 individually are respectively: [2.255, 19.625, 4.023, 2.191, 4.086, 5.295, 4.130, 3.284, 1.099, 0.733, 1.484, 0.707, 0.314]

174 Applying the kernel regression to the Y variable and also on the X matrix of inde-

pendent variables one variable at a time resulted in approximately 72% lower AIC scores

compared to the Y = Xβ model in Table 6.2. The model (0,3,4,6,7,8,10,12) was the second

most frequently identified model with an AIC score of 408.41 15 times out of 100 repli-

cations. None of the 100 replications identified a model within the top 10 models. The

modeling took 4.2 hours to complete compared to 32.3 minutes for the Yk = Xβ model in Table 6.8.

Figure 6.33 is a three-dimensional plot showing the fitness landscape of ICOMP(IFIM) scores with the number of generations and populations output from the modified M 3

toolbox from the Yk = Xkβ model. Figure 6.34 is a scatter plot of the minimum and average ICOMP(IFIM) scores for

each generation in the GA from the Yk = Xkβ model. The GA converges to the mini- mum ICOMP(IFIM) at generation 27 and is constant through generation 60. The average

ICOMP(IFIM) score decreases quickly and stabilizes at generation 14.

The GA model was replicated 100 times for ICOMP(IFIM). The results are summarized

in Table 6.11. An explicit enumeration of all possible subset models was performed and

the top 10 ICOMP(IFIM) models are shown in Table 6.11. The optimal bandwidths h for

Y and X were calculated from Eqn. 3.55 for each model using the Gaussian kernel. The

model (0,1,3,4,5,6,7,8,9,10,11,12) was identified with an ICOMP(IFIM) score of 388.26 96

times out of 100 replications. This ICOMP(IFIM) score is approximately 76% lower than

the ICOMP(IFIM) scores of the Y = Xβ model as seen in Table 6.3. None of the models

identified by the GA are among the top 10 models. The top 10 models all exclude the

intercept term. The GA correctly identified the top two models that include the intercept

term. The modeling took 7.3 hours to complete compared to 34.7 minutes for the Yk = Xβ model in Table 6.9.

175 3−dimensional Surface Plot of ICOMP_IFIM scores.

650

600

550

500

450

400

350 0 10

20 60 40 50 20 30 30 0 10 Population Generations

Figure 6.33: Three dimensional surface plot of ICOMP(IFIM) scores for body fat from the Yk = Xkβ model.

176 GA Progress: Objective function ICOMP_IFIM 410 500

400 450 Average Value (*) Minimum Value (o) 390 400

380 350 0 10 20 30 40 50 60 Generation

Figure 6.34: Scatter plot of ICOMP(IFIM) scores for body fat for each generation from the Yk = Xkβ model.

177 Table 6.11: ICOMP(IFIM) results for body fat data from the Yk = Xkβ model.

Model ICOMP(IFIM) Frequency h for Y h for X {1,2,3,4,5,6,7,8,9,10,11,12,13} 380.68 2.936 * {1,2,3,4,6,7,8,9,10,11,12,13} 381.46 2.936 * {1,2,3,4,5,6,7,8,9,11,12,13} 381.48 2.936 * {1,2,3,4,5,6,7,8,10,11,12,13} 381.74 2.936 * {1,2,3,4,5,6,7,8,9,10,12,13} 381.81 2.936 * {1,2,3,4,6,7,8,9,11,12,13} 382.26 2.936 * {1,2,3,4,5,6,7,8,11,12,13} 382.41 2.936 * {1,2,3,4,6,7,8,10,11,12,13} 382.54 2.936 * {1,2,3,4,6,7,8,9,10,12,13} 382.55 2.936 * {1,2,3,4,5,6,7,8,9,12,13} 382.61 2.936 * {0,1,3,4,5,6,7,8,9,10,11,12} 388.26 96 2.936 * {0,1,3,4,5,6,7,8,9,10,11} 388.42 4 2.936 * Modeling completed in 7.3 hours using 100 replications.

*Optimal bandwidth h values for X1, X2, …, X13 individually are respectively: [2.255, 19.625, 4.023, 2.191, 4.086, 5.295, 4.130, 3.284, 1.099, 0.733, 1.484, 0.707, 0.314]

178 6.3.6 Results from the Yk = Xmkβ Model

3 The modified M toolbox was run with the Yk = Xmkβ model where kernel regression was applied to the Y dependent variable and the X matrix of independent variables all together. The operational parameters shown in Table 6.1 were used for the GA with 100 replications. Since the Yk = Xmkβ model computes the optimal bandwidth h using all independent variables together, the computation time is significantly increased compared

to the Yk = Xkβ model. Figure 6.35 is a three-dimensional plot showing the fitness

landscape of AIC scores from the Yk = Xmkβ model with the number of generations and populations output from the modified M 3 toolbox.

Figure 6.36 is a scatter plot of the minimum and average AIC scores for each generation in the GA from the Yk = Xmkβ model. The GA converges to the minimum AIC at generation 25 and stays constant through generation 60. The average AIC score decreases and stabilizes at generation 29.

The GA model was replicated 100 times for AIC. The results are summarized in Table

6.12. An explicit enumeration of all possible subset models was performed and the top

10 AIC models are shown in Table 6.12. The optimal bandwidth h for X was calculated from Eqn. 3.54 for each model using the Gaussian kernel, while the optimal bandwidth h

for Y was calculated from Eqn. 3.55. The GA identified the model (0,2,4,5,6,9,13) with

the smallest AIC score of 424.24 57 times out of 100 replications. Applying the kernel

regression to the Y variable and also on the X matrix of independent variables together

resulted in lower AIC scores compared to the no kernel regression model Y = Xβ in Table

6.2. The model (0,2,5,6,10,13) was the second most frequently identified model with an

AIC score of 424.61 43 times out of 100 replications. The GA identified the top two models

with an intercept term and two of the top three models overall. The modeling took 184

hours to complete compared to 32.3 minutes for the Yk = Xβ model in Table 6.8. Figure 6.37 is a three-dimensional plot showing the fitness landscape of ICOMP(IFIM)

scores with the number of generations and populations output from the modified M 3

toolbox from the Yk = Xmkβ model.

179 3−dimensional Surface Plot of AIC scores. 650

600

550

500

450

400 0 10 20 30 30 40 50 60 0 10 20 Population Generations

Figure 6.35: Three dimensional surface plot of AIC scores for body fat from the Yk = Xmkβ model.

180 GA Progress: Objective function AIC 500 600

450 500 Average Value (*) Minimum Value (o)

400 400 0 10 20 30 40 50 60 Generation

Figure 6.36: Scatter plot of AIC scores for body fat for each generation from the Yk = Xmkβ model.

181 Table 6.12: AIC results for body fat data from the Yk = Xmkβ model.

Model AIC Frequency h for Y h for X {2,4,5,6,9,13} 423.20 2.936 3.696 {0,2,4,5,6,9,13} 424.24 57 2.936 3.870 {0,2,5,6,10,13} 424.61 43 2.936 1.902 {0,2,5,6,13} 425.83 2.936 1.747 {0,2,4,5,6,13} 425.95 2.936 3.211 {0,2,4,5,6,11,13} 427.00 2.936 2.270 {0,2,5,6,11,13} 427.58 2.936 3.293 {0,2,5,6,9,13} 427.83 2.936 3.202 {2,4,5,6,9,10,13} 428.28 2.936 3.600 {0,2,4,5,6,10,11,13} 428.49 2.936 2.045 Modeling completed in 184 hours using 100 replications.

182 3−dimensional Surface Plot of ICOMP_IFIM scores.

700

650

600

550

500

450

400 0

10

20 50 60 30 40 30 20 0 10 Population Generations

Figure 6.37: Three dimensional surface plot of ICOMP(IFIM) scores for body fat from the Yk = Xmkβ model.

183 Figure 6.38 is a scatter plot of the minimum and average ICOMP(IFIM) scores for each generation in the GA from the Yk = Xmkβ model. The GA converges to the mini- mum ICOMP(IFIM) at generation 29 and is constant through generation 60. The average

ICOMP(IFIM) score decreases quickly and stabilizes at generation 33.

The GA was replicated 100 times for ICOMP(IFIM). The results are summarized in

Table 6.13. An explicit enumeration of all possible subset models was performed and the

top 10 ICOMP(IFIM) models are shown in Table 6.13. The GA identified the model

(0,2,4,5,6,13) with the smallest ICOMP(IFIM) score of 429.57 58 times out of 100 repli-

cations. This ICOMP(IFIM) score is much lower than the ICOMP(IFIM) scores using

no kernel regression on any variables for the Y = Xβ model as seen in Table 6.3. The

model (0,2,5,6,11,13) was the second most frequently selected model 29 times out of 100

replications with an ICOMP(IFIM) score of 431.17. Table 6.13 shows the ICOMP(IFIM)

scores from the Yk = Xmkβ models are higher than the ICOMP(IFIM) scores from the

Yk = Xβ models in Table 6.9 and the Yk = Xkβ models in Table 6.11. The GA identified 97 of the 100 replications within the top 10 models. However, an explicit enumeration of

all possible models shows all the models the GA did identify are the best ICOMP(IFIM)

models with an intercept term. The modeling took 161 hours to complete compared to

34.7 minutes for the Yk = Xβ model in Table 6.9.

6.3.7 Conclusions from Body Fat Data

The best AIC and ICOMP(IFIM) models on the body fat data from Tables 6.2 through 6.13

along with their AIC and ICOMP(IFIM) scores are summarized in Table 6.14. Table 6.14

shows both AIC and ICOMP(IFIM) scores increased for the Y = Xkβ model compared to the baseline Y = Xβ model with no kernel regression. The AIC and ICOMP(IFIM) scores

also increased for the Y = Xmkβ model compared to the baseline Y = Xβ model with no kernel regression. These results agree with Chapter 4 where applying kernel regression on the X matrix alone tends to increase both AIC and ICOMP(IFIM), indicating poorer models compared to the baseline Y = Xβ model with no kernel regression.

184 GA Progress: Objective function ICOMP_IFIM 500 600

480 550

460 500 Average Value (*) Minimum Value (o)

440 450

420 400 0 10 20 30 40 50 60 Generation

Figure 6.38: Scatter plot of ICOMP(IFIM) scores for body fat for each generation from the Yk = Xmkβ model.

185 Table 6.13: ICOMP(IFIM) results for body fat data from the Yk = Xmkβ model.

Model ICOMP(IFIM) Frequency h for Y h for X {2,4,5,6,9,10,13} 418.67 2.936 3.600 {2,4,5,6,9,13} 422.76 2.936 3.696 {2,4,5,6,9,10,12,13} 427.12 2.936 3.778 {2,4,5,6,9,12,13} 427.46 2.936 4.069 {0,2,4,5,6,13} 429.57 58 2.936 3.211 {2,4,5,6,9,10} 429.58 2.936 2.917 {2,4,5,6,9,10,11,13} 430.24 2.936 3.559 {2,4,5,6,9} 430.38 2.936 2.249 {0,2,5,6,11,13} 431.17 29 2.936 3.293 {0,2,5,6,9,13} 431.34 10 2.936 3.202 {0,2,5,6,10,13} 432.21 3 2.936 1.990 Modeling completed in 161 hours using 100 replications.

Table 6.14: Best AIC and ICOMP(IFIM) models for the body fat data.

Kernel Best AIC Best ICOMP(IFIM) ICOMP Model Model from GA AIC Model from GA (IFIM) Y = X β {0,1,2,4,6,7,8,12,13} 1457.0 {4,6,7,12,13} 1495.2

Y = X k β {0,3,4,6,13} 1477.3 {0,3,4,6} 1497.6

Y = X mk β {0,2,4,5,6,13} 1490.5 {0,2,5,6,13} 1525.8

Y k = X β {0,1,2,4,6,8,12,13} 386.38 {0,1,4,6,12,13} 410.58

Y k = X k β {0,3,4,6,10,12} 407.68 {0,1,3,4,5,6,7,8,9,10,11,12} 388.26

Y k = X mk β {0,2,4,5,6,9,13} 424.24 {0,2,4,5,6,13} 429.57

186 However, applying kernel regression on the Y variable decreases both ICOMP(IFIM) and AIC scores whether kernel regression is also applied to the X matrix or not. The AIC and ICOMP(IFIM) scores for the Yk = Xmkβ model were higher than both the Yk = Xβ

and Yk = Xkβ models. GA correctly identified the best ICOMP(IFIM) model for the

Yk = Xkβ kernel model and the best AIC model for the Yk = Xβ kernel model. The GA tends to identify the best models with an intercept term even when other models without an intercept term may have lower AIC or ICOMP(IFIM) scores.

The GA identified the global optimal solution for the body fat data as confirmed by

Bao [Bao, et. al, 2005] (Table 6) using AIC. The models in Table 6.14 also confirm results presented by Bozdogan [Bozdogan, 2004].

6.4 Conclusions

The genetic algorithm (GA) was introduced as a data mining tool for the optimization of information criteria for model selection of multivariate data sets. The GA can locate a global optimum solution on very rugged fitness landscapes through its mutation and crossover properties.

A real body fat data set was used to demonstrate the integration of kernel regression with the GA. Kernel regression was applied to the Y dependent variable, the X matrix of independent variables, or both. AIC and ICOMP(IFIM) results were computed using the modified M 3 information complexity toolbox of MATLAB programs developed by Howe and Bozdogan [Howe, 2009]. The body fat data confirmed the results of Chapter 4 that applying the kernel regression on the Y variable dramatically decreases both the AIC and

ICOMP(IFIM) scores compared to the baseline model using no kernel regression. However, applying kernel regression to the X matrix alone increases the AIC and ICOMP(IFIM) scores. The optimal bandwidths h were also calculated for each of the models identified by the modified M 3 toolbox using Eqns. 3.54 and 3.55.

The GA identified the global optimal solution for an example real data set in the literature [Bao, et. al, 2005] that used the exact implicit enumeration algorithm with AIC.

187 Chapter 7

Kernel Regression with Implicit Enumeration Algorithm

7.1 Introduction

Chapter 6 showed results from integrating kernel regression smoothing with the genetic algorithm on a real data set. Chapter 7 will present an application of kernel regression smoothing with the implicit enumeration algorithm. The implicit enumeration algorithm has been shown to dramatically reduce the number of models to be evaluated using infor- mation criteria to identify the global optimum solution. The results of integrating kernel regression with the genetic algorithm and implicit enumeration will be compared.

7.2 Implicit Enumeration Algorithm

Bao, et. al [Bao, et. al, 2005] developed an implicit enumeration (IE) algorithm that is guaranteed to find the global optimum solution for ordinary least squares multiple re- gression using AIC. The subset of variables that minimizes AIC is identified and global optimality is proven with a small number of iterations without exhaustively evaluating every subset model. Bao showed that in addition to the best model, IE can generate all models having information criteria values close to the minimum in order to estimate the

188 uncertainty in model selection. The methodology of the IE used by Bao is summarized and applied on the same body fat data used in Chapter 6.

Bao assumed that the information criteria being used was of the form

z(w) = f(w) + g(w), (7.1) where f(w) is a term that measures bias in estimation of model parameters and g(w) is a measure of variance in the estimation of parameters. The first term, f(w) is considered a measure of lack of fit, while g(w) is a “penalty function” associated with the number of parameters that must be estimated within the model. If another independent variable is added to the set defined by a given w, the result would be a decrease in f(w) and increase in g(w). That is, f(w) and g(w) must follow the monotonic conditions

f(w1) ≤ f(w2) (7.2) and

g(w1) ≥ g(w2) (7.3)

if w1 ≥ w2. The “best” solution is then defined as the solution among the 2p possible solutions that

minimizes z(w), where p is the number of independent variables comprising the X matrix.

Information criteria described in Chapter 2 operationalize z(w) in Eqn. 7.1. Specifically,

AIC from Eqn. 2.1 and ICOMP(IFIM) from Eqn. 2.9 satisfy the monotonicity requirements

for z(w) in Eqn. 7.1.

IE constructs an enumeration tree by identifying and eliminating branches of the tree

that do not contain the optimal solution w∗ where w∗ minimizes z(w). Let z∗ = z(w∗).

IE uses partial solutions to discard those that cannot lead to the optimal solution. Partial

solutions are bounded from above with an upper bound that is used to determine whether

fathoming is required. A partial solution is fathomed when no completion can be optimal

or the best feasible solution has been found. For example, if the lower bound on a given

189 partial solution is greater than or equal to the upper bound on another partial solution already examined, then the partial solution cannot be optimal, so it is unnecessary to further examine that partial solution.

Bao [Bao, et. al, 2005] discusses in further detail the seven steps of the IE algorithm:

1. Initialization: Set S = ∅, wS = ∅ andz ¯∗ = ∞. Let w = {0, 0, ..., 0} represent the incumbent solution.

2. Computing an upper bound: Compute z(w) for selected completions of (S, wS) and setz ¯(S, wS) equal to the smallest of these. Go to step 3.

3. Updating: Ifz ¯(S, wS) < z¯∗, a solution better than incumbent has been found.

Replace the incumbent solution with the best feasible completion of (S, wS), and setz ¯∗ = z¯(S, wS).

4. Computing a lower bound: Compute z(S, wS).

5. Fathoming: (S, wS) is fathomed ifz ¯(S, wS) > z¯∗, indicating that no completion of

(S, wS) can be optimal orz ¯(S, wS) =z ¯∗, indicating that the best feasible completion of

(S, wS) has been found.

If (S, wS) is fathomed, go to Step 7.

6. Branching: Add another variable to S and add the corresponding component (0 or 1) to wS. The methods used to select a variable and to choose the value of the new component of wS are described in Bao, et. al [Bao, et. al, 2005]. Go to Step 2.

7. Backtracking: If all of the elements of S are underlined, the incumbent represents the optimal solution. Otherwise, choose the right most non-underlined element of S, underline it and change the corresponding element of wS (from 0 to 1 or from 1 to 0). Then drop all elements of S (and the corresponding component of wS) to the right of the newly underlined element. Go to step 2.

Since IE is exhaustive it will either enumerate or eliminate by fathoming every possible solution.

Lower and upper bounds on the optimal solution are calculated using four bounding solutions:

190 1. core solution z(wc)

2. replete solution z(wr)

3. augmented core solution z(wac)

4. diminished replete solution z(wdr)

The upper bound is then computed as

S z¯(S, w ) = min(z(wc), z(wac), z(wr), z(wdr)). (7.4)

For monotonic functions f(w) and g(w), Bao [Bao, et. al, 2005] proves that a lower bound is computed as

S z(S, w ) = min(z(wc), z(wac), z(wr), z(wdr), ze). (7.5)

Bao evaluated three branching strategies: greedy, variable augmentation, and variable deletion. Bao showed the variable augmentation branching strategy (where the free variable is added to the core solution to form the augmented core solution) consistently identified the optimal solution in fewer iterations than either the greedy or variable deletion branching strategies.

Bao also showed that IE consistently outperformed the traditional stepwise regression approach by correctly identifying the optimal solution, while stepwise regression did not.

191 7.3 Implicit Enumeration with Kernel Regression

The MATLAB programs written by Bao, et al. [Bao, et. al, 2005] implemented IE on ordinary least squares multiple regression using AIC since AIC is calculated as the sum of two monotonic functions that satisfy IE assumptions. The MATLAB programs were modified for the six kernel regression models described in Table 4.1. In addition, the programs were modified to calculate ICOMP(IFIM) as the objective function in addition to AIC.

7.3.1 Results from the Y = Xβ Model

The IE MATLAB programs were applied on the real body fat data described in Section

6.3 and used by Bao [Bao, et. al, 2005]. Table 7.1 summarizes the AIC results from IE for the Y = Xβ model without kernel regression smoothing. Table 7.1 shows the same top 10

AIC models as Table 6.2. IE correctly identified the model (0,1,2,4,6,7,8,12,13) as having the lowest AIC score in only five iterations. IE evaluated 2598 out of the 8191 possible models to arrive at this model. This is the same model most frequently identified in Table

6.2 using GA.

Table 7.2 summarizes the ICOMP(IFIM) results from IE for the Y = Xβ model without kernel regression smoothing. Table 7.2 shows the same top 10 ICOMP(IFIM) models as

Table 6.3. The model (0,6) was identified by IE as having the lowest ICOMP(IFIM) score in one iteration. IE evaluated only 28 out of 8191 possible models to arrive at this solution.

The (0,6) model is not among the top 10 models as scored by ICOMP(IFIM) with an

ICOMP(IFIM) score of 1534.87.

Bozdogan [Bozdogan, 1990a] showed that the penalty term from Eqn. 2.13 monotoni- cally increases asσ ˆ2 increases when 2qσˆ2 > ntr(X0X)−1. However, Bozdogan [Bozdogan,

1990a] also showed that the penalty term from Eqn. 2.13 monotonically decreases asσ ˆ2 increases when 2qσˆ2 < ntr(X0X)−1. Therefore, the implicit enumeration algorithm has difficulty obtaining the global minimum ICOMP(IFIM) score when the weak monotonicity assumption forσ ˆ2 is violated when 2qσˆ2 = ntr(X0X)−1.

192 Table 7.1: Top 10 best AIC models for the body fat data using IE for the Y = Xβ model.

Number of Number of Model AIC Iterations Models Evaluated {0,1,2,4,6,7,8,12,13} 1457.00 5 2598 {0,1,2,4,6,8,12,13} 1457.05 {1,2,3,4,6,7,8,12,13} 1457.47 {0,1,2,4,6,8,11,12,13} 1457.62 {1,3,4,6,7,8,12,13} 1457.72 {0,1,2,4,6,7,8,11,12,13} 1457.82 {0,1,2,4,6,11,12,13} 1458.13 {1,4,6,7,8,12,13} 1458.21 {0,1,2,3,4,6,7,8,12,13} 1458.33 {0,1,2,4,6,8,10,12,13} 1458.34

Table 7.2: Top 10 best ICOMP(IFIM) models for the body fat data using IE for the Y = Xβ model.

ICOMP Number of Number of Model (IFIM) Iterations Models Evaluated {4,6,7,12,13} 1495.17 {4,6,7,13} 1495.85 {6,7,13} 1495.97 {3,4,6,7,12,13} 1497.84 {1,4,6,7,8,12,13} 1498.50 {3,6,7,13} 1498.51 {6,7,12,13} 1498.54 {1,4,6,7,12,13} 1498.55 {4,6,7,11,13} 1498.55 {0,2,6} 1498.65 {0,6} 1534.87 1 28

193 7.3.2 Results from the Y = Xkβ Model

Table 7.3 summarizes the results from IE for the Y = Xkβ model with kernel regression smoothing applied to the X matrix variables individually. IE correctly identified the model

(0,3,4,6,13) as having the lowest AIC score of 1477.28 in only nine iterations. IE evaluated

2178 out of the 8191 possible models to correctly confirm this model has the global optimal minimum AIC score. This is the same model identified in Table 6.4 using GA. Table 7.3 shows the same top 10 models with the lowest AIC scores as Table 6.4.

Table 7.4 summarizes the ICOMP(IFIM) results from IE for the Y = Xkβ model. Table 7.4 shows the same top 10 ICOMP(IFIM) models as Table 6.5. The model (0,3,6,7,9) was identified by IE as having the lowest ICOMP(IFIM) score in nine iterations. IE evaluated only 240 out of 8191 possible models to arrive at this solution. The (0,3,6,7,9) model is not among the top 10 models as scored by ICOMP(IFIM) with an ICOMP(IFIM) score of

1511.13. The implicit enumeration algorithm has difficulty obtaining the global minimum

ICOMP(IFIM) score when the weak monotonicity assumption forσ ˆ2 is violated when

2qσˆ2 = ntr(X0X)−1 as discussed in Section 7.3.1.

7.3.3 Results from the Y = Xmkβ Model

Table 7.5 summarizes the results from IE for the Y = Xmkβ model with kernel regression smoothing applied to the X matrix variables together. IE identified the model (0,2,4,5,6,13) as having the lowest AIC score of 1490.5 in only eight iterations and evaluated only 238 out of the 8191 possible models. This is the same model identified in Table 6.6 using GA. Table

7.5 shows the same top 10 models with the lowest AIC scores as Table 6.6. However, the model IE identified is not among the top 10 models as identified by an explicit enumeration of all possible subset models.

Table 7.6 summarizes the ICOMP(IFIM) results from IE for the Y = Xmkβ model. Table 7.6 shows the same top 10 ICOMP(IFIM) models as Table 6.7. The model (0,6) was again identified by IE as having the lowest ICOMP(IFIM) score in one iteration. IE evaluated only 28 out of 8191 possible models to arrive at this model.

194 Table 7.3: Top 10 best AIC models for the body fat data using IE for the Y = Xkβ model.

Number of Number of Model AIC Iterations Models Evaluated {0,3,4,6,13} 1477.28 9 2178 {0,3,4,6,10,13} 1478.42 {0,3,4,5,6,13} 1478.70 {0,1,3,4,6,13} 1478.78 {0,3,4,6,12,13} 1478.89 {0,3,4,6,11,13} 1479.04 {0,2,3,4,6,13} 1479.15 {0,3,4,6,7,13} 1479.26 {0,3,4,6,9,13} 1479.28 {0,3,4,6,8,13} 1479.28

Table 7.4: Top 10 best ICOMP(IFIM) models for the body fat data using IE for the Y = Xkβ model.

ICOMP Number of Number of Model (IFIM) Iterations Models Evaluated {0,3,4,6} 1497.56 {0,3,4,6,10} 1498.13 {0,3,4,5,6} 1499.94 {0,3,4,6,9} 1500.59 {0,3,6,13} 1500.60 {0,3,4,6,11} 1500.63 {0,1,3,4,6} 1500.65 {0,3,4,6,10,11} 1501.12 {0,3,4,5,6,10} 1501.15 {0,1,3,4,6,10} 1501.38 {0,3,6,7,9} 1511.13 9 240

195 Table 7.5: Top 10 best AIC models for the body fat data using IE for the Y = Xmkβ model.

Number of Number of Model AIC Iterations Models Evaluated {0,1,2,4,5,6,8,12,13} 387.93 {0,1,2,3,4,6,8,12,13} 388.37 {0,1,2,4,5,6,8,10,12,13} 389.18 {0,1,2,4,5,6,7,8,12,13} 389.22 {0,1,2,3,4,6,7,8,12,13} 389.41 {0,1,2,4,5,6,8,9,12,13} 389.53 {0,1,2,3,4,6,8,10,12,13} 389.66 {0,1,2,4,5,6,8,11,12,13} 389.88 {0,1,2,3,4,5,6,8,12,13} 389.93 {0,1,2,3,4,6,8,9,12,13} 390.06 {0,2,4,5,6,13} 1490.5 8 238

Table 7.6: Top 10 best ICOMP(IFIM) models for the body fat data using IE for the Y = Xmkβ model.

ICOMP Number of Number of Model (IFIM) Iterations Models Evaluated {1,2,3,4,5,6,7,8,9,10,11,12,13} 380.68 {1,2,3,4,6,7,8,9,10,11,12,13} 381.46 {1,2,3,4,5,6,7,8,9,11,12,13} 381.48 {1,2,3,4,5,6,7,8,10,11,12,13} 381.74 {1,2,3,4,5,6,7,8,9,10,12,13} 381.81 {1,2,3,4,6,7,8,9,11,12,13} 382.26 {1,2,3,4,5,6,7,8,11,12,13} 382.41 {1,2,3,4,6,7,8,10,11,12,13} 382.54 {1,2,3,4,6,7,8,9,10,12,13} 382.55 {1,2,3,4,5,6,7,8,9,12,13} 382.61 {0,6} 1696.7 1 28

196 The (0,6) model is not among the top 10 models as scored by ICOMP(IFIM) with a

ICOMP(IFIM) score of 1696.7. IE has difficulty obtaining the global minimum score for

ICOMP(IFIM) when the weak monotonicity assumption forσ ˆ2 is violated as discussed in

Section 7.3.1.

7.3.4 Results from the Yk = Xβ Model

Table 7.7 summarizes the results from IE for the Yk = Xβ model with kernel regression smoothing applied only to the Y dependent variable. IE correctly identified the model

(0,1,2,4,6,8,12,13) as having the lowest AIC score of 386.38 in only six iterations. IE

evaluated 2120 out of the 8191 possible models to correctly confirm this model has the

global optimal minimum AIC score. This is the same model identified in Table 6.8 using

GA. Table 7.7 shows the same top 10 models with the lowest AIC scores as Table 6.8.

Table 7.8 summarizes the ICOMP(IFIM) results from IE for the Yk = Xβ model. Table 7.8 shows the same top 10 ICOMP(IFIM) models as Table 6.9. The model (0,1,4,6,8,12,13)

was identified by IE as having the lowest ICOMP(IFIM) score in seven iterations. IE eval-

uated only 948 out of 8191 possible models to arrive at this solution. The (0,1,4,6,8,12,13)

model is not among the top 10 models as scored by ICOMP(IFIM) with an ICOMP(IFIM)

score of 411.24. IE has difficulty obtaining the global minimum ICOMP(IFIM) score

when the weak monotonicity assumption forσ ˆ2 is violated as discussed in Section 7.3.1.

However, the (0,1,4,6,8,12,13) model is very similar to the global optimal AIC model of

(0,1,2,4,6,8,12,13) identified by IE in Table 7.7.

7.3.5 Results from the Yk = Xkβ Model

Table 7.9 summarizes the results from IE for the Yk = Xkβ model with kernel regression smoothing applied to the Y dependent variable and the X matrix of independent variables

individually. IE identified the model (0,1,2,3,6,8,9) as having the lowest AIC score of 398.93

in only seven iterations. IE evaluated 2614 out of the 8191 possible models.

197 Table 7.7: Top 10 best AIC models for the body fat data using IE for the Yk = Xβ model.

Number of Number of Model AIC Iterations Models Evaluated {0,1,2,4,6,8,12,13} 386.38 6 2120 {0,1,2,4,6,7,8,12,13} 387.51 {0,1,2,4,6,8,10,12,13} 387.67 {0,1,2,4,5,6,8,12,13} 387.93 {0,1,2,4,6,8,9,12,13} 388.09 {0,1,2,6,8,12,13} 388.27 {0,1,2,4,6,8,11,12,13} 388.30 {0,1,2,3,4,6,8,12,13} 388.37 {0,1,4,6,7,8,12,13} 388.50 {0,1,2,4,6,7,8,10,12,13} 388.84

Table 7.8: Top 10 best ICOMP(IFIM) models for the body fat data using IE for the Yk = Xβ model.

ICOMP Number of Number of Model (IFIM) Iterations Models Evaluated {1,2,3,4,5,6,7,8,9,10,11,12,13} 380.68 {1,3,4,5,6,7,8,9,10,11,12,13} 381.16 {1,2,3,4,6,7,8,9,10,11,12,13} 381.46 {1,2,3,4,5,6,7,8,9,11,12,13} 381.48 {1,2,3,4,5,6,7,8,10,11,12,13} 381.74 {1,3,4,6,7,8,9,10,11,12,13} 381.75 {1,2,3,4,5,6,7,8,9,10,12,13} 381.81 {1,3,4,5,6,7,8,9,11,12,13} 382.05 {1,3,4,5,6,7,8,10,11,12,13} 382.26 {1,2,3,4,6,7,8,9,11,12,13} 382.26 {0,1,4,6,8,12,13} 411.24 7 948

198 Table 7.9: Top 10 best AIC models for the body fat data using IE for the Yk = Xkβ model.

Number of Number of Model AIC Iterations Models Evaluated {0,1,2,4,5,6,8,12,13} 387.93 {0,1,2,3,4,6,8,12,13} 388.37 {0,1,2,4,5,6,8,10,12,13} 389.18 {0,1,2,4,5,6,7,8,12,13} 389.22 {0,1,2,3,4,6,7,8,12,13} 389.41 {0,1,2,4,5,6,8,9,12,13} 389.53 {0,1,2,3,4,6,8,10,12,13} 389.66 {0,1,2,4,5,6,8,11,12,13} 389.88 {0,1,2,3,4,5,6,8,12,13} 389.93 {0,1,2,3,4,6,8,9,12,13} 390.06 {0,1,2,3,6,8,9} 398.93 7 2614

199 This model is not among the top 10 AIC models. Table 7.9 shows the same top 10 models with the lowest AIC scores as Table 6.10.

Table 7.10 summarizes the ICOMP(IFIM) results from IE for the Yk = Xkβ model. Ta- ble 7.10 shows the same top 10 ICOMP(IFIM) models as Table 6.11. The saturated model

(0,1,2,3,4,5,6,7,8,9,10,11,12,13) was identified by IE as having the lowest ICOMP(IFIM)

score in one iteration. IE evaluated 1578 out of 8191 possible models to arrive at this

solution. The (0,1,2,3,4,5,6,7,8,9,10,11,12,13) model is not among the top 10 models as

scored by ICOMP(IFIM) with a ICOMP(IFIM) score of 455.22. IE has difficulty obtaining

the global minimum ICOMP(IFIM) score when the weak monotonicity assumption forσ ˆ2

is violated as discussed in Section 7.3.1.

7.3.6 Results from the Yk = Xmkβ Model

Table 7.11 summarizes the results from IE for the Yk = Xmkβ model with kernel regression smoothing applied to the Y dependent variable and the X matrix of independent variables together. IE identified the model (0,2,5,6,10,13) as having the lowest AIC score of 424.61 in only 16 iterations. IE evaluated only 210 out of the 8191 possible models. This is one of the models identified in Table 6.12 using GA. Table 7.11 shows the same top 10 models with the lowest AIC scores as Table 6.12.

Table 7.12 summarizes the ICOMP(IFIM) results from IE for the Yk = Xmkβ model. Table 7.12 shows the same top 10 ICOMP(IFIM) models as Table 6.13. The model (0,6) was identified by IE as having the lowest ICOMP(IFIM) score in one iteration. IE evaluated only 28 out of 8191 possible models to arrive at this model. The (0,6) model is not among the top 10 models as scored by ICOMP(IFIM) with an ICOMP(IFIM) score of

556.65. IE has difficulty obtaining the global minimum ICOMP(IFIM) score when the weak monotonicity assumption forσ ˆ2 is violated as discussed in Section 7.3.1. This is the same model IE identified as optimal in Tables 7.2 and 7.6 for the Y = Xβ and Y = Xmkβ models, respectively.

200 Table 7.10: Top 10 best ICOMP(IFIM) models for the body fat data using IE for the Yk = Xkβ model.

ICOMP Number of Number of Model (IFIM) Iterations Models Evaluated {1,2,3,4,5,6,7,8,9,10,11,12,13} 380.68 {1,2,3,4,6,7,8,9,10,11,12,13} 381.46 {1,2,3,4,5,6,7,8,9,11,12,13} 381.48 {1,2,3,4,5,6,7,8,10,11,12,13} 381.74 {1,2,3,4,5,6,7,8,9,10,12,13} 381.81 {1,2,3,4,6,7,8,9,11,12,13} 382.26 {1,2,3,4,5,6,7,8,11,12,13} 382.41 {1,2,3,4,6,7,8,10,11,12,13} 382.54 {1,2,3,4,6,7,8,9,10,12,13} 382.55 {1,2,3,4,5,6,7,8,9,12,13} 382.61 {0,1,2,3,4,5,6,7,8,9,10,11,12,13} 455.22 1 1578

201 Table 7.11: Top 10 best AIC models for the body fat data using IE for the Yk = Xmkβ model.

Number of Number of Model AIC Iterations Models Evaluated {2,4,5,6,9,13} 423.20 {0,2,4,5,6,9,13} 424.24 {0,2,5,6,10,13} 424.61 16 210 {0,2,5,6,13} 425.83 {0,2,4,5,6,13} 425.95 {0,2,4,5,6,11,13} 427.00 {0,2,5,6,11,13} 427.58 {0,2,5,6,9,13} 427.83 {2,4,5,6,9,10,13} 428.28 {0,2,4,5,6,10,11,13} 428.49

Table 7.12: Top 10 best ICOMP(IFIM) models for the body fat data using IE for the Yk = Xmkβ model.

ICOMP Number of Number of Model (IFIM) Iterations Models Evaluated {2,4,5,6,9,10,13} 418.67 {2,4,5,6,9,13} 422.76 {2,4,5,6,9,10,12,13} 427.12 {2,4,5,6,9,12,13} 427.46 {0,2,4,5,6,13} 429.57 {2,4,5,6,9,10} 429.58 {2,4,5,6,9,10,11,13} 430.24 {2,4,5,6,9} 430.38 {0,2,5,6,11,13} 431.17 {0,2,5,6,9,13} 431.34 {0,6} 556.65 1 28

202 7.4 Conclusions

Kernel regression smoothing was incorporated into the implicit enumeration (IE) algo- rithm for the six kernel models shown in Table 4.1. MATLAB programs written by

Bao [Bao, et. al, 2005] were modified to incorporate kernel regression smoothing using

AIC or ICOMP(IFIM) information criteria as the objective function to be minimized.

IE with kernel regression correctly identified the global optimal models for the six kernel models shown in Table 4.1 using AIC. These models confirm the results of the GA from

Chapter 6 for AIC. IE works well for AIC since the monotonicity ofσ ˆ2 holds forσ ˆ2 > 0.

Table 7.13 summarizes the best models from IE and GA using AIC for each of the six kernel models. Table 7.13 shows perfect agreement between the first four models identified by IE and GA. IE outperformed GA for the Yk = Xkβ model since the AIC score for the (0,1,2,3,6,8,9) model identified by IE was 398.93, while the AIC score for the (0,3,4,6,10,12) model identified by GA was 407.68. However, GA barely outperformed IE for the Yk =

Xmkβ model since the AIC score for the (0,2,4,5,6,9,13) model identified by GA was 424.24, while the AIC score for the (0,2,5,6,10,13) model identified by IE was 424.61.

Table 7.14 summarizes the best models from IE and GA using ICOMP(IFIM) for each of the six kernel models. Table 7.14 shows the GA outperformed IE for selecting the optimal models for all six kernel models since the ICOMP(IFIM) scores were consistently lower for GA than IE. The weak monotonicity of the penalty term of ICOMP(IFIM) is a problem for IE, but not for GA.

However, ICOMP(IFIM) did not perform as well as AIC using IE with kernel regres- sion due to the weak monotonicity properties of the penalty function of ICOMP(IFIM).

Bozdogan [Bozdogan, 1990a] showed that the penalty term from Eqn. 2.13 monotonically increases asσ ˆ2 increases when 2qσˆ2 > ntr(X0X)−1. However, Bozdogan [Bozdogan, 1990a]

also showed that the penalty term from Eqn. 2.13 monotonically decreases asσ ˆ2 increases

when 2qσˆ2 < ntr(X0X)−1. Therefore, the implicit enumeration algorithm has difficulty ob- taining the global minimum ICOMP(IFIM) score when the weak monotonicity assumption forσ ˆ2 is violated when 2qσˆ2 = ntr(X0X)−1.

203 Table 7.13: Comparison of the best models from IE and GA using AIC.

Kernel IE IE GA GA Model Best Model AIC Best Model AIC Y = X β {0,1,2,4,6,7,8,12,13} 1457.0 {0,1,2,4,6,7,8,12,13} 1457.0

Y = X k β {0,3,4,6,13} 1477.3 {0,3,4,6,13} 1477.3

Y = X mk β {0,2,4,5,6,13} 1490.5 {0,2,4,5,6,13} 1490.5

Y k = X β {0,1,2,4,6,8,12,13} 386.38 {0,1,2,4,6,8,12,13} 386.38

Y k = X k β {0,1,2,3,6,8,9} 398.93 {0,3,4,6,10,12} 407.68

Y k = X mk β {0,2,5,6,10,13} 424.61 {0,2,4,5,6,9,13} 424.24

Table 7.14: Comparison of the best models from IE and GA using ICOMP(IFIM).

Kernel IE IE ICOMP GA GA ICOMP Model Best Model (IFIM) Best Model (IFIM) Y = X β {0,6} 1534.87 {4,6,7,12,13} 1495.17

Y = X k β {0,3,6,7,9} 1511.13 {0,3,4,6} 1497.56

Y = X mk β {0,6} 1696.7 {0,2,5,6,13} 1525.82

Y k = X β {0,1,4,6,8,12,13} 411.24 {0,1,4,6,12,13} 410.58

Y k = X k β all variables + intercept 455.22 {0,1,3,4,5,6,7,8,9,10,11,12} 388.26

Y k = X mk β {0,6} 556.65 {0,2,4,5,6,13} 429.57

204 Further research is needed to modify the IE algorithm or the ICOMP(IFIM) penalty term to account for the split monotonicity of the current penalty term of ICOMP(IFIM) in order to achieve the global optimal solution.

205 Chapter 8

Conclusions

8.1 Summary of Dissertation

This dissertation provided a brief description of the problem and proposed approach of kernel density estimation using information complexity in Chapter 1. Chapter 2 defined information criteria and complexity and compared information criteria as a tool for vari- able selection to other heuristic techniques and model diagnostics. Chapter 3 introduced kernel density estimation and provided a relevant and current literature review. Univari- ate, multivariate and variable bandwidth selectors currently in the literature were also reviewed. Chapter 4 discussed univariate kernel regression using both simulated and real data sets for various data distributions and kernel functions. Chapter 5 extended kernel regression to the multivariate case and presented results of multivariate kernel regression using simulated data. Chapter 6 described the genetic algorithm and presented results of integrating multivariate kernel density regression with the genetic algorithm using a real multivariate data set. Chapter 7 presented results of integrating kernel regression with the implicit enumeration algorithm using the same real multivariate data set from Chapter 6.

The research in this dissertation made the following contributions to the literature.

• Information complexity is used as the criterion to determine the optimal bandwidth

for multivariate kernel density estimation. The bandwidth is calculated for each

206 model identified by the genetic algorithm. The best model has the minimum ICOMP

using kernel regression smoothing within the genetic algorithm.

• The management science approach taken is the minimization of the non-linear infor-

mation complexity function to determine the global optimal model for high dimen-

sional multivariate data sets.

• The approach of using an implicit enumeration with the GA in the kernel regression

problem integrates management science optimization with statistical modeling.

All possible combinations of applying kernel regression smoothing to the dependent Y variables and the independent X variables were researched using both ICOMP(IFIM) and

AIC for model selection. This research showed kernel regression smoothing on the X matrix increased the information complexity scores, resulting in generally poorer models. However, kernel regression smoothing on the Y variables dramatically decreased the information complexity scores in most data sets, resulting in much better models.

The choice of the kernel function used to smooth the data depends on the data distri- bution. The Gaussian kernel is best for approximately normally distributed data, while the

Epanechnikov kernel was shown to be a better choice for uniform and power exponential distributed data.

The genetic algorithm with kernel regression smoothing was applied to a well publicized real multivariate data set. The real body fat data confirmed the simulated results of

Chapter 4 that applying the kernel regression on the Y variable dramatically decreases both the AIC and ICOMP(IFIM) scores compared to the baseline model using no kernel regression. However, applying kernel regression to the X matrix alone increases the AIC and ICOMP(IFIM) scores.

Kernel regression smoothing was incorporated into the implicit enumeration algorithm for the six kernel models shown in Table 4.1. MATLAB programs written by Bao [Bao, et. al, 2005] were modified to incorporate kernel regression smoothing using AIC or

ICOMP(IFIM) information criteria as the objective function to be minimized.

207 Implicit enumeration with kernel regression correctly identified the global optimal mod- els for three of the six kernel models using AIC. These models confirm the results of the genetic algorithm for AIC. Implicit enumeration works well for AIC due to the monotonicity ofσ ˆ2. However, ICOMP(IFIM) did not perform as well as AIC using implicit enumera- tion with kernel regression due to the monotonicity properties of the penalty function of

ICOMP(IFIM).

8.2 Future Work

This dissertation has provided the opportunity for future work to extend this research.

Future work based on this research includes:

• Determine whether the dramatic reduction of information complexity scores by ap-

plying kernel regression smoothing to the dependent Y variables extends for more

kernel density functions than were used in this dissertation.

• Integrate the kernel regression smoothing with the hybridized implicit enumeration

algorithm and the genetic algorithm.

• Apply this research to a higher dimension real data set to see if the results are

consistent with both the real and simulated data sets used in this dissertation.

• Modify the implicit enumeration algorithm to account for the monotonicity of the

penalty function of ICOMP(IFIM).

• Use other information criteria such as SC, ICOMP(IFIM)PEU , and ICOMP(IFIM) for misspecified models with kernel regression in the genetic and implicit enumeration

algorithms.

The author plans to submit the results of this dissertation for publication to a peer reviewed journal.

208 Bibliography

209 Bibliography

[Abramson, 1982] Abramson, I. (1982). On Bandwidth Variation in Kernel Estimates: A

Square Root Law. The Annals of Statistics, 10:1217–1223.

[Akaike, 1973] Akaike, H. (1973). Information Theory and an Extension of the Maximum

Likelihood Principle. Second International Symposium on Information Theory, 267–281.

Budapest: Academiai Kiado.

[Akaike, 1974] Akaike, H. (1974). A New Look at the Statistical Model Identification.

IEEE Transactions Automatic Control, AC-19:716–723.

[Akaike, 1978] Akaike, H. (1978). A Bayesian Analysis of the Minimum AIC Procedure.

Annals of the Institute of Statistical Mathematics, 30:9–14.

[Akaike, 1981] Akaike, H. (1981). Likelihood of a Model and Information Criteria. Journal

of Econometrics, 16(1):3–14.

[Akaike, 1987] Akaike, H. (1987). Factor Analysis and AIC. Psychometrika, 52:317–332.

[An and Gu, 1989] An, H. and L. Gu (1989). Fast Stepwise Procedures of Selection of

Variables by Using AIC and BIC Criteria. Acta Math. Appl. Sinica (English Ser.),

5:60–67.

[Bao, et. al, 2005] Bao, X., H. Bozdogan, V. Chatpattananan and K. Gilbert (2005). An

Implicit Enumeration Algorithm for Mining High Dimensional Data. International Jour-

nal of Operational Research, 1:123–144.

210 [Beal, 2007] Beal, D. (2007). Information Criteria Methods in SAS for Multiple Linear

Regression Models. Proceedings of the 15th Annual Conference of the SouthEast SAS

Users Group.

[Bearse and Bozdogan, 1998] Bearse, P. and H. Bozdogan (1998). Subset Selection in Vec-

tor Autoregressive Models Using the Genetic Algorithm with Informational Complexity

as the Fitness Function. Systems Analysis Modelling Simulation, 31:61–91.

[Bhandarkar, et. al, 1994] Bhandarkar, S., Y. Zhang and W. Potter (1994) An Edge De-

tection Technique Using Genetic Algorithm Based Optimization. Pattern Recognition,

27:1159–1180.

[Bhuyan, et. al, 1991] Bhuyan, J., V. Raghavan and V. Elayavalli (1991). Genetic Algo-

rithm for Clustering with an Ordered Representation. 4th International Conference on

Genetic Algorithms, San Mateo, California: Morgan Kaufman, 38.

[Bhattacharyya, et. al, 1989] Bhattacharyya, B., L. Franklin and G. Richardson (1989).

Nonparametric Kernel Density Estimations: Unweighted Versus Length Biased Den-

sities. Computer Science and Statistics: Proceedings of the 21st Symposium on the

Interface, 319–322.

[Bowman, 1984] Bowman, A. W. (1984). An Alternative Method of Cross-Validation for

the Smoothing of Density Estimates. Biometrika, 71:353–360.

[Bowman and Azzalini, 1997] Bowman, A. W. and A. Azzalini (1997). Applied Smoothing

Techniques for Data Analysis: the Kernel Approach with S-Plus Illustrations. Oxford

University Press, Oxford.

[Box and Tidwell, 1962] Box, G. E. P. and P. W. Tidwell (1962). Transformation of the

Independent Variables. Technometrics, 4:531–550.

[Bozdogan, 1987] Bozdogan, H. (1987). Model Selection and Akaike’s Information Crite-

rion (AIC): The General Theory and its Analytical Extensions. Psychometrika, 52:345–

370.

211 [Bozdogan, 1988] Bozdogan, H. (1988). ICOMP: A New Model Selection Criterion. Classi-

fication and Related Methods of Data Analysis, Hans H. Bock (ed.), Amsterdam, Elsevier

Science Publishers B. V. (North Holland), 599–608.

[Bozdogan, 1990a] Bozdogan, H. (1990a). Statistical Modeling and Model Evaluation: A

New Informational Approach.

[Bozdogan, 1990b] Bozdogan, H. (1990b). On the Information Based Measure of Covari-

ance Complexity and its Application to the Evaluation of Multivariate Linear Models.

Communications in Statistics, Theory and Methods, 19(1):221–278.

[Bozdogan, 1993] Bozdogan, H. (1993). Choosing the Number of Component Clusters

in the Mixture Model Using a New Informational Complexity Criterion of the Inverse

Fisher Information Matrix. Information and Classification. Concepts, Methods and Ap-

plications. Proceedings of the 16th Annual Conference of the Gesellschaft fr Klassifikation

e. V., 40–54.

[Bozdogan, 1994] Bozdogan, H. (1994). Mixture Model Cluster Analysis Using Model Se-

lection Criteria and a New Informational Measure of Complexity. Multivariate Statistical

Modeling, H. Bozdogan (Ed.), Vol. 2, 69–113. Proceedings of the first US/Japan Con-

ference on the Frontiers of Statistical Modeling: An Informational Approach, 2:69–113,

Dordrecht, the Netherlands: Kluwer Academic Publishers.

[Bozdogan, 2000] Bozdogan, H. (2000). Akaike’s Information Criterion and Recent Devel-

opments in Information Complexity. Journal of Mathematical Psychology, 44(1):62–91.

[Bozdogan, 2003] Bozdogan, H. (2003). Statistical Data Mining and Knowledge Discovery.

Boca Raton: Chapman & Hall CRC.

[Bozdogan, 2004] Bozdogan, H. (2004) Intelligent Statistical Data Mining with Informa-

tion Complexity and Genetic Algorithms. Statistical Data Mining and Knowledge Dis-

covery, H. Bozdogan (Ed.), 15–56, Chapman and Hall/CRC, Boca Raton, Florida.

212 [Bozdogan, 2006] Bozdogan, H. (2006). Keynote lecture presented at the International

Workshop on Knowledge Extraction and Modeling (KNEMO ’06), Island of Capri, Italy,

September 4-6, 2006, www.knemo.unana.it.

[Bozdogan and Haughton, 1998] Bozdogan, H. and D. Haughton (1998). Informational

Complexity Criteria for Regression Models. Computational Statistics and Data Analysis,

28:51–76.

[Bozdogan and Magnus, 2004] Bozdogan, H. and J. Magnus (2004). Misspecification Re-

sistant Model Selection Using Information Complexity. Journal of Econometrics.

[Bozdogan and Shigemasu, 1998] Bozdogan, H. and Shigemasu, K. (1998). Bayesian Fac-

tor Analysis Model and Choosing the Number of Factors Using a New Informational

Complexity Criterion. Advances in Data Science and Classification. Proceedings of the

6th Conference of the International Federation of Classification Societies, 335–342.

[Brieman, 2001] Brieman, L. (2001). Statistical Modeling: the Two Cultures (with discus-

sion). Statistical Science, 16:199–231.

[Breiman, Meisel and Purcell, 1977] Breiman, L., W. Meisel and E. Purcell (1977). Vari-

able Kernel Estimates of Probability Density Estimates. Technometrics, 19:135–144.

[Brewer, 1998] Brewer, M. J. (1998). A Modeling Approach for Bandwidth Selection in

Kernel Density Estimation. COMPSTAT – Proceedings in Computational Statistics,

13th Symposium, 203–208.

[Brooks, et. al, 2003] Brooks, S., N. Friel and R. King (2003). Classical Model Selection

via Simulated Annealing. Journal of Royal Statistical Society, Series B, 65:503–520.

[Burnham and Anderson, 2002] Burnham, K. and D. Anderson (2002). Model Selection

and Multimodel Inference: A Practical Information-Theoretic Approach. Springer-

Verlag, New York.

213 [Burns, 1992] Burns, P. (1992). A Genetic Algorithm for Robust Regression Estimation.

Statsci Technical Report, Statistical Sciences, Inc.

[Cao, 2001] Cao, R. (2001). Relative Efficiency of Local Bandwidths in Kernel Density

Estimation. Statistics, 35:113–137.

[Chatfield, 1995] Chatfield, C. (1995). Model Uncertainty, Data Mining and Statistical

Inference. Journal of the Royal Statistical Society, Series A, 158:419–466.

[Chaudhuri and Marron, 1999] Chaudhuri, P. and J. Marron (1999). SiZer for Exploration

of Structures in Curves. Journal of the American Statistical Association, 94:807–823.

[Chiu, 1991] Chiu, S. T. (1991). The Effect of Discretization Error on Bandwidth Selection

for Kernel Density Estimation. Biometrika, 78:436–441.

[Chiu, 1992] Chiu, S. T. (1992). An Automatic Bandwidth Selector for Kernel Density

Estimation. Biometrika, 79:771–782.

[Chiu, 1996] Chiu, S. T. (1996). A Comparative Review of Bandwidth Selection for Kernel

Density Estimation. Statistica Sinica, 6:126–145.

[Chu, 1994] Chu, C. K. (1994). Estimation of Change Points in a Nonparametric Re-

gression Function Through Kernel Density Estimation. Communications in Statistics:

Theory and Methods, 23:3037–3062.

[Cwik and Koronacki, 1997a] Cwik, J. and J. Koronacki (1997a). A Combined Adaptive-

Mixtures/Plug-in Estimator of Multivariate Probability Densities. Computational Statis-

tics and Data Analysis, 26:199–218.

[Cwik and Koronacki, 1997b] Cwik, J. and J. Koronacki (1997b). Multivariate Density

Estimation: A Comparative Study. Neural Computing and Applications, 6:173–185.

[Delaigle and Gijbels, 2004] Delaigle, A. and I. Gijbels (2004). Bootstrap Bandwidth Selec-

tion in Kernel Density Estimation from a Contaminated Sample. Annals of the Institute

of Statistical Mathematics, 56:19–47.

214 [Devroye and Gyorfi, 1985] Devroye, L. and L. Gyorfi (1985). Nonparametric Density Es-

timation: the L1 View, John Wiley and Sons Inc.: New York.

[Duin, 1976] Duin, R. P. W. (1976). On the Choice of Smoothing Parameters for Parzen Es-

timators of Probability Density Functions. IEEE Transactions on Computers, 25:1175–

1178.

[Duong, 2004] Duong, T. (2004). Bandwidth Selectors for Multivariate Kernel Density

Estimation. Ph.D. dissertation, School of Mathematics and Statistics, University of

Western Australia.

[Duong, et al, 2008] Duong, T., A. Cowling, I. Koch and M. Wand (2008). Feature Signif-

icance for Multivariate Kernel Density Estimation. Computational Statistics and Data

Analysis, 52:4225-4242.

[Duong and Hazelton, 2003] Duong, T. and M. L. Hazelton (2003). Plug-in Bandwidth

Selectors for Bivariate Kernel Density Estimation. Journal of Nonparametric Statistics,

15:17–30.

[Duong and Hazelton, 2005] Duong, T. and M. L. Hazelton (2005). Cross Validation Band-

width Matrices for Multivariate Kernel Density Estimation. Scandinavian Journal of

Statistics, 32:485–506.

[Eggermont and Lariccia, 1996] Eggermont, P.P.B. and V. N. Lariccia (1996). A Simple

and Effective Bandwidth Selector for Kernel Density Estimation. Scandinavian Journal

of Statistics, 23:285–301.

[Epanechnikov, 1969] Epanechnikov, V. A. (1969). Nonparametric Estimation of a Multi-

variate Probability Density. Theory of Probability and its Applications, 14:153–158.

[Faraway and Jhun, 1990] Faraway, J. J. and M. Jhun (1990). Bootstrap Choice of Band-

width for Density Estimation. Journal of the American Statistical Association, 85:1119–

1122.

215 [Friedman, 1991] Friedman, J. H. (1991). Multivariate Adaptive Regression Splines. The

Annals of Statistics, 19:1–67

[Furnival and Wilson, 1974] Furnival, G. and R. Wilson (1974). Regression by Leaps and

Bounds. Technometrics, 16:499–511.

[Gajek and Lenic, 1993] Gajek, L. and A. Lenic (1993). An Approximate Necessary Con-

dition for the Optimal Bandwidth Selector in Kernel Density Estimation. Applicationes

Mathematicae (Warsaw), 22:123–138.

[Godtliebsen et al, 2002] Godtliebsen, F., J. Marron, and P. Chaudhuri (2002). Signifi-

cance in Scale Space for Bivariate Density Estimation. Journal of Computational and

Graphical Statistics, 11:1–21.

[Goldberg, 1989] Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization, and

Machine Learning. New York: Addison-Wesly.

[Gomez, Gomez-Villegas and Marin, 1998] Gomez, E., M. A. Gomez-Villegas, and J. M.

Marin (1998). A Multivariate Generalization of the Power Exponential Family of Dis-

tributions. Communications in Statistics- Theory and Methods, 27(3):589–600.

[Hald, 2001] Hald, A. (2001). On the History of the Correction for Grouping. Scandinavian

Journal of Statistics, 28:417-428.

[Hall, 1992] Hall, P. (1992). On Global Properties of Variable Bandwidth Density Estima-

tors. Annals of Statistics, 20:762–778.

[Hall and Marron, 1987] Hall, P. and J. S. Marron (1987). Extent to Which Least Squares

Cross Validation Minimizes Integrated Square Error in Nonparametric Density Estima-

tion. Probability Theory and Related Fields, 74:567–581.

[Hall and Marron, 1991] Hall, P. and J. S. Marron (1991). Lower Bounds for Bandwidth

Selection in Density Estimation. Probability Theory and Related Fields, 90:149–173.

216 [Hall, et al, 1991] Hall, P., S. J. Sheather, M. C. Jones and J. S. Marron (1991). On

Optimal Data Based Bandwidth Selection in Kernel Density Estimation. Biometrika,

78:263–269.

[Hall, Marron and Park, 1992] Hall, P., J. S. Marron and B. U. Park (1992). Smoothed

Cross-Validation. Probability Theory and Related Fields, 92:1–20.

[Hall and Wand, 1988a] Hall, P. and M. P. Wand (1988a). On Nonparametric Discrimina-

tion using Density Differences. Biometrika, 75:541–547.

[Hallin and Tran, 1996] Hallin, M. and L. T. Tran (1996). Kernel Density Estimation for

Linear Processes: Asymptotic Normality and Optimal Bandwidth Derivation. Annals

of the Institute of Statistical Mathematics, 48:429–449.

[Hamada, et. al, 2001] Hamada, M., H. Martz, C. Reese and A. Wilson (2001). Finding

Near Optimal Bayesian Experimental Designs via Genetic Algorithms. The American

Statistician, 55(3):175–181.

[Hardle, 1990] Hardle, W. (1990). Applied Nonparametric Regression. Cambridge Univer-

sity Press, New York.

[Hardle, Liang and Gao, 2000] Hardle, W., H. Liang, and J. Gao (2000). Partially Linear

Models. Springer-Physica-Verlag, Heidelberg.

[Hardle et al, 1990] Hardle, W., J. Marron, and M. Wand (1990). Bandwidth Choice for

Density Derivatives. Journal of the Royal Statistical Society Series B, 52:223–232.

[Hardle, Mori and Vieu, 2007] Hardle, W., Y. Mori, P. Vieu (2007). Statistical Methods

for Biostatistics and Related Fields, Springer: New York.

[Hazelton, 1996] Hazelton, M. (1996). Bandwidth Selection for Local Density Estimation.

Scandinavian Journal of Statistics, 23:221–232.

[Hazelton, 1999] Hazelton, M. L. (1999). An Optimal Local Bandwidth Selector for Kernel

Density Estimation. Journal of Statistical Planning and Inference, 77:37–50.

217 [Hjort and Glad, 1995] Hjort, N. L. and I. K. Glad (1995). Nonparametric Density Esti-

mation with a Parametric Start. The Annals of Statistics, 23:882–904.

[Hocking, 1976] Hocking, R. (1976). The Analysis and Selection of Variables in Linear

Regression. Biometrics, 32:1–49.

[Hoeting and Ibrahim, 1996] Hoeting, J. and J. Ibrahim (1996). Bayesian Predictive Si-

multaneous Variable and Transformation Selection in the Linear Model. Department of

Statistics, Colorado State University, Fort Collins. Technical Report No. 96/39.

[Holland, 1992] Holland, J. H. (1992). Genetic Algorithms. Scientific American, 267(1):66–

72.

[Howe, 2009] Howe, J. A. (2009). A New Generation of Mixture-Model Cluster Analysis

with Information Complexity and the Genetic EM Algorithm. Ph.D. dissertation, the

University of Tennessee, Knoxville.

[Jones, 1990] Jones, M. C. (1990). Variable Kernel Density Estimates and Variable Kernel

Density Estimates. Australian Journal of Statistics, 32:361–372.

[Jones, 1991a] Jones, M. C. (1991). Kernel Density Estimation for Length Biased Data.

Biometrika, 78:511–519.

[Jones, 1991b] Jones, M. C. (1991). On Correcting for Variance Inflation in Kernel Density

Estimation. Computational Statistics & Data Analysis, 11:3–15.

[Jones, 1991c] Jones, M. C. (1991). Prospects for Automatic Bandwidth Selection in Ex-

tensions to Basic Kernel Density Estimation. Nonparametric Functional Estimation and

Related Topics, 241–249.

[Jones, 1992] Jones, M. C. (1992). Potential for Automatic Bandwidth Choice in Variations

on Kernel Density Estimation. Statistics and Probability Letters, 13:351–356.

[Jones, 1993a] Jones, M. C. (1993a). Kernel Density Estimation When the Bandwidth is

Large. The Australian Journal of Statistics, 35:319–326.

218 [Jones, 1993b] Jones, M. C. (1993b). Simple Boundary Correction for Kernel Density

Estimation. Statistics and Computing, 3:135–146.

[Jones, 1998] Jones, M. C. (1998). On Some Kernel Density Estimation Bandwidth Selec-

tors Related to the Double Kernel Method. Sankhy, Series A, 60:249–264.

[Jones and Foster, 1996] Jones, M. C. and P. J. Foster (1996). A Simple Nonnegative

Boundary Correction Method for Kernel Density Estimation. Statistica Sinica, 6:1005–

1013.

[Jones and Kappenman, 1992] Jones, M. C. and R. F. Kappenman (1992). On a Class

of Kernel Density Estimate Bandwidth Selectors. Scandinavian Journal of Statistics:

Theory and Applications, 19:337–349.

[Jones and Lotwick, 1984] Jones, M. C. and H. W. Lotwick (1984). A Remark on Algo-

rithm AS 176: Kernel Density Estimation Using the Fast Fourier Transform. Applied

Statistics, 33:120–122.

[Jones, Marron and Park, 1991] Jones, M. C., J. S. Marron and B. U. Park (1991). A

Simple Root n Bandwidth Selector. The Annals of Statistics, 19:1919–1932.

[Jones, Marron and Sheather, 1996a] Jones, M. C., J. S. Marron and S. J. Sheather

(1996a). A Brief Survey of Bandwidth Selection for Density Estimation. Journal of

the American Statistical Association, 91:401–407.

[Jones, Marron and Sheather, 1996b] Jones, M. C., J. S. Marron and S. J. Sheather

(1996b). Progress in Data Based Bandwidth Selection for Kernel Density Estimation.

Computational Statistics, 11:337–381.

[Jones, McKay and Hu, 1994] Jones, M. C., I. J. McKay and T. C. Hu (1994). Variable

Location and Scale Kernel Density Estimation. Annals of the Institute of Statistical

Mathematics, 46:521–535.

219 [Jones, Signorini and Hjort, 1999] Jones, M. C., D. F. Signorini and N. L. Hjort (1999).

On Multiplicative Bias Correction in Kernel Density Estimation. Sankhy, Series A,

61:422–430.

[Kullback, 1987] Kullback, S. (1987). The Kullback-Leibler distance. The American Statis-

tician, 41:340–341.

[Linhart and Zucchin, 1986] Linhart H. and W. Zucchin (1986). Model Selection. New

York: John Wiley & Sons.

[Liu, 2006] Liu, M. (2006). Multivariate Non-normal Regression Models, Information Com-

plexity, and Genetic Algorithms: A Three Way Hybrid for Intelligent Data Mining. Ph.D.

thesis, University of Tennessee, Knoxville.

[Liu and Bozdogan, 2008] Liu, M. and H. Bozdogan (2008). Multivariate Regression Mod-

els with Power Exponential Random Errors and Subset Selection Using Genetic Algo-

rithms With Information Complexity. European Journal of Pure and Applied Mathe-

matics, January 2008.

[Loader, 1999] Loader, C. R. (1999). Bandwidth Selection: Classical or Plug-in? The

Annals of Statistics, 27:415–438.

[Loftsgaarden and Quesenberry, 1965] Loftsgaarden, D. O. and C. P. Quesenberry (1965).

A Nonparametric Estimate of a Multivariate Density Function. Annals of Mathematical

Statistics, 36:1049–1051.

[Luh, Minesky and Bozdogan, 1996] Luh, H. K., J. J. Minesky and H. Bozdogan (1996).

Choosing the Best Predictors in Regression Analysis via the Genetic Algorithm with

Informational Complexity as the Fitness Function. unpublished paper.

[Mantel, 1970] Mantel, N. (1970). Why Stepdown Procedures in Variable Selection. Tech-

nometrics, 12:591–612.

220 [Meyer, 2003] Meyer, M. (2003). An Evolutionary Algorithm with Applications to Statis-

tics. Journal of Computational and Graphical Statistics, 12(2):265–281.

[Minnotte and Scott, 1993] Minnotte, M. C. and D. W. Scott (1993). The Mode Tree: A

Tool for Visualization of Nonparametric Density Features. Journal of Computational

and Graphical Statistics, 2:51–68.

[Park, 1989] Park, B. U. (1989). On the Plug-in Bandwidth Selectors in Kernel Density

Estimation. Journal of the Korean Statistical Society, 18:107–117.

[Parzen, 1962] Parzen, E. (1962). On Estimation of a Probability Density Function and

Mode. The Annals of Mathematical Statistics, 33:1065–1076.

[Rosenblatt, 1956] Rosenblatt, M. (1956). Remarks on Some Nonparametric Estimates of

a Density Function. The Annals of Mathematical Statistics, 27:832–837.

[Rudemo, 1982] Rudemo, M. (1982). Empirical Choice of Histograms and Kernel Density

Estimators. Scandinavian Journal of Statistics: Theory and Applications, 9:65–78.

[Sahin and Diwekar, 2004] Sahin, K. H. and U. M. Diwekar (2004). Better Optimization of

Nonlinear Uncertain Systems (BONUS): A New Algorithm for Stochastic Programming

Using Reweighting Through Kernel Density Estimation. Annals of Operations Research,

132:47–68.

[Said, 2005] Said, Y. H. (2005). On Genetic Algorithms and Their Applications. Handbook

of Statistics, 24:359–390.

[Sain, 2002] Sain, S. R. (2002). Multivariate Locally Adaptive Density Estimation. Com-

putational Statistics and Data Analysis, 39:165–186.

[Sain and Scott, 2002] Sain, S. R. and D. W. Scott (2002). Zero-Bias Bandwidths for

Locally Adaptive Kernel Density Estimation. Scandinavian Journal of Statistics, 29:441–

460.

221 [Sain, Baggerly and Scott, 1994] Sain, S. R., K. A. Baggerly and D. W. Scott (1994). Cross

Validation of Multivariate Densities. Journal of the American Statistical Association,

89:807–817.

[Sain and Scott, 1996] Sain, S. R. and D. W. Scott (1996). On Locally Adaptive Density

Estimation. Journal of the American Statistical Association, 91:1525–1534.

[Sawa, 1978] Sawa, T. (1978). Information Criteria for Discriminating Among Alternative

Regression Models. Econometrica, 46:1273–1282.

[Schucany, 1989] Schucany, W. R. (1989). Locally Optimal Window Widths for Kernel

Density Estimation with Large Samples. Statistics and Probability Letters, 7:401–405.

[Schwarz, 1978] Schwarz, G. (1978). Estimating the Dimension of a Model. Annals of

Statistics, 6:461–464.

[Sclove, 1987] Sclove, S. (1987). Application of Model Selection Criteria to Some Problems

in Multivariate Analysis. Psychometrika, 52:333–343.

[Scott, 1976] Scott, D. W. (1976). Nonparametric Probability Density Estimation by Op-

timization Theoretic Techniques. Statistics & Probability, Vol. 65.

[Scott, et. al, 1980] Scott, D. W., R. A. Tapia and J. R. Thompson (1980). Nonparametric

Probability Density Estimation by Discrete Maximum Penalized-Likelihood Criteria.

Annals of Statistics, 8(4): 820–832.

[Scott, 1985] Scott, D. W. (1985). Averaged Shifted Histograms: Effective Nonparametric

Density Estimators in Several Dimensions. Annals of Statistics, 13:1024–1040.

[Scott, 1992] Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice, and

Visualization. John Wiley: New York.

[Scott and Sain, 2004] Scott, D. W. and S. R. Sain (2004). Multi-dimensional Density

Estimation. Elsevier Science.

222 [Scott and Terrell, 1987] Scott, D. W. and G. R. Terrell (1987). Biased and Unbiased

Cross-Validation in Density Estimation. Journal of the American Statistical Association,

82(400):11311-1146.

[Scott, Tapia and Thompson, 1997] Scott, D.W., R. A. Tapia and J. R. Thompson (1997).

Kernel Density Estimation Revisited. Nonlinear Analysis, 1(4):339–372.

[Sheather and Jones, 1991] Sheather, S. J. and M. C. Jones (1991). A Reliable Data-

Based Bandwidth Selection Method for Kernel Density Estimation. Journal of the Royal

Statistical Society, Series B: Methodological, 53:683–690.

[Silverman, 1981] Silverman, B. W. (1981). Using Kernel Density Estimates to Investigate

Multi-Modality. Journal of the Royal Statistical Society, Series B, 43:97–99.

[Silverman, 1982] Silverman, B. W. (1982). Kernel Density Estimation Using the Fast

Fourier Transform. Applied Statistics, 31:93–99.

[Silverman, 1986] Silverman, B. W. (1986). Density Estimation for Statistics and Data

Analysis. Chapman and Hall: London.

[Simonoff, 1996] Simonoff, J. S. (1996). Smoothing Methods in Statistics, New York:

Springer-Verlag.

[Singh, 1987] Singh, R. S. (1987). MISE of Kernel Estimates of a Density and its Deriva-

tives. Statistics & Probability Letters, 5:153–159.

[Song, Seog and Cho, 1991] Song, M. S., K. H. Seog and S. S. Cho (1991). On Asymptot-

ically Optimal Plug-in Bandwidth Selectors in Kernel Density Estimation. Journal of

the Korean Statistical Society, 20:29–43.

[Speckman, 1988] Speckman, P. (1988). Kernel Smoothing in Partial Linear Models. Jour-

nal of the Royal Statistical Society, Series B, 50:413–436.

[Stone, 1984] Stone, C. J. (1984). An Asymptotically Optimal Window Selection Rule for

Kernel Density Estimates. The Annals of Statistics, 12:1285–1297.

223 [Sturges, 1926] Sturges, H. A. (1926). The Choice of a Class Interval. Journal of the

American Statistical Association, 21:65–66.

[Tapia and Thompson, 1978] Tapia, R. A. and J. R. Thompson (1978). Nonparametric

Probability Density Estimation. Hopkins Press: Baltimore.

[Taylor, 1989] Taylor, C. C. (1989). Bootstrap Choice of the Smoothing Parameter in

Kernel Density Estimation. Biometrika, 76:705–712.

[Terrell, 1990] Terrell, G. R. (1990). The Maximal Smoothing Principle in Density Esti-

mation. Journal of the American Statistical Association, 85:470–477.

[Terrell and Scott, 1985] Terrell, G. R. and D. W. Scott (1985). Oversmoothed Nonpara-

metric Density Estimates. Journal of the American Statistical Association, 80:209–214.

[Terrell and Scott, 1992] Terrell, G. R. and D. W. Scott (1992). Variable Kernel Density

Estimation. Annals of Statistics, 20:1236–1265.

[Trotter and Shetty, 1974] Trotter, L. and C. Shetty (1974). An Algorithm for the Bounded

Variable Integer Programming Problem. Journal of Association for Computing Machin-

ery, 21:505–513.

[Van Emden, 1971] Van Emden, M. H. (1971). An Analysis of Complexity. Mathematical

Centre Tracts. 35, Amsterdam: Mathematisch Centrum.

[Victor, 1976] Victor, N. (1976). Decision Making and Medical Care: Can Information

Science Help? Nonparametric Allocation Rules, North-Holland, Amsterdam, 515–529.

[Vose, 1999] Vose, M. (1999). The Simple Genetic Algorithm: Foundations and Theory.

MIT Press.

[Wagner, 1975] Wagner, T. J. (1975). Nonparametric Estimates of Probability Densities.

IEEE Transactions on Information Theory, IT-21:438–440.

224 [Wand and Jones, 1993] Wand, M. P. and M. C. Jones (1993). Comparison of Smoothing

Parameterizations in Bivariate Kernel Density Estimation. Journal of the American

Statistical Association, 88(422):520–528.

[Wand and Jones, 1994] Wand, M. P. and M. C. Jones (1994). Multivariate Plug-in Band-

width Selection. Computational Statistics, 9:97–116.

[Wand and Jones, 1995] Wand, M. P. and M. C. Jones (1995). Kernel Smoothing, Chap-

man and Hall: London.

[Wu, 1997] Wu, T. J. (1997). Root n Bandwidth Selectors for Kernel Estimation of Density

Derivatives. Journal of the American Statistical Association, 92:536–547.

[Wu and Tsai, 2004] Wu, T. J. and M. H. Tsai (2004). Root n Bandwidth Selectors in

Multivariate Kernel Density Estimation. Probability Theory and Related Fields, 129:537–

558.

[Yang, 2000] Yang, L. (2000). Root n Convergent Transformation Kernel Density Estima-

tion. Journal of Nonparametric Statistics, 12:447–474.

225 Appendices

226 Appendix

A.1 Simulated Normal Data from Section 4.2.1

227

Figure A.1: Histograms and scatter plots of normally distributed variables X1, X2, X3, and X4.

228

Figure A.2: Histograms and scatter plots of normally distributed variables X5, X6, X7, and Y1.

229 A.2 Hald’s Cement Data from Section 4.2.3

Figure A.3: Histograms and scatter plots of Hald’s variables X1, X2, X3, X4 and Y .

230 A.3 Power Exponential Data from Section 4.2.4

Figure A.4: Histograms and scatter plots of PE variables X1, X2, X3, and X4.

231

Figure A.5: Histograms and scatter plots of PE variables X5, X6, X7, and Y1.

232 A.4. Friedman’s Data from Section 4.2.5

Figure A.6: Histograms and scatter plots of Friedman’s variables X1, X2, X3, X4, X5, and Y .

233

Figure A.7: Histograms and scatter plots of Friedman’s variables X6, X7, X8, X9, and X10.

234 A.5. Uncorrelated Uniform Data from Section 4.2.6

Figure A.8: Histograms and scatter plots of uncorrelated uniform variables X4, X5, X6, and X7.

235 A.6 Simulated Multivariate Normal Data from Section 5.2

Figure A.9: Histograms and scatter plots of normally distributed variables X1, X2, Y1, Y2, and Y3.

236

Figure A.10: Histograms and scatter plots of normally distributed variables X3, X4, X5, X6, and X7.

237 A.7 Power Exponential Data from Section 5.3

Figure A.11: Histograms and scatter plots of PE variables X1, X2, Y1, Y2, and Y3.

238

Figure A.12: Histograms and scatter plots of PE variables X3, X4, X5, X6, and X7.

239 A.8. Body Fat Data from Section 6.3

Figure A.13: Histograms and scatter plots of body fat variables X1, X2, X3, X4, and X5.

240

Figure A.14: Histograms and scatter plots of body fat variables X6, X7, X8, X9, and X10.

241

Figure A.15: Histograms and scatter plots of body fat variables X11, X12, X13, and Y .

242 Vita

Dennis J. Beal was born and raised in Knoxville, Tennessee. He graduated from Carter

High School in 1980 where he was class valedictorian from a senior class of approximately

300 students. He then attended Carson-Newman College in Jefferson City, Tennessee, and majored in mathematics. In addition to a major in mathematics, he minored in computer studies, history, business administration, and Greek. Dennis graduated magna cum laude from Carson-Newman College in 1984 with a 3.90 grade point average and was awarded the Science-Humanities Award in 1984 from his graduating class for academic excellance in both science and the humanities. He was a member or officer in many campus honor societies and organizations, including Alpha Chi, Mortar Board, Pi Tau Chi, Blue Key Na- tional Honor Society, Alpha Lambda Delta, Phi Alpha Theta, president of Pi Mu Epsilon,

Phi Eta Sigma, and Sigma Mu Alpha. In 1985 he was selected into the Outstanding Young

Men of America.

After graduating from Carson-Newman College, Dennis pursued graduate study in ap- plied mathematics at Virginia Polytechnic Institute and State University (Virginia Tech) in Blacksburg, Virginia. He graduated from Virginia Tech with a Master of Science degree in applied mathematics in 1987. At Virginia Tech he took his first senior level statistics courses where he fell in love with the science of statistics. He also took his first graduate courses in operations research and saw how applicable these disciplines were to solving real world problems.

243 After getting a taste of statistics and operations research, Dennis decided he wanted more education in these areas and enrolled at the University of Tennessee, Knoxville, to pursue a Master of Science degree in statistics. He took nearly every graduate statistics course offered in addition to some management science courses as electives. He graduated from the University of Tennessee, Knoxville, with his Master of Science degree in statistics in

1989 with a 3.91 grade point average.

During graduate school at the University of Tennessee, Knoxville, he had the opportu- nity to do part-time work as a co-op at the Oak Ridge National Laboratory in Oak Ridge,

Tennessee. There he ran SAS models and analyzed transportation data for the U.S. Depart- ment of Transportation. After graduating with his Master of Science degree in statistics with a 3.9 grade point average, he applied for a full-time position as a statistician with

Martin Marietta Energy Systems at the K-25 Plant (now known as the East Tennessee

Technology Park) in Oak Ridge. He worked as an applied statistician on a wide variety of primarily environmental projects for the U.S. Department of Energy. Due to the nature of his projects, Dennis was granted a U.S. Department of Energy top secret (’Q’) security clearance which is still active today. Dennis became an experienced SAS programmer and statistician by analyzing environmental data across several U.S. Department of Energy installations.

In 1998 Dennis transitioned from Lockheed Martin Energy Systems (successor to Martin

Marietta Energy Systems) to the new U.S. Department of Energy environmental contrac- tor Bechtel Jacobs Company where he continued his work as an applied statistician on environmental projects for the U.S. Department of Energy. In 1999 he transitioned from

Bechtel Jacobs Company to one of its subcontractors.

In 2000 he left the subcontractor and was hired as a statistician at Science Applications

International Corporation (SAIC) in Oak Ridge, Tennessee. At SAIC he used his advanced

244 SAS skills and applied statistical professional experience on a wide variety of both gov- ernment and private industry projects. In addition to gaining experience as a statistician, he also used his skills as a human health risk assessor at SAIC where he wrote and ran complex SAS program to calculate risks and hazards for human health.

In 2001 Dennis wanted to pursue a Ph.D. in management science with an emphasis in statistics to complement with his 12 years of professional experience in the business world.

His first doctoral level statistics course at the University of Tennessee, Knoxville, was with

Dr. Hamparam Bozdogan where Dennis was introduced to the concepts of information criteria and model selection for regression. He knew this was the area he wanted to pursue in research. He pursued the Ph.D. in management science part-time while still working full-time at SAIC. In 2007 he was promoted in grade to Senior Statistician where he is the lead statistician on many high profile projects for decontamination and decommissioning of legacy buildings at the Oak Ridge National Laboratory, Y-12 National Security Com- plex, and the East Tennessee Technology Park. He successfully designed the statistical sampling plan for characterizing the K-25 Building structure and its many process gaseous equipment components. This was the first successful statistical sampling design approved by the regulators and has become the standard for other building characterization projects within the U.S. Department of Energy.

Dennis also applied his statistical expertise on the National Biosurveillance Information

System for the U.S. Federal Government whose goal is to integrate information from multi- ple sources to predict terrorist threats or epidemics from emerging cases of highly contagious diseases from animals or humans. Dennis has an active Top Secret Department of Defense security clearance for his work on this project.

Dennis also was on the technical team that created a ranking system for quantifying the safety of meat products for the U.S. Department of Agriculture. He had meetings with the

245 Assistant Secretary of Agriculture in Washington, D.C. to provide statistical expertise into the development of the model used to predicting meat processing plants that need closer inspection.

Dennis expects to complete his Ph.D. in management science in December 2009 with a

3.9 grade point average. In addition to his research and applied statistical consulting through SAIC, he continues to present papers annually at SAS conferences.

246