Novel Computationally Intelligent Machine Learning Algorithms for Data Mining and Knowledge Discovery

Novel Computationally Intelligent Machine Learning Algorithms for Data Mining and Knowledge Discovery Iffat A. Gheyas Department of Computing Science and Mathematics University of Stirling, Stirling FK9 4LA Scotland, UK This thesis has been submitted to the University of Stirling in partial fulfilment of the requirements for the degree of Doctor of Philosophy. November, 2009 Declaration I hereby declare that this thesis has been composed by me, that the work and results have not been presented for any university degree prior to this and that the ideas that I do not attribute to others are my own. Iffat Gheyas 2009 i Abstract This thesis addresses three major issues in data mining regarding feature subset selection in large dimensionality domains, plausible reconstruction of incomplete data in cross-sectional applications, and forecasting univariate time series. For the automated selection of an optimal subset of features in real time, we present an improved hybrid algorithm: SAGA. SAGA combines the ability to avoid being trapped in local minima of Simulated Annealing with the very high convergence rate of the crossover operator of Genetic Algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks (GRNN). For imputing missing values and forecasting univariate time series, we propose a homogeneous neural network ensemble. The proposed ensemble consists of a committee of Generalized Regression Neural Networks (GRNNs) trained on different subsets of features generated by SAGA and the predictions of base classifiers are combined by a fusion rule. This approach makes it possible to discover all important interrelations between the values of the target variable and the input features. The proposed ensemble scheme has two innovative features which make it stand out amongst ensemble learning algorithms: (1) the ensemble makeup is optimized automatically by SAGA; and (2) GRNN is used for both base classifiers and the top level combiner classifier. Because of GRNN, the proposed ensemble is a dynamic weighting scheme. This is in contrast to the existing ensemble approaches which belong to the simple voting and static weighting strategy. The basic idea of the dynamic weighting procedure is to give a higher reliability weight to those scenarios that are similar to the new ones. The simulation results demonstrate the validity of the proposed ensemble model. ii Acknowledgement People often say that getting a PhD is stressful. I can‘t agree there. I really enjoyed every single moment of it. I was in huge financial distress. I didn‘t think that I would ever see the day! My thesis is a miracle to me. I wish to acknowledge a few people who made this thesis work possible. I have never had the chance to thank them –until now. I would like to thank my principal supervisor, Professor Leslie Smith, for his expert guidance, invaluable advice, great feedback and patient encouragement. Proud to call him my best friend!!! He pays attention; works hard and always supports my academic needs, whatever they might be. It is for him, in the whole university, only I have the SAS software installed on my computer. I used both the MATLAB and SAS software packages for performing the simulations and analyses; and it gave me a great flexibility to implement and experiment existing machine learning algorithms. In 2008, when it became that bad that I often could not work at times due to financial hardship, he did everything so that I could take my office computer home. I used to work late at night. I was very stressed out (due to financial reason) and we had ups and downs and battles. But he is always very patient and forgiving with me. I was privileged to work with him for the last 4 years. Sincere thanks to my external examiner Professor Colin Fyfe of the University of the West of Scotland and my internal examiner Dr Bruce Graham for insightful comments on the final revisions of this thesis. iii I would like to take this opportunity to thank my second supervisor Dr David Cairns. My tuition fee is over £30,000. I could not pay the last instalment of £2,600. I did not even have enough money for myself. I was asking everyone in the department to give me a part-time job. When heard, Dr Cairns became very sympathetic. This is some of what he wrote to me back in February 2008: ―I am very sorry to hear of your difficult circumstances. I have asked Kevin Swingler if he has any work available at the moment and unfortunately neither he nor I are able to offer any work since we do not have any available funds. I think that it is important that you have a discussion with Prof. Smith about your funding concerns to see if it is possible to come to some agreement with the university which avoids or defers the need to pay the outstanding balance. ...I will discuss your situation with him to see what can be done.‖ I‘m sure this is the sweetest email I‘ve ever received. This story has a happy ending! I wish to gratefully acknowledge the help of the following persons: When I was looking for part time jobs so that I could continue my research, Dr Simon Jones and Dr Savi Maharaj kindly agreed to act as my referees. I really want to thank, my MSc supervisor, Dr Amir Hussain for his initial help in getting me started. Thanks especially to Ms Kate Howie for helping me choose the right statistical tests for my project. Virtually everybody around me was stressed out. I have been a prolific reader of academic papers in my studies for the last four years. I always send a lot of iv interlibrary loan (ILL) requests (I averaged around 50 requests per day) which made the library staffs extremely stressed and tense. I deeply appreciate their help. My deepest gratitude goes to my family for their unflagging love and support v To My Parents & Especially, To My Sister Dr Ferdous Gheyas For Their Love, Endless Support and Encouragement vi Abbreviations ACF Autocorrelation Function ACO Ant Colony Optimization AIC Akaike Information Criterion ANN Artificial Neural Networks ARIMA Autoregressive Integrated Moving Average BIC Bayesian Information Criterion BP Blood Pressure CI Confidence Interval d Deseasonalized DFT Discrete Fourier Transform DWT Discrete Wavelet Transform EM Expectation Maximization ERM Expected Risk Maximization ERNN Elman's Recurrent Neural Networks FW Filter approach + Wrapper approach (a hybrid algorithm) GA Genetic Algorithm GARCH Generalized Auto-Regressive Conditional Heteroskedasticity Gbest Global best GE Generalized Regression Neural Network Ensemble GEFTS Generalized Regression Neural Network Ensemble for Forecasting Time Series ( proposed time series forecasting algorithm) GEMI Generalized Regression Neural Network Ensemble for Multiple Imputation GESI Generalized Regression Neural Network Ensemble for Single Imputation GRNN Generalized Regression Neural Networks GRNN MI Generalized Regression Neural Networks with Multiple Imputation GRNN SI Single Imputation with Generalized Regression Neural Networks HA ARIMA-GARCH+ERNN (a hybrid algorithm) HC Hill Climbing HD Hot Deck Imputation HD MI Hot Deck Imputation method with Multiple Imputation HD SI Hot Deck Imputation method for Single Imputation HES Heterogeneous ensemble with simple averaging HES MI Heterogeneous ensemble with simple averaging for Multiple Imputation HES SI Heterogeneous ensemble with simple averaging for Single Imputation HEW Heterogeneous ensemble with weighted averaging HEW MI Heterogeneous ensemble with weighted averaging for Multiple Imputation HEW SI Heterogeneous ensemble with weighted averaging for Single Imputation HOS Homogeneous ensemble with simple averaging vii HOS MI Homogeneous ensemble with simple averaging for Multiple Imputation HOS SI Homogeneous ensemble with simple averaging for Single Imputation HOW Homogeneous ensemble with weighted averaging HOW MI Homogeneous ensemble with weighted averaging for Multiple Imputation HOW SI Homogeneous ensemble with weighted averaging for Single Imputation HUX Half Uniform Crossover KNN K-Nearest Neighbours Algorithm KNN MI K-Nearest Neighbours Algorithm for Multiple Imputation KNN SI K-Nearest Neighbours Algorithm for Single Imputation logsig logistic sigmoid function MAR Missing at Random MCAR Missing Completely at Random MCMC Markov Chain Monte Carlo MI Multiple Imputation MLP Multilayer Perceptrons MLP MI Multilayer Perceptrons for Multiple Imputation MLP SI Multilayer Perceptrons for Single Imputation MNAR Missing Not at Random MS Mean Substitution MSE Mean square error nd non-deseasonalized NN Neural Networks PACF Partial Autocorrelation Function Pbest Personal best PC Principal Component PCA Principal Component Analysis PSO Particle Swarm Optimization RBFN Radial Basis Function Networks RBFN MI Radial Basis Function Networks for Multiple Imputation RBFN SI Radial Basis Function Networks for Single Imputation RNN Recurrent Neural Networks SA Simulated Annealing SAGA SA (Simulated Annealing) +GA (Genetic Algorithm) (proposed feature subset selection algorithm) SBS Sequential Backward Selection SFBS Sequential Floating Backward Selection SFFS Sequential Floating Forward Selection SFS Sequential Forward Selection SI Single Imputation SRM Structural Risk Minimization SU Symmetric Uncertainty SVM Support Vector Machines tanh hyperbolic tangent sigmoid function viii VC Vapnik-Chervonenkis dimension dimension WKNN Weighted K-Nearest Neighbours Algorithm WKNN MI Weighted K-Nearest Neighbours Algorithm for Multiple Imputation WKNN SI Weighted K-Nearest Neighbours Algorithm

Load more