Learning Classifier Systems from First Principles

LEARNING CLASSIFIER SYSTEMS FROM FIRST PRINCIPLES A PROBABILISTIC REFORMULATION OF LEARNING CLASSIFIER SYSTEMS FROM THE PERSPECTIVE OF MACHINE LEARNING Submitted by Jan Drugowitsch for the degree of Doctor of Philosophy of the University of Bath August, 2007 COPYRIGHT Attention is drawn to the fact that copyright of this thesis rests with its author. This copy of the thesis has been supplied on condition that anyone who con- sults it is understood to recognise that its copyright rests with its author and no information derived from it may be published without the prior written consent of the author. This thesis may be made available for consultation within the University li- brary and may be photocopied or lent to other libraries for the purposes of consultation. Abstract Learning Classifier Systems (LCS) are a family of rule-based machine learning methods. They aim at the autonomous production of potentially human- readable results that are the most compact generalised representation whilst also maintaining high predictive accuracy, with a wide range of application areas, such as autonomous robotics, economics, and multi-agent systems. Their design is mainly approached heuristically and, even though their performance is competitive in regression and classification tasks, they do not meet their expected performance in sequential decision tasks despite being initially designed for such tasks. It is out contention that improvement is hindered by a lack of theoretical understanding of their underlying mechanisms and dy- namics. To improve this understanding, our work proposes a new methodology for their design that centres on the model they use to represent the problem structure, and subsequently applies standard machine learning methods to train this model. The LCS structure is commonly a set of rules, resulting in a para- metric model that combines a set of localised models, each representing one rule. This leads to a general definition of the optimal set of rules as being the one whose model represents the data best and at a minimum complexity, and hence an increased theoretical understanding of LCS. Consequently, LCS training reduces to searching and evaluating this set of rules, for which we introduce and apply several standard methods that are shown to be closely related to current LCS implementations. The benefit of taking this approach is not only a new view on LCS, and the transfer of the formal basis of the applied methods to the analysis of LCS, but i also the first general definition for what it means for a set of rules to be optimal. The work promises advances in several areas, such as developing new LCS implementations with performance guarantees, to improve their performance, and foremost their theoretical understanding. ii Acknowledgements I first and foremost would like to thank my parents, Elsbeth and Knut Drugow- itsch, for their general moral and generous financial support, without which I would have been unable to fully focus on my work. With equal emphasis I would like to thank my supervisor, Alwyn Barry, for providing me with a chal- lenging research subject. His guidance, his constructive comments, and his initiative were essential in the realisation of this thesis. I would also like to acknowledge my examiners, Larry Bull and Dan Richardson, and their thought- ful criticism and discussion of my thesis submission. I would like to express my gratitude to all people that shaped my work and life in the three years that I have spent in Bath, in particular Will Lowe for introducing me to the model-based machine learning perspective, for offering his stance on various machine learning topics, and for comments on an early draft of this thesis. Joanna Bryson supported me morally through general Artificial Intelligence discussions, and financially by providing me with programming jobs. Special thanks also go to Hagen Lehmann and Tristan Caulfield for comments on early drafts of my thesis and for discussions about my work, life, the universe and everything. My communication with several LCS researchers has contributed to the con- tent of this thesis. In particular, I would like to thank Pier Luca Lanzi and Daniele Loiacono for their frequent comments and motivating appreciation of my work, and the stimulating discussions at various conferences. Addition- ally, I would like to acknowledge the comments of Martin Butz and Lashon Booker on some of my published work, and those of Will Browne on the first draft of this thesis. iii Due to the nature of my work and my previous education I am grateful for any mathematical support that was given to me during the completion of this thesis, in particular by Marelee Hurn with respect to some statistical questions, and my lab colleagues Jonty Needham and Mark Price. Various researchers from the machine learning community have also pro- vided their help: Christopher Bishop, Markus Svensen,´ Matthew Beal, and Tommi Jaakkola answered my questions regarding the application of vari- ational Bayesian inference; Gavin Brown pointed me to relevant ensemble learning literature; Lei Xu supported my attempts at applying Bayesian Ying Yang to LCS; Peter Grunwald¨ and Arta Doci clarified some MDL-related issues; Michael Littman discussed my queries regarding intelligent exploration methods and performance guarantees in reinforcement learning. Finally, I thank Odali Sanhueza for making my Ph.D.-free time as good as it can get. iv Contents 1 Introduction 1 1.1 MachineLearning........................... 2 1.1.1 Common Machine Learning Tasks . 2 1.1.2 Designing an Unsupervised Learning Algorithm . 3 1.2 LearningClassifierSystems . 6 1.2.1 ABriefOverview....................... 7 1.2.2 Applications and Current Issues . 8 1.3 AboutthisWork............................ 9 1.3.1 The Initial Approach . 9 1.3.2 Taking a Model-Centred View . 10 1.3.3 Summarising the Approach . 11 1.3.4 Contributions ......................... 12 1.4 HowtoReadthisThesis ....................... 12 v 1.4.1 ChapterOverview ...................... 13 2 Background 17 2.1 A General Problem Description . 18 2.2 EarlyLearningClassifierSystems. 20 2.2.1 InitialIdea........................... 20 2.2.2 The General Framework . 21 2.2.3 InteractingSubsystems . 22 2.2.4 TheGeneticAlgorithminLCS . 23 2.2.5 TheProblemsofEarlyLCS . 24 2.3 TheLCSRenaissance ......................... 25 2.3.1 ComputingthePrediction . 26 2.3.2 Localisation and Representation . 27 2.3.3 Classifiers as Localised Maps from Input to Output . 27 2.3.4 Recovering the Global Prediction . 28 2.3.5 Michigan-style vs. Pittsburgh-style LCS . 28 2.4 ExistingTheory ............................ 29 2.4.1 TheHolisticView....................... 29 2.4.2 Approaches from the Genetic Algorithm Side . 30 2.4.3 Approaches from the Function Approximation Side . 32 vi 2.4.4 Approaches from the Reinforcement Learning Side . 33 2.5 DiscussionandConclusion . 33 3 A Learning Classifier Systems Model 37 3.1 TaskDefinitions ............................ 38 3.1.1 Expected Risk vs. Empirical Risk . 39 3.1.2 Regression........................... 41 3.1.3 Classification ......................... 42 3.1.4 Sequential Decision . 43 3.1.5 Batch vs. Incremental Learning . 44 3.2 LCSasParametricModel . 47 3.2.1 ParametricModels . 49 3.2.2 LCSModel........................... 49 3.2.3 Classifiers as Localised Models . 50 3.2.4 Recovering the Global Model . 52 3.2.5 Finding a Good Model Structure . 52 3.2.6 Considerations for Model Structure Search . 53 3.2.7 RelationtotheInitialLCSIdea . 54 3.3 SummaryandOutlook........................ 55 vii 4 A Probabilistic Model for LCS 57 4.1 The Mixtures-of-Experts Model . 58 4.1.1 Likelihood for Known Gating . 59 4.1.2 ParametricGatingNetwork . 60 4.1.3 Training by Expectation-Maximisation . 62 4.1.4 LocalisationbyInteraction . 64 4.1.5 TrainingIssues ........................ 65 4.2 LinearExpertModels. 65 4.3 GeneralisingtheMoEModel . 67 4.3.1 An Additional Layer of Forced Localisation . 67 4.3.2 Updated Expectation-Maximisation Training . 69 4.3.3 Implications on Localisation . 69 4.3.4 Relation to Standard MoE Model . 70 4.3.5 RelationtoLCS ........................ 70 4.3.6 TrainingIssues ........................ 73 4.4 Independent Classifier Training . 74 4.4.1 TheOriginofLocalMaxima. 74 4.4.2 What does a Classifier Model? . 75 4.4.3 Introducing Independent Classifier Training . 76 viii 4.4.4 Training the Gating Network . 77 4.4.5 Implications on Likelihood and Assumptions about the Data . 78 4.5 DiscussionandSummary . 78 5 Training the Classifiers 83 5.1 Linear Classifier Models and Their Underlying Assumptions . 84 5.1.1 LinearModels......................... 85 5.1.2 GaussianNoise ........................ 86 5.1.3 Maximum Likelihood and Least Squares . 87 5.2 Batch Learning Approaches . 88 5.2.1 TheWeightVector ...................... 88 5.2.2 The Noise Precision . 90 5.3 Incremental Learning Approaches . 91 5.3.1 The Principle of Orthogonality . 92 5.3.2 Steepest Gradient Descent . 93 5.3.3 Least Mean Squared . 96 5.3.4 Normalised Least Mean Squared . 98 5.3.5 Recursive Least Squares . 99 5.3.6 TheKalmanFilter. 104 ix 5.3.7 Incremental Noise Precision Estimation . 111 5.3.8 Summarising Incremental Learning Approaches . 116 5.4 EmpiricalDemonstration . 116 5.4.1 Experimental Setup . 117 5.4.2 Weight Vector Estimate . 120 5.4.3 NoiseVarianceEstimate . 120 5.5 DiscussionandSummary . 122 6 Mixing Independently Trained Classifiers 127 6.1 Using the Generalised Softmax Function . 129 6.1.1 Batch Learning by Iterative Reweighted Least Squares . 130 6.1.2 Incremental Learning by Least Squares . 132 6.2 AlternativeHeuristics . 134 6.2.1 Properties of Weighted Averaging Mixing . 135 6.2.2 InverseVariance . 138 6.2.3 PredictionConfidence . 139 6.2.4 Maximum Prediction Confidence . 140 6.2.5 XCS............................... 141 6.3 EmpiricalComparison . 142 6.3.1 Experimental Design . 143 x 6.3.2 Results............................. 145 6.3.3 Discussion........................... 147 6.4 Relation to our Previously Published Work . 149 6.5 SummaryandOutlook ........................ 151 7 The Optimal Set of Classifiers 155 7.1 WhatisOptimal? ..........................

Load more