Learning in High Dimensions with Projected Linear Discriminants
Total Page:16
File Type:pdf, Size:1020Kb
Learning in High Dimensions with Projected Linear Discriminants By Robert John Durrant A thesis submitted to the University of Birmingham for the degree of Doctor of Philosophy School of Computer Science College of Engineering and Physical Sciences University of Birmingham January 2013 ii c Robert John Durrant, 2013. Typeset in LATEX 2". Except where acknowledged in the customary manner, the material presented in this thesis is, to the best of my knowledge, original and has not been submitted in whole or part for a degree in any university. Robert John Durrant iii iv Abstract The enormous power of modern computers has made possible the statistical modelling of data with dimensionality that would have made this task inconceivable only decades ago. However, experience in such modelling has made researchers aware of many issues associated with working in high-dimensional domains, collectively known as `the curse of dimensionality', which can confound practitioners' desires to build good models of the world from these data. When the dimensionality is very large, low-dimensional methods and geometric intuition both break down in these high-dimensional spaces. To mitigate the dimensionality curse we can use low-dimensional representations of the original data that capture most of the information it contained. However, lit- tle is currently known about the effect of such dimensionality reduction on classifier performance. In this thesis we develop theory quantifying the effect of random pro- jection { a recent, very promising, non-adaptive dimensionality reduction technique { on the classification performance of Fisher's Linear Discriminant (FLD), a successful and widely-used linear classifier. We tackle the issues associated with small sample size and high-dimensionality by using randomly projected FLD ensembles, and we de- velop theory explaining why our new approach performs well. Finally, we quantify the generalization error of Kernel FLD, a related non-linear projected classifier. v vi Dedication To my family. vii viii Acknowledgments It is traditional to begin one's thesis by thanking those individuals without whom it would not have been completed, and since it is a good tradition I don't intend to break with it. My wife, Sam, has been an unfailing source of support for as long as we have known each other, and her love, patience and good humour have kept me going much more than I think she knows. Thank you Sam. On the other hand my daughters Amy (8) and Lucy (2) have been an unfailing source of distraction; however in my experience it sometimes pays to be distracted for a while, and besides that is how things are supposed to be. Thank you girls. Without my family, this would have been a lonelier and harder road, and I dedicate this thesis to them. I would also like to especially thank my supervisor, Dr. Ata Kab´an,for her en- couragement, patience, and support { and especially for starting the ball rolling by convincing me to attempt some theory during my Masters' degree. We have had a productive partnership, and the contents of this thesis are the fruit of our collaborative effort. I would also like to thank Ata for reading my thesis drafts and for finding in them more bugs than I imagined were possible. The usual caveat that any remaining errors are my own is, of course, still true. I would also like to thank the other researchers I have been privileged to meet during the course of this research, too many to name, for interesting and often enlight- ening discussions and ideas. In particular, thanks are due to Peter Ti˘no,Jakramate Bootkrajang, and Hector Basevi for asking hard questions in the measure concentra- tion group, and to my thesis group members Steve Vickers and Iain Styles for keeping me on track and encouraging me to keep at it when sometimes things weren't going as well as planned. Thanks are also due to my examiners Professor John Shawe-Taylor and Peter Ti˘no, as well as chair Iain Styles, for taking time out of their busy schedules to examine my thesis. Finally I would like to thank the School of Computer Science at the University of Birmingham for their generous financial support which enabled me to spend my time in research, and also the School Research Committee, Royal Academy of Engineering, and KDD 2010 and ECML 2012 programme committees, for financial support towards attending conferences, schools and workshops. Bob Durrant, Halesowen, January 2013 ix x Publications During the course of this project, the following peer-reviewed papers have been pub- lished in journals and conference proceedings, and the following technical reports were written: Journal Papers R. Durrant and A. Kab´an. A tight bound on the performance of Fisher's linear discriminant in randomly projected data spaces. Pattern Recognition Letters (2011). R. Durrant and A. Kab´an. Random Projections as Regularizers: Learning a Linear Discriminant Ensemble from Fewer Observations than Dimensions. (Submitted). Refereed Conference Papers R. Durrant and A. Kab´an. Compressed Fisher Linear Discriminant Analysis: Clas- sification of Randomly Projected Data. In Proceedings16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2010) (2010). R. Durrant and A. Kab´an. A bound on the performance of LDA in randomly projected data spaces. In Proceedings 20th International Conference on Pattern Recog- nition (ICPR 2010), pp. 4044{4047 (2010). IBM Best Student Paper Award{ Machine Learning and Pattern Recognition Track. R. Durrant and A. Kab´an. Error bounds for Kernel Fisher Linear Discriminant in Gaussian Hilbert space. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AIStats 2012) (2012). Technical Reports R. Durrant and A. Kab´an. Flip probabilities for random projections of θ-separated vectors. University of Birmingham, School of Computer Science Technical Report No. CSR-10-10. R. Durrant and A. Kab´an. A comparison of the moments of a quadratic form involving orthonormalised and normalised random projection matrices. University of Birmingham, School of Computer Science Technical Report No. CSR-11-04. xi xii Fare forward, travellers! not escaping from the past Into different lives, or into any future; You are not the same people who left that station Or who will arrive at any terminus, While the narrowing rails slide together behind you; And on the deck of the drumming liner Watching the furrow that widens behind you, You shall not think `the past is finished' Or `the future is before us'. At nightfall, in the rigging and the aerial, Is a voice descanting (though not to the ear, The murmuring shell of time, and not in any language) `Fare forward, you who think that you are voyaging; You are not those who saw the harbour Receding, or those who will disembark.' - T.S.Eliot - `The Dry Salvages' xiii xiv Contents List of Figures xix List of Tables xxi 1 Introduction1 1.1 Problem Statement . .1 1.2 Research Questions . .3 1.3 Contributions of this Thesis . .4 1.4 Organization of this Thesis . .6 2 Mathematical Background and Tools7 2.1 Notation and Conventions . .8 2.2 Linear Algebra . .8 2.2.1 Normed Vector Spaces . .8 2.2.2 Linear Independence and Bases . 10 2.2.3 Positive (Semi-)definite Matrices . 11 2.3 Random Matrix Theory . 14 2.3.1 Random Projections . 14 2.4 Other Useful Inequalities . 17 3 State of the Art 21 3.1 Summary of Existing Work . 23 3.2 Inadequacies in previous work and our Contribution . 26 4 Fisher's Linear Discriminant 29 4.1 Preliminaries . 30 4.1.1 The classification problem . 30 4.1.2 Fisher's Linear Discriminant . 31 4.2 Dataspace analysis of FLD . 32 4.2.1 Upper bound on the generalization error of two-class FLD with subgaussian class conditionals . 32 4.2.2 Exact generalization error for two-class FLD with Gaussian class- conditionals . 35 4.2.3 Multi-class FLD . 37 4.3 Summary and Discussion . 38 xv Contents 5 Randomly-projected Fisher Linear Discriminant 41 5.1 Preliminaries . 42 5.1.1 The randomly-projected classification problem . 42 5.1.2 Decision rule for RP-FLD . 42 5.2 Analysis of RP-FLD . 42 5.2.1 Decomposition of data space FLD bound as sum of `estimated error' and `estimation error' . 43 5.2.2 Main results: Generalization error bounds on RP-FLD . 45 5.2.3 Bounds on the projected dimensionality k and discussion . 50 5.2.4 Numerical validation . 52 5.2.5 Relation to Theorem6....................... 55 5.2.6 The mean flipping problem . 57 5.2.7 Corollary to theorems7 and8................... 58 5.3 Summary and Discussion . 59 6 Flip Probabilities for Random Projections of θ-separated Vectors 61 6.1 Preliminaries . 62 6.2 Statement of Theorem and Proofs . 63 6.3 Proof of Theorem9............................. 63 6.4 Proof of part 2. 67 6.5 Proof of part 3. 68 6.6 Geometric Interpretation . 69 6.7 Empirical Validation . 70 6.8 Summary and Discussion . 72 7 Voting Ensembles of RP-FLDs 75 7.1 Preliminaries . 76 7.2 Theory: Linking the ensemble error to the full data space error . 77 −1 7.2.1 Analysis of E RT RΛ^RT R ................. 79 7.2.2 Generalization error of the ensemble for a fixed training set . 84 7.3 Tail bound on the generalization error of ensemble when k < ρ − 1 . 87 7.4 Lower bound on the error of pseudo-inverted FLD in the data space . 92 7.5 Variability reduction and bounding the deviation of a finite ensemble from its expectation . 94 7.6 Experiments . 98 7.6.1 Datasets . 98 7.6.2 Protocol . 98 7.6.3 Results . 99 7.7 Summary and Discussion .